tl;dr: Explored a possible optimization for Flux with diffusers
when using enable_sequential_cpu_offload()
. It did not work.
While trying to use Flux (nearly 22 GB of weights) with diffusers
on a 12 GB graphics card, I noticed that it barely used any GPU memory when using enable_sequential_cpu_offload()
. And it was super slow. It turns out that the largest module in Flux's transformer model is around 108 MB, so because diffusers streams modules one-at-a-time, the peak VRAM usage never crossed above a few hundred MBs.
And that felt odd - a few hundred MBs being used on a 12 GB graphics card. Wouldn't it be faster if it always kept 8-9 GB of the model weights on the GPU, and streamed the rest? Less data to stream == less time wasted on memory overhead == better rendering speed?
The summary is that, strangely enough, that optimization did not result in a real improvement. Sometimes it was barely any faster. So, IMO not worth the added complexity. Quantization probably has better ROI.
Idea:
The way a diffusion pipeline usually works is - it first runs the text encoder
module(s) once, and then runs the vae
module once (for encoding), and then loops over the unet
/transformer
module several times (i.e. inference steps
), and then finally runs the vae
module once again (for decoding).
So the idea was to keep a large fraction of the unet
/transformer
sub-modules in the GPU (instead of offloading them back the CPU). This way, the second/third/fourth/etc loop of the unet
/transformer
would need to transfer less data per loop iteration, and therefore incur less overhead of GPU-to-CPU-and-back transfers. 14 GB transferred per loop, instead of 22 GB.
Which modules did I pin?
For deciding which modules to "pin" to the GPU, I tried both orders - sorting by the smallest modules and pinning those first, as well as sorting by the largest modules and pinning those first. The first approach intended to reduce the I/O wait time during computation, while waiting for lots of small modules. The second approach intended to keep the big modules pinned to the GPU, to avoid large transfers.
Neither approach seemed to change the result.
Results:
Unfortunately, the performance gain is non-existent, to very marginal. I ran each test twice, to ensure that the OS would have the page files warmed up equally.
For a 512x512 image with 4 steps:
In other runs with 4 steps, the optimization was sometimes faster by 5-10 seconds, or sometimes similar.
With increased steps (e.g. 10 steps), the optimization is usually better by 15-20 seconds (i.e. 160 seconds total vs 180 seconds).
Possible Explanation (of why it didn't work):
OS Paging or Driver caching or PyTorch caching. The first loop iteration would obviously be very slow, since it would read everything (including the "pinned" modules) from the CPU to the GPU.
The subsequent inference loops were actually faster with the optimization (2.5s vs 4s). But since the first iteration constituted nearly 95% of the total time, any savings due to this optimization were only affecting the subsequent 5% of the total time.
And I think paging or driver/pytorch caching is making the corresponding I/O transfer times very similar after the 2nd iteration. While 2.5 sec (for optimized) is faster than 4 sec (for unoptimized) in subsequent iterations, the improvement is not really very impactful. 4 seconds for transferring 22 GB is already pretty comparable to the optimized version. Presumably due to heavy OS paging.
So the basic premise of this experiment turned out to be wrong. Subsequent iterations of the unet
/transformer
do not incur a heavy I/O overhead while streaming GPU modules. Therefore pinning a large chunk of that module to the GPU didn't save much time. Not enough to make a difference to the overall render time.