cmdr2's notes

tl;dr: Explored a possible optimization for Flux with diffusers when using enable_sequential_cpu_offload(). It did not work.

While trying to use Flux (nearly 22 GB of weights) with diffusers on a 12 GB graphics card, I noticed that it barely used any GPU memory when using enable_sequential_cpu_offload(). And it was super slow. It turns out that the largest module in Flux's transformer model is around 108 MB, so because diffusers streams modules one-at-a-time, the peak VRAM usage never crossed above a few hundred MBs.

And that felt odd - a few hundred MBs being used on a 12 GB graphics card. Wouldn't it be faster if it always kept 8-9 GB of the model weights on the GPU, and streamed the rest? Less data to stream == less time wasted on memory overhead == better rendering speed?

The summary is that, strangely enough, that optimization did not result in a real improvement. Sometimes it was barely any faster. So, IMO not worth the added complexity. Quantization probably has better ROI.

Idea:

The way a diffusion pipeline usually works is - it first runs the text encoder module(s) once, and then runs the vae module once (for encoding), and then loops over the unet/transformer module several times (i.e. inference steps), and then finally runs the vae module once again (for decoding).

So the idea was to keep a large fraction of the unet/transformer sub-modules in the GPU (instead of offloading them back the CPU). This way, the second/third/fourth/etc loop of the unet/transformer would need to transfer less data per loop iteration, and therefore incur less overhead of GPU-to-CPU-and-back transfers. 14 GB transferred per loop, instead of 22 GB.

Which modules did I pin?

For deciding which modules to "pin" to the GPU, I tried both orders - sorting by the smallest modules and pinning those first, as well as sorting by the largest modules and pinning those first. The first approach intended to reduce the I/O wait time during computation, while waiting for lots of small modules. The second approach intended to keep the big modules pinned to the GPU, to avoid large transfers.

Neither approach seemed to change the result.

Results:

Unfortunately, the performance gain is non-existent, to very marginal. I ran each test twice, to ensure that the OS would have the page files warmed up equally.

For a 512x512 image with 4 steps:

  • With this optimization: With 8 GB "pinned" (and the rest streamed), the overall image generation took 124 seconds.
  • Without this optimization: With everything streamed, the overall image generation took 122 seconds.
  • In other runs with 4 steps, the optimization was sometimes faster by 5-10 seconds, or sometimes similar.

    With increased steps (e.g. 10 steps), the optimization is usually better by 15-20 seconds (i.e. 160 seconds total vs 180 seconds).

    Possible Explanation (of why it didn't work):

    OS Paging or Driver caching or PyTorch caching. The first loop iteration would obviously be very slow, since it would read everything (including the "pinned" modules) from the CPU to the GPU.

    The subsequent inference loops were actually faster with the optimization (2.5s vs 4s). But since the first iteration constituted nearly 95% of the total time, any savings due to this optimization were only affecting the subsequent 5% of the total time.

    And I think paging or driver/pytorch caching is making the corresponding I/O transfer times very similar after the 2nd iteration. While 2.5 sec (for optimized) is faster than 4 sec (for unoptimized) in subsequent iterations, the improvement is not really very impactful. 4 seconds for transferring 22 GB is already pretty comparable to the optimized version. Presumably due to heavy OS paging.

    So the basic premise of this experiment turned out to be wrong. Subsequent iterations of the unet/transformer do not incur a heavy I/O overhead while streaming GPU modules. Therefore pinning a large chunk of that module to the GPU didn't save much time. Not enough to make a difference to the overall render time.

    Wrote a WebXR drawing tool with passthrough (i.e. AR overlay), in order to draw lines over real-world surfaces. It's pretty handy!

    Uploaded it as Freebird Lite. It proved itself useful yesterday, since I could sketch lines around the house to plan different fittings and show the ideas to others (using the headset). Since it's just a website in a browser, it doesn't require any installation. And it works on all the compatible 6 DoF headsets.

    For now, it will only let you draw lines in the air. That was all I needed yesterday.

    The main limitation right now is that it doesn't "anchor" things correctly, i.e. it resets the orientation when the headset recenters (e.g. after Quest wakes up from sleep). So the lines will no longer be where you drew them (relative to the real-world surfaces). I'm already using anchored elements, but clearly I'm doing it wrong.

    Dev note - I created a self-signed certificate for my PC, so I can access it (over LAN) from the Quest. Debugging over temporary Cloudflare tunnels was pretty inconvenient.

    Got !FS working in the browser using PyScript! It's pretty cool - python, skyfield, numpy etc running inside the browser, fully client-side. And I didn't have to modify the code, it just works. And most importantly, it performed pretty decently. Acceptable performance.

    The performance on desktop browsers is pretty good. It's a bit slower on mobile (but acceptable for my purpose).

    sgp4 has some C-bindings, so I had to compile to WebAssembly using Emscripten, and made a .whl (wheel). skyfield is pure python, so I made a wheel for that as well using python -m build.

    It all works surprisingly well.

    Built a simple hydroponics growing container with an ESP 8266. Code at https://github.com/cmdr2/farm

    The seeds have been planted - 18 tomato seeds spread across 6 planters (aka plastic egg carton with a plastic wrap to build the humidity).

    For the plumbing system, I've gone for a simple design. It has two tanks, stacked vertically over each other. A motor switches on at preset intervals, pumps the nutrient solution from the lower tank for a preset duration. A smaller hole (compared to the inlet pipe) in the upper tank (containing the plant roots) drains the water back to the lower tank. For fun, an emergency cut-off could've be built at the top of the upper tank (using a water-level sensor). But for now, that's not in scope.

    So really, it's just an ESP 8266 sending an ON signal at preset intervals, and holding it ON for another preset interval. The rest is gravity and plumbing.

    The circuit is just a modified version of what I had from a previous project, a TIP 122 transistor (with a diode) controlled by a GPIO pin on the ESP 8266, which will switch the motor on/off. For now, the motor and the ESP 8266 board have different power sources. A TIP 122 is not really necessary for this, but I'm not going for design elegance awards.

    Later edit: Added a light sensor (LDR) to change the pumping frequency at night. And Amplitude analytics logging to ping a server each time it runs, so that I can check whether it's been running on time.