Spent a few days learning more about Diffusion models, UNets and Transformers. Wrote a few toy implementations of a denoising diffusion model (following diffusers' tutorial) and a simple multi-headed self-attention model for next-character prediction (following Karpathy's video).
The non-latent version of the denoising model was trained on the Smithsonian Butterfly dataset, and it successfully generates new butterfly images. But it's unconditional (i.e. no text prompts), and non-latent (i.e. works directly on the image data, instead of a compressed latent space).
The latent version doesn't seem to be working correctly right now. It runs, but the output is garbage, and I don't think it's training correctly. I pre-converted the entire butterfly dataset into latent encodings before training (to speed up training), but the results were garbage even if I did the VAE encoding during training. Still need to look into this more.
The multi-headed self-attention implementation is structurally okay (I think), but I think it's too simple to learn anything meaningful about sentence structures. I might be wrong, since I'm a newbie. I haven't implemented the rest of the transformer architecture, since I was just trying to get some intuition around the attention mechanism.
The purpose of this deep-dive was to develop better intuition about how the models work, and where the runtime performance and memory hotspots are (in these models).