cmdr2's notes

Built two experiments using locally-hosted LLMs. One is a script that lets two bots chat with each other endlessly. The other is a browser bookmarklet that summarizes the selected text in 300 words or less.

Both use an OpenAI-compatible API, so they can be pointed at regular OpenAI-compatible remote servers, or your own locally-hosted servers (like LMStudio).

> Bot Chat

> Summarize Bookmarklet

The bot chat script is interesting, but the conversation starts stagnating/repeating after 20-30 messages. The conversation is definitely very interesting initially. The script lets you define the names and descriptions of the two bots, the scene description, and the first message by the first bot. After that, it lets the two bots talk to each other endlessly.

The browser bookmarklet is very useful, but many sites have domain restrictions (even if my server allows CORS). So it's a bit hit-or-miss.

Notes on two directions for ED4's UI that I'm unlikely to continue on.

One is to start a desktop app with a full-screen webview (for the app UI). The other is writing the tabbed browser-like shell of ED4 in a compiled language (like Go or C++) and loading the contents of the tabs as regular webpages (by using webviews). So it would load URLs like http://localhost:9000/ui/image_editor and http://localhost:9000/ui/settings etc.

In the first approach, we would start an empty full-screen webview, and let the webpage draw the entire UI, including the tabbed shell. The only purpose of this would be to start a desktop app instead of opening a browser tab, while being very lightweight (compared to Electron/Tauri style implementations).

In the second approach, the shell would essentially be like a 2008-era Google Chrome [1], that's super lightweight and fast. And the purpose of this would be to have a fast-starting UI, and provide a scaffolding for other apps like this that need tabbed interfaces.

Realistically, neither of the two approaches are actually really necessary for ED4's goals [2]. It's already really fast to open a browser tab, and I don't see a strong justification for the added project complexity of compiling webviews and maintaining a native-language shell. For e.g. I use a custom locally-hosted diary app, which also opens a browser tab for the UI, and I've never once felt that its startup time was too slow for my taste. On the contrary, I'm always pleased by how quickly it starts up.

I don't really care whether ED starts in a browser tab or runs as a dedicated desktop app. I just want ED4's UI to be interactable within a few hundred milliseconds of launching it. That's the goal.

---

[1] To be honest, the second approach is an old pet idea of mine (from 2010), of writing things like IDEs etc in a fast, lightweight tabbed UI (like 2008-era Chrome), back when IDEs used to be massive trucks that took forever to load (Eclipse, Netbeans, Visual Studio etc). Chrome was also very novel in writing the rest of their user interface in HTML (e.g. Settings, Bookmarks, Downloads etc). So this is more of a pet itch, rather than something that came out of ED4's project needs. I might explore this again one day, but it doesn't really matter that much to me right now.

[2] Another downside of the second approach is that it prevents ED from being used remotely from other computers via a web browser.

Worked on a few UI design ideas for Easy Diffusion v4. I've uploaded the work-in-progress mockups at https://github.com/easydiffusion/files.

So far, I've mocked out the design for the outer skeleton. That is, the new tabbed interface, the status bar, and the unified main menu. I also worked on how they would look like on mobile devices.

It gives me a rough idea of the Vue components that would need to be written, and the surface area that plugins can impact. For e.g. plugins can add a new menu entry only in the Plugins sub-menu.

The mockups draw inspiration from earlier version of Easy Diffusion (obviously), Google Chrome, Firefox, and VS Code.

Freebird is finally out on sale - https://freebirdxr.com/buy

It's still called an Early Access version, since it needs more work to feel like a cohesive product. It's already got quite a lot of features, and it's definitely useful. But I think it's still missing a few key features, and needs an overall "fine-tuning" of the user experience and interface.

So yeah, lots more to do. But it feels good to get something out on sale after nearly 4 years of development. Freebird has already spent 2 years in free public beta, so quite a number of people have already used it.

Freebird sold its first few copies since it went on sale last night. The main emotion is a sense of relief I think.

Today I explored an idea for what might happen if an AI model runs continuously, processing inputs, acting and receiving sensory inputs without interruption. Maybe in a text-adventure game. Instead of responding to isolated prompts, the AI would live in a simulated environment, interacting with its world in real time. The experiment is about observing whether behaviors like an understanding of time, awareness, or even a sense of self could emerge naturally through sustained operation.

The plan is simple: let the bot run indefinitely, receiving sensory inputs and responding to them. The idea is to see if patterns of self-awareness emerge just from interacting with the environment. To make things more interesting, the world could start with a batch of bots, with new ones introduced over time, creating a kind of multi-generational system of learned behaviors.

Other details include adding "teacher bots" to guide the AI at the start, occasional interactions with human players for feedback, and strategies to manage the growing memory efficiently. There’s also the possibility of introducing existential "drives" to shape the AI's motivation or pausing periodically to compress memory into neural weights.

This is still very much a vague idea, but I'm curious about what would happen with this kind of training. Long-term memory and context size will be technical challenges.

Spent a few days learning more about Diffusion models, UNets and Transformers. Wrote a few toy implementations of a denoising diffusion model (following diffusers' tutorial) and a simple multi-headed self-attention model for next-character prediction (following Karpathy's video).

The non-latent version of the denoising model was trained on the Smithsonian Butterfly dataset, and it successfully generates new butterfly images. But it's unconditional (i.e. no text prompts), and non-latent (i.e. works directly on the image data, instead of a compressed latent space).

The latent version doesn't seem to be working correctly right now. It runs, but the output is garbage, and I don't think it's training correctly. I pre-converted the entire butterfly dataset into latent encodings before training (to speed up training), but the results were garbage even if I did the VAE encoding during training. Still need to look into this more.

The multi-headed self-attention implementation is structurally okay (I think), but I think it's too simple to learn anything meaningful about sentence structures. I might be wrong, since I'm a newbie. I haven't implemented the rest of the transformer architecture, since I was just trying to get some intuition around the attention mechanism.

The purpose of this deep-dive was to develop better intuition about how the models work, and where the runtime performance and memory hotspots are (in these models).

Spent some more time on the v4 experiments for Easy Diffusion (i.e. C++ based, fast-startup, lightweight). stable-diffusion.cpp is missing a few features, which will be necessary for Easy Diffusion's typical workflow. I wasn't keen on forking stable-diffusion.cpp, but it's probably faster to work on a fork for now.

For now, I've added live preview and per-step progress callbacks (based on a few pending pull-requests on sd.cpp). And protection from GGML_ASSERT killing the entire process. I've been looking at the ability to load individual models (like the vae) without needing to reload the entire SD model.

sd.cpp for Flux in ED in 3.5?

As a side-idea, I could use sd.cpp as the Flux and SD3 backend for the current version of Easy Diffusion. The Forge backend in ED 3.5 beta is a bit prone to crashing, and I'm only using it for Flux.

Long-term considerations side, it might be an interesting experiment to try sd.cpp in ED 3.5 and see if it's more stable than Forge for that purpose.

I could write a simple web server with a similar API as Forge and ship it.

Spent a few days getting a C++ based version of Easy Diffusion working, using stable-diffusion.cpp. I'm working with a fork of stable-diffusion.cpp here, to add a few changes like per-step callbacks, live image previews etc.

It doesn't have a UI yet, and currently hardcodes a model path. It exposes a RESTful API server (written using the Crow C++ library), and uses a simple task manager that runs image generation tasks on a thread. The generated images are available at an API endpoint, and it shows the binary JPEG/PNG image (instead of base64 encoding).

The general intent of this project is to play with ideas for version 4 of Easy Diffusion, and make it as lightweight, easy to install and fast as possible. Cutting out as much unnecessary bloat as possible.

Wrote a simple hex-dumper for analysing dll and executable files. Uses pefile.

Continuing on the race car simulator series. Last week, the "effective tire friction" calculation was implemented, which modeled the grip at the point of contact between the tire and the road surface. This intentionally did not take into account the vertical load (or any other forces), since the purpose was limited to calculating the "effective" friction coefficient based on the material conditions.

The next step was implemented yesterday, which calculates the effective force the tire will apply on the wheel axle, in reaction to the torque applied by the engine on the wheel axle. That reaction force will cause the car to move forward. It also factors in the existing inertial force (i.e. if the car is already moving) in order to model sideways slip (e.g. for drifting).

If the applied force on the tire's contact patch exceeds the max allowed traction force, then the tire will start slipping. That will reduce the effective torque applied back on the wheel axle (from the contact patch). The tire's softness is also considered, since the tire will deform in response to the different torques, which will reduce the force applied.

The current models for grip and force have been documented at the wiki:

  • grip
  • force
  • Interesting side note - the wiki documentation was written by feeding ChatGPT the code for each module, and asking it to write such a document (with math formulae). It was pretty good, and I didn't have to do a lot of manual editing on the generated output.

    Also, all the searches performed on Perplexity AI for this project are available at the Perplexity Space created for this project.

    Following up on yesterday's post, there's now full automation for the conversion of provisional NORAD IDs to the official one (once they're available in Celestrak). This automation is still waiting to be deployed, because it needs to be tested with the official NORAD IDs for yesterday's Starlink launch (G6-77), once they're assigned next week. This automation has been now been deployed.

    So now, the only processes still done manually are (a) selecting a new leader for a train, if the current leader drifts away from the train, and (b) removing old trains that have spread out completely.

    Spent two days automating some of the processes around findstarlink.com, and updating some of the code that had started bit-rotting.

    Most of FindStarlink's operations run as individual AWS Lambda functions, that are triggered periodically by CloudWatch Events (and Schedules). But a few processes are still done manually, mainly due to a mix of laziness and also being a bit tricky to automate. I also needed to migrate the existing automations to a newer NodeJS runtime in AWS Lambda, since the current runtime was nearing end-of-life support.

    What was automated?

    One such process is for making a database entry whenever a new batch of Starlink satellites launch. I used to make the database entry manually (well, JSON entry), and this was becoming a real problem because I've been increasingly erratic and late with updating the DB. Delaying this entry results in users not knowing about new launches for several hours after the launch, and Starlink trains are best seen within 24 hours after launch. I used to perform some manual validations after a launch, but I don't think that's necessary anymore. So removing the manual validation step made it straightforward to automate. I used the Launch Library API to get the latest launch info, and automated the rest of the steps.

    Another problematic manual process is replacing the dummy NORAD ID (assigned to a satellite upon launch by Celestrak) with the provisional NORAD ID (assigned a day or two later). Delaying this replacement results in outdated orbit predictions, often leading to increasingly inaccurate predictions for users. Finding the like-for-like replacement isn't very straightforward, because we need to search for a satellite with similar coordinates and orbital path as the one being replaced. But a simple Manhattan-distance check seemed good-enough for the job, and it's been applied to the automation. It also determines the list of satellites that are part of the train, and stores that in the database.

    What's left?

    I'm still left with two manual processes.

    One is replacing the provisional NORAD ID with the official NORAD ID (assigned a week after launch). This is different from what I automated yesterday - that one replaced the dummy ID with the provisional ID, this one replaces the provisional ID with the official one (assigned a week after launch).

    This process is a bit trickier to automate, because after a week, the particular satellite being tracked may no longer even be in the "train", i.e. it may have drifted away to a different orbit. This isn't a problem when done manually, because I can see the entire train in my dashboard, and visually pick the correct "leader" of the train. But doing this algorithmically would involve calculating the orbital paths of all the satellites in that "train", and automatically picking a new leader. So this automation isn't very hard to do, but isn't trivial either.

    The other manual process is removing the DB entries for older trains that have spread out completely. This is usually necessary 3-4 weeks after a launch. Again, this is similar in complexity to the previous process, because I need to calculate the orbital paths of all the satellites in the train, and check if they're still within a threshold distance and threshold path angle to be considered a train. The user impact of this process being late is low, because the system already warns users automatically about low visibility chances once a train is older than 7 days.

    Future ideas

    I'd also like to somehow store the main changes to the database as versions. The database is just a JSON file, and I'd like to store the change history for things like insertions, deletions, and NORAD ID changes. Would be ideal if the Lambda function could just commit to the GitHub repo.

    Maybe this entire setup could be replaced with a GitHub action that runs periodically, and commits to the repo.

    Started building a car simulator, focused on F1-like car characteristics. It's reasonably detailed in terms of simulation, but is ultimately meant for games/machine learning, so it approximates some of the behavior. It isn't physically accurate.

    The first piece is the car simulator itself - https://github.com/cmdr2/car-sim. This module is a numbers-only simulation, i.e. it doesn't handle visualization, interaction or anything that's not related to the simulation of vehicle components.

    I've started from the point of contact between the tire and the road, and will work backwards from that. I've got a basic tire friction model working, which computes the "effective friction" against the track surface, by taking into account: tire material, tread amount, road type, road condition, tire width, tire hardness, tire pressure, tire temperature, tire wear and tire camber.

    This model is independent of the vertical tire load (i.e. downforce + car weight). Instead, the output of this model can be multiplied by the vertical load to get the max traction force.

    It uses numpy and all the operations are vectorized, so lots of tires (and conditions) can be simulated in parallel.

    tl;dr - Today I shipped the ability to see the desktop screen in VR (while using Freebird). And fixed a few user-reported bugs in Freebird.

    Performance

    The performance is still a bit laggy. The actual screencapture code now runs in a separate process, and copies data over a SharedMemory buffer (which works pretty well for sharing data between two separate processes). That helps avoid Python's GIL while performing numpy operations on large arrays.

    But the main performance bottleneck is the inability to update an existing texture using Blender's gpu module. The current implementation in Blender forces me to create a new texture each frame, which is pretty slow. And this has to be done on the main thread, otherwise Blender crashes with a null context error.

    So I update the screen just two times per second. For now, the usability is adequate IMO. But if it becomes important, I can maybe look at the Blender codebase and see if there's anything that can be proposed as a solution. Like right now, doing video is impossible at decent framerates using the gpu module. Maybe I missed something, but I really searched a lot (and tried a lot of approaches).

    Illusion of responsiveness

    To create an illusion of responsiveness, the mouse cursor is updated every frame (72fps), while the actual screen updates twice per second (2fps).

    Custom code to avoid dependencies

    And thanks to ChatGPT/Claude, I got a custom Windows-only implementation for taking screenshots and getting the mouse cursor location. It's a bit faster than mss/pyautogui and works fine for my purpose.

    The main reason was to avoid depending on external libraries (tricky to install them, being a blender addon). With AI codegen tools, it might actually be a good tradeoff, since the generated code is reasonably straightforward and maintainable (and uses well-supported Windows APIs). And I didn't spend much time on them.

    Let's see how it performs during actual usage. It's now in early-access on Freebird. Hopefully it won't crash.

    Built an initial prototype of showing the desktop window screencapture inside VR (while in Freebird), using the mss library. Freebird will have to install it using subprocess.run([sys.executable, '-m', 'pip', 'install', 'mss']).

    It works, but is currently a bit laggy. The capture and processing happens on a thread, and a timer modal calls the actual GPU texture assignment. The GPU texture assignment takes about 2 ms, but the XR view is still juddering (way more than it would with an extra 2ms of latency). Still need to investigate and smoothen the performance.

    But it seems to work well overall. I even drew a tiny placeholder cursor, and it casts the entire screen, so it's possible to switch to other programs while in VR in Freebird. Promising.

    Finished the blog-agent project for now. The blog is now live, and the code is up at the GitHub project.

    In summary, it lets me write my notes as text files in Dropbox, and it automatically formats and publishes it as a blog on S3. It runs by triggering an AWS Lambda function via a Dropbox webhook.

    It's built purely for a workflow that I'm very used to (writing notes in text files, one file per month, posts separated by two hyphens padded with line breaks). But making this a public project will probably force me to keep things well-documented (so that I can fix things easily, if they break in the future).

    Added the ability to auto-generate an atom.xml (for feeds), and auto-generate a Twitter feed-like index.html (with pagination).

    It currently uses a custom hacky static-site generator (in Python), but it would be better to make this a wrapper around Hugo (and to move the Lambda runtime to OS/Go). That'll help improve the customizability, performance, and robustness.

    Updated the flat_blog generator, and modified the blog-agent to use Dropbox Refresh tokens to get new auth tokens.

    Also made it auto-generate an index.html, and added styling for the list of posts. The idea is to make it look a bit more like a twitter feed, and less like a list of links. Still not fully there yet.

    Published the first version of the Dropbox-based blog that gets mirrored on S3.

    The GitHub project is live, and still has quite a few bugs and missing pieces.

    How does it work?

    It takes the original posts from Dropbox and automatically publishes them in other places (after formatting them). As the author of those notes, the only place I'm concerned about is my Dropbox folder with my text files. But the agent then goes and mirrors the writing in different places automatically, and I never need to think about that process at all.

    I wrote this post on my phone, using the Dropbox app and text editor. All I need to do is save this text file, and it's live. There's something interesting about that.

    tl;dr - Today, I fixed a few bugs in Easy Diffusion and Freebird/VR Puppetry. And started building a blog engine that automatically takes my text file blog-posts from Dropbox and publishes them as a static blog on S3. I've already been writing a private blog for 10+ years as text files on Dropbox, and like it that way.

    Fixed a few bugs reported in Easy Diffusion 3.5's beta. And investigated an issue in VR Puppetry and fixed a separate bug in bl_xr for VR Puppetry and Freebird. And did a bit of support work for Easy Diffusion and Freebird/VR Puppetry.

    After that I worked on a way to publish a blog from the text files that I write in my Dropbox folder.

    Background fluff (can skip)

    Like any self-respecting developer, I've spent more time writing static site generators and blog engines than I would care to admit. And certainly way more time than I've spent actually writing any blog posts. So this experiment is another in that series, and the main intent is to build something that I'd actually use for a while.

    Data format

    Writing plain text files in a Dropbox folder is the only "blogging" system that's worked for a decent amount of time for me (10+ years now, with reasonably consistent post intervals). I make a new text file each month (e.g. "October 2024.txt"), and then keep writing posts in that file for the rest of the month. Each post is separated by two hyphens (and line breaks above and below that line break). The first line of each post is a date. The post body can contain markdown.

    This system works for me, and I've written a lot of posts over the last 10 years with this system. Those posts are private, but if I was able to do the same thing for public posts, I think there'd be a good chance that I'd actually write public-facing blog posts. Anyway, it's interesting to build this blog-engine regardless.

    Overall publishing workflow

    So the ideal system would just involve me writing the same way. I would continue writing text files in a folder on Dropbox, with a new text file per month. Every save would trigger a Dropbox webhook calling an AWS Lambda function. This Lambda function would fetch the text files from Dropbox, split the posts in the monthly text files into separate posts, and write a static-site blog to S3.

    This would completely free me to write like I've always written and never worry about pressing the publish button, or do anything extra or artificial (for me). The system would automatically publish the changes to the public-facing site.

    This would be very inefficient, if implemented naively. But a naive implementation is a reasonable start IMO, since text is really memory-efficient (especially if downloaded as a zip).

    This would make Dropbox (and my local disks) as the true copy of my notes (rather than an external provider in a custom database). The S3 website would be like a public-facing mirror of those notes (with different formatting). It would align with my habit of writing on whichever device I'm on (since I have Dropbox synced on all of them). And I can use whichever text editor I feel productive in at that moment - Sublime Text, vi, Notepad etc. And no clutter or rich interface. Just me and a text editor, and raw text files.

    I have a bad habit of pressing Ctrl+S every few sentences (from the old days of unreliable PCs), and that needs to be addressed. Otherwise it would trigger an enormous amount of unnecessary rebuilds. Again, I'd like to solve this with tech, rather than force myself to change how I write.

    tl;dr - Today, I worked on using stable-diffusion.cpp in a simple C++ program. As a linked library, as well as compiling sd.cpp from scratch (with and without CUDA). The intent was to get a tiny and fast-starting executable UI for Stable Diffusion working. Also, ChatGPT is very helpful!

    Part 1: Using sd.cpp as a library

    First, I tried calling the stable-diffusion.cpp library from a simple C++ program (which just loads the model and renders an image). Via dynamic linking. That worked, and its performance was the same as the example sd.exe CLI, and it detected and used the GPU correctly.

    The basic commands for this were (using MinGW64):

    gendef stable-diffusion.dll
    dlltool --dllname stable-diffusion.dll --output-lib libstable-diffusion.a --input-def stable-diffusion.def
    g++ -o your_program your_program.cpp -L. -lstable-diffusion
    

    And I had to set a CMAKE_GENERATOR="MinGW Makefiles" environment variable. The steps will be different if using MSVC's cl.exe.

    I figured that I could write a simple HTTP server in C++ that wraps sd.cpp. Using a different language would involve keeping the language binding up-to-date with sd.cpp's header file. For e.g. the Go-lang wrapper is currently out-of-date with sd.cpp's latest header.

    This thin-wrapper C++ server wouldn't be too complex, it would just act as a rendering backend process for a more complex Go-lang based server (which would implement other user-facing features like model management, task queue management etc).

    Here's a simple C++ example:

    #include "stable-diffusion.h"
    #include <iostream>
    
    int main() {
        // Create the Stable Diffusion context
        sd_ctx_t* ctx = new_sd_ctx("F:\\path\\to\\sd-v1-5.safetensors", "", "", "", "", "", "", "", "", "", "",
                                    false, false, false, -1, SD_TYPE_F16, STD_DEFAULT_RNG, DEFAULT, false, false, false);
    
        if (ctx == NULL) {
            std::cerr << "Failed to create Stable Diffusion context." << std::endl;
            return -1;
        }
    
        // Generate image using txt2img
        sd_image_t* image = txt2img(ctx, "A beautiful landscape painting", "", 0, 7.5f, 1.0f, 512, 512,
                                    EULER_A, 25, 42, 1, NULL, 0.0f, 0.0f, false, "");
    
        if (image == NULL) {
            std::cerr << "txt2img failed." << std::endl;
            free_sd_ctx(ctx);
            return -1;
        }
    
        // Output image details
        std::cout << "Generated image: " << image->width << "x" << image->height << std::endl;
    
        // Cleanup
        free_sd_ctx(ctx);
         
        return 0;
    }
    

    Part 2: Compiling sd.cpp from scratch (as a sub-folder in my project)

    Update: This code is now available in a github repo.

    The next experiment was to compile sd.cpp from scratch on my PC (using the MinGW compile as well as Microsoft's VS compiler). I used sd.cpp as a git submodule in my project, and linked to it staticly.

    I needed this initially to investigate a segfault inside a function of stable-diffusion.dll, which I wasn't able to trace (even with gdb). Plus it was fun to compile the entire thing and see the entire Stable Diffusion implementation fit into a tiny binary that starts up really quickly. A few megabytes for the CPU-only build.

    My folder tree was:

    - stable-diffusion.cpp # sub-module dir
    - src/main.cpp
    - CMakeLists.txt
    

    src/main.cpp is the same as before, except for this change at the start of int main() (in order to capture the logs):

    void sd_log_cb(enum sd_log_level_t level, const char* log, void* data) {
        std::cout << log;
    }
    
    int main(int argc, char* argv[]) {
        sd_set_log_callback(sd_log_cb, NULL);
    
        // ... rest of the code is the same
    }
    

    And CMakeLists.txt is:

    cmake_minimum_required(VERSION 3.13)
    project(sd2)
    
    # Set C++ standard
    set(CMAKE_CXX_STANDARD 17)
    set(CMAKE_CXX_STANDARD_REQUIRED ON)
    
    # Add submodule directory for stable-diffusion
    add_subdirectory(stable-diffusion.cpp)
    
    # Include directories for stable-diffusion and its dependencies
    include_directories(stable-diffusion.cpp src)
    
    # Create executable from your main.cpp
    add_executable(sd2 src/main.cpp)
    
    # Link with the stable-diffusion library
    target_link_libraries(sd2 stable-diffusion)
    

    Compiled using:

    cmake
    cmake --build . --config Release
    

    This ran on the CPU, and was obviously slow. But good to see it running!

    Tiny note: I noticed that compiling with g++ (mingw64) resulted in faster iterations/sec compared to MSVC. For e.g. 3.5 sec/it vs 4.5 sec/it for SD 1.5 (euler_a, 256x256, fp32). Not sure why.

    Part 3: Compiling the CUDA version of sd.cpp

    Just for the heck of it, I also installed the CUDA Toolkit and compiled the cuda version of my example project. That took some fiddling. I had to copy some files around to make it work, and point the CUDAToolkit_ROOT environment variable to where the CUDA toolkit was installed (for e.g. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6).

    Compiled using:

    cmake -DSD_CUBLAS=ON
    cmake --build . --config Release
    

    The compilation took a long time, since it compiled all the cuda kernels inside ggml. But it worked, and was as fast as the official sd.exe build for CUDA (which confirmed that nothing was misconfigured).

    It resulted in a 347 MB binary (which compresses to a 71 MB .7z file for download). That's really good, compared to the 6 GB+ (uncompressed) behemoths in python-land for Stable Diffusion. Even including the CUDA DLLs (that are needed separately) that's "only" another 600 MB uncompressed (300 MB .7z compressed), which is still better.

    Conclusions

    The binary size (and being a single static binary) and the startup time is hands-down excellent. So that's pretty promising.

    But in terms of performance, sd.cpp seems to be significantly slower for SD 1.5 than Forge WebUI (or even a basic diffusers pipeline). 3 it/sec vs 7.5 it/sec for a SD 1.5 image (euler_a, 512x512, fp16) on my NVIDIA 3060 12GB. I tested with the official sd.exe build. I don't know if this is just my PC, but another user reported something similar.

    Interestingly, the implementation for the Flux model in sd.cpp runs as fast as Forge WebUI, and is pretty efficient with memory.

    Also, I don't think it's really practical or necessary to compile sd.cpp from scratch, but I wanted to have the freedom to use things like the CLIP implementation inside sd.cpp, which isn't exposed via the DLL. But that could also be achieved by submitting a PR to the sd.cpp project, and maybe they'd be okay with exposing the useful inner models in the main DLL as well.

    But it'll be interesting to link this with the fast-starting Go frontend (from yesterday), or maybe even just as a fast-starting standalone C++ server. Projects like Jellybox exist (Go-lang frontend and sd.cpp backend), but it's interesting to play with this anyway, to see how small and fast an SD UI can be made.