~ / cmdr2

Wed Oct 16 18:10:25 2024

stable-diffusion
c++
cuda
easy-diffusion
lab
performance
featured

tl;dr - Today, I worked on using stable-diffusion.cpp in a simple C++ program. As a linked library, as well as compiling sd.cpp from scratch (with and without CUDA). The intent was to get a tiny and fast-starting executable UI for Stable Diffusion working. Also, ChatGPT is very helpful!

Part 1: Using sd.cpp as a library

First, I tried calling the stable-diffusion.cpp library from a simple C++ program (which just loads the model and renders an image). Via dynamic linking. That worked, and its performance was the same as the example sd.exe CLI, and it detected and used the GPU correctly.

The basic commands for this were (using MinGW64):

gendef stable-diffusion.dll
dlltool --dllname stable-diffusion.dll --output-lib libstable-diffusion.a --input-def stable-diffusion.def
g++ -o your_program your_program.cpp -L. -lstable-diffusion

And I had to set a CMAKE_GENERATOR="MinGW Makefiles" environment variable. The steps will be different if using MSVC's cl.exe.

I figured that I could write a simple HTTP server in C++ that wraps sd.cpp. Using a different language would involve keeping the language binding up-to-date with sd.cpp's header file. For e.g. the Go-lang wrapper is currently out-of-date with sd.cpp's latest header.

This thin-wrapper C++ server wouldn't be too complex, it would just act as a rendering backend process for a more complex Go-lang based server (which would implement other user-facing features like model management, task queue management etc).

Here's a simple C++ example:

#include "stable-diffusion.h"
#include <iostream>

int main() {
    // Create the Stable Diffusion context
    sd_ctx_t* ctx = new_sd_ctx("F:\\path\\to\\sd-v1-5.safetensors", "", "", "", "", "", "", "", "", "", "",
                                false, false, false, -1, SD_TYPE_F16, STD_DEFAULT_RNG, DEFAULT, false, false, false);

    if (ctx == NULL) {
        std::cerr << "Failed to create Stable Diffusion context." << std::endl;
        return -1;
    }

    // Generate image using txt2img
    sd_image_t* image = txt2img(ctx, "A beautiful landscape painting", "", 0, 7.5f, 1.0f, 512, 512,
                                EULER_A, 25, 42, 1, NULL, 0.0f, 0.0f, false, "");

    if (image == NULL) {
        std::cerr << "txt2img failed." << std::endl;
        free_sd_ctx(ctx);
        return -1;
    }

    // Output image details
    std::cout << "Generated image: " << image->width << "x" << image->height << std::endl;

    // Cleanup
    free_sd_ctx(ctx);
     
    return 0;
}

Part 2: Compiling sd.cpp from scratch (as a sub-folder in my project)

Update: This code is now available in a github repo.

The next experiment was to compile sd.cpp from scratch on my PC (using the MinGW compile as well as Microsoft's VS compiler). I used sd.cpp as a git submodule in my project, and linked to it staticly.

I needed this initially to investigate a segfault inside a function of stable-diffusion.dll, which I wasn't able to trace (even with gdb). Plus it was fun to compile the entire thing and see the entire Stable Diffusion implementation fit into a tiny binary that starts up really quickly. A few megabytes for the CPU-only build.

My folder tree was:

- stable-diffusion.cpp # sub-module dir
- src/main.cpp
- CMakeLists.txt

src/main.cpp is the same as before, except for this change at the start of int main() (in order to capture the logs):

void sd_log_cb(enum sd_log_level_t level, const char* log, void* data) {
    std::cout << log;
}

int main(int argc, char* argv[]) {
    sd_set_log_callback(sd_log_cb, NULL);

    // ... rest of the code is the same
}

And CMakeLists.txt is:

cmake_minimum_required(VERSION 3.13)
project(sd2)

# Set C++ standard
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# Add submodule directory for stable-diffusion
add_subdirectory(stable-diffusion.cpp)

# Include directories for stable-diffusion and its dependencies
include_directories(stable-diffusion.cpp src)

# Create executable from your main.cpp
add_executable(sd2 src/main.cpp)

# Link with the stable-diffusion library
target_link_libraries(sd2 stable-diffusion)

Compiled using:

cmake
cmake --build . --config Release

This ran on the CPU, and was obviously slow. But good to see it running!

Tiny note: I noticed that compiling with g++ (mingw64) resulted in faster iterations/sec compared to MSVC. For e.g. 3.5 sec/it vs 4.5 sec/it for SD 1.5 (euler_a, 256x256, fp32). Not sure why.

Part 3: Compiling the CUDA version of sd.cpp

Just for the heck of it, I also installed the CUDA Toolkit and compiled the cuda version of my example project. That took some fiddling. I had to copy some files around to make it work, and point the CUDAToolkit_ROOT environment variable to where the CUDA toolkit was installed (for e.g. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6).

Compiled using:

cmake -DSD_CUBLAS=ON
cmake --build . --config Release

The compilation took a long time, since it compiled all the cuda kernels inside ggml. But it worked, and was as fast as the official sd.exe build for CUDA (which confirmed that nothing was misconfigured).

It resulted in a 347 MB binary (which compresses to a 71 MB .7z file for download). That's really good, compared to the 6 GB+ (uncompressed) behemoths in python-land for Stable Diffusion. Even including the CUDA DLLs (that are needed separately) that's "only" another 600 MB uncompressed (300 MB .7z compressed), which is still better.

Conclusions

The binary size (and being a single static binary) and the startup time is hands-down excellent. So that's pretty promising.

But in terms of performance, sd.cpp seems to be significantly slower for SD 1.5 than Forge WebUI (or even a basic diffusers pipeline). 3 it/sec vs 7.5 it/sec for a SD 1.5 image (euler_a, 512x512, fp16) on my NVIDIA 3060 12GB. I tested with the official sd.exe build. I don't know if this is just my PC, but another user reported something similar.

Interestingly, the implementation for the Flux model in sd.cpp runs as fast as Forge WebUI, and is pretty efficient with memory.

Also, I don't think it's really practical or necessary to compile sd.cpp from scratch, but I wanted to have the freedom to use things like the CLIP implementation inside sd.cpp, which isn't exposed via the DLL. But that could also be achieved by submitting a PR to the sd.cpp project, and maybe they'd be okay with exposing the useful inner models in the main DLL as well.

But it'll be interesting to link this with the fast-starting Go frontend (from yesterday), or maybe even just as a fast-starting standalone C++ server. Projects like Jellybox exist (Go-lang frontend and sd.cpp backend), but it's interesting to play with this anyway, to see how small and fast an SD UI can be made.