tl;dr - Today, I worked on using stable-diffusion.cpp in a simple C++ program. As a linked library, as well as compiling sd.cpp from scratch (with and without CUDA). The intent was to get a tiny and fast-starting executable UI for Stable Diffusion working. Also, ChatGPT is very helpful!
Part 1: Using sd.cpp as a library
First, I tried calling the stable-diffusion.cpp library from a simple C++ program (which just loads the model and renders an image). Via dynamic linking. That worked, and its performance was the same as the example sd.exe
CLI, and it detected and used the GPU correctly.
The basic commands for this were (using MinGW64):
gendef stable-diffusion.dll
dlltool --dllname stable-diffusion.dll --output-lib libstable-diffusion.a --input-def stable-diffusion.def
g++ -o your_program your_program.cpp -L. -lstable-diffusion
And I had to set a CMAKE_GENERATOR="MinGW Makefiles"
environment variable. The steps will be different if using MSVC's cl.exe
.
I figured that I could write a simple HTTP server in C++ that wraps sd.cpp. Using a different language would involve keeping the language binding up-to-date with sd.cpp's header file. For e.g. the Go-lang wrapper is currently out-of-date with sd.cpp's latest header.
This thin-wrapper C++ server wouldn't be too complex, it would just act as a rendering backend process for a more complex Go-lang based server (which would implement other user-facing features like model management, task queue management etc).
Here's a simple C++ example:
#include "stable-diffusion.h"
#include <iostream>
int main() {
// Create the Stable Diffusion context
sd_ctx_t* ctx = new_sd_ctx("F:\\path\\to\\sd-v1-5.safetensors", "", "", "", "", "", "", "", "", "", "",
false, false, false, -1, SD_TYPE_F16, STD_DEFAULT_RNG, DEFAULT, false, false, false);
if (ctx == NULL) {
std::cerr << "Failed to create Stable Diffusion context." << std::endl;
return -1;
}
// Generate image using txt2img
sd_image_t* image = txt2img(ctx, "A beautiful landscape painting", "", 0, 7.5f, 1.0f, 512, 512,
EULER_A, 25, 42, 1, NULL, 0.0f, 0.0f, false, "");
if (image == NULL) {
std::cerr << "txt2img failed." << std::endl;
free_sd_ctx(ctx);
return -1;
}
// Output image details
std::cout << "Generated image: " << image->width << "x" << image->height << std::endl;
// Cleanup
free_sd_ctx(ctx);
return 0;
}
Part 2: Compiling sd.cpp from scratch (as a sub-folder in my project)
Update: This code is now available in a github repo.
The next experiment was to compile sd.cpp from scratch on my PC (using the MinGW compile as well as Microsoft's VS compiler). I used sd.cpp as a git submodule in my project, and linked to it staticly.
I needed this initially to investigate a segfault inside a function of stable-diffusion.dll
, which I wasn't able to trace (even with gdb
). Plus it was fun to compile the entire thing and see the entire Stable Diffusion implementation fit into a tiny binary that starts up really quickly. A few megabytes for the CPU-only build.
My folder tree was:
- stable-diffusion.cpp # sub-module dir
- src/main.cpp
- CMakeLists.txt
src/main.cpp
is the same as before, except for this change at the start of int main()
(in order to capture the logs):
void sd_log_cb(enum sd_log_level_t level, const char* log, void* data) {
std::cout << log;
}
int main(int argc, char* argv[]) {
sd_set_log_callback(sd_log_cb, NULL);
// ... rest of the code is the same
}
And CMakeLists.txt
is:
cmake_minimum_required(VERSION 3.13)
project(sd2)
# Set C++ standard
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# Add submodule directory for stable-diffusion
add_subdirectory(stable-diffusion.cpp)
# Include directories for stable-diffusion and its dependencies
include_directories(stable-diffusion.cpp src)
# Create executable from your main.cpp
add_executable(sd2 src/main.cpp)
# Link with the stable-diffusion library
target_link_libraries(sd2 stable-diffusion)
Compiled using:
cmake
cmake --build . --config Release
This ran on the CPU, and was obviously slow. But good to see it running!
Tiny note: I noticed that compiling with g++
(mingw64) resulted in faster iterations/sec compared to MSVC. For e.g. 3.5 sec/it
vs 4.5 sec/it
for SD 1.5 (euler_a, 256x256, fp32). Not sure why.
Part 3: Compiling the CUDA version of sd.cpp
Just for the heck of it, I also installed the CUDA Toolkit and compiled the cuda version of my example project. That took some fiddling. I had to copy some files around to make it work, and point the CUDAToolkit_ROOT
environment variable to where the CUDA toolkit was installed (for e.g. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6
).
Compiled using:
cmake -DSD_CUBLAS=ON
cmake --build . --config Release
The compilation took a long time, since it compiled all the cuda kernels inside ggml. But it worked, and was as fast as the official sd.exe
build for CUDA (which confirmed that nothing was misconfigured).
It resulted in a 347 MB binary (which compresses to a 71 MB .7z file for download). That's really good, compared to the 6 GB+ (uncompressed) behemoths in python-land for Stable Diffusion. Even including the CUDA DLLs (that are needed separately) that's "only" another 600 MB uncompressed (300 MB .7z compressed), which is still better.
Conclusions
The binary size (and being a single static binary) and the startup time is hands-down excellent. So that's pretty promising.
But in terms of performance, sd.cpp seems to be significantly slower for SD 1.5 than Forge WebUI (or even a basic diffusers pipeline). 3 it/sec
vs 7.5 it/sec
for a SD 1.5 image (euler_a, 512x512, fp16) on my NVIDIA 3060 12GB. I tested with the official sd.exe
build. I don't know if this is just my PC, but another user reported something similar.
Interestingly, the implementation for the Flux
model in sd.cpp runs as fast as Forge WebUI, and is pretty efficient with memory.
Also, I don't think it's really practical or necessary to compile sd.cpp from scratch, but I wanted to have the freedom to use things like the CLIP implementation inside sd.cpp, which isn't exposed via the DLL. But that could also be achieved by submitting a PR to the sd.cpp project, and maybe they'd be okay with exposing the useful inner models in the main DLL as well.
But it'll be interesting to link this with the fast-starting Go frontend (from yesterday), or maybe even just as a fast-starting standalone C++ server. Projects like Jellybox exist (Go-lang frontend and sd.cpp backend), but it's interesting to play with this anyway, to see how small and fast an SD UI can be made.