cmdr2's notes

Development update for Easy Diffusion - It's chugging along in starts and stops. Broadly, there are three tracks:

- Maintenance: The past few months have seen increased support for AMD, Intel and integrated GPUs. This includes AMD on Windows. Added support for the new AMD 9060/9070 cards last week, and the new NVIDIA 50xx cards in March.

- Flux to the main branch / release v3.5 to stable: Right now, Flux / v3.5 still requires you to enable ED beta first. And then install Forge. Last week I got Flux working in our main engine (with decent rendering speed). It still needs more work to support all the different models formats for Flux. Using Forge was a temporary arrangement, until Flux worked in our main engine.

- ED v4: I'm continuing to work on a new C++ based engine for ED 4 (based on ggml), which will allow ED to start up in less than a second, and have a significantly smaller download size. The new engine will also use quantization a lot more. Initially, only a few models will be supported in the new engine, so I think power users will continue using the v3.5 engine for advanced models.

Maintenance is going on regularly. Flux is now my active project again, as of last week. I spent a lot of time on ED 4 in Feb and Mar (and contributed code to ggml), but I've paused work on that until Flux goes to ED's stable branch.

The past few months have been a bit chaotic for me wrt time, since we were moving houses. I usually work one day a week on ED, sometimes less.

Experimented with an idea for extending HTML/CSS/JS to define 3D scenes, treating a 3D scene as just a depth extension of the DOM model.

This explores a syntax for defining a 3D scene in a web browser (especially for VR), without WebXR boilerplate and handling XR controller inputs as first-class browser events. I'll explore a polyfill to support this on existing WebXR-compliant browsers.

My previous attempt at this idea (back in 2014) didn't go so well. At that point, I hadn't built any VR experiences, and the syntax I came up with wasn't very practical or productive (at creating anything beyond toy-sized scenes). I'm curious to see if I can do better this time, as most of my work since then has been about building VR experiences.

Here's a simple scene which contains a single 3D model, with an optional skybox:

<!DOCTYPE html>
<html>
<head>
<style>
body {
  background: skybox-procedural();
}
#terrain {
  material: unlit;
  texture: url("assets/terrain.jpg");
}
</style>
<body>
  <model id="terrain" src="assets/terrain.fbx" />
</body>
</html>

The body element represents an infinite bounding volume. The model tag is an extension element, and it is styled using some extended CSS attributes.

Let's customize the skybox and add some linear fog:

<!DOCTYPE html>
<html>
<head>
<style>
body {
  background:
    skybox-procedural(#88aaff, #ddeeff, 0.5, 1.0), /* sky color, base color, atmospheric thickness, exposure */
    fog-linear(#cce0ff, #ffffff, 0.0m, 500.0m); /* near color, far color, near distance, far distance */
}
#terrain {
  material: unlit;
  texture: url("assets/terrain.jpg");
}
</style>
<body>
  <model id="terrain" src="assets/terrain.fbx" />
</body>
</html>

We use CSS Background Layers to specify multiple backgrounds (skybox, and then fog) for the body element.

I'd also love to write a toy browser, that supports linking between different webpages without leaving VR. Remember Janus VR (previously Firebox)?

Obviously there are a LOT of open questions, plenty of which probably won't ever have good answers. The main thing is, I want something like this for VR, and the VR experience would be the main priority.

Spent the last few days refactoring ggml-cpu.c in ggml. The ggml-cpu.c file is currently a monolith with around 15,000 lines of code, and needs to be refactored into separate files and de-duplicated using C++ function templates.

The first part of that refactoring was pushed earlier today - https://github.com/ggml-org/ggml/pull/1144

I also worked on the next two PRs - one that splits SIMD Mapping definitions and vectorized functions into separate files, and another that moves all the operator functions (except mul_mat) into a separate C++ file. I tested the combined effect of these two PRs, and it successfully passed the runners on ggml-ci. These two PRs will shrink ggml-cpu.c to around 5k lines (down from 15k lines right now).

The next step is to test these two PRs a lot more, and read the diff again with a very fine comb, to ensure that this doesn't mess up any CPU extension or scenario.

Upgraded the default version of Easy Diffusion to Python 3.9. Newer versions of torch don't support Python 3.8, so this became urgent after the release of NVIDIA's 50xx series GPUs.

I choose 3.9 as a temporary fix (instead of a newer Python version), since it had the least amount of package conflicts. The future direction of Easy Diffusion's backend is unclear right now - there are a bunch of possible paths. So I didn't want to spend too much time on this. I also wanted to minimize the risk to existing users.

Added support for float16 ADD/SUB/MUL/DIV operations in the CUDA backend of ggml. Also fixed the CPU implementation of these operations in float16 to work with repeating tensors, and added test cases. PR: https://github.com/ggml-org/ggml/pull/1121

Discussed making ggml-cpu.c into a C++ file, so that we can use function templates to de-duplicate a huge amount of code in that file.

Also worked on adding float16 support (in CUDA and CPU) for a number of unary operators, like SQRT, RELU, GELU, SIGMOID, LOG, COS, CLAMP etc. It seems to be passing the tests, so will propose this as a PR soon.

Good tutorial for understanding the basics of CUDA: https://www.pyspur.dev/blog/introduction_cuda_programming. It also links to NVIDIA's simple tutorial.

Implemented a simple float16 addition kernel in CUDA at https://github.com/cmdr2/study/blob/main/ml/cuda/half_add.cu. Compile it using nvcc -o half_add half_add.cu.

// Part 2 in the "Simple introduction to ggml" series.

At the end of Part 1, we learnt how to keep the model weights separate from temporary computation-only tensor variables. This allowed the model weights to stay in memory across multiple predictions (which is the usual behavior of machine learning programs during inference).

Now let's modify that to build a simple Neural Network model using ggml. If you're new to ggml, I recommend reading Part 1 first.

Model and Training

Our model will behave like a logic gate (AND, OR, XOR - depending on its training). We'll use a simple model which has 2 fully-connected layers (2 inputs, 16 hidden nodes, 1 output). This model's design (and its training code) is based on Omkar Prabu's excellent intro to ggml.

We'll train the model by running python train_logic_gate.py --print-weights, which will train an XOR gate and print the trained weights (and also write them to a model.sft file). You can also ask the program to train an AND or OR gate instead, by passing in a --gate-type argument.

The trained weights printed by the program will look something like this:

fc1_weight = { 0.22488207, -0.39456311, ..., 0.07894109, -0.41966945 }
fc1_bias = { -0.35652003, -0.67564911, ..., 1.17234588, 0.77097332 }
fc2_weight = { 0.13858399, -0.20547047, ..., -1.64424217, -0.63815284 }
fc2_bias = { -0.55232018 }

Inference in ggml

Now let's implement this model using ggml, in order to run inference on it.

Define the model

First, we'll define the model as a struct (for convenience). This model will contain 4 tensors, i.e. a pair of weights and biases for the two fully-connected layers.

struct logic_gate_model {
    ggml_tensor* fc1_weight;
    ggml_tensor* fc1_bias;
    ggml_tensor* fc2_weight;
    ggml_tensor* fc2_bias;
    ggml_context* params_ctx;

    struct model_config {
        int32_t n_input = 2;
        int32_t n_hidden = 16;
        int32_t n_output = 1;
    } config;
};

Define the tensor variables required for model weights

Then we'll modify the load_weights() function to create tensors for the model weights.

model.fc1_weight = ggml_new_tensor_2d(model.params_ctx, GGML_TYPE_F32, model.config.n_input, model.config.n_hidden);
model.fc1_bias = ggml_new_tensor_1d(model.params_ctx, GGML_TYPE_F32, model.config.n_hidden);
model.fc2_weight = ggml_new_tensor_2d(model.params_ctx, GGML_TYPE_F32, model.config.n_hidden, model.config.n_output);
model.fc2_bias = ggml_new_tensor_1d(model.params_ctx, GGML_TYPE_F32, model.config.n_output);

Allocate memory for the model weight tensors, and assign the model data

Next, we'll use the weights printed by the training code, and load that into the model.

std::vector<float> fc1_weight = { 0.22488207, -0.39456311, ..., 0.07894109, -0.41966945 };
std::vector<float> fc1_bias = { -0.35652003, -0.67564911, ..., 1.17234588, 0.77097332 };
std::vector<float> fc2_weight = { 0.13858399, -0.20547047, ..., -1.64424217, -0.63815284 };
std::vector<float> fc2_bias = { -0.55232018 };

ggml_backend_tensor_set(model.fc1_weight, fc1_weight.data(), 0, ggml_nbytes(model.fc1_weight));
ggml_backend_tensor_set(model.fc1_bias, fc1_bias.data(), 0, ggml_nbytes(model.fc1_bias));
ggml_backend_tensor_set(model.fc2_weight, fc2_weight.data(), 0, ggml_nbytes(model.fc2_weight));
ggml_backend_tensor_set(model.fc2_bias, fc2_bias.data(), 0, ggml_nbytes(model.fc2_bias));

Update the computation graph

We'll modify the predict() function to define an input tensor, and write the series of math operations (mirroring the forward() function in the corresponding PyTorch model).

struct ggml_tensor* x = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, model.config.n_input);

struct ggml_tensor* fc1 = ggml_add(ctx, ggml_mul_mat(ctx, model.fc1_weight, x), model.fc1_bias);  // multiply the weights, and add the bias
struct ggml_tensor* fc1_relu = ggml_relu(ctx, fc1);
struct ggml_tensor* fc2 = ggml_add(ctx, ggml_mul_mat(ctx, model.fc2_weight, fc1_relu), model.fc2_bias);
struct ggml_tensor* result = ggml_hardsigmoid(ctx, fc2);

Load the input data for prediction

This will create a truth table for the inputs: (0, 0), (0, 1), (1, 0), (1, 1).

for (int i = 0; i < 2; i++) {
    for (int j = 0; j < 2; j++) {
        std::vector<float> input = {float(i), float(j)};
        predict(model, input);
    }
}

A complete working example for this is at logic_gate.cpp. It also tells you how to compile (at the top).

Minor refactoring

The code in logic_gate.cpp is getting pretty messy, for a fairly simple model. This will make it challenging to write larger models in the future.

So let's clean up the implementation slightly. We'll move the code related to model weights and model computation into the model struct. This separates the model's logic from the code required for actually running the model.

The model struct now looks like this:

struct logic_gate_model {
    ggml_tensor* fc1_weight;
    ggml_tensor* fc1_bias;
    ggml_tensor* fc2_weight;
    ggml_tensor* fc2_bias;
    ggml_context* params_ctx;

    struct model_config {
        int32_t n_input = 2;
        int32_t n_hidden = 16;
        int32_t n_output = 1;
    } config;

    logic_gate_model() {
        // create a context (for weights)
        int num_weight_tensors = 4; // since we store four tensors in the model
        params_ctx = ggml_init({
            /*.mem_size   =*/ ggml_tensor_overhead() * num_weight_tensors,
            /*.mem_buffer =*/ NULL,
            /*.no_alloc   =*/ true,
        });

        // Define the tensor variables required for model weights
        fc1_weight = ggml_new_tensor_2d(params_ctx, GGML_TYPE_F32, config.n_input, config.n_hidden);
        fc1_bias = ggml_new_tensor_1d(params_ctx, GGML_TYPE_F32, config.n_hidden);
        fc2_weight = ggml_new_tensor_2d(params_ctx, GGML_TYPE_F32, config.n_hidden, config.n_output);
        fc2_bias = ggml_new_tensor_1d(params_ctx, GGML_TYPE_F32, config.n_output);

        ggml_backend_alloc_ctx_tensors(params_ctx, backend);
    }

    ~logic_gate_model() {
        ggml_free(params_ctx);
    }

    void load_weights() {
        std::vector<float> fc1_weight_data = { 0.22488207, -0.39456311, 0.32581645, -0.56285965, 2.41329503, -2.41322660, -0.37499088, 0.08395171, 0.21755114, 0.80772698, 0.25437704, 1.57216692, -0.43496752, 0.22240390, 0.46247596, -0.02229351, 0.32341745, 0.25361675, -0.20483392, 0.26918083, -0.91469419, 1.23764634, 0.15310341, -0.67303509, 1.77088165, 1.77059495, -0.11867817, -0.37374884, 0.79170924, -1.17232382, 0.07894109, -0.41966945 };
        std::vector<float> fc1_bias_data = { -0.35652003, -0.67564911, 0.00009615, -0.62946773, 0.27859268, 0.01491952, 0.52390707, -0.47604990, -0.25365347, 0.21269353, 0.00003640, -0.44338676, -1.77084744, 0.82772928, 1.17234588, 0.77097332 };
        std::vector<float> fc2_weight_data = { 0.13858399, -0.20547047, 3.41583562, 0.15011564, 0.56532770, 1.40391135, 0.00871399, 0.24152395, -0.39389160, 0.16984159, 1.34791148, -0.12602532, -3.02119160, -0.68023020, -1.64424217, -0.63815284 };
        std::vector<float> fc2_bias_data = { -0.55232018 };

        ggml_backend_tensor_set(fc1_weight, fc1_weight_data.data(), 0, ggml_nbytes(fc1_weight));
        ggml_backend_tensor_set(fc1_bias, fc1_bias_data.data(), 0, ggml_nbytes(fc1_bias));
        ggml_backend_tensor_set(fc2_weight, fc2_weight_data.data(), 0, ggml_nbytes(fc2_weight));
        ggml_backend_tensor_set(fc2_bias, fc2_bias_data.data(), 0, ggml_nbytes(fc2_bias));
    }

    ggml_tensor* forward(ggml_context *ctx, ggml_tensor *x) {
        ggml_tensor* fc1 = ggml_add(ctx, ggml_mul_mat(ctx, fc1_weight, x), fc1_bias);  // multiply the weights, and add the bias
        ggml_tensor* fc1_relu = ggml_relu(ctx, fc1);
        ggml_tensor* fc2 = ggml_add(ctx, ggml_mul_mat(ctx, fc2_weight, fc1_relu), fc2_bias);
        return ggml_hardsigmoid(ctx, fc2);
    }
};

A complete working example for this is at logic_gate_refactored.cpp. It also tells you how to compile (at the top).

A note about model weights

As you've noticed, we hardcoded the trained weights in the inference code. This isn't ideal. So we need to write a utility function that loads the weights from the model.sft file (safetensors format).

I've implemented a very basic safetensors loader at safetensors.hpp. This implementation isn't very efficient for very large models, but it's sufficient for our purposes right now, and is easy to understand.

Let's modify the load_weights() function. First we'll remove the hardcoded weights. Next, we'll call safetensors::load_from_file() and assign the tensor data to the corresponding ggml_tensor in the callback function.

std::unordered_map<std::string, struct ggml_tensor*> tensor_map;
...

// names of the parameters as written by the training code
tensor_map["fc1.weight"] = fc1_weight;
tensor_map["fc1.bias"] = fc1_bias;
tensor_map["fc2.weight"] = fc2_weight;
tensor_map["fc2.bias"] = fc2_bias;

...
auto tensors = tensor_map;
safetensors::load_from_file("model.sft", [&tensors](const std::string& key, const std::string& dtype, const std::vector<uint64_t>& shape, const std::vector<uint8_t>& tensor_data) {
    std::cout<<"Read tensor: "<<key<<", size: "<<tensor_data.size()<<" bytes"<<std::endl;

    auto it = tensors.find(key);
    if (it != tensors.end()) {
        ggml_tensor* tensor = it->second;
        ggml_backend_tensor_set(tensor, tensor_data.data(), 0, ggml_nbytes(tensor));
    } else {
        std::cout<<"Unknown key: "<<key<<std::endl;
    }
});

A complete working example for this is at logic_gate_with_weights_file.cpp. It also tells you how to compile (at the top).

A simple introduction to ggml.

// This is Part 1 in a series on ggml. You can read Part 2 after this one.

This post uses the new "backend" API in ggml. I wrote this to explain ggml to myself. I'm still learning about it, so please feel free to suggest any corrections!

Overall flow of a ggml program

At a very high-level, a ggml program has the following steps:

1. Define the tensor variables

2. Define the computation graph

3. Allocate memory for the tensor variables, and assign the data

4. Run the computation, and read the result

Let's explore each step briefly:

1. Define the tensor variables: We'll define the data type and shape of each tensor variable that we need. For e.g. a 32-bit float tensor with shape (2, 4).

2. Define the computation graph: This is a fancy way of saying that we'll specify the operations that'll be performed on the tensor variables. For e.g. if x, y and z are three tensor variables, then the computation (x * y) + z can be represented as add(mul(x, y), z).

3. Allocate memory for the tensor variables, and assign the data: We'll ask ggml to allocate memory (on the backend) for all the tensors. ggml will go through the computation graph, and allocate memory for each tensor used in the graph. After that, we'll copy the data of each tensor to its allocated memory.

4. Run the computation, and read the result: We'll ask ggml to run the sequence of operations defined in step 2. We'll then read the result from the final output tensor (or any step of the computation).

Implementing this in ggml

Now let's see how we can implement these steps in ggml.

Define the tensor variables

We'll create tensor variables in ggml using ggml_new_tensor_1d(), ggml_new_tensor_2d(), ggml_new_tensor_3d() or ggml_new_tensor_4d(). These functions create a 1-dimensional, or 2-dimensional or 3-dimensional or 4-dimensional tensor. We'll pass the type, e.g. GGML_TYPE_F32 (32-bit float), as well as the shape of the tensor, e.g. 2, 4.

Note: At the moment, ggml does not support dimensions higher than 4, since that's typically what's used in machine learning programs.

Define the computation graph

We'll define the computation graph by chaining together different operator functions, like ggml_add(), ggml_mul(), ggml_soft_max() etc. We can take the output of one function, and pass that as the input to another operator function.

We'll then pass the last tensor variable (e.g. "result") in our operator chain to ggml_build_forward_expand(). This will tell ggml to run the computation in the "forward" direction (i.e. for inference, not training). For book-keeping, we also need to create a graph object using ggml_new_graph(), which will represent our computation graph.

Allocate memory for the tensor variables, and assign the data

We'll create a memory allocator for the graph using ggml_gallocr_new(), and then call ggml_gallocr_alloc_graph(). This will go through the entire computation graph, and allocate memory (on the backend) for each tensor in the graph.

We'll then copy the data for each tensor to its allocated memory, using ggml_backend_tensor_set().

Run the computation, and read the result

We'll call ggml_backend_graph_compute() to run the computation on the backend. After that, we'll get a reference to the last tensor in the graph using ggml_graph_node() and read its data using ggml_backend_tensor_get().

Putting these together

Let's write some code for each block. In this example, we'll try to add three tensors: [1, 2, 3], [10, 20, 30], and [100, 200, 300].

Define the tensor variables

ggml_tensor* a = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 3);
ggml_tensor* b = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 3);
ggml_tensor* c = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 3);

This will create three tensors a, b and c with shape 1x3, and type 32-bit float. Don't worry about the ctx variable right now.

Define the computation graph

ggml_tensor* result = ggml_add(ctx, a, ggml_add(ctx, b, c));

ggml_cgraph* gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, result);

This computation graph effectively represents: a + (b + c).

Allocate memory for the tensor variables, and assign the data

ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
ggml_gallocr_alloc_graph(allocr, gf);

std::vector<float> a_data = {1, 2, 3};
std::vector<float> b_data = {10, 20, 30};
std::vector<float> c_data = {100, 200, 300};
ggml_backend_tensor_set(a, a_data.data(), 0, ggml_nbytes(a));
ggml_backend_tensor_set(b, b_data.data(), 0, ggml_nbytes(b));
ggml_backend_tensor_set(c, c_data.data(), 0, ggml_nbytes(c));

This will allocate memory and assign the data to the tensors (on the backend). Don't worry about the backend variable right now.

Run the computation, and read the result

ggml_backend_graph_compute(backend, gf);

// get the last node in the graph
ggml_tensor* result_node = ggml_graph_node(gf, -1);

// create an array to store the result data
int n = ggml_nelements(result_node);
std::vector<float> result_data(n);

// copy the data from the backend memory into the result array
ggml_backend_tensor_get(result_node, result_data.data(), 0, ggml_nbytes(result_node));

A complete working example for this is at simple_addition.cpp. It also tells you how to compile (at the top).

What's a context? What's a backend?

A context keeps references to the data structures involved in the program. It is created using ggml_init(), and can optionally allocate memory. For e.g. a context can hold references to the ggml_tensor objects that are long-lived, or references to the ggml_graph objects.

A backend refers to the device on which computations will be run, e.g. CUDA or the CPU. The backend object is created using ggml_backend_cuda_init() or ggml_backend_cpu_init() (or similar functions for other backend types).

Bonus: Keeping model weights in memory, across multiple inference runs

In a typical machine learning program, we load the model weights at the beginning, and run inference on those weights repeatedly. We usually don't re-allocate the memory (or reload the data) for model weights each time we want to run inference computations on them.

So let's modify our steps a bit to handle this scenario.

Note: This is purely an optimization, and it is completely optional.

Let's add two steps to the beginning:

1. [new] Define the tensor variables required for model weights

2. [new] Allocate memory for the model weight tensors, and assign the model data

The rest of the steps will remain unchanged. This ensures that the model weights remain in memory across multiple inference computations.

Define the tensor variables required for model weights

ggml_tensor* weight = ggml_new_tensor_1d(ctx_weights, GGML_TYPE_F32, 3);

Allocate memory for the model weight tensors, and assign the model data

ggml_backend_alloc_ctx_tensors(ctx_weights, backend);

std::vector<float> weight_data = {0, 1, 2};
ggml_backend_tensor_set(weight, weight_data.data(), 0, ggml_nbytes(weight));

We'll use a different function to allocate the memory (ggml_backend_alloc_ctx_tensors()), since we won't have a computation graph at this point.

We can now use the weight tensor in our computation graph, without having to allocate or assign its data in each inference run.

A complete working example with this modification is at simple_addition_with_static_weights.cpp. It also tells you how to compile (at the top).

In Part 2, we'll explore how to build a simple Neural Network using ggml!

Easy Diffusion (and sdkit) now also support AMD on Windows automatically (using DirectML), thanks to integrating with torchruntime. It also supports integrated GPUs (Intel and AMD) on Windows, making Easy Diffusion faster on PCs without dedicated graphics cards.

Spent the last week or two getting torchruntime fully integrated into Easy Diffusion, and making sure that it handles all the edge-cases.

Easy Diffusion now uses torchruntime to automatically install the best-possible version of torch (on the users' computer) and support a wider variety of GPUs (as well as older GPUs). And it uses a GPU-agnostic device API, so Easy Diffusion will automatically support additional GPUs when they are supported by torchruntime.

This also makes it easier for developers to add support for more (or newer) GPUs in a simpler code base (i.e. torchruntime), instead of digging into the internals of Easy Diffusion's codebase.

This removes a lot of custom, hacky code from Easy Diffusion for installing torch. This hacky code was the initial source of knowledge for torchruntime, and I'm thankful to all the contributors who poured their knowledge into it.

Continued to test and fix issues in sdkit, after the change to support DirectML. The change is fairly intrusive, since it removes direct references to torch.cuda with a layer of abstraction.

Fixed a few regressions, and it now passes all the regression tests for CPU and CUDA support (i.e. existing users). Will test for DirectML next, although it will fail (with out-of-memory) for anything but the simplest tests (since DirectML is quirky with memory allocation).

Spent a bit of time with Freebird - added the ability to scale the radius of curve points in Freebird (in EDIT mode), thanks to a user code contribution. And did some user support.

Also experimented with libtorch and wrote a simple C++ program that uses libtorch for computation. Managed to compile it for CPU and CUDA using the instructions on their website. But it feels fairly heavyweight compared to ggml, especially if compiling staticly (would take a LONG time). The experiment was driven by a desire to reduce the installation size of Easy Diffusion (from the current 6-8 GB behemoth), without necessarily discarding torch. Might experiment with compiling libtorch staticly sometime, just to see how long it takes, and whether the final binary size is small enough for my needs.

Worked on adding support for DirectML in sdkit. This allows AMD GPUs and Integrated GPUs to generate images on Windows.

DirectML seems like it's really inefficient with memory though. So for now it only manages to generate images using SD 1.5. XL and larger models fail to generate, even though I have a 12 GB of VRAM in my graphics card.

Continued from Part 1.

Spent a few days figuring out how to compile binary wheels of PyTorch and include all the necessary libraries (ROCm libs or CUDA libs).

tl;dr - In Part 2, the compiled PyTorch wheels now include the required libraries (including ROCm). But this isn't over yet. Torch starts now, but adding two numbers with it produces garbage values (on the GPU). There's probably a bug in the included ROCBLAS version, might need to recompile ROCBLAS for gfx803 separately. Will tackle that in Part 3 (tbd).

Compilation

Here's the process:

1. Create a Docker instance for the required ROCm version. For e.g. docker pull rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_1.13.1 for Torch 1.13.1 with ROCm 5.7.

2. Start the Docker instance: docker run -it YOUR_IMAGE_ID bash

3. (Optional) Install uv and after installing it, source the bash file for uv (it'll tell you to do so).

curl -LsSf https://astral.sh/uv/install.sh | sh
4. (Optional) Create a venv (for using a different python version):
cd ~
uv venv --python 3.10 torch-gfx803
cd torch-gfx803
source bin/activate
5. Link the required projects, and get the pytorch-builder repository (pytorch-builder is no longer used for PyTorch 2.4+, but I need it for PyTorch 1.13):
ln -s /var/lib/jenkins/pytorch pytorch
git clone https://github.com/pytorch/builder.git
6. Fix build-specific issues for ROCm 5.7:
cd pytorch
ln -s /usr/bin/patchelf /usr/local/bin/patchelf
mkdir -p .ci/docker/ci_commit_pins
echo "34f8189eae57a23cc15b4b4f032fe25757e0db8e" > .ci/docker/ci_commit_pins/triton-rocm.txt
echo "2.1.0" > .ci/docker/triton_version.txt
7. Edit ~/torch-gfx803/builder/manywheel/build_rocm.sh and comment out the entire block related to triton at the end (where it appends the +{TRITON_SHORTHASH} suffix).

8. Install ccache:

sudo apt install ccache
9. Set the required env variables (you can also save this in a file and source it instead, for convenience):
export HSA_OVERRIDE_GFX_VERSION="8.0.3"
export ROC_ENABLE_PRE_VEGA="1"
export PYTORCH_ROCM_ARCH="gfx803"
export ROCM_ARCH="gfx803"
export TORCH_BLAS_PREFER_HIPBLASLT="0"
export USE_CUDA="0"
export USE_ROCM="1"
export USE_LMDB="1"
export USE_OPENCV="1"
export USE_MKLDNN="0"
export USE_MPI="0"
export USE_NINJA="1"
export BLAS="Eigen"
export FORCE_CUDA="1"
export DESIRED_CUDA="rocm5.7"
export DESIRED_PYTHON="3.10"
export PYTORCH_FINAL_PACKAGE_DIR="/root/torch-gfx803/wheels"
export PYTORCH_ROOT="/root/torch-gfx803/pytorch"

export PYTORCH_BUILD_VERSION=1.13.0 PYTORCH_BUILD_NUMBER=1
10. Build PyTorch:
~/torch-gfx803/builder/manywheel/build_rocm.sh

Post-installation

For now, the user needs to install OpenMPI and MKL on their PC, before being able to use this wheel. I'm not sure how to remove the requirement for these libraries, since the official PyTorch 1.13.0+rocm5.2 wheel does not include or need these libraries. I'm not sure if these requirements were added in between ROCm 5.2 and 5.7.

To install these libraries, the user needs to run these commands on their PC:

OpenMPI:

sudo apt install libopenmpi-dev

MKL:

wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install intel-oneapi-mkl-devel=2024.0.0-49656

export LD_LIBRARY_PATH=/opt/intel/oneapi/mkl/2024.0/lib:$LD_LIBRARY_PATH

But does it work?

Sort of. It now loads, i.e. we can run import torch and it works. But it produces garbage results if we add two numbers with it. CPU math works, but GPU math fails, for e.g.
>>> import torch
>>> cpu_x = torch.tensor([0, 1, 2])
>>> rocm_x = torch.tensor([0, 1, 2], device='cuda:0')
>>> cpu_x + 10
tensor([10, 11, 12])  # correct
>>> rocm_x + 10
tensor([ 4492320119074422909,   -88335127951390314, -7107455620438441222], device='cuda:0')  # <---- garbage values

There's probably a bug in the included ROCBLAS version, might need to recompile ROCBLAS for gfx803 separately. I could also write a simple addition program ([1], [2]) that uses ROCBLAS, to test it.

Will tackle that in Part 3 (tbd).

Continued in Part 2, where I figured out how to include the required libraries in the wheel.

Spent all of yesterday trying to compile pytorch with the compile-time PYTORCH_ROCM_ARCH=gfx803 environment variable.

tl;dr - In Part 1, I compiled wheels for PyTorch with ROCm, in order to add support for older AMD cards like RX 480. I managed to compile the wheels, but the wheel doesn't include the required ROCm libraries. I figured that out in Part 2.

The intention was to build ROCm 6.2 wheels for torch 2.4 with support for older AMD cards (like RX 480) that don't work with the official torch binary wheels. This is supposed to work (another).

If this worked, I could've hosted the wheel for users of Easy Diffusion (or torchruntime) with older AMD GPUs, without requiring them to install ROCm separately on their PCs.

Compilation was successful, but I wasn't able to get the compiled wheels to include the required libraries like libMIOpen. The compiled wheel was around 300 MB, while the official torch+rocm wheels are nearly 3 GB. A diff of the two wheels using unzip -Z1 shows that's because of the missing libraries.

Edit: Figured this out in Part 2.

I went through the builder code at the torch repo, as well as the deprecated pytorch-builder repo, but couldn't get this to include the libraries. I even tried auditwheel, but that failed with a "very-recent version of stdlib" error (something like that).

In any case, I followed this guide for compiling torch for ROCm.

Notes:

1. Ensure your PC has atleast 120 GB of free disk space.

2. On a Windows host, use this command to start the container: docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ipc=host --shm-size 8G rocm/pytorch:latest-base

3. If necessary, create a uv venv if you need to build for a different version of Python.

Spent the last few days writing torchruntime, which will automatically install the correct torch distribution based on the user's OS and graphics card. This package was written by extracting this logic out of Easy Diffusion, and refactoring it into a cleaner implementation (with tests).

It can be installed (on Win/Linux/Mac) using pip install torchruntime.

The main intention is that it'll be easier for developers to contribute updates (for e.g. for newer or older GPUs). It wasn't easy to find or modify this code previously, since it was buried deep inside Easy Diffusion's internals.

Additionally, this package is useful to other developers who're building PyTorch-based apps that target users with NVIDIA/AMD/Intel graphics cards on Windows/Linux/Mac. The logic inside this package originates from a lot of user support and bug reports over the years in Easy Diffusion, and a bunch of research on the Internet (for the variety of combinations and settings required for AMD).

Spent most of the day doing some support work for Easy Diffusion, and experimenting with torch-directml for AMD support on Windows.

From the initial experiments, torch-directml seems to work properly with Easy Diffusion. I ran it on my NVIDIA card, and another user ran it on their AMD Radeon RX 7700 XT.

It's 7-10x faster than the CPU, so looks promising. It's 2x slower than CUDA on my NVIDIA card, but users with NVIDIA cards are not the target audience of this change.

I still need to run the full set of automated tests, so there's a chance of some corner scenario breaking.

Spent a few days prototyping a UI for Easy Diffusion v4. Files are at this repo.

The main focus was to get a simple but pluggable UI, that was backed by a reactive data model, and to allow splitting the codebase into individual components (with their own files). And require only a text editor and a browser to develop, i.e. no compilation or nodejs-based developer experiences.

I really want something that is easy to understand - for an outside developer and for myself (for e.g. if I'm returning to a portion of the codebase after a while). And with very little friction to start developing for it.

It uses Vue, but directly in the browser. I use vue3-sfc-loader to allow the UI to be divided into separate component files, without requiring compilation.

I got a basic tabbed interface shell working, and laid out the foundational data structures, and tested that plugins could add new tabs as well.

Next, I'm going to experiment with PrimeVue for fleshing out a simple UI. I looked at quite a few UI libraries (including classic Bootstrap), and PrimeVue seems closest to my own mindset - like if I designed a UI library, it would look a lot like PrimeVue. And it seems to have most of the components that I require.

Really need to figure out a way to render standard HTML elements (styled with CSS and modified with JS) in a 3D scene. Reinventing excellent libraries like PrimeVue again inside 3D (for rendering in VR) is just wasteful.

There have been attempts, e.g. A-Frame, but we really need to view the webpage in 3D. Just regular HTML elements. The regular DOM renderer. The pieces feel like they're there conceptually, but the implementation gap is probably big enough (that it hasn't happened yet).

For e.g. Freebird (Blender VR plugin) had to reinvent everything from scratch - UI buttons were rectangles drawn using shaders, grid layouts had to be reimplemented from scratch. I abstracted it into a sensible framework (heavily mimicking HTML/CSS/JS, because it makes sense!). But obviously this is all very wasteful, and results in poor quality UIs, because building a UI framework isn't the primary focus of Freebird (it's 3D modeling in VR).

A simple browser-like shell using ImGui and GLFW. It was supposed to show a webview, but I couldn't figure out how to embed a webview in the window (instead of it popping up in its own window). Maybe I'll revisit this in the future if I can figure it out.

Create a folder named thirdparty (alongside main.cpp and CMakeLists.txt) and clone the git repositories for imgui and glfw into the thirdparty folder.

Then compile using:

cmake -B build
cmake --build build --config Release

And run the compiled executable.

main.cpp:

#include <vector>
#include <string>
#include <memory>
#include <stdexcept>
#include <imgui.h>
#include <imgui_impl_glfw.h>
#include <imgui_impl_opengl3.h>
#include <GLFW/glfw3.h>

#include <iostream>

struct Tab {
    std::string url;

    Tab(const std::string& initial_url)
        : url(initial_url) {
    }
};

class Browser {
public:
    Browser() {
        tabs.emplace_back("https://example.com"); // Default tab
        current_tab = 0;
    }

    void run() {
        if (!glfwInit()) {
            throw std::runtime_error("Failed to initialize GLFW");
        }

        GLFWwindow* window = glfwCreateWindow(1280, 720, "Tabbed Browser", nullptr, nullptr);
        if (!window) {
            glfwTerminate();
            throw std::runtime_error("Failed to create GLFW window");
        }

        glfwMakeContextCurrent(window);
        glfwSwapInterval(1);

        ImGui::CreateContext();
        ImGui_ImplGlfw_InitForOpenGL(window, true);
        ImGui_ImplOpenGL3_Init("#version 130");

        while (!glfwWindowShouldClose(window)) {
            glfwPollEvents();
            render(window);
        }

        ImGui_ImplOpenGL3_Shutdown();
        ImGui_ImplGlfw_Shutdown();
        ImGui::DestroyContext();
        glfwDestroyWindow(window);
        glfwTerminate();
    }

private:
    std::vector<Tab> tabs;
    int current_tab;

    void render(GLFWwindow* window) {
        ImGui_ImplOpenGL3_NewFrame();
        ImGui_ImplGlfw_NewFrame();
        ImGui::NewFrame();

        ImGui::SetNextWindowPos(ImVec2(0, 0));
        ImGui::SetNextWindowSize(ImGui::GetIO().DisplaySize);
        ImGui::PushStyleVar(ImGuiStyleVar_WindowBorderSize, 0);
        ImGui::PushStyleVar(ImGuiStyleVar_WindowPadding, ImVec2(0, 0));
        ImGui::PushStyleVar(ImGuiStyleVar_WindowRounding, 0);

        ImGui::Begin("Browser", nullptr, ImGuiWindowFlags_NoDecoration | ImGuiWindowFlags_NoMove);

        if (ImGui::BeginTabBar("Tabs")) {
            for (size_t i = 0; i < tabs.size(); ++i) {
                bool open = true;
                if (ImGui::BeginTabItem(("Tab " + std::to_string(i + 1)).c_str(), &open)) {
                    current_tab = static_cast<int>(i);
                    ImGui::Text("URL %d: %s", i, tabs[i].url.c_str());
                    ImGui::EndTabItem();
                }
                if (!open) {
                    tabs.erase(tabs.begin() + i);
                    if (current_tab >= static_cast<int>(i)) {
                        current_tab = std::max(0, current_tab - 1);
                    }
                }
            }

            // Add new tab button
            if (ImGui::TabItemButton("+")) {
                tabs.emplace_back("https://example.com");
                current_tab = static_cast<int>(tabs.size()) - 1;
            }

            ImGui::EndTabBar();
        }

        // Add the ≡ button
        ImGui::SameLine();
        if (ImGui::Button("≡")) {
            ImGui::OpenPopup("TabMenu");
        }

        // Menu popup logic
        if (ImGui::BeginPopup("TabMenu")) {
            if (ImGui::MenuItem("New Tab")) {
                tabs.emplace_back("https://example.com");
                current_tab = static_cast<int>(tabs.size()) - 1;
            }
            if (ImGui::MenuItem("New Private Tab")) {
                tabs.emplace_back("https://example.com?private=1");
                current_tab = static_cast<int>(tabs.size()) - 1;
            }
            if (ImGui::MenuItem("Settings")) {
                tabs.emplace_back("https://settings.example.com");
                current_tab = static_cast<int>(tabs.size()) - 1;
            }
            ImGui::EndPopup();
        }

        ImGui::End();
        ImGui::PopStyleVar(3);

        ImGui::Render();
        int display_w, display_h;
        glfwGetFramebufferSize(window, &display_w, &display_h);
        glViewport(0, 0, display_w, display_h);
        glClearColor(0.45f, 0.55f, 0.60f, 1.00f);
        glClear(GL_COLOR_BUFFER_BIT);
        ImGui_ImplOpenGL3_RenderDrawData(ImGui::GetDrawData());
        glfwSwapBuffers(window);
    }
};

int main() {
    try {
        Browser browser;
        browser.run();
    } catch (const std::exception& e) {
        fprintf(stderr, "Error: %s\n", e.what());
        return EXIT_FAILURE;
    }
    return EXIT_SUCCESS;
}

CMakeLists.txt:

cmake_minimum_required(VERSION 3.16)

# Project name and version
project(TabbedBrowser VERSION 1.0)

# Set C++ standard
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)

set(CMAKE_RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/bin")
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/lib")
set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/lib")

add_subdirectory(thirdparty/glfw)

set(SOURCES main.cpp)

set(IMGUI_SOURCES
    thirdparty/imgui/imgui.cpp
	thirdparty/imgui/imgui_draw.cpp
	thirdparty/imgui/imgui_tables.cpp
	thirdparty/imgui/imgui_widgets.cpp
	thirdparty/imgui/imgui_demo.cpp
	thirdparty/imgui/backends/imgui_impl_glfw.cpp
	thirdparty/imgui/backends/imgui_impl_opengl3.cpp
)

# Platform-specific settings
if(WIN32)
    set(PLATFORM_LIBS glfw opengl32)
elseif(APPLE)
    find_library(COCOA_LIBRARY Cocoa REQUIRED)
    find_library(IOKIT_LIBRARY IOKit REQUIRED)
    find_library(COREFOUNDATION_LIBRARY CoreFoundation REQUIRED)
    find_library(COREGRAPHICS_LIBRARY CoreGraphics REQUIRED)
    set(PLATFORM_LIBS glfw ${COCOA_LIBRARY} ${IOKIT_LIBRARY} ${COREFOUNDATION_LIBRARY} ${COREGRAPHICS_LIBRARY})
elseif(UNIX)
    find_package(X11 REQUIRED)
    set(PLATFORM_LIBS glfw X11 GL)
endif()

# Add executable
add_executable(TabbedBrowser ${SOURCES} ${IMGUI_SOURCES})

# Link libraries
target_include_directories(TabbedBrowser PRIVATE thirdparty/imgui)
target_include_directories(TabbedBrowser PRIVATE thirdparty/imgui/backends)
target_link_libraries(TabbedBrowser PRIVATE ${PLATFORM_LIBS})

# Include directories
if(APPLE)
    include_directories(/usr/local/include)
    link_directories(/usr/local/lib)
endif()

I spent some time today doing support for Freebird, Puppetry and Easy Diffusion. Identified a bug in Freebird (bone axis gizmos aren't scaling correctly in VR), got annoyed by how little documentation I've written for Puppetry's scripting API, and got reminded about how annoying it is for Easy Diffusion to force-download the poor quality starter model (stock SD 1.4) during installation.

The majority of the day was spent in using a local LLM for classifying emails. I get a lot of repetitive emails for FindStarlink - people telling me whether they saw Starlink or not (using the predictions on the website). The first part of my reply is always a boilerplate "Glad you saw it" or "Sorry about that", followed by email-specific replies. I'd really like the system to auto-fill the first part of the email, if it's a report about Starlink sighting.

Is classification even necessary or correct?

Typing/pasting the first part of the reply tens of times a day is quite cumbersome. On occasion I've received over a hundred emails in a single day, and on a particularly peaky day, 1000 emails in a single day. So user empathy is one aspect, but I'm also a single human being.

I could of course just remove the entire email aspect, and make it a faceless "Yes/No" button on the website (to confirm Starlink sightings). But I think it would reduce the accuracy of the sighting reports, and would also make the site a colder place.

I understand that using an email classifier to auto-insert a reply is also cold, but it's still me replying manually after reading the mail. It's more like an auto-complete on steroids, not an auto-responder.

Classifier details

The Llama 3.1 8B model was pretty accurate. I tested it against a dump of 500 emails that I've labeled in the past as "fail", "success", and "other".

The first message was sent with a "user" role: "Here is an email. It reports whether the sender saw the Starlink train of satellites.", followed by the email subject and contents (plaintext).

I also included two messages with a "system" role, with more hints about classifying the emails, and restricting the output to three labels only.

I also tried with the Llama 3.2 1B model, but it was significantly poorer in accuracy.

Here's the rough script:

OPENAI_API_HOST = "http://localhost:1234"
MODEL_NAME = "Llama-3.1-8B-Lexi-Uncensored-V2-GGUF"

LABELS = ["success", "fail", "other"]

CLASSIFY_PROMPT = "Here is an email. It reports whether the sender saw the Starlink train of satellites."

SYSTEM_MESSAGE1 = {
    "role": "system",
    "content": 'Reply with "fail" if the user failed to see Starlink, reply with "success" if the user successfully saw Starlink. Otherwise reply with "other".',
}
SYSTEM_MESSAGE2 = {
    "role": "system",
    "content": 'Understand the sentiment of the email. Sometimes the sender will describe an unrelated topic or ask an unrelated question. Classify such email as "other". Sometimes the user will say that a previous viewing was amazing, or that they can confirm seeing it or that they saw it (the satellites) or that it was visible or that the timings were spot on or correct or that it works well, classify those as "success". Sometimes they will say did not see or don\'t see or was not visible or that it was a let down or generally a negative experience, classify those as "fail".',
}
SYSTEM_MESSAGE3 = {"role": "system", "content": 'reply only with "fail", "success" or "other"'}


def classify(text):
    message = {"role": "user", "content": f"{CLASSIFY_PROMPT}\n\n{text}"}

    response = requests.post(
        f"{OPENAI_API_HOST}/v1/chat/completions",
        json={"model": MODEL_NAME, "messages": [message, SYSTEM_MESSAGE1, SYSTEM_MESSAGE2, SYSTEM_MESSAGE3]},
    )
    if response.status_code != 200:
        raise RuntimeError(f"Unexpected response from server. Status code: {response.status_code}:", response.text)

    response = response.json()
    response = response["choices"][0]["message"]["content"].lower().strip()

    if response not in LABELS:
        print("---")
        print(f"Unexpected label: {response} for {text}")
        print("---")
        response = "other"

    return response

Next step

Now I'd like to fetch the latest emails for the starlink email address, and then run the classifier, and have it save a draft reply for each email if it is classified as success or fail. I'm okay with running this script on my PC, maybe once every morning.

I tried writing a plugin for Thunderbird, and to be honest I didn't enjoy the experience. The API and developer tooling is quite nice, but Thunderbird is really slow and flaky at raising the 'new email' event handler. And the UI would freeze occasionally. I spent an entire afternoon fighting the system, and didn't even get to link the local LLM to it.

Plus I'd like a simpler UI, which presents a card layout of all the emails that need my attention, and a simple textbox under it showing the proposed reply. If I agree, I can type any additional content and/or press Send. Or I can click a button to pick the correct classification, and do the same. Reducing my effort is important to me.

So I also wrote a simple script that talks to GMail directly (using Google's python library), and got it to fetch my emails. This could be expanded to build such a UI for myself, since I don't need a general-purpose email client (GMail's web interface is enough for me).

In general, I'm surprised that we don't have programmable email clients, or email clients that classify emails for you etc. Like I'd expect GMail to atleast offer this as a feature, with all the talent and compute Google has. Maybe they're working on this?

For now, I'm probably going to park this for a while.