A simple introduction to ggml.
// This is Part 1 in a series on ggml. You can read Part 2 after this one.
This post uses the new "backend" API in ggml. I wrote this to explain ggml to myself. I'm still learning about it, so please feel free to suggest any corrections!
Overall flow of a ggml program
At a very high-level, a ggml
program has the following steps:
1. Define the tensor variables
2. Define the computation graph
3. Allocate memory for the tensor variables, and assign the data
4. Run the computation, and read the result
Let's explore each step briefly:
1. Define the tensor variables: We'll define the data type and shape of each tensor variable that we need. For e.g. a 32-bit float
tensor with shape (2, 4)
.
2. Define the computation graph: This is a fancy way of saying that we'll specify the operations that'll be performed on the tensor variables. For e.g. if x
, y
and z
are three tensor variables, then the computation (x * y) + z
can be represented as add(mul(x, y), z)
.
3. Allocate memory for the tensor variables, and assign the data: We'll ask ggml
to allocate memory (on the backend) for all the tensors. ggml
will go through the computation graph, and allocate memory for each tensor used in the graph. After that, we'll copy the data of each tensor to its allocated memory.
4. Run the computation, and read the result: We'll ask ggml
to run the sequence of operations defined in step 2. We'll then read the result from the final output tensor (or any step of the computation).
Implementing this in ggml
Now let's see how we can implement these steps in ggml.
Define the tensor variables
We'll create tensor variables in ggml using ggml_new_tensor_1d()
, ggml_new_tensor_2d()
, ggml_new_tensor_3d()
or ggml_new_tensor_4d()
. These functions create a 1-dimensional, or 2-dimensional or 3-dimensional or 4-dimensional tensor. We'll pass the type, e.g. GGML_TYPE_F32
(32-bit float), as well as the shape of the tensor, e.g. 2, 4
.
Note: At the moment, ggml does not support dimensions higher than 4, since that's typically what's used in machine learning programs.
Define the computation graph
We'll define the computation graph by chaining together different operator functions, like ggml_add()
, ggml_mul()
, ggml_soft_max()
etc. We can take the output of one function, and pass that as the input to another operator function.
We'll then pass the last tensor variable (e.g. "result") in our operator chain to ggml_build_forward_expand()
. This will tell ggml to run the computation in the "forward" direction (i.e. for inference, not training). For book-keeping, we also need to create a graph object using ggml_new_graph()
, which will represent our computation graph.
Allocate memory for the tensor variables, and assign the data
We'll create a memory allocator for the graph using ggml_gallocr_new()
, and then call ggml_gallocr_alloc_graph()
. This will go through the entire computation graph, and allocate memory (on the backend) for each tensor in the graph.
We'll then copy the data for each tensor to its allocated memory, using ggml_backend_tensor_set()
.
Run the computation, and read the result
We'll call ggml_backend_graph_compute()
to run the computation on the backend. After that, we'll get a reference to the last tensor in the graph using ggml_graph_node()
and read its data using ggml_backend_tensor_get()
.
Putting these together
Let's write some code for each block. In this example, we'll try to add three tensors: [1, 2, 3]
, [10, 20, 30]
, and [100, 200, 300]
.
Define the tensor variables
ggml_tensor* a = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 3);
ggml_tensor* b = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 3);
ggml_tensor* c = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 3);
This will create three tensors a
, b
and c
with shape 1x3
, and type 32-bit float
. Don't worry about the ctx
variable right now.
Define the computation graph
ggml_tensor* result = ggml_add(ctx, a, ggml_add(ctx, b, c));
ggml_cgraph* gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, result);
This computation graph effectively represents: a + (b + c)
.
Allocate memory for the tensor variables, and assign the data
ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
ggml_gallocr_alloc_graph(allocr, gf);
std::vector<float> a_data = {1, 2, 3};
std::vector<float> b_data = {10, 20, 30};
std::vector<float> c_data = {100, 200, 300};
ggml_backend_tensor_set(a, a_data.data(), 0, ggml_nbytes(a));
ggml_backend_tensor_set(b, b_data.data(), 0, ggml_nbytes(b));
ggml_backend_tensor_set(c, c_data.data(), 0, ggml_nbytes(c));
This will allocate memory and assign the data to the tensors (on the backend). Don't worry about the backend
variable right now.
Run the computation, and read the result
ggml_backend_graph_compute(backend, gf);
// get the last node in the graph
ggml_tensor* result_node = ggml_graph_node(gf, -1);
// create an array to store the result data
int n = ggml_nelements(result_node);
std::vector<float> result_data(n);
// copy the data from the backend memory into the result array
ggml_backend_tensor_get(result_node, result_data.data(), 0, ggml_nbytes(result_node));
A complete working example for this is at simple_addition.cpp. It also tells you how to compile (at the top).
What's a context? What's a backend?
A context keeps references to the data structures involved in the program. It is created using ggml_init()
, and can optionally allocate memory. For e.g. a context can hold references to the ggml_tensor
objects that are long-lived, or references to the ggml_graph
objects.
A backend refers to the device on which computations will be run, e.g. CUDA or the CPU. The backend object is created using ggml_backend_cuda_init()
or ggml_backend_cpu_init()
(or similar functions for other backend types).
Bonus: Keeping model weights in memory, across multiple inference runs
In a typical machine learning program, we load the model weights at the beginning, and run inference on those weights repeatedly. We usually don't re-allocate the memory (or reload the data) for model weights each time we want to run inference computations on them.
So let's modify our steps a bit to handle this scenario.
Note: This is purely an optimization, and it is completely optional.
Let's add two steps to the beginning:
1. [new] Define the tensor variables required for model weights
2. [new] Allocate memory for the model weight tensors, and assign the model data
The rest of the steps will remain unchanged. This ensures that the model weights remain in memory across multiple inference computations.
Define the tensor variables required for model weights
ggml_tensor* weight = ggml_new_tensor_1d(ctx_weights, GGML_TYPE_F32, 3);
Allocate memory for the model weight tensors, and assign the model data
ggml_backend_alloc_ctx_tensors(ctx_weights, backend);
std::vector<float> weight_data = {0, 1, 2};
ggml_backend_tensor_set(weight, weight_data.data(), 0, ggml_nbytes(weight));
We'll use a different function to allocate the memory (ggml_backend_alloc_ctx_tensors()
), since we won't have a computation graph at this point.
We can now use the weight
tensor in our computation graph, without having to allocate or assign its data in each inference run.
A complete working example with this modification is at simple_addition_with_static_weights.cpp. It also tells you how to compile (at the top).
In Part 2, we'll explore how to build a simple Neural Network using ggml!