Continued from Part 1.
Spent a few days figuring out how to compile binary wheels of PyTorch and include all the necessary libraries (ROCm libs or CUDA libs).
tl;dr - In Part 2, the compiled PyTorch wheels now include the required libraries (including ROCm). But this isn't over yet. Torch starts now, but adding two numbers with it produces garbage values (on the GPU). There's probably a bug in the included ROCBLAS version, might need to recompile ROCBLAS for gfx803 separately. Will tackle that in Part 3 (tbd).
Compilation
Here's the process:
1. Create a Docker instance for the required ROCm version. For e.g. docker pull rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_1.13.1
for Torch 1.13.1 with ROCm 5.7.
2. Start the Docker instance: docker run -it YOUR_IMAGE_ID bash
3. (Optional) Install uv and after installing it, source the bash file for uv (it'll tell you to do so).
curl -LsSf https://astral.sh/uv/install.sh | sh
4. (Optional) Create a venv (for using a different python version):
cd ~
uv venv --python 3.10 torch-gfx803
cd torch-gfx803
source bin/activate
5. Link the required projects, and get the pytorch-builder repository (pytorch-builder is no longer used for PyTorch 2.4+, but I need it for PyTorch 1.13):
ln -s /var/lib/jenkins/pytorch pytorch
git clone https://github.com/pytorch/builder.git
6. Fix build-specific issues for ROCm 5.7:
cd pytorch
ln -s /usr/bin/patchelf /usr/local/bin/patchelf
mkdir -p .ci/docker/ci_commit_pins
echo "34f8189eae57a23cc15b4b4f032fe25757e0db8e" > .ci/docker/ci_commit_pins/triton-rocm.txt
echo "2.1.0" > .ci/docker/triton_version.txt
7. Edit ~/torch-gfx803/builder/manywheel/build_rocm.sh
and comment out the entire block related to triton
at the end (where it appends the +{TRITON_SHORTHASH}
suffix).
8. Install ccache
:
sudo apt install ccache
9. Set the required env variables (you can also save this in a file and source
it instead, for convenience):
export HSA_OVERRIDE_GFX_VERSION="8.0.3"
export ROC_ENABLE_PRE_VEGA="1"
export PYTORCH_ROCM_ARCH="gfx803"
export ROCM_ARCH="gfx803"
export TORCH_BLAS_PREFER_HIPBLASLT="0"
export USE_CUDA="0"
export USE_ROCM="1"
export USE_LMDB="1"
export USE_OPENCV="1"
export USE_MKLDNN="0"
export USE_MPI="0"
export USE_NINJA="1"
export BLAS="Eigen"
export FORCE_CUDA="1"
export DESIRED_CUDA="rocm5.7"
export DESIRED_PYTHON="3.10"
export PYTORCH_FINAL_PACKAGE_DIR="/root/torch-gfx803/wheels"
export PYTORCH_ROOT="/root/torch-gfx803/pytorch"
export PYTORCH_BUILD_VERSION=1.13.0 PYTORCH_BUILD_NUMBER=1
10. Build PyTorch:
~/torch-gfx803/builder/manywheel/build_rocm.sh
Post-installation
For now, the user needs to install OpenMPI and MKL on their PC, before being able to use this wheel. I'm not sure how to remove the requirement for these libraries, since the official PyTorch 1.13.0+rocm5.2 wheel does not include or need these libraries. I'm not sure if these requirements were added in between ROCm 5.2 and 5.7.
To install these libraries, the user needs to run these commands on their PC:
OpenMPI:
sudo apt install libopenmpi-dev
MKL:
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install intel-oneapi-mkl-devel=2024.0.0-49656
export LD_LIBRARY_PATH=/opt/intel/oneapi/mkl/2024.0/lib:$LD_LIBRARY_PATH
But does it work?
Sort of. It now loads, i.e. we can runimport torch
and it works. But it produces garbage results if we add two numbers with it. CPU math works, but GPU math fails, for e.g.
>>> import torch
>>> cpu_x = torch.tensor([0, 1, 2])
>>> rocm_x = torch.tensor([0, 1, 2], device='cuda:0')
>>> cpu_x + 10
tensor([10, 11, 12]) # correct
>>> rocm_x + 10
tensor([ 4492320119074422909, -88335127951390314, -7107455620438441222], device='cuda:0') # <---- garbage values
There's probably a bug in the included ROCBLAS version, might need to recompile ROCBLAS for gfx803 separately. I could also write a simple addition program ([1], [2]) that uses ROCBLAS, to test it.
Will tackle that in Part 3 (tbd).