Good tutorial for understanding the basics of CUDA: https://www.pyspur.dev/blog/introduction_cuda_programming. It also links to NVIDIA's simple tutorial.
Implemented a simple float16
addition kernel in CUDA at https://github.com/cmdr2/study/blob/main/ml/cuda/half_add.cu. Compile it using nvcc -o half_add half_add.cu
.