GPU Server
The GPU Server
The GPU machine is a two socket server with AMD EPYC 7313 processors. One processor a 16 Cores, actually with SMT enabled (32 Threads). It comes with 512 GB of memory and 2 x 4 TB U.3 (NVMe) SSDs as fast storage. There are 8 AMD Instinct Mi 50 GPU cards for computing.
Access is given by SLURM and the separate partition "gpu".
As software stack AMD ROCm is installed. This supports the ROCm and openCL interface. Current ROCm Stack is version 6.2.1.
Submitting
GPUs are handled as generic resources in Slurm (gres).
Each GPU is handled as allocatable item. You can allocate up to 8 GPUs. You can do this by adding "--gres=gpu:N", where N is the number of CPUs.
CPUs are handled as usual.
Example: Interative Seesion with 2 GPUs:
PyTorch
A popular framework for machine learning is PyTorch. An up-to-date version with ROCm support must be installed with pip3 in a venv.
. venv/bin/activate
Install Pytorch:
At time of writing it's not available for 6.1. Please check the pytorch Website for updates.
You can test the installation with
print(torch.cuda.is_available())
Links
GPU Cards: https://www.amd.com/en/products/professional-graphics/instinct-mi50
ROCm documentation: https://rocm.docs.amd.com/en/latest/rocm.html
Pytorch: https://pytorch.org/