Wiki source code of GPU Server
Version 12.1 by Thomas Coelho (local) on 2024/06/07 10:12
Hide last authors
author | version | line-number | content |
---|---|---|---|
![]() |
1.1 | 1 | {{box cssClass="floatinginfobox" title="**Contents**"}} |
2 | {{toc/}} | ||
3 | {{/box}} | ||
4 | |||
5 | = The GPU Server = | ||
6 | |||
![]() |
8.1 | 7 | The GPU machine is a two socket server with AMD EPYC 7313 processors. One processor a 16 Cores, actually with SMT enabled (32 Threads). It comes with 512 GB of memory and 2 x 4 TB U.3 (NVMe) SSDs as fast storage. There are** 8 AMD Instinct Mi 50** GPU cards for computing. |
![]() |
1.1 | 8 | |
9 | Access is given by SLURM and the separate partition "gpu". | ||
10 | |||
![]() |
10.1 | 11 | As software stack AMD ROCm is installed. This supports the ROCm and openCL interface. Current ROCm Stack is version 6.1. This is also packaged in Ubuntu 6.1. |
![]() |
1.1 | 12 | |
13 | (% class="box infomessage" %) | ||
14 | ((( | ||
15 | Because GPU computing is a new discipline, we can only provide limited information here. If you have something to share, please fell free to edit this page. | ||
16 | ))) | ||
17 | |||
![]() |
12.1 | 18 | {{warning}} |
19 | To have built in ROCm support in slurm, this machine has already been updated to Ubuntu 24.04. There are some parts of the ROCm stack included in the distribution which is a mixture of 5.7 and 6.0. Official Support from AMD for Ubuntu 24.04 is not yet available. Pytorch has succesfully tested with this setup. | ||
20 | {{/warning}} | ||
21 | |||
![]() |
1.1 | 22 | == Submitting == |
23 | |||
24 | GPUs are handled as generic resources in Slurm (gres). | ||
25 | |||
26 | Each GPU is handled as allocatable item. You can allocate up to 8 GPUs. You can do this by adding "~-~-gres=gpu:N", where N is the number of CPUs. | ||
27 | |||
28 | CPUs are handled as usual. | ||
29 | |||
30 | Example: Interative Seesion with 2 GPUs: | ||
31 | |||
32 | {{code language="bash"}} | ||
33 | srun -p gpu --gres=gpu:2 --pty bash | ||
34 | {{/code}} | ||
35 | |||
36 | == PyTorch == | ||
37 | |||
38 | A popular framework for machine learning is PyTorch. An up-to-date version with ROCm support must be installed with pip3 in a venv. | ||
39 | |||
40 | {{code language="bash"}} | ||
41 | python3 -m venv venc | ||
42 | . venv/bin/activate | ||
43 | {{/code}} | ||
44 | |||
45 | Install Pytorch: | ||
46 | |||
![]() |
4.1 | 47 | {{code language="bash"}} |
![]() |
9.1 | 48 | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0 |
49 | |||
50 | |||
![]() |
4.1 | 51 | {{/code}} |
![]() |
1.1 | 52 | |
![]() |
11.1 | 53 | At time of writing it's not available for 6.1. Please check the pytorch Website for updates. |
![]() |
10.1 | 54 | |
![]() |
4.1 | 55 | You can test the installation with |
56 | |||
57 | {{code language="python"}} | ||
58 | import torch | ||
59 | |||
60 | print(torch.cuda.is_available()) | ||
61 | |||
62 | {{/code}} | ||
63 | |||
![]() |
1.1 | 64 | == Links == |
65 | |||
![]() |
2.2 | 66 | GPU Cards: [[https:~~/~~/www.amd.com/en/products/professional-graphics/instinct-mi50>>https://www.amd.com/en/products/professional-graphics/instinct-mi50]] |
67 | |||
![]() |
1.1 | 68 | ROCm documentation: [[https:~~/~~/rocm.docs.amd.com/en/latest/rocm.html>>https://rocm.docs.amd.com/en/latest/rocm.html]] |
69 | |||
![]() |
5.1 | 70 | Pytorch: [[https:~~/~~/pytorch.org/>>https://pytorch.org/]] |
![]() |
1.1 | 71 | |
72 | |||
![]() |
10.1 | 73 |