Wiki source code of GPU Server

Version 1.1 by Thomas Coelho (local) on 2023/11/06 10:43

Hide last authors
Thomas Coelho (local) 1.1 1 {{box cssClass="floatinginfobox" title="**Contents**"}}
2 {{toc/}}
3 {{/box}}
4
5 = The GPU Server =
6
7 The GPU machine is a two socket server with AMD EPYC 7313 processors. The processor has 16 Cores, actually with SMT enabled (32 Threads). It comes with 512 GB of memory and 2 x 4 TB U.3 (NVMe) SSDs as fast storage. There are** 8 AMD Instinct Mi 50** GPU cards for computing.
8
9 Access is given by SLURM and the separate partition "gpu".
10
11 As software stack AMD ROCm is installed. This supports the ROCm and openCL interface.
12
13 (% class="box infomessage" %)
14 (((
15 Because GPU computing is a new discipline, we can only provide limited information here. If you have something to share, please fell free to edit this page.
16 )))
17
18 == Submitting ==
19
20 GPUs are handled as generic resources in Slurm (gres).
21
22 Each GPU is handled as allocatable item. You can allocate up to 8 GPUs. You can do this by adding "~-~-gres=gpu:N", where N is the number of CPUs.
23
24 CPUs are handled as usual.
25
26 Example: Interative Seesion with 2 GPUs:
27
28 {{code language="bash"}}
29 srun -p gpu --gres=gpu:2 --pty bash
30 {{/code}}
31
32 == PyTorch ==
33
34 A popular framework for machine learning is PyTorch. An up-to-date version with ROCm support must be installed with pip3 in a venv.
35
36 {{code language="bash"}}
37 python3 -m venv venc
38 . venv/bin/activate
39 {{/code}}
40
41 Install Pytorch:
42
43
44 == Links ==
45
46 ROCm documentation: [[https:~~/~~/rocm.docs.amd.com/en/latest/rocm.html>>https://rocm.docs.amd.com/en/latest/rocm.html]]
47
48
49
50