Wiki source code of GPU Server

Version 1.1 by Thomas Coelho (local) on 2023/11/06 10:43

version	line-number	content
1.1	1	{{box cssClass="floatinginfobox" title="Contents"}}
	2	{{toc/}}
	3	{{/box}}
	4
	5	= The GPU Server =
	6
	7	The GPU machine is a two socket server with AMD EPYC 7313 processors. The processor has 16 Cores, actually with SMT enabled (32 Threads). It comes with 512 GB of memory and 2 x 4 TB U.3 (NVMe) SSDs as fast storage. There are 8 AMD Instinct Mi 50 GPU cards for computing.
	8
	9	Access is given by SLURM and the separate partition "gpu".
	10
	11	As software stack AMD ROCm is installed. This supports the ROCm and openCL interface.
	12
	13	(% class="box infomessage" %)
	14	(((
	15	Because GPU computing is a new discipline, we can only provide limited information here. If you have something to share, please fell free to edit this page.
	16	)))
	17
	18	== Submitting ==
	19
	20	GPUs are handled as generic resources in Slurm (gres).
	21
	22	Each GPU is handled as allocatable item. You can allocate up to 8 GPUs. You can do this by adding "~-~-gres=gpu:N", where N is the number of CPUs.
	23
	24	CPUs are handled as usual.
	25
	26	Example: Interative Seesion with 2 GPUs:
	27
	28	{{code language="bash"}}
	29	srun -p gpu --gres=gpu:2 --pty bash
	30	{{/code}}
	31
	32	== PyTorch ==
	33
	34	A popular framework for machine learning is PyTorch. An up-to-date version with ROCm support must be installed with pip3 in a venv.
	35
	36	{{code language="bash"}}
	37	python3 -m venv venc
	38	. venv/bin/activate
	39	{{/code}}
	40
	41	Install Pytorch:
	42
	43
	44	== Links ==
	45
	46	ROCm documentation: [[https:~~/~~/rocm.docs.amd.com/en/latest/rocm.html>>https://rocm.docs.amd.com/en/latest/rocm.html]]
	47
	48
	49
	50

Wiki source code of GPU Server

Applications

Navigation