Difference between revisions of "SLURM"

From IT Service Wiki
Jump to: navigation, search
(Partitions)
 
(43 intermediate revisions by 4 users not shown)
Line 2: Line 2:
  
 
Slurm is fully integrated in our system. You do not need set any environment variables.
 
Slurm is fully integrated in our system. You do not need set any environment variables.
 
== Submitting Jobs ==
 
 
In most case you want to submit a non interactive job to be executed in our cluster.
 
 
This is very simple for serial (1 CPU) jobs:
 
 
  sbatch jobscript.sh
 
 
where jobscript.sh is a shell script with your job commands.
 
 
Running '''openMPI''' jobs is not much more complictated:
 
 
  sbatch -n X jobscript.sh
 
 
where X is the number of desired MPI processes. Launch the job in the jobscript with:
 
 
  mpirun YOUREXECUTABLE
 
 
You don't have to worry about the number of processes or specific nodes. Both slurm and openmpi know
 
about each other.
 
 
Running SMP jobs (multiple threads, not necessary mpi). Running MPI jobs an a single node, is recommended for the
 
dfg-big nodes. This are big host, with up to 48 cpu's per node, but slow network connection. Launch SMP jobs with
 
 
  sbatch -N 1 -n X jobscript.sh
 
 
 
 
  
  
Line 37: Line 9:
  
 
Based on access restrictions our cluster is divided in different partitions. 'sinfo' will only show partitions you are allowed to use. Using 'sinfo -a' shows all partitons.
 
Based on access restrictions our cluster is divided in different partitions. 'sinfo' will only show partitions you are allowed to use. Using 'sinfo -a' shows all partitons.
 +
 +
A partition is selected by '-p PARTITIONNAME'.
  
 
{| border="1" align="center"
 
{| border="1" align="center"
Line 42: Line 16:
 
! scope="col" | '''Partition'''  
 
! scope="col" | '''Partition'''  
 
! scope="col" | '''No. Nodes'''  
 
! scope="col" | '''No. Nodes'''  
! scope="col" | '''Cores'''  
+
! scope="col" | '''Cores/M'''  
! scope="col" | '''Tot. Slots'''
+
! scope="col" | '''Tot. Cores'''
 
! scope="col" | '''RAM/GB'''  
 
! scope="col" | '''RAM/GB'''  
 
! scope="col" | '''CPU'''  
 
! scope="col" | '''CPU'''  
! scope="col" | '''Remark'''
+
! scope="col" | '''Remark/Restriction'''
 
|-
 
|-
| housewifes
+
| itp
| 15
+
| 10
| 4
+
| 12
| 72
+
| 120
| 16
 
| Dual Core AMD Opteron(tm) Processor 270 2,0 GHz
 
|-
 
| dfg
 
| 9
 
| 8
 
| 72
 
 
| 32  
 
| 32  
| Quad-Core AMD Opteron(tm) Processor 2346 HE
 
| '''Restricted access'''
 
|-
 
| dfg
 
| 8
 
| 8
 
| 64
 
| 32/64
 
| Quad-Core AMD Opteron(tm) Processor 2376
 
| [http://en.wikipedia.org/wiki/InfiniBand Infiniband], '''Restricted access'''
 
|-
 
| dfg
 
| 8
 
| 12
 
| 96
 
| 32/64
 
 
| Six-Core AMD Opteron(tm) Processor 2427
 
| Six-Core AMD Opteron(tm) Processor 2427
| [http://en.wikipedia.org/wiki/InfiniBand Infiniband], '''Restricted access'''
+
|
 
|-
 
|-
| quantum
+
| itp-big
| 8
+
| 3
| 12
+
| 48
| 96
+
| 144
| 32/64
+
| 128
| Six-Core AMD Opteron(tm) Processor 2427
+
| AMD Opteron(tm) Processor 6172
| [http://en.wikipedia.org/wiki/InfiniBand Infiniband], '''Restricted access'''
+
|  
 
|-
 
|-
 
| dfg-big
 
| dfg-big
Line 93: Line 44:
 
| 128
 
| 128
 
| 8-Core AMD Opteron(tm) Processor 6128
 
| 8-Core AMD Opteron(tm) Processor 6128
| '''Restricted access'''
+
| Group Valenti
 
|-
 
|-
 
| dfg-big
 
| dfg-big
Line 101: Line 52:
 
| 128/256
 
| 128/256
 
| 12-Core AMD Opteron(tm) Processor 6168
 
| 12-Core AMD Opteron(tm) Processor 6168
| '''Restricted access'''
+
| Group Valenti
 +
|-
 +
| dfg-big
 +
| 4
 +
| 64
 +
| 256
 +
| 128/256
 +
| 16-Core AMD Opteron(tm) Processor 6272
 +
| Group Valenti
 +
|-
 +
| dfg-big
 +
| 4
 +
| 48
 +
| 192
 +
| 128/256
 +
| 12-Core AMD Opteron(tm) Processor 6344
 +
| Group Valenti
 +
|-
 +
| fplo
 +
| 2
 +
| 12
 +
| 24
 +
| 256
 +
| Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
 +
| Group Valenti
 +
|-
 +
| fplo
 +
| 4
 +
| 16
 +
| 32
 +
| 256
 +
| Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
 +
| Group Valenti
 +
|-
 +
| dfg-xeon
 +
| 5
 +
| 16
 +
| 32
 +
| 128
 +
| Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
 +
| Group Valenti
 +
|-
 +
| dfg-xeon
 +
| 7
 +
| 20
 +
| 140
 +
| 128
 +
| Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
 +
| Group Valenti
 +
|-
 +
| iboga
 +
| 44
 +
| 20
 +
| 880
 +
| 64
 +
| Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
 +
| Group Rezzolla
 +
|-
 +
| dreama
 +
| 1
 +
| 40
 +
| 40
 +
| 1024
 +
| Intel(R) Xeon(R) CPU E7-4820 v3 @ 1.90GHz
 +
| Group Rezzolla
 +
|-
 +
| barcelona
 +
| 8
 +
| 40
 +
| 320
 +
| 192
 +
| Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
 +
| Group Valenti
  
 
|}
 
|}
  
The access to the DFG-Nodes (dfg, dfg-ib and dfg-big) is restricted to the members of the SFB/TR49. If you do not belong to that group but want to test and develop programs for the Infiniband Network, please talk to the administrator.
+
Most nodes are for exclusive use by their corresponding owners. The itp and itp-big nodes are for common usage. Except for 'fplo' and 'dfg-big' nodes, all machines are connected with Infiniband for all traffic (IP and internode communitcation - MPI)
The access to the queue 'quantum' is restricted to group Prof. Hofstetter.
+
 
 +
== Submitting Jobs ==
 +
 
 +
In most cases you want to submit a non interactive job to be executed in our cluster.
 +
 
 +
This is very simple for serial (1 CPU) jobs:
 +
 
 +
  sbatch -p PARTITION jobscript.sh
 +
 
 +
where jobscript.sh is a shell script with your job commands.
 +
 
 +
Running '''openMPI''' jobs is not much more complictated:
 +
 
 +
  sbatch -p PARTITION -n X jobscript.sh
 +
 
 +
where X is the number of desired MPI processes. Launch the job in the jobscript with:
 +
 
 +
  mpirun YOUREXECUTABLE
 +
 
 +
You don't have to worry about the number of processes or specific nodes. Both slurm and openmpi know
 +
about each other.
 +
 
 +
If you want '''infiniband''' for your MPI job (which is usually a good idea, if not running on the same node), you have to request the feature infiniband:
 +
 
 +
  sbatch -p dfg -C infiniband -n X jobscript.sh
 +
 
 +
Note: Infiniband is not available for 'fplo' and 'dfg-big'.
 +
 
 +
Running '''SMP jobs''' (multiple threads, not necessary mpi). Running MPI jobs on a single node is recommended for the
 +
dfg-big nodes. This are big host with up to 64 cpu's per node, but 'slow' gigabit network connection. Launch SMP jobs with
  
= SLURM vs. SGE =
+
  sbatch -p PARTITION -N 1 -n X jobscript.sh
  
This chapter compares the new batch system SLURM with the old SGE.
+
=== Differences in network the network connection ===
 +
 +
The new v3 dfg-xeon nodes are equipped with 10 GB network. This is faster (trough put) and has lower latency then gigabit ethernet, but is not is not as fast as the DDR infinband network. The 10 GB network is used for MPI and I/O. Infiniband is only use for MPI.
  
== Partitions ==
+
== Defining Resource limits ==
  
Slurm has a slightly different view on the cluster. Nodes of a cluster are organized in partitions. To submit a job you have the choose one partition where to run the job.
+
By default each job allocates 2 GB memory and a run time of 3 days. More resources can be requested by
  
== Comparison of commands ==
+
  --mem-per-cpu=<MB>
  
The following table shows the most important commands in slurm compared to the commands of the grid engine.
+
where <MB> is the memory in megabytes. The virtual memory limit is 2.5 times of the requested real memory limit.
  
{| border=1
+
The memory limit is not a hard limit. When exceeding the limit, your memory will be swapped out. Only when using more the 150% of the limit your job will be killed. So be conservative, to keep enough room for other jobs. Requested memory is blocked from the use by other jobs.
|+ Comparison of SGE and Slurm
+
 
! SGE
+
  -t or --time=<time>
! Slurm
+
 
! Description
+
where time can be set in the format "days-hours". See man page for more formats.
|-
+
 
|qstat
+
 
| squeue
+
== Memory Management ==
| Show running jobs
+
 
|-
+
In Slurm you specify only one parameter, which is the limit for your real memory usage and drives the decision where your job is started. The virtual memory of your job maybe 2.5 times of your requested memory. You can exceed your memory limit by 20%. But this will be swap space instead of real memory. This prevents crashing if you memory limit is a little to tight.
|qsub
+
 
|sbatch
+
== Inline Arguments ==
| Submit a batch job
+
 
|-
+
sbatch arguments can be written in the jobfile:
|qlogin
 
|qrun
 
| Run interactive commands
 
|-
 
|qdel
 
|scancel
 
| Delete a batch job
 
|-
 
|qhost
 
|sinfo
 
| Get info about nodes
 
|}
 
  
== Parallel environments ==
+
<pre>
 +
#! /bin/bash
 +
#
 +
# Choosing a partition:
 +
#SBATCH -p housewives
  
Slurm has no concept of parallel environment. Slurm has been designed for parallel execution. This makes things easier, but gives your more responsibility, when allocation cluster resources.
+
YOUR JOB COMMANDS....
 +
</pre>
  
 
= Links =
 
= Links =
  
* Homepage [https://computing.llnl.gov/linux/slurm/slurm.html]
+
* SLURM-Homepage [http://slurm.schedmd.com/slurm.html]

Latest revision as of 14:05, 15 November 2019

SLURM is the Simple Linux Utility for Resource Management and is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

Slurm is fully integrated in our system. You do not need set any environment variables.


Partitions

A partition is a subset of the cluster, a bundle of compute nodes with the same characteristics.

Based on access restrictions our cluster is divided in different partitions. 'sinfo' will only show partitions you are allowed to use. Using 'sinfo -a' shows all partitons.

A partition is selected by '-p PARTITIONNAME'.

Partition No. Nodes Cores/M Tot. Cores RAM/GB CPU Remark/Restriction
itp 10 12 120 32 Six-Core AMD Opteron(tm) Processor 2427
itp-big 3 48 144 128 AMD Opteron(tm) Processor 6172
dfg-big 3 32 96 128 8-Core AMD Opteron(tm) Processor 6128 Group Valenti
dfg-big 3 48 144 128/256 12-Core AMD Opteron(tm) Processor 6168 Group Valenti
dfg-big 4 64 256 128/256 16-Core AMD Opteron(tm) Processor 6272 Group Valenti
dfg-big 4 48 192 128/256 12-Core AMD Opteron(tm) Processor 6344 Group Valenti
fplo 2 12 24 256 Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz Group Valenti
fplo 4 16 32 256 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Group Valenti
dfg-xeon 5 16 32 128 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Group Valenti
dfg-xeon 7 20 140 128 Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Group Valenti
iboga 44 20 880 64 Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz Group Rezzolla
dreama 1 40 40 1024 Intel(R) Xeon(R) CPU E7-4820 v3 @ 1.90GHz Group Rezzolla
barcelona 8 40 320 192 Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Group Valenti

Most nodes are for exclusive use by their corresponding owners. The itp and itp-big nodes are for common usage. Except for 'fplo' and 'dfg-big' nodes, all machines are connected with Infiniband for all traffic (IP and internode communitcation - MPI)

Submitting Jobs

In most cases you want to submit a non interactive job to be executed in our cluster.

This is very simple for serial (1 CPU) jobs:

  sbatch -p PARTITION jobscript.sh

where jobscript.sh is a shell script with your job commands.

Running openMPI jobs is not much more complictated:

  sbatch -p PARTITION -n X jobscript.sh

where X is the number of desired MPI processes. Launch the job in the jobscript with:

  mpirun YOUREXECUTABLE

You don't have to worry about the number of processes or specific nodes. Both slurm and openmpi know about each other.

If you want infiniband for your MPI job (which is usually a good idea, if not running on the same node), you have to request the feature infiniband:

 sbatch -p dfg -C infiniband -n X jobscript.sh

Note: Infiniband is not available for 'fplo' and 'dfg-big'.

Running SMP jobs (multiple threads, not necessary mpi). Running MPI jobs on a single node is recommended for the dfg-big nodes. This are big host with up to 64 cpu's per node, but 'slow' gigabit network connection. Launch SMP jobs with

  sbatch -p PARTITION -N 1 -n X jobscript.sh

Differences in network the network connection

The new v3 dfg-xeon nodes are equipped with 10 GB network. This is faster (trough put) and has lower latency then gigabit ethernet, but is not is not as fast as the DDR infinband network. The 10 GB network is used for MPI and I/O. Infiniband is only use for MPI.

Defining Resource limits

By default each job allocates 2 GB memory and a run time of 3 days. More resources can be requested by

  --mem-per-cpu=<MB>

where <MB> is the memory in megabytes. The virtual memory limit is 2.5 times of the requested real memory limit.

The memory limit is not a hard limit. When exceeding the limit, your memory will be swapped out. Only when using more the 150% of the limit your job will be killed. So be conservative, to keep enough room for other jobs. Requested memory is blocked from the use by other jobs.

  -t or --time=

where time can be set in the format "days-hours". See man page for more formats.


Memory Management

In Slurm you specify only one parameter, which is the limit for your real memory usage and drives the decision where your job is started. The virtual memory of your job maybe 2.5 times of your requested memory. You can exceed your memory limit by 20%. But this will be swap space instead of real memory. This prevents crashing if you memory limit is a little to tight.

Inline Arguments

sbatch arguments can be written in the jobfile:

#! /bin/bash
#
# Choosing a partition:
#SBATCH -p housewives

YOUR JOB COMMANDS....

Links

  • SLURM-Homepage [1]