Slurm Quick Installation for Cluster on Ubuntu 20.04
Slurm will make a bunch of seperated machines look much like a cluster, is it right?
Naming Convention of Nodes
A common cluster should comprise management nodes and compute nodes. This aritcle will take our cluster as an example to demostrate steps to install and configure Slurm. In our case, the management node is called clab-mgt01
while the compute nodes are named from clab01
to clab20
in order.
Install Dependencies
Execute the following command to install the dependencies on all machines. (clab-all
refers to all machines including management and compute nodes).
1 | clab-all$ sudo apt install slurm-wlm slurm-client munge |
Tips: There are several tools that may help to manage multiple nodes easily:
- iTerm2 (on Mac) / Terminator (on Linux)
- csshX (on Mac) / cssh (on Linux)
- Parallel SSH (at cluster side)
Generate Slurm Configuration
There is an official online configuration generator. And we should carefully check the fields below.
- SlurmctldHost:
clab-mgt01
in our case. - NodeName:
clab[01-20]
in our case. - CPUs: It is recommended to leave it blank.
- Sockets: For a dual-socket server we commonly see, it should be
2
. - CoresPerSocket: Number of physical cores per socket.
- ThreadsPerCore: For a regular x86 server, if hyperthreading is enabled, it should be
2
, otherwise1
. - RealMemory: Optional.
Click submit
, then we could copy the file content to /etc/slurm-llnl/slurm.conf
on all machines.
Tips: Don't forget the shared storage (e.g. NFS storage) on the cluster. We could utilize it to distribute files.
Distribute Munge Key
Once Munge is installed successfully, the key /etc/munge/munge.key
will be automatically generated. It is requried for all machines to hold the same key. Therefore, we could distribute the key on the management node to the remaining nodes including compute nodes and other backup management node if existing.
Tips: Again. We could also utilize the shared storage to distribute the key.
Then make sure the permission and the ownership are correctly set.
1 | clab-all$ sudo chmod 400 /etc/munge/munge.key |
Patch Slurm Cgroup Integration
By default, there Slurm cannot work with Cgroup well. If we start Slurm service right now, we may receive this error shown below.
1 | error: cgroup namespace 'freezer' not mounted. aborting |
Therefore, by pasting the following content to /etc/slurm/cgroup.conf
on compute nodes, this issue can be fixed.
1 | CgroupMountpoint=/sys/fs/cgroup |
or using this command:
1 | echo CgroupMountpoint=/sys/fs/cgroup >> /etc/slurm/cgroup.conf |
Fix Directory Permission
For unknown reasons, the permission of the relevant directory is not set properly, which may lead to this error.
1 | slurmctld: fatal: mkdir(/var/spool/slurmctld): Permission denied |
The solution is executing the commands below on management nodes.
1 | clab-mgt$ sudo mkdir -p /var/spool/slurmctld |
Start Slurm Service
So far, we have finished the basic configuration. Let us launch Slurm now.
1 | # On management nodes |
Run sinfo
and we should see all the compute nodes are ready.
1 | $ sinfo |
Debugging Tips
If your Slurm is not working correctly, you could try with these commands to debug.
1 | clab-mgt$ sudo slurmctld -D |