Slurm Quick Installation for Cluster on Ubuntu 20.04

Slurm will make a bunch of seperated machines look much like a cluster, is it right?

Naming Convention of Nodes

A common cluster should comprise management nodes and compute nodes. This aritcle will take our cluster as an example to demostrate steps to install and configure Slurm. In our case, the management node is called clab-mgt01 while the compute nodes are named from clab01 to clab20 in order.

Install Dependencies

Execute the following command to install the dependencies on all machines. (clab-all refers to all machines including management and compute nodes).

1
clab-all$ sudo apt install slurm-wlm slurm-client munge

Tips: There are several tools that may help to manage multiple nodes easily:

  • iTerm2 (on Mac) / Terminator (on Linux)
  • csshX (on Mac) / cssh (on Linux)
  • Parallel SSH (at cluster side)

Generate Slurm Configuration

There is an official online configuration generator. And we should carefully check the fields below.

  • SlurmctldHost: clab-mgt01 in our case.
  • NodeName: clab[01-20] in our case.
  • CPUs: It is recommended to leave it blank.
  • Sockets: For a dual-socket server we commonly see, it should be 2.
  • CoresPerSocket: Number of physical cores per socket.
  • ThreadsPerCore: For a regular x86 server, if hyperthreading is enabled, it should be 2, otherwise 1.
  • RealMemory: Optional.

Click submit, then we could copy the file content to /etc/slurm-llnl/slurm.conf on all machines.

Tips: Don't forget the shared storage (e.g. NFS storage) on the cluster. We could utilize it to distribute files.

Distribute Munge Key

Once Munge is installed successfully, the key /etc/munge/munge.key will be automatically generated. It is requried for all machines to hold the same key. Therefore, we could distribute the key on the management node to the remaining nodes including compute nodes and other backup management node if existing.

Tips: Again. We could also utilize the shared storage to distribute the key.

Then make sure the permission and the ownership are correctly set.

1
2
clab-all$ sudo chmod 400 /etc/munge/munge.key
clab-all$ chown munge:munge /etc/munge/munge.key

Patch Slurm Cgroup Integration

By default, there Slurm cannot work with Cgroup well. If we start Slurm service right now, we may receive this error shown below.

1
error: cgroup namespace 'freezer' not mounted. aborting

Therefore, by pasting the following content to /etc/slurm/cgroup.conf on compute nodes, this issue can be fixed.

1
CgroupMountpoint=/sys/fs/cgroup

or using this command:

1
echo CgroupMountpoint=/sys/fs/cgroup >> /etc/slurm/cgroup.conf

Fix Directory Permission

For unknown reasons, the permission of the relevant directory is not set properly, which may lead to this error.

1
slurmctld: fatal: mkdir(/var/spool/slurmctld): Permission denied

The solution is executing the commands below on management nodes.

1
2
clab-mgt$ sudo mkdir -p /var/spool/slurmctld
clab-mgt$ sudo chown slurm:slurm /var/spool/slurmctld/

Start Slurm Service

So far, we have finished the basic configuration. Let us launch Slurm now.

1
2
3
4
5
6
7
8
9
10
11
# On management nodes
clab-mgt$ sudo systemctl enable munge
clab-mgt$ sudo systemctl start munge
clab-mgt$ sudo systemctl enable slurmctld
clab-mgt$ sudo systemctl start slurmctld

# On compute nodes
clab-comp$ sudo systemctl enable munge
clab-comp$ sudo systemctl start munge
clab-comp$ sudo systemctl enable slurmd
clab-comp$ sudo systemctl start slurmd

Run sinfo and we should see all the compute nodes are ready.

1
2
3
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 20 idle clab[01-20]

Debugging Tips

If your Slurm is not working correctly, you could try with these commands to debug.

1
2
clab-mgt$ sudo slurmctld -D
clab-comp$ sudo slurmd -D

References