Slurm

Overview

../_images/slurm-arch.gif

Setup munge authentication service on controller node

apt install munge
systemctl enable munge
systemctl start munge

Setup Slurmdbd on controller node

  • Install packages

apt install slurmdbd mysql-server
  • Edit /etc/mysql/conf.d/mysql

innodb_buffer_size=128M
  • Start mysql and create slurm user

systemctl enable mysql
systemctl start mysql
echo "create database slurm" | mysql
echo "create user slurm@localhost identified by '$STORAGE_PASS'" | mysql
echo "GRANT ALL PRIVILEGES ON slurm.* TO 'slurm'@'localhost';" | mysql
  • Create slurmdbd config

zcat /usr/share/doc/slurmdbd/examples/slurmdbd.conf.simple.gz > /etc/slurm-llnl/slurmdbd.conf
  • Set the StoragePass used to create db user

  • Start slurmdbd

systemctl enable slurmdbd
systemctl start slurmdbd

Setup Slurmctld on controller node

  • Install packages

apt install slurm-client slurmctld
  • Create example config

zcat /usr/share/doc/slurm-client/examples/slurm.conf.simple.gz > /etc/slurm-llnl/slurm.conf
  • Set ControlMachine to name of Slurm controller

  • Configure cluster nodes

#
# COMPUTE NODES
#
NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=64000 State=UNKNOWN
NodeName=my-nodes-[1-42]

#
# Partition Configurations
#
PartitionName=mypart Nodes=my-nodes-[1-42] Default=YES MaxTime=INFINITE State=UP
  • To get the number of CPUs, Cores, RealMemory etc for above configuraton

slurmd -C
  • Start slurmctld

systemctl enable slurmctld
systemctl start slurmctld
  • To make slurmctld HA install it on another machine and set BackupController in /etc/slurm-llnl/slurm.conf

  • Reload slurmd, slurmctld and check that config got loaded

systemctl restart slurmctld
systemctl restart slurmd
scontrol show config | grep Backup
  • Open tcp ports 6817 and 6818 on controller and backup node

Setup Slurmd on compute node

  • Install packages

apt install slurmd munge
  • Copy config /etc/slurm-llnl/slurm.conf from controller node

  • Copy munge shared key /etc/munge/munge.key from controller node

  • Start munge and slurmd

systemctl enable munge
systemctl start munge

systemctl enable slurmd
systemctl start slurmd

Check cluster status

sinfo -a
scontrol show nodes

Submit a test batch job and show job queue

echo -en '#!/bin/bash\n\nsleep 10\nhostname\n' > test.sh; chmod a+rx test.sh; sbatch ./test.sh
squeue