Slurm¶
Overview¶
Setup munge authentication service on controller node¶
apt install munge
systemctl enable munge
systemctl start munge
Setup Slurmdbd on controller node¶
Install packages
apt install slurmdbd mysql-server
Edit /etc/mysql/conf.d/mysql
innodb_buffer_size=128M
Start mysql and create slurm user
systemctl enable mysql
systemctl start mysql
echo "create database slurm" | mysql
echo "create user slurm@localhost identified by '$STORAGE_PASS'" | mysql
echo "GRANT ALL PRIVILEGES ON slurm.* TO 'slurm'@'localhost';" | mysql
Create slurmdbd config
zcat /usr/share/doc/slurmdbd/examples/slurmdbd.conf.simple.gz > /etc/slurm-llnl/slurmdbd.conf
Set the StoragePass used to create db user
Start slurmdbd
systemctl enable slurmdbd
systemctl start slurmdbd
Setup Slurmctld on controller node¶
Install packages
apt install slurm-client slurmctld
Create example config
zcat /usr/share/doc/slurm-client/examples/slurm.conf.simple.gz > /etc/slurm-llnl/slurm.conf
Set ControlMachine to name of Slurm controller
Configure cluster nodes
#
# COMPUTE NODES
#
NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=64000 State=UNKNOWN
NodeName=my-nodes-[1-42]
#
# Partition Configurations
#
PartitionName=mypart Nodes=my-nodes-[1-42] Default=YES MaxTime=INFINITE State=UP
To get the number of CPUs, Cores, RealMemory etc for above configuraton
slurmd -C
Start slurmctld
systemctl enable slurmctld
systemctl start slurmctld
To make slurmctld HA install it on another machine and set BackupController in /etc/slurm-llnl/slurm.conf
Reload slurmd, slurmctld and check that config got loaded
systemctl restart slurmctld
systemctl restart slurmd
scontrol show config | grep Backup
Open tcp ports 6817 and 6818 on controller and backup node
Setup Slurmd on compute node¶
Install packages
apt install slurmd munge
Copy config /etc/slurm-llnl/slurm.conf from controller node
Copy munge shared key /etc/munge/munge.key from controller node
Start munge and slurmd
systemctl enable munge
systemctl start munge
systemctl enable slurmd
systemctl start slurmd
Check cluster status¶
sinfo -a
scontrol show nodes
Submit a test batch job and show job queue¶
echo -en '#!/bin/bash\n\nsleep 10\nhostname\n' > test.sh; chmod a+rx test.sh; sbatch ./test.sh
squeue