Configuring an Elastic Cloud Partition

Slurm’s ability to extend into the could relies on its Power Save capabilities. The Slurm Power Save module provides for a way to regulate nodes or queues’ states based on usage or some other partition attribute. With execution of pre- and post- scripting allowed by Power Save, nodes can be assimilated from AWS (and others) and disbanded when the cloud nodes are no longer needed. To be able to do that successfully, Slurm needs to ultimately have a way of getting the AWS cloud nodes’ names and IP addresses, unless the AWS instance is assigned a static IP and hostname.

Setup

Global Users (all nodes, setup before Slurm or Munge install)

Slurm and Munge require UID and GID to be the same across all nodes.

MariaDB (master slurm instance, install before Slurm)

Slurm can store various accounting data in MariaDB. It only needs to bet setup on the node that will run the master slurm instance.

If not installed, install MariaDB:

yum install mariadb-server mariadb-devel -y
systemctl enable mariadb
systemctl start mariadb
mysql_secure_installation

Start a SQL interactive session and:

mysql> grant all on slurm_acct_db.* TO 'slurm'@'localhost'
mysql> create database slurm_acct_db;

Then configure the /etc/slurm/slurmdbd.conf file on the master instance as shown here. systemctl enable slurmdbd systemctl start slurmdbd

Munge (all nodes, install before Slurm)

Munge is an authentication tool used to identify messaging from the Slurm machines. Install munge on all nodes:

yum install epel-release
yum install munge munge-libs munge-devel -y

After installing Munge, create a secret key only on the slurm server node, where munge will also be running by:

yum install rng-tools -y
rngd -r /dev/urandom
/usr/sbin/create-munge-key -r 
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key

At the end it should look something like this:

[root@xcat2-master ~]# ls -la /etc/munge/munge.key
-r-------- 1 munge munge 1024 Feb  1 12:35 /etc/munge/munge.key

This key should now be propagated to all of the compute nodes in the cluster:

scp /etc/munge/munge.key root@nfs-1:/etc/munge
scp /etc/munge/munge.key centos@18.237.111.190:/etc/munge (ie: cloud instance)

Then login to every node, correct permissions and start the munge daemon:

chown -R munge: /etc/munge/ /var/log/munge/
chmod 0700 /etc/munge/ /var/log/munge/
systemctl enable munge
systemctl start munge

Test communication between Munge clients. From the slurm master node try:

munge -n
munge -n | unmunge
munge -n | ssh nfs-1 unmunge
remunge

If no errors are encoutered, Munge is working as expected.

SLURM

Create Slurm RPMS to install on all nodes:

rpm -Uvh http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum install -y munge-devel munge-libs readline-devel perl-ExtUtils-MakeMaker openssl-devel pam-devel rpm-build perl-DBI perl-Switch munge mariadb-devel

Download the desired version of Slurm from: https://www.schedmd.com/downloads.php Then build the RPM packages:

rpmbuild -ta slurm-17.11.2.tar.bz2

Now install the RPMs on all nodes:

rpm -Uvh ~/rpmbuild/RPMS/x86_64/*.rpm

Create and Configure slurm.conf

On the master slurm node, create a slurm.conf file, as the one here. After copying the slurm.conf file to the cloud node, modify its slurm.conf such that:

ControlAddr=128.200.34.10

According to the settings in the slurm.conf file we need to do the following on the master slurm instance:

touch /var/log/slurmctld.log
chown slurm: /var/log/slurmctld.log
mkdir /var/log/slurm
chown -R slurm:slurm /var/log/slurm

While on the compute nodes do the following:

touch /var/log/slurmd.log
chown slurm: /var/log/slurmd.log

Test the slurmd configuration on the compute nodes by executing:

[centos@aws-comp0 ~]$ slurmd -C
NodeName=aws-comp0 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=990
UpTime=8-01:39:56

Firewall (if running)

On the compute nodes, make sure any firewall is disabled (at least for the time being).

systemctl stop firewalld
systemctl disable firewalld

On the slurm server be sure to open the following ports:

firewall-cmd --permanent --zone=public --add-port=6817/udp
firewall-cmd --permanent --zone=public --add-port=6817/tcp
firewall-cmd --permanent --zone=public --add-port=6818/udp
firewall-cmd --permanent --zone=public --add-port=6818/tcp
firewall-cmd --permanent --zone=public --add-port=7321/udp
firewall-cmd --permanent --zone=public --add-port=7321/tcp
firewall-cmd --reload

Clock syncronization

Every node must have ntpd running so clocks are properly syncronized:

yum install ntp -y
chkconfig ntpd on
ntpdate pool.ntp.org
systemctl start ntpd

Start slurmd on Compute Nodes First

systemctl enable slurmd systemctl start slurmd

Start slurmctld on Master Slurm Node:

systemctl enable slurmctld systemctl start slurmctld

Test and Debug

On compute:

tail -f /var/log/slurmd.log, /var/log/messages

On master:

tail -f /var/log/slurmctld.log, /var/log/slurm/slurmdbd.log
systemctl status slurmctrld slurmdbd munge -l

Useful commands for troubleshooting:

scontrol show nodes
scontrol show jobs
scontrol show daemons
srun --ntasks=16 --label /bin/hostname
sbatch # submit script
salloc # create job alloc and start shell, interactive
srun # create job alloc and launch job step, MPI
sattach #
sinfo
sinfo --Node
sinfo -p debug
squeue -i60
squeue -u -t all
squeue -s -p debug
smap
sview
scontrol show partition
scontrol update PartitionName=debug MaxTime=60
scontrol show config
sacct -u jtatar
sacct -p debug
sstat
sreport
sacctmgr
sprio
sshare
sdiag
scancel --user=jtatar --state=pending
scancel 444445
strigger
# Submit a job array with index values between 0 and 31
sbatch --array=0-31 -N1 tmp
# Submit a job array with index values of 1, 3, 5 and 7
sbatch --array=1,3,5,7 -N1 tmp
# Submit a job array with index values between 1 and 7
# with a step size of 2 (i.e. 1, 3, 5 and 7)
sbatch --array=1-7:2 -N1 tmp

References

https://slurm.schedmd.com/power_save.html

http://biocluster.ucr.edu/~jhayes/slurm/elastic_computing.html

https://slurm.schedmd.com/quickstart_admin.html

http://sysadm.mielnet.pl/building-and-installing-rpm-slurm-on-centos-7/

https://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/

http://www.hpckp.org/images/training/slurm15/02-hosts-partitions.pdf

Troubleshooting a Slurm Partition in a ‘DOWN’ State

Check Node Status

[root@xcat2-master ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
testQ*       up   infinite      1   down nfs-1

If the state is down, ping the machine(s) and see if they respond: ping nfs-1

Check Daemons

On Master

Make sure the following daemons are running: slurmctrld, slurmdbd, munged

[root@xcat2-master ~]# ps aux | grep slurm
slurm     1439  0.0  0.1 668460  5904 ?        Sl   Jun14   0:23 /usr/sbin/slurmctld
slurm     1926  0.0  0.0 276508  2456 ?        Sl   Jun14   0:00 /usr/sbin/slurmdbd
[root@xcat2-master ~]# ps aux | grep munged
munge     1092  0.0  0.0 243956  2216 ?        Sl   Jun14   0:00 /usr/sbin/munged

Check munge:

munge -n | unmunge | grep STATUS
STATUS:           Success (0)

On Compute

Make sure the following daemons are running: slurmd, munged

[root@nfs-1 ~]# ps aux | grep slurm
root     20214  0.0  0.1 131304  2552 ?        S    Jun13   0:00 /usr/sbin/slurmd
[root@nfs-1 ~]# ps aux | grep munged
munge      970  0.0  0.1 243956  2464 ?        Sl   May08   0:15 /usr/sbin/munged

Check munge:

munge -n | unmunge | grep STATUS
STATUS:           Success (0)

Check the Logs

On Master

/var/log/slurmctrld.log /var/log/slurmsched.log /var/log/slurm/slurmdbd.log /var/log/munge/*

On Compute

/var/log/slurmd.log /var/log/munge/*

Restart Daemons (Master first)

First restart the master node’s slurmctrld process: sudo systemctl restart slurmctrld If issues persist, restart slurmd on compute node next: sudo systemctl restart slurmd

Able to ssh from master to compute node?

You should be able to ssh from master to compute node without having to enter a password. If that is not the case, create a ssh key and copy the public key to the compute node.

Then, try, in this order: Stop the slurmd instance on the compute node. Restart the slurmctrld instance on the master node. Start the slurmd instance on the compute node.

Last resort

scontrol update nodename=nfs-1 state=resume

References:

https://www.eidos.ic.i.u-tokyo.ac.jp/~tau/lecture/parallel_distributed/2016/html/fix_broken_slurm.html https://slurm.schedmd.com/troubleshoot.html