slurm example configurations
|
The following page contains documentation and example configuration files to demonstrate the process of setting up the
SLURM cluster resource manager, both on the controller-side and the compute node-side, for test and demonstration purposes.
|
For this scenario, I used VMware ESXi virtual machines for both the controller node and the compute nodes. Virtual machine
resource configuration was the same for all nodes and fairly minimal: 1 vCPU, 1 GB RAM, 40 GB vDisk. Each system was freshly
loaded with a basic installation of Ubuntu 12 LTS before beginning.
|
CONTROLLER CONFIGURATION
|
1. Install MUNGE
|
apt-get install libmunge-dev libmunge2 munge
|
2. Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the
MUNGE installation guide for complete details. My
example below will generate a lower quality key at high speed, but that's OK because this test cluster is completely
detached from the public Internet and I'm the only user on the system.
|
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
|
3. Start MUNGE.
|
/etc/init.d/munge start
|
4. Install MySQL server (for SLURM accounting) and development tools (to build SLURM). We'll also install the BLCR tools so that
SLURM can take advantage of that checkpoint-and-restart functionality.
|
apt-get install mysql-server libmysqlclient-dev libmysqld-dev libmysqld-pic
apt-get install gcc bison make flex libnncurses5-dev tcsh pkg-config
apt-get install blcr-dkms blcr-testsuite blcr-util libcr-dbg libcr-dev libcr0
|
5. Unpack and build SLURM.
|
bunzip2 slurm-2.5.4.tar.bz2
tar xfp slurm-2.5.4.tar
cd slurm-2.5.4
./configure
make
make install
|
6. Copy the example configuration files out to /usr/local/etc.
|
cd /usr/src/slurm-2.5.4
cp ./etc/slurm.conf.example /usr/local/etc/slurm.conf
cp ./etc/slurmdbd.conf.example /usr/local/etc/slurmdbd.conf
|
7. Set things up for slurmdbd (the SLURM accounting daemon) in MySQL.
|
mysql -u root -p
create database slurm_acct_db;
create user 'slurm'@'localhost';
set password for 'slurm'@'localhost' = password('MyStoragePassword');
grant usage on *.* to 'slurm'@'localhost';
grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
flush privileges;
|
6. Configure /usr/local/etc/slurmdbd.conf such that it looks like the following:
|
#
# Example slurmdbd.conf file.
#
# See the slurmdbd.conf man page for more information.
#
# Archive info
#ArchiveJobs=yes
#ArchiveDir="/tmp"
#ArchiveSteps=yes
#ArchiveScript=
#JobPurge=12
#StepPurge=1
#
# Authentication info
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
#
# slurmDBD info
DbdAddr=localhost
DbdHost=localhost
#DbdPort=7031
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=4
#DefaultQOS=normal,standby
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
#PluginDir=/usr/lib/slurm
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes
#
# Database info
StorageType=accounting_storage/mysql
#StorageHost=localhost
#StoragePort=1234
StoragePass=MyStoragePassword
StorageUser=slurm
StorageLoc=slurm_acct_db
|
7. Configure /usr/local/etc/slurm.conf such that it looks like the following:
|
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=diablonet
ControlMachine=slurm
ControlAddr=172.16.1.15
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/slurmd
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/tmp
#SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
CacheGroups=0
#FirstJobId=
ReturnToService=0
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFs=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/linear
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
JobCompType=jobcomp/filetxt
JobCompLoc=/tmp/slurm_job_completion.txt
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#
AccountingStorageEnforce=limits,qos
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurm
AccountingStorageLoc=/tmp/slurm_job_accounting.txt
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
NodeName=cn[01-02] Procs=1 ThreadsPerCore=1 Sockets=1 CoresPerSocket=1 RealMemory=1000 State=UNKNOWN
# PARTITIONS
PartitionName=DEFAULT State=UP MaxTime=28-00:00:00 DefaultTime=01:00:00 PreemptMode=REQUEUE Priority=10000 Shared=FORCE:1
PartitionName=production Default=YES Nodes=cn[01-02]
|
8. Copy the default init scripts to /etc/init.d.
|
cd /usr/src/slurm-2.5.4/etc
cp init.d.slurm /etc/init.d/slurm
cp init.d.slurmdbd /etc/init.d/slurmdbd
chmod +x /etc/init.d/slurm
chmod +x /etc/init.d/slurmdbd
|
9. Make some changes to the default init scripts that we copied to /etc/init.d to make them work with Ubuntu. Honestly a
bit more work is required here (not quite perfect); I haven't bothered and have just been starting the requisite daemons manually:
|
CONFDIR="/usr/local/etc"
LIBDIR="/usr/local/lib"
SBINDIR="/usr/local/sbin"
BINDIR="/usr/local/bin" # in init script for slurm only, not for slurmdbd
change each instance of: "/etc/rc.d/init.d/functions" to "/lib/lsb/init-functions"
|
10. Add SLURM user:
|
echo "slurm:x:2000:2000:slurm admin:/home/slurm:/bin/bash" >> /etc/passwd
echo "slurm:x:2000:slurm >> /etc/group
pwconv
|
11. Create SLURM spool and log directories and set permissions accordingly:
|
mkdir /var/spool/slurm
chown -R slurm:slurm /var/spool/slurm
mkdir /var/log/slurm
chown -R slurm:slurm /var/log/slurm
|
12. Start the SLURM database daemon:
|
/usr/local/sbin/slurmdbd &
|
13. Use sacctmgr to create the cluster in the accounting system:
|
sacctmgr add cluster diablonet
|
14. Start the SLURM controller daemon:
|
/usr/local/sbin/slurmctld &
|
15. Use sacctmgr to add accounts to the accounting system. Consider these as basically user classes:
|
sacctmgr add account research description "Research accounts" Organization=Research
|
16. Use sacctmgr to add users (already defined in /etc/passwd) to the SLURM accounting system such that they can
submit jobs. Note that in general, you'll need some way to keep user names and UIDs synchronized between the controller and all the
compute nodes, whether manually or by using a scheme like LDAP.
|
sacctmgr create user scaron account=xxx defaultaccount=xxx adminlevel=[None|Operator|Admin]
sacctmgr show user name=scaron
|
At this point, the SLURM controller should be online; you can query the system with commands like squeue and sinfo.
|
COMPUTE NODE CONFIGURATION
|
Repeat the following steps for each client node that will be configured. In a larger scale real-world scenario we'd use a configuration
management system to automate the task of setting up compute nodes based on a master template, but for this smaller scale test we'll
just set each host up manually.
|
1. Install MUNGE
|
apt-get install libmunge-dev libmunge2 munge
|
2. Copy MUNGE key over from the SLURM controller:
|
scp root@slurm:/etc/munge/munge.key /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
|
3. Start MUNGE.
|
/etc/init.d/munge start
|
4. Install development tools to build SLURM, and the BLCR tools so that SLURM can take advantage of the checkpoint-and-restart
functionality.
|
apt-get install gcc bison make flex libnncurses5-dev tcsh pkg-config
apt-get install blcr-dkms blcr-testsuite blcr-util libcr-dbg libcr-dev libcr0
|
5. Unpack and build SLURM.
|
bunzip2 slurm-2.5.4.tar.bz2
tar xfp slurm-2.5.4.tar
cd slurm-2.5.4
./configure
make
make install
|
6. Copy SLURM configuration over from the SLURM controller:
|
scp root@slurm:/usr/local/etc/slurm.conf /usr/local/etc
|
7. Copy SLURM init script over from the SLURM controller:
|
scp root@slurm:/etc/init.d/slurm /etc/init.d
|
8. Add SLURM user:
|
echo "slurm:x:2000:2000:slurm admin:/home/slurm:/bin/bash" >> /etc/passwd
echo "slurm:x:2000:slurm >> /etc/group
pwconv
|
9. Create SLURM spool and log directories and set permissions accordingly:
|
mkdir /var/spool/slurm
chown -R slurm:slurm /var/spool/slurm
mkdir /var/log/slurm
chown -R slurm:slurm /var/log/slurm
|
10. Start the SLURM client daemon:
|
/usr/local/sbin/slurmd &
|
11. As the nodes come online, the SLURM controller should gradually figure it out and mark them as idle in sinfo. You
can speed the process by varying the node state with the scontrol command:
|
scontrol update NodeName=cn01 State=Resume
|
This completes the process of setting up SLURM on the clients. Now just set up an appropriate network filesystem to create some
common working space for the cluster, and start dispatching jobs!
|
For a quick test, create a script like the following, which we'll call slurmtest.sh:
|
#!/bin/sh
# Submit to a specifc node with a command e.g.
# sbatch --nodelist=cn01 slurmtest.sh
# Submit to a random node with a command e.g.
# sbatch slurmtest.sh
#SBATCH --job-name=slurmtest
hostname
uptime
|
Then submit the job to the cluster using the sbatch command:
|
sbatch slurmtest.sh
|
home
|