User Tools

Site Tools


tech:slurm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
tech:slurm [2019/09/06 14:30] kohofertech:slurm [2020/04/24 17:38] – [GCC] kohofer
Line 15: Line 15:
 ===== Installation ===== ===== Installation =====
  
-==== Controller ====+===== Controller name: slurm-ctrl =====
  
-Controller name: slurm-ctrl+Install slurm-wlm and tools
  
-  ssh csadmin@slurm-ctrl +  ssh slurm-ctrl 
-  $ sudo apt install slurm-wlm slurm-wlm-doc mailutils sview mariadb-client mariadb-server libmariadb-dev python-dev python-mysqldb+  apt install slurm-wlm slurm-wlm-doc mailutils mariadb-client mariadb-server libmariadb-dev python-dev python-mysqldb
  
 === Install Maria DB Server === === Install Maria DB Server ===
  
-  apt-get install mariadb-server +  apt-get install mariadb-server 
-  systemctl start mysql +  systemctl start mysql 
-  mysql -u root+  mysql -u root
   create database slurm_acct_db;   create database slurm_acct_db;
   create user 'slurm'@'localhost';   create user 'slurm'@'localhost';
Line 37: Line 37:
 In the file /etc/mysql/mariadb.conf.d/50-server.cnf we should have the following setting: In the file /etc/mysql/mariadb.conf.d/50-server.cnf we should have the following setting:
  
 +  vi /etc/mysql/mariadb.conf.d/50-server.cnf
   bind-address = localhost   bind-address = localhost
- 
-=== Configure munge === 
- 
-  $ ssh csadmin@linux1 
-  scp slurm-ctrl:/etc/munge/munge.key /etc/munge/ 
  
 === Node Authentication === === Node Authentication ===
Line 48: Line 44:
 First, let us configure the default options for the munge service: First, let us configure the default options for the munge service:
  
-/etc/default/munge+  vi /etc/default/munge 
- +  OPTIONS="--syslog --key-file /etc/munge/munge.key"
-OPTIONS="--syslog --key-file /etc/munge/munge.key"+
  
 === Central Controller === === Central Controller ===
  
-The main configuration file is /etc/slurm-llnl/slurm.conf this file has to be present in the controller and all of the compute nodes and it also has to be consistent between all of them.+The main configuration file is /etc/slurm-llnl/slurm.conf this file has to be present in the controller and *ALL* of the compute nodes and it also has to be consistent between all of them. 
 + 
 +  vi /etc/slurm-llnl/slurm.conf
  
 <code> <code>
Line 60: Line 57:
 # /etc/slurm-llnl/slurm.conf # /etc/slurm-llnl/slurm.conf
 ############################### ###############################
-General +slurm.conf file generated by configurator easy.html. 
-ControlMachine=entry-node +# Put this file on all nodes of your cluster. 
-AuthType=auth/munge +# See the slurm.conf man page for more information. 
-CacheGroups=0 +# 
-CryptoType=crypto/munge +ControlMachine=slurm-ctrl 
-JobCheckpointDir=/var/lib/slurm-llnl/checkpoint +#ControlAddr=10.7.20.97 
-KillOnBadExit=01 +# 
-MpiDefault=pmi2 +#MailProg=/bin/mail 
-MailProg=/usr/bin/mail +MpiDefault=none 
-PrivateData=usage,users,accounts +#MpiParams=ports=#-# 
-ProctrackType=proctrack/cgroup +ProctrackType=proctrack/pgid
-PrologFlags=Alloc,Contain +
-PropagateResourceLimits=NONE +
-RebootProgram=/sbin/reboot+
 ReturnToService=1 ReturnToService=1
 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
-SlurmctldPort=6817+##SlurmctldPidFile=/var/run/slurmctld.pid 
 +#SlurmctldPort=6817
 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
-SlurmdPort=6818 +##SlurmdPidFile=/var/run/slurmd.pid 
-SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd+#SlurmdPort=6818 
 +SlurmdSpoolDir=/var/spool/slurmd
 SlurmUser=slurm SlurmUser=slurm
-StateSaveLocation=/var/lib/slurm-llnl/slurmctld+#SlurmdUser=root 
 +StateSaveLocation=/var/spool
 SwitchType=switch/none SwitchType=switch/none
-TaskPlugin=task/cgroup +TaskPlugin=task/none 
- +# 
-Timers +
-InactiveLimit=0 +# TIMERS 
-KillWait=30 +#KillWait=30 
-MinJobAge=300 +#MinJobAge=300 
-SlurmctldTimeout=120 +#SlurmctldTimeout=120 
-SlurmdTimeout=300 +#SlurmdTimeout=300 
-Waittime=0 +# 
- +# 
-Scheduler+SCHEDULING
 FastSchedule=1 FastSchedule=1
 SchedulerType=sched/backfill SchedulerType=sched/backfill
-SchedulerPort=7321 +SelectType=select/linear 
-SelectType=select/cons_res +#SelectTypeParameters= 
-SelectTypeParameters=CR_CPU_Memory+
 +
 +# LOGGING AND ACCOUNTING 
 +AccountingStorageType=accounting_storage/none 
 +ClusterName=cluster 
 +#JobAcctGatherFrequency=30 
 +JobAcctGatherType=jobacct_gather/none 
 +#SlurmctldDebug=3 
 +SlurmctldLogFile=/var/log/slurm-llnl/SlurmctldLogFile 
 +#SlurmdDebug=3 
 +SlurmdLogFile=/var/log/slurm-llnl/SlurmLogFile 
 +
 +
 +# COMPUTE NODES 
 +NodeName=linux1 NodeAddr=10.7.20.98 CPUs=1 State=UNKNOWN 
 +</code>
  
-Preemptions +Copy slurm.conf to compute nodes! 
-PreemptType=preempt/partition_prio + 
-PreemptMode=REQUEUE+  root@slurm-ctrlscp /etc/slurm-llnl/slurm.conf csadmin@10.7.20.109:/tmp/.; scp /etc/slurm-llnl/slurm.conf csadmin@10.7.20.110:/tmp/. 
 + 
 +  vi /lib/systemd/system/slurmctld.service 
 +   
 +<code> 
 +[Unit] 
 +Description=Slurm controller daemon 
 +After=network.target munge.service 
 +ConditionPathExists=/etc/slurm-llnl/slurm.conf 
 +Documentation=man:slurmctld(8)
  
-# Accounting +[Service] 
-AccountingStorageType=accounting_storage/slurmdbd +Type=forking 
-AccountingStoreJobComment=YES +EnvironmentFile=-/etc/default/slurmctld 
-ClusterName=mycluster +ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS 
-JobAcctGatherFrequency=30 +ExecStartPost=/bin/sleep 2 
-JobAcctGatherType=jobacct_gather/linux +ExecReload=/bin/kill -HUP $MAINPID 
-SlurmctldDebug=3 +PIDFile=/var/run/slurm-llnl/slurmctld.pid
-SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log +
-SlurmdDebug=3 +
-SlurmdLogFile=/var/log/slurm-llnl/slurmd.log +
-SlurmSchedLogFile= /var/log/slurm-llnl/slurmschd.log +
-SlurmSchedLogLevel=3+
  
-NodeName=compute-1 Procs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=128000 Weight=4 +[Install] 
-NodeName=compute-2 Procs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=254000 Weight=3 +WantedBy=multi-user.target
-NodeName=compute-3 Procs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=256000 Weight=3 +
-NodeName=compute-4 Procs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=256000 Weight=3+
  
-PartitionName=base Nodes=compute-1,compute-2,compute-3,compute-4 Default=Yes MaxTime=72:00:00 Priority=1 State=UP 
-PartitionName=long Nodes=compute-1,compute-2,compute-3,compute-4 Default=No MaxTime=UNLIMITED Priority=1 State=UP AllowGroups=long 
 </code> </code>
  
-  root@controller# systemctl start slurmctld+  vi /lib/systemd/system/slurmd.service 
 + 
 +<code> 
 +[Unit] 
 +Description=Slurm node daemon 
 +After=network.target munge.service 
 +ConditionPathExists=/etc/slurm-llnl/slurm.conf 
 +Documentation=man:slurmd(8) 
 + 
 +[Service] 
 +Type=forking 
 +EnvironmentFile=-/etc/default/slurmd 
 +ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS 
 +ExecStartPost=/bin/sleep 2 
 +ExecReload=/bin/kill -HUP $MAINPID 
 +PIDFile=/var/run/slurm-llnl/slurmd.pid 
 +KillMode=process 
 +LimitNOFILE=51200 
 +LimitMEMLOCK=infinity 
 +LimitSTACK=infinity 
 + 
 +[Install] 
 +WantedBy=multi-user.target 
 +</code> 
 + 
 +   
 +  root@slurm-ctrl# systemctl daemon-reload 
 +  root@slurm-ctrl# systemctl enable slurmdbd 
 +  root@slurm-ctrl# systemctl start slurmdbd 
 +  root@slurm-ctrl# systemctl enable slurmctld 
 +  root@slurm-ctrl# systemctl start slurmctld 
  
 === Accounting Storage === === Accounting Storage ===
  
-After we have the slurm-llnl-slurmdbd package installed we configure it, by editing the /etc/slurm-llnl/slurmdb.conf file:+After we have the slurm-llnl-slurmdbd package installed we configure it, by editing the /etc/slurm-llnl/slurmdbd.conf file: 
 + 
 +  vi /etc/slurm-llnl/slurmdbd.conf
  
 <code> <code>
 ######################################################################## ########################################################################
 # #
-# /etc/slurm-llnl/slurmdb.conf is an ASCII file which describes Slurm+# /etc/slurm-llnl/slurmdbd.conf is an ASCII file which describes Slurm
 # Database Daemon (SlurmDBD) configuration information. # Database Daemon (SlurmDBD) configuration information.
 # The contents of the file are case insensitive except for the names of # The contents of the file are case insensitive except for the names of
-# nodes and files. Any text following a "#" in the configuration file is     # treated as a comment through the end of that line. The size of each+# nodes and files. Any text following a "#" in the configuration file is 
 +# treated as a comment through the end of that line. The size of each
 # line in the file is limited to 1024 characters. Changes to the # line in the file is limited to 1024 characters. Changes to the
 # configuration file take effect upon restart of SlurmDbd or daemon # configuration file take effect upon restart of SlurmDbd or daemon
Line 153: Line 199:
 StoragePort=3306 StoragePort=3306
 StorageUser=slurm StorageUser=slurm
-StoragePass=safepassword+StoragePass=slurmdbpass
 StorageType=accounting_storage/mysql StorageType=accounting_storage/mysql
 StorageLoc=slurm_acct_db StorageLoc=slurm_acct_db
Line 159: Line 205:
 PidFile=/var/run/slurm-llnl/slurmdbd.pid PidFile=/var/run/slurm-llnl/slurmdbd.pid
 SlurmUser=slurm SlurmUser=slurm
 +
 </code> </code>
  
-  root@controller# systemctl start slurmdbd+  root@slurm-ctrl# systemctl start slurmdbd 
 + 
 +=== Authentication === 
 + 
 +Copy /etc/munge.key to all compute nodes 
 + 
 +  scp /etc/munge/munge.key csadmin@10.7.20.98:/tmp/
 +  
 +Allow password-less access to slurm-ctrl 
 +  
 +  csadmin@slurm-ctrl:~$ ssh-copy-id -i .ssh/id_rsa.pub 10.7.20.102: 
 +   
 +Run a job from slurm-ctrl 
 + 
 +  ssh csadmin@slurm-ctrl 
 +  srun -N 1 hostname 
 +  linux1 
  
  
 === Test munge === === Test munge ===
  
-  munge -n | unmunge | grep STATUS+  munge -n | unmunge | grep STATUS
   STATUS:           Success (0)   STATUS:           Success (0)
-  munge -n | ssh slurm-ctrl unmunge | grep STATUS+  munge -n | ssh slurm-ctrl unmunge | grep STATUS
   STATUS:           Success (0)   STATUS:           Success (0)
  
 === Test Slurm === === Test Slurm ===
  
-  sinfo+  sinfo
   PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST   PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
   debug*       up   infinite      1   idle linux1   debug*       up   infinite      1   idle linux1
  
-==== Compute Nodes ====+If computer node is down 
 + 
 +<code> 
 +sinfo -a 
 +PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
 +debug*       up   infinite      2   down gpu[02-03] 
 +</code> 
 + 
 +  scontrol update nodename=gpu02 state=idle 
 +  scontrol update nodename=gpu03 state=idle 
 +  scontrol update nodename=gpu02 state=resume 
 + 
 +<code> 
 +sinfo -a 
 +PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
 +debug*       up   infinite      2   idle gpu[02-03] 
 +</code> 
 + 
 + 
 +===== Compute Nodes ====
  
 A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service.
  
-Zecihnung+{{:tech:slurm-hpc-cluster_compute-node.png?400|}}
  
-=== Authentication ===+=== Installation slurm and munge ===
  
-  ssh root@slurm-ctrl +  ssh -l csadmin <compute-nodes> 10.7.20.109 10.7.20.110 
-  root@controller# for i in `seq 1 2`; do scp /etc/munge/munge.key linux-${i}:/etc/munge/munge.key; done+  sudo apt install slurm-wlm libmunge-dev libmunge2 munge
  
-  root@compute-1# systemctl start munge+  sudo vi /lib/systemd/system/slurmd.service
  
-Run a job from slurm-ctrl+<code> 
 +[Unit] 
 +Description=Slurm node daemon 
 +After=network.target munge.service 
 +ConditionPathExists=/etc/slurm-llnl/slurm.conf 
 +Documentation=man:slurmd(8)
  
-  $ ssh csadmin +[Service] 
-  $ srun -N 1 hostname +Type=forking 
-  linux1+EnvironmentFile=-/etc/default/slurmd 
 +ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS 
 +ExecStartPost=/bin/sleep 2 
 +ExecReload=/bin/kill -HUP $MAINPID 
 +PIDFile=/var/run/slurm-llnl/slurmd.pid 
 +KillMode=process 
 +LimitNOFILE=51200 
 +LimitMEMLOCK=infinity 
 +LimitSTACK=infinity 
 + 
 +[Install] 
 +WantedBy=multi-user.target 
 +</code> 
 + 
 +  sudo systemctl enable slurmd 
 +  sudo systemctl enable munge 
 +  sudo systemctl start slurmd 
 +  sudo systemctl start munge 
 + 
 + 
 +Generate ssh keys 
 + 
 +  ssh-keygen 
 + 
 +Copy ssh-keys to slurm-ctrl  
 + 
 +  ssh-copy-id -i ~/.ssh/id_rsa.pub csadmin@slurm-ctrl.inf.unibz.it: 
 + 
 +Become root to do important things: 
 + 
 +  sudo -i 
 +  vi /etc/hosts 
 + 
 +Add those lines below to the /etc/hosts file 
 + 
 +<code> 
 +10.7.20.97      slurm-ctrl.inf.unibz.it slurm-ctrl 
 +10.7.20.98      linux1.inf.unibz.it     linux1 
 +</code> 
 + 
 +First copy the munge keys from the slurm-ctrl to all compute nodes, now fix location, 
 +owner and permission. 
 + 
 +  mv /tmp/munge.key /etc/munge/
 +  chown munge:munge /etc/munge/munge.key 
 +  chmod 400 /etc/munge/munge.key 
 + 
 +Place /etc/slurm-llnl/slurm.conf in right place, 
 + 
 +  mv /tmp/slurm.conf /etc/slurm-llnl/ 
 +  chown root: /etc/slurm-llnl/slurm.conf 
 +  
 +   
 + 
 + 
 +===== Links ===== 
 + 
 +[[https://slurm.schedmd.com/overview.html|Slurm Workload Manager Overview]] 
 + 
 +[[https://github.com/mknoxnv/ubuntu-slurm|Steps to create a small slurm cluster with GPU enabled nodes]] 
 + 
 +[[https://implement.pt/2018/09/slurm-in-ubuntu-clusters-pt1/|Slurm in Ubuntu Clusters Part1]] 
 + 
 +[[https://wiki.fysik.dtu.dk/niflheim/SLURM|Slurm batch queueing system]] 
 + 
 +[[https://doku.lrz.de/display/PUBLIC/SLURM+Workload+Manager|SLURM Workload Manager]] 
 + 
 +[[https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html|Slurm Quick Start Tutorial]] 
 + 
 +{{ :tech:9-slurm.pdf |Linux Clusters Institute: Scheduling and Resource Management 2017}} 
 + 
 + 
 +====== Modules ====== 
 + 
 +===== GCC ===== 
 + 
 +This take a long time! 
 + 
 +Commands to run to compile gcc-6.1.0 
 + 
 +  wget https://ftp.gnu.org/gnu/gcc/gcc-6.1.0/gcc-6.1.0.tar.bz2 
 +  tar xfj gcc-6.1.0.tar.bz2 
 +  cd gcc-6.1.0 
 +  ./contrib/download_prerequisites 
 +  ./configure --prefix=/opt/package/gcc/6.1.0 --disable-multilib 
 +  make 
 + 
 +After some time an error occurs, and the make process stops! 
 +<code> 
 +... 
 +In file included from ../.././libgcc/unwind-dw2.c:401:0: 
 +./md-unwind-support.h: In function ‘x86_64_fallback_frame_state’: 
 +./md-unwind-support.h:65:47: error: dereferencing pointer to incomplete type ‘struct ucontext’ 
 +       sc = (struct sigcontext *) (void *) &uc_->uc_mcontext; 
 +                                               ^~ 
 +../.././libgcc/shared-object.mk:14: recipe for target 'unwind-dw2.o' failed 
 +</code> 
 + 
 +To fix do: [[https://stackoverflow.com/questions/46999900/how-to-compile-gcc-6-4-0-with-gcc-7-2-in-archlinux|solution]] 
 + 
 +  vi /opt/packages/gcc-6.1.0/x86_64-pc-linux-gnu/libgcc/md-unwind-support.h 
 + 
 +and replace/comment out line 61 with this: 
 + 
 +<code> 
 +struct ucontext_t *uc_ = context->cfa; 
 +</code> 
 + 
 +old line: /* struct ucontext *uc_ = context->cfa; */ 
 + 
 +  make 
 + 
 +Next error: 
 + 
 +<code> 
 +../../.././libsanitizer/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc:270:22: error: aggregate ‘sigaltstack handler_stack’ has incomplete type and cannot be defined 
 +   struct sigaltstack handler_stack; 
 + 
 +</code> 
 + 
 +To fix see: [[https://github.com/llvm-mirror/compiler-rt/commit/8a5e425a68de4d2c80ff00a97bbcb3722a4716da?diff=unified|solution]] 
 +or [[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81066]] 
 + 
 +Amend the files according to solution above! 
 + 
 +Next error: 
 + 
 +<code> 
 +... 
 +checking for unzip... unzip 
 +configure: error: cannot find neither zip nor jar, cannot continue 
 +Makefile:23048: recipe for target 'configure-target-libjava' failed 
 +... 
 +... 
 +</code> 
 + 
 +  apt install unzip zip 
 + 
 +and run make again! 
 + 
 +  make 
 + 
 +Next error: 
 + 
 +<code> 
 +... 
 +In file included from ../.././libjava/prims.cc:26:0: 
 +../.././libjava/prims.cc: In function ‘void _Jv_catch_fpe(int, siginfo_t*, void*)’: 
 +./include/java-signal.h:32:26: error: invalid use of incomplete type ‘struct _Jv_catch_fpe(int, siginfo_t*, void*)::ucontext’ 
 +   gregset_t &_gregs = _uc->uc_mcontext.gregs;    \ 
 +... 
 +</code> 
 + 
 +Edit the file: /opt/packages/gcc-6.1.0/x86_64-pc-linux-gnu/libjava/include/java-signal.h 
 + 
 +  vi /opt/packages/gcc-6.1.0/x86_64-pc-linux-gnu/libjava/include/java-signal.h 
 + 
 +<note warning>Not enough more errors!</note> 
 + 
 +<code> 
 +// kh 
 +  ucontext_t *_uc = (ucontext_t *);                             \ 
 +  //struct ucontext *_uc = (struct ucontext *)_p;                               \ 
 +  // kh 
 + 
 +</code> 
 + 
 +Next error:
  
  
  
  
 +===== Links =====
  
 +http://www.walkingrandomly.com/?p=5680
  
-https://slurm.schedmd.com/overview.html+https://modules.readthedocs.io/en/latest/index.html
/data/www/wiki.inf.unibz.it/data/pages/tech/slurm.txt · Last modified: 2022/11/24 16:17 by kohofer