User Tools

Site Tools


tech:slurm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
tech:slurm [2020/02/07 10:34] – [Compute Nodes] kohofertech:slurm [2020/02/11 10:46] – [Compute Nodes] kohofer
Line 15: Line 15:
 ===== Installation ===== ===== Installation =====
  
-==== Controller ==== +===== Controller name: slurm-ctrl =====
- +
-Controller name: slurm-ctrl+
  
 Install slurm-wlm and tools Install slurm-wlm and tools
Line 114: Line 112:
 </code> </code>
  
-  root@slurm-ctrl# scp /etc/slurm-llnl/slurm.conf csadmin@10.7.20.98:/tmp/.; scp /etc/slurm-llnl/slurm.conf csadmin@10.7.20.102:/tmp/.+Copy slurm.conf to compute nodes! 
 + 
 +  root@slurm-ctrl# scp /etc/slurm-llnl/slurm.conf csadmin@10.7.20.109:/tmp/.; scp /etc/slurm-llnl/slurm.conf csadmin@10.7.20.110:/tmp/. 
 + 
 +  vi /lib/systemd/system/slurmctld.service 
 +   
 +<code> 
 +[Unit] 
 +Description=Slurm controller daemon 
 +After=network.target munge.service 
 +ConditionPathExists=/etc/slurm-llnl/slurm.conf 
 +Documentation=man:slurmctld(8) 
 + 
 +[Service] 
 +Type=forking 
 +EnvironmentFile=-/etc/default/slurmctld 
 +ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS 
 +ExecStartPost=/bin/sleep 2 
 +ExecReload=/bin/kill -HUP $MAINPID 
 +PIDFile=/var/run/slurm-llnl/slurmctld.pid 
 + 
 +[Install] 
 +WantedBy=multi-user.target 
 + 
 +</code> 
 + 
 +  vi /lib/systemd/system/slurmd.service 
 + 
 +<code> 
 +[Unit] 
 +Description=Slurm node daemon 
 +After=network.target munge.service 
 +ConditionPathExists=/etc/slurm-llnl/slurm.conf 
 +Documentation=man:slurmd(8) 
 + 
 +[Service] 
 +Type=forking 
 +EnvironmentFile=-/etc/default/slurmd 
 +ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS 
 +ExecStartPost=/bin/sleep 2 
 +ExecReload=/bin/kill -HUP $MAINPID 
 +PIDFile=/var/run/slurm-llnl/slurmd.pid 
 +KillMode=process 
 +LimitNOFILE=51200 
 +LimitMEMLOCK=infinity 
 +LimitSTACK=infinity 
 + 
 +[Install] 
 +WantedBy=multi-user.target 
 +</code> 
 + 
 +   
 +  root@slurm-ctrl# systemctl daemon-reload 
 +  root@slurm-ctrl# systemctl enable slurmdbd 
 +  root@slurm-ctrl# systemctl start slurmdbd 
 +  root@slurm-ctrl# systemctl enable slurmctld
   root@slurm-ctrl# systemctl start slurmctld   root@slurm-ctrl# systemctl start slurmctld
 +
  
 === Accounting Storage === === Accounting Storage ===
Line 187: Line 241:
   debug*       up   infinite      1   idle linux1   debug*       up   infinite      1   idle linux1
  
-==== Compute Nodes ====+If computer node is down 
 + 
 +<code> 
 +sinfo -a 
 +PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
 +debug*       up   infinite      2   down gpu[02-03] 
 +</code> 
 + 
 +  scontrol update nodename=gpu02 state=idle 
 +  scontrol update nodename=gpu03 state=idle 
 + 
 +<code> 
 +sinfo -a 
 +PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
 +debug*       up   infinite      2   idle gpu[02-03] 
 +</code> 
 + 
 + 
 +===== Compute Nodes ====
  
 A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service.
Line 197: Line 270:
   ssh -l csadmin <compute-nodes> 10.7.20.109 10.7.20.110   ssh -l csadmin <compute-nodes> 10.7.20.109 10.7.20.110
   sudo apt install slurm-wlm libmunge-dev libmunge2 munge   sudo apt install slurm-wlm libmunge-dev libmunge2 munge
 +
 +  sudo vi /lib/systemd/system/slurmd.service
 +
 +<code>
 +[Unit]
 +Description=Slurm node daemon
 +After=network.target munge.service
 +ConditionPathExists=/etc/slurm-llnl/slurm.conf
 +Documentation=man:slurmd(8)
 +
 +[Service]
 +Type=forking
 +EnvironmentFile=-/etc/default/slurmd
 +ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
 +ExecStartPost=/bin/sleep 2
 +ExecReload=/bin/kill -HUP $MAINPID
 +PIDFile=/var/run/slurm-llnl/slurmd.pid
 +KillMode=process
 +LimitNOFILE=51200
 +LimitMEMLOCK=infinity
 +LimitSTACK=infinity
 +
 +[Install]
 +WantedBy=multi-user.target
 +</code>
 +
   sudo systemctl enable slurmd   sudo systemctl enable slurmd
   sudo systemctl enable munge   sudo systemctl enable munge
Line 207: Line 306:
   ssh-keygen   ssh-keygen
  
-Copy ssh-keys to slurm-ctrl (using IP, because no DNS in place)+Copy ssh-keys to slurm-ctrl 
  
-  ssh-copy-id -i ~/.ssh/id_rsa.pub csadmin@10.7.20.97:+  ssh-copy-id -i ~/.ssh/id_rsa.pub csadmin@slurm-ctrl.inf.unibz.it:
  
 Become root to do important things: Become root to do important things:
/data/www/wiki.inf.unibz.it/data/pages/tech/slurm.txt · Last modified: 2022/11/24 16:17 by kohofer