User Tools

Site Tools


tech:slurm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
tech:slurm [2019/09/06 16:38] – [Links] kohofertech:slurm [2020/02/10 17:07] kohofer
Line 17: Line 17:
 ==== Controller ==== ==== Controller ====
  
-Controller name: slurm-ctrl+===== Controller name: slurm-ctrl =====
  
 Install slurm-wlm and tools Install slurm-wlm and tools
  
   ssh slurm-ctrl   ssh slurm-ctrl
-  apt install slurm-wlm slurm-wlm-doc mailutils sview mariadb-client mariadb-server libmariadb-dev python-dev python-mysqldb+  apt install slurm-wlm slurm-wlm-doc mailutils mariadb-client mariadb-server libmariadb-dev python-dev python-mysqldb
  
 === Install Maria DB Server === === Install Maria DB Server ===
Line 114: Line 114:
 </code> </code>
  
-  root@slurm-ctrl# scp /etc/slurm-llnl/slurm.conf csadmin@10.7.20.98:/tmp/.; scp /etc/slurm-llnl/slurm.conf csadmin@10.7.20.102:/tmp/.+Copy slurm.conf to compute nodes! 
 + 
 +  root@slurm-ctrl# scp /etc/slurm-llnl/slurm.conf csadmin@10.7.20.109:/tmp/.; scp /etc/slurm-llnl/slurm.conf csadmin@10.7.20.110:/tmp/. 
 + 
 +  vi /lib/systemd/system/slurmctld.service 
 +   
 +<code> 
 +[Unit] 
 +Description=Slurm controller daemon 
 +After=network.target munge.service 
 +ConditionPathExists=/etc/slurm-llnl/slurm.conf 
 +Documentation=man:slurmctld(8) 
 + 
 +[Service] 
 +Type=forking 
 +EnvironmentFile=-/etc/default/slurmctld 
 +ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS 
 +ExecReload=/bin/kill -HUP $MAINPID 
 +PIDFile=/var/run/slurm-llnl/slurmctld.pid 
 + 
 +[Install] 
 +WantedBy=multi-user.target 
 + 
 +</code> 
 + 
 +  vi /lib/systemd/system/slurmd.service 
 + 
 +<code> 
 +[Unit] 
 +Description=Slurm node daemon 
 +After=network.target munge.service 
 +ConditionPathExists=/etc/slurm-llnl/slurm.conf 
 +Documentation=man:slurmd(8) 
 + 
 +[Service] 
 +Type=forking 
 +EnvironmentFile=-/etc/default/slurmd 
 +ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS 
 +ExecReload=/bin/kill -HUP $MAINPID 
 +PIDFile=/var/run/slurm-llnl/slurmd.pid 
 +KillMode=process 
 +LimitNOFILE=51200 
 +LimitMEMLOCK=infinity 
 +LimitSTACK=infinity 
 + 
 +[Install] 
 +WantedBy=multi-user.target 
 +</code> 
 + 
 +   
 +  root@slurm-ctrl# systemctl daemon-reload 
 +  root@slurm-ctrl# systemctl enable slurmdbd 
 +  root@slurm-ctrl# systemctl start slurmdbd 
 +  root@slurm-ctrl# systemctl enable slurmctld
   root@slurm-ctrl# systemctl start slurmctld   root@slurm-ctrl# systemctl start slurmctld
 +
  
 === Accounting Storage === === Accounting Storage ===
Line 186: Line 240:
   PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST   PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
   debug*       up   infinite      1   idle linux1   debug*       up   infinite      1   idle linux1
 +
 +If computer node is down
 +
 +<code>
 +sinfo -a
 +PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 +debug*       up   infinite      2   down gpu[02-03]
 +</code>
 +
 +  scontrol update nodename=gpu02 state=idle
 +  scontrol update nodename=gpu03 state=idle
 +
 +<code>
 +sinfo -a
 +PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 +debug*       up   infinite      2   idle gpu[02-03]
 +</code>
 +
 +
  
 ==== Compute Nodes ==== ==== Compute Nodes ====
Line 193: Line 266:
 {{:tech:slurm-hpc-cluster_compute-node.png?400|}} {{:tech:slurm-hpc-cluster_compute-node.png?400|}}
  
-=== Installation ===+=== Installation slurm and munge === 
 + 
 +  ssh -l csadmin <compute-nodes> 10.7.20.109 10.7.20.110 
 +  sudo apt install slurm-wlm libmunge-dev libmunge2 munge 
 +  sudo systemctl enable slurmd 
 +  sudo systemctl enable munge 
 +  sudo systemctl start slurmd 
 +  sudo systemctl start munge 
  
-  ssh -l csadmin 10.7.20.102 
-  sudo apt install slurm-wlm 
-  
 Generate ssh keys Generate ssh keys
  
Line 246: Line 324:
  
 [[https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html|Slurm Quick Start Tutorial]] [[https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html|Slurm Quick Start Tutorial]]
 +
 +{{ :tech:9-slurm.pdf |Linux Clusters Institute: Scheduling and Resource Management 2017}}
/data/www/wiki.inf.unibz.it/data/pages/tech/slurm.txt · Last modified: 2022/11/24 16:17 by kohofer