tech:slurm
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
tech:slurm [2020/02/07 12:55] – [Controller] kohofer | tech:slurm [2020/02/11 10:41] – kohofer | ||
---|---|---|---|
Line 15: | Line 15: | ||
===== Installation ===== | ===== Installation ===== | ||
- | ==== Controller | + | ===== Controller name: slurm-ctrl |
- | + | ||
- | Controller name: slurm-ctrl | + | |
Install slurm-wlm and tools | Install slurm-wlm and tools | ||
Line 113: | Line 111: | ||
NodeName=linux1 NodeAddr=10.7.20.98 CPUs=1 State=UNKNOWN | NodeName=linux1 NodeAddr=10.7.20.98 CPUs=1 State=UNKNOWN | ||
</ | </ | ||
+ | |||
+ | Copy slurm.conf to compute nodes! | ||
root@slurm-ctrl# | root@slurm-ctrl# | ||
+ | |||
+ | vi / | ||
+ | | ||
+ | < | ||
+ | [Unit] | ||
+ | Description=Slurm controller daemon | ||
+ | After=network.target munge.service | ||
+ | ConditionPathExists=/ | ||
+ | Documentation=man: | ||
+ | |||
+ | [Service] | ||
+ | Type=forking | ||
+ | EnvironmentFile=-/ | ||
+ | ExecStart=/ | ||
+ | ExecStartPost=/ | ||
+ | ExecReload=/ | ||
+ | PIDFile=/ | ||
+ | |||
+ | [Install] | ||
+ | WantedBy=multi-user.target | ||
+ | |||
+ | </ | ||
+ | |||
+ | vi / | ||
+ | |||
+ | < | ||
+ | [Unit] | ||
+ | Description=Slurm node daemon | ||
+ | After=network.target munge.service | ||
+ | ConditionPathExists=/ | ||
+ | Documentation=man: | ||
+ | |||
+ | [Service] | ||
+ | Type=forking | ||
+ | EnvironmentFile=-/ | ||
+ | ExecStart=/ | ||
+ | ExecStartPost=/ | ||
+ | ExecReload=/ | ||
+ | PIDFile=/ | ||
+ | KillMode=process | ||
+ | LimitNOFILE=51200 | ||
+ | LimitMEMLOCK=infinity | ||
+ | LimitSTACK=infinity | ||
+ | |||
+ | [Install] | ||
+ | WantedBy=multi-user.target | ||
+ | </ | ||
+ | |||
+ | | ||
root@slurm-ctrl# | root@slurm-ctrl# | ||
root@slurm-ctrl# | root@slurm-ctrl# | ||
Line 192: | Line 241: | ||
debug* | debug* | ||
- | ==== Compute Nodes ==== | + | If computer node is down |
+ | |||
+ | < | ||
+ | sinfo -a | ||
+ | PARTITION AVAIL TIMELIMIT | ||
+ | debug* | ||
+ | </ | ||
+ | |||
+ | scontrol update nodename=gpu02 state=idle | ||
+ | scontrol update nodename=gpu03 state=idle | ||
+ | |||
+ | < | ||
+ | sinfo -a | ||
+ | PARTITION AVAIL TIMELIMIT | ||
+ | debug* | ||
+ | </ | ||
+ | |||
+ | |||
+ | ===== Compute Nodes ===== | ||
A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. | A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. |
/data/www/wiki.inf.unibz.it/data/pages/tech/slurm.txt · Last modified: 2022/11/24 16:17 by kohofer