tech:slurm
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
tech:slurm [2020/02/10 16:25] – [Controller] kohofer | tech:slurm [2020/04/24 11:41] – kohofer | ||
---|---|---|---|
Line 14: | Line 14: | ||
===== Installation ===== | ===== Installation ===== | ||
- | |||
- | ==== Controller ==== | ||
===== Controller name: slurm-ctrl ===== | ===== Controller name: slurm-ctrl ===== | ||
Line 131: | Line 129: | ||
EnvironmentFile=-/ | EnvironmentFile=-/ | ||
ExecStart=/ | ExecStart=/ | ||
+ | ExecStartPost=/ | ||
ExecReload=/ | ExecReload=/ | ||
PIDFile=/ | PIDFile=/ | ||
Line 152: | Line 151: | ||
EnvironmentFile=-/ | EnvironmentFile=-/ | ||
ExecStart=/ | ExecStart=/ | ||
+ | ExecStartPost=/ | ||
ExecReload=/ | ExecReload=/ | ||
PIDFile=/ | PIDFile=/ | ||
Line 241: | Line 241: | ||
debug* | debug* | ||
- | ==== Compute Nodes ==== | + | If computer node is down |
+ | |||
+ | < | ||
+ | sinfo -a | ||
+ | PARTITION AVAIL TIMELIMIT | ||
+ | debug* | ||
+ | </ | ||
+ | |||
+ | scontrol update nodename=gpu02 state=idle | ||
+ | scontrol update nodename=gpu03 state=idle | ||
+ | scontrol update nodename=gpu02 state=resume | ||
+ | |||
+ | < | ||
+ | sinfo -a | ||
+ | PARTITION AVAIL TIMELIMIT | ||
+ | debug* | ||
+ | </ | ||
+ | |||
+ | |||
+ | ===== Compute Nodes ===== | ||
A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. | A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. | ||
Line 251: | Line 271: | ||
ssh -l csadmin < | ssh -l csadmin < | ||
sudo apt install slurm-wlm libmunge-dev libmunge2 munge | sudo apt install slurm-wlm libmunge-dev libmunge2 munge | ||
+ | |||
+ | sudo vi / | ||
+ | |||
+ | < | ||
+ | [Unit] | ||
+ | Description=Slurm node daemon | ||
+ | After=network.target munge.service | ||
+ | ConditionPathExists=/ | ||
+ | Documentation=man: | ||
+ | |||
+ | [Service] | ||
+ | Type=forking | ||
+ | EnvironmentFile=-/ | ||
+ | ExecStart=/ | ||
+ | ExecStartPost=/ | ||
+ | ExecReload=/ | ||
+ | PIDFile=/ | ||
+ | KillMode=process | ||
+ | LimitNOFILE=51200 | ||
+ | LimitMEMLOCK=infinity | ||
+ | LimitSTACK=infinity | ||
+ | |||
+ | [Install] | ||
+ | WantedBy=multi-user.target | ||
+ | </ | ||
+ | |||
sudo systemctl enable slurmd | sudo systemctl enable slurmd | ||
sudo systemctl enable munge | sudo systemctl enable munge | ||
Line 261: | Line 307: | ||
ssh-keygen | ssh-keygen | ||
- | Copy ssh-keys to slurm-ctrl | + | Copy ssh-keys to slurm-ctrl |
- | ssh-copy-id -i ~/ | + | ssh-copy-id -i ~/ |
Become root to do important things: | Become root to do important things: | ||
Line 307: | Line 353: | ||
{{ : | {{ : | ||
+ | |||
+ | |||
+ | ====== Modules ====== | ||
+ | |||
+ | ===== GCC ===== | ||
+ | |||
+ | Commands to run to compile gcc-6.1.0 | ||
+ | |||
+ | wget https:// | ||
+ | tar xfj gcc-6.1.0.tar.bz2 | ||
+ | cd gcc-6.1.0 | ||
+ | ./ | ||
+ | ./configure --prefix=/ | ||
+ | make | ||
+ | |||
+ | In file included from ../ | ||
+ | ./ | ||
+ | ./ | ||
+ | sc = (struct sigcontext *) (void *) & | ||
+ | ^~ | ||
+ | ../ | ||
+ | |||
+ | To fix do: | ||
+ | https:// | ||
+ | |||
+ | vi / | ||
+ | |||
+ | and replace line 61 with this: | ||
+ | |||
+ | struct ucontext_t *uc_ = context-> | ||
+ | |||
+ | or comment the old line: /* struct ucontext *uc_ = context-> | ||
+ | |||
+ | run make again | ||
+ | |||
+ | |||
+ | |||
+ | ===== Links ===== | ||
+ | |||
+ | http:// | ||
+ | |||
+ | https:// |
/data/www/wiki.inf.unibz.it/data/pages/tech/slurm.txt · Last modified: 2022/11/24 16:17 by kohofer