tech:slurm
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
tech:slurm [2019/09/06 15:33] – [Controller] kohofer | tech:slurm [2020/04/29 09:17] – [Python] kohofer | ||
---|---|---|---|
Line 15: | Line 15: | ||
===== Installation ===== | ===== Installation ===== | ||
- | ==== Controller | + | ===== Controller name: slurm-ctrl |
- | + | ||
- | Controller name: slurm-ctrl | + | |
Install slurm-wlm and tools | Install slurm-wlm and tools | ||
ssh slurm-ctrl | ssh slurm-ctrl | ||
- | apt install slurm-wlm slurm-wlm-doc mailutils | + | apt install slurm-wlm slurm-wlm-doc mailutils mariadb-client mariadb-server libmariadb-dev python-dev python-mysqldb |
=== Install Maria DB Server === | === Install Maria DB Server === | ||
Line 51: | Line 49: | ||
=== Central Controller === | === Central Controller === | ||
- | The main configuration file is / | + | The main configuration file is / |
- | vi / | + | vi / |
< | < | ||
Line 114: | Line 112: | ||
</ | </ | ||
- | | + | Copy slurm.conf to compute nodes! |
+ | |||
+ | | ||
+ | |||
+ | vi / | ||
+ | |||
+ | < | ||
+ | [Unit] | ||
+ | Description=Slurm | ||
+ | After=network.target munge.service | ||
+ | ConditionPathExists=/ | ||
+ | Documentation=man: | ||
+ | |||
+ | [Service] | ||
+ | Type=forking | ||
+ | EnvironmentFile=-/ | ||
+ | ExecStart=/ | ||
+ | ExecStartPost=/ | ||
+ | ExecReload=/ | ||
+ | PIDFile=/ | ||
+ | |||
+ | [Install] | ||
+ | WantedBy=multi-user.target | ||
+ | |||
+ | </ | ||
+ | |||
+ | vi / | ||
+ | |||
+ | < | ||
+ | [Unit] | ||
+ | Description=Slurm node daemon | ||
+ | After=network.target munge.service | ||
+ | ConditionPathExists=/ | ||
+ | Documentation=man: | ||
+ | |||
+ | [Service] | ||
+ | Type=forking | ||
+ | EnvironmentFile=-/ | ||
+ | ExecStart=/ | ||
+ | ExecStartPost=/ | ||
+ | ExecReload=/ | ||
+ | PIDFile=/ | ||
+ | KillMode=process | ||
+ | LimitNOFILE=51200 | ||
+ | LimitMEMLOCK=infinity | ||
+ | LimitSTACK=infinity | ||
+ | |||
+ | [Install] | ||
+ | WantedBy=multi-user.target | ||
+ | </ | ||
+ | |||
+ | |||
+ | root@slurm-ctrl# | ||
+ | root@slurm-ctrl# | ||
+ | root@slurm-ctrl# | ||
+ | root@slurm-ctrl# | ||
+ | root@slurm-ctrl# systemctl start slurmctld | ||
=== Accounting Storage === | === Accounting Storage === | ||
Line 153: | Line 208: | ||
</ | </ | ||
- | root@controller# systemctl start slurmdbd | + | root@slurm-ctrl# systemctl start slurmdbd |
=== Authentication === | === Authentication === | ||
Line 159: | Line 214: | ||
Copy / | Copy / | ||
- | scp / | + | scp / |
+ | |||
+ | Allow password-less access to slurm-ctrl | ||
+ | |||
+ | csadmin@slurm-ctrl: | ||
| | ||
Run a job from slurm-ctrl | Run a job from slurm-ctrl | ||
Line 182: | Line 241: | ||
debug* | debug* | ||
- | ==== Compute Nodes ==== | + | If computer node is down |
+ | |||
+ | < | ||
+ | sinfo -a | ||
+ | PARTITION AVAIL TIMELIMIT | ||
+ | debug* | ||
+ | </ | ||
+ | |||
+ | scontrol update nodename=gpu02 state=idle | ||
+ | scontrol update nodename=gpu03 state=idle | ||
+ | scontrol update nodename=gpu02 state=resume | ||
+ | |||
+ | < | ||
+ | sinfo -a | ||
+ | PARTITION AVAIL TIMELIMIT | ||
+ | debug* | ||
+ | </ | ||
+ | |||
+ | |||
+ | ===== Compute Nodes ===== | ||
A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. | A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. | ||
Line 188: | Line 267: | ||
{{: | {{: | ||
- | ==== Installation ==== | + | === Installation |
+ | |||
+ | ssh -l csadmin < | ||
+ | sudo apt install slurm-wlm libmunge-dev libmunge2 munge | ||
+ | |||
+ | sudo vi / | ||
+ | |||
+ | < | ||
+ | [Unit] | ||
+ | Description=Slurm node daemon | ||
+ | After=network.target munge.service | ||
+ | ConditionPathExists=/ | ||
+ | Documentation=man: | ||
+ | |||
+ | [Service] | ||
+ | Type=forking | ||
+ | EnvironmentFile=-/ | ||
+ | ExecStart=/ | ||
+ | ExecStartPost=/ | ||
+ | ExecReload=/ | ||
+ | PIDFile=/ | ||
+ | KillMode=process | ||
+ | LimitNOFILE=51200 | ||
+ | LimitMEMLOCK=infinity | ||
+ | LimitSTACK=infinity | ||
+ | |||
+ | [Install] | ||
+ | WantedBy=multi-user.target | ||
+ | </ | ||
+ | |||
+ | sudo systemctl enable slurmd | ||
+ | sudo systemctl enable munge | ||
+ | sudo systemctl start slurmd | ||
+ | sudo systemctl start munge | ||
- | ssh -l csadmin 10.7.20.102 | ||
- | sudo apt install slurm-wlm | ||
- | |||
Generate ssh keys | Generate ssh keys | ||
+ | |||
ssh-keygen | ssh-keygen | ||
- | Copy ssh-keys to slurm-ctrl | + | Copy ssh-keys to slurm-ctrl |
- | ssh-copy-id -i ~/ | + | ssh-copy-id -i ~/ |
Become root to do important things: | Become root to do important things: | ||
Line 212: | Line 323: | ||
</ | </ | ||
+ | First copy the munge keys from the slurm-ctrl to all compute nodes, now fix location, | ||
+ | owner and permission. | ||
+ | mv / | ||
+ | chown munge:munge / | ||
+ | chmod 400 / | ||
+ | Place / | ||
+ | |||
+ | mv / | ||
+ | chown root: / | ||
+ | |||
+ | | ||
Line 227: | Line 349: | ||
[[https:// | [[https:// | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | |||
+ | ====== Modules ====== | ||
+ | |||
+ | ===== Python ===== | ||
+ | |||
+ | ==== Python 3.7.7 ==== | ||
+ | |||
+ | |||
+ | cd / | ||
+ | mkdir / | ||
+ | wget https:// | ||
+ | tar xfJ Python-3.7.7.tar.xz | ||
+ | cd Python-3.7.7/ | ||
+ | ./configure --prefix=/ | ||
+ | make | ||
+ | make install | ||
+ | | ||
+ | |||
+ | ==== Python 2.7.18 ==== | ||
+ | |||
+ | |||
+ | cd / | ||
+ | mkdir / | ||
+ | wget https:// | ||
+ | cd Python-2.7.18 | ||
+ | ./configure --prefix=/ | ||
+ | make | ||
+ | make install | ||
+ | |||
+ | ==== Create modules file ==== | ||
+ | |||
+ | |||
+ | cd / | ||
+ | vi python-2.7.18 | ||
+ | |||
+ | < | ||
+ | #%Module1.0 | ||
+ | proc ModulesHelp { } { | ||
+ | global dotversion | ||
+ | |||
+ | puts stderr " | ||
+ | } | ||
+ | |||
+ | module-whatis " | ||
+ | prepend-path PATH / | ||
+ | |||
+ | </ | ||
+ | | ||
+ | |||
+ | |||
+ | |||
+ | ===== GCC ===== | ||
+ | |||
+ | This takes a long time! | ||
+ | |||
+ | Commands to run to compile gcc-6.1.0 | ||
+ | |||
+ | wget https:// | ||
+ | tar xfj gcc-6.1.0.tar.bz2 | ||
+ | cd gcc-6.1.0 | ||
+ | ./ | ||
+ | ./configure --prefix=/ | ||
+ | make | ||
+ | |||
+ | After some time an error occurs, and the make process stops! | ||
+ | < | ||
+ | ... | ||
+ | In file included from ../ | ||
+ | ./ | ||
+ | ./ | ||
+ | sc = (struct sigcontext *) (void *) & | ||
+ | ^~ | ||
+ | ../ | ||
+ | </ | ||
+ | |||
+ | To fix do: [[https:// | ||
+ | |||
+ | vi / | ||
+ | |||
+ | and replace/ | ||
+ | |||
+ | < | ||
+ | struct ucontext_t *uc_ = context-> | ||
+ | </ | ||
+ | |||
+ | old line: /* struct ucontext *uc_ = context-> | ||
+ | |||
+ | make | ||
+ | |||
+ | Next error: | ||
+ | |||
+ | < | ||
+ | ../ | ||
+ | | ||
+ | |||
+ | </ | ||
+ | |||
+ | To fix see: [[https:// | ||
+ | or [[https:// | ||
+ | |||
+ | Amend the files according to solution above! | ||
+ | |||
+ | Next error: | ||
+ | |||
+ | < | ||
+ | ... | ||
+ | checking for unzip... unzip | ||
+ | configure: error: cannot find neither zip nor jar, cannot continue | ||
+ | Makefile: | ||
+ | ... | ||
+ | ... | ||
+ | </ | ||
+ | |||
+ | apt install unzip zip | ||
+ | |||
+ | and run make again! | ||
+ | |||
+ | make | ||
+ | |||
+ | Next error: | ||
+ | |||
+ | < | ||
+ | ... | ||
+ | In file included from ../ | ||
+ | ../ | ||
+ | ./ | ||
+ | | ||
+ | ... | ||
+ | </ | ||
+ | |||
+ | Edit the file: / | ||
+ | |||
+ | vi / | ||
+ | |||
+ | <note warning> | ||
+ | |||
+ | < | ||
+ | // kh | ||
+ | ucontext_t *_uc = (ucontext_t *); \ | ||
+ | //struct ucontext *_uc = (struct ucontext *)_p; \ | ||
+ | // kh | ||
+ | |||
+ | </ | ||
+ | |||
+ | Next error: | ||
+ | |||
+ | <code php> | ||
+ | ... | ||
+ | In file included from ../ | ||
+ | ./ | ||
+ | // | ||
+ | | ||
+ | ../ | ||
+ | ./ | ||
+ | | ||
+ | | ||
+ | ../ | ||
+ | | ||
+ | | ||
+ | ../ | ||
+ | | ||
+ | | ||
+ | ../ | ||
+ | ../ | ||
+ | | ||
+ | | ||
+ | ../ | ||
+ | ../ | ||
+ | ../ | ||
+ | | ||
+ | ... | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ===== Links ===== | ||
+ | |||
+ | http:// | ||
+ | |||
+ | https:// |
/data/www/wiki.inf.unibz.it/data/pages/tech/slurm.txt · Last modified: 2022/11/24 16:17 by kohofer