tech:slurm
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revisionNext revisionBoth sides next revision | ||
tech:slurm [2019/09/06 11:17] – created kohofer | tech:slurm [2020/02/18 11:03] – [Controller name: slurm-ctrl] kohofer | ||
---|---|---|---|
Line 13: | Line 13: | ||
{{: | {{: | ||
+ | ===== Installation ===== | ||
+ | ===== Controller name: slurm-ctrl ===== | ||
+ | Install slurm-wlm and tools | ||
+ | ssh slurm-ctrl | ||
+ | apt install slurm-wlm slurm-wlm-doc mailutils mariadb-client mariadb-server libmariadb-dev python-dev python-mysqldb | ||
+ | === Install Maria DB Server === | ||
+ | apt-get install mariadb-server | ||
+ | systemctl start mysql | ||
+ | mysql -u root | ||
+ | create database slurm_acct_db; | ||
+ | create user ' | ||
+ | set password for ' | ||
+ | grant usage on *.* to ' | ||
+ | grant all privileges on slurm_acct_db.* to ' | ||
+ | flush privileges; | ||
+ | exit | ||
- | https:// | + | In the file / |
+ | |||
+ | vi / | ||
+ | bind-address = localhost | ||
+ | |||
+ | === Node Authentication === | ||
+ | |||
+ | First, let us configure the default options for the munge service: | ||
+ | |||
+ | vi / | ||
+ | OPTIONS=" | ||
+ | |||
+ | === Central Controller === | ||
+ | |||
+ | The main configuration file is / | ||
+ | |||
+ | vi / | ||
+ | |||
+ | < | ||
+ | ############################### | ||
+ | # / | ||
+ | ############################### | ||
+ | # slurm.conf file generated by configurator easy.html. | ||
+ | # Put this file on all nodes of your cluster. | ||
+ | # See the slurm.conf man page for more information. | ||
+ | # | ||
+ | ControlMachine=slurm-ctrl | ||
+ | # | ||
+ | # | ||
+ | # | ||
+ | MpiDefault=none | ||
+ | # | ||
+ | ProctrackType=proctrack/ | ||
+ | ReturnToService=1 | ||
+ | SlurmctldPidFile=/ | ||
+ | ## | ||
+ | # | ||
+ | SlurmdPidFile=/ | ||
+ | ## | ||
+ | # | ||
+ | SlurmdSpoolDir=/ | ||
+ | SlurmUser=slurm | ||
+ | # | ||
+ | StateSaveLocation=/ | ||
+ | SwitchType=switch/ | ||
+ | TaskPlugin=task/ | ||
+ | # | ||
+ | # | ||
+ | # TIMERS | ||
+ | # | ||
+ | # | ||
+ | # | ||
+ | # | ||
+ | # | ||
+ | # | ||
+ | # SCHEDULING | ||
+ | FastSchedule=1 | ||
+ | SchedulerType=sched/ | ||
+ | SelectType=select/ | ||
+ | # | ||
+ | # | ||
+ | # | ||
+ | # LOGGING AND ACCOUNTING | ||
+ | AccountingStorageType=accounting_storage/ | ||
+ | ClusterName=cluster | ||
+ | # | ||
+ | JobAcctGatherType=jobacct_gather/ | ||
+ | # | ||
+ | SlurmctldLogFile=/ | ||
+ | # | ||
+ | SlurmdLogFile=/ | ||
+ | # | ||
+ | # | ||
+ | # COMPUTE NODES | ||
+ | NodeName=linux1 NodeAddr=10.7.20.98 CPUs=1 State=UNKNOWN | ||
+ | </ | ||
+ | |||
+ | Copy slurm.conf to compute nodes! | ||
+ | |||
+ | root@slurm-ctrl# | ||
+ | |||
+ | vi / | ||
+ | |||
+ | < | ||
+ | [Unit] | ||
+ | Description=Slurm controller daemon | ||
+ | After=network.target munge.service | ||
+ | ConditionPathExists=/ | ||
+ | Documentation=man: | ||
+ | |||
+ | [Service] | ||
+ | Type=forking | ||
+ | EnvironmentFile=-/ | ||
+ | ExecStart=/ | ||
+ | ExecStartPost=/ | ||
+ | ExecReload=/ | ||
+ | PIDFile=/ | ||
+ | |||
+ | [Install] | ||
+ | WantedBy=multi-user.target | ||
+ | |||
+ | </ | ||
+ | |||
+ | vi / | ||
+ | |||
+ | < | ||
+ | [Unit] | ||
+ | Description=Slurm node daemon | ||
+ | After=network.target munge.service | ||
+ | ConditionPathExists=/ | ||
+ | Documentation=man: | ||
+ | |||
+ | [Service] | ||
+ | Type=forking | ||
+ | EnvironmentFile=-/ | ||
+ | ExecStart=/ | ||
+ | ExecStartPost=/ | ||
+ | ExecReload=/ | ||
+ | PIDFile=/ | ||
+ | KillMode=process | ||
+ | LimitNOFILE=51200 | ||
+ | LimitMEMLOCK=infinity | ||
+ | LimitSTACK=infinity | ||
+ | |||
+ | [Install] | ||
+ | WantedBy=multi-user.target | ||
+ | </ | ||
+ | |||
+ | |||
+ | root@slurm-ctrl# | ||
+ | root@slurm-ctrl# | ||
+ | root@slurm-ctrl# | ||
+ | root@slurm-ctrl# | ||
+ | root@slurm-ctrl# | ||
+ | |||
+ | |||
+ | === Accounting Storage === | ||
+ | |||
+ | After we have the slurm-llnl-slurmdbd package installed we configure it, by editing the / | ||
+ | |||
+ | vi / | ||
+ | |||
+ | < | ||
+ | ######################################################################## | ||
+ | # | ||
+ | # / | ||
+ | # Database Daemon (SlurmDBD) configuration information. | ||
+ | # The contents of the file are case insensitive except for the names of | ||
+ | # nodes and files. Any text following a "#" | ||
+ | # treated as a comment through the end of that line. The size of each | ||
+ | # line in the file is limited to 1024 characters. Changes to the | ||
+ | # configuration file take effect upon restart of SlurmDbd or daemon | ||
+ | # receipt of the SIGHUP signal unless otherwise noted. | ||
+ | # | ||
+ | # This file should be only on the computer where SlurmDBD executes and | ||
+ | # should only be readable by the user which executes SlurmDBD (e.g. | ||
+ | # " | ||
+ | # it contains a database password. | ||
+ | ######################################################################### | ||
+ | AuthType=auth/ | ||
+ | AuthInfo=/ | ||
+ | StorageHost=localhost | ||
+ | StoragePort=3306 | ||
+ | StorageUser=slurm | ||
+ | StoragePass=slurmdbpass | ||
+ | StorageType=accounting_storage/ | ||
+ | StorageLoc=slurm_acct_db | ||
+ | LogFile=/ | ||
+ | PidFile=/ | ||
+ | SlurmUser=slurm | ||
+ | |||
+ | </ | ||
+ | |||
+ | root@slurm-ctrl# | ||
+ | |||
+ | === Authentication === | ||
+ | |||
+ | Copy / | ||
+ | |||
+ | scp / | ||
+ | |||
+ | Allow password-less access to slurm-ctrl | ||
+ | |||
+ | csadmin@slurm-ctrl: | ||
+ | |||
+ | Run a job from slurm-ctrl | ||
+ | |||
+ | ssh csadmin@slurm-ctrl | ||
+ | srun -N 1 hostname | ||
+ | linux1 | ||
+ | |||
+ | |||
+ | |||
+ | === Test munge === | ||
+ | |||
+ | munge -n | unmunge | grep STATUS | ||
+ | STATUS: | ||
+ | munge -n | ssh slurm-ctrl unmunge | grep STATUS | ||
+ | STATUS: | ||
+ | |||
+ | === Test Slurm === | ||
+ | |||
+ | sinfo | ||
+ | PARTITION AVAIL TIMELIMIT | ||
+ | debug* | ||
+ | |||
+ | If computer node is down | ||
+ | |||
+ | < | ||
+ | sinfo -a | ||
+ | PARTITION AVAIL TIMELIMIT | ||
+ | debug* | ||
+ | </ | ||
+ | |||
+ | scontrol update nodename=gpu02 state=idle | ||
+ | scontrol update nodename=gpu03 state=idle | ||
+ | scontrol update nodename=gpu02 state=resume | ||
+ | |||
+ | < | ||
+ | sinfo -a | ||
+ | PARTITION AVAIL TIMELIMIT | ||
+ | debug* | ||
+ | </ | ||
+ | |||
+ | |||
+ | ===== Compute Nodes ===== | ||
+ | |||
+ | |||
+ | A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. | ||
+ | |||
+ | {{: | ||
+ | |||
+ | === Installation slurm and munge === | ||
+ | |||
+ | ssh -l csadmin < | ||
+ | sudo apt install slurm-wlm libmunge-dev libmunge2 munge | ||
+ | |||
+ | sudo vi / | ||
+ | |||
+ | < | ||
+ | [Unit] | ||
+ | Description=Slurm node daemon | ||
+ | After=network.target munge.service | ||
+ | ConditionPathExists=/ | ||
+ | Documentation=man: | ||
+ | |||
+ | [Service] | ||
+ | Type=forking | ||
+ | EnvironmentFile=-/ | ||
+ | ExecStart=/ | ||
+ | ExecStartPost=/ | ||
+ | ExecReload=/ | ||
+ | PIDFile=/ | ||
+ | KillMode=process | ||
+ | LimitNOFILE=51200 | ||
+ | LimitMEMLOCK=infinity | ||
+ | LimitSTACK=infinity | ||
+ | |||
+ | [Install] | ||
+ | WantedBy=multi-user.target | ||
+ | </ | ||
+ | |||
+ | sudo systemctl enable slurmd | ||
+ | sudo systemctl enable munge | ||
+ | sudo systemctl start slurmd | ||
+ | sudo systemctl start munge | ||
+ | |||
+ | |||
+ | Generate ssh keys | ||
+ | |||
+ | ssh-keygen | ||
+ | |||
+ | Copy ssh-keys to slurm-ctrl | ||
+ | |||
+ | ssh-copy-id -i ~/ | ||
+ | |||
+ | Become root to do important things: | ||
+ | |||
+ | sudo -i | ||
+ | vi / | ||
+ | |||
+ | Add those lines below to the /etc/hosts file | ||
+ | |||
+ | < | ||
+ | 10.7.20.97 | ||
+ | 10.7.20.98 | ||
+ | </ | ||
+ | |||
+ | First copy the munge keys from the slurm-ctrl to all compute nodes, now fix location, | ||
+ | owner and permission. | ||
+ | |||
+ | mv / | ||
+ | chown munge:munge / | ||
+ | chmod 400 / | ||
+ | |||
+ | Place / | ||
+ | |||
+ | mv / | ||
+ | chown root: / | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Links ===== | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | {{ : |
/data/www/wiki.inf.unibz.it/data/pages/tech/slurm.txt · Last modified: 2022/11/24 16:17 by kohofer