tech:slurm
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
tech:slurm [2019/09/06 14:30] – kohofer | tech:slurm [2020/04/28 18:25] – kohofer | ||
---|---|---|---|
Line 15: | Line 15: | ||
===== Installation ===== | ===== Installation ===== | ||
- | ==== Controller ==== | + | ===== Controller |
- | Controller name: slurm-ctrl | + | Install |
- | | + | ssh slurm-ctrl |
- | | + | apt install slurm-wlm slurm-wlm-doc mailutils mariadb-client mariadb-server libmariadb-dev python-dev python-mysqldb |
=== Install Maria DB Server === | === Install Maria DB Server === | ||
- | | + | apt-get install mariadb-server |
- | | + | systemctl start mysql |
- | | + | mysql -u root |
create database slurm_acct_db; | create database slurm_acct_db; | ||
create user ' | create user ' | ||
Line 37: | Line 37: | ||
In the file / | In the file / | ||
+ | vi / | ||
bind-address = localhost | bind-address = localhost | ||
- | |||
- | === Configure munge === | ||
- | |||
- | $ ssh csadmin@linux1 | ||
- | scp slurm-ctrl:/ | ||
=== Node Authentication === | === Node Authentication === | ||
Line 48: | Line 44: | ||
First, let us configure the default options for the munge service: | First, let us configure the default options for the munge service: | ||
- | / | + | vi / |
- | + | OPTIONS=" | |
- | OPTIONS=" | + | |
=== Central Controller === | === Central Controller === | ||
- | The main configuration file is / | + | The main configuration file is / |
+ | |||
+ | vi / | ||
< | < | ||
Line 60: | Line 57: | ||
# / | # / | ||
############################### | ############################### | ||
- | # General | + | # slurm.conf file generated by configurator easy.html. |
- | ControlMachine=entry-node | + | # Put this file on all nodes of your cluster. |
- | AuthType=auth/ | + | # See the slurm.conf man page for more information. |
- | CacheGroups=0 | + | # |
- | CryptoType=crypto/ | + | ControlMachine=slurm-ctrl |
- | JobCheckpointDir=/ | + | # |
- | KillOnBadExit=01 | + | # |
- | MpiDefault=pmi2 | + | #MailProg=/ |
- | MailProg=/usr/bin/mail | + | MpiDefault=none |
- | PrivateData=usage, | + | #MpiParams=ports=#-# |
- | ProctrackType=proctrack/ | + | ProctrackType=proctrack/pgid |
- | PrologFlags=Alloc, | + | |
- | PropagateResourceLimits=NONE | + | |
- | RebootProgram=/ | + | |
ReturnToService=1 | ReturnToService=1 | ||
SlurmctldPidFile=/ | SlurmctldPidFile=/ | ||
- | SlurmctldPort=6817 | + | ## |
+ | #SlurmctldPort=6817 | ||
SlurmdPidFile=/ | SlurmdPidFile=/ | ||
- | SlurmdPort=6818 | + | ## |
- | SlurmdSpoolDir=/ | + | #SlurmdPort=6818 |
+ | SlurmdSpoolDir=/ | ||
SlurmUser=slurm | SlurmUser=slurm | ||
- | StateSaveLocation=/ | + | # |
+ | StateSaveLocation=/ | ||
SwitchType=switch/ | SwitchType=switch/ | ||
- | TaskPlugin=task/ | + | TaskPlugin=task/ |
- | + | # | |
- | # Timers | + | # |
- | InactiveLimit=0 | + | # TIMERS |
- | KillWait=30 | + | #KillWait=30 |
- | MinJobAge=300 | + | #MinJobAge=300 |
- | SlurmctldTimeout=120 | + | #SlurmctldTimeout=120 |
- | SlurmdTimeout=300 | + | #SlurmdTimeout=300 |
- | Waittime=0 | + | # |
- | + | # | |
- | # Scheduler | + | # SCHEDULING |
FastSchedule=1 | FastSchedule=1 | ||
SchedulerType=sched/ | SchedulerType=sched/ | ||
- | SchedulerPort=7321 | + | SelectType=select/ |
- | SelectType=select/ | + | #SelectTypeParameters= |
- | SelectTypeParameters=CR_CPU_Memory | + | # |
+ | # | ||
+ | # LOGGING AND ACCOUNTING | ||
+ | AccountingStorageType=accounting_storage/ | ||
+ | ClusterName=cluster | ||
+ | # | ||
+ | JobAcctGatherType=jobacct_gather/ | ||
+ | # | ||
+ | SlurmctldLogFile=/ | ||
+ | # | ||
+ | SlurmdLogFile=/ | ||
+ | # | ||
+ | # | ||
+ | # COMPUTE NODES | ||
+ | NodeName=linux1 NodeAddr=10.7.20.98 CPUs=1 State=UNKNOWN | ||
+ | </ | ||
- | # Preemptions | + | Copy slurm.conf to compute nodes! |
- | PreemptType=preempt/partition_prio | + | |
- | PreemptMode=REQUEUE | + | root@slurm-ctrl# scp / |
+ | |||
+ | vi / | ||
+ | |||
+ | < | ||
+ | [Unit] | ||
+ | Description=Slurm controller daemon | ||
+ | After=network.target munge.service | ||
+ | ConditionPathExists=/ | ||
+ | Documentation=man: | ||
- | # Accounting | + | [Service] |
- | AccountingStorageType=accounting_storage/ | + | Type=forking |
- | AccountingStoreJobComment=YES | + | EnvironmentFile=-/etc/default/ |
- | ClusterName=mycluster | + | ExecStart=/usr/sbin/ |
- | JobAcctGatherFrequency=30 | + | ExecStartPost=/bin/sleep 2 |
- | JobAcctGatherType=jobacct_gather/linux | + | ExecReload=/bin/kill -HUP $MAINPID |
- | SlurmctldDebug=3 | + | PIDFile=/var/run/ |
- | SlurmctldLogFile=/var/log/ | + | |
- | SlurmdDebug=3 | + | |
- | SlurmdLogFile=/var/log/slurm-llnl/ | + | |
- | SlurmSchedLogFile= /var/log/ | + | |
- | SlurmSchedLogLevel=3 | + | |
- | NodeName=compute-1 Procs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=128000 Weight=4 | + | [Install] |
- | NodeName=compute-2 Procs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=254000 Weight=3 | + | WantedBy=multi-user.target |
- | NodeName=compute-3 Procs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=256000 Weight=3 | + | |
- | NodeName=compute-4 Procs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=256000 Weight=3 | + | |
- | PartitionName=base Nodes=compute-1, | ||
- | PartitionName=long Nodes=compute-1, | ||
</ | </ | ||
- | root@controller# systemctl start slurmctld | + | |
+ | |||
+ | < | ||
+ | [Unit] | ||
+ | Description=Slurm node daemon | ||
+ | After=network.target munge.service | ||
+ | ConditionPathExists=/ | ||
+ | Documentation=man: | ||
+ | |||
+ | [Service] | ||
+ | Type=forking | ||
+ | EnvironmentFile=-/ | ||
+ | ExecStart=/ | ||
+ | ExecStartPost=/ | ||
+ | ExecReload=/ | ||
+ | PIDFile=/ | ||
+ | KillMode=process | ||
+ | LimitNOFILE=51200 | ||
+ | LimitMEMLOCK=infinity | ||
+ | LimitSTACK=infinity | ||
+ | |||
+ | [Install] | ||
+ | WantedBy=multi-user.target | ||
+ | </ | ||
+ | |||
+ | |||
+ | | ||
+ | root@slurm-ctrl# | ||
+ | root@slurm-ctrl# | ||
+ | root@slurm-ctrl# | ||
+ | root@slurm-ctrl# systemctl start slurmctld | ||
=== Accounting Storage === | === Accounting Storage === | ||
- | After we have the slurm-llnl-slurmdbd package installed we configure it, by editing the / | + | After we have the slurm-llnl-slurmdbd package installed we configure it, by editing the / |
+ | |||
+ | vi / | ||
< | < | ||
######################################################################## | ######################################################################## | ||
# | # | ||
- | # / | + | # / |
# Database Daemon (SlurmDBD) configuration information. | # Database Daemon (SlurmDBD) configuration information. | ||
# The contents of the file are case insensitive except for the names of | # The contents of the file are case insensitive except for the names of | ||
- | # nodes and files. Any text following a "#" | + | # nodes and files. Any text following a "#" |
+ | # treated as a comment through the end of that line. The size of each | ||
# line in the file is limited to 1024 characters. Changes to the | # line in the file is limited to 1024 characters. Changes to the | ||
# configuration file take effect upon restart of SlurmDbd or daemon | # configuration file take effect upon restart of SlurmDbd or daemon | ||
Line 153: | Line 199: | ||
StoragePort=3306 | StoragePort=3306 | ||
StorageUser=slurm | StorageUser=slurm | ||
- | StoragePass=safepassword | + | StoragePass=slurmdbpass |
StorageType=accounting_storage/ | StorageType=accounting_storage/ | ||
StorageLoc=slurm_acct_db | StorageLoc=slurm_acct_db | ||
Line 159: | Line 205: | ||
PidFile=/ | PidFile=/ | ||
SlurmUser=slurm | SlurmUser=slurm | ||
+ | |||
</ | </ | ||
- | root@controller# systemctl start slurmdbd | + | root@slurm-ctrl# systemctl start slurmdbd |
+ | |||
+ | === Authentication === | ||
+ | |||
+ | Copy / | ||
+ | |||
+ | scp / | ||
+ | |||
+ | Allow password-less access to slurm-ctrl | ||
+ | |||
+ | csadmin@slurm-ctrl: | ||
+ | |||
+ | Run a job from slurm-ctrl | ||
+ | |||
+ | ssh csadmin@slurm-ctrl | ||
+ | srun -N 1 hostname | ||
+ | linux1 | ||
=== Test munge === | === Test munge === | ||
- | | + | munge -n | unmunge | grep STATUS |
STATUS: | STATUS: | ||
- | | + | munge -n | ssh slurm-ctrl unmunge | grep STATUS |
STATUS: | STATUS: | ||
=== Test Slurm === | === Test Slurm === | ||
- | | + | sinfo |
PARTITION AVAIL TIMELIMIT | PARTITION AVAIL TIMELIMIT | ||
debug* | debug* | ||
- | ==== Compute Nodes ==== | + | If computer node is down |
+ | |||
+ | < | ||
+ | sinfo -a | ||
+ | PARTITION AVAIL TIMELIMIT | ||
+ | debug* | ||
+ | </ | ||
+ | |||
+ | scontrol update nodename=gpu02 state=idle | ||
+ | scontrol update nodename=gpu03 state=idle | ||
+ | scontrol update nodename=gpu02 state=resume | ||
+ | |||
+ | < | ||
+ | sinfo -a | ||
+ | PARTITION AVAIL TIMELIMIT | ||
+ | debug* | ||
+ | </ | ||
+ | |||
+ | |||
+ | ===== Compute Nodes ===== | ||
A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. | A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. | ||
- | Zecihnung | + | {{: |
- | === Authentication | + | === Installation slurm and munge === |
- | | + | ssh -l csadmin < |
- | | + | |
- | | + | |
- | Run a job from slurm-ctrl | + | < |
+ | [Unit] | ||
+ | Description=Slurm node daemon | ||
+ | After=network.target munge.service | ||
+ | ConditionPathExists=/ | ||
+ | Documentation=man: | ||
- | | + | [Service] |
- | $ srun -N 1 hostname | + | Type=forking |
- | | + | EnvironmentFile=-/ |
+ | ExecStart=/ | ||
+ | ExecStartPost=/ | ||
+ | ExecReload=/ | ||
+ | PIDFile=/ | ||
+ | KillMode=process | ||
+ | LimitNOFILE=51200 | ||
+ | LimitMEMLOCK=infinity | ||
+ | LimitSTACK=infinity | ||
+ | [Install] | ||
+ | WantedBy=multi-user.target | ||
+ | </ | ||
+ | |||
+ | sudo systemctl enable slurmd | ||
+ | sudo systemctl enable munge | ||
+ | sudo systemctl start slurmd | ||
+ | sudo systemctl start munge | ||
+ | |||
+ | |||
+ | Generate ssh keys | ||
+ | |||
+ | ssh-keygen | ||
+ | |||
+ | Copy ssh-keys to slurm-ctrl | ||
+ | |||
+ | ssh-copy-id -i ~/ | ||
+ | |||
+ | Become root to do important things: | ||
+ | |||
+ | sudo -i | ||
+ | vi /etc/hosts | ||
+ | |||
+ | Add those lines below to the /etc/hosts file | ||
+ | |||
+ | < | ||
+ | 10.7.20.97 | ||
+ | 10.7.20.98 | ||
+ | </ | ||
+ | |||
+ | First copy the munge keys from the slurm-ctrl to all compute nodes, now fix location, | ||
+ | owner and permission. | ||
+ | |||
+ | mv / | ||
+ | chown munge:munge / | ||
+ | chmod 400 / | ||
+ | |||
+ | Place / | ||
+ | |||
+ | mv / | ||
+ | chown root: / | ||
+ | |||
+ | | ||
+ | |||
+ | |||
+ | ===== Links ===== | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | |||
+ | ====== Modules ====== | ||
+ | |||
+ | ===== Python ===== | ||
+ | |||
+ | Python 3.7.7 | ||
+ | |||
+ | cd / | ||
+ | mkdir / | ||
+ | wget https:// | ||
+ | tar xfJ Python-3.7.7.tar.xz | ||
+ | cd Python-3.7.7/ | ||
+ | ./configure --prefix=/ | ||
+ | make | ||
+ | make install | ||
+ | | ||
+ | |||
+ | |||
+ | ===== GCC ===== | ||
+ | |||
+ | This takes a long time! | ||
+ | |||
+ | Commands to run to compile gcc-6.1.0 | ||
+ | |||
+ | wget https:// | ||
+ | tar xfj gcc-6.1.0.tar.bz2 | ||
+ | cd gcc-6.1.0 | ||
+ | ./ | ||
+ | ./configure --prefix=/ | ||
+ | make | ||
+ | |||
+ | After some time an error occurs, and the make process stops! | ||
+ | < | ||
+ | ... | ||
+ | In file included from ../ | ||
+ | ./ | ||
+ | ./ | ||
+ | sc = (struct sigcontext *) (void *) & | ||
+ | ^~ | ||
+ | ../ | ||
+ | </ | ||
+ | |||
+ | To fix do: [[https:// | ||
+ | |||
+ | vi / | ||
+ | |||
+ | and replace/ | ||
+ | |||
+ | < | ||
+ | struct ucontext_t *uc_ = context-> | ||
+ | </ | ||
+ | |||
+ | old line: /* struct ucontext *uc_ = context-> | ||
+ | |||
+ | make | ||
+ | |||
+ | Next error: | ||
+ | |||
+ | < | ||
+ | ../ | ||
+ | | ||
+ | |||
+ | </ | ||
+ | |||
+ | To fix see: [[https:// | ||
+ | or [[https:// | ||
+ | |||
+ | Amend the files according to solution above! | ||
+ | |||
+ | Next error: | ||
+ | |||
+ | < | ||
+ | ... | ||
+ | checking for unzip... unzip | ||
+ | configure: error: cannot find neither zip nor jar, cannot continue | ||
+ | Makefile: | ||
+ | ... | ||
+ | ... | ||
+ | </ | ||
+ | |||
+ | apt install unzip zip | ||
+ | |||
+ | and run make again! | ||
+ | |||
+ | make | ||
+ | |||
+ | Next error: | ||
+ | |||
+ | < | ||
+ | ... | ||
+ | In file included from ../ | ||
+ | ../ | ||
+ | ./ | ||
+ | | ||
+ | ... | ||
+ | </ | ||
+ | |||
+ | Edit the file: / | ||
+ | |||
+ | vi / | ||
+ | |||
+ | <note warning> | ||
+ | |||
+ | < | ||
+ | // kh | ||
+ | ucontext_t *_uc = (ucontext_t *); \ | ||
+ | //struct ucontext *_uc = (struct ucontext *)_p; \ | ||
+ | // kh | ||
+ | |||
+ | </ | ||
+ | |||
+ | Next error: | ||
+ | |||
+ | <code php> | ||
+ | ... | ||
+ | In file included from ../ | ||
+ | ./ | ||
+ | // | ||
+ | | ||
+ | ../ | ||
+ | ./ | ||
+ | | ||
+ | | ||
+ | ../ | ||
+ | | ||
+ | | ||
+ | ../ | ||
+ | | ||
+ | | ||
+ | ../ | ||
+ | ../ | ||
+ | | ||
+ | | ||
+ | ../ | ||
+ | ../ | ||
+ | ../ | ||
+ | | ||
+ | ... | ||
+ | </ | ||
+ | ===== Links ===== | ||
+ | http:// | ||
- | https://slurm.schedmd.com/overview.html | + | https://modules.readthedocs.io/en/ |
/data/www/wiki.inf.unibz.it/data/pages/tech/slurm.txt · Last modified: 2022/11/24 16:17 by kohofer