tech:slurm
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
tech:slurm [2019/09/06 14:30] – kohofer | tech:slurm [2019/09/06 16:38] – [Links] kohofer | ||
---|---|---|---|
Line 19: | Line 19: | ||
Controller name: slurm-ctrl | Controller name: slurm-ctrl | ||
- | $ ssh csadmin@slurm-ctrl | + | Install slurm-wlm and tools |
- | | + | |
+ | | ||
+ | apt install slurm-wlm slurm-wlm-doc mailutils sview mariadb-client mariadb-server libmariadb-dev python-dev python-mysqldb | ||
=== Install Maria DB Server === | === Install Maria DB Server === | ||
- | | + | apt-get install mariadb-server |
- | | + | systemctl start mysql |
- | | + | mysql -u root |
create database slurm_acct_db; | create database slurm_acct_db; | ||
create user ' | create user ' | ||
Line 37: | Line 39: | ||
In the file / | In the file / | ||
+ | vi / | ||
bind-address = localhost | bind-address = localhost | ||
- | |||
- | === Configure munge === | ||
- | |||
- | $ ssh csadmin@linux1 | ||
- | scp slurm-ctrl:/ | ||
=== Node Authentication === | === Node Authentication === | ||
Line 48: | Line 46: | ||
First, let us configure the default options for the munge service: | First, let us configure the default options for the munge service: | ||
- | / | + | vi / |
- | + | OPTIONS=" | |
- | OPTIONS=" | + | |
=== Central Controller === | === Central Controller === | ||
- | The main configuration file is / | + | The main configuration file is / |
+ | |||
+ | vi / | ||
< | < | ||
Line 60: | Line 59: | ||
# / | # / | ||
############################### | ############################### | ||
- | # General | + | # slurm.conf file generated by configurator easy.html. |
- | ControlMachine=entry-node | + | # Put this file on all nodes of your cluster. |
- | AuthType=auth/ | + | # See the slurm.conf man page for more information. |
- | CacheGroups=0 | + | # |
- | CryptoType=crypto/ | + | ControlMachine=slurm-ctrl |
- | JobCheckpointDir=/ | + | # |
- | KillOnBadExit=01 | + | # |
- | MpiDefault=pmi2 | + | #MailProg=/ |
- | MailProg=/usr/bin/mail | + | MpiDefault=none |
- | PrivateData=usage, | + | #MpiParams=ports=#-# |
- | ProctrackType=proctrack/ | + | ProctrackType=proctrack/pgid |
- | PrologFlags=Alloc, | + | |
- | PropagateResourceLimits=NONE | + | |
- | RebootProgram=/ | + | |
ReturnToService=1 | ReturnToService=1 | ||
SlurmctldPidFile=/ | SlurmctldPidFile=/ | ||
- | SlurmctldPort=6817 | + | ## |
+ | #SlurmctldPort=6817 | ||
SlurmdPidFile=/ | SlurmdPidFile=/ | ||
- | SlurmdPort=6818 | + | ## |
- | SlurmdSpoolDir=/ | + | #SlurmdPort=6818 |
+ | SlurmdSpoolDir=/ | ||
SlurmUser=slurm | SlurmUser=slurm | ||
- | StateSaveLocation=/ | + | # |
+ | StateSaveLocation=/ | ||
SwitchType=switch/ | SwitchType=switch/ | ||
- | TaskPlugin=task/ | + | TaskPlugin=task/ |
- | + | # | |
- | # Timers | + | # |
- | InactiveLimit=0 | + | # TIMERS |
- | KillWait=30 | + | #KillWait=30 |
- | MinJobAge=300 | + | #MinJobAge=300 |
- | SlurmctldTimeout=120 | + | #SlurmctldTimeout=120 |
- | SlurmdTimeout=300 | + | #SlurmdTimeout=300 |
- | Waittime=0 | + | # |
- | + | # | |
- | # Scheduler | + | # SCHEDULING |
FastSchedule=1 | FastSchedule=1 | ||
SchedulerType=sched/ | SchedulerType=sched/ | ||
- | SchedulerPort=7321 | + | SelectType=select/ |
- | SelectType=select/ | + | #SelectTypeParameters= |
- | SelectTypeParameters=CR_CPU_Memory | + | # |
- | + | # | |
- | # Preemptions | + | # LOGGING AND ACCOUNTING |
- | PreemptType=preempt/ | + | AccountingStorageType=accounting_storage/ |
- | PreemptMode=REQUEUE | + | ClusterName=cluster |
- | + | #JobAcctGatherFrequency=30 | |
- | # Accounting | + | JobAcctGatherType=jobacct_gather/ |
- | AccountingStorageType=accounting_storage/ | + | #SlurmctldDebug=3 |
- | AccountingStoreJobComment=YES | + | SlurmctldLogFile=/ |
- | ClusterName=mycluster | + | #SlurmdDebug=3 |
- | JobAcctGatherFrequency=30 | + | SlurmdLogFile=/ |
- | JobAcctGatherType=jobacct_gather/ | + | # |
- | SlurmctldDebug=3 | + | # |
- | SlurmctldLogFile=/ | + | # COMPUTE NODES |
- | SlurmdDebug=3 | + | NodeName=linux1 NodeAddr=10.7.20.98 CPUs=1 State=UNKNOWN |
- | SlurmdLogFile=/ | + | |
- | SlurmSchedLogFile= / | + | |
- | SlurmSchedLogLevel=3 | + | |
- | + | ||
- | NodeName=compute-1 Procs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=128000 Weight=4 | + | |
- | NodeName=compute-2 Procs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=254000 Weight=3 | + | |
- | NodeName=compute-3 Procs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=256000 Weight=3 | + | |
- | NodeName=compute-4 Procs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=256000 Weight=3 | + | |
- | + | ||
- | PartitionName=base Nodes=compute-1, | + | |
- | PartitionName=long Nodes=compute-1, | + | |
</ | </ | ||
- | root@controller# systemctl start slurmctld | + | root@slurm-ctrl# scp / |
+ | root@slurm-ctrl# systemctl start slurmctld | ||
=== Accounting Storage === | === Accounting Storage === | ||
- | After we have the slurm-llnl-slurmdbd package installed we configure it, by editing the / | + | After we have the slurm-llnl-slurmdbd package installed we configure it, by editing the / |
+ | |||
+ | vi / | ||
< | < | ||
######################################################################## | ######################################################################## | ||
# | # | ||
- | # / | + | # / |
# Database Daemon (SlurmDBD) configuration information. | # Database Daemon (SlurmDBD) configuration information. | ||
# The contents of the file are case insensitive except for the names of | # The contents of the file are case insensitive except for the names of | ||
- | # nodes and files. Any text following a "#" | + | # nodes and files. Any text following a "#" |
+ | # treated as a comment through the end of that line. The size of each | ||
# line in the file is limited to 1024 characters. Changes to the | # line in the file is limited to 1024 characters. Changes to the | ||
# configuration file take effect upon restart of SlurmDbd or daemon | # configuration file take effect upon restart of SlurmDbd or daemon | ||
Line 153: | Line 145: | ||
StoragePort=3306 | StoragePort=3306 | ||
StorageUser=slurm | StorageUser=slurm | ||
- | StoragePass=safepassword | + | StoragePass=slurmdbpass |
StorageType=accounting_storage/ | StorageType=accounting_storage/ | ||
StorageLoc=slurm_acct_db | StorageLoc=slurm_acct_db | ||
Line 159: | Line 151: | ||
PidFile=/ | PidFile=/ | ||
SlurmUser=slurm | SlurmUser=slurm | ||
+ | |||
</ | </ | ||
- | root@controller# systemctl start slurmdbd | + | root@slurm-ctrl# systemctl start slurmdbd |
+ | |||
+ | === Authentication === | ||
+ | |||
+ | Copy / | ||
+ | |||
+ | scp / | ||
+ | |||
+ | Allow password-less access to slurm-ctrl | ||
+ | |||
+ | csadmin@slurm-ctrl: | ||
+ | |||
+ | Run a job from slurm-ctrl | ||
+ | |||
+ | ssh csadmin@slurm-ctrl | ||
+ | srun -N 1 hostname | ||
+ | linux1 | ||
=== Test munge === | === Test munge === | ||
- | | + | munge -n | unmunge | grep STATUS |
STATUS: | STATUS: | ||
- | | + | munge -n | ssh slurm-ctrl unmunge | grep STATUS |
STATUS: | STATUS: | ||
=== Test Slurm === | === Test Slurm === | ||
- | | + | sinfo |
PARTITION AVAIL TIMELIMIT | PARTITION AVAIL TIMELIMIT | ||
debug* | debug* | ||
Line 181: | Line 191: | ||
A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. | A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. | ||
- | Zecihnung | + | {{: |
- | === Authentication | + | === Installation |
- | | + | ssh -l csadmin 10.7.20.102 |
- | | + | |
+ | |||
+ | Generate ssh keys | ||
- | | + | |
- | Run a job from slurm-ctrl | + | Copy ssh-keys to slurm-ctrl |
- | | + | ssh-copy-id -i ~/ |
- | | + | |
- | linux1 | + | Become root to do important things: |
+ | |||
+ | | ||
+ | | ||
+ | |||
+ | Add those lines below to the /etc/hosts file | ||
+ | |||
+ | < | ||
+ | 10.7.20.97 | ||
+ | 10.7.20.98 | ||
+ | </ | ||
+ | |||
+ | First copy the munge keys from the slurm-ctrl to all compute nodes, now fix location, | ||
+ | owner and permission. | ||
+ | |||
+ | mv / | ||
+ | chown munge:munge / | ||
+ | chmod 400 / | ||
+ | |||
+ | Place / | ||
+ | |||
+ | mv / | ||
+ | chown root: / | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Links ===== | ||
+ | [[https:// | ||
+ | [[https:// | ||
+ | [[https:// | ||
+ | [[https:// | ||
+ | [[https:// | ||
- | https://slurm.schedmd.com/overview.html | + | [[https://support.ceci-hpc.be/doc/ |
/data/www/wiki.inf.unibz.it/data/pages/tech/slurm.txt · Last modified: 2022/11/24 16:17 by kohofer