User Tools

Site Tools


tech:slurm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
tech:slurm [2020/02/10 16:25] – [Controller] kohofertech:slurm [2020/05/27 10:57] kohofer
Line 14: Line 14:
  
 ===== Installation ===== ===== Installation =====
- 
-==== Controller ==== 
  
 ===== Controller name: slurm-ctrl ===== ===== Controller name: slurm-ctrl =====
Line 131: Line 129:
 EnvironmentFile=-/etc/default/slurmctld EnvironmentFile=-/etc/default/slurmctld
 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
 +ExecStartPost=/bin/sleep 2
 ExecReload=/bin/kill -HUP $MAINPID ExecReload=/bin/kill -HUP $MAINPID
 PIDFile=/var/run/slurm-llnl/slurmctld.pid PIDFile=/var/run/slurm-llnl/slurmctld.pid
Line 152: Line 151:
 EnvironmentFile=-/etc/default/slurmd EnvironmentFile=-/etc/default/slurmd
 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
 +ExecStartPost=/bin/sleep 2
 ExecReload=/bin/kill -HUP $MAINPID ExecReload=/bin/kill -HUP $MAINPID
 PIDFile=/var/run/slurm-llnl/slurmd.pid PIDFile=/var/run/slurm-llnl/slurmd.pid
Line 241: Line 241:
   debug*       up   infinite      1   idle linux1   debug*       up   infinite      1   idle linux1
  
-==== Compute Nodes ====+If computer node is **<color #ed1c24>down</color>** or **<color #ed1c24>drain</color>** 
 + 
 +<code> 
 +sinfo -a 
 +PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
 +debug*       up   infinite      2   down gpu[02-03] 
 + 
 +sinfo  
 +PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
 +gpu*         up   infinite      1  drain gpu02 
 +gpu*         up   infinite      1   down gpu03 
 + 
 +</code> 
 + 
 +  scontrol update nodename=gpu02 state=idle 
 +  scontrol update nodename=gpu03 state=idle 
 +  scontrol update nodename=gpu02 state=resume 
 + 
 +<code> 
 +sinfo -a 
 +PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
 +debug*       up   infinite      2   idle gpu[02-03] 
 +</code> 
 + 
 + 
 +===== Compute Nodes ====
  
 A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service.
Line 251: Line 277:
   ssh -l csadmin <compute-nodes> 10.7.20.109 10.7.20.110   ssh -l csadmin <compute-nodes> 10.7.20.109 10.7.20.110
   sudo apt install slurm-wlm libmunge-dev libmunge2 munge   sudo apt install slurm-wlm libmunge-dev libmunge2 munge
 +
 +  sudo vi /lib/systemd/system/slurmd.service
 +
 +<code>
 +[Unit]
 +Description=Slurm node daemon
 +After=network.target munge.service
 +ConditionPathExists=/etc/slurm-llnl/slurm.conf
 +Documentation=man:slurmd(8)
 +
 +[Service]
 +Type=forking
 +EnvironmentFile=-/etc/default/slurmd
 +ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
 +ExecStartPost=/bin/sleep 2
 +ExecReload=/bin/kill -HUP $MAINPID
 +PIDFile=/var/run/slurm-llnl/slurmd.pid
 +KillMode=process
 +LimitNOFILE=51200
 +LimitMEMLOCK=infinity
 +LimitSTACK=infinity
 +
 +[Install]
 +WantedBy=multi-user.target
 +</code>
 +
   sudo systemctl enable slurmd   sudo systemctl enable slurmd
   sudo systemctl enable munge   sudo systemctl enable munge
Line 261: Line 313:
   ssh-keygen   ssh-keygen
  
-Copy ssh-keys to slurm-ctrl (using IP, because no DNS in place)+Copy ssh-keys to slurm-ctrl 
  
-  ssh-copy-id -i ~/.ssh/id_rsa.pub csadmin@10.7.20.97:+  ssh-copy-id -i ~/.ssh/id_rsa.pub csadmin@slurm-ctrl.inf.unibz.it:
  
 Become root to do important things: Become root to do important things:
Line 307: Line 359:
  
 {{ :tech:9-slurm.pdf |Linux Clusters Institute: Scheduling and Resource Management 2017}} {{ :tech:9-slurm.pdf |Linux Clusters Institute: Scheduling and Resource Management 2017}}
 +
 +
 +====== Modules ======
 +
 +===== Python =====
 +
 +==== Python 3.7.7 ====
 +
 +
 +  cd /opt/packages
 +  mkdir /opt/packages/python/3.7.7
 +  wget https://www.python.org/ftp/python/3.7.7/Python-3.7.7.tar.xz
 +  tar xfJ Python-3.7.7.tar.xz
 +  cd Python-3.7.7/
 +  ./configure --prefix=/opt/packages/python/3.7.7/ --enable-optimizations
 +  make
 +  make install
 +  
 +
 +==== Python 2.7.18 ====
 +
 +
 +  cd /opt/packages
 +  mkdir /opt/packages/python/2.7.18
 +  wget https://www.python.org/ftp/python/2.7.18/Python-2.7.18.tar.xz
 +  cd Python-2.7.18
 +  ./configure --prefix=/opt/packages/python/2.7.18/ --enable-optimizations
 +  make
 +  make install
 +
 +==== Create modules file ====
 +
 +
 +  cd /opt/modules/modulefiles/
 +  vi python-2.7.18
 +
 +<code>
 +#%Module1.0
 +proc ModulesHelp { } {
 +global dotversion
 + 
 +puts stderr "\tPython 2.7.18"
 +}
 + 
 +module-whatis "Python 2.7.18"
 +prepend-path PATH /opt/packages/python/2.7.18/bin
 +
 +</code>
 +  
 +
 +
 +
 +===== GCC =====
 +
 +This takes a long time!
 +
 +Commands to run to compile gcc-6.1.0
 +
 +  wget https://ftp.gnu.org/gnu/gcc/gcc-6.1.0/gcc-6.1.0.tar.bz2
 +  tar xfj gcc-6.1.0.tar.bz2
 +  cd gcc-6.1.0
 +  ./contrib/download_prerequisites
 +  ./configure --prefix=/opt/package/gcc/6.1.0 --disable-multilib
 +  make
 +
 +After some time an error occurs, and the make process stops!
 +<code>
 +...
 +In file included from ../.././libgcc/unwind-dw2.c:401:0:
 +./md-unwind-support.h: In function ‘x86_64_fallback_frame_state’:
 +./md-unwind-support.h:65:47: error: dereferencing pointer to incomplete type ‘struct ucontext’
 +       sc = (struct sigcontext *) (void *) &uc_->uc_mcontext;
 +                                               ^~
 +../.././libgcc/shared-object.mk:14: recipe for target 'unwind-dw2.o' failed
 +</code>
 +
 +To fix do: [[https://stackoverflow.com/questions/46999900/how-to-compile-gcc-6-4-0-with-gcc-7-2-in-archlinux|solution]]
 +
 +  vi /opt/packages/gcc-6.1.0/x86_64-pc-linux-gnu/libgcc/md-unwind-support.h
 +
 +and replace/comment out line 61 with this:
 +
 +<code>
 +struct ucontext_t *uc_ = context->cfa;
 +</code>
 +
 +old line: /* struct ucontext *uc_ = context->cfa; */
 +
 +  make
 +
 +Next error:
 +
 +<code>
 +../../.././libsanitizer/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc:270:22: error: aggregate ‘sigaltstack handler_stack’ has incomplete type and cannot be defined
 +   struct sigaltstack handler_stack;
 +
 +</code>
 +
 +To fix see: [[https://github.com/llvm-mirror/compiler-rt/commit/8a5e425a68de4d2c80ff00a97bbcb3722a4716da?diff=unified|solution]]
 +or [[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81066]]
 +
 +Amend the files according to solution above!
 +
 +Next error:
 +
 +<code>
 +...
 +checking for unzip... unzip
 +configure: error: cannot find neither zip nor jar, cannot continue
 +Makefile:23048: recipe for target 'configure-target-libjava' failed
 +...
 +...
 +</code>
 +
 +  apt install unzip zip
 +
 +and run make again!
 +
 +  make
 +
 +Next error:
 +
 +<code>
 +...
 +In file included from ../.././libjava/prims.cc:26:0:
 +../.././libjava/prims.cc: In function ‘void _Jv_catch_fpe(int, siginfo_t*, void*)’:
 +./include/java-signal.h:32:26: error: invalid use of incomplete type ‘struct _Jv_catch_fpe(int, siginfo_t*, void*)::ucontext’
 +   gregset_t &_gregs = _uc->uc_mcontext.gregs;    \
 +...
 +</code>
 +
 +Edit the file: /opt/packages/gcc-6.1.0/x86_64-pc-linux-gnu/libjava/include/java-signal.h
 +
 +  vi /opt/packages/gcc-6.1.0/x86_64-pc-linux-gnu/libjava/include/java-signal.h
 +
 +<note warning>Not enough more errors!</note>
 +
 +<code>
 +// kh
 +  ucontext_t *_uc = (ucontext_t *);                             \
 +  //struct ucontext *_uc = (struct ucontext *)_p;                               \
 +  // kh
 +
 +</code>
 +
 +Next error:
 +
 +<code php>
 +...
 +In file included from ../.././libjava/prims.cc:26:0:          
 +./include/java-signal.h:32:3: warning: multi-line comment [-Wcomment]
 +   //struct ucontext *_uc = (struct ucontext *)_p;                                                  
 +                                                        
 +../.././libjava/prims.cc: In function ‘void _Jv_catch_fpe(int, siginfo_t*, void*)’:
 +./include/java-signal.h:31:15: warning: unused variable ‘_uc’ [-Wunused-variable]               
 +   ucontext_t *_uc = (ucontext_t *)_p;       
 +                       
 +../.././libjava/prims.cc:192:3: note: in expansion of macro ‘HANDLE_DIVIDE_OVERFLOW’            
 +   HANDLE_DIVIDE_OVERFLOW;       
 +   ^~~~~~~~~~~~~~~~~~~~~~
 +../.././libjava/prims.cc:203:1: error: expected ‘while’ before ‘jboolean’                    
 + jboolean                                       
 + ^~~~~~~~                                      
 +../.././libjava/prims.cc:203:1: error: expected ‘(’ before ‘jboolean’
 +../.././libjava/prims.cc:204:1: error: expected primary-expression before ‘_Jv_equalUtf8Consts’
 + _Jv_equalUtf8Consts (const Utf8Const* a, const Utf8Const *b)                   
 + ^~~~~~~~~~~~~~~~~~~                                    
 +../.././libjava/prims.cc:204:1: error: expected ‘)’ before ‘_Jv_equalUtf8Consts’
 +../.././libjava/prims.cc:204:1: error: expected ‘;’ before ‘_Jv_equalUtf8Consts’
 +../.././libjava/prims.cc:204:22: error: expected primary-expression before ‘const’
 + _Jv_equalUtf8Consts (const Utf8Const* a, const Utf8Const *b)
 +...
 +</code>
 +
 +===== Example =====
 +
 +An simple example to use nvidia GPU!
 +
 +<code>
 +#!/bin/bash
 +
 +#SBATCH --job-name=mnist
 +#SBATCH --output=mnist.out
 +#SBATCH --error=mnist.err
 +
 +#SBATCH --partition gpu
 +#SBATCH --gres=gpu
 +#SBATCH --mem-per-cpu=4gb
 +#SBATCH --nodes 2
 +#SBATCH --time=00:08:00
 +
 +#SBATCH --ntasks=10
 +
 +#SBATCH --mail-type=ALL
 +#SBATCH --mail-user=<your-email@address.com>
 +</code>
 +
 +
 +
 +
 +ml load miniconda3
 +
 +python3 main.py
 +
 +
 +
 +===== Links =====
 +
 +https://www.admin-magazine.com/HPC/Articles/Warewulf-Cluster-Manager-Development-and-Run-Time/Warewulf-3-Code/MPICH2
 +
 +https://proteusmaster.urcf.drexel.edu/urcfwiki/index.php/Environment_Modules_Quick_Start_Guide
 +
 +https://en.wikipedia.org/wiki/Environment_Modules_(software)
 +
 +http://www.walkingrandomly.com/?p=5680
 +
 +https://modules.readthedocs.io/en/latest/index.html
 +
/data/www/wiki.inf.unibz.it/data/pages/tech/slurm.txt · Last modified: 2022/11/24 16:17 by kohofer