tech:slurm
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
tech:slurm [2020/05/27 11:47] – [Example mnist] kohofer | tech:slurm [2021/02/25 10:25] – [Compute Nodes] kohofer | ||
---|---|---|---|
Line 267: | Line 267: | ||
===== Compute Nodes ===== | ===== Compute Nodes ===== | ||
- | |||
A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. | A compute node is a machine which will receive jobs to execute, sent from the Controller, it runs the slurmd service. | ||
Line 342: | Line 341: | ||
| | ||
+ | ===== Modify user accounts ===== | ||
+ | |||
+ | Add user | ||
+ | |||
+ | sacctmgr add user < | ||
+ | |||
+ | Modify user, give 12000 minutes/200 hours for usage | ||
+ | |||
+ | sacctmgr modify user misegata set GrpTRESMin=cpu=12000, | ||
+ | |||
+ | Restart the services: | ||
+ | |||
+ | systemctl restart slurmctld.service | ||
+ | systemctl restart slurmdbd.service | ||
+ | |||
+ | Check status: | ||
+ | |||
+ | systemctl status slurmctld.service | ||
+ | systemctl status slurmdbd.service | ||
+ | |||
+ | |||
+ | |||
===== Links ===== | ===== Links ===== | ||
+ | |||
+ | [[https:// | ||
[[https:// | [[https:// | ||
Line 391: | Line 414: | ||
==== Create modules file ==== | ==== Create modules file ==== | ||
+ | **PYTHON** | ||
cd / | cd / | ||
Line 407: | Line 431: | ||
</ | </ | ||
- | | ||
+ | **CUDA** | ||
+ | vi / | ||
+ | |||
+ | < | ||
+ | #%Module1.0 | ||
+ | proc ModulesHelp { } { | ||
+ | global dotversion | ||
+ | |||
+ | puts stderr " | ||
+ | } | ||
+ | |||
+ | module-whatis " | ||
+ | |||
+ | set | ||
+ | |||
+ | setenv | ||
+ | prepend-path | ||
+ | prepend-path | ||
+ | </ | ||
===== GCC ===== | ===== GCC ===== | ||
Line 736: | Line 778: | ||
squeue | squeue | ||
+ | |||
+ | ---- | ||
+ | |||
+ | |||
+ | ===== CUDA NVIDIA TESLA Infos ===== | ||
+ | |||
+ | === nvidia-smi === | ||
+ | |||
+ | |||
+ | root@gpu02: | ||
+ | |||
+ | < | ||
+ | Every 2.0s: nvidia-smi | ||
+ | |||
+ | Mon Jun 22 17:49:14 2020 | ||
+ | +-----------------------------------------------------------------------------+ | ||
+ | | NVIDIA-SMI 440.64.00 | ||
+ | |-------------------------------+----------------------+----------------------+ | ||
+ | | GPU Name Persistence-M| Bus-Id | ||
+ | | Fan Temp Perf Pwr: | ||
+ | |===============================+======================+======================| | ||
+ | | | ||
+ | | N/A | ||
+ | +-------------------------------+----------------------+----------------------+ | ||
+ | | | ||
+ | | N/A | ||
+ | +-------------------------------+----------------------+----------------------+ | ||
+ | |||
+ | +-----------------------------------------------------------------------------+ | ||
+ | | Processes: | ||
+ | | GPU | ||
+ | |=============================================================================| | ||
+ | | 0 8627 C / | ||
+ | +-----------------------------------------------------------------------------+ | ||
+ | |||
+ | </ | ||
+ | |||
+ | === deviceQuery === | ||
+ | |||
+ | |||
+ | To run the deviceQuery it is necessary to make it first! | ||
+ | |||
+ | root@gpu03: | ||
+ | make | ||
+ | |||
+ | Add PATH to the system wide environment | ||
+ | |||
+ | vi / | ||
+ | |||
+ | Add this to the end | ||
+ | |||
+ | / | ||
+ | |||
+ | Next enable/ | ||
+ | |||
+ | source / | ||
+ | |||
+ | < | ||
+ | root@gpu03: | ||
+ | deviceQuery Starting... | ||
+ | |||
+ | CUDA Device Query (Runtime API) version (CUDART static linking) | ||
+ | |||
+ | Detected 2 CUDA Capable device(s) | ||
+ | |||
+ | Device 0: "Tesla V100-PCIE-32GB" | ||
+ | CUDA Driver Version / Runtime Version | ||
+ | CUDA Capability Major/Minor version number: | ||
+ | Total amount of global memory: | ||
+ | (80) Multiprocessors, | ||
+ | GPU Max Clock rate: 1380 MHz (1.38 GHz) | ||
+ | Memory Clock rate: 877 Mhz | ||
+ | Memory Bus Width: | ||
+ | L2 Cache Size: | ||
+ | Maximum Texture Dimension Size (x, | ||
+ | Maximum Layered 1D Texture Size, (num) layers | ||
+ | Maximum Layered 2D Texture Size, (num) layers | ||
+ | Total amount of constant memory: | ||
+ | Total amount of shared memory per block: | ||
+ | Total number of registers available per block: 65536 | ||
+ | Warp size: 32 | ||
+ | Maximum number of threads per multiprocessor: | ||
+ | Maximum number of threads per block: | ||
+ | Max dimension size of a thread block (x,y,z): (1024, 1024, 64) | ||
+ | Max dimension size of a grid size (x,y,z): (2147483647, | ||
+ | Maximum memory pitch: | ||
+ | Texture alignment: | ||
+ | Concurrent copy and kernel execution: | ||
+ | Run time limit on kernels: | ||
+ | Integrated GPU sharing Host Memory: | ||
+ | Support host page-locked memory mapping: | ||
+ | Alignment requirement for Surfaces: | ||
+ | Device has ECC support: | ||
+ | Device supports Unified Addressing (UVA): | ||
+ | Device supports Compute Preemption: | ||
+ | Supports Cooperative Kernel Launch: | ||
+ | Supports MultiDevice Co-op Kernel Launch: | ||
+ | Device PCI Domain ID / Bus ID / location ID: 0 / 59 / 0 | ||
+ | Compute Mode: | ||
+ | < Default (multiple host threads can use :: | ||
+ | |||
+ | Device 1: "Tesla V100-PCIE-32GB" | ||
+ | CUDA Driver Version / Runtime Version | ||
+ | CUDA Capability Major/Minor version number: | ||
+ | Total amount of global memory: | ||
+ | (80) Multiprocessors, | ||
+ | GPU Max Clock rate: 1380 MHz (1.38 GHz) | ||
+ | Memory Clock rate: 877 Mhz | ||
+ | Memory Bus Width: | ||
+ | L2 Cache Size: | ||
+ | Maximum Texture Dimension Size (x, | ||
+ | Maximum Layered 1D Texture Size, (num) layers | ||
+ | Maximum Layered 2D Texture Size, (num) layers | ||
+ | Total amount of constant memory: | ||
+ | Total amount of shared memory per block: | ||
+ | Total number of registers available per block: 65536 | ||
+ | Warp size: 32 | ||
+ | Maximum number of threads per multiprocessor: | ||
+ | Maximum number of threads per block: | ||
+ | Max dimension size of a thread block (x,y,z): (1024, 1024, 64) | ||
+ | Max dimension size of a grid size (x,y,z): (2147483647, | ||
+ | Maximum memory pitch: | ||
+ | Texture alignment: | ||
+ | Concurrent copy and kernel execution: | ||
+ | Run time limit on kernels: | ||
+ | Integrated GPU sharing Host Memory: | ||
+ | Support host page-locked memory mapping: | ||
+ | Alignment requirement for Surfaces: | ||
+ | Device has ECC support: | ||
+ | Device supports Unified Addressing (UVA): | ||
+ | Device supports Compute Preemption: | ||
+ | Supports Cooperative Kernel Launch: | ||
+ | Supports MultiDevice Co-op Kernel Launch: | ||
+ | Device PCI Domain ID / Bus ID / location ID: 0 / 175 / 0 | ||
+ | Compute Mode: | ||
+ | < Default (multiple host threads can use :: | ||
+ | > Peer access from Tesla V100-PCIE-32GB (GPU0) -> Tesla V100-PCIE-32GB (GPU1) : Yes | ||
+ | > Peer access from Tesla V100-PCIE-32GB (GPU1) -> Tesla V100-PCIE-32GB (GPU0) : Yes | ||
+ | |||
+ | deviceQuery, | ||
+ | Result = PASS | ||
+ | </ | ||
===== Links ===== | ===== Links ===== |
/data/www/wiki.inf.unibz.it/data/pages/tech/slurm.txt · Last modified: 2022/11/24 16:17 by kohofer