GPU compute server - gpu18

This machine is for GPU-accelerated scientifique computations of university members. Each GPU can be scheduled to jobs separately. SLURM is used as job scheduler. Operating system is Linux/Ubuntu. Connection and data transfer is possible via ssh (secure shell).

8GPUs inside DGX1

Access is possible after login using ssh to gpu18.urz.uni-magdeburg.de via intra-uni-network only. An account can be created on formless inquiry to the hpc-admins. Please add information about the project name and its duration (maximum estimated), how much connected GPUs (mostly we expect a single GPU), GPU-memory (max 32GB/GPU), main-Memory and Disk-Space is needed at maximum for each class of jobs you may have. This data will be used to adjust the slurm jobsystem to different needs for best usage i.e. adapting the job queues for different job sizes. Students need a permission (formless email) from its tutor which must be an employee of the university.
Please use other smaller GPU-systems if your jobs do not really need 32GB GPU memory to leave the GPUs for users which want that huge GPU-memory and have no other choices (p.e. t100 gpu-nodes). The GPU-memory and its speed is the most expensive part of that kind of GPU-machines. Also note, that the GPU machine though to have huge main memory (512GB), but if that must be distributed to 8 jobs on 8 GPUs only 64 GB memory is left for each users job, which is only a factor of 2 to the GPU-memory. That means you dont have much space to buffer deep-learning data to memory feeding the GPU. If you want more main memory, there must be other users which are satisfied with less than 64GB main memory. Also there is only 7TB SSD-storage for the 8 GPUs. We try to expand this clear bottleneck 2020/2021 by EDR-infiniband storage if we get money for it.
Please always be careful in allocating CPUs and memory to avoid conflicts with other jobs. Check scaling before using multiple GPUs for one job. Sharing a single GPU for multiple jobs is not enabled by default to avoid out-of-memory-conflicts for GPU memory (which may crash the GPU, please report your experiences to us).
Operating system is Linux Ubuntu. CPU-intense preparation of jobs should not be done on the GPU node. Please use the hpc18 system for it.
This machine is not suited for work with personal data. That means there is no encryption and failed disks will be replaced without any possibility to destroy data on this disk before giving it back to a third party due to hardware warranty terms.

History:

2019-01-30 installation + first GPU-test-users
2019-01-30 ssh-access through ssh gpu18.urz.uni-magdeburg.de (intranet only)
2019-03-20 slurm 15.08.7 installed (from ubuntu reposities)
2019-04-08 fixed typo in dns-nameservers (network was partly slow)
2019-06-12 server room power loss due to short (de: Kurzschluss)
2019-08-08 config central kerberos-passwords (university)
2019-08-12 slurm 17.11.12 installed (+dbd for single GPU allocation)
2019-09-22 single GPU-allocation by slurm, old: node allocation
2019-10-29 reboot to get dead nvidia6 running (slurm problem)
2020-01-14 install g++-5 g++-6 g++-7 g++-9 (old: g++-8)
2020-03-11 reboot (nfs4 partly hanging, some non-accessible files, ca. 40 zombies)
2020-03-20 nfs-server overload errors, nfs4 switched to nfs3 testwise
2020-10-05 07:20 system SSD 480GB defect, system down
2020-10-08 SSD replaced, system restore from system backup (2019-11)
2020-10-12 system restored from backup and up, users allowed
2020-10-28 13:55 -crash- uncorrectable memory error
2020-11-23 slurm Fair-Scheduling configured
2021-01-19 gcc-7.5 as default gcc (old gcc-5)
2021-02-02 new slurm gres syntax
2021-02-03 cuda-11.2 installed, plz use "module load cuda" # or cuda/11.2
2021-03-11 quota activated (try "quota -l")
2021-08-09 remote shutdown tests for 2021-08-16
2021-08-16 planned maintenance, no clima, 12h down time
2021-11-21 /nfs_t100 disabled, t100 will be powered off
2021-11-22 upgrade OS from ubuntu-16 to ubuntu-18, /nfs1 broken (nfs4)
2021-11-29 upgrade to DGX-OS 5, ubuntu 20.4. nvidia-470.82 CUDA 11.4
2021-12-01 fix /nfs1 (nfs3 permanently), python, pip
2022-05-16 downtime due to slurm security update
2022-06-14 maintenance, infiniband driver installation problems
2023-10-26 13:36:57 reboot (cuda-inst), /scratch2 unavailable, 15:30 fix
2024-02-21 driver-535 cuda-12.3 installed (thx to FIN-IKS A.K.) +reboot
2024-10-29 access to world-network enabled, no icmp
2024-11-04 testing slurm-configs for hpc21-gpus, please report problems

Projects (incomplete):

Problems: