SPINPACK
What's about?
SPINPACK is a big program package to compute
lowest eigenvalues and eigenstates and various expectation values
(spin correlations etc) for quantum
spin systems.
These model systems can for example describe magnetic properties of
insulators at very low temperatures (T=0) where the magnetic moments
of the particles form entangled quantum states.
The package generates the symmetrized configuration vector,
the sparse matrix representing the quantum interactions and
computes its eigenvalues and eigenvectors using iterative Matrix-Vector
multiplications (SpMV) as the compute intense core operation
and finaly some expectation values for the quantum system.
The first SPINPACK version was based on Nishimori's
TITPACK (Lanczos method, no symmetries), but
it was early converted to C/C++ and completely rewritten (1994/1995).
Other diagonalization algorithms are implemented too
(Lanzcos, 2x2-diagonalization and LAPACK/BLAS for smaller systems).
It is able to handle
Heisenberg,
t-J, and Hubbard-systems up to 64 sites or more using
special compiler and CPU features (usually up to 128)
or more sites in slower emulation mode (C++/CXX required for int128 emulation).
For instance we got the lowest eigenstates for the
Heisenberg Hamiltonian on a 40 site square lattice on our machines at 2002.
Note that the resources needed for computation grow exponentially with the
number of lattice sites (N=40 means 2^N/symfactor matrix dimension).
The Hamilton matrix can be stored to memory or file storage.
If there is no storage space the matrix elements will be recomputed
on every iteration round (slow).
The package is written mainly in C to get it running on all unix systems.
C++ is only needed for complex eigenvectors and
twisted boundary conditions if C has no complex extension like gcc has.
This way the package is very portable.
Parallelization can be done using MPI- and PTHREAD-library.
Mixed mode (hybrid mode) is possible, but not always faster
than pure MPI (2015).
v2.60 has slightly hybrid mode advantage on CPUs supporting hyper-threading.
This will hopefully be improved further. MPI-scaling is tested to work
up to 6000 cores, PTHREAD-scaling up to 510 cores but requires
careful tuning (scaling 2008-1016).
The program can use all topological symmetries,
S(z) symmetry and spin inversion to reduce matrix size.
This will reduce the needed computing recources by a linear factor.
Since 2015/2016 CPU vector extensions (SIMD, SSE2, AVX2)
are supported to get better performance for
the symmetry operations on bit representations of the quantum spins.
The results are very reliable because the package has been used
since 1995 in scientific work. Low-latency High-bandwith network
and low latency memory is needed to get best performance on large scale
clusters.
News
- Bug 2022-07-04: do not use 64bit-LAPACK libraries,
Spinpack uses 32bit integers only for the fortran-API which is unsafe,
you may get strange errors for full diagonalization part,
this may result to segfaults, corrupt memory data or bad results
depending on the undefined data lying in the upper 32bit,
using 32bit-LAPACK libraries is safe, use spinpack-2.59c or later
-
Groundstate of the S=1/2 Heisenberg AFM on a N=42 kagome biggest
sub-matrix computed (Sz=1 k=Pi/7 size=36.7e9, nnz=41.59, v2.56 cplx8,
using partly non-blocking hybrid code on
supermuc.phase1
10400cores(650 nodes, 2 tasks/node, 8cores/task, 2hyperthreads/core, 4h),
matrix_storage=0.964e6nz/s/core SpMV=6.58e6nz/s/core Feb2017)
-
Groundstate of the S=1/2 Heisenberg AFM on a N=42 linear chain computed
(E0/Nw=-0.22180752, Hsize = 3.2e9, v2.38, Jan2009)
using 900 Nodes of a SiCortex SC5832 700MHz 4GB RAM/Node (320min).
Update: N=41 Hsize = 6.6e9, E0/Nw=-0.22107343
16*(16cores+256GB+IB)*32h matrix stored, v2.41 Oct2011).
-
Groundstate of the S=1/2 Heisenberg AFM on a N=42 square lattice computed
(E0 = -28.43433834, Hsize = 1602437797, ((7,3),(0,6)), v2.34, Apr2008)
using 23 Nodes a 2*DualOpteron-2.2GHz 4GB RAM via 1Gb-eth
(92Cores usage=80%, ca.60GB RAM, 80MB/s BW, 250h/100It).
-
Program is ready for cluster (MPI and Pthread can be used at the same
time, see the performance graphic)
and can again use memory as storage media for performance measurement
(Dec07).
- Groundstate of the S=1/2 Heisenberg AFM on a N=40 square lattice
computed (E0 = -27.09485025, Hsize = 430909650, v1.9.3, Jan2002).
- Groundstate of the S=1/2 J1-J2-Heisenberg AFM on a N=40 square lattice
J2=0.5, zero-momentum space:
E0= -19.96304839, Hsize = 430909650
(15GB memory, 185GB disk, v2.23, 60 iterations,
210h, Altix-330 IA64-1.5GHz, 2 CPUs, GCC-3.3, Jan06)
- Groundstate of the S=1/2 Heisenberg AFM on a N=39 triangular lattice
computed (E0 = -21.7060606, Hsize = 589088346, v2.19, Jan2004).
- Largest complex Matrix: Hsize=1.2e9 (26GB memory, 288GB disk, v2.19 Jul2003),
90 iterations: 374h alpha-1GHz (with limited disk data rate, 4 CPUs, til4_36)
- Largest real Matrix: Hsize=1.3e9 (18GB memory, 259GB disk, v2.21 Apr2004),
90 iterations: real=40h cpu=127h sys=9% alpha-1.15GHz (8 CPUs, til9_42z7)
Download
Verify download using:
gpg --verify spinpack.tgz.asc spinpack.tgz
- spinpack.tgz experimental developper version (may have bug fixes, new features or speed improvements, see doc/history.html)
- ---
- spinpack-2.59d.tgz[.asc] improved usability at bigger systems, see doc/history (2022.07)
- spinpack-2.59c.tgz[.asc] fix lapack-64bit-bug, see doc/history (2022.07)
- spinpack-2.58a.tgz simpler block matrix handling + more, see doc/history (big NN speedup, Matrix compression disabled, SuperMuc-Adaptions, fix multirun 2019.07)
- spinpack-2.57.tgz simpler block matrix handling + more, see doc/history
- spinpack-2.56c.tgz 2.57 backport fixes, above 2048*16threads, FTLM-random-fix, see doc/history
- spinpack-2.56.tgz better hybrid MPI-scaling above 1000 tasks, tested on kagome42_sym14_sz13..6, pgp-sign, updated 2017-02-23, see doc/history, still blocking MPI only)
- spinpack-2.55.tgz better MPI-scaling above 1000 tasks, tested on kagome42_sym14_sz13..8..1, pgp-sign, updated 2017-02-21, see doc/history)
- spinpack-2.52.tgz OpenMP-support (implemented as pthread-emulation), but weak mixed code speed, pgp-sign, Dec16)
- spinpack-2.51.tgz g++6-adaptions (gcc6.2 compile-errors/warnings fixed, pgp-sign, Sep16)
- spinpack-2.50d.tgz SIMD-support (SSE2,AVX2), lot of bug-fixes (Jan16+fixFeb16+fixMar16b+c+fixApr16d))
- spinpack-2.49.tgz mostly bug-fixes (Mar15) (updated Mar15,12, buggy bfly-bench, NN>32 32bit-compile-error.patch, see experimental version above)
- spinpack-2.48.tgz test-version (v2.48pre Feb14 new features, +tUfixMay14 +chkptFixDez14 +2ndrunFixJan15)
- spinpack-2.47.tgz bug fixes (see doc/history.html, bug fixes of 2.45-2.46)
(version 2014/02/14, 1MB, gpg-signatur)
- spinpack-2.44.tgz (see doc/history.html, known bugs)
(version 2013/01/23 + fix May13,May14 2.44c, 1MB, gpg-signatur)
- spinpack-2.43.tgz +checkpointing (see doc/history.html)
(version 2012/05/23, 1MB, gpg-signatur)
- spinpack-2.42.tgz ns.mpi-speed++ (see doc/history.html)
(version 2012/05/07, 1MB, gpg-signatur)
- spinpack-2.41.tgz mpi-speed++,doc++ (see doc/history.html)
(version 2011/10/24 + backport-fix 2015-09-23, 1MB, gpg-signatur)
- spinpack-2.40.tgz bug fixes (see doc/history.html)
(version 2009/11/26, 890kB, gpg-signatur)
- spinpack-2.39.tgz new option -m, new lattice (doc/history.html)
(version 2009/04/20, 849kB, gpg-signatur)
- spinpack-2.38.tgz MPI-fixes (doc/history.html)
(version 2009/02/11, 849kB, gpg-signatur)
- spinpack-2.36.tgz MPI-tuned (doc/history.html)
(version 2008/08/04, 802kB, gpg-signatur)
- spinpack-2.35.tgz IA64-tuned (doc/history.html)
(version 2008/07/21, 796kB, gpg-signatur)
- spinpack-2.34.tgz bugs fixed for MPI (doc/history.html)
(version 2008/04/23, 770kB, gpg-signatur)
- spinpack-2.33.tgz bugs fixed for MPI (doc/history.html)
(version 2008/03/16, 620kB, gpg-signatur)
- spinpack-2.32.tgz bug fixed (doc/history.html)
(version 2008/02/19, 544kB, gpg-signatur)
- spinpack-2.31.tgz MPI works and scales
(version 2007/12/14, 544kB, gpg-signatur)
- spinpack-2.26.tgz code simplified
and partly speedup, prepare for FPGA and MPI
(version 07/02/27, gpg-signatur)
- spinpack-2.15.tgz see doc/history.tex (updated 2003/01/20)
Installation
- gunzip -c spinpack-xxx.tgz | tar -xf - # xxx is the version number
- cd spinpack; ./configure --mpt
- make test # to test the package and create exe path
- # edit src/config.h exe/daten.def for your needs (see models/*.c)
- make
- cd exe; ./spin
Documentation
The documentation is available in the doc-path.
Most parts of the documentation are rewritten in english now. If you
still find some parts written in german or out-of-date documentation
send me an email with a short hint where I find this part and
I want to rewrite this part as soon as I can.
Please see doc/history.html for latest changes.
You can find a documentation about speed in the package or an older version
on this
spinpack-speed-page.
ToDo
1) Effectiveness (energy consumption) of code is not optimal (2019-04).
Storing Matrix to memory is the
most effective method (much less energy and time).
Computing the Matrix is CPU-bounded (approx nzx*2000 Ops / 8 Byte)
and well optimized for Heisenberg-systems.
The butterfly network O(2logN) is used for bit permutation
to compute lattice symmetries (typical 4*N).
There is some room for hardware/software acceleration there.
But the core routine during iteration (SpMV) is network-I/O-bounded.
Ideal we have a FLOP to transfer rate of 2 FLOP per 8 Byte (double)
as worst case to 8 FLOP per 8 Byte (single-precision-complex)
as best case, which is
low compared to todays HPC-systems with 150 FLOP per Byte.
So about 1 percent of peak performance can be used on
QDR-Infiniband clusters only.
But at the moment (2019-04) index 4 Byte data and matrix size data
(latency) is transferred too which cost about 50%
more data. This must be changed before making the code using
full overlaping communication and using remaining CPU power for
data compression to get further acceleration on HPC-Clusters.
2) Parallel computation of two ore more datasets (vectors) at the same time
will increase memory consumption by a factor of (nnz+(m*2))/(nnz+2),
but makes memory bounded SpMV-core to more effective SpMM-core
(m times FLOPs per (factor above slightly increased) memory bandwith).
This is useful together with improved overlapping computation and
communication.
3) The most time consuming important function is b_smallest
in hilbert.c
for matrix generation.
This function computes the representator
of a set of symmetric spin configurations (bit pattern) from a member of
this set. It also returns a phase factor and the orbit length.
It would be a great progress,
if the performance of that function could be improved. Ideas are welcome.
One of my ideas is to use FPGAs but my impression
on 2009 was, that the FPGA/VHDL-Compiler and Xilings-tools are so slow,
badly scaling and buggy, that code generation and debugging is really no
fun and a much better FPGA toolchain is needed for HPC.
2015-05 I added software benes-network to get gain of AVX2, but it looks like
that its still not the maximum available speed (HT shows near 2 factor,
bitmask falls out of L1-cache?).
Please use these data for your work or verify my data.
Questions and corrections are welcome. If you miss data or explanations
here, please send a note to me.
Frequently asked questions (FAQ)
-
Q: I try to diagonalize a 4-spin system, but I do not get the full spectrum. Why?
-
A: Spinpack is designed to handle big systems. Therefore it uses as much
symmetries as it can. The very small 4-spin system has a very special
symmetry which makes it equivalent to a 2-spin system build by two s=1 spins.
Spinpack uses this symmetry automatically to give you the possibility
to emulate s=1 (or s=3/2,etc) spin systems by pairs of s=1/2 spins.
If you want to switch this off, edit src/config.h and change
CONFIG_S1SYM to CONFIG_NOS1SYM.
-
Q: What is the best suitable cluster-hardware?
-
A: Use SpMV-benchmarks (HPCG) to check best system if not using spinpack
itself. 1st Network-Bandwith is importand, 2nd Network-BW again,
3th MemoryChannels, 4th CPU CacheSize, 5th CPU integer(!) units.
Dont forget reliability of network and RAM.
This means do not buy today Top-CPUs for best overall performance,
better look at top networks, high-Memory-BW
and cheap Multicore/Multichannel/highBW-CPUs (check performance/price).
Calculate performance/price including 5y (90%SpMV+10%HPL) power
consumption price including cooling.
Most CPU prices have similar HPL-Perf/(price+5yEnergy).
5y-HPL-energy price can range from 100% to 300% of CPU-price!
Use power consumption value at HPCG-test over minimum 2 nodes.
This will be likely much lower than for the peak at HPL-test.
Feel free to improve my suggestions.
This picture is showing a small sample of a possible Hilbert matrix.
The non-zero elements are shown as black pixels (v2.33 Feb2008 kago36z14j2).
This picture is showing a small sample of a possible Hilbert matrix.
The non-zero elements are shown as black (J1) and gray (J2) pixels
(v2.42 Nov2011 j1j2-chain N=18 Sz=0 k=0). Config space is sorted by
J1-Ising-model-Energy to show structures of the matrix.
Ising energy ranges are shown as slightly grayed arrays.
.
.
.
Ground state energy scaling for finite size spin=1/2-AFM-chains N=4..40
using up to 300GB memory to store the N=39 sparse matrix and 245 CPU-houres
(2011, src=lc.gpl).
Author: Joerg Schulenburg, Uni-Magdeburg, 2008-2016