spinpack speed estimation 2014

This will give you an raw overview, what the speed limits of spinpack (v2.48 2014) are. It will hopefully help you to find optimal hardware to run spinpack. There are three main parts. First is the hilbert space generation, second the sparse matrix generation which runs at least once and third the iteration where the main part ist a matrix-vector multiplication with about 100 iterations or more. We will make example estimation for a 100 Node cluster with about 24 available TB memory and (non-blocking) infiniband network. An upper limit about 5-10% of the memory is foreseen for I/O-buffering (Network/Storage). Each node has 256GB memory, 16*3GHz 8SFLOP-Cores (380GF) and 3GB/s Bandwith.

Vector and matrix size

Having the biggest physical system in mind, the main memory is been used for the hilbert vector and will need 8 Bytes per entry (1bit per site, up to 64 sites). This vector is needed for matrix generation. Three float vectors must also be fit to memory (lanczos method + eigenvector), each will have 8 Byte double (alternatives would be 4 Byte float for maximum model size, 8 Byte float complex or 16 Bytes double complex). At least two of the three vectors will be accessed sequencially only, the other is accessed randomly (over the whole cluster). We assume for simplicity that we need 8+3*4 = 20 Bytes per vector element. Assuming 249GB available memory per node we have n1=7e9 double or n1=11e9 float vector elements available per node or an overall matrix dimension of 1100e9 at maximum.

The matrix will have non-zero entries in a range of 10 to 60 per line. For simplicity we assume nz=50 non-zero elements per line (J1-J2 N=32 square). For jobs running longer than 4 days a checkpoint/suspend system is required. The cheapest suspend system is using SIGSTOP/SIGCONT + swapping out (on diskless nodes via remote nbd-swap).

Hilbert space generation

- work in progress -

The minimum symmetric configurations are generated and stored for later index search in matrix generation. Computing symmetric configurations is done mostly and costs lot of 64bit bit manipulations (shift, and, or) and minimum numsym*numparticles cache space (z.B: 256*64B=16kB). Use of vectorization is not done by the compiler at the moment, but it should be possible to vectorize the code. Because configurations need to be stored as 64 bit elements, only 64bit vectorization helps. 2014 vectorization is possible via AVX-CPU extention (256bit registers YMM0-15, Sandy Bridge Q1 2011, Bulldozer Q4 2011) using GCC-4.6 on linux 2.6.30 to support context switches (since 2009). Since AVX2 (Haswell Q2 2013) shift OPs (VPSHLQ, AVX: VPAND,VPANDN,VPOR) are implemented too. Vectorization is mostly untested yet (also SSE2 2*64bit-int GCC-4 auto).

Hilbert matrix generation

This is the slowest process. It is because lot of bit permutations must be done to generate minimum (bit-) configurations. Here is potential to (linear) speed up using Co-Prozessors or algorithmic improvements (benes network + vectorization).

100 iterations - mainly matrix vector multiplications

There are different implemented ways to get the matrix. It can be recomputed, read from memory or read from disk. Reading is possible only if enough space is available which is 3.3TB per node. Throughput of disks is assumed as 70MB/s per disk. Network is assumed to have BW=2GB/s per node and 2us latency. Network is also needed to get the vector elements from the remote nodes. For simplicity its assumed that remote traffic dominates transfers.

Network is not the bottleneck for reading matrix from disk 8*70MB/s = 560MB/s !?
Doubling disk drives gives speedup of about 2, doubles checkpointing speed and costs only +10W/node more (cheapest speedup).

checkpointing and HW fault tolerance

Needed for fair scheduling and (predictable?) failures

One checkpoint taking 2h per 4 days will result in a loss of 2.5% compute power. Checkpointing on demand will save power, but block operations and checks between them or block redundance are needed to reduce delay.