Sorry for the language mix (german, english).
- speed_tests estimate n1sym=n1/fac_sym by rnd states
- nsend_vy ... by malloc, not static (link errors)
- motivation symmetry (a4 faster?)
1/sym smaller matrix (num_nz?) = more CPU locality, but sym-overhead
better convergence? (a4 kago,sq num_iterations * time/it = overall-time)
all submatrizes (sym=2) vs. full matrix
sq40.d=3 n1=9880 sym=1 nzx=6.692 2.19min/jsPC1t 3.03566559 3.14900966
k=1 n1=4921 sym=2 nzx=6.692 0.28min/jsPC1t 3.17382831
k=0 n1=4959 sym=2 nzx=6.693 0.29min/jsPC1t 3.22459890 3.28999171
dsyevd t=O(n1^3) ca.8xfaster speedup O(n1^2)
a0 n1=9880 sym=1 nzx=6.692 115it 2.6ns 3.78e8nz/s 4.02657620 4.06319662
k=1 n1=4921 sym=2 nzx=6.692 105it 3.4ns 2.94e8nz/s 4.06319662
k=0 n1=4959 sym=2 nzx=6.693 115it 3.7ns 2.74e8nz/s 4.02657620
trace/n1 : 7.153846154 k=*
trace/n1 : 7.153660012 k=0 lower
trace/n1 : 7.154033733 k=1 higher
d=2 trace/n1 : 8.051282051 6.00648538 6.03103089 ok
a4= +4.12191137 sameTrace t1=4.39634064 t2=4.17636083 wrong lapack !!!
d=1 n1=40 lapack=+6.80066150 lanczos=8.0 wrong lapack !!!
a4~lapack korrekt (diag=9.0,nondiag=0.5)
WARNING: You are wasting memory and time using complex mode for real matrix H!
~mpi +thread segfault +dbg=ok valgrind: openblasp uninit values L4041
libopenblasp threads starts to early? (seen with more)
- combined-send-va2a-sz(it+2)+send-Hij(it+1)+send-Hij*xj(it)
on local-compute Hij and Hji_part_slow_sym_needed ???
+ collect_sym_caused_dbl(NZX-loop)
instead of store num_j/blk per line (or extra sendrcv)
- H_compression: worst case complex (N-phase-table)!?
storing norm2_i (pack-idx=1..sym sym=32: 32,16,8,4,2,1=6 (log2(N)+1))
or sq(morm2_i/norm_j)*H (pack_idx=above^2=2*bits)*pack_H_idx
va) table: allNum 1..255 easy implement, only some used (gaps)
vb) table: allFaktorsOfN(orSym) 1,2,3,4,6,8,12,24,48 compact, scan-on-SH
symbased: 1 2(ud or even_sym)
sym=30: 30 15 5 3 2 1 (all Teiler!?) prime_sym=2pack_idx
sym=12: 12 6 4 3 2 1
sym=120: teilerzahl=16? 2 3 4 5 6 8 10 12 15 20 24 30 40 60 -> 4bit
# sym-coeff better stored to vector (4bit) + precomputed 1/sq-table
# defined by sym-table, same for all tasks
phase factors = ggt(sym.max=N)? -> log2(N) -> 6bit (temporary used?)
parameter-index=ca. 2bit
compressionHinclVcoeff = 4+6+2=12bit = 2Byte (real=6bit)
compressionH = 6+2= 8bit = 1B (real=2bit)
nondiag: SiSj +-1/2 only 1bit (see examples)
put to 8B-blocks (aligned access + 64bit shift + and)
no compress Hcplx=2*flt=2*4B=8Byte (real=4Byte)
(excluding index 32bit?
sum-of-diff-phases possible! p.e. +cfg-cfg=0cfg
+phase -phase?
xooxoo => oxoxoo + oooxox + xoxooo + xoooxo
=+5p*xoxooo +3p*xoxooo +xoxooo + 2p*xoxooo = 0
check-compress maxN-loop
+p*oxooxo
+2p*ooxoox
- use base syms instead of product-syms, less L1 cache, less L1 misses
LC40: product-list: 80syms * 64Byte or 80syms * 11bfly8B = 5-6kB L1
base-syms 2 * 64Byte or 2 * 11Bfly8B = 16-170B L1 + 2*counters
for future CPUs or GPUs
GPUs have ca. 1kB/core but 2500 possible (hyper-) threads
no cache coherence (to much cores), explicite flush/invalidate
problem able/nonable? test LC6 (needed for correct factors)
- why MPT SH is much more more unbalanced than MPI? fix it!
- hamilton_geth_block (reduce expensive NetIO), do not resend iy+sz
store remotely, NetIO-bound speed
- HW-design minNodeMem(max10%Caching for minPktSize) propto latency
minPktSize = latency * BW = 30 us * 10GbE/s = 30 kB
Faktor Transfers/Byte? nzx=40-Matrix dominiert RAM
f64: 1*8B-transfer je Hxy=(8Bflt+4Bx+4By=16B) = 50%
In 4h-Walltime sendbareDatenmenge je Node = use_node_Mem
Walltime/200Iterationen * BW(MB/s) / 50% = 4h/200 * 1GB/s * 2 = 144 GB
momentan 12B je Hxy weniger RAM eff. nutzbar
- affinity vs. mpi (JoeS2021-03) not as important than rnd-mem-speed !!!?
QDR40Gb=3GB/s 8B/3GB/s=2us/nz/nodes ab ca. 20 nodes == mem-latency
RAM_2ch*DDR2-800*8B=12.8GB/s cross-remote-BW? lat=100ns
Ryzen threadripper 64c: 2dies-no-lmem +2dies-2ch-lmem je 8c
lmem=64ns,2ch*3G2*8B=51.2GB/s rmem=105ns,25GB/s sum=100GB/s/32c
rndmem= 8B/0.1us= ---80MB/s--- (*channels?) = ---slowest---
SH=600ns/nz/c log2n1=17+8(ln-search=rnd) SpeedFaktor7.5, hash?
MV= 81ns/nz/c == mem-latency!?
MV8 would hide SH-latencies using hyperthreading!?
mpi is slowest 2GB/s? 2us latency, and 10ns per nz=20B transfer
need overlap mpi-transfer + compute (see below)
+ overlap rnd-mem-access and cpu-intense-computations
speed is rnd_mem*nz_size == nz_size
- every thread computes/loads its own AH(?/pt_n?) part
- thread 0 collects threads 0-n data to one mpi-block rmem (25GB/s)
- mpi thread 0 sends to thread 0 (2GB/s) and again this is thread 1-n
non-local memory (rmem)
- if we split to local mpi tasks, mpi may bundle inter-node traffic
for me? # slower!?
## if only mem latency matters only improvement is memchannels(HW)
## and multivectors(rnd_vec_for_degeneracy) = simplest
### or/and parametrized-matrix j1*H1+j2*H2
- 2019-04 max 32threads,mpi=2,AH=4K malloc 5a80000..d933300 = 95MB..228MB
hamilton_geth_block.buf_idx+buf_vy_iy bsort_bck nsend_iy _vy _cfg nrecv_
static 2GB overflow, see nm -S --size-sort spin
add CFLAGS += -mcmodel=large
- 2019-04: SiSj is not efficient because, SiSj= nzx=1-sparse
better do compute SiSj (or XYiXYj) in parallel for all i,j
old: per_cfg: n^2/sym* (one-SiSj nzx=1*sym io=nzx *+=2nzx) latenz!
noSym: all-SiSj-terms per_cfg: nzx=N^2 io=nzx*Nbit *+=2nzx
4NSym: all-SiSj-terms per_cfg: nzx=N^2/sym*sym io=nzx*Nbit
bigger N^2/sym=N/4 bigger packages (10x) or N..N^2 better
power for this SpMV?
- configure autocheck -DCFG_CPUSET=2 (mostly better, + dyn. switch off)
- set HRMAX to 0 default (optional matrix compression)
output num orbits per IsingE
table: hrnum per model (for estimation) max=max_orbit_len^2
see hilbert.h
- read-out+show /proc/net/dev eth0: TX.bytes packets RX. + lo
- better butterfly commands for bigger int64 (int128 etc.) = solved19.03
example: N=256 pyro bn8=640ns vs. int64 6.4ns + ca. +4ns int64-exchange
sq40: N=256 532ns N=64 4.22ns (2.5% zero-masks)
~/papers/rechner/permutations/
- ToDo: FLOPS vs. memBW? NA*8cfgByte/sym? + 11*8mskBytes/sym
- ToDo: wiederverwendung von big vars, less reload from mem/cache = BW
mind. CFLAGS+= -mcmodel=large for big NAH (slower?)
- ToDo minsymcfg safes vec-transfers, replace by micro-loop, use gensym^i
- 20% cpu:NetIO= FLOP:Byte 95% SpMV = 2 or 10(cplx) flop : 2 or 4 * 4or8B
(a3+ib3) += (a1+ib1)*(a2+ib2) = a1*a2-b1*b2 +i(a1*b2+b1*a2) # 4mul+4add
f64: 2FLOP:8Byte=min. c128: 8FLOP:16B c64: 8FLOP:8B=max.
ToDo19:: zuviel NetIO
send_y-indes(better store remote) 4B (max_n1/mpi) + latency
recv_v0[y] 4B(f32)...16B(c128) + latency
sum = 8..20B + 3*latency (50..20% improvement possible)
Q: overlapp-rechnend? genM_2 waehrend SpMV mit M_1(im RAM)
Ausgabewerte OPs/nz transferB/nz fuer genMatrix and SpMV (f32..c128)
like linpack n1*(nz*... + ... ) + ...
schaetz-walltime 100It (vorgabe NetwBW=100MB/s wenn ohne MPI)
compute next Matrix during SpMV
ToDo: Messung durch Jobsystem-prolog/epilog? Traffik GbE IB
- similar HPCG-benchmark Sparse(Ax=b) usable for fulldiag (TopHPC=30min)
- ToDo: recomputing H on SpMV step is waste of energy + big_walltimes
better use bigger computer memory (LRZ), not for production
as long as recompute-power > storage-power
- try/cmp.perf: SLEPc EPS=Eigenvalue_Problem_Solver Krylov-Schur
SLEPc-syntax similar to PETSc-syntax MatCreate +
MatSetValues(nnz-value-submatrix)
- use Ext-SparseMatrixSolver-Libs for a4 MPI (for bigger matrices)
replace dsyev,zheev,dsyevd,zheevd:
https://math.nist.gov/MatrixMarket/formats.html text-list: row column entry
- www.ssisc.org/lis/ Lis (linear algebra library) from/to-file(stdin?)
incl. lanczos etc.
SCALAPACK dep. on PBLAS
- scalapack for fulldiag: PDSYEVD (dense matrix only? but EV = same size)
http://stackoverflow.com/questions/20706523/scalapack-matrix-diagonalization-pdsyevd
- PETSc sparse matrix (bad unusable website)
wiki https://www.mcs.anl.gov/petsc/petsc-dev/docs/manual.pdf = lost
allow the overlap of communication and computation
yum install petsc-openmpi-devel petsc4py # MUMPS SuperLU atlas scalapack ...
./configure --with-x=0 --with-64-bit-indices
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/
ROBUSTNESS: replace JJ by tU coding ???, l1-size doubles N=32 needs 128bit=slow
ROBUSTNESS: test-code tUJ_S1SYM (tU=0 must be same as JJ)
SPEED: no global (para)vars in hamilton (bad MIPS, bad parallel read to cache)
SPEED: see doc/cfgidx.txt # to reduce slow memory acesses (rndread=60-100ns scfg2idx)
reduce mpi-transfers (max. speed-of-light-latency = 3.3 ... 5 ns/m)
block according to ising-cloud-distance?
- SPEED: store transfersizes per H-block (n1/AH)^2*mpi_n*(rcv)
flags=pointer mpi_blk_sz!=0(readable)
more important than H itself? (2us latency away)
test on Gbit, use for matrix-pixel-graphic?
- FKT: NOSZSYM=1 for non-Sz-commute-hamiltonian SxSx (user-whish-list 2017-11)
test 5N-chain doc/example4_lc5.html n1=18 k=-1 -1.86803399...+1.25000000
-
check: papers/spinpack_rel_pap/kawamura1703_spinpack_konkurenz_aitchi-phi.pdf
Speed.Why_hybrid: Threads(or local MPI) + 1MPI-thread/node
get max packet size on large scale (ca. 64kB/n2n)
IB.QDR=3GB/s (mind.64kB/blk) vs. RAM=10-30GB/s 2016
(1000nodes*64kB=64MB)/1%Mem=min6.4GB/node OK upto 10000*64GB
vs.(50*16core*64kB=51MB)/1%Mem=min5.1GB/core OK upto 50n*80GB/n
IB: minAH=64K*mpi_n/nzx mpi=4K*40/64K=20 ???
Speed.ToDo17: split code + optimize debug+error-code for size, not speed
Speed.ToDo17: do not make n1_blocks of sorted symcfgs
instead use n1_blocks of sorted hash(symcfgs)
for equidistribution!? s1-LC27 v=16+3 - better distribution
balance= 179..408 max_eff=45.6% (speedup max. factor 2)
but less local communication!
DBG.ToDo17: for vvv&16 add temp-bit set by log2-iteration
DBG.ToDo17: output num blocks task_n1/AH (log2-SH-output + %%)
ah_blks (per thread?) balance? stored_blocks?
Speed.ToDo17:
hasw.bn32 ist 2.x times faster than thin.bn8, but SH is only 25% better
haswell: bn32=9ns *14sym=126ns scfg2idx=127ns ?=105ns sumSH=368ns
westmere: bn8=16ns *14sym=227ns scfg2idx=125ns ?=104ns sumSH=466ns
vs. 16B/100MB/s = 160ns (Gbit) or 16B/1000MB/s = 16ns = Factor 20 zu SH
haswell no advantage for small sym, Latenzen ab 1000*nnz only 50% time
ToDo: use scfg2idx_blk (compare) + rnd cfgs
ToDo17: ab L1cache.lt.nodes*8B faellt scfg2node aus dem cache! warn?!
bei log2(n1)=30 Zugriffen, ca. 10 Zugriffe (30%) im L1
L1/core=16kB / 64Bcacheline = 256 entries (8 Zugriffe aus L1) +25%
L2/core=256KB / 64Bcacheline = 4096 enties (12 tests from L2) +40%
optimiert mit subtrees in cachelines gefuellt *8 entries, *3 Zugriffe
faster if xorshift(?) hashing? 1 access + on colision followed mem
hashes with advantages for l1-generation? or cfg2node? balance?
but locality is lost, hash needs bigger l1[n1+50%] avoid collisions
ToDo17: parallel MPI + compute (1st step write different yy|vr files)
i=0; aa2a_i-1(=dummy)
(compute_Hblk_i+compute_cfg_dst_i -or- loadHblk+dst_i;
wait_i-1;aa2a_i; round==0 ? storeH_i-1; i++)
i=0; aa2a_i-1(=dummy)
?
SH.nhm: nsend_cnt[MPI=10k]=10k*4B(n1/AH) (str/ld cnt)
nsend_cfg[AH*nzx] compute-iy=cfg2idx (should str/ld iy here)
if(l1) rnd-read-v0[iy]
nsend_iy(to store)|nsend_vy(to compute) [AH*nzx++]
sort_back subblocks to columns + compute|store(y+dn+hr)
i100.geth: ld(idx+destnode,Hr)
nsend_cnt[MPI=10k]=10k*4B(n1/AH) (ld?)
nsend_iy[AH*nzx] rnd-read-v[iy] (ToDo ld remote iy, not send)
nsend_vy(to compute)[AH*nzx++]
sort_back + compute nHv+=vy
i=0-extra iread_y_0(ggf gz,thread0 only!)
step i=0 wait_y_0 iread_y_1 v[y]_0 isend_v_0 skip v_i-1
step i=1..n-2 wait_y_i iread_y_i+1 v[y]_i isend_v_i wait_v_i-1 +=v_i-1
step i=n-1 wait_y_i skip_y_i+1 v[y]_i isend_v_i wait_v_i-1 +=v_i-1
i=n-1-extr wait_v_n-1 +=v_n-1
index-name= cnt_ah_blks loop-to total_ah_blks
if recv=ready at wait: t_compute > t_sendrecv (idle_time=MPIwait-MPItest)
- checkpointing or ckpt_help is more difficult (just wait_* earlier!)
- for old MPI, own implementation or blocked calls (also stepwise implem.)
- improve counter-names!
!!! better store multiple AH instead async blocks to get counter AH_i !!!
2xMPI_IA...(...req) than 2xMPI_Wait(request,status) allowed? yes,
MPI_File_iread(...req) ompi-v1.2 read read_at ???
also do interleaved geth_mem^n:geth_dsk:geth_compute max.perf
get better nzxmax, dyn?, store every 1000th symcfg?
fix nzx: num data blocks for disk or mem computed at the beginning
dyn nzx: complex handling, unknown size needed at beginning
perf-test? or fail-run will get nzxmax before abort?
dyn_table: ofs_nz_blk
mpibuf_cfg mpibuf_iy mpibuf_ry +
shorten loop by functions getnode_ofs_from_thread_ofs etc.
- interleaved store_mem + storeH (factor = opt_maxmem/opt_max_file++)
- minimize traffic (design? slowest speed?)
old: load_nzAH*8B_from_disk send_nz4B_iy rcv_nz_v1_cplx8B = 20B in 3 steps
v2: load_nzAH*mix8B_from_disk rcv_nz_v1_cplx8B = 16B in 2 steps
v3: flt16? own_c++_class or load/store_float16 = 12B
interleaved - read mem/SSD/HDD to hide slow perf ssd*2,hdd*3
v1: MPI_File_iread, interleave_factor set by environment, pipe.bz?
-------new_design? 2mpi-to-1mpi, rnd_rd-to-rnd_rd+wr --------------
old16: [iy,Hxy] snd[iy] - rcv[vy] vx+=Hxy*vy(seq_write,rnd_rdbuf)
remote: - rcv[iy] rnd_rd[vy] snd[vy] -
new18? [iy,Hxy] Hxy*vx snd[iy,Hxy*vx] - # more simple, faster?
remote: - rcv[iy,Hxy*vx] rnd_vy+=Hxy*vx
# what about mpi in wop()
slower on single core? slowest operation? benchmark_using_rnd_xorshift?
- SpMV prefetch_rnd_rd + seq_update(+=) 60MB/s(1c) ? memspeed? prefetch?
- SpMV prefetch_rnd_update(+=) 50MB/s(1c)
- MPI Geth=100MB/s IB=3GB/s (but smaller packets?)
computer/mem_speed: hxy pf_rnd_rd8B
pse.2*6iX5650-2.3GHz - 6pf=440MB/s 18ns min_prefetch?
kautz.1*6mips-0.5GHz - 4pf=43MB/s 183ns
see utils/SpMV_speed.c
test1: loop_i1 loop_i2 vx[i1]+=hxy[i2]*vy[iy[i2]];
test2: loop_i1 loop_i2 vx[iy[i2]]+=hxy[i2]*vy[i1]; # critical vx
n1=1.0e6 181.e6nz/s 5.5ns t1.nzx40 quantum.n24-3GHz 8cores
n1=1.0e6 164.e6nz/s 6.1ns t1.nzx40 quantum.n24-3GHz 2cores
n1=1.0e6 104.e6nz/s 9.6ns t1.nzx40 quantum.n24-3GHz
n1=1.0e6 86.9e6nz/s 11.5ns t2.nzx40 quantum.n24-3GHz t1+15%
n1=4.2e6 60.3e6nz/s 16.6ns t1.nzx40 quantum.n24-3GHz
n1=4.2e6 45.4e6nz/s 22.0ns t2.nzx40 quantum.n24-3GHz t1+15%
n1=1.0e6 6.18e6nz/s 161ns t1.nzx40 mkautz-.5GHz pcc-Ofast+handunroll2
n1=1.0e6 5.74e6nz/s 174ns t2.nzx40 mkautz-.5GHz t1+8%
- balance: biggest cfg = maxnzx, interleave threads(+mpi?) via AH?
local nz computation instead of transfer? no, effort = CPU*mpi_n
or backward nz computation? v0+= H*v1 (send rnd ix,rr?) 1step but rndr/w
- np.parallel using MPI_File_write_ordered OMPI-1.2-2006.09
- move CPUwork from SH to i100 part (do async I/O first!)
p.e. norm_factor=sqrt(n2[x]/n2[y])=sqrt(n2[x])/sqrt(n2[y]) per table
spart ggf. nzx*sqrt time! output num_diff_n2!
need cplx_phase_table[x,y] too?
hr=idx{Hr=0.5only?,phase,n2x,n2y} = less_bits?
gives num_hr not num_diff_n2^2 (but real we do not have all combies, test!)
but have to store n2 or recompute (after checkpoint reload)
than splitting hr to off_diag_hr (smaller!) + diag_hr (compressed)
makes loop shorter ... or
even off_diag_hr can be generated directly before SH (same for all threads)
also parameter-dependency can be put, to avoid matrix recomputation
= factorized matrix J1*hr[1]+J2*hr[2]
check that improvement on tiny problem!
what is relevant if t_network > t_H_recompute?
- show matrix chain/2D/3D ohne sym vs. mit sym (sortierung = problem?)
see papers/spinpack_rel_pap/sparse_solvers_GHOST_PHIST_CP_thies2015.pdf
ToDo1503: test-suite NN=40,64,128 * JJ,tJ,tU * dietlibc,mpi,pthread, ca.50% MemSH
test on knoppix64 as 32bit=(-m32 -static) 64bit=(-m64 -L/usr/lib64) g++ gcc
also test _OPENMP + mixed MPI+PT/OMP code
diet.x86_64 has no sqrt+cos+sin, m32 only
test NN=512
ToDo: check src/*.[ch] ToDo's, may be some are easy to fix
ToDo2017: ansatz-auto_check, last check v1.4.7 2001 (for bigsys-speed_test?)
ToDo: add opt_performance_estimate_only store AH as n1 subset only
ToDo: see y2016/lrz* use both maxhfile for disk, maxhmem for mem
test mpt icpc (karman=AMD) -ftree-vectorize ignored, 64threads=Ok
module load torque;module load intel/cc/10.1.008 # dflt=32bit
module load torque;module load intel/cce/10.1.008 # dflt=64bit
-march=core2 -mcpu=core2 -static # erzeugt 32bit +SSE3 (dflt=pentium4)
PC10.bn8=29ns (but PC10.gcc493.bn8=22ns)
-xO (core2-duo+sse3) but PC10.is_dual-core
-xP (core-duo+sse3) PC10.bn8=29ns
PC10: pcc-1.2.0 fails linking (-O2 compiling)
quantum: g++485 5-30% faster than clang342
vm8: CC1=gcc CC2=dietc gcc CC3=gcc -std=c99 CC4=clang CC5=tinycc CC6=gcc41
noSym(div0?) LMsym(sameE) n1=0 4parametersets(memleak?)
test partly stored SH! using hfile! + 2 threads + mpi(if defined?)
test 2*a0, test sizeof(t1.xy[0]) = thxy = int128
z.B. testJJ64cplx32SH0
sym_k=4 2; n1=0 (sym_k=2, 0+40), with/without symmetry (k=-1000)
ulimit<SHsize<maxfile: nonmpi=old=FAIL,new=OK mpi=hang
maxfile<min(ulimit,SHsize): hfmax=4e6(94%) nonmpi=OK mpi=?
int32 (-m32) und n1>2e9 (1 bit for next line (remove!?)) ???
- use GOMP_CPU_AFFINITY='0,2,4,6' (gcc+intel)
ToDo1601:
- for eigenvectors + LM show SS and JJ for each LM-cluster (max?)
-set type_lm to usefull values
- add tU/tJ/JJ lm-tests to make test chain=5 s=? see sym_tU.txt
largest doable tU 2s=4 (can contain . u d 3) chain without S1sym in 10min?
statH 4*4 submatrix num_elements
check biggest lc 2s=4 maxN or N=8 maxS Tnorm2 needed!
test max lm_factor using lm_factor_double, speed __int128 vs. double?
- check OVL+speed chain N=12 2S=7, cubocta N=28 2S=2
ToDo 2015-12:
- stability_Status/OpenProblems/ToDo release v2.50 as stable base!!!
- simplify S1SYM-code (+check lowest nzx) in hamilton_nhv +tU 2.51?
bisherige Probleme als tests speichern!
- test BFLY code speed (lc_s1 + s5/2 ico) ok, but ns.r to slow vs. SH!
short look = missing sign code for tU, improvement not clear + 2.51 else
analyze on small model
check if we have !ismin_lm on test !!!
- parallel ns.recursive!
- on-same-node-mpi is slower (smaller pckts) and consumes more energy
ToDo: pthread or OMP within node (bigger packets, only one core polling)
- mpi+bigmalloc (90%) + access often causes SEGFAULT instead of failed mallo
ToDo: big malloc (90%) before, not after initialization (difficult)
2.51
check mit t100!
- use set/getrlimit() for SH-storage (remove HBLen, use (HBLen=)AH*NZX)
getrlimit? abschaetzung via scaling higher Sz cfgs/symcfgs or outputMC-quote
test in configure
- l1 write node i (+node caching), read node j often "read-error" + abort
ToDo: write node i, read node i + send to j via MPI (avoid w/r via share)
check! 2.50 tbase.c:load_l1(*l1,len,ofs) mymap(FILE_l1,ofs) = bad for MPI?
but better than 2000 files? (not robust!) 2.51? 2.50?
- give hint of parallel-ns is to slow above ncfgs>1e15 (30h*1000cores)
see S=1-LC28 ncfgs=7.65e+15 144*16t100 > 40h
on serial ns let idle other tasks? or besser gleich OMP?
fuer ^PRF: alle zeitabhaengigen Outputs in Textmitte abschneiden!
fuer diffs! oder einfach t/m= abschneiden? PRF fuer grep performance?
- benes for tU, tJ (no speedy sign code) correctness tested for N=4
N=16 8+8 t=-1 U=8 gcc492-O1 cxx0_1+ n1=2.58e6 ns=0.17m SH=1.18m E=-4.17493
testcxx8,cxx0(2m) + LM?
- remove HBLen, use AH*NZX-packets+size (ifundef HB=AH*NZX ??)
- change search, store and use of LM (more regulary) ???
z.B: as shift_to_next_site_in_lm-cluster=N + shift_to_next_lm_cluster=1 (defs1)
loop of symsearch never between higher sites in lm-cluster (faster without S1SYM)
what about s1-s3-s1-s3 chains? lm1.i0 lm1.i1 lm2.i0 lm2.i1 lm2.i2 ...
lm3.i0 lm3.i1 lm4.i0 lm4.i1 lm4.i2 (-a format)
or list of different lm-clusters.i0-idx + lm.len + next.lm.i0
- measure mpi_sendrecv-time but stop if max .lt. 1s or 1it (less syscalls)
- ibtraffic available at linux? like ifconfig Bytes?
/usr/sbin/perfquery -x
n114: 79MB/s TX + 79MB/s RX 811kpcks/s ca. 100B/pkt = full load 72nodes
mpi_stress: 128B / 76MB/s, 1k/415MB/s, 16K/1.5GB/s, 1M/2.2GB/s 36nodes
- visual output (matrix) of traffic_sum node i to j, or min/max
- 2.48 NN=84 failes compile (s=7/2)
- mix mpi+pthread failes, speed_test.nhv sometimes hanging or segfault
(slower, but more powersaving)
- fix tJ + S1SYM + rekursiv
- fix serial a64 + parallel --chkpt_load=2 divergence (+ l1_0000.dat ronly)
see lc_s1.gpl N=28 n1=18735341583 \* 7 l1=123GB h12-h21!=0
only if n1 mod mpi_n is not zero!? try 159*7t100 mod=0
ToDo: improve speed Cuboctahedron N=12 s=7/2 reimar2004 v2.22 48sym n=84=42+42
n1=35.6e6 norm2 bis 3e20 4marvel n2=39h xnz=39.5 SH=13.5h i100=6h
better estimated n1 + nzx by MC max 1s
models/cubocta.def
quanta2 ns=10h SH=6000m=100h zu langsam
PRF: minsymcfg_bnV8 t=16.4s loops/s= 2.7e+03 t[ns]/cfs= 45524.28 2 99i
multithread mpi+mpt+openblas-pinning maxscfg=12m,
neu: mpt+pinning bnV8=1600ns but symcfg_bnv8=133ns maxscfg=9m SH=230m
try: defspin1.sh -s {2...n} [daten.def] also -a !
- rekursiv parallelization ln2(tasks)=bitdepth_start 2tasks=-1bit (S=1)
- rekursiv progress, higher z count (below 1s) + modulo to tasks
add to todo/hist or c-code as sample
ToDo: bad nzx from random, do some nzx-max iterations?
get better n1 estim. by MC?
speed-estim: t9_42.42sym.sz4.n1=6e9 18*10 SH=148e6nz/s i100=2e9nz/s
= 16GB/s / 16nodes = 1GB/s/node (measure on Gb, compare)
ToDo: spins L490 s. bugs 12.05.15
ToDo1503: see speed_mpi.gpl L900
+ check ToDo's in speed_mpi.gpl (slow pt on kautz, needs memlocal sym)
+ speedup vlint (using bit[pair]set + bitget as shift + and, function?)
/usr/include/c++/4.7.2/x86_64-linux-gnu/bits/c++config.h
+ add pragma no-unroll to lm (code size) and unroll to VS-loop (vectorize)
check asm, where vectorcode was produced and its speed
+ add WARN if NOS1SYM
XXZ-model + field - howto ?
ToDo150409: replace ERR(630)(all nodes) by one node warning
add estmate_n2lm=nzLMx + n2sym from n2speed_test_MC1000rnd_configs
needed for s=3/2 n=3*18 better estimation,
also rr-bits-(n2sym,params,factor) for better parallel storeH-ckpt
or b16-minifloat=10m+5e+1s=16bit ? 1/8=.125 + b16-Genauigkeitstest
ToDo1504: spins.c L610 hfmax changed (bad idea?), update min b_*[blk] via MPI!
ToDo1503:
openmpi behav. under memory pressure? (out of memory?)
HyperThreads effectively reduce Cachesize/task (slower)
need better cache-hit-rate (contradict)
use __thread for sym on SC_MIPS (3-4*slower on mkautz! own copy is better!)
(L1 overflow?)
http://www.akkadia.org/drepper/cpumemory.pdf
robustness: (what happens, when I/O is away for a while, test qemu)
nfs no read daten.i
CHECK! 5000 cores 20MB memory reserve not enough, ompi hangs up (no error)?
ok fuer 28*400MB meggie 10+30 24% in mem (372MB)
ggf. memcheck-pg per exec + kill nach timeout bevor malloc?
- bei chkpt6 handling doppelte tri.txt Zeilen, korrekte It? +e0??
+ tU coded as uuu,ddd (recycle sym,bfly etc. as u*d)?
- buggy tU N=150 e3=...
Ausgabe GB/s bei ersten GB transfer!?
ToDo: faulth-/detection/tolerance
Immunity-aware programming, compare Matrix-xor-checksum per iteration?
ToDo: replace tphase as int in inner loops (multiply cmplx sign outside)
ToDo: 1505 replace tbase by tbc bascfg/bitcfg or vlint? replace vlint.h
by C-inline-functions/macros bc_[sg]etbit0,getbit1,getbitN,lshiftN
bc_and|or|xor (minimum needed for highperf lbrkN + benes)
for speedup on big NN + elemination of c++-class
try use __m128i // __SSE2__
1503 cache-problem? cachegrind 1% hamilton_geth_block
# meggie.speed: SH=2480ns/core vs. 5.3ns*160=848ns faktor 3!!!
# Hinweis auch zu starke AH-Abhaengigkeit! (cachesize?)
-t1
54.47% src/hilbert.c 1314-1360
33.91% src/hilbert.c 1339
23.66% src/spins.c 1369-1544
-t3
88.18% src/hilbert.c 1314-1360 slower on mkautz!
54.89% src/hilbert.c 1339 # sym[i][j]-loop
5.47% src/spins.c 1369-1544
nhm_line.(ii=0...hbuf->n).minsymcfg_dflt(hbuf->el[ii].bj, bj); L700
struct hbuf.struct shelem{bj,blkj,jj,rr}el[NZXMAX*NUM_AHEAD_LINES] change?
cache + vector-technisch unguenst.
besser hbuf.H_bj[NZXMAX*AH],H_blkj[..]...
symcfg,sign,norm2 []
ToDo:
minsymcfg_lNbrk2(...,struct tsym *threadlocal_sym_copy) # ToDo! 2015-03
dont use global (thread shared) data (nn,nu,nd,sym)
kautz/MIPS: 6threads: 380ns(global)/48ns(local copy) = factor 8 faster!
1thread 40ns!
ToDo: remove HBLen, by table.nz[n1_blk/AH?] (l1,v0,v1,ev,nz) ???
simple loops, disk-store partly or complete unfilled blocks?
z.B. sq40j1.8+32 nzx: mean=27.3 min=13 max=33(=nu*4+1) +21%
z.B. sq40j2.8+32 nzx: mean=53.5 min=31 max=65(=nu*8+1) +21.4%
z.B. sq40j2.6+34 nzx: mean=42.85 min=27 max=49(=nu*8+1) +14.3%
+ mpi-transfer complete AH-blocks?
ToDo: ns Vorgabe tu lt nu fuer rekursiv (ca. n over tu)/sym ca. 2*mpi_n
u8=482e3 u10=5.3e6 u12=35e6 + store chunks
ToDo: remove mmap, load l1_0000.dat via node01 to other nodes (local disk)
would work without tmp_shared too
dmtcp_launch
/opt/ompi-1.8.4/bin/mpirun --preload-files spin,daten.def,daten.i\
--bind-to core -H node01,node02 -np 4 ./spin
ToDo: sighand.c: better sigusr1 only (let usr2 for dmtcp)?
+ use timimg = 2*SIGUSR1/10s = start MPI-traffic-free checkpoint window?
ckpt-window MPI_Bcast(sig1) if(sig1){print_ready_for_ckpt;sleep30;p_cnt}
before MPI_SendRecv or next_ahead
per CKPT_HELP (overhead? no)
ToDo: auto-define CONFIG_PTHREAD for B_NUM > 1??
ToDo: make test diet gcc -O2 testsqrt.c /usr/lib/libm.a # -lm does not work!
or just test __dietlibc__ compare gcc -Os vs. diet free/usable mem
maxmalloc mpi_i=0 danach mpi_i=1 ... (node structure?)
ToDo: remove mmap() l1_0000.dat (advantg. swapping, disadv. C/R open file)
SH stored 100%: l1 needed until 100% reached, release gives advantage
for checkpointing less mem (but remove file is better)
also needed for eigenvalues later
SH not or partly stored = no advantage of mmap (except swapping = bad)
before a0 ftell+fclose+a0+fopen+fseek daten.i (mod + closed file)
ToDo: vectorization + inhomogen MPI (memory Nodes + CoPros a la XeonPhi?)
minsymcfg_loopN(cfg) via 4sym-vector,
better do 4 vector-parallel-syms on same cfg and min at end = minsymcfg
ToDo: Overlapping Communication with Computation (1Gbit for 3GHz-core) 2015
replace MPI_SendRecv by
concurrent computation MPI_Isend + MPI_Irecv + CPU + MPI_Wait
+ output if not enough bandwith
new module needs 3 (pipe-)buffer for each H,Hv:
- nodes*MPI_Isend+MPI_Irecv H(i-1), (H*v)(i-2); i is (AHEAD) block idx
old was a loop, but new is I_All_to_all
- optional async read H(i), store H(i-1) + system-ahead i+1, wcache
- compute H(i) or readH(i), (H*v)(i-1), (vHv,v+=Hv)(i-2)
optional RAID-update v(i-2)
- MPI_Wait and opt. I/O_wait
- Buffers: Hi_send[ahead*nzx]
Hi_recv[ahead*nzxmax]
Hi_comp[ahead*nzxmax]
3*Hvi[ahead]
+ 2 pipe-fill-steps (loop over 2+n1/ahead)
speed = minspeed ( cpu(ahead*nzx/s), mpi(ahead*(nzx*8B+4B)/s)
ToDo: benchmark both in spinpack! PRF
MPI_All_to_all benchmark !!! vs. balanced pktsize for zielnodes
macht erst ab 4 test nodes sinn, besser 8
stat: number bis 4k pckts, bis 16, bis 64, darueber ToDo
max ahead = avail_buffer_space/task = ca. 10% of avail. mem / task
max pckt size = max ahead * nzx / num_threads
put this descr. to speed.html and leave a link to it
- remove threads + function calls ?? (not allowed in OpenCL), replace
by parallel-pragmas?!
- I/O store 50% not 0..50% but interleaved block on disk vs. on RAM
for better background load from disk (maybe interleaved recompute/diskload?)
works only if memory is allocated at the beginning (known size ram + matrix)
ToDo: power reduction using MPI_Isend,Irecv + MPI_Test idle usleep(1)-loop
usleep may be lengthened by the granularity of system timers
100*usleep(2) = 0.1s comp2=+1ms/usleep GbE=100kB/1ms
100*usleep(1000) = 0.2s == 100*nanosleep(1ms)
100*usleep(2000) = 0.3s == ! ! ! Problem ! ! !
depends of linux scheduler! 10-50ms or busyloop
in/out port 80 takes about 1us on x86 inb_p() or inb() asm/io.h
sched_setscheduler() SCHED_FIFO or SCHED_RR
ToDo: power footprint on kautz
32quanta1.idle=280W ns=500W SH=520W i100=570W 2.2GHz 2015-02 t/nz[ns]=112.8*16
32quanta1.idle=280W SH=390W 1.2GHz
ToDo: failure tolerance ?
checkpointing or redundancy (assuming 99% of memory changes its data)
- 8+1RAID5 redundancy needs lot of network bandwidth!
+ blocks of old and new data between check + update points
block size max. 10% of memory for efficency = network transfersize
additional block for overlapping transfer + compute
# block operations are good in general! speed
# using double memory or fast storage
networktransfers/iteration = log2(nodes/raidnodes)*memory
log2(80/10)*256GB=3*256GB/(IB=3GB/s)=256s=4min for save/restore
compute time should be 4min++/Iteration to have no cpu losses
does not help on total failure or job switching (or nonvolatile mem)
- checkpointing
checkpoint-interval must be smaller (1:10++) than failure interval
need storage = 2*memory, depends on storage speed
min storage: 2TB / 70MB/s == 256GB / 12.5MB/s = 5.7h
(chkpt-interval=60h, failure-interval=600h=25d)
one drv/node: 256GB / 70MB/s = 61min
(chkpt-interval=10h, failure-interval=100h=4d)
libs: not much implementations, no change of node numbers
own: restart with different node number is possible + crc
virtual SMP could help to use SMP checkpointing on clusters!
- all mallocs + errhandlings syncronized to avoid different thread-mem
consumtions?
Problem bei partly stored SH! background MPI loest das Problem auch s.o.?
use multiple of AH + zerofill,
ToDo: see speed_estim.html (compute time for full cluster memory usage)
can be used to buy spinpack-optimized compute cluster
ToDo: reduce L1-cache needs, using generators and list of gen_idx
through all syms, see also butterfly.txt for symconfig speedup
160sym 8+32 lNbrk=4.9ns SH=160*10.0ns=1600ns
40sym 8+32 lNbrk=6.7ns SH= 40*17.5ns=700ns sym_k=-13 0 -2 0
20sym 8+32 lNbrk=8.2ns SH= 20*28.4ns=568ns fit tSH=400+7.5*sym
20sym 8+32 lNbrk=7.5ns SH= 20*26.9ns=538ns NOS1SYM-6%
=520ns -Hstat-3.5%
20sym 6+34 lNbrk=7.3ns SH= 20*22.3ns=447ns = 285+20*8ns
4sym 6+34 lNbrk=11ns SH= 4* 83ns=333ns = 285+ 4*12ns
1sym 6+34 lNbrk=9.1ns SH= 1* 285ns=285ns
=266ns -Hstst-7%
bondloop=80 + (nzx=33)*(nsym=160 + ln2tasks)
ToDo: hamilton_nhv like fast_hamilton_nhv (optional store H additionally)
nhv(cfg[AH],r[AH]?) = scfg[AH*nzxmax],sgn[AH*nzxmax],n2[AH*nzxmax]
mem+20%(better vect+omp) or nzx[AH](mem+1./nzx)
mit nzx vectorisierung+omp einfacher? fillzeros?
norm2 Berechnung? Aufwand + 1/nzx or (byte-) vector? const nzxmax
replace XY_FLAG by separate x entry (mem+...%) but sort HBSize
for tasks (const. package size)
old.xy_flag(mixed blocks): xy+by+rr=4B+2B+2B=8B + buffers
new.xx_idx(sorted blocks): xx+yy+rr=4B+4B+2B=10B + const.pktsize
-d$TMP for batchjobs? (better remove tmp dependency of code)
bug: if ulimit -v is smaller maxfile i100 failes ??? fixed 2011?
bug: hubbard model nu=nd=N n1=0 (should be n1=1)
- adapt to clusters without shared FS
(how to distribute sequence of data of unknown number? blocks?
or dry run for counting only)
simplest: switch to count only after OOM, store the last
stored and counted scfg and may be every 1024th?
2*nodes counted_ranges: 1 1 1 1 to 2 2 ... +2+2 to 4 4 ... +4+4
store start scfg + foundnum scfg + end_cfg + time_needed
so we have stored ranges and counted ranges
2nd round: recompute the counted only on the right nodes
optional "stop" per nonblocking MPI from OOM node?
OR roundrobin 10e6cfg-chunks(doubling if under 1min) + list of reallocated
scfg-blocks (sort blocks in 2nd run, testversion: do it parallel to
disk, if it works, remove disk code)
chunks of size of max. free space
{chunkidx, startcfgORidx, stopcfgORidx, numscfg, time, *scfgs...}
tree-algo usable for start and stop tree partitioning?
stop at 1st depht-nu-cfg of depht-(nu-4) ??? problems?
- make a exsample.html page for different physical models
and put the link to the README
- parallel sort for fulldiag by E_Ising=Ez using Bitoner Sortierer
http://www-i1.informatik.rwth-aachen.de/~algorithmus/algo12.php
- remove maxmem from daten.i (maxmem+usemem? vom *.c) jobsystem!
set 0 as default (2011-12-09)
- generate matrix.pgm earlier in fulldiag scaled for bigger matrizes
up to n1=200e6 (fits to 2GB)
to 512..1023 pixel from getH without storing full sparse matrix 2011-11
- http://graphics.stanford.edu/~seander/bithacks.html
Swapping individual bits with XOR
+: using v & -v last bit counting?
0010100 | 0010011 = 0010111 (v|(v-1))+1 01... to 10...
0010100 & 1101100 = 0000100
... -1 = 0000011
x = ((b >> i) ^ (b >> j)) & ((1U << n) - 1); // XOR temporary r = b ^ ((x << i) | (x << j));
e.t.c.
- replace vdate.h(#define) by vdate.c(const char) to avoid recompiling spins.c
or split spins.c (add main.c?)
- ns write sns to local files and concat explicit (remove bottleneck NetFS)
filter n1/threads ... 1M-chunks for better distribution
- use ErzeugeneSym^ni instead of store all Syms (save cache! more speed)
generate products in smallest() + optional permute via tables (sign?)
+ fast permutation by tables
http://microcontrollers.wordpress.com/2011/03/11/how-to-do-really-fast-bit-permutations-with-few-operations/
- use rekursiv n1 and split search paths to threads
write nu=0 to file0
read file0 and write nu=1 to file1, rename file1 to file0
read file0 and write nu=2 to file1, ...
stop if nu reached
- suspend/resume in parallel mode (no storedH, v0/v1 only)
p.e. suspend after next 20 iterations,
mips-cluster.kautz dump.mem2TB/(200MB/s=2Gb/s)=167min=2.8h wtime=28h++
sq48 n1=168e9 *(4+4+6)=2.4TB
s2tri27 ... (ns try both serial +parallel and break slower method?)
recompute l1 if file l1_0000.dat is removed or bad (check last!)
- check all (quasi-parallel) disk operations (ns.l1=OK, rw_v=OK 2012-05)
- hr_restore: 0 setzen + fehlende indexe neu berechnen statt tausende files?
use maxnzx as static size for simpler code? but 10-20% more SH-memory
- check: libckpt (user-directed checkpointing)
- speichersparsam proggen(cache), rekursive numscfg mit beliebigen Startpunkt
und stoppunkt proggen (z.B. fruehe Rueckkehr und fortsetzen spaeter oder
auf anderen thread und unterbrechen nach endl. Zeit)
Beispiel-config space (N=5,S=1) ...
chkpt.resume nsymconf() from l1(?) + n1 (save last testcfg all 2h? chkpt0?)
bsp: N=40-chain syms: 2 non-commuting l=2-syms
40syms generated by s0,s1,s0*s1,s1*s0,s0*s1*s0,... compact code?
- maxscfg by exclude higher empty subtrees
- skip complete a0-run on bad k_sym !? (avoid long runs) ??? if its easy to implement
- try http://dmtcp.sourceforge.net/ distributed-mt-checkpoint-userlib
- test triangle48 sym=192=4N n1=ca168e9(l1=7n1=1.1TB=100MB/s*11000s(3h))
split l1 on failure? or per node or 256 threads ...? l1=216GB/4h
md-raid0 for 2TB (fuse?)
fixed partitioning (equal size or eual nodes)
or maxsize l1_0..63 (1TB/8=128GB)
l1_%4d.dat in 200GB/Bsize chunks (links todifferent FSs,striped?)
- test triangle-s2 N=2*27=54 e= 3 3 0 9 (tU=108bit) nud=54,53
3+51 NoS1SYM n1=15540 SH=0.3m 54.70017464 30m/100It
S1SYM n1=3627 k=-1000 54.70017464
S1SYM n1=25 k=0 54.76837837 0min
see s1_triangle.gpl
- warn on l1 writing on long=32bit systems (split files?)
- ToDo: fermionic sign for b_smallest_lm() (LM/S1SYM) ???
- ToDo: test resume after break during checkpointing (incl. ev)
+ robustness gegen datenfehler?
- checkpoint resume after 2nd++ data-set (a0...a0)
ToDo: problem bad get_maxscfg for parallel speedup (47+1 8sym 000- n1=6)
- parastation-mpich send_16MB_from_all_to_task0 causes SEGFAULT on task0
test ulimit=4GB 4tasks lowsym n1=32e6(+16MB=OK),n1=225e6(730M+16MB) 8m/It
test: q.mpiexec -l -m ... ./wrapper.sh: ulimit -v lowmem + nice -19 spin
Verhalten bei Speichermangel mit mpich,
limit=70MB (4*64M+16M fail L229) ToDo: sauber abbrechen!?
limit=84MB (4*64M+16M) OK
ToDo: a2 2x2 mpi-version! scaling?!
- rechnet nur einen Datensatz a0? if multithreaded
ToDo: ccNUMA tips, clear diskcache to allow local malloc !!!
ToDo: output i100.t during iteration first on 10th than only if changed by 10%
for smaller diffs
ToDo: oprofile (2013-04)
- +dietlib -printf
- struct s_float + cast op() test accuracity
- aio +
interleaved io (m threads writing/reading to/from one hnz-stream)
pipe_read (+transparent mpi, konst. Blockzahl/Laenge?)
h_file.c
- replace itime by ftime=gettimeofday.s+us*1e-6 or MPI_T, for better shortruns 2013-04
- add CPUSET for 64...1024 (see memspeed.c)
------------------------------------------------------------------- 2013-04
- Strategy:
recompute matrix H (not storing, because computing is and will be (?)
always cheaper than fast storage, see GPU, multicore, L1-caches)
- checksum H-block-elements to avoid errors (2012)
- SU(2) use ???
see R. Schnalle + J. Schnack, Calculating the energy spectra of magnetic
molecules: application of real- and spin-space symmetries, Apr2010
International Reviews in Physical Chemistry Vol.29, No 2, 403-452
- speedup ns at MIPS NN>64 (remove y%() from vlint.h)
- s=1 CONFIG_S1SYM + PARALLEL buggy, n1 to big, fur ud=-1 ???
- virt. Test-Cluster (rid-replacement)
- Zi != 0 for square Sz=0 k=-9999
- after release v2.40, make FPGA/GPU ready (b_smallest blockwise) + FPGA
- ising matrix to ising+1storder-excitations?
- storeh2 for mpi (for ns() too)
MC like method 10000 random configs divided into 100 blocks
defining start-cfg of each block,
isingerg as mean value of neighbouring ising ergs?
like e0+meanexcitations
- remove writing l1 (allows starting 2 jobs within same path)
save as one file on node0?
- rename zahl to lfloat (long float), mzahl to sfloat (short float)
- complete new strategy? build n1 via storeh2 (sorted by isingerg)
and balance number of nonzero elements per rank/thread
b2i have to ask probably more than one other rank?
- remove n2, parallel ns()
- err9200: set nzxmax-overflow-flag and reset hbuf->n
stop later! ??? or better dyn.malloc
or hbuf as compressed part of hbuf[NUM_
- mpi ns() without nfs-transfer
- fix compile-errors on SunOS(isut1), check MRule for kago39
- ns() replace fwrite(l1_xxx.tmp) ??? (tina has NFS problems)
by mpi_sendrecv
1st: send buffer to thread_i (i+1 if mem_i=full)
or simpler send buf_i to thread i%mpi_n (buf_mpi_i)
2nd: balance l1[] send l1[]begin(i+1)..end to
or simpler send buf_mpi_i to thread i (l1_i)
- speedup by local malloc? if not change back to v[i+b_ofs[blk]]
and node_ofs[..], b_ofs[]=0..node_len[node]-1;
- autobalancer, redistribute lines among threads after 1st iteration?
!ps -m -o time -p $PPID at end to calculate unbalance
- first mpi_n*MPI_calls per hbuf-line, later block a fixed number of lines
split nhm()
- hamilton_nhv(i) -> nhm(i,jnz) -> store hbuf[].cfg,rr
- smallest(hbuf) -> jnz*b_smallest -> store hbuf[].scfg,rr*=(sign,norm2)
- scfg2blk(hbuf) -> [blk].hbuf...
- mpi_n*MPI_Sendrecv([i+j].hbuf to i+j, [i].hbuf from i-j) cfg
scfg2idx -> hbuf[].idx
MPI_Sendrecv([i+j].hbuf to i-j, [i].hbuf from i+j) idx+v0?
for all b_len[0] (send 0 if b_len[i]<b_len[0])
- hlines[HLmax]
- problem: Hbloecke sparse - xy+flag sinnlos? x+y aber mehr mem, mpi-ovhead?
- store XandY ??? no!
- Xsteps greater 1 between, 2 nz-Elements
- 80% more memory needed, more slow!?
- Y+(nextX-flag) !!!
- only stripes possible (sort mpi-blocks during read)
- creation of H needs more time and a lot more MPI-traffic
- store Y+Y.blk(byte) +20%memory
- [0].xy=num_elements_line_x [0].r=diag [0].by=block
[1..n-1]=nondiagonal elements (xy,by,r)
- by blocks for nodes only? C++ array access via MPI?
- highest to lowest vector_coeff? can int/llong be used for speedup on T1?
- ca. n1 ohne hash-collisionen?
- erste 10-down-spins nach Tabelle/rekurs.suche * (letzte down-spins ueber
restplaetze)
- next symcfg -> letzte up-spin wegnehmen und neuen Platz suchen bis smallest
- alloc h_arrays within threads (h_xxxx[B_NUM] not needed),
open/close within threads?
- bessere Speed-Messung um Flaschenhals zu finden?
- MP-scaling, Size-Scaling hnz/s (wenn sinnvoll)
- for FPGAs prepared
- smaller code for less bugs
- remove noSBase (k=-1 can be used, add kud=0 or -1, test speed)
- pictures of most probable 40 states! mark flips (@ critical J2=0.6)
show first perturbation terms
- Fernziel: FPGAs + MPI (b_smallest-calls blocken)
- MPI async: MPI_PROBE, MPI_GET_COUNT, MPI_RECV, MPI_ISend,Ireceive,waitall/any
- async read n-to-m threads (m<=n), Operationen in Bloecken (speedup)
zu aufwendig? besser nur in Speicher, ontime-Berechnung oder mmap zu Platte
- tJ has no .3 symmetry like tU (Ham2), makes sence?
- j1-j2-t-U als parameterindex speichern oder via Index (lange 40er j1-j2 Berechnungen)
(wegen a2 keine getrennten Parameterfiles), besser index?
speicher index zu [faktor{0,+1/2,-1/2,-1,+1}, parameterindex]
- n-site-Terme in H n>2
- SiSj mit sym viel zu langsam, warum?
- a4 parallel (world leader? <-> AHonecker)
- a2 parallel-skaling testen / verbessern?
- data-multistream-konzept (pipeline concept of vectormachines) for new HPC?
seriell dynamisch gekoppelte programmierbare Units (z.B. CPUs, FPGAs)
Abbildung in seq. Prozessoren als threads + stream buffers + stop
start mechanism if stream buffer is full/empty (wait for data)
reicht nicht fuer v2=H*v1+v2, auch random access notwendig (stream+RAM)
MIPS per Watt? cost-performance-per-watt (ARM SA-1100 1997 133MHz max250mW)
- nice graphic?: xy-Array fuer colored Isingenergien-Matrixgewichten
minIsing=0(neel) maxIsing=maxNumBonds=N*Dim
Ising=num-uu-Bonds+dd-Bonds, H1diff=0..2maxNN-2 H2diff=0..2maxNN
+ state Overlap <PisingstatesE1|H|PisingstatesE2> => x_out enhanced
highest/lowest ising recursiv? min..maxE1.E2.E3....
- check also: grep ToDo src/*.[ch]
- LM Bindungen in H vereinfachen/generalisieren?
(use a more general (simple) method for local symmetries)
example:
..O O---O O---O O... this is a sample-chain, with 5*N sites
\ / \ / \ / and N vertical symmetries, which are
O O O completely decoupled from each other
/ \ / \ / \ and to the other symetries (very similar to LM),
..O O---O O---O O... a future version should care about this
speedup for local singlets (see 4site_exchange_diamond36.def)
commutating symmetries (ge: vertauschende Symmetrien)
- auch LM als symmetrie-subgroup z.B. N=5 S=3/2 (15 sites) 2013-04
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
generators:
subgroup0 0 1 2 = sym0=0-1 2, sym1=0 1-2 (sym2=0-2 1=sym0*sym1) l=3
subgroup1 3 4 5 = sym3=3-4 5, sym4=3 4-5 (sym5=3-5 4=sym0*sym1) l=3
...
subgroup4 12 13 14 ...
generators: 5*2(10 to store), subgroupsyms=5*3(15 to store)
oldNoS1sym: 3^5=243 (growing fast, slowdown + may hit CPU-cache size!)
- find solution to avoid 64bit overflow for spinchain N=6 s=7 and bigger
- instead of pthread_create/join doing on every iteration do it only once
and use with mutex in mpi compatible way
- improves sun_top_pcpu (pcpu is resettet to 0 after pwd_create)
- improves linux_top logging (new pid creation on p_create)
- could be made mpi compatible (MY_MPI)
- calculate SiSj for twisted boundary conditions TBC
(using posx,posy, ww[NN*NN] ?)
- translate docs to english (partly done)
- new design via pipes (dataflow) and threads (ex: generate-H-thread
writes elements sorted to 4 pipes for blocks, system does cashing)
- Codierung/Indizierung nach Isingenergie mit allen moegl. Bindungen bei
gleicher Symmetrie (Vorteil, S=1, LM automatisch integriert, schneller?)
- Einfuehrung optional eines C++ types ULLLong mit >64bit fuer N>64
- Umstellung auf C++ wuerde das Programm uebersichtlicher machen!
minimal fuer Daten-Typen
- Erwartungswerte H_t H_J H_U etc. berechnen (deren Summe ist <H>)
ermoeglicht bessere Interpretation? evl. H_J1, H_J2
- H_J1={ny,array of pointer of {iy,nx,array of {ix,wert[y,x]}}}
- store H_J1,H_J2 seperatly and calculate H = J1*H_J1 + J2*H_J2
- try start neel and count nonzero-elements per lanzcos step (a8?)
also check overlap to predecessing lanzcos vectors
- repeat with debug++ if fatal error, reduce debug output
- speed 600MHz lt=2:05 SH=2:15, --- 1s/It --- nu+nd=32+8 n1=482e3
call b_smallest lt=2:07 SH=2:13=133s
call 2*b_smallest lt=2:05 SH=3:50=230s => b_smallest=1:37=97s=72%
call b_getbase lt=2:07 SH=13:45
call 2*b2i lt=2:06 SH=2:29 => +12..14s=10%
nhmline() +1..3s=1%
1*nhm SH=2:21
2*nhm SH=4:32 => +131s=98%
nhm=return SH=0:07s
lt mit H aufbauen? nach E_Ising sortiert? 3bonds=3dimIsing
+ use cos(k!=0,pi) possible for real numbers
H*000111=001011+100110 (6)->(6) better uu=dd=0 ud=du=1 (for +-Operators)
H*4.2.0 =2.2.2+2.2.2
H*001011=010011+000111b+001101+101010 (6)->(6,2)
H*2.2.2 =2.2.2+4.2.0b+2.2.2+0.6.0
hash(ising-string)? trees of Isingergs (level=bondtype)
konfig mit min. Bitastand zu aelteren Repres. als Representanten speichern?
(kfg1^kfg2 liefert <ij>)
n1=555e6 hnz=23e9=41.4*n1
idx -> (IsingRep -> kfg) -> H*kfg -> kfgs -> IsingRep -> idxs
H*kfg, kfgs->IsingRep per FPGAs?
idx <-> IsingRep per (hash)Table?
v voellig dynamisch mit lowestIsing startend?
1. Iteration superfast, 2. Iteration 41*langsamer? etc.
aber Problem index finden bleibt?
numBonds? (topology-index 1dist.2dist.3dist...N/2dist for chain)
0=01010101 2(8.0.8.0) -> 10010101 + 01100101 + 00110101 + ... 8(6.4.4.)
1=10010101 8(6.4.4. ) -> 01010101=0
+ 10100101=1
+ 10001101 + 10010011 + ... 16(4.6.)
2=10001101 16(4.6.) -> 10010101=1
+ 10001011 (4.4.)
3=10001011 16(4.4.) -> 10001101=2
+ 01001011 (6.4.2.)
+ 10000111 (2.4.)
4a=01001011 8(6.4.2)
4b=10000111 8(2.4)
- code2 waehlen
- kill SIGUSR sh parallel => kein sinnvoller Wert
- LAPAck ohne EV (per Option), zheev implementieren fuer sparse + parallel!?
Lizenz? http://www.netlib.org/lapack/faq.html#1.2
- change the name of routines if modified,
- We only ask that proper credit be given to the authors.
complex: zheev (JOBZ='N'|'V', UPLO='U', N, A[LDA,N], LDA>=max[1,N], W[N],..)
wantz = LSAME( jobz,'v'); // teste option
lower =
- symmetry.tex uebersetzen/neu gestalten
- Oles-Term pruefen und dokumentieren + ggf. nur reines Coulomb-U(i,j)
6-site U/|i-j|
- Reimars patch = ok
- check OP/sec, theoretical limits MBps MOps etc. no disk/IO?
pid=...
while ps -p $pid; do
echo -n "$(date +"%j %H:%M:%S") "
# only for OSF -g $pid (for subprozesses gzip)
ps -p $pid -o "pid,pgid,ppid,time,etime,usertime,systime,pcpu,pagein,vsz,rss,inblock,oublock" | tail -1
sleep 30
done
# compare prozess + system pagein/inblock/oublock wenn moeglich
#
plot [700:840] "aab.log" u 0:3 t "cpu/%" w lp,\
"<awk '{print ($9-x)/3.e3; x=$9}' aab.log" u 0:1 t "read/3e3" w lp,\
"<awk '{print ($10-x)/1.e2; x=$10}' aab.log" u 0:1 t "write/1e2" w lp
marvel: full last
gzip -fc1 tmp/htmp001.001 >tmp/htmp001.1.gz 1m32.381s v1.2.4 8MB/s
gzip -fc6 tmp/htmp001.001 >tmp/htmp001.6.gz 3m07.496s
gzip -fc9 tmp/htmp001.001 >tmp/htmp001.9.gz 4m34.404s 4MB/s
bzip2 -fc1 tmp/htmp001.001 >tmp/htmp001.1.bz2 7m43.521s v1.0.1
bzip2 -fc9 tmp/htmp001.001 >tmp/htmp001.9.bz2 12m14.325s
ls -l tmp/htmp001.*
778485760 Jan 17 tmp/htmp001.001
303230293 Jan 19 tmp/htmp001.1.gz 39%
297352257 Jan 19 tmp/htmp001.6.gz 38%
296760506 Jan 19 tmp/htmp001.9.gz 38%
296796144 Jan 19 tmp/htmp001.1.bz2 38%
332110892 Jan 19 tmp/htmp001.9.bz2 42% ?
# decompress to /dev/null
cat tmp/htmp001.001 0m11.624s # 67MB/s
gunzip -c tmp/htmp001.1.gz 0m24.802s Todo: +mem? +ru=100%?
gunzip -c tmp/htmp001.9.gz 0m24.043s # 32MB/s async?
bunzip2 -c tmp/htmp001.1.bz2 1m55.676s #
# xz and bzip2 is slower than gzip, but compresses better
# 2017-02.lc40.7+33.SH=23MB PC10 (Gbit=100MB/s) PC10 quantum2G
time xz -fc1 tmp/htmp0000.001 >/dev/null # 3.5s 4.9MB 4.7s 1.1s
time xz -c tmp/htmp0000.001 >/dev/null # 37.0s 4.2MB 41.6s 0.9s
time bzip2 -fc1 tmp/htmp0000.001 >/dev/null # 4.8s 6.1MB 5.3s 1.6s
time gzip -fc1 tmp/htmp0000.001 >/dev/null # 0.9s 7.9MB 1.2s 0.3s 25MB/s
time gzip -c tmp/htmp0000.001 >/dev/null # 4.4s 7.4MB
time gzip -fc8 tmp/htmp0000.001 >/dev/null # 22.2s 7.3MB
time gzip -fc9 tmp/htmp0000.001 >/dev/null # 58.6s
sh prog1 | buffer | sh prog2 # buffered async read?
Performance and efficence
The efficence of spinpack-2.19 on a Pentium-M-1.4Ghz was estimated using
valgrind-20030725 for the 40-site square lattice s=1/2-model.
37461525713 Instr./49s = 764M I/s (600MHz)
12647793092 Drefs/49s = 258M rw/s