Now we could compare new SC5832 machine (light green) to an Infiniband cluster with two QuadOpteron Nodes (brown curve), which has enough bandwitdh to show good scaling. Both scale well, but the SC5832 is better on a peak GFLOP or energy consumption base. The MPI-Stress benchmark shows that the infiniband cluster has much higher latencies for collective communication. The SMP machines have latencies of 1.9us for a 4-socket DualOpteron and 3.1us for a 8-socket QuadOpteron system using OpenMPI-1.2.6. The Altix4700 at the LRZ has 510 usable IA64 Prozessors per Numalink-Partition. The MPI speed depends very strongly from the MPI package size (vertical lines, 128kB for middle line).
Checked for up to 1000 cores now (Jul08)! linear in plot: x=log(CPUs) y=log(t1/t) lg2(t1/t) = b * lg2(CPUs) # t1 extrapolated 1CPU-time t1/t = (2^(b))^lg2(CPUs) # 2^b = SpeedUP2 (CPU-doubling) t1/t2 = (2^(b))^lg2(2) = 2^b = SpeedUp2 BWFactor = (t1/t(1CPU)) # Band Width Factor OverallSpeedUp = SMPSpeedUP * MPISpeedUp SMPSpeedUp = SMPSpeedUp2 ^ lg2(SMPCores) MPISpeedUp = BWFactor * MPISpeedUp2 ^ lg2(MPINodes) SpeedUp2 v2.33: SMP = 1.66 (up to 32 CPUs), MPI = 1.46 (up to 64 nodes) v2.36: SMP = 1.66 (up to 32 CPUs), MPI = 1.69 (up to 50 nodes) BWFactor 100Mbit/s = ca. 40% ( 40% float, 25% double, 2*2GHz) 1Gbit/s = ca. 100% (100% float, 70% double, 4*2Ghz) 2*10Gbit/s = ca. 100% (estimated, BW*Cores/Node) extrapolation: v2.33: SpeedUp = 1.66^lg2(SMPCores) * 1.46^lg2(MPINodes) v2.36: SpeedUp = 1.66^lg2(SMPCores) * 1.69^lg2(MPINodes) ??