High Performance Linpack with CPU

  1. Compile HPL
    • Copy the template Makefile:
      cp setup/Make.Linux_Intel64 Make.Linux_Intel64
      
    • Edit the make file and change following lines:
      TOPdir = <hpl-2.3 top folder directory>
      
      MPdir =  <openmpi file directory>
      MPinc = -I$(MPdir)/include
      MPlib = -L$(MPdir)/lib -lmpi
      
      LAdir = <openblas file directory>
      LAinc = -I$(LAdir)/include
      LAlib = $(LAdir)/lib/libopenblas.a
      
      CC = mpicc
      CCNOOPT = $(HPL_DEFS)
      CCFLAGS = $(HPL_DEFS) -O3 -w -z noexecstack -z relro -z now -Wall # modify this according to the cpu
      
      LINKFLAGS = $(CCFLAGS) $(OMP_DEFS)
      
    • Compile
      make arch=Linux_Intel64
      
    • If you want to clean:
      make clean arch=Linux_Intel64
      
  2. Run HPL
    • Edit the file bin/Linux_Intel64/HPL.dat inside the top folder.
      Here is an example with 8GB RAM and 4 Cores CPU:

      HPLinpack benchmark input file
      Innovative Computing Laboratory, University of Tennessee
      HPL.out      output file name (if any) 
      6            device out (6=stdout,7=stderr,file)
      1            # of problems sizes (N)
      29184         Ns
      1            # of NBs
      192           NBs
      0            PMAP process mapping (0=Row-,1=Column-major)
      1            # of process grids (P x Q)
      2            Ps
      2            Qs
      16.0         threshold
      1            # of panel fact
      2            PFACTs (0=left, 1=Crout, 2=Right)
      1            # of recursive stopping criterium
      4            NBMINs (>= 1)
      1            # of panels in recursion
      2            NDIVs
      1            # of recursive panel fact.
      1            RFACTs (0=left, 1=Crout, 2=Right)
      1            # of broadcast
      1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
      1            # of lookahead depth
      1            DEPTHs (>=0)
      2            SWAP (0=bin-exch,1=long,2=mix)
      64           swapping threshold
      0            L1 in (0=transposed,1=no-transposed) form
      0            U  in (0=transposed,1=no-transposed) form
      1            Equilibration (0=no,1=yes)
      8            memory alignment in double (> 0)
      ##### This line (no. 32) is ignored (it serves as a separator). ######
      0                               Number of additional problem sizes for PTRANS
      1200 10000 30000                values of N
      0                               number of additional blocking sizes for PTRANS
      40 9 8 13 13 20 16 32 64        values of NB
      
    • To tune the parameters, can reference the website here. It is not guaranteed to be the optimized setup. Try to tune the parameter by yourself.

      The following parameters are probably you need to tune:

      • Ps * Qs: the number of cores
      • Ns: the problem size
      • NBs: the block size
    • Run benchmark

      mpirun -np <number of cores> ./xhpl