Eiger: Difference between revisions

Eiger (view source)

Revision as of 18:19, 27 February 2021

3,536 bytes removed , 27 February 2021

→‎Xeon Phi (Knights Landing, KNL): remove - nobody uses KNL

Kay

Bureaucrats

2,652

edits

@@ Line 21: / Line 21: @@
 On multi-socket machines, there are additional considerations having to do with their NUMA architecture - see [[Performance]].
-=== Xeon Phi (Knights Landing, KNL) ===
-The benchmark was run on a single KNL7210 processor (256 cores) set to quadrant mode and using the MCDRAM as cache. '''The environment variable OMP_PROC_BIND was set to false, or KMP_AFFINITY set to none''' (if this is not done, the scheduler seems to put all threads on one core). XDS was compiled with the -xMIC-AVX512 option of ifort. These benchmarks were performed with "warm" operating system cache, which means that the first run of a given type didn't count because it had to read all data from disk.
-Deviating from the Xeon benchmark setup, BACKGROUND_RANGE was set to a more realistic value of 1 50 (instead of 1 9).
-Using the Dectris library that makes use of the <code>[[LIB]]=</code> [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#LIB= option] of XDS:
- INIT:            elapsed wall-clock time       30.4 sec
- COLSPOT:         elapsed wall-clock time       40.7 sec
- INTEGRATE: total elapsed wall-clock time       52.9 sec
-Now additionally running with <code>numactl --preferred=1 xds_par</code> after having modified the forkintegrate script such that it starts mintegrate_par with the same numactl parameters:
- INIT.LP:         elapsed wall-clock time       29.8 sec
- COLSPOT:         elapsed wall-clock time       40.0 sec
- INTEGRATE: total elapsed wall-clock time       51.3 sec
-This was running with a 8GB/8GB split (''hybrid'') MCDRAM. The same run, but with 8 JOBS and 32 PROCESSORS, takes
- INIT.LP:         elapsed wall-clock time       25.3 sec
- COLSPOT:         elapsed wall-clock time       40.1 sec
- INTEGRATE: total elapsed wall-clock time       53.1 sec
-Back to 16 JOBS and 16 PROCESSORS, but with MCDRAM in ''flat'' mode und <code>numactl --preferred=1 xds_par</code> (thus using all 16GB for arrays, and nothing for cache):
- INIT.LP:         elapsed wall-clock time       29.5 sec
- COLSPOT:         elapsed wall-clock time       38.6 sec
- INTEGRATE: total elapsed wall-clock time       53.2 sec
-Now setting the KNL to SNC4 mode, and the MCDRAM to cache (using it in flat mode is impractical because the --preferred argument takes only 1 argument; to determine the correct argument requires scripting):
- INIT.LP:         elapsed wall-clock time       29.6 sec
- COLSPOT.LP:      elapsed wall-clock time       37.8 sec
- INTEGRATE: total elapsed wall-clock time       49.6 sec
-If the library is compiled with -mtune=knl, all times are about 1 second less.
-Conclusions: since INIT benefits from more PROCESSORs, one could run XDS twice for fastest turnaround; the first run with JOBS=XYCORR INIT and a high number of processors (99 is maximum). The second run with JOB=COLSPOT IDXREF DEFPIX INTEGRATE CORRECT, and an optimized JOBS/PROCESSORS combination. The SNC4 mode is fastest in this example - to do better than the cache mode of the MCDRAM, one needs to adapt the forkcolspot and forkintegrate script- see [[Performance]]. Other examples (with more frames) confirmed that cache mode is best for quadrant and SNC4, and resulted in quadrant mode being superior to SNC4. To optimally use the latter, one needs to thoroughly understand and properly use the relevant environment variables, in particular KMP_AFFINITY and KMP_PLACE_THREADS.
-For comparison, if these data are stored as CBFs, COLSPOT and INTEGRATE take 34.8 and 45.2 seconds, respectively, in SNC4 mode. However, with a cold cache (i.e. when data are read for the first time), the HDF5 files have an advantage because they are a factor 2.5 smaller, due to the better compression.
 == Troubleshooting ==