Eiger: Difference between revisions

Jump to navigation Jump to search
378 bytes removed ,  21 March 2017
switch to LIB= usage (from H5ToXds)
(→‎General aspects: write "job" instead of "process")
(switch to LIB= usage (from H5ToXds))
Line 3: Line 3:
== General aspects ==
== General aspects ==
# The framecache of XDS uses memory to save on I/O; it saves a frame in RAM after reading it for the first time. By default, each XDS (or mcolspot/mintegrate) job stores NUMBER_OF_IMAGES_IN_CACHE=DELPHI/OSCILLATION_RANGE images in memory which corresponds to one DELPHI-sized batch of data. This requires (number of pixels)*(number of jobs)*4 Bytes per frame which amounts to 72 MB in case of the Eiger 16M when running with MAXIMUM_NUBER_OF_JOBS=1. (If DELPHI=20 and OSCILLATION_RANGE=0.05 your computer thus has to have at least 400*72MB = 29GB of memory for each job). If it has not, the fallback is to the old behaviour of reading each frame three times (instead of once). There is an upper limit (2GB?) to the amount of memory that will be used by default; if the required memory is more than that, a message will be printed and the user must explicitly include a NUMBER_OF_IMAGES_IN_CACHE= line in XDS.INP.
# The framecache of XDS uses memory to save on I/O; it saves a frame in RAM after reading it for the first time. By default, each XDS (or mcolspot/mintegrate) job stores NUMBER_OF_IMAGES_IN_CACHE=DELPHI/OSCILLATION_RANGE images in memory which corresponds to one DELPHI-sized batch of data. This requires (number of pixels)*(number of jobs)*4 Bytes per frame which amounts to 72 MB in case of the Eiger 16M when running with MAXIMUM_NUBER_OF_JOBS=1. (If DELPHI=20 and OSCILLATION_RANGE=0.05 your computer thus has to have at least 400*72MB = 29GB of memory for each job). If it has not, the fallback is to the old behaviour of reading each frame three times (instead of once). There is an upper limit (2GB?) to the amount of memory that will be used by default; if the required memory is more than that, a message will be printed and the user must explicitly include a NUMBER_OF_IMAGES_IN_CACHE= line in XDS.INP.
# Dectris provides [https://www.dectris.com/news.html?page=2 H5ToXds] (Linux only!) which is needed by XDS. That program converts (as the name indicates) the HDF5 files to CBF files; however, it does not write the geometry and other information into the CBF header (therefore, [[generate_XDS.INP]] does not work with these files). As an alternative, one could use GlobalPhasing's hdf2mini-cbf program (needs autoPROC license) or, from http://www.mrc-lmb.cam.ac.uk/harry/imosflm/ver721/downloads, the eiger2cbf-osx or eiger2cbf-linux program written by T. Nakane. These programs do write a useful CBF header.
# Dectris provides a library [https://github.com/dectris/neggia] for native reading of HDF5 files, which can be loaded into XDS at runtime using the <code>LIB=</code> [http://homes.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_parameters.html#LIB= keyword]. With this library, no conversion to CBF or otherwise is necessary. It is therefore just as fast and efficient to read HDF5 files as any other file format.
# For faster processing (Linux only; script needs to be adapted for OSX), the [[Eiger#A_script_for_faster_XDS_processing_of_Eiger_data|shell script]] below should be copied to /usr/local/bin/H5ToXds and made executable (<code>chmod a+rx /usr/local/bin/H5ToXds*</code>). The binary H5ToXds then should be named e.g. /usr/local/bin/H5ToXds.bin - note the .bin filename extension! The script ''also'' uses RAM to speed up processing; it uses it for fast storage of the temporary CBF file that H5ToXds/eiger2cbf/hdf2mini-cbf writes, and that each parallel thread ("processor") of XDS reads. The amount of additional RAM this requires is modest (about (number of pixels)*(number of threads) bytes).


A suitable [[XDS.INP]] may have been written by the data collection (beamline) software. Latest [[generate_XDS.INP]] (<code>generate_XDS.INP xxx_master.h5</code>) or the [[Eiger#XDS_from_H5.py_script_for_generating_XDS.INP_given_a_master_.h5_file|XDS_from_H5.py script]] can be used if XDS.INP is not available.
A suitable [[XDS.INP]] may have been written by the data collection (beamline) software. Latest [[generate_XDS.INP]] (<code>generate_XDS.INP xxx_master.h5</code>) or the [[Eiger#XDS_from_H5.py_script_for_generating_XDS.INP_given_a_master_.h5_file|XDS_from_H5.py script]] can be used if XDS.INP is not available.
Line 19: Line 18:


Any comparisons should be based on a common dataset. I downloaded from https://www.dectris.com/datasets.html their latest dataset
Any comparisons should be based on a common dataset. I downloaded from https://www.dectris.com/datasets.html their latest dataset
ftp://dectris.com/EIGER_16M_Nov2015.tar.bz2 (900 frames) and processed it on a single unloaded CentOS7.2 64bit machine with dual Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz , HT enabled (showing 32 processors in /proc/cpuinfo), on a local XFS filesystem (all defaults), with four JOBs and 12 PROCESSORS (the XDS.INP that Dectris provides suggests 8 JOBs of 12 PROCESSORS, but I changed that). The numbers below refer to the H5ToXds binary as used in the [[Eiger#A_script_for_faster_XDS_processing_of_Eiger_data|script]] below.
ftp://dectris.com/EIGER_16M_Nov2015.tar.bz2 (900 frames) and processed it on a single unloaded CentOS7.2 64bit machine with dual Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz , HT enabled (showing 32 processors in /proc/cpuinfo), on a local XFS filesystem (all defaults), with four JOBs and 12 PROCESSORS (the XDS.INP that Dectris provides suggests 8 JOBs of 12 PROCESSORS, but I changed that).  
 
The timing, using the XDS (BUILT=20151231), is on the first run
INIT:  elapsed wall-clock time      12.0 sec
COLSPOT: elapsed wall-clock time      44.9 sec
INTEGRATE: total elapsed wall-clock time      65.1 sec
CORRECT: elapsed wall-clock time        2.9 sec
Total elapsed wall-clock time for XDS      133.6 sec
 
When I repeat this, I get
Total elapsed wall-clock time for XDS      128.3 sec
Repeat once again:
Total elapsed wall-clock time for XDS      129.3 sec
So a bit of cache-warming helps, but not much. This machine has 64GB RAM. From the output of "top", the highest memory usage occurs during INTEGRATE, when each of the mintegrate_par processes consumes up to 7.4% of the memory. In other words, in this way less than 20GB of total memory are used. "top" shows a CPU consumption around (on average) 4 times 650%.
 
The number of JOBs and PROCESSORs could be optimized. I tried 6 JOBs and get
Total elapsed wall-clock time for XDS      120.1 sec
so there's still some room for improvement.
 
With program versions as of 2016-03-10, eiger2cbf-linux is practically as fast as the H5ToXds binary; hdf2mini-cbf is somewhat slower.
 
When unpacking the .h5 files to .cbf files and processing those, I get on the same machine and with same processing parameters:
Total elapsed wall-clock time for XDS      96.3 sec
which indicates a 24% overhead due to the HDF5-to-CBF conversion. However, one has to add to this the time for the HDF5-to-CBF conversion, which is (with 18 parallel H5ToXds jobs each converting 50 frames) 34.2 sec, so overall the "on-the-fly" route using the script below is faster than the "pre-conversion" route, at least on this machine.


On multi-socket machines, there are additional considerations having to do with their NUMA architecture - see [[Performance]].
On multi-socket machines, there are additional considerations having to do with their NUMA architecture - see [[Performance]].
Line 50: Line 26:
The benchmark was run on a single KNL7210 processor (256 cores) set to quadrant mode and using the MCDRAM as cache. '''The environment variable OMP_PROC_BIND was set to false, or KMP_AFFINITY set to none''' (if this is not done, the scheduler seems to put all threads on one core). XDS was compiled with the -xMIC-AVX512 option of ifort. These benchmarks were performed with "warm" operating system cache, which means that the first run of a given type didn't count because it had to read all data from disk.
The benchmark was run on a single KNL7210 processor (256 cores) set to quadrant mode and using the MCDRAM as cache. '''The environment variable OMP_PROC_BIND was set to false, or KMP_AFFINITY set to none''' (if this is not done, the scheduler seems to put all threads on one core). XDS was compiled with the -xMIC-AVX512 option of ifort. These benchmarks were performed with "warm" operating system cache, which means that the first run of a given type didn't count because it had to read all data from disk.


Deviating from the Xeon benchmark setup (above), BACKGROUND_RANGE was set to a more realistic value of 1 50 (instead of 1 9). The INIT numbers are therefore not comparable.
Deviating from the Xeon benchmark setup, BACKGROUND_RANGE was set to a more realistic value of 1 50 (instead of 1 9).  
This gives
COLSPOT:        elapsed wall-clock time      48.3 sec
INTEGRATE: total elapsed wall-clock time      61.2 sec
when run with MAXIMUM_NUMBER_OF_JOBS=16 and MAXIMUM_NUMBER_OF_PROCESSORS=16. These parameters, as well as the KNL setup could still be optimized.


Update Feb 21, 2017 using XDS BUILT=20161205, and the CentOS-7.3 default kernel 3.10.0-514.6.1.el7:
Using the Dectris library that makes use of the <code>LIB=</code> [http://homes.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_parameters.html#LIB= option] of XDS:
INIT:            elapsed wall-clock time      33.4 sec
COLSPOT:        elapsed wall-clock time      49.3 sec
INTEGRATE: total elapsed wall-clock time      59.8 sec
Using, instead of the H5ToXds script, a pre-release library that makes use of the <code>LIB=</code> [http://homes.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_parameters.html#LIB= option] of XDS:
  INIT:            elapsed wall-clock time      30.4 sec
  INIT:            elapsed wall-clock time      30.4 sec
  COLSPOT:        elapsed wall-clock time      40.7 sec
  COLSPOT:        elapsed wall-clock time      40.7 sec
Line 80: Line 48:
  COLSPOT.LP:      elapsed wall-clock time      37.8 sec  
  COLSPOT.LP:      elapsed wall-clock time      37.8 sec  
  INTEGRATE: total elapsed wall-clock time      49.6 sec
  INTEGRATE: total elapsed wall-clock time      49.6 sec
If the library is compiled with -mtune=knl, all times are about 1 second less.


Conclusions: since INIT benefits from more PROCESSORs, one could run XDS twice for fastest turnaround; the first run with JOBS=XYCORR INIT and a high number of processors (99 is maximum). The second run with JOB=COLSPOT IDXREF DEFPIX INTEGRATE CORRECT, and an optimized JOBS/PROCESSORS combination. The SNC4 mode is fastest in this example - to do better than the cache mode of the MCDRAM, one needs to adapt the forkcolspot and forkintegrate script- see [[Performance]]. Other examples (with more frames) confirmed that cache mode is best for quadrant and SNC4, and resulted in quadrant mode being superior to SNC4. To optimally use the latter, one needs to thoroughly understand and properly use the relevant environment variables, in particular KMP_AFFINITY and KMP_PLACE_THREADS.
Conclusions: since INIT benefits from more PROCESSORs, one could run XDS twice for fastest turnaround; the first run with JOBS=XYCORR INIT and a high number of processors (99 is maximum). The second run with JOB=COLSPOT IDXREF DEFPIX INTEGRATE CORRECT, and an optimized JOBS/PROCESSORS combination. The SNC4 mode is fastest in this example - to do better than the cache mode of the MCDRAM, one needs to adapt the forkcolspot and forkintegrate script- see [[Performance]]. Other examples (with more frames) confirmed that cache mode is best for quadrant and SNC4, and resulted in quadrant mode being superior to SNC4. To optimally use the latter, one needs to thoroughly understand and properly use the relevant environment variables, in particular KMP_AFFINITY and KMP_PLACE_THREADS.
Line 87: Line 57:
== Troubleshooting ==
== Troubleshooting ==
* make sure that master.h5 and the corresponding data.h5 files remain together as collected, and '''don't rename the data.h5 files''' - they are referred to from master.h5.  If you change the names of the data.h5 files or copy them somewhere else, that link is broken unless you fix master.h5.   
* make sure that master.h5 and the corresponding data.h5 files remain together as collected, and '''don't rename the data.h5 files''' - they are referred to from master.h5.  If you change the names of the data.h5 files or copy them somewhere else, that link is broken unless you fix master.h5.   
* the programs get a lot of testing on RHEL/CentOS/SL. To test if the conversion program work (e.g. on uncommon distros like Mint), run it outside XDS, e.g. <pre> H5ToXds master.h5 1:100 out.cbf </pre> If this creates CBF-compressed files for the first 100 images of your dataset, all is good.
== A script for faster XDS processing of Eiger data ==
<pre>
#!/bin/bash
# Kay Diederichs 10/2015
# 3/2016 adapt for eiger2cbf-linux and hdf2min-cbf
# for the latter see https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ccp4bb;58a4ee1.1603 and
# https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ccp4bb;a048b4e8.1603
#
# Idea: put temporary files into fast local directory, instead of NFS
#
# Installation: Rename Dectris' H5ToXds to H5ToXds.bin
#              This script should be called H5ToXds and reside in $PATH
#              Modify this script according to which binary you use - see comments below.
#
# Recommendation:
# - for the fast local directory one should use a RAMdisk (one GB size at most)
# - /dev/shm seems to be set up for that purpose on most distributions
#
tempfile="/dev/shm/H5ToXds${PWD//\//_}.$3"
#
# choose between H5ToXds.bin,  eiger2cbf and hdf2mini-cbf; un/comment accordingly
/usr/local/bin/H5ToXds.bin $1 $2 "$tempfile" || rm "$tempfile"
#/usr/local/bin/eiger2cbf-linux $1 $2 "$tempfile" >& /dev/null  || rm "$tempfile"
#/usr/local/bin/hdf2mini-cbf $1 $2 "$tempfile"  || rm "$tempfile"
ln -sf "$tempfile" $3 2>/dev/null
</pre>


== XDS_from_H5.py script for generating XDS.INP given a master .h5 file ==
== XDS_from_H5.py script for generating XDS.INP given a master .h5 file ==
Line 710: Line 652:
         exit(-1)
         exit(-1)
</pre>
</pre>
= Old way of processing Eiger data with XDS i.e. using H5ToXds =
Dectris provides a library [https://www.dectris.com/news.html?page=2 H5ToXds] (Linux only!) which is needed by XDS. That program converts (as the name indicates) the HDF5 files to CBF files; however, it does not write the geometry and other information into the CBF header (therefore, [[generate_XDS.INP]] does not work with these files). As an alternative, one could use GlobalPhasing's hdf2mini-cbf program (needs autoPROC license) or, from http://www.mrc-lmb.cam.ac.uk/harry/imosflm/ver721/downloads, the eiger2cbf-osx or eiger2cbf-linux program written by T. Nakane. These programs do write a useful CBF header.
For faster processing (Linux only; script needs to be adapted for OSX), the [[Eiger#A_script_for_faster_XDS_processing_of_Eiger_data|shell script]] below should be copied to /usr/local/bin/H5ToXds and made executable (<code>chmod a+rx /usr/local/bin/H5ToXds*</code>). The binary H5ToXds then should be named e.g. /usr/local/bin/H5ToXds.bin - note the .bin filename extension! The script ''also'' uses RAM to speed up processing; it uses it for fast storage of the temporary CBF file that H5ToXds/eiger2cbf/hdf2mini-cbf writes, and that each parallel thread ("processor") of XDS reads. The amount of additional RAM this requires is modest (about (number of pixels)*(number of threads) bytes).
== Benchmark using H5ToXds ==
The numbers below refer to the H5ToXds binary as used in the script below.
The timing, using the XDS (BUILT=20151231), is on the first run
INIT:  elapsed wall-clock time      12.0 sec
COLSPOT: elapsed wall-clock time      44.9 sec
INTEGRATE: total elapsed wall-clock time      65.1 sec
CORRECT: elapsed wall-clock time        2.9 sec
Total elapsed wall-clock time for XDS      133.6 sec
When I repeat this, I get
Total elapsed wall-clock time for XDS      128.3 sec
Repeat once again:
Total elapsed wall-clock time for XDS      129.3 sec
So a bit of cache-warming helps, but not much. This machine has 64GB RAM. From the output of "top", the highest memory usage occurs during INTEGRATE, when each of the mintegrate_par processes consumes up to 7.4% of the memory. In other words, in this way less than 20GB of total memory are used. "top" shows a CPU consumption around (on average) 4 times 650%.
The number of JOBs and PROCESSORs could be optimized. I tried 6 JOBs and get
Total elapsed wall-clock time for XDS      120.1 sec
so there's still some room for improvement.
With program versions as of 2016-03-10, eiger2cbf-linux is practically as fast as the H5ToXds binary; hdf2mini-cbf is somewhat slower.
When unpacking the .h5 files to .cbf files and processing those, I get on the same machine and with same processing parameters:
Total elapsed wall-clock time for XDS      96.3 sec
which indicates a 24% overhead due to the HDF5-to-CBF conversion. However, one has to add to this the time for the HDF5-to-CBF conversion, which is (with 18 parallel H5ToXds jobs each converting 50 frames) 34.2 sec, so overall the "on-the-fly" route using the script below is faster than the "pre-conversion" route, at least on this machine.
== A script for faster XDS processing of Eiger data ==
<pre>
#!/bin/bash
# Kay Diederichs 10/2015
# 3/2016 adapt for eiger2cbf-linux and hdf2min-cbf
# for the latter see https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ccp4bb;58a4ee1.1603 and
# https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ccp4bb;a048b4e8.1603
#
# Idea: put temporary files into fast local directory, instead of NFS
#
# Installation: Rename Dectris' H5ToXds to H5ToXds.bin
#              This script should be called H5ToXds and reside in $PATH
#              Modify this script according to which binary you use - see comments below.
#
# Recommendation:
# - for the fast local directory one should use a RAMdisk (one GB size at most)
# - /dev/shm seems to be set up for that purpose on most distributions
#
tempfile="/dev/shm/H5ToXds${PWD//\//_}.$3"
#
# choose between H5ToXds.bin,  eiger2cbf and hdf2mini-cbf; un/comment accordingly
/usr/local/bin/H5ToXds.bin $1 $2 "$tempfile" || rm "$tempfile"
#/usr/local/bin/eiger2cbf-linux $1 $2 "$tempfile" >& /dev/null  || rm "$tempfile"
#/usr/local/bin/hdf2mini-cbf $1 $2 "$tempfile"  || rm "$tempfile"
ln -sf "$tempfile" $3 2>/dev/null
</pre>


== See also ==
== See also ==


[[Performance]]
[[Performance]]
2,652

edits

Cookies help us deliver our services. By using our services, you agree to our use of cookies.

Navigation menu