Performance: Difference between revisions

Revision as of 16:38, 12 December 2019

General considerations

In the order of effect:

XDS scales well (i.e. the wallclock time for data processing goes down when the number of available cores is increased) in the INIT, COLSPOT, IDXREF, INTEGRATE and CORRECT steps when using the MAXIMUM_NUMBER_OF_PROCESSORS keyword. This triggers program-level parallelization, using OpenMP threads. IDXREF and CORRECT have a significant serial part, for example due to I/O.
the program scales very well in the COLSPOT and INTEGRATE steps when using the MAXIMUM_NUMBER_OF_JOBS keyword. This triggers a shell-level parallelisation and results in individual processes run by the operating system. There is a slight penalty associated with high values of MAXIMUM_NUMBER_OF_JOBS= :
- in INTEGRATE, geometry refinement results are not transferred between JOBs: see Pathologies;
- in COLSPOT, the phi values at the borders between JOBs are less accurate (in particular if the mosaicity is high), and the same reflection may be listed twice in SPOT.XDS if it extends over the border between JOBs. The latter effect may be mitigated by having as many SPOT_RANGEs as JOBs, and leaving gaps between the SPOT_RANGEs; see Problems#IDXREF_produces_too_long_axes.
combining these two keywords gives the highest performance (see 2VB1#XDS_processing for an example). As a rough guide, I'd choose them to be approximately equal; an even number for MAXIMUM_NUMBER_OF_PROCESSORS should be chosen because that fits better with usual hardware. If in doubt, use a lower number for MAXIMUM_NUMBER_OF_JOBS than for MAXIMUM_NUMBER_OF_PROCESSORS. Since 2017, XDS has an automatic feature that divides up the available cores into JOBS (operating system processes) running with multiple cores each. You use this automatic feature when you run xds_par, and don't specify MAXIMUM_NUMBER_OF_JOBS.
NUMBER_OF_IMAGES_IN_CACHE avoids repeated (3-fold) reading of data frames in the INTEGRATE task during processing of a batch of frames. This comes at the expense of memory (RAM) and is discussed in Eiger. The default is DELPHI/OSCILLATION_RANGE+1 and is usually adequate. Only on low-memory systems (e.g a 8GB RAM machine for processing Eiger 16M data collected with 0.1° oscillation range, at DELPHI=5 and MAXIMUM_NUMBER_OF_JOBS=1) should this be set to 0, to conserve memory and avoid slow processing due to thrashing, or even killed XDS processes. If the cache size of a process exceeds 8GB, XDS will print a warning, and in that case the user has to explicitly include a NUMBER_OF_IMAGES_IN_CACHE=<desired number> line in XDS.INP, to confirm that actually so much memory should be used. Typically, you don't specify NUMBER_OF_IMAGES_IN_CACHE.
XDS with the MAXIMUM_NUMBER_OF_JOBS and CLUSTER_NODES keywords can use several machines. This requires some setup as explained at the bottom of [1].
Hyperthreading (SMT), if available, is often beneficial. A "virtual" core has only about 20% performance of a "physical" core but it comes at no cost - you just have to switch it on in the BIOS of the machine.
some overcommitting of resources (i.e. MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS > number of cores) may be beneficial; you'll have to play with these two parameters since this depends on the actual hardware. If MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS is >4096 (the default in RHEL), you may have to adjust the maxproc limit of your shell; in bash: ulimit -u unlimited.
the next thing to consider is DELPHI together with OSCILLATION_RANGE: if DELPHI (the rotation range of a batch of frames) is an integer multiple of MAXIMUM_NUMBER_OF_PROCESSORS * OSCILLATION_RANGE that would be good because it nicely balances the usage of the threads. For this purpose, you may want to change (if possible, raise) the value of DELPHI (default is 5 degrees). If you are doing fine-slicing then mis-balancing of threads is not an issue - but for those users who want to collect 1° frames (which I think is not the best way nowadays ...) it should be a consideration. Additional consideration: the total number of frames should be an integer multiple of the intended number of frames in a batch. Example: 360 frames of 0.5° can be processed on a 8-core machine optimally by specifying DELPHI=4, since then there are 8 frames in a batch and the complete dataset has 45 batches. For weak data one should consider raising DELPHI to 12; that would give 15 batches. A trick: if you want to use DELPHI=8 in this situation then just specify DATA_RANGE=1 368 (pretending 23 batches of 8°) instead of DATA_RANGE=1 360 . XDS will complain about the missing 8 frames, but that has no adverse effects except that no FRAME.cbf will be produced. All of this doesn't matter for a single dataset, but for mass processing of datasets it does make a difference.
performance-wise, I/O also plays a role because as soon as you run 24 or so processes, a single Gigabit ethernet connection may be limiting. OTOH shell-level parallelization smoothes the load. With BUILTs of XDS before 20191211, you could experiment with an environment variable FORT_BUFFERED that when set to TRUE (in ~/.bashrc or better systemwide, with export FORT_BUFFERED=TRUE), will result in faster I/O. Attention: on some installations this has resulted in XDS crashes, so if you get crashes with error message forrtl: severe (67): input statement requires too much data then don't use this option! Since BUILT=20191211, a new version of the ifort compiler is used which should fix the underlying compiler bug. This and later BUILTs employ the option -assume buffered_io, and the environment variable is no longer needed for fast I/O.
REFINE(INTEGRATE)= ! (empty list) makes INTEGRATE go much faster through the frames, since frames are processed less often when analyzing a batch of frames, and no geometry refinement takes place. TRUSTED_REGION=0 X results in fast processing if X is less than 1.4142 because that reduces the number of pixels of a rectangular detector that will be evaluated; X should of course be chosen such that it does not result in omission of useful data. For fast processing, the defaults for NUMBER_OF_PROFILE_GRID_POINTS_ALONG_ALPHA/BETA and NUMBER_OF_PROFILE_GRID_POINTS_ALONG_GAMMA should be used.

Cluster

If a cluster of computers is available that allow login with ssh without asking for a password, and that have NFS-mounted the relevant directories under the same paths, one can use the CLUSTER_NODES= keyword in XDS.INP. Attention: if CLUSTER_NODES=a b c then task 1 will run on node b, task 2 on node c, and task 3 on node a. This is due to the logic in the current (BUILT=20191015) version of forkxds (part of XDS package).

If the other computers are not reachable by ssh, but coupled with a batch queueing system, then the forkxds script of the XDS distribution has to be modified: the node names are not relevant, and the ssh invocation has to be replaced by a qsub invocation - see Cluster Installation.

processing compressed data

XDS can process data files that were previously compressed with compress (.Z), gzip (.gz), bzip2 (.bz2) or xz (.xz). It does this by on-the-fly decompression to temporary files with standard names (SCRATCH2XXYY.tmp) where XX (XX = 01..99) stands for the "JOB" and YY (YY = 01..99) for the thread number that produces the temporary file.

Compression saves a lot of disk space, but decompression may be time-consuming in terms of CPU and I/O. The penalty associated with decompression can be mitigated by

using a LIB= plugin that saves the I/O, and overhead of running an external program. This exists for gzip-compressed CBF files; see LIB#Existing_implementations.
(Linux only) using symlinks pointing to /dev/shm which results in SCRATCH2XXYY.tmp being written to RAM instead of (network) disk. A script (typically called mklinks) achieving this is

#!/bin/bash
# purpose: create symlinks for xds_par
# usage: mklinks [# of jobs]

maxjobs=$1
test -z $1 && maxjobs=1

maxprocs=$(grep processor /proc/cpuinfo | wc -l)
echo creating symlinks for $maxprocs threads and $maxjobs JOBs

# create unique directory for SCRATCH2 files:
tempdir="/dev/shm/xds${PWD//\//_}"
rm -rf $tempdir
mkdir $tempdir

for j in $(seq 1 $maxjobs); do
  for i in $(seq 1 $maxprocs); do
    ln -sfn $tempdir/SCRATCH_$(printf "%02d" "$j")$(printf "%02d" "$i").tmp
    ln -sfn $tempdir/SCRATCH2_$(printf "%02d" "$j")$(printf "%02d" "$i").tmp
  done
done

This has to be run in the XDS processing directory of the current dataset, before running xds_par. After finishing data processing, one may cleanup with this script (typically called rmlinks):

#!/bin/bash
tempdir="/dev/shm/xds${PWD//\//_}"
rm -rf $tempdir
rm -f SCRATCH*

if decompressing .bz2 files, one can use the faster lbunzip2 (if it is installed) simply by making a symlink to it (assuming $HOME/bin is in your $PATH):

ln -s `which lbunzip2` $HOME/bin/bunzip2

Both measures can be combined.

Linux kernel setting

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled

on RHEL6, and

cat /sys/kernel/mm/transparent_hugepage/enabled

on RHEL7, respectively, should show always, not never to be active (the active setting is bracketed).

@@ Line 1: / Line 1: @@
-== Single-machine considerations ==
+== General considerations ==
 In the order of effect:
-# XDS scales well (i.e. the wallclock time for data processing goes down when the number of available cores is increased) in the INIT, COLSPOT, IDXREF, INTEGRATE and CORRECT steps when using the [http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_parameters.html#MAXIMUM_NUMBER_OF_PROCESSORS= MAXIMUM_NUMBER_OF_PROCESSORS] keyword. This triggers program-level parallelization, using [http://www.openmp.org OpenMP] threads.
+# XDS scales well (i.e. the wallclock time for data processing goes down when the number of available cores is increased) in the INIT, COLSPOT, IDXREF, INTEGRATE and CORRECT steps when using the [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#MAXIMUM_NUMBER_OF_PROCESSORS= MAXIMUM_NUMBER_OF_PROCESSORS] keyword. This triggers program-level parallelization, using [http://www.openmp.org OpenMP] threads. IDXREF and CORRECT have a significant serial part, for example due to I/O.
-# the program scales very well in the COLSPOT and INTEGRATE steps when using the [http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_parameters.html#MAXIMUM_NUMBER_OF_JOBS= MAXIMUM_NUMBER_OF_JOBS] keyword. This triggers a shell-level parallelization. There is a slight penalty associated with high values of MAXIMUM_NUMBER_OF_JOBS= :
+# the program scales very well in the COLSPOT and INTEGRATE steps when using the [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#MAXIMUM_NUMBER_OF_JOBS= MAXIMUM_NUMBER_OF_JOBS] keyword. This triggers a shell-level parallelisation and results in individual processes run by the operating system. There is a slight penalty associated with high values of MAXIMUM_NUMBER_OF_JOBS= :
 #*in INTEGRATE, geometry refinement results are not transferred between JOBs: see [[Pathologies]];
 #*in COLSPOT, the phi values at the borders between JOBs are less accurate (in particular if the mosaicity is high), and the same reflection may be listed twice in SPOT.XDS if it extends over the border between JOBs. The latter effect may be mitigated by having as many SPOT_RANGEs as JOBs, and leaving gaps between the SPOT_RANGEs; see [[Problems#IDXREF_produces_too_long_axes]].
-# combining these two keywords gives the highest performance in my experience (see [[2VB1#XDS_processing]] for an example). As a rough guide, I'd choose them to be approximately equal; an even number for MAXIMUM_NUMBER_OF_PROCESSORS should be chosen because that fits better with usual hardware. If in doubt, use a lower number for MAXIMUM_NUMBER_OF_JOBS than for MAXIMUM_NUMBER_OF_PROCESSORS.
+# combining these two keywords gives the highest performance (see [[2VB1#XDS_processing]] for an example). As a rough guide, I'd choose them to be approximately equal; an even number for MAXIMUM_NUMBER_OF_PROCESSORS should be chosen because that fits better with usual hardware. If in doubt, use a lower number for MAXIMUM_NUMBER_OF_JOBS than for MAXIMUM_NUMBER_OF_PROCESSORS. Since 2017, XDS has an automatic feature that divides up the available cores into JOBS (operating system processes) running with multiple cores each. You use this automatic feature when you run <code>xds_par</code>, and don't specify MAXIMUM_NUMBER_OF_JOBS.
-# some overcommitting of resources (i.e. MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS > number of cores) is beneficial; you'll have to play with these two parameters.
+# NUMBER_OF_IMAGES_IN_CACHE avoids repeated (3-fold) reading of data frames in the INTEGRATE task during processing of a batch of frames. This comes at the expense of memory (RAM) and is discussed in [[Eiger]]. The default is DELPHI/OSCILLATION_RANGE+1 and is usually adequate. Only on low-memory systems (e.g a 8GB RAM machine for processing Eiger 16M data collected with 0.1° oscillation range, at DELPHI=5 and MAXIMUM_NUMBER_OF_JOBS=1) should this be set to 0, to conserve memory and avoid slow processing due to thrashing, or even killed XDS processes. If the cache size of a process exceeds 8GB, XDS will print a warning, and in that case the user has to explicitly include a NUMBER_OF_IMAGES_IN_CACHE=<desired number> line in XDS.INP, to confirm that actually so much memory should be used. '''Typically, you don't specify NUMBER_OF_IMAGES_IN_CACHE'''.
-# the next thing to consider is [http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_parameters.html#DELPHI= DELPHI] together with [http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_parameters.html#OSCILLATION_RANGE= OSCILLATION_RANGE]: if DELPHI (the rotation range of a ''batch'' of frames) is an integer multiple of MAXIMUM_NUMBER_OF_PROCESSORS * OSCILLATION_RANGE that would be good because it nicely balances the usage of the threads. For this purpose, you may want to change (if possible, raise) the value of DELPHI (default is 5 degrees). If you are doing fine-slicing then mis-balancing of threads is not an issue - but for those users who want to collect 1° frames (which I think is not the best way nowadays ...) it should be a consideration. Additional consideration: the total number of frames should be an integer multiple of the intended number of frames in a batch. Example: 360 frames of 0.5° can be processed on a 8-core machine optimally by specifying DELPHI=4, since then there are 8 frames in a batch and the complete dataset has 45 batches. For weak data one should consider raising DELPHI to 12; that would give 15 batches. A trick: if you want to use DELPHI=8 in this situation then just specify DATA_RANGE=1 368 (pretending 23 batches of 8°) instead of DATA_RANGE=1 360 . XDS will complain about the missing 8 frames, but that has no adverse effects except that no FRAME.cbf will be produced. All of this doesn't matter for a single dataset, but for mass processing of datasets it does make a difference.
+# XDS with the MAXIMUM_NUMBER_OF_JOBS and CLUSTER_NODES keywords can use [[Performance#Cluster|several machines]]. This requires some setup as explained at the bottom of [http://xds.mpimf-heidelberg.mpg.de/html_doc/downloading.html].
-# performance-wise, I/O also plays a role because as soon as you run 24 or so processes then a single GB ethernet connection may be limiting. OTOH shell-level parallelization smoothes the load.
+# Hyperthreading (SMT), if available, is often beneficial. A "virtual" core has only about 20% performance of a "physical" core but it comes at no cost - you just have to switch it on in the BIOS of the machine.
-# REFINE(INTEGRATE)= ! (empty list) makes INTEGRATE go much faster through the frames, since frames are processed less often when analyzing a batch of frames, and no geometry refinement takes place.
+# some overcommitting of resources (i.e. MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS > number of cores) may be beneficial; you'll have to play with these two parameters since this depends on the actual hardware. If MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS is >4096 (the default in RHEL), you may have to adjust the maxproc limit of your shell; in bash: <code>ulimit -u unlimited</code>.
-# XDS with the MAXIMUM_NUMBER_OF_JOBS keyword can use [[Performance#Cluster|several machines]]. This requires some setup as explained at the bottom of [http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/downloading.html].
+# the next thing to consider is [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#DELPHI= DELPHI] together with [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#OSCILLATION_RANGE= OSCILLATION_RANGE]: if DELPHI (the rotation range of a ''batch'' of frames) is an integer multiple of MAXIMUM_NUMBER_OF_PROCESSORS * OSCILLATION_RANGE that would be good because it nicely balances the usage of the threads. For this purpose, you may want to change (if possible, raise) the value of DELPHI (default is 5 degrees). If you are doing fine-slicing then mis-balancing of threads is not an issue - but for those users who want to collect 1° frames (which I think is not the best way nowadays ...) it should be a consideration. Additional consideration: the total number of frames should be an integer multiple of the intended number of frames in a batch. Example: 360 frames of 0.5° can be processed on a 8-core machine optimally by specifying DELPHI=4, since then there are 8 frames in a batch and the complete dataset has 45 batches. For weak data one should consider raising DELPHI to 12; that would give 15 batches. A trick: if you want to use DELPHI=8 in this situation then just specify DATA_RANGE=1 368 (pretending 23 batches of 8°) instead of DATA_RANGE=1 360 . XDS will complain about the missing 8 frames, but that has no adverse effects except that no FRAME.cbf will be produced. All of this doesn't matter for a single dataset, but for mass processing of datasets it does make a difference.
-# Hyperthreading (SMT), usually available on Intel CPUs, is often beneficial. A "virtual" core has only about 20% performance of a "physical" core but it comes at no cost - you just have to switch it on in the BIOS of the machine.
+# performance-wise, I/O also plays a role because as soon as you run 24 or so processes, a single Gigabit ethernet connection may be limiting. OTOH shell-level parallelization smoothes the load. With BUILTs of XDS before 20191211, you could experiment with an environment variable FORT_BUFFERED that when set to TRUE (in ~/.bashrc or better systemwide, with <code>export FORT_BUFFERED=TRUE</code>), will result in faster I/O. Attention: on some installations this has resulted in XDS crashes, so if you get crashes with error message <code>forrtl: severe (67): input statement requires too much data</code> then don't use this option! Since BUILT=20191211, a new version of the ifort compiler is used which should fix the underlying compiler bug. This and later BUILTs employ the option -assume buffered_io, and the environment variable is no longer needed for fast I/O.
-# NUMBER_OF_IMAGES_IN_CACHE avoids repeated (3-fold) reading of data frames in the INTEGRATE task during processing of a batch of frames. This comes at the expense of memory (RAM) and is discussed in [[Eiger]]. The default is DELPHI/OSCILLATION_RANGE+1 and is adequate. Only on low-memory systems (e.g a 8GB RAM machine for processing Eiger 16M data collected with 0.1° oscillation range, at DELPHI=5 and MAXIMUM_NUMBER_OF_JOBS=1) should this be set to 0, to conserve memory and avoid slow processing due to thrashing, or even killed XDS processes.
+# REFINE(INTEGRATE)= ! (empty list) makes INTEGRATE go much faster through the frames, since frames are processed less often when analyzing a batch of frames, and no geometry refinement takes place. TRUSTED_REGION=0 X results in fast processing if X is less than 1.4142 because that reduces the number of pixels of a rectangular detector that will be evaluated; X should of course be chosen such that it does not result in omission of useful data. For fast processing, the defaults for NUMBER_OF_PROFILE_GRID_POINTS_ALONG_ALPHA/BETA and NUMBER_OF_PROFILE_GRID_POINTS_ALONG_GAMMA should be used.
 == Cluster ==
-In a cluster of computers, one has to modify the <tt>forkcolspot</tt> and <tt>forkintegrate</tt> scripts (which are part of the XDS distribution) as shown in [http://xds.mpimf-heidelberg.mpg.de/html_doc/forkcolspot_cluster forkcolspot_cluster] and [http://xds.mpimf-heidelberg.mpg.de/html_doc/forkintegrate_cluster forkintegrate_cluster]. The names of computers called "node1" to "node4" in these example scripts have to be replaced with actual computers names reachable by ssh and having NFS-mounted the relevant directories under the same paths.
+If a cluster of computers is available that allow login with <code>ssh</code> without asking for a password, and that have NFS-mounted the relevant directories under the same paths, one can use the [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#CLUSTER_NODES= CLUSTER_NODES=] keyword in XDS.INP. Attention: if <code>CLUSTER_NODES=a b c</code> then task 1 will run on node b, task 2 on node c, and task 3 on node a. This is due to the logic in the current (BUILT=20191015) version of <code>forkxds</code> (part of XDS package).
-== Multi-socket machines ==
+If the other computers are not reachable by <code>ssh</code>, but coupled with a batch queueing system, then the forkxds script of the XDS distribution has to be modified: the node names are not relevant, and the <code>ssh</code> invocation has to be replaced by a <code>qsub</code> invocation - see [[Cluster Installation]].
-Multi-socket machines consist of several nodes each comprising several CPUs and some amount of memory. The nodes are connected by specialized hardware (sometimes called interconnect or bus) that transports data between the nodes. Typically, node-local memory is faster to read and write than memory on a different node. This NUMA (non-uniform memory architecture) setup has consequences for the performance when used for running XDS jobs.
-In particular, good performance is obtained if MAXIMUM_NUMBER_OF_JOBS is chosen as the number of nodes, and MAXIMUM_NUMBER_OF_PROCESSORS is chosen as the number of CPU cores (physical + virtual) of each socket. One then has to take care that each job ends up on its own socket. The following scripts do this. Please note that <tt>numactl</tt> has to be installed.
-<pre>
-#!/bin/bash
-#                      forkcolspot
-#
-# enables  multi-tasking by splitting the COLSPOT step of
-# xds into independent jobs. Each job is carried out by the
-# Fortran program mcolspot or mcolspot_par started by this
-# script as a background process with a different set of
-# input parameters.
-#
-# 'forkcolspot' is called by xds or xds_par in the COLSPOT
-# step using the Fortran instruction
-# CALL SYSTEM('forkcolspot ntask maxcpu'),
-#    ntask  ::total number of jobs
-#   maxcpu  ::maximum number of processors used by each job
-#
-# Clearly, this can only work if forkcolspot, mcolspot, and
-# mcolspot_par are correctly installed in the search path
-# for executables.
-#
-# W.Kabsch and K.Rohm     Version Februar 2005
-# NOTE: No blanks allowed adjacent to the = signs !!!
-# K.Diederichs 3/2016 NUMA affinity added
-#export KMP_AFFINITY="verbose"
-maxnode=`numactl -H|awk '/available/{print $2-1}'`
-#echo highest node is $maxnode
-ntask=$1  #total number of jobs
-maxcpu=$2 #maximum number of processors used by each job
-	   #maxcpu=1: use 'mcolspot' (single processor)
-	   #maxcpu>1: use 'mcolspot_par' (openmp version)
-pids=""                    #list of background process ID's
-itask=1
-inode=0   # initialize inode
-while test $itask -le $ntask
-do
-# KD modification: which node?
-   let inode=$inode+1
-   if [ $inode -gt $maxnode ]
-      then let inode=0
-   fi
-#end modification
-   if [ $maxcpu -gt 1 ]
-      then echo "$itask" | numactl --cpunodebind=$inode mcolspot_par &
-      else echo "$itask" | mcolspot     &
-   fi
-   pids="$pids $!"  #append id of the background process just started
-   itask=`expr $itask + 1`
-done
-trap "kill -15 $pids" 2 15  # 2:Control-C; 15:kill
-wait  #wait for all background processes issued by this shell
-rm -f mcolspot.tmp  #this temporary file was generated by ads
-</pre>
-<pre>
-#!/bin/bash
-#                      forkintegrate
-#
-# enables  multi-tasking by splitting the INTEGRATE step of
-# xds into independent jobs. Each job is carried out by the
-# Fortran program mintegrate or mintegrate_par started by
-# this script as a background process with a different set
-# of input parameters.
-#
-# 'forkintegrate' is called by xds (or xds_par) in the
-# INTEGRATE step using the Fortran instruction
-# CALL SYSTEM('forkintegrate fframe ni ntask niba0 maxcpu'),
-#    fframe ::id number of the first data image
-#    ni     ::number of images in the data set
-#    ntask  ::total number of jobs
-#    niba0  ::minimum number of images in a batch
-#    maxcpu ::maximum number of processors used by each job
-#
-# Clearly, this can only work if forkintegrate, mintegrate,
-# and mintegrate_par are correctly installed in the search
-# path for executables.
-#
-# W.Kabsch and K.Rohm     Version Februar 2005
-# NOTE: No blanks allowed adjacent to the = signs !!!
-# K.Diederichs 3/2016 NUMA affinity added
-#export KMP_AFFINITY="verbose"
-maxnode=`numactl -H|awk '/available/{print $2-1}'`
-#echo highest node is $maxnode
-fframe=$1 #id number of the first image
-ni=$2     #number of images in the data set
-ntask=$3  #total number of jobs
-niba0=$4  #minimum number of images in a batch
-maxcpu=$5 #maximum number of processors used by each job
-	   #maxcpu=1: use 'mintegrate' (single processor)
-	   #maxcpu>1: use 'mintegrate_par' (openmp version)
-minitask=$(($ni / $ntask)) #minimum number of images in a job
-mtask=$(($ni % $ntask))    #number of jobs with minitask+1 images
-pids=""                    #list of background process ID's
-nba=0
-litask=0
-itask=1
-inode=0   # initialize inode
-while test $itask -le $ntask
-do
-# KD modification: which node?
-   let inode=$inode+1
-   if [ $inode -gt $maxnode ]
-      then let inode=0
-   fi
-#end modification
-   if [ $itask -gt $mtask ]
-      then nitask=$minitask
-      else nitask=$(($minitask + 1))
-   fi
-   fitask=`expr $litask + 1`
-   litask=`expr $litask + $nitask`
-   if [ $nitask -lt $niba0 ]
-      then n=$nitask
-      else n=$niba0
-   fi
-   if [ $n -lt 1 ]
-      then n=1
-   fi
-   nbatask=$(($nitask / $n))
-   nba=`expr $nba + $nbatask`
-   image1=$(($fframe + $fitask - 1)) #id number of the first image
-   if [ $maxcpu -gt 1 ]
-      then echo "$image1 $nitask $itask $nbatask" | numactl --cpunodebind=$inode mintegrate_par &
-      else echo "$image1 $nitask $itask $nbatask" | mintegrate     &
-   fi
-   pids="$pids $!"  #append id of the background process just started
-   itask=`expr $itask + 1`
-done
-trap "kill -15 $pids" 2 15  # 2:Control-C; 15:kill
-wait  #wait for all background processes issued by this shell
-rm -f mintegrate.tmp  #this temporary file was generated by xds
-</pre>
-As an alternative to <tt>numactl</tt>, one may use <tt>taskset</tt> or <tt>KMP_AFFINITY</tt>.
-If <tt>[https://github.com/RRZE-HPC/likwid/wiki likwid]</tt> would be used instead of <tt>numactl</tt> one could have much better control of affinity groups.
-In my tests on a 4-socket machine, the difference between runs with the original scripts and the NUMA-aware ones was a reduction of wallclock time by about 8%. With a 2-socket machine, I saw a <1% effect. But this will depend very much on the specific hardware.
-== Multi-socket machines in a cluster ==
-In that case, I'd suggest to modify e.g. <tt>forkcolspot_cluster</tt> to not run <tt>mcolspot_par</tt> directly on the remote machine, but rather to run a script on that machine that checks the number of nodes, and runs <tt>mcolspot_par</tt> on the right node.
 == processing compressed data ==
@@ Line 182: / Line 26: @@
 XDS can process data files that were previously compressed with compress (<code>.Z</code>), gzip (<code>.gz</code>), bzip2 (<code>.bz2</code>) or xz (<code>.xz</code>). It does this by on-the-fly decompression to temporary files with standard names (<code>SCRATCH2XXYY.tmp</code>) where XX (XX = 01..99) stands for the "JOB" and YY (YY = 01..99) for the thread number that produces the temporary file.
-Compression saves a lot of disk space, but decompression is time-consuming in terms of CPU and I/O. The penalty associated with decompression can be mitigated by
+Compression saves a lot of disk space, but decompression may be time-consuming in terms of CPU and I/O. The penalty associated with decompression can be mitigated by
+* using a [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#LIB= LIB=] plugin that saves the I/O, and overhead of running an external program. This exists for gzip-compressed CBF files; see [[LIB#Existing_implementations]].
 * (Linux only) using symlinks pointing to /dev/shm which results in <code>SCRATCH2XXYY.tmp</code> being written to RAM instead of (network) disk. A script (typically called <code>mklinks</code>) achieving this is
 <pre>
@@ Line 220: / Line 65: @@
 == Linux kernel setting ==
   cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
+on RHEL6, and
+ cat /sys/kernel/mm/transparent_hugepage/enabled
+on RHEL7, respectively,
 should show <code>always</code>, not <code>never</code> to be active (the active setting is bracketed).

Performance: Difference between revisions

Revision as of 16:38, 12 December 2019

Contents

General considerations

Cluster

processing compressed data

Linux kernel setting

Navigation menu

Performance: Difference between revisions

Revision as of 16:38, 12 December 2019

General considerations

Cluster

processing compressed data

Linux kernel setting

Navigation menu

Search