Cluster Installation

From XDSwiki
Revision as of 15:34, 10 September 2019 by Kay (talk | contribs) (→‎Performance)
Jump to navigation Jump to search

The following does not refer to the CLUSTER_NODES= setup. The latter does not require a queueing system!

XDS can be run in a cluster using any batch job scheduling software such as Grid Engine, Condor, Torque/PBS, LSF, SLURM etc. These are distributed resource management system which monitor the CPU and memory usage of the available computing resources and schedule jobs to the least used computers.

setup of XDS for a batch queue system

In order to setup XDS for a queuing system, the forkxds script needs to be changed to use qsub instead of ssh. Example scripts used for Univa Grid Engine (UGA) at Diamond (from https://github.com/DiamondLightSource/fast_dp/tree/master/etc/uge_array - thanks to Graeme Winter!) are below; they may need to be changed for the specific environment and queueing system.

# forkxds
#!/bin/bash
#                    forkxds          Version DLS-2017/08
#
# enables  multi-tasking by splitting the COLSPOT and INTEGRATE
# steps of xds into independent jobs. Each job is carried out by 
# a Fortran main program (mcolspot, mcolspot_par, mintegrate, or
# mintegrate_par). The jobs are distributed among the processor 
# nodes of the NFS cluster network.
#
# 'forkxds' is called by xds or xds_par by the Fortran instruction
# CALL SYSTEM('forkxds ntask maxcpu main rhosts'),
#    ntask  ::total number of independent jobs (tasks)
#   maxcpu  ::maximum number of processors used by each job
#    main   ::name of the main program to be executed; could be
#             mcolspot | mcolspot_par | mintegrate | mintegrate_par
#   rhosts  ::names of CPU cluster nodes in the NFS network 
#
# DLS UGE port of script to operate nicely with cluster 
# scheduling system - will work with any XDS usage but is 
# aimed for fast_dp see fast_dp#3. Options passed through environment:
#
# FORKXDS_PRIORITY - priority within queue, e.g. 1024
# FORKXDS_PROJECT - UGE project to assign for this
# FORKXDS_QUEUE - queue to submit to

ntask=$1  #total number of jobs
maxcpu=$2 #maximum number of processors used by each job
main=$3   #name of the main program to be executed

rm -f forkxds.params
itask=1
while test $itask -le $ntask
do
   echo $main >> forkxds.params
   itask=`expr $itask + 1`
done

# save environment
echo "PATH=$PATH" > forkxds.env
echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" >> forkxds.env

# check environment for queue; project; priority information
qsub_opt=""
if [[ -n "$FORKXDS_PRIORITY" ]] ; then
    qsub_opt="$qsub_command -p $FORKXDS_PRIORITY"
fi

if [[ -n "$FORKXDS_PROJECT" ]] ; then
    qsub_opt="$qsub_command -P $FORKXDS_PROJECT"
fi

if [[ -n "$FORKXDS_QUEUE" ]] ; then
    qsub_opt="$qsub_command -q $FORKXDS_QUEUE"
fi

qsub $qsub_opt -sync y -V -cwd -pe smp $maxcpu -t 1-$ntask `which forkxds_job`


# forkxds_job

#!/bin/bash

params=$(awk "NR==$SGE_TASK_ID" forkxds.params)
JOB=`echo $params | awk '{print $1}'`

# load environment
. forkxds.env

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH
export PATH=$PATH
echo $SGE_TASK_ID | $JOB

Performance

If the cluster has nodes with more cores than the node where xds_par was started, the binary file mintegrate.tmp (written by xds_par, and read by mintegrate_par) must be patched, by adding these two lines to the forkxds script:

printf "0: %02x000000" <desired_number_of_threads> | xxd -r -g0 > nn.dat
dd bs=4 count=1 seek=13 conv=notrunc if=nn.dat of=mintegrate.tmp

If the cluster is homogeneous then this is of course no concern.