# R-factors

## Definitions

### Data quality indicators

In the following, all sums over hkl extend only over unique reflections with more than one observation!

• Rsym and Rmerge - the formula for both is:
$\displaystyle{ R = \frac{\sum_{hkl} \sum_{j} \vert I_{hkl,j}-\langle I_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}I_{hkl,j}} }$


where $\displaystyle{ \langle I_{hkl}\rangle }$ is the average of symmetry- (or Friedel-) related observations of a unique reflection.

It can be shown that this formula results in higher R-factors when the redundancy is higher (Diederichs and Karplus [1]). In other words, low-redundancy datasets appear better than high-redundancy ones, which obviously violates the intention of having an indicator of data quality!

• Redundancy-independant version of the above:
$\displaystyle{ R_{meas} = \frac{\sum_{hkl} \sqrt \frac{n}{n-1} \sum_{j=1}^{n} \vert I_{hkl,j}-\langle I_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}I_{hkl,j}} }$


which unfortunately results in higher (but more realistic) numerical values than Rsym / Rmerge (Diederichs and Karplus [1] , Weiss and Hilgenfeld [2]).

#### measuring quality of averaged intensities/amplitudes

for intensities use (Weiss [3])

$\displaystyle{ R_{p.i.m.} = \frac{\sum_{hkl} \sqrt \frac{1}{n-1} \sum_{j=1}^{n} \vert I_{hkl,j}-\langle I_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}I_{hkl,j}} }$


Rmrgd-I is similarly defined in Diederichs and Karplus [1].

Similarly, one should use Rmrgd-F as a quality indicator for amplitudes [1], which may be calculated as:

$\displaystyle{ R_{mrgd-F} = \frac{\sum_{hkl} \sqrt \frac{1}{n-1} \sum_{j=1}^{n} \vert F_{hkl,j}-\langle F_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}F_{hkl,j}} }$


with $\displaystyle{ \langle F_{hkl}\rangle }$ defined analogously as $\displaystyle{ \langle I_{hkl}\rangle }$.

In the sums above, the summation omits those reflections with just one observation.

#### measuring radiation damage

We can plot (Diederichs [4])

$\displaystyle{ R_{d} = \frac{\sum_{hkl} \sum_{|i-j|=d} \vert I_{hkl,i} - I_{hkl,j}\vert}{\sum_{hkl} \sum_{|i-j|=d} (I_{hkl,i} + I_{hkl,j})/2} }$

which gives us the average R-factor of two reflections measured d frames apart. As long as the plot is parallel to the x axis there is no radiation damage. As soon as the plot starts to rise, we see that there's a systematical error contribution due to radiation damage.

Strong wiggles at very high d are irrelevant as only few reflections contribute.

To my knowledge, the only program that implements this currently (December 2008) is XDSSTAT.

### Model quality indicators

• R and Rfree : the formula for both is
$\displaystyle{ R=\frac{\sum_{hkl}\vert F_{hkl}^{obs}-F_{hkl}^{calc}\vert}{\sum_{hkl} F_{hkl}^{obs}} }$


where $\displaystyle{ F_{hkl}^{obs} }$ and $\displaystyle{ F_{hkl}^{calc} }$ have to be scaled w.r.t. each other. R and Rfree differ in the set of reflections they are calculated from: R is calculated for the working set, whereas Rfree is calculated for the test set.

## what do R-factors try to measure, and how to interpret their values?

• relative deviation of

### Data quality

• typical values: ...

### Model quality

#### Relation between R and Rfree as a function of resolution

References:

• Tickle IJ, Laskowski RA and Moss DS. Rfree and the Rfree Ratio. I. Derivation of Expected Values of Cross-Validation Residuals Used in Macromolecular Least-Squares Refinement. Acta Cryst. (1998). D54, 547-557 [5]
• Tickle IJ, Laskowski RA and Moss DS. Rfree and the Rfree ratio. II. Calculation of the expected values and variances of cross-validation statistics in macromolecular least-squares refinement. Acta Cryst. (2000). D56, 442-450 [6]

- formula from that paper: Rfree = 1.065*R + 0.036

- plot with empirical data: http://xray.bmc.uu.se/gerard/supmat/rfree2000/rfminusr_vs_resolution.gif

- many more plots: http://xray.bmc.uu.se/gerard/supmat/rfree2000

- harry plotter (java): http://xray.bmc.uu.se/gerard/supmat/rfree2000/plotter.html

## what kinds of problems exist with these indicators?

• (Rsym / Rmerge ) should not be used to judge data quality, Rmeas should be used instead. The reason is that the former depend on multiplicity, whereas the latter doesn't.
• R/Rfree and NCS: reflections in work and test set are not independent if chosen randomly. It is better to choose the test set reflections in thin resolution shells. Since the twin related reflections have the same sin(theta)/lambda values they will not be split over the working and reference sets. DATAMAN from the Uppsala Software Factory and XPREP (a program which may be obtained from Bruker) offer this option. A disadvantage is the the maps may not be quite as good as when the free R reflections are selected randomaly. (FIXME: which Phenix program does this?). A paper investigating this thoroughly is Fabiola, F., A. Korostelev, et al. (2006). "Bias in cross-validated free R factors: mitigation of the effects of non-crystallographic symmetry." Acta Cryst. D 62: 227-38.

## Notes

1. K. Diederichs and P.A. Karplus (1997). Improved R-factors for diffraction data analysis in macromolecular crystallography. Nature Struct. Biol. 4, 269-275 [1]
2. M.S. Weiss and R. Hilgenfeld (1997) On the use of the merging R-factor as a quality indicator for X-ray data. J. Appl. Crystallogr. 30, 203-205[2]
3. M.S. Weiss. Global indicators of X-ray data quality. J. Appl. Cryst. (2001). 34, 130-135 [3]
4. K. Diederichs (2006). Some aspects of quantitative analysis and correction of radiation damage. Acta Cryst D62, 96-101 [4]