CC1/2: Difference between revisions

From XDSwiki
Jump to navigation Jump to search
(value of CC1/2 at a resolution where the signal vanishes)
(→‎See also: link PDF)
(43 intermediate revisions by 2 users not shown)
Line 1: Line 1:


== number of reflection pairs ==
==CC<sub>1/2</sub> calculation==
[[CORRECT.LP]] and XSCALE.LP do not explicitly state the ''number of reflection pairs'' that were used to calculated CC1/2..
 
CC<sub>1/2</sub>  can be calculated with the so-called σ-τ method ([https://cms.uni-konstanz.de/index.php?eID=tx_nawsecuredl&u=0&g=0&t=1475179096&hash=5cf64234a23a794a1894c5408384c57208d7b602&file=fileadmin/biologie/ag-strucbio/pdfs/Assman2016_JApplCryst.pdf Assmann, G., Brehm, W. and Diederichs, K. (2016) Identification of rogue datasets in serial crystallography. J. Appl. Cryst. 49, 1021-1028.]) by:
 
: <math>CC_{1/2}=\frac{\sigma^2_{\tau}}{\sigma^2_{\tau}+\sigma^2_{\epsilon}} =\frac{\sigma^2_{y}- \frac{1}{2}\sigma^2_{\epsilon}}{\sigma^2_{y}+ \frac{1}{2}\sigma^2_{\epsilon}} </math>
 
This requires calculation of <math>\sigma^2_{y} </math>, the variance of the average intensities, and <math>\sigma^2_{\epsilon} </math>, the average of the variances of the averaged (merged) intensities.
 
== Method ==
 
===''' <math>\sigma^2_{\epsilon} </math>'''===
 
With <math>x_{j,i} </math> , a single observation j of all n observations of one reflection i, the average of all sample variances of the mean across all unique reflections of a resolution shell is obtained by calculating the unbiased sample variance of the mean for every unique reflection i by:
 
<math>s^2_{\epsilon i} = \frac{1}{n_{i}-1} \cdot \left ( \sum^{n_{i}}_{j} x^2_{j,i} - \frac{\left ( \sum^{n_{i}}_{j}x_{j,i} \right )^2}{n_{i}} \right )    / \frac{n_{i}}{2} </math>
 
<math>s^2_{\epsilon i} </math> is divided by the factor  <math>\frac{n}{2} </math>, because the variance of the sample mean (intensities of the merged observations) is the quantity of interest. The division by '''n/2''' takes care of providing the variance of the mean ([https://en.wikipedia.org/wiki/Sample_mean_and_covariance#Variance_of_the_sample_mean ]) (merged) intensity of the '''half'''-datasets, as defined in [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3457925/ Karplus and Diederichs (2012)]. These "variances of means" are averaged over all unique reflections of the resolution shell:
<math>s^2_{\epsilon}=\sum^N_{i} s^2_{\epsilon i} / N </math>
 
 
----
 
 
If the standard deviations <math>\sigma_{int\_j,i} </math> for the single observations are considered as weights for the CC<sub>1/2</sub> calculation, with <math>w_{j,i}=\frac{1}{\sigma_{int\_j,i}^2} </math>, one way to obtain the unbiased '''weighted''' sample variance of the half-dataset mean for every unique reflection i is:
 
<math>s^2_{\epsilon i\_w} =  \frac{n_{i}}{n_{i}-1} \cdot \left ( \frac{\sum^{n_{i}}_{j}w_{j,i} x^2_{j,i}}{\sum^{n_{i}}_{j}w_{j,i}} -\left ( \frac{ \sum^{n_{i}}_{j}w_{j,i}x_{j,i} }{\sum^{n_{i}}_{j}w_{j,i}}\right )^2 \right )    / \frac{n_{i}}{2} </math>
 
These " weighted variances of means" are averaged over all unique reflections of the resolution shell:
 
<math>s^2_{\epsilon_w}=\sum^N_{i} s^2_{\epsilon i\_w} / N </math>
 
It should be noted that it is not straightforward to define the correct way to calculate a weighted variance (and the weighted variance of the mean). The formula <math>s^2_w =  \frac{n_{i}}{n_{i}-1} \cdot \left ( \frac{\sum^{n_{i}}_{j}w_{j,i} x^2_{j,i}}{\sum^{n_{i}}_{j}w_{j,i}} -\left ( \frac{ \sum^{n_{i}}_{j}w_{j,i}x_{j,i} }{\sum^{n_{i}}_{j}w_{j,i}}\right )^2 \right )</math> is - after some manipulation - the same as that found at [https://stats.stackexchange.com/questions/6534/how-do-i-calculate-a-weighted-standard-deviation-in-excel],[https://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf]. Other ways of calculating the weighted variance of the mean ([https://en.wikipedia.org/wiki/Weighted_arithmetic_mean],[https://www.gnu.org/software/gsl/manual/html_node/Weighted-Samples.html]) should be considered.
 
----
 
===''' <math>\sigma^2_{y} </math>'''===
 
Let N be the number of reflections in a resolution shell. With <math>\overline{x}_{i}= \sum^n_{j} x_{j,i}</math> , the unbiased sample variance from all averaged intensities of all unique reflections is calculated by:
 
<math>s^2_{y} = \frac{1}{N-1} \cdot \left (\sum^N_{i} \overline{x}_{i}^2 - \frac{\left ( \sum^N_{i} \overline{x}_{i} \right )^2}{ N} \right ) </math>
 
----
If the standard deviations <math>\sigma_{int\_j,i} </math> for the single observations are used for weighting, <math>s^2_{y}</math> is obtained from
<math>\overline{x}_{i_w} = \frac{\sum^n_{j} w_{j,i}  x_{j,i}} {\sum^n_{j}w_{j,i}}</math>:
 
<math>s^2_{y_w} = \frac{1}{N-1} \cdot \left (\sum^N_{i} \overline{x}_{i_w}^2 - \frac{\left ( \sum^N_{i} \overline{x}_{i_w} \right )^2}{ N} \right ) </math>
 
== Example ==
An example is shown for a very simplified data file (unmerged ASCII.HKL).
 
<pre>
First reflection with 6 observations:
    h    k    l      int    σ(int) 
    2    0    0  9.156E+02  3.686E+00  1
    0    2    0  5.584E+02  3.093E+00  1
    0    0    2  6.301E+02  2.405E+01  1 
    2    0    0  9.256E+02  3.686E+00  2
    0    2    0  2.584E+02  3.093E+00  2
    0    0    2  7.301E+02  2.405E+01  2
</pre>
<math>\overline{x}_{1} </math> , the average intensity of all observations of this reflection = 669.6999
 
<math>\sigma^2_{\epsilon 1} </math>, the unbiased sample variance of the mean of all observations of this unique reflection = 62544.6597/(n/2) = 20848.2198
 
<pre>
Second reflection with 6 observations:
    h    k    l      int    σ(int) 
    1    1    2  2.395E+01  8.932E+01  1 
    1    2    1  9.065E+01  7.407E+00  1 
    2    1    1  5.981E+01  9.125E+00  1 
    1    1    2  3.395E+01  8.932E+01  2 
    1    2    1  9.065E+01  7.407E+00  2 
    2    1    1  1.608E+01  2.215E+01  2 
</pre>
<math>\overline{x}_{2} </math> , the average intensity of all observations of this reflection = 52.5150
 
<math>\sigma^2_{\epsilon 2} </math>, the unbiased sample variance of the mean of all observations of this unique reflection = 1089.9803/(n/2) = 363.3267
1089.9803/(n/2)
 
<math>\sigma^2_{\epsilon} </math> , the average of all the <math>\sigma^2_{\epsilon i} </math> = 10605.7733
 
<math>\sigma^2_{y} </math>, the variance of all the averaged intensities = 190458.6533
 
As a result of these calculations  CC<sub>1/2</sub> = (190458.6533-(0.5*10605.7733))/(190458.6533+(0.5*10605.7733)), which results in 0.9458.
 
The described calculation is implemented in [[XDSCC12]], and CC<sub>1/2</sub> and [[DeltaCC12|ΔCC<sub>1/2</sub>]] can be calculated for XDS_ASCII.HKL and XSCALE.HKL files.
 
==Fortran 95 code that assumes that all unique reflections have the same number of observations==
 
<pre>
sumibar=0
sumibar2=0
sumsig2eps=0
DO i=1,nref
  xbar=SUM(iobs(:,i))/nobs
  sumibar=sumi+xbar
  sumibar2=sumibar2+xbar**2
  sumsig2eps=sumeps + (SUM(iobs(:,i)**2)-xbar**2*nobs)/(nobs-1)/(nobs/2)
END DO
sig2y=(sumibar2-sumibar**2/nref)/(nref-1)
sig2eps=sumsig2eps/nref
print *,(sig2y-0.5*sig2eps)/(sig2y+0.5*sig2eps)
</pre>


However, the number can be calculated from the numbers available, for each resolution shell: there is the NUMBER OF UNIQUE REFLECTIONS (X), the NUMBER OF OBSERVED REFLECTIONS (Y), and the number of COMPARED reflections (Z) - the latter number is the total number of unmerged observations that contributed to the CC1/2 and the R-value calculations.


The ''number of reflections pairs'' that were used for the CC1/2 calculation can therefore be obtained as follows: Y-Z gives the number of unique reflections that have a single observation. The remaining (X-Y+Z) unique reflections have multiple observations, i.e. there were  (X-Y+Z) reflection pairs that went into CC1/2.
== number of reflection pairs ==
[[CORRECT.LP]] and XSCALE.LP do not explicitly state the ''number of reflection pairs'' that were used to calculated CC<sub>1/2</sub>.


However, the number can be calculated from the numbers available, for each resolution shell: there is the NUMBER OF UNIQUE REFLECTIONS (X), the NUMBER OF OBSERVED REFLECTIONS (Y), and the number of COMPARED reflections (Z) - the latter number is the total number of unmerged observations that contributed to the CC<sub>1/2</sub> and the R-value calculations.


== value of CC1/2 at a resolution where the signal vanishes ==
The ''number of reflections pairs'' that were used for the CC<sub>1/2</sub> calculation can therefore be obtained as follows: Y-Z gives the number of unique reflections that have a single observation. The remaining (X-Y+Z) unique reflections have multiple observations, i.e. there were  (X-Y+Z) reflection pairs that went into CC<sub>1/2</sub>.
At a resolution where the signal vanishes, CC1/2 should be around zero. However, empirically we sometimes see negative values of CC1/2 (to values down to around -0.4) when using sftools or phenix.cc_star for calculating it. On the other hand, CC1/2 as printed out in CORRECT.LP does approach zero. How can this be understood?


The reason is that CORRECT does "alien" rejection (as documented in [[CORRECT.LP]])  ''after'' the final statistics table is printed. "Aliens" are reflections that are much stronger than should be expected in their resolution range, e.g. ice reflections. These reflections are identified in the following way: the average intensity in a resolution range is calculated. Any (acentric) reflection whose intensity is larger than 10 times the average is suspicious/unexpected; it is printed out at the bottom of CORRECT.LP (for centrics, the criterion is a bit different). By default, the parameter REJECT_ALIENS has a value of 20, which means that those reflections with intensity > 20*average are marked as aliens (outliers), and are disregarded in downstream processing (e.g. [[XDSCONV]]).


In a resolution shell where the noise is much stronger than the signal (empirically, if <I/sigma> is less than 0.2), many reflections are considered as aliens - those where the noise happens to be strongly positive. If these are rejected (i.e. if the default REJECT_ALIEN is applied) then the average intensity becomes negative.  
== why CC<sub>1/2</sub> can be negative ==
If the numerator of the formula becomes negative, CC<sub>1/2</sub> is negative. This happens if the variance of the average intensities across the unique reflections of a resolution shell is low, but the individual measurements of each unique reflection vary strongly. This is discussed in §4.1 of [https://cms.uni-konstanz.de/index.php?eID=tx_nawsecuredl&u=0&g=0&t=1475179096&hash=5cf64234a23a794a1894c5408384c57208d7b602&file=fileadmin/biologie/ag-strucbio/pdfs/Assman2016_JApplCryst.pdf Assmann, G., Brehm, W. and Diederichs, K. (2016) Identification of rogue datasets in serial crystallography (2016) J. Appl. Cryst. 49, 1021-1028.]


In addition, CC1/2 becomes negative as can be seen in a simulation employing random numbers that are normally distributed with an average of zero and a variance of one. In the figure, each reflection is represented at a location determined by the intensities of its subset intensities. Reflections with intensity>1 are rejected (red crosses), whereas reflections with intensity<1 are used for calculating CC1/2 (green). The magenta line divides the plot into reflections with positive (total) intensity (upper right) and negative (total) intensity (lower left). The blue line is a least-squares fit to the "green" reflections; the correlation coefficient is -0.3.  
== Implementation ==
This way of calculating CC<sub>1/2</sub> is implemented in [[XDSCC12]] and in [http://www.desy.de/~twhite/crystfel/manual-partialator.html partialator] of [http://www.desy.de/~twhite/crystfel/relnotes-0.8.0 CrystFEL].


[[File:Reject_aliens.png]]
== See also ==
[https://www.youtube.com/watch?v=LirxJIcQ6T0 CC* - Linking crystallographic model and data quality.] Video recorded at SBGrid/NE-CAT workshop 2014; the PDF is [https://strucbio.biologie.uni-konstanz.de/pub/CC%20-%20Linking%20crystallographic%20model%20and%20data%20quality.pdf here]. The sound is poor.

Revision as of 20:14, 15 June 2020

CC1/2 calculation

CC1/2 can be calculated with the so-called σ-τ method (Assmann, G., Brehm, W. and Diederichs, K. (2016) Identification of rogue datasets in serial crystallography. J. Appl. Cryst. 49, 1021-1028.) by:

[math]\displaystyle{ CC_{1/2}=\frac{\sigma^2_{\tau}}{\sigma^2_{\tau}+\sigma^2_{\epsilon}} =\frac{\sigma^2_{y}- \frac{1}{2}\sigma^2_{\epsilon}}{\sigma^2_{y}+ \frac{1}{2}\sigma^2_{\epsilon}} }[/math]

This requires calculation of [math]\displaystyle{ \sigma^2_{y} }[/math], the variance of the average intensities, and [math]\displaystyle{ \sigma^2_{\epsilon} }[/math], the average of the variances of the averaged (merged) intensities.

Method

[math]\displaystyle{ \sigma^2_{\epsilon} }[/math]

With [math]\displaystyle{ x_{j,i} }[/math] , a single observation j of all n observations of one reflection i, the average of all sample variances of the mean across all unique reflections of a resolution shell is obtained by calculating the unbiased sample variance of the mean for every unique reflection i by:

[math]\displaystyle{ s^2_{\epsilon i} = \frac{1}{n_{i}-1} \cdot \left ( \sum^{n_{i}}_{j} x^2_{j,i} - \frac{\left ( \sum^{n_{i}}_{j}x_{j,i} \right )^2}{n_{i}} \right ) / \frac{n_{i}}{2} }[/math]

[math]\displaystyle{ s^2_{\epsilon i} }[/math] is divided by the factor [math]\displaystyle{ \frac{n}{2} }[/math], because the variance of the sample mean (intensities of the merged observations) is the quantity of interest. The division by n/2 takes care of providing the variance of the mean ([1]) (merged) intensity of the half-datasets, as defined in Karplus and Diederichs (2012). These "variances of means" are averaged over all unique reflections of the resolution shell:

[math]\displaystyle{ s^2_{\epsilon}=\sum^N_{i} s^2_{\epsilon i} / N }[/math]




If the standard deviations [math]\displaystyle{ \sigma_{int\_j,i} }[/math] for the single observations are considered as weights for the CC1/2 calculation, with [math]\displaystyle{ w_{j,i}=\frac{1}{\sigma_{int\_j,i}^2} }[/math], one way to obtain the unbiased weighted sample variance of the half-dataset mean for every unique reflection i is:

[math]\displaystyle{ s^2_{\epsilon i\_w} = \frac{n_{i}}{n_{i}-1} \cdot \left ( \frac{\sum^{n_{i}}_{j}w_{j,i} x^2_{j,i}}{\sum^{n_{i}}_{j}w_{j,i}} -\left ( \frac{ \sum^{n_{i}}_{j}w_{j,i}x_{j,i} }{\sum^{n_{i}}_{j}w_{j,i}}\right )^2 \right ) / \frac{n_{i}}{2} }[/math]

These " weighted variances of means" are averaged over all unique reflections of the resolution shell:

[math]\displaystyle{ s^2_{\epsilon_w}=\sum^N_{i} s^2_{\epsilon i\_w} / N }[/math]

It should be noted that it is not straightforward to define the correct way to calculate a weighted variance (and the weighted variance of the mean). The formula [math]\displaystyle{ s^2_w = \frac{n_{i}}{n_{i}-1} \cdot \left ( \frac{\sum^{n_{i}}_{j}w_{j,i} x^2_{j,i}}{\sum^{n_{i}}_{j}w_{j,i}} -\left ( \frac{ \sum^{n_{i}}_{j}w_{j,i}x_{j,i} }{\sum^{n_{i}}_{j}w_{j,i}}\right )^2 \right ) }[/math] is - after some manipulation - the same as that found at [2],[3]. Other ways of calculating the weighted variance of the mean ([4],[5]) should be considered.


[math]\displaystyle{ \sigma^2_{y} }[/math]

Let N be the number of reflections in a resolution shell. With [math]\displaystyle{ \overline{x}_{i}= \sum^n_{j} x_{j,i} }[/math] , the unbiased sample variance from all averaged intensities of all unique reflections is calculated by:

[math]\displaystyle{ s^2_{y} = \frac{1}{N-1} \cdot \left (\sum^N_{i} \overline{x}_{i}^2 - \frac{\left ( \sum^N_{i} \overline{x}_{i} \right )^2}{ N} \right ) }[/math]


If the standard deviations [math]\displaystyle{ \sigma_{int\_j,i} }[/math] for the single observations are used for weighting, [math]\displaystyle{ s^2_{y} }[/math] is obtained from [math]\displaystyle{ \overline{x}_{i_w} = \frac{\sum^n_{j} w_{j,i} x_{j,i}} {\sum^n_{j}w_{j,i}} }[/math]:

[math]\displaystyle{ s^2_{y_w} = \frac{1}{N-1} \cdot \left (\sum^N_{i} \overline{x}_{i_w}^2 - \frac{\left ( \sum^N_{i} \overline{x}_{i_w} \right )^2}{ N} \right ) }[/math]

Example

An example is shown for a very simplified data file (unmerged ASCII.HKL).

First reflection with 6 observations:
     h     k     l       int     σ(int)  
     2     0     0  9.156E+02  3.686E+00   1 
     0     2     0  5.584E+02  3.093E+00   1 
     0     0     2  6.301E+02  2.405E+01   1  
     2     0     0  9.256E+02  3.686E+00   2 
     0     2     0  2.584E+02  3.093E+00   2 
     0     0     2  7.301E+02  2.405E+01   2 

[math]\displaystyle{ \overline{x}_{1} }[/math] , the average intensity of all observations of this reflection = 669.6999

[math]\displaystyle{ \sigma^2_{\epsilon 1} }[/math], the unbiased sample variance of the mean of all observations of this unique reflection = 62544.6597/(n/2) = 20848.2198


Second reflection with 6 observations:
     h     k     l       int     σ(int)  
     1     1     2  2.395E+01  8.932E+01   1  
     1     2     1  9.065E+01  7.407E+00   1  
     2     1     1  5.981E+01  9.125E+00   1  
     1     1     2  3.395E+01  8.932E+01   2  
     1     2     1  9.065E+01  7.407E+00   2  
     2     1     1  1.608E+01  2.215E+01   2  

[math]\displaystyle{ \overline{x}_{2} }[/math] , the average intensity of all observations of this reflection = 52.5150

[math]\displaystyle{ \sigma^2_{\epsilon 2} }[/math], the unbiased sample variance of the mean of all observations of this unique reflection = 1089.9803/(n/2) = 363.3267 1089.9803/(n/2)

[math]\displaystyle{ \sigma^2_{\epsilon} }[/math] , the average of all the [math]\displaystyle{ \sigma^2_{\epsilon i} }[/math] = 10605.7733

[math]\displaystyle{ \sigma^2_{y} }[/math], the variance of all the averaged intensities = 190458.6533

As a result of these calculations CC1/2 = (190458.6533-(0.5*10605.7733))/(190458.6533+(0.5*10605.7733)), which results in 0.9458.

The described calculation is implemented in XDSCC12, and CC1/2 and ΔCC1/2 can be calculated for XDS_ASCII.HKL and XSCALE.HKL files.

Fortran 95 code that assumes that all unique reflections have the same number of observations

sumibar=0
sumibar2=0
sumsig2eps=0
DO i=1,nref
  xbar=SUM(iobs(:,i))/nobs
  sumibar=sumi+xbar
  sumibar2=sumibar2+xbar**2
  sumsig2eps=sumeps + (SUM(iobs(:,i)**2)-xbar**2*nobs)/(nobs-1)/(nobs/2)
END DO
sig2y=(sumibar2-sumibar**2/nref)/(nref-1)
sig2eps=sumsig2eps/nref
print *,(sig2y-0.5*sig2eps)/(sig2y+0.5*sig2eps)


number of reflection pairs

CORRECT.LP and XSCALE.LP do not explicitly state the number of reflection pairs that were used to calculated CC1/2.

However, the number can be calculated from the numbers available, for each resolution shell: there is the NUMBER OF UNIQUE REFLECTIONS (X), the NUMBER OF OBSERVED REFLECTIONS (Y), and the number of COMPARED reflections (Z) - the latter number is the total number of unmerged observations that contributed to the CC1/2 and the R-value calculations.

The number of reflections pairs that were used for the CC1/2 calculation can therefore be obtained as follows: Y-Z gives the number of unique reflections that have a single observation. The remaining (X-Y+Z) unique reflections have multiple observations, i.e. there were (X-Y+Z) reflection pairs that went into CC1/2.


why CC1/2 can be negative

If the numerator of the formula becomes negative, CC1/2 is negative. This happens if the variance of the average intensities across the unique reflections of a resolution shell is low, but the individual measurements of each unique reflection vary strongly. This is discussed in §4.1 of Assmann, G., Brehm, W. and Diederichs, K. (2016) Identification of rogue datasets in serial crystallography (2016) J. Appl. Cryst. 49, 1021-1028.

Implementation

This way of calculating CC1/2 is implemented in XDSCC12 and in partialator of CrystFEL.

See also

CC* - Linking crystallographic model and data quality. Video recorded at SBGrid/NE-CAT workshop 2014; the PDF is here. The sound is poor.