Xds nonisomorphism: Difference between revisions

From XDSwiki
Jump to navigation Jump to search
(Created page with "[ftp://turn5.biologie.uni-konstanz.de/pub/xds_nonisomorphism.rhel6.bz2 xds_nonisomorphism] is a program that clusters datasets as stored in their unmerged reflection file (typ...")
 
mNo edit summary
(8 intermediate revisions by one other user not shown)
Line 1: Line 1:
[ftp://turn5.biologie.uni-konstanz.de/pub/xds_nonisomorphism.rhel6.bz2 xds_nonisomorphism] is a program that clusters datasets as stored in their unmerged reflection file (typically called XDS_ASCII.HKL) as written by [[XDS]]. It implements the method of [https://doi.org/10.1107/S1399004713025431 Brehm and Diederichs (2014)] and theory of [https://doi.org/10.1107/S2059798317000699 Diederichs (2017)].
[ftp://{{SERVERNAME}}/pub/linux_bin/xds_nonisomorphism xds_nonisomorphism][ftp://{{SERVERNAME}}/pub/sources/xds_nonisomorphism.f90][ftp://{{SERVERNAME}}/pub/mac_bin/xds_nonisomorphism (Mac binary)] is a program that analyzes data sets (typically, less than 10) stored in unmerged reflection files (typically called XDS_ASCII.HKL) as written by [[XDS]]. It implements equation 2 of the theory of [https://doi.org/10.1107/S2059798317000699 Diederichs (2017)]. Its purpose is the identification of non-isomorphous (i.e. dissimilar or less well related) data sets among other, more similar data sets. As a consequence of running xds_nonisomorphism, the user may choose to only merge the most isomorphous (similar) data sets, and to discard the non-isomorphous ones - or to analyze these separately. That choice is not done automatically by the program; rather it is assumed that the user will choose the isomorphous data sets based on the program output, and scale these e.g. with [[XSCALE]].


This program determines the lengths of the vectors from the [[CC1/2]] of the data sets, and the angles between vectors from the correlation coefficients between data sets. It requires data sets with internal multiplicity, and mutual overlap. Angles are expressed in degrees. Less than 10° should be considered good isomorphism, 90° means completely unrelated (i.e. non-isomorphous) datasets (theoretically, higher angles are also possible if data sets are anti-correlated). After the analysis, it produces a 3D representation of the arrangement of data sets such that the distances in 3D try to reproduce the angles - please note that this is a completely different representation from that of [[xscale_isocluster]]! xds_nonisomorphism prints a short help text if the -h option is used.
It should be noted that the result of the analyis does not depend on the amount of random error, which means it does not depend on the strengths of data sets - it works just as well for weakly or strongly exposed crystals, and for tiny or big ones.
 
xds_nonisomorphism prints a short help text if the -h option is used.
 
== Data ==
 
The assumption is that several data sets exist, and that these should be merged with [[XSCALE]]. The program therefore reads the names of the XDS_ASCII.HKL files from XSCALE.INP . The latter file, and the XDS_ASCII.HKL listed after each INPUT_FILE= line in XSCALE.INP must exist. The program reads the files in the order given, and produces tables with pairwise statistics. The method requires data sets with internal multiplicity, and mutual overlap (common reflections) between data sets.
 
 
== Calculation ==
 
In particular, for each pair of data sets it determines
* the CC* values (Karplus & Diederichs (2012). Science 336, 1030–1033) from the [[CC1/2]] of the data sets (using the σ-τ method of Assmann ''et al.'', J. Appl. Cryst. (2016). 49, 1021–1028) in columns 3 and 4 of the output, and
* the pairwise (Pearson's) correlation coefficients (column 5).
As given by equation 2 of [https://doi.org/10.1107/S2059798317000699 Diederichs (2017)], the ratio between the latter quantity and the product of the CC* values of a pair is a measure of the non-isomorphism - for isomorphous data, that ratio is 1; for non-isomorphous data, the ratio is lower. This ratio is given in column 6 under the heading "cos(phi)".
 
 
== Analysis and interpretation ==
 
Angles (calculated as the inverse cosine of the ratio) are expressed in degrees. Less than 10° may be considered good isomorphism, 90° means highly non-isomorphous  (i.e. completely unrelated) datasets. However, as seen in actual tables, the numerical value (and the interpretation of the magnitude of an angle) depends on the resolution. But there is another interpretation of the ratio (column 6) - not as cos(phi) but as a factor. To make sense of this interpretation, the program uses a formula (McCoy et al. (2017) PNAS 114, 3637-3641 equation 1) that relates coordinate difference (column 8 of output) to the factor. This coordinate RMSD value should be independent of resolution. If it is ''not'' (which is sometimes seen in pairwise comparisons of data sets) then this is an indication that some other systematic difference, that cannot be interpreted as coordinate difference, exists between data sets. Candidates are many kinds of sources of systematic error, e.g. errors in data processing, twinning, overloads, vibrations ...  
 
After the analysis, the program produces a 3D representation of the arrangement of data sets such that their distances in 3D try to reproduce the angles (which are averaged across resolution bins). Please note that this representation is completely different from that of [[xscale_isocluster]]!  
 
 
== Example and explanation of output ==
Create XSCALE.INP (XSCALE does not have to be run at this point!!):
<pre>
OUTPUT_FILE=temp.ahkl
INPUT_FILE=../../xds_317_7.rg4/XDS_ASCII.HKL
INPUT_FILE=../../xds_317_8.rg4/XDS_ASCII.HKL
INPUT_FILE=../../xds_319_7.rg4/XDS_ASCII.HKL
</pre>
Now run xds_nonisomorphism:
<pre>
dikay@turn29:-xscale_rg4/tst% xds_nonisomorphism
xds_nonisomorphism KD 2017-12-03. Academic use only; binary expires 2018-12-31.
Pls cite Diederichs, K. (2017) Acta Cryst D73, 286-293. -h option shows options
reading XSCALE.INP to find XDS_ASCII.HKL-type files
!SPACE_GROUP_NUMBER=    4
!UNIT_CELL_CONSTANTS=    34.534    57.199    72.346  90.000  90.155  90.000
iset, dmax, dmin, name:  1  25.012  1.600 ../../xds_317_7.rg4/XDS_ASCII.HKL
iset, dmax, dmin, name:  2  25.012  1.889 ../../xds_317_8.rg4/XDS_ASCII.HKL
iset, dmax, dmin, name:  3  22.917  1.664 ../../xds_319_7.rg4/XDS_ASCII.HKL
iset,nobs,nunique,nunique w/ >1 observations=  1  125760  22536  22350
iset,nobs,nunique,nunique w/ >1 observations=  2  125086  22303  22114
iset,nobs,nunique,nunique w/ >1 observations=  3  150588  22496  22357
Lowest and highest resolution used: 22.917  1.889
10 resolution shells:
  5.800  4.168  3.422  2.972  2.663  2.433  2.255  2.110  1.991  1.889
</pre>
The data sets have been read, and some basic statistics are produced. Also, the resolution limits of the (default) 10 resolution shells are listed.
<pre>
iset1, iset2=          1          2
resol_shell nmatch  CC*_1    CC*_2    CC(1,2)  cos(phi) angle(deg) RMSD_coord
        1    768    0.9998    0.9999    0.9966    0.9969    4.4984    0.1761
        2    1297    0.9999    0.9997    0.9670    0.9674  14.6752    0.3422
        3    1672    0.9998    0.9996    0.9787    0.9793  11.6919    0.2113
        4    1903    0.9998    0.9991    0.9751    0.9762  12.5148    0.1920
        5    2234    0.9995    0.9947    0.9702    0.9759  12.5988    0.1709
        6    2508    0.9991    0.9844    0.9260    0.9415  19.6870    0.2432
        7    2575    0.9979    0.9527    0.9042    0.9512  17.9739    0.2041
        8    2785    0.9957    0.8867    0.8418    0.9534  17.5545    0.1854
        9    3023    0.9900    0.7563    0.6880    0.9189  23.2369    0.2322
        10    2966    0.9681    0.5344    0.4395    0.8495  31.8427    0.3057
</pre>
CC<sup>*</sup>_1 is really high out to the highest resolution, so the first data set (iset1=1) is quite good. CC<sup>*</sup>_2 is weaker at high resolution. cos(phi) = CC(1,2)/(CC<sup>*</sup>_1 * CC<sup>*</sup>_2) in column 6 should ideally be 1, but indicates non-isomorphism here. Converting this to an angle (column 7), only the lowest resolution shell appears "good". However,  we can estimate the RMS deviation of coordinates (column 8) giving rise to this amount of non-isomorphism, and these are consistently around 0.2 Å.
<pre>
iset1, iset2=          1          3
resol_shell nmatch  CC*_1    CC*_2    CC(1,2)  cos(phi) angle(deg) RMSD_coord
        1    781    0.9998    0.9997    0.9931    0.9936    6.5032    0.2544
        2    1329    0.9999    0.9997    0.9691    0.9695  14.1917    0.3312
        3    1613    0.9998    0.9991    0.9476    0.9486  18.4460    0.3347
        4    1947    0.9998    0.9983    0.9880    0.9898    8.1748    0.1251
        5    2246    0.9995    0.9924    0.9747    0.9826  10.6984    0.1449
        6    2518    0.9991    0.9813    0.9536    0.9727  13.4283    0.1650
        7    2578    0.9978    0.9442    0.8971    0.9522  17.7954    0.2020
        8    2819    0.9957    0.8555    0.7925    0.9304  21.5076    0.2282
        9    2936    0.9894    0.6446    0.5191    0.8140  35.5116    0.3620
        10    3216    0.9670    0.3030    0.2574    0.8784  28.5551    0.2721
</pre>
Similar to the comparison of data sets 1 and 2, except that at low resolution, another source of non-isomorphism appears to dominate.
<pre>
iset1, iset2=          2          3
resol_shell nmatch  CC*_1    CC*_2    CC(1,2)  cos(phi) angle(deg) RMSD_coord
        1    773    0.9999    0.9997    0.9919    0.9924    7.0822    0.2777
        2    1338    0.9997    0.9997    0.9836    0.9842  10.2144    0.2369
        3    1629    0.9996    0.9991    0.9322    0.9334  21.0351    0.3828
        4    1913    0.9991    0.9984    0.9686    0.9710  13.8272    0.2123
        5    2234    0.9946    0.9923    0.9489    0.9614  15.9703    0.2172
        6    2503    0.9844    0.9812    0.8915    0.9230  22.6311    0.2805
        7    2675    0.9518    0.9425    0.8560    0.9542  17.4034    0.1974
        8    2783    0.8858    0.8568    0.6870    0.9053  25.1411    0.2679
        9    2938    0.7508    0.6453    0.3902    0.8054  36.3539    0.3712
        10    2953    0.5340    0.2847    0.1352    0.8894  27.2020    0.2591
</pre>
Again similar, except that the coordinates seem to differ a bit more between data sets 2 and 3.
<pre>
using average RMSD values (excluding unreasonable table entries):
dataset #, mean RMSD to all other datasets:          1  0.2341354   
dataset #, mean RMSD to all other datasets:          2  0.2483061   
dataset #, mean RMSD to all other datasets:          3  0.2561350   
central dataset (most isomorphous) is number          1
most distant dataset (least isom.) is number          3
RMSD= lines in XSCALE.INP.rename_me will be specified w.r.t. to central dataset
 
Jacobi it_num,num_rot:          8          10
Eigenvalues: -1.1685427E-09  2.4050672E-02  3.6891516E-02
coordinates in 3D that best reproduce the angles as distances:
-2.6219051E-02 -0.1248424      0.0000000E+00
-0.1207941      8.0754802E-02  0.0000000E+00
  0.1470131      4.4087593E-02  0.0000000E+00
wrote noniso.pdb
</pre>
noniso.pdb is a pseudo-PDB file, with each data set represented as an atom position; it could/should be loaded into coot. It can be seen that the three data sets form an equal-sided triangle; there is no hint that two of them are close to each other but far from the remaining one, so that one of them could/should be discarded.
<pre>
noniso.pdb=representation of data set arrangement in 3D (coords*100)
wrote XSCALE.INP.rename_me with additional RMSD= lines
</pre>
(Currently, the XSCALE.INP.rename_me file that xds_nonisomorphism writes is useless, because XSCALE does not understand the RMSD lines.)
 
For completeness, this is noniso.pdb:
<pre>
CRYST1  100.000  100.000  100.000  90.00  90.00  90.00 P 1
HETATM    1  O  HOH A  1      -2.622 -12.484  0.000  1.0000.00
HETATM    2  O  HOH A  2    -12.079  8.075  0.000  1.0000.00
HETATM    3  O  HOH A  3      14.701  4.409  0.000  1.0000.00
</pre>

Revision as of 18:29, 19 December 2019

xds_nonisomorphism[1](Mac binary) is a program that analyzes data sets (typically, less than 10) stored in unmerged reflection files (typically called XDS_ASCII.HKL) as written by XDS. It implements equation 2 of the theory of Diederichs (2017). Its purpose is the identification of non-isomorphous (i.e. dissimilar or less well related) data sets among other, more similar data sets. As a consequence of running xds_nonisomorphism, the user may choose to only merge the most isomorphous (similar) data sets, and to discard the non-isomorphous ones - or to analyze these separately. That choice is not done automatically by the program; rather it is assumed that the user will choose the isomorphous data sets based on the program output, and scale these e.g. with XSCALE.

It should be noted that the result of the analyis does not depend on the amount of random error, which means it does not depend on the strengths of data sets - it works just as well for weakly or strongly exposed crystals, and for tiny or big ones.

xds_nonisomorphism prints a short help text if the -h option is used.

Data

The assumption is that several data sets exist, and that these should be merged with XSCALE. The program therefore reads the names of the XDS_ASCII.HKL files from XSCALE.INP . The latter file, and the XDS_ASCII.HKL listed after each INPUT_FILE= line in XSCALE.INP must exist. The program reads the files in the order given, and produces tables with pairwise statistics. The method requires data sets with internal multiplicity, and mutual overlap (common reflections) between data sets.


Calculation

In particular, for each pair of data sets it determines

  • the CC* values (Karplus & Diederichs (2012). Science 336, 1030–1033) from the CC1/2 of the data sets (using the σ-τ method of Assmann et al., J. Appl. Cryst. (2016). 49, 1021–1028) in columns 3 and 4 of the output, and
  • the pairwise (Pearson's) correlation coefficients (column 5).

As given by equation 2 of Diederichs (2017), the ratio between the latter quantity and the product of the CC* values of a pair is a measure of the non-isomorphism - for isomorphous data, that ratio is 1; for non-isomorphous data, the ratio is lower. This ratio is given in column 6 under the heading "cos(phi)".


Analysis and interpretation

Angles (calculated as the inverse cosine of the ratio) are expressed in degrees. Less than 10° may be considered good isomorphism, 90° means highly non-isomorphous (i.e. completely unrelated) datasets. However, as seen in actual tables, the numerical value (and the interpretation of the magnitude of an angle) depends on the resolution. But there is another interpretation of the ratio (column 6) - not as cos(phi) but as a factor. To make sense of this interpretation, the program uses a formula (McCoy et al. (2017) PNAS 114, 3637-3641 equation 1) that relates coordinate difference (column 8 of output) to the factor. This coordinate RMSD value should be independent of resolution. If it is not (which is sometimes seen in pairwise comparisons of data sets) then this is an indication that some other systematic difference, that cannot be interpreted as coordinate difference, exists between data sets. Candidates are many kinds of sources of systematic error, e.g. errors in data processing, twinning, overloads, vibrations ...

After the analysis, the program produces a 3D representation of the arrangement of data sets such that their distances in 3D try to reproduce the angles (which are averaged across resolution bins). Please note that this representation is completely different from that of xscale_isocluster!


Example and explanation of output

Create XSCALE.INP (XSCALE does not have to be run at this point!!):

OUTPUT_FILE=temp.ahkl
INPUT_FILE=../../xds_317_7.rg4/XDS_ASCII.HKL
INPUT_FILE=../../xds_317_8.rg4/XDS_ASCII.HKL
INPUT_FILE=../../xds_319_7.rg4/XDS_ASCII.HKL

Now run xds_nonisomorphism:

dikay@turn29:-xscale_rg4/tst% xds_nonisomorphism
xds_nonisomorphism KD 2017-12-03. Academic use only; binary expires 2018-12-31.
Pls cite Diederichs, K. (2017) Acta Cryst D73, 286-293. -h option shows options
 
 reading XSCALE.INP to find XDS_ASCII.HKL-type files
!SPACE_GROUP_NUMBER=    4
!UNIT_CELL_CONSTANTS=    34.534    57.199    72.346  90.000  90.155  90.000
iset, dmax, dmin, name:  1  25.012   1.600 ../../xds_317_7.rg4/XDS_ASCII.HKL
iset, dmax, dmin, name:  2  25.012   1.889 ../../xds_317_8.rg4/XDS_ASCII.HKL
iset, dmax, dmin, name:  3  22.917   1.664 ../../xds_319_7.rg4/XDS_ASCII.HKL
iset,nobs,nunique,nunique w/ >1 observations=   1  125760   22536   22350
iset,nobs,nunique,nunique w/ >1 observations=   2  125086   22303   22114
iset,nobs,nunique,nunique w/ >1 observations=   3  150588   22496   22357
 
Lowest and highest resolution used: 22.917  1.889
 10 resolution shells:
  5.800  4.168  3.422  2.972  2.663  2.433  2.255  2.110  1.991  1.889

The data sets have been read, and some basic statistics are produced. Also, the resolution limits of the (default) 10 resolution shells are listed.

 iset1, iset2=           1           2
resol_shell nmatch   CC*_1     CC*_2     CC(1,2)  cos(phi) angle(deg) RMSD_coord
         1     768    0.9998    0.9999    0.9966    0.9969    4.4984    0.1761
         2    1297    0.9999    0.9997    0.9670    0.9674   14.6752    0.3422
         3    1672    0.9998    0.9996    0.9787    0.9793   11.6919    0.2113
         4    1903    0.9998    0.9991    0.9751    0.9762   12.5148    0.1920
         5    2234    0.9995    0.9947    0.9702    0.9759   12.5988    0.1709
         6    2508    0.9991    0.9844    0.9260    0.9415   19.6870    0.2432
         7    2575    0.9979    0.9527    0.9042    0.9512   17.9739    0.2041
         8    2785    0.9957    0.8867    0.8418    0.9534   17.5545    0.1854
         9    3023    0.9900    0.7563    0.6880    0.9189   23.2369    0.2322
        10    2966    0.9681    0.5344    0.4395    0.8495   31.8427    0.3057

CC*_1 is really high out to the highest resolution, so the first data set (iset1=1) is quite good. CC*_2 is weaker at high resolution. cos(phi) = CC(1,2)/(CC*_1 * CC*_2) in column 6 should ideally be 1, but indicates non-isomorphism here. Converting this to an angle (column 7), only the lowest resolution shell appears "good". However, we can estimate the RMS deviation of coordinates (column 8) giving rise to this amount of non-isomorphism, and these are consistently around 0.2 Å.

 iset1, iset2=           1           3
resol_shell nmatch   CC*_1     CC*_2     CC(1,2)  cos(phi) angle(deg) RMSD_coord
         1     781    0.9998    0.9997    0.9931    0.9936    6.5032    0.2544
         2    1329    0.9999    0.9997    0.9691    0.9695   14.1917    0.3312
         3    1613    0.9998    0.9991    0.9476    0.9486   18.4460    0.3347
         4    1947    0.9998    0.9983    0.9880    0.9898    8.1748    0.1251
         5    2246    0.9995    0.9924    0.9747    0.9826   10.6984    0.1449
         6    2518    0.9991    0.9813    0.9536    0.9727   13.4283    0.1650
         7    2578    0.9978    0.9442    0.8971    0.9522   17.7954    0.2020
         8    2819    0.9957    0.8555    0.7925    0.9304   21.5076    0.2282
         9    2936    0.9894    0.6446    0.5191    0.8140   35.5116    0.3620
        10    3216    0.9670    0.3030    0.2574    0.8784   28.5551    0.2721

Similar to the comparison of data sets 1 and 2, except that at low resolution, another source of non-isomorphism appears to dominate.

 iset1, iset2=           2           3
resol_shell nmatch   CC*_1     CC*_2     CC(1,2)  cos(phi) angle(deg) RMSD_coord
         1     773    0.9999    0.9997    0.9919    0.9924    7.0822    0.2777
         2    1338    0.9997    0.9997    0.9836    0.9842   10.2144    0.2369
         3    1629    0.9996    0.9991    0.9322    0.9334   21.0351    0.3828
         4    1913    0.9991    0.9984    0.9686    0.9710   13.8272    0.2123
         5    2234    0.9946    0.9923    0.9489    0.9614   15.9703    0.2172
         6    2503    0.9844    0.9812    0.8915    0.9230   22.6311    0.2805
         7    2675    0.9518    0.9425    0.8560    0.9542   17.4034    0.1974
         8    2783    0.8858    0.8568    0.6870    0.9053   25.1411    0.2679
         9    2938    0.7508    0.6453    0.3902    0.8054   36.3539    0.3712
        10    2953    0.5340    0.2847    0.1352    0.8894   27.2020    0.2591

Again similar, except that the coordinates seem to differ a bit more between data sets 2 and 3.

 using average RMSD values (excluding unreasonable table entries):
 dataset #, mean RMSD to all other datasets:           1  0.2341354    
 dataset #, mean RMSD to all other datasets:           2  0.2483061    
 dataset #, mean RMSD to all other datasets:           3  0.2561350    
 central dataset (most isomorphous) is number           1
 most distant dataset (least isom.) is number           3
 RMSD= lines in XSCALE.INP.rename_me will be specified w.r.t. to central dataset

 Jacobi it_num,num_rot:           8          10
 Eigenvalues: -1.1685427E-09  2.4050672E-02  3.6891516E-02
 coordinates in 3D that best reproduce the angles as distances:
 -2.6219051E-02 -0.1248424      0.0000000E+00
 -0.1207941      8.0754802E-02  0.0000000E+00
  0.1470131      4.4087593E-02  0.0000000E+00
 wrote noniso.pdb

noniso.pdb is a pseudo-PDB file, with each data set represented as an atom position; it could/should be loaded into coot. It can be seen that the three data sets form an equal-sided triangle; there is no hint that two of them are close to each other but far from the remaining one, so that one of them could/should be discarded.

 noniso.pdb=representation of data set arrangement in 3D (coords*100)
 wrote XSCALE.INP.rename_me with additional RMSD= lines

(Currently, the XSCALE.INP.rename_me file that xds_nonisomorphism writes is useless, because XSCALE does not understand the RMSD lines.)

For completeness, this is noniso.pdb:

CRYST1  100.000  100.000  100.000  90.00  90.00  90.00 P 1
HETATM    1  O   HOH A   1      -2.622 -12.484   0.000  1.0000.00
HETATM    2  O   HOH A   2     -12.079   8.075   0.000  1.0000.00
HETATM    3  O   HOH A   3      14.701   4.409   0.000  1.0000.00