A quick overview of SOFTWARE PACKAGES (and databases) used in evaluation

[this document is available by clicking on HELP link in any of the Results pages]

The 2015/2016 EM Modeling Challenge evaluation system includes software for preprocessing and verification of models and their assessment at 4 levels of granularity: multimers, monomers, domains (if any) and inter-chain interfaces.

Acceptance, preliminary preprocessing and analysis of models:

· Maxit and other tools (Cathy Lawson - acceptance, preliminary preprocessing)

· In-house scripts (Andriy Kryshtafovych – analysis and preprocessing)

Multimers:

· TEMPy ^1-3 , (global and local model-map fit)

· PHENIX ⁴, including a new module for the EM model challenge implemented by Pavel Afonine (used for analysis of global and local model-map fit)

· EMRinger ⁵ (global and local model-map fit based on side-chain fit)

· Model/Map match (using chain comparison module of PHENIX)

· QSscore (Torsten Schwede /Martino Bertoni – not published yet; used for assessing multimeric model fit to reference structure(s) )

· EMRinger

· LDDT scores on chain correspondences found with QSscore (above)

Monomers and Domains:

· Molprobity ⁶ (verifying stereochemistry of models)

· dFIRE ⁷ *not sure we need it – results are often inconsistent; new version - dFIRE2 is hard to get; Molprobity may be enough

· PROSA ⁸ *not sure we need it – results are length dependent, Molprobity may be enough

· ProQ ⁹ (one of the best CASP single-model accuracy assessment methods)

· QMEAN ¹⁰ (one of the best CASP single-model accuracy assessment methods)

· LGA ¹¹ (used for generating GDT-family based scores - main rigid-body superposition-based scores used in CASP)

· TM-score ¹² (similar in principle to GDT_TS, i.e., rigid body superposition based - widely used in protein modeling community)

· LDDT ¹³ (superposition-free measure; compares difference in distance patterns)

· CAD score ¹⁴ (superposition-free measure; compares difference in contact areas)

· ECOD ¹⁵ (database for extracting domain definitions)

· Davis_QAconsensus ¹⁶ (in-house tool for assessment of models based on their clustering)

Interfaces:

· IFaceCheck (Andriy Kryshtafovych, just developed; tool name is tentative)

TARGETS

Targets are numbered consecutively, from T0001 to T0008. Target name is usually provided together with the target name, e.g. GroEL: T0003.

If different reference structures are used in a specific type of evaluation, target ID will include PDB ID of the reference structure, e.g. GroEL: T0003_1xck. Several reference structures can be used for the same target. If chains with different configurations are present in the target – the reference structure ID would include chain name as well, e.g. GroEL: T0003_1svt_A or GroEL: T0003_1svt_H.

If different EM maps are used in a specific type of evaluation, map EMDB ID will be shown in a separate column of the evaluation results. Results for different maps of the same target are organized as different subtargets, e.g. target T0002 has two subtargets: one for the emd_5623 map and another – for emd_6287.

MODELS

Information on submissions received at Rutgers is available for organizers from the Google Docs spreadsheet https://docs.google.com/spreadsheets/d/12C4SUu9L37qvRBu0RCLoC4KgZXbcbYbpVn5q3TxACd8/edit#gid=0

The same table (with the columns containing sensitive information being redacted) is publicly available from the Model Descr. Tab on the Results pages. Sub-tables with the information on the submitted models for each target are accessible by clicking on any model name in the Model column of any target-specific Results Table.

Each submission has its own name as assigned by the PDB extract system (e.g., emcm102_GSec – see column B of the spreadsheet). This name is used only internally and does not appear in the evaluation tables.

If several models are provided in the same submission file, they are preliminary split into separate files. Each model is assigned a unique ID and is evaluated separately.

Model names that are used in the Results tables (e.g. T0003EM164_1) are formed according to the following scheme:

· T0003 [target name]

· EM [electron microscopy]

· 164 [predictor number (see below)]

· _1 [model number from this predictor for this target]

Note 1: The suggested target reference structures and all models were analyzed one by one for their format consistency and compliancy with the evaluation software. Problems were recorded in a separate file available from the Google Docs.

https://docs.google.com/document/d/1lJnq4Yv0ZOYUS6hffp3o2d_41KVNwlave47ktJdZ_MQ/edit

The most typical problems were identified and systemized in yet another Google Docs document:

https://docs.google.com/document/d/1CehhkMVmKBkLxIc3tmoinNjajRHPHqDtEwdzSnMy9Yw/edit

Note 2: If a model is submitted as a monomer and symmetry info is provided within the model, the assessors may consider applying the symmetry operators and evaluating the model not only as a monomer, but also as a multimer.

PREDICTORS

Each group participating in the EM Model Challenge is assigned a unique number. Predictor IDs corresponding to each model are provided in the column C of the spreadsheet. The group_ID – group_name correspondence is concealed from the assessors and available only to the organizers.

Description of the EVALUATION MEASURES and the WEB INFRASTRUCTURE

Results of the evaluations are presented in the EM Model Challenge Results website (http://model-compare.emdatabank.org/, see the screenshot below). Each target has its own Results webpage that can be visited by clicking on the target pictograph/name. Submitted models in plain text format can be accessed by following the “Repositorium of models” link at the top of the page. Models are grouped in folders by target; directories ending in “_” contain representative chains from multi-chain submissions; directories ending in “–Dx” contain models split in structural domains, if any. Directory ‘fixed’ contains models fixed by the submitters or organizers. The final Results tables contain results for the fixed models and not the original ones.

Results pages hierarchy.

Once in any of the Results pages, you can switch between targets through the “Target” drop-down menu at the target-specific pages (thick blue arrow in the screenshot below).

Each model is evaluated at four levels of its structural composition: quaternary structure, tertiary structure, constituting domains and inter-chain interfaces (if applicable). Input data and evaluation tools are different for different assessment approaches. The results are reported separately under four different tabs provided at the highest level of the Results page hierarchy: Multimers (i.e. quaternary structure), Monomers (tertiary structure), Domains, and Interfaces. Additionally a Model Descr Tab is available to provide information on the details of the submitted models (as provided by the predictors). Going down the hierarchy users can browse results of specific analyses. Data are available as sortable result tables, clickable bar graphs or scatter plots (details further in the text).

1. Multimers (top level tab in the evaluation infrastructure)

Multimer prediction models are evaluated in three regimes:

1.1) Model Stats

1.2) vs the experimental EM map (and/or versus the improved map, if submitted together with model) and

1.3) vs the experimental structure(s).

Results in each of the evaluation modes are presented under a separate tab (pointed to by the blue arrow below).

1.1. Assessment of stereo-chemical properties of the multimeric models (2^nd level tab “Model Stats”) is carried out with the PHENIX ⁴ module designed specifically for the EM model challenge. This type of evaluation does not require reference structures. The results are presented as tables (3^rd level tab Scores) and histograms (3^rd level tab Histograms).

1.1.1. Scores.

Left section of the results table under the Scores tab (above) reports deviation of

· Bond distances,

· Angles,

· Chirality,

· Planarity

· and Dihedral angles

from the values observed in ideal models. For each of the listed model features, three values are provided: 1) rmsd calculated on the selected features, 2) maximum value of the deviations (in the selected feature units, e.g. Angstroms for Bonds and Degrees for Angles), and 3) and number of the selected objects in the model.

Right section of the table reports Molprobity scores for the whole multimeric model:

· Clash-Score: the number of all-atom steric overlaps > 0.4Å per 1000 atoms.

· Rot-out: Rotamer Outliers score - percentage of sidechains conformations classified as rotamer outliers, from those sidechains that can be evaluated.

· Ram-out: Ramachandran Outliers score - percentage of backbone Ramachandran conformations classified as outliers.

· Ram-fv: Ramachandran Favored score - percentage of backbone Ramachandran conformations in favored region.

Note that scores for separate chains are provided under the Monomers evaluation tab.

1.1.2. Histograms.

Histograms pages provide binned distribution of some of the model geometrical features for the same ranges of the scores for all models (so they are directly comparable). User can pick four different histograms per model showing:

· deviations from ideal bonds,

· deviations from ideal angles,

· deviation of non-bonded distances,

· atom displacement parameters (ADPs).

X-axis shows number of examples in bins specified in y-axis.

1.2. Evaluation of multimeric models vs the maps (2^nd level tab “vs EM maps”) includes calculation of global and per-residue fitness measures generated with TEMPy ^1-3

, PHENIX ⁴ and EMRinger⁵. All three packages have been locally installed and tested.

1.2.1. Global (full-model) fitness (3^rd level tab Global Accuracy)

1.2.1.1. Global scores reported in the table (4^th level tab Scores) are:

From TEMPy:

· CCC: cross-correlation function to score goodness of fit between the original map and the map convoluted from the model coordinates to the author-specified resolution of the experimental map (or an updated user-provided map)

· LAP: Laplacian-filtered CCC

· ENV: Envelope score

· MI: Mutual information score

Please read the paper [²] for a more detailed explanation of TEMPy scores.

From PHENIX:

· overall FSC: overall model-map Fourier Shell Correlation

· Box CC: overall map-model cross-correlation score (per chain scores are available by clicking at the score)

· Resolution at the FSC=0.5

From EM Ringer:

· EMringer score (clicking at the score will bring up the plot of the EMRinger score for different Electron Potential Thresholds).

There is a strong correlation between resolution and the EMRinger score. This is to be expected, given that EMRinger score reports on side chain density, which is only resolvable above about 4.5 Å. In general, for maps above around 3.5 Å resolution, the minimum score that should be expected is around 1, with a benchmark for a very good score lying around 2. Most structures which have been carefully refined, either in real or reciprocal space, score above 1.5, with some structures getting scores above 3.

Model/Map Match (PHENIX module chain_comparison):

· N(close) - number of Cα within 3Å

· N(far) - number of Cα further than 3Å

· CA score - the number of Cα within 3Å of the target divided by the rmsd

· Seq. match % - percentage of Cα atoms that have the correct residue name

1.2.1.2. Global fitness plots (4^th level tab Plots) are:

· FSC curves built from the PHENIX results for all models evaluated on the specific target (see below). This page is also reachable by clicking on the FSC score in the table of results. Mouse over the graph shows data values for each curve for the selected resolution value (x-axis). Mouse over the model name (under the graph) highlights the line for the selected model (can be very useful if many lines are running very close to each other). Clicking on the model name turns the line invisible (and greys the group name); clicking on the grey model name makes the line visible again.

1.2.2. Local (per-residue) fitness (3^rd level tab Local Accuracy)

Per-residue fitness of models to electron maps is shown as a Summary table and Interactive line plots.

1.2.2.1. Summary tab.

Summary table shows TEMPy’s CCC score and PHENIX’s box CC score for each chain in the model.

1.2.2.2. TEMPy tab.

· TEMPy tab provides access to scatter plots based on the SMOC calculation (Segment Based Manders’ Overlap Coefficients - please read paper [³] for a more detailed explanation). The per-residue accuracy of models according to this score can be explored in more details by selecting a specific region in the model on the lower line-only graph. Click on the residue you want to start a detailed exploration, drug mouse to the end of the desired interval and then release (alternatively use mouse’s scroll wheel for this operation) – the top plot will change accordingly showing details for the selected region. By default, the plots are shown for the chain with the best overall CCC score (highlighted at the bottom of the plot). If you want to see plots for some other chains on the same graph – please click on the additional chains that are greyed out at the bottom of the graph by default. (the exemplary plot below shows graphs for chains B and C selected). Moving the mouse over the names of the selected chains will highlight respective lines in the plots above. Scores for all chains from the model are shown in the table to the right of the plot.

1.2.2.3. PHENIX tab.

· Per-residue graph for the chain with the highest box CC score. The per-residue accuracy of models according to this score can be explored in more details by selecting a specific region in the model on the lower line-only graph. Click on the residue you want to start a detailed exploration, drug mouse to the end of the desired interval and then release (alternatively use mouse’s scroll wheel for this operation) – the top plot will change accordingly showing details for the selected region. By default, the plots are shown for the chain with the best overall box CC score (highlighted at the bottom of the plot). If you want to see plots for some other chains on the same graph – please click on the additional chains that are greyed out at the bottom of the graph by default. Moving the mouse over the names of the selected chains will highlight respective lines in the plots above. Scores for all chains from the model are shown in the table to the right of the plot.

1.2.2.4. EMRinger tab.

· Per-residue graph showing fraction of residues passing the rotameric threshold. Selecting option ‘–all--‘ in the Model dropdown menu will visualize the first (alphabetically) chain for all the models submitted on the selected target. The graphs can be explored in more details by selecting a specific region in the model on the lower graph (using the technique described above). If a specific model is selected (instead of –all-), then graphs for all chains in this model can be visualized. By default, the plot is shown for the 1^st alphabetically chain. If you want to see plots for some other chains on the same graph – please click on the additional chains that are greyed out at the bottom of the graph by default. Moving the mouse over the names of the selected chains will highlight respective lines in the plots above.

1.3. Evaluation of multimers vs the reference structure (2^nd level tab “vs Structure”) is carried out using a newly developed QSscore (Quaternary Structure score) from Torsten Schwede’s group. The software is locally installed and tested. The package first finds the best mapping between the reference target and model chains using the structure symmetry, and then reports 4 scores:

· QS_best: the fraction of interchain contacts (Cβ-Cβ<12A) that are shared between two structures for best fitting interface

· QS_global: the fraction of interchain contacts that are shared between two structures for all interfaces

· RMSD calculated on the whole aligned structure (CAs of all common chains)

· LDDT (Local Distance Difference Test): non-symmetric measure that does not penalize for overprediction, e.g. a tetrameric model (containing a perfect dimeric model) vs the dimeric target is giving a LDDT score of 1.0 (check paper [¹³] for more info on the LDDT)

2. Monomers. Submitted models are first split into separate chain-based models, which are then checked for similarity, and all different chain-based models are evaluated separately.

Monomer accuracy assessment is carried out with three types of measures: single-model validation measures; model consensus measures and comparison to the reference structure (X-ray or/and EM).

Single-model validation measures do not require a 'ground truth' structure and include knowledge-based potentials (Molprobity and dFIRE) and CASP a-priori estimators of global and local model accuracy (ProQ and QMEAN – all locally installed and tested).

Comparison to the reference structure is made with both, superposition-free measures (including LDDT and CAD) and rigid-body superposition-based measures including RMSD, GDT_TS, GDT_HA (sequence-dependent, i.e. assuming one-to-one residue correspondence in model and target), LGA_S and TM-score (sequence-independent, i.e. finding alignment first). All packages are locally installed and tested.

If several models are submitted on the same target, overall and local consensus of models is checked with locally implemented DAVIS-QA method. Global and local similarity of models is calculated in terms of GDT_TS score and distances between the corresponding residues in models. Models with the highest level of conservancy get the highest reliability scores. Calculations in this regime include all-against-all models scores, so in principle any of the models can be used as a reference structure.

2.1. Results of global accuracy assessment (2^nd level Global Accuracy tab) are presented in form of sortable tables (tab Scores) and plots (tab Plots)

2.1.1. Global accuracy scores.

Table for global accuracy scores include:

Single-model validation measures:

five scores from Molprobity (see Molprobity paper from the 1^st page of this manual for details):

· Molprb: aggregated Molprobity Score calculated according to the formula: Molprb = 0.426 *ln(1 + Clash-Score) + 0.33 *ln(1 + max(0, Rot-out - 1)) + 0.25 *ln( 1 + max(0, (100 - Ram-fv) - 2 )) + 0.5

(note: the lower score - the better model; scores below 3 indicate likely stereochemically reasonable model)

· Clash-Score: the number of all-atom steric overlaps > 0.4Å per 1000 atoms.

· Rot-out: Rotamer Outliers score - percentage of sidechains conformations classified as rotamer outliers, from those sidechains that can be evaluated.

· Ram-out: Ramachandran Outliers score - percentage of backbone Ramachandran conformations classified as outliers.

· Ram-fv: Ramachandran Favored score - percentage of backbone Ramachandran conformations in favored region.

· dFIRE: an energy-based score, i.e. the lower (negative) - the better. Note: The new version of dFIRE score has been published this year, but the software is not publicly available. I tried to get it through personal contacts – was promised, but guess the software is not in a form ready for the distribution).

· ProQ: a score from an a-priori estimator of model accuracy; the score is scaled to 0-1 range with higher numbers corresponding to likely better models (see the ProQ2 paper for details on the sequence and structure-based features used in scoring)

· QMEAN: a score from an a-priori estimator of model accuracy; the score is scaled to 0-1 range with higher numbers corresponding to likely better models (see the QMEAN paper or http://swissmodel.expasy.org/qmean/cgi/index.cgi?page=help web page to learn about 6 scoring function terms contributing to QMEAN).

Model consensus method:

· Davis-QA: a score showing average similarity of the selected model to all other submitted models on this target in terms of GDT_TS (range 0-100)

Reference structure superposition independent measures:

· LDDT: the score evaluates similarity of inter-atomic distances in model and targets (see the LDDT paper for details). The score is scaled in the range 0-1.

· CAD: the score evaluates protein models against the target structure by quantifying differences between contact areas (see the CAD paper for details). The score is scaled in the range 0-1.

Reference structure rigid body superposition-based methods:

· RMSD: rmsd on all CAs

· GDT_TS (Total Score): = (GDT_P1 + GDT_P2 + GDT_P4 + GDT_P8)/4, where GDT_Pn denotes percent of residues in model that can be fit to target under the distance cutoff of nÅ (see the LGA paper for details). GDT_TS is scaled in the range 0-100. Note: Clicking on the GDT_TS value for the model will show GDT plot, which is also accessible from the Plots tab.

· GDT_HA (High Accuracy): a fine-grained version of the GDT_TS score with halved cutoffs: GDT_HA = (GDT_P0.5 + GDT_P1 + GDT_P2 + GDT_P4)/4, where GDT_Pn denotes percent of residues in model that can be fit to target under the distance cutoff of nÅ. GDT_HA is scaled in the range 0-100.

· LGA_S (0-100) is similar to GDT_TS, but uses weighted scores from the full set of distance cutoffs in [0-10] range. Unlike GDT_TS, can be used in situations, where alignment between model and target cannot be established based on sequence.

· TMscore: a measure that is conceptually similar to GDT_TS, but has a different score normalization procedure; TM-score is in the range (0,1], where 1 indicates a perfect match between two structures (see the TM paper for details).

2.1.2. Global accuracy plots.

Accuracy of the model versus the reference structure is visually summarized by GDT plots showing percentage of residues in the model that can be superimposed into the target under the specified residue-residue distance cutoff. The closer the curve stays to the y-axis – the better the model. An ideal model would be represented by a curve going straight up and then staying horizontally across the whole range of distance cutoffs. Graphs are interactive so the lines in them can be switched on and off and the underlying scores can be shown using the techniques described in section 1.2.1.2.

2.2. Results of local accuracy assessment (2^nd level Local Accuracy tab) are presented in form of combined bar plots and tables.

There are five tabs under the parent Local Accuracy tab, each corresponding to the selected evaluation measure (GDT_TS, LDDT, ProQ, QMEAN and Davis-QA). Clicking on each of the tabs shows color-coded bar showing per-residue accuracy of the model according to the selected score. For example, clicking on the GDT_TS tab shows CA-CA distances between corresponding residues in model and target after their optimal superposition, while clicking on the LDDT tab gives per-residue LDDT score (type of the score is specified in the plot title). Clicking on the color-coded bar shows structural superposition of model and target colored the same way as the underlying bar.

3. Domains. If monomeric units consist of several structural domains the models are evaluated at the level of domains for different reference structures, if needed. Targets are split into domains by consulting the DomainParser and DDomain2 software and the ECOD database of structural domains.

General principles of web infrastructure for Domain results is similar to the Monomer infrastructure.

4. Interfaces

We locally developed the software for identification of corresponding interfaces, and calculation of statistical measures on interface similarity. First, similar interfaces are clustered in the target and model based on the Jaccard distance between them (identity of residues constituting interface). Clicking on the interface name (e.g., AK, meaning interface between the chains A and K) in the column Corresponding Interfaces shows the full list of similar interfaces in the cluster below the table (see below).

For each pair of similar interfaces, we calculate

· Prec (precision) = TP/(TP+FP)

· Recall = TP/(TP+FN)

· F1-score = 2*Prec*Recall/(Prec+Recall)

· Jd (Jaccard distance) = (FP+FN)/(TP+FP+FN)

between interchain contact pairs in target and models. The scores are reported for the best scoring interface from each of the corresponding interface clusters and also as average from all pairwise scores from all interfaces in the cluster (right side of the table).

· RMSD between the residues belonging to target interfaces and corresponding residues in the model is also calculated together with the

· Coverage of the target by the models residues (percentage of target residues used in the rmsd calculation).

References:

1. Farabella I, Vasishtan D, Joseph AP, Pandurangan AP, Sahota H, Topf M. : a Python library for assessment of three-dimensional electron microscopy density fits. J Appl Crystallogr 2015;48(Pt 4):1314-1323.

2. Vasishtan D, Topf M. Scoring functions for cryoEM density fitting. J Struct Biol 2011;174(2):333-343.

3. Joseph AP, Malhotra S, Burnley T, Wood C, Clare DK, Winn M, Topf M. Refinement of atomic models in high resolution EM reconstructions using Flex-EM and local assessment. Methods 2016;100:42-49.

4. Adams PD, Afonine PV, Bunkoczi G, Chen VB, Echols N, Headd JJ, Hung LW, Jain S, Kapral GJ, Grosse Kunstleve RW, McCoy AJ, Moriarty NW, Oeffner RD, Read RJ, Richardson DC, Richardson JS, Terwilliger TC, Zwart PH. The Phenix software for automated determination of macromolecular structures. Methods 2011;55(1):94-106.

5. Barad BA, Echols N, Wang RY, Cheng Y, DiMaio F, Adams PD, Fraser JS. EMRinger: side chain-directed model and map validation for 3D cryo-electron microscopy. Nat Methods 2015;12(10):943-946.

6. Chen VB, Arendall WB, 3rd, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, Murray LW, Richardson JS, Richardson DC. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr D Biol Crystallogr 2010;66(Pt 1):12-21.

7. Yang Y, Zhou Y. Specific interactions for ab initio folding of protein terminal regions with secondary structures. Proteins 2008;72(2):793-803.

8. Wiederstein M, Sippl MJ. ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic Acids Res 2007;35(Web Server issue):W407-410.

9. Ray A, Lindahl E, Wallner B. Improved model quality assessment using ProQ2. BMC Bioinformatics 2012;13:224.

10. Benkert P, Kunzli M, Schwede T. QMEAN server for protein model quality estimation. Nucleic Acids Res 2009;37(Web Server issue):W510-514.

11. Zemla A. LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res 2003;31(13):3370-3374.

12. Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 2005;33(7):2302-2309.

13. Mariani V, Biasini M, Barbato A, Schwede T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 2013;29(21):2722-2728.

14. Olechnovic K, Kulberkyte E, Venclovas C. CAD-score: a new contact area difference-based function for evaluation of protein structural models. Proteins 2013;81(1):149-162.

15. Cheng H, Schaeffer RD, Liao Y, Kinch LN, Pei J, Shi S, Kim BH, Grishin NV. ECOD: an evolutionary classification of protein domains. PLoS Comput Biol 2014;10(12):e1003926.

16. Kryshtafovych A, Monastyrskyy B, Fidelis K. CASP prediction center infrastructure and evaluation measures in CASP10 and CASP ROLL. Proteins 2014;82 Suppl 2:7-13.