Department of Biomedicine, University of Bergen, Bergen, Norway

Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany

Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA

Abstract

Background

Popular bioinformatics approaches for studying protein functional dynamics include comparisons of crystallographic structures, molecular dynamics simulations and normal mode analysis. However, determining how observed displacements and predicted motions from these traditionally separate analyses relate to each other, as well as to the evolution of sequence, structure and function within large protein families, remains a considerable challenge. This is in part due to the general lack of tools that integrate information of molecular structure, dynamics and evolution.

Results

Here, we describe the integration of new methodologies for evolutionary sequence, structure and simulation analysis into the Bio3D package. This major update includes unique high-throughput normal mode analysis for examining and contrasting the dynamics of related proteins with non-identical sequences and structures, as well as new methods for quantifying dynamical couplings and their residue-wise dissection from correlation network analysis. These new methodologies are integrated with major biomolecular databases as well as established methods for evolutionary sequence and comparative structural analysis. New functionality for directly comparing results derived from normal modes, molecular dynamics and principal component analysis of heterogeneous experimental structure distributions is also included. We demonstrate these integrated capabilities with example applications to dihydrofolate reductase and heterotrimeric G-protein families along with a discussion of the mechanistic insight provided in each case.

Conclusions

The integration of structural dynamics and evolutionary analysis in Bio3D enables researchers to go beyond a prediction of single protein dynamics to investigate dynamical features across large protein families. The Bio3D package is distributed with full source code and extensive documentation as a platform independent R package under a GPL2 license from

Background

The internal motions and intrinsic dynamics of proteins have increasingly been recognized as essential for protein function and activity

Current software solutions lack much of the flexibility needed for comparative studies of large heterogeneous structural datasets. For example, popular web servers for NMA typically operate on single structures and do not permit high-throughput calculations

Package overview and architecture

Bio3D version 2.0 now provides extensive functionality for high-throughput NMA of an ensemble of protein structures facilitating the study of evolutionary and comparative protein dynamics across protein families. The NMA module couples to major protein structure and sequence databases (PDB, PFAM, UniProt and NR) and associated search tools (including BLAST

A typical user workflow for the comparison of cross-species protein flexibility is depicted in Figure **get.seq**() function. This sequence is then used in a BLAST or HMMER search of the full PDB database to identify related protein structures (functions **blast**() or **hmmer**()). Identified structures can then optionally be downloaded (with the function **get.pdb**()) and aligned using the function **pdbaln**(). The output will be a multiple sequence alignment together with aligned coordinate data and associated attributes. Ensemble NMA on all aligned structures can then be carried out with function **nma**(). The function provides an **pca**(). This provides principal components of the same dimensions as the normal modes facilitating direct comparison of mode fluctuations, or alternatively mode vectors using functions such as **rmsip**() and **overlap**(). Indeed extensive new functions for the analysis of normal modes and principal components are now provided. These include cross-correlation, fluctuations, overlap, vector field, dynamic sub-domain clustering, correlation network analysis and movie generation along with integrated functions for plotting and visualization. Extensive multicore support is also included for a number of commonly used functions. This enables a significant speed-up for time-consuming tasks, such as ensemble NMA for large protein families, modes comparison, domain assignment, correlation analysis for multiple structures, and analysis for long-timescale MD simulations. Comprehensive tutorials integrating NMA with PCA, simulation data from MD, and additional sequence and structure analysis methods, including correlation network analysis, are available in Additional files

Example workflow for

**Example workflow for**
**
ensemble
**

Implementation

Elastic network models

A unique collection of multiple ENM force fields is now provided within Bio3D. These include the popular anisotropic network model (ANM)

All implemented ENMs considered here employ a harmonic potential, where the potential energy between residues

where **r** is the current protein conformation, **r**
^{
0
} represents the equilibrium conformation, and ‖**r**
_{
ij
}‖ the distance between residues

with units of ^{− 1} Å^{− 2}. The selection of different force fields is described in detail both online and in Additional file

Ensemble NMA

Integrated multiple sequence and structural alignment methods are utilized to facilitate the analysis of structures of unequal composition and length. From these alignments, equivalent atom positions across structure ensembles are identified and normal mode vectors determined by calculating the effective force-constant Hessian matrix

where **K**
_{
AA
} represents the sub-matrix of **K** corresponding to the aligned C-alpha atoms, **K**
_{
QQ
} for the gapped regions, and **K**
_{
AQ
} and **K**
_{
QA
} are the sub-matrices relating the aligned and gapped sites

where **V** is the matrix of eigenvectors and λ the associated eigenvalues.

Ensemble PCA

Principal component analysis can be performed on any structure dataset of equal or unequal sequence composition and length to capture and characterize inter-conformer relationships. The application of PCA to both distributions of experimental structures and MD trajectories, along with its ability to provide considerable insight into the nature of conformational differences in a range of protein families has been previously discussed _{
ij
} calculated from the aligned and superimposed Cartesian coordinates,

where

Similarity measures

Multiple similarity measures have been implemented to provide an enhanced framework for the assessment and comparison of ensemble NMA and PCA. These measures also facilitate clustering of proteins based on their predicted modes of motion:

**Root mean square inner product** (RMSIP) measures the cumulative overlap between all pairs of the

where

**Covariance overlap** provides a measure of the correspondence between the eigenvectors (**v**
_{
i
}) similar to the RMSIP measure, but also includes weighing by their associated eigenvalues (λ_{
i
})

**Bhattacharyya coefficient** provides a means to compare two covariance matrices derived from NMA or an ensemble of conformers (e.g. simulation or X-ray conformers). For ENM normal modes the covariance matrix (**C**) can be calculated as the pseudo inverse of the mode eigenvectors:

where **v**
_{
i
} represents the _{
i
} the corresponding eigenvalue, and

where **Q** is the matrix of the principal components of (**C**
_{A} + **C**
_{B})/2, Λ is diagonal matrix containing the corresponding eigenvalues, and **Q**. The Bhattacharyya coefficient varies between 0 and 1, and equals to 1 if the covariance matrices (**C**
_{A} and **C**
_{B}) are identical.

**Squared Inner Product** (SIP) measures the linear correlation between two atomic fluctuation profiles

where **
w
**

PCA of cross-correlation and covariance matrices

New functionality facilitates PCA of residue-residue cross-correlations and covariance matrices derived from ensemble NMA. This analysis can be formulated as

where **Υ** is a matrix containing the elements of the **B** the eigenvectors and Γ the associated eigenvalues. Projection into the sub-space defined by the largest eigenvectors enables clustering of the structures based on the largest variance within the cross-correlation or covariance matrices.

All similarity measures described above can be utilized for clustering the ensemble of structures based on their normal modes. Various clustering algorithms are available, such as k-means clustering, as well as hierarchical clustering using the Ward’s minimum variance method, or single, complete and average linkage. The application and comparison of the described similarity measures is presented in Additional file

Force constants variance weighting

We propose to incorporate knowledge on the accessible conformational ensemble (e.g. all available X-ray structures) to lift the dependency of the force constants on the single structure they were derived from. We weigh the force constants with the variance of the pairwise residue distances derived from the ensemble of structures. The weights (W_{
ij
}) and the modified force constants (_{
ij
}(

where _{
ij
} (the elements of matrix **S**) represents the variance of the distance between residues

Identification of dynamic domains

Analysis and identification of dynamic domains,

Correlation network analysis

Correlation network analysis can be employed to identify protein segments with correlated motions. In this approach, a weighted graph is constructed where each residue represents a node and the weight of the connection between nodes, _{
ij
}, expressed by either the Pearson-like form **C**) is calculated for each structure in the ensemble NMA. Then, edges are added for residue pairs with _{
ij
} ≥ _{0} across all experimental structures, where _{0} is a constant. In addition, edges are added for residues where _{
ij
} ≥ _{0} for at least one of the structures and the Cα-Cα distance, _{
ij
}, satisfies _{
ij
} ≤ 10 Å for at least 75% of all conformations. Edges weights are then calculated with − _{
ij
}〉), where 〈 ⋅ 〉 denotes the ensemble average. Girvan and Newman betweeness clustering

Results and discussion

In this section we demonstrate the application of new Bio3D functionality for analyzing functional motions in two distinct protein systems. Further examples, along with executable code, are provided in Additional files

Cross-species analysis of DHFR

Dihydrofolate reductase (DHFR) plays a critical role in promoting cell growth and proliferation in all organisms by catalyzing the reaction of dihydrofolate to tetrahydrofolate, an essential precursor for thymidylate synthesis

Following the workflow described in Figure **geostas**() reveals the presence of two dynamic sub-domains corresponding to the adenosine-binding sub-domain and the loop sub-domain, respectively (Figure

Results of ensemble PCA and NMA on

**Results of ensemble PCA and NMA on**
**
E. coli
**

Beginning with the knowledge of only one DHFR PDB code, the complete PCA and NMA of the

To detect more distantly related DHFR homologues we built a hidden Markov model (HMM) from the PFAM multiple sequence alignment using the Bio3D interface to PFAM and HMMER (see the

Cross-species normal modes analysis of DHFR. (A)

**Cross-species normal modes analysis of DHFR. (A)** Sequence conservation of the collected DHFR species. **(B)** Aligned fluctuation profiles for selected species of DHFR. Shaded blue regions depict areas discussed in the text showing different fluctuation patterns between specific species. The region shaded in light red depict the Met20 loop in **(C)** A visual comparison of mode fluctuations between DHFR from

Heterotrimeric G-proteins

Applying ensemble NMA to heterotrimeric G-protein α-subunits (Gα) reveals nucleotide dependent structural dynamic features of functional relevance. Gα undergoes cycles of nucleotide-dependent conformational rearrangements to couple cell surface receptors to downstream effectors and signaling cascades that control diverse cellular processes. These process range from movement and division to differentiation and neuronal activity. Interaction with activated receptor promotes the exchange of GDP for GTP on Gα and its separation from its βγ subunit partners (Gβγ). Both isolated Gα and Gβγ can then interact and activate downstream effectors. GTP hydrolysis deactivates Gα, which re-associates with Gβγ effectively completing the cycle.

In the current application, we collected 53 PDB structures of Gα (from application of the **blast.pdb**() function). These structures were aligned with the function **pdbaln**() and their modes of motion calculated with **nma**() (Figure

Investigating functional dynamics in heterotrimeric G-proteins. (A)

**Investigating functional dynamics in heterotrimeric G-proteins. (A)** Prediction of large-scale opening motions. **(B)** Prediction of dynamically coupled sub-domains (colored regions) from correlation network analysis of NMA results. Inter-subdomain couplings are highlighted with thick black lines. **(C)** Characterization of distinct GTP-active and GDP-inactive states from a clustering of NMA RMSIP results. **(D)** Fluctuation analysis reveals structural regions with significantly distinct flexibilities (highlighted with a blue shaded background are sites with a p-value < 0.005) between the active (red) and inactive (green) states. Full details for the reproduction of this analysis along with PCA that distinguishes GDP and GTP states can be found in the Additional file

It has been suggested that the activation mechanism of Gα involves a large domain opening that facilitates GDP/GTP exchange **cna**() function, reveals dynamically coupled subdomains that may facilitate the allosteric coupling of receptor and nucleotide binding sites (Figure

Related solutions and future developments

As noted in the introduction, a number of previously implemented software solutions (including multiple web-servers

**MMTK 2.7**

**ProDy 1.5**

**MAVEN 1.2**

**WebNM@ 2.0**

**Bio3D 2.0**

^{a}Read and search functionality.

^{b}Read-only functionality from the PDB.

^{c}Read, search, and annotation functionality, including enhanced search capabilities across multiple databases.

^{d}STM: Spring Tensor Model; pANM: power ANM; nnANM: nearest neighbor ANM; mcgANM: mixed coarse graining ANM.

^{e}Dependences are not open source.

^{f}VMD plugin NMWiz available for single molecule NMA.

^{g}Web interface for ensemble PCA and NMA in development.

**Dependencies**

Python, NumPy, ScientificPython

Python, NumPy, MatplotLib

Matlab Component Runtime (MCR)

Web browser

R, Muscle

**Reading and analysis of molecular sequences**

No

Yes

No

No

Yes

**Reading and analysis of multiple molecular structures**

No

Yes

Yes

Yes

Yes

**Reading and analysis of binary MD simulation trajectories**

Yes

Yes

No

No

Yes

**Biomolecular database integration**

No

PDB, PFAM^{a}

No^{b}

No^{b}

PDB, PFAM, UNIPROT, NR^{c}

**Energy minimization and MD**

Yes

No

No

No

No

**Standard NMA**

Yes

Yes

Yes

Yes

Yes

**Ensemble NMA across heterogeneous structures**

No

No

No

Yes

Yes

**Forcefields for NMA**

C-alpha, ANM, Amber all-atom

GNM/ANM, Custom

GNM/ANM, pANM, STM, nnANM, mcgANM, Custom^{d}

C-alpha

C-alpha, ANM, pfANM sdENM, REACH, Custom

**Ensemble PCA across heterogeneous structures**

No

Yes

Identical structures only

No

Yes

**Correlation network analysis from NMA and MD**

No

No

No

No

Yes

**Dynamic domain analysis**

No

No

No

No

Yes

**Sequence alignment**

No

No

No

No

Yes

**Structure alignment**

Yes

Yes

No

No

Yes

**Advanced statistical analysis**

No

No

No

No

Yes

**Permits both interactive and batch analysis**

Yes

Yes

No

Yes

Yes

**Open source code available**

Yes

Yes

Yes^{e}

No

Yes

**Multicore compatibility**

Yes

No

No

No

Yes

**GUI**

No

No^{f}

Yes

Webserver

No^{g}

Current and future development of Bio3D (see:

Conclusion

Bio3D version 2.0 provides a versatile integrated environment for protein structural and evolutionary analysis with unique capabilities including high-throughput ensemble NMA for examining the dynamics of evolutionary related protein structures; a convenient interface for accessing multiple ENM force fields; and a direct integration with a large number of functions for sequence, structure and simulation analysis. The package is implemented in the R environment and thus couples to extensive graphical and statistical capabilities along with a powerful user-friendly interactive programming environment that, together with Bio3D, enables both exploratory structural bioinformatics analysis and automated batch analysis of large datasets.

Availability and requirements

Project name: Bio3D

Project home page:

Operating system(s): Platform independent

Programming language: R

Other requirements: R > = 3.0.0

License: GPL2

Any restrictions to use by non-academics: none

Abbreviations

CNA: Correlation network analysis

DHFR: Dihydrofolate reductase

ENM: Elastic network model

MD: Molecular dynamics

NMA: Normal mode analysis

PCA: Principal component analysis

RMSIP: Root mean square inner product

Competing interests

The authors declare that they have no competing interests.

Author contributions

Conceived and designed the study: LS, XY and BJG. Performed the study: LS and XY. Implementation: LS and XY (NMA functionality); XY, GS and BJG (CNA functionality). Analyzed and interpreted the data: LS, XY and BJG. Wrote the paper and the attached vignettes: LS, XY and BJG. All authors read and approved the final manuscript.

Additional files

**Comprehensive tutorials for traditional single structure and new ensemble NMA on Heterotrimeric G-proteins and other systems.**

Click here for file

**
E. coli DHFR
**

Click here for file

**Species wide NMA of the DHFR superfamily.**

Click here for file

**Complete example of the integration of ensemble NMA with correlation network analysis.**

Click here for file

Acknowledgements

We thank Edvin Fuglebakk and Julia Romanowska (University of Bergen, Norway) as well as the Bio3D user community for valuable discussions and software testing. We acknowledge the University of Bergen (LS) and University of Michigan (XY, GS and BJG) for funding.