Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/markovmodel/pysfd


https://github.com/markovmodel/pysfd

collective-variables ensemble molecular-dynamics order-parameters visualization

Last synced: 2 months ago
JSON representation

Awesome Lists containing this project

README

        

## PySFD - Significant Feature Differences Analyzer for Python

PySFD computes and visualizes any significant differences in features, e.g., distances, contacts, angles, dihedrals,
among different sets of molecular dynamics-simulated ensembles.

For an overview, pleas read the following documentation.
For further details, please read
- the doc strings throughout the PySFD package
- example jupyter notebook and python files located in `PySFD/docs/PySFD_example`

This software has been thoroughly tested, but comes with no warranty.

### Citation
If you find this tool useful, please cite:
```
Stolzenberg, S. "PySFD: Comprehensive Molecular Insights from Significant Feature Differences detected among many Simulated Ensembles." Bioinformatics (Oxford, England) (2018).
```
https://doi.org/10.1093/bioinformatics/bty818

### Installation
PySFD is a Python package available for Python 2 and 3, should in principle run on all common platforms,
but is currently only supported for Linux and MacOS.

#### Required Packages

With conda installed, e.g., via

Miniconda https://conda.io/miniconda.html

first make sure you have activated the conda-forge channel:
```sh
conda config --add channels conda-forge
```

and then install the folllowing python packages:
```sh
conda install jupyter numpy pathos pandas biopandas mdtraj scipy matplotlib seaborn
```

to install SHIFTX2 (only works for Python 3):
```sh
conda install -c omnia/label/dev shiftx2
```

For visualization of significant feature differences, please install the latest versions of:

- PyMOL (https://pymol.org), e.g., via
```sh
conda install -c schrodinger pymol
conda install Pmw
```
- VMD (http://www.ks.uiuc.edu/Research/vmd)

In case you would like to visualize your PySFD results directly from a Python/iPython/Jupyter session (via ```pysfd.view_feature_diffs()```), you will need to install the Python module iPyMol
(https://github.com/cxhernandez/ipymol.git)
via:
```
pip install ipymol
```

#### Working Package System

PySFD was successfully tested, e.g., using the following python environment:

| package | version | | channel |
|:---|:---|:---|:---|
|biopandas |0.2.3 | py36_0 |conda-forge|
|ipymol |0.5 | \ | |
|jupyter |1.0.0 | py_1 |conda-forge|
|matplotlib |2.2.2 | py36_1 |conda-forge|
|mdtraj |1.9.1 | py36_1 |conda-forge|
|numpy |1.14.5 |py36hcd700cb_0 | |
|pandas |0.23.1 | py36_0 |conda-forge|
|pathos |0.2.1 | py36_1 |conda-forge|
|pymol |2.1.1 | py36_2 |schrodinger|
|python |3.6.5 | 1 |conda-forge|
|scipy |1.1.0 |py36hfc37229_0 | |
|seaborn |0.8.1 | py36_0 |conda-forge|

(created via
``conda list "^python$|^jupyter$|numpy|pathos|pandas|biopandas|mdtraj|scipy|matplotlib|seaborn|shiftx2|pymol|ipymol"``
)

The full conda environment is listed in ./pysfd.yaml

(created via ``conda env export > pysfd.yaml``)

, which can be used to automatically install all required packages via:

``conda env create -f pysfd.yaml``

#### PySFD

Then just download the PySFD package, e.g., via GitHub
```sh
git clone https://github.com/markovmodel/PySFD.git
```
and from within the downloaded "PySFD" directory,
install PySFD into the Python Path of your environment via:
```sh
python setup.py install
```
To uninstall, simply type:
```sh
pip uninstall pysfd
```

### Features
Pre-defined features are stored in `pysfd.features` and currently include the following modules
and feature classes:

[srf.py] : Single Residue Feature (SRF)
- `CA_RMSF_VMD`: CA atom root mean square flucatuations (RMSF), computed with VMD
- `ChemicalShift`: predicted NMR Chemical Shifts, computed via mdtraj and shiftx2
- `Dihedral` (`Dihedral_Std`): dihedral angles (their standard deviations) computed with mdtraj, one of the following methods is passed to the `Dihedral` (`Dihedral_Std`) class:
- `mdtraj.compute_chi1`
- `mdtraj.compute_chi2`
- `mdtraj.compute_chi3`
- `mdtraj.compute_chi4`
- `mdtraj.compute_omega`
- `mdtraj.compute_phi`
- `mdtraj.compute_psi`
- `IsDSSP_mdtraj` : binary secondary structure assignments (DSSP), e.g., whether or not a helix ("H") is formed in a particular residue
- `Scalar_Coupling`: scalar couplings computed with mdtraj, one of the following methods is passed to the `Scalar_Coupling` class:
- `mdtraj.compute_J3_HN_C`
- `mdtraj.compute_J3_HN_CB`
- `mdtraj.compute_J3_HN_HA`
- `SASA_sr` : solvent accessibility surface areas (SASAs) via mdtraj.shrake_rupley
- `RSASA_sr` : relative SASA, i.e. normalized to the total SASA of a particular residue, computed via mdtraj.shrake_rupley

[paf.py] : Pairwise Atomic Features (PAF)
- `Atm2Atm_Distance`: atom-to-atom distance (CA atoms by default, i.e. if df_sel is None)
- `AtmPos_Correlation`: (partial) correlations between atom positions (CA atoms by default, i.e. if df_sel is None)

[prf.py] : Pairwise Residual Features (PRF)
- `Ca2Ca_Distance`: distance between CA atoms
- `CaPos_Correlation`: (partial) correlations between Ca positions
- `Dihedral_Correlation`: (partial) correlations between Dihedral angles (see srf module)
- `Scalar_Coupling_Correlation`: (partial) correlations between Dihedral angles (see srf module)

[spbsf.py] : sparse Pairwise Backbone Sidechain Features (sPBSF) (contact frequencies and dwell times)
- `HBond_mdtraj`: hydrogen bonds via mdtraj
- `HBond_VMD`: hydrogen bonds via VMD
- `HBond_HBPLUS`: hydrogen bonds via HBPLUS
- link to HBPLUS program:
https://www.ebi.ac.uk/thornton-srv/software/HBPLUS/
- after installation, please update "export hbdir=..."
in PySFD/features/scripts/compute_PI.hbplus.sh
- `Hvvdwdist_VMD`: heavy atom van der Waals radii contacts between residues that are > 4 residue positions apart, computed via VMD
- `HvvdwHB`: contact, if `Hvvdwdist_VMD` or `HBond_HBPLUS` contact in a simulation frame
`Hvvdwdist_VMD` contacts can be transformed into a single residual feature (SRF, see above), e.g., by only considering the contact frequencies of any residual backbone or side-chain with water (`Hvvdwdist_H2O`).

[pprf.py] : Pairwise Pairwise Residual Features (PPRF)
- `Ca2Ca_Distance_Correlation`: pairwise (partial) correlations between Ca-to-Ca distances

[pspbsf.py] : Pairwise sparse Pairwise Backbone/Sidechain Features (PsPBSF)
- `sPBSF_Correlation`: (partial) correlations between sPBSF features (see spbsf module)

[pff.py] : Pairwise Feature Features (PFF)
- `Feature_Correlation`: pairwise (partial) correlations between (coarse-grained) features, each which are specified for individual frames, e.g. for correlations between `srf.Dihedral` and `spbsf.HBond_mdtraj`, but not with `srf.CA_RMSF_VMD`

Beware of spurious correlations in big data!

Each of these feature classes contained in each module is derived
from `pysfd.features._feature_agent.FeatureAgent`.
Each such feature class is instantiated and passed as a `FeatureObj` object into
an `pysfd.PySFD` object, which in turn extracts the essential feature function from `FeatureObj` via
`FeatureObj.get_feature_func()`.
These features classes can be exented by adding further feature classes in each existing feature module,
or adding further feature modules (make sure then to update `PySFD/features/__init__.py`).

An important functionality of PySFD is that it allows the histogramming of specific (or all) computed features of a particular type.
The features to be histogrammed are initially listed in the `df_hist_feats` parameter, a pandas.DataFrame passed in a particular feature class (see docstrings).
For each ensemble, the average feature histograms (with standard deviations) over ensemble trajectories are then computed, and stored into the `PySFD.df_fhists` dictionary.

### Feature Coarse-Graining
Each of these features can be further coarse-grained for each simulation frame by "region" into regional features by providing
`df_rgn_seg_res_bb`, a user-defined pandas Dataframe, and
`rgn_agg_func`, user-defined function or string,
to the specific feature class.

`df_rgn_seg_res_bb` maps from the residual (or sub-residual, i.e. backbone/sidechain) level to the regional level via
the columns `rgn` (region ID), `seg` (SegID), `res` (ResID), and optionally `bb` (is backbone (1)? is sidechain (0)?).
```python
'''
* df_rgn_seg_res_bb : optional pandas.DataFrame for coarse-graining that defines
regions by segIDs and resIDs, and optionally backbone/sidechain, e.g.
df_rgn_seg_res_bb = _pd.DataFrame({'rgn' : ["a1", "a2", "b1", "b2", "c"],
'seg' : ["A", "A", "B", "B", "C"],
'res' : [range(4,83), range(83,185), range(4,95), range(95,191), range(102,121)]})
if None, no coarse-graining is performed
'''
```

`rgn_agg_func` defines the function used in the coarse-graining, e.g. mean, sum, ...
```python
'''
* rgn_agg_func : function or str for coarse-graining
function that defines how to aggregate from residues (backbone/sidechain) to regions in each frame
this function uses the coarse-graining mapping defined in `df_rgn_seg_res_bb`
- if a function, it has to be vectorized (i.e. able to be used by, e.g., a 1-dimensional numpy array)
- if a string, it has to be readable by the aggregate function of a pandas Data.Frame,
such as "mean", "std"
'''
```

Each feature class then coarse-grains these features
- in each simulation frame, if feature values are computed by frame
- otherwise over feature values, e.g., if the underlying feature type describes an entire trajectory (correlation coefficients, RMSF, ...)

Each feature type has a default for `rgn_agg_func` (usually "mean" or "sum", see doc strings)

For help on how to semi-automatically define `df_rgn_seg_res_bb`
see:

`PySFD/docs/notes/how_to_define_rgn_to_seg_res_ranges.txt`

For example definitions of `df_rgn_seg_res_bb` data frames, see

`PySFD/docs/PySFD_example/scripts`

### Feature and Significant Feature Differences
These are computed by the following methods (see example jupyter notebooks in `PySFD/docs/PySFD_example`):
1) `PySFD.comp_features()`,
2) `PySFD.comp_feature_diffs()`, or `PySFD.comp_feature_diffs_with_dwells()`, and
3) `PySFD.comp_and_write_common_feature_diffs()`

(further notes below)

Multiple features types and differences can be computed within the same PySFD instance and
are stored, respectively, into the dictionaries
`PySFD.df_features[PySFD.feature_func_name]` and
`PySFD.df_feature_diffs[PySFD.feature_func_name]`, where
`PySFD.feature_func_name` is the name of the
currently selected feature function `PySFD.feature_func`.

Further Notes:

1\. `PySFD.comp_features()`

Features are computed as means over ensemble trajectory means (and optionally higher statistical moments with respect to the mean, see parameter `max_mom_ord` in the docstrings),
i.e. each feature as a mean of the feature's mean (or higher statistical moment with respect to the mean) along each trajectory
(circular/arithmetic means for circular/linear feature types).
The uncertainties of these feature means (and optionally higher moments) can be computed either as

"standard errors" (`PySFD.error_type[PySFD.feature_func_name] = "std_err"`) (statistical uncertainty)

or

"standard deviations" (`PySFD.error_type[PySFD.feature_func_name] = "std_dev"`) ("exclusive"(/"effect size") uncertainty).

(optional higher moments, i.e. `max_mom_ord>1`, are defined only for "standard errors")

If these uncertainties are computed as "standard errors", then each as
a (circular) standard deviation over ensemble trajectory means.

If these uncertainties are computed as "standard deviations", then each as
a mean of (circular) standard deviations over ensemble trajectories.

`PySFD.error_type[PySFD.feature_func_name]` is implicitly defined in each `FeatureObj` object
passed into PySFD.

(

`PySFD.error_type[PySFD.feature_func_name]` gets updated
in the beginning of `PySFD.run_ens()`:
```python
self.error_type[self._feature_func_name] = dataflags.get("error_type")
```

Alternatively, one could directly define error_type as a universal PySFD.error_type
and make PySFD.PyASDA.feat_func() read it, but then looping through a list of
feature functions items would require
updating self.error_type for every iteration, e.g. via an external dictionary.

)

Reloading of already computed features (not differences) is invoked via
`PySFD.reload_features()`
and writing features to disk via
`PySFD.write_features()`.

2\. `PySFD.comp_feature_diffs()`, or `PySFD.comp_feature_diffs_with_dwells()`

To determine significant differences among pairs of ensembles, the individual ensemble
uncertainties are added geometrically (to form "sigma", i.e. the sqrt(...) in the manuscript)
and scaled by `num_sigma`.
An individual feature is significantly different between two ensembles, if its mean (optionally a higher moment with respect to the mean) differs in absolute value by
more than both `num_funit` and num_sigma * sigma:

![equation](http://latex.codecogs.com/gif.latex?%7Babs%7D%5Cleft%28%5Cbar%7Bf%7D_%7Ba%7D-%5Cbar%7Bf%7D_%7Bb%7D%5Cright%29-%5Cmax%5Cleft%28n_%7B%5Csigma%7D%5Ccdot%5Csqrt%7B%5CDelta_%7Bf_%7Ba%7D%7D%5E%7B2%7D+%5CDelta_%7Bf_%7Bb%7D%7D%5E%7B2%7D%7D%2Cn_%7Bf%7D%5Cright%29%3E0)

(f_a, f_b could either be a mean or optionally a higher statistical moment with respect to the mean)

If significant differences are computed also in regard to higher statistical moments, then significant differences in the m-th moment (for all 1 < m <= `max_mom_ord`)
are selected, for which all n-th moments (n\"rgn" mappings have to be defined in
`$CURRENT_WORKDIR/scripts/df_rgn_seg_res_bb.dat`
see
`PySFD/docs/notes/how_to_define_rgn_to_seg_res_ranges.txt`
for help on how to create this file from "df_rgn_seg_res_bb" defined in PySFD

Currently, PyMOL visualizations for significant feature differences are implemented only for:
- Single Residual Features (SRF)
- Pairwise Residual Features (PRF)
- sparse Pairwise Backbone/Sidechain Features (sPBSF)

In case you would like to visualize your PySFD results in PyMOL directly from a Python/iPython/Jupyter session, you can use the ```PySFD.view_feature_diffs()``` method (see docstring for further documentation).

#### VMD
[VMD_Vis.tcl] : main VMD Visualizer function `VMD_visualize_feature_diffs`

Each of

[VMD_Vis.prf.tcl],

[VMD_Vis.spbsf.tcl], and

[VMD_Vis.srf.tcl]
contains all the feature-specific parameter values and an `add_vis()` function that
is executed by VMD_visualize_feature_diffs (contained in [VMD_Vis.tcl]).

These feature-specific tcl files are executed within VMD, e.g., via
```tcl
set VisFeatDiffsDir $CURRENT_WORKDIR/VisFeatDiffs
source $VisFeatDiffsDir/VMD/VMD_Vis.spbsf.tcl
```
(also see `VisFeatDiffs/VMD/readme.txt`)

You can "cursor" through the different simulated ensembles (and their significant feature differences) using the "a" and "d" keys.

Before visualizing coarse-grained significant differences,
the coresponding "seg,res"->"rgn" mappings have to be defined in
`$CURRENT_WORKDIR/scripts/rgn2segres.tcl`
see `PySFD/docs/notes/how_to_define_rgn_to_seg_res_ranges.txt`
for help on how to create this file from `df_rgn_seg_res_bb` defined in PySFD

Currently, VMD visualizations for significant feature differences are implemented only for:
- Single Residual Features (SRF)
- Pairwise Residual Features (PRF)
- sparse Pairwise Backbone/Sidechain Features (sPBSF)

[srf.py]:
[pff.py]:
[paf.py]:
[prf.py]:
[spbsf.py]:
[pprf.py]:
[pspbsf.py]:
[VMD_Vis.tcl]:
[VMD_Vis.prf.tcl]:
[VMD_Vis.spbsf.tcl]:
[VMD_Vis.srf.tcl]:
[PyMOLVisFeatDiffs.py]:
[PyMOLVisFeatDiffs_prf.py]:
[PyMOLVisFeatDiffs_spbsf.py]:
[PyMOLVisFeatDiffs_srf.py]: