Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/edikedik/lxtractor
Library for analysing protein structures and sequences
https://github.com/edikedik/lxtractor
bioinfomatics computational-biology data-analysis data-mining feature-extraction python structural-biology
Last synced: about 5 hours ago
JSON representation
Library for analysing protein structures and sequences
- Host: GitHub
- URL: https://github.com/edikedik/lxtractor
- Owner: edikedik
- License: gpl-3.0
- Created: 2021-12-21T14:36:50.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2024-08-28T09:54:09.000Z (3 months ago)
- Last Synced: 2024-11-15T00:34:49.603Z (2 days ago)
- Topics: bioinfomatics, computational-biology, data-analysis, data-mining, feature-extraction, python, structural-biology
- Language: Python
- Homepage: https://lxtractor.readthedocs.io/
- Size: 3.85 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# lXtractor
[![Coverage Status](https://coveralls.io/repos/github/edikedik/lXtractor/badge.svg?branch=master)](https://coveralls.io/github/edikedik/lXtractor?branch=master)
[![Documentation status](https://readthedocs.org/projects/lxtractor/badge/?version=latest)](https://lxtractor.readthedocs.io/en/latest/?badge=latest)
[![PyPi status](https://img.shields.io/pypi/v/lXtractor.svg)](https://pypi.org/project/lXtractor)
[![Python version](https://img.shields.io/pypi/pyversions/lXtractor.svg)](https://pypi.org/project/lXtractor)
[![Hatch project](https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg)](https://github.com/pypa/hatch)## Introduction
`lXtractor` is a toolbox devoted to feature extraction from macromolecular
sequences and structures.
It's tailored towards creating shareable local data collections anchored to
a reference sequence-based object: a single sequence, MSA, or an HMM model.
Currently, it doesn't define any unique algorithms, aiming at simplicity and
transparency.
It simply provides a (hopefully) convenient interface simplifying mundane tasks,
such as fetching the data, extracting domains, mapping sequences, and computing
sequential and structural variables.
Sequences and structures anchored to a single reference object have a benefit
of interpretability in downstream applications, such as fitting interpretable
ML models.## Installation
`lXtractor` requires python>=3.10 installed on a Unix system and is
installable via pip```bash
pip install lXtractor
```We encourage users to first create a virtual environment via `conda` or `mamba`.
## Usage
`lXtractor` is designed to be flexible and its usage is defined by the initial
hypothesis or a reference object that one wants to extrapolate towards the
existing sequences or structures.
Below, we'll provide a very abstract description of what this package is
intended for.In creating data collections, one could define the following steps::
1. Assemble the data.
2. Map reference object to assembled entries' sequences.
3. Filter hits.
4. Define and calculate variables -- sequence or structure descriptors.
5. Save the data for later usage or modifications.`lXtractor` defines objects and routines helpful throughout this process.
Namely, `PDB`, `SIFTS`, `AlphaFold`, `fetch_uniprot()`
can aid in the first step.
Then, `Alignment` and `PyHMMer` can facilitate step 2.
At the end of the step 2 one will get a collection of `Chain*`-type objects.
If working with sequence-only collections, these are going to be
`ChainSequence` objects.
For structure-only data, these are going to be ``ChainStructure`` containers,
embedding `ChainSequence` and `GenericStructure` objects.
Finally, dealing with mappings between canonical sequence associated with
a group of structures will result in ``Chain`` objects.`ChainList` wraps `Chain*`-type objects into a list-like collection with
useful operations allowing to quickly filter and bulk-modify `Chain*`-type
objects.
Thus, filtering typically comes down to using ``ChainList.filter()`` method that
accepts a `Callable[Chain*, bool]` and returns a filtered `ChainList`.
One can save/load the collected objects using `ChainIO` and proceed
with the feature extraction.`lXtractor` defines various sequence and structure variables.
Variable-related operations are handled by `GenericCalculator` and
`Manager` classes. The former defines the calculation strategy and how
the calculations are parallelized, while the latter handles the calculations
and aggregates the results into a pandas `DataFrame`.As a result, one is left with a collection of `Chain*`-type objects and a
table with calculated variables. In addition, one can store the calculated
variables within the objects themselves, although we currently do not encourage
this practice.`lXtractor` is in the experimental stage and under active development.
Thus, objects' interfaces may change.For the time being, one can check the examples of
1. [finding sequence determinants](https://eboruta.readthedocs.io/en/latest/notebooks/sequence_determinants_tutorial.html)
of tyrosine and serine-threonine kinases and
2. [a protocol](https://github.com/edikedik/kinactive/blob/abae9c8a1fca0754d02e3f117dee210b587e666b/kinactive/db.py#L142)
to build a complete structural collection of protein kinase domains.More examples are to come in the future, so stay tuned. If you know a good example to apply `lXtractor`, feel free to raise an issue or reach out [[email protected]]([email protected]).