Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mattdoug604/pyvariant
Map between gene, transcript, exon, CDS, and protein coordinates.
https://github.com/mattdoug604/pyvariant
Last synced: about 2 months ago
JSON representation
Map between gene, transcript, exon, CDS, and protein coordinates.
- Host: GitHub
- URL: https://github.com/mattdoug604/pyvariant
- Owner: mattdoug604
- License: mit
- Created: 2020-03-13T20:57:27.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2023-12-21T22:09:49.000Z (about 1 year ago)
- Last Synced: 2024-11-05T11:09:35.483Z (about 2 months ago)
- Language: Python
- Size: 1 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pyvariant
## What is it?
**pyvariant** is a Python package for mapping biological sequence variants (mutations) to their equivalent chromosome, cDNA, gene, exon, protein, and RNA positions.
## How to get it
The easiest way to get pyvariant is using [pip](https://pip.pypa.io/en/latest/quickstart.html):
```sh
pip install pyvariant
```The source code is hosted on GitHub at:
## How to use it
Before you can use pyvariant, you will need to download the necessary genomic data.
To download and install data from Ensembl, run:
```shell
pyvariant install --species --release
```For example:
```shell
pyvariant install --species 'homo sapiens' --release 100
```At the time of writing, installing a human dataset takes roughly 30-45 minutes and 1.5G of storage space. However, the actual time and space required to install a dataset will depend entirely on the size of the dataset, your computer, internet speed, etc.
By default, the data is downloaded to a [platform-specific data directory](https://pypi.org/project/appdirs/) that is generally only accessible by the user (e.g. `/home//.local/share/pyvariant/`). If you want the data to be accessible to other users, or your home directory does have enough storage space, you may to specify a different directory to download to with the `--cache` option:
```shell
pyvariant install --species homo_sapiens --release 100 --cache /path/to/cache/
```For more options, run:
```shell
pyvariant install --help
```Alternatively, you can run the installation from inside a Python process:
```python
>>> from pyvariant import EnsemblRelease
>>> ensembl100 = EnsemblRelease(species='homo_sapiens', release=100, cache_dir="/path/to/cache/")
>>> ensembl100.install()
```Once the data is installed, the `EnsemblRelease` object provides methods for getting information about different features, getting the DNA/RNA/protein sequence at a position, and converting between positions, etc.
### Map between feature types
The main use of the package is for converting between equivalent cDNA, DNA, exon, protein, and RNA positions. For example:
cDNA to DNA:
```python
>>> ensembl100.to_dna("ENST00000310581:c.-124C>T")
[DnaSubstitution(refseq='C', altseq='T', contig_id='5', start=1295113, start_offset=0, end=1295113, end_offset=0, strand='-')]
```Protein to DNA:
```python
>>> ensembl100.to_dna("BRCA1:p.Q1458*")
[DnaDelins(refseq='CAG', altseq='TAA', contig_id='17', start=43076598, start_offset=0, end=43076600, end_offset=0, strand='-'), DnaDelins(refseq='CAG', altseq='TGA', contig_id='17', start=43076598, start_offset=0, end=43076600, end_offset=0, strand='-'), DnaSubstitution(refseq='C', altseq='T', contig_id='17', start=43076600, start_offset=0, end=43076600, end_offset=0, strand='-')]
```Exon to RNA:
```python
>>> ensembl100.to_rna("ENST00000266970:e.7::ENST00000360299:e.2")
[RnaFusion(breakpoint1=RnaPosition(contig_id='12', start=972, start_offset=0, end=2240, end_offset=0, strand='+', gene_id='ENSG00000123374', gene_name='CDK2', transcript_id='ENST00000266970', transcript_name='CDK2-201'), breakpoint2=RnaPosition(contig_id='12', start=63, start_offset=0, end=317, end_offset=0, strand='+', gene_id='ENSG00000111540', gene_name='RAB5B', transcript_id='ENST00000360299', transcript_name='RAB5B-201'))]
```You can also limit mapping to the canonical transcript only:
```python
>>> ensembl100 = EnsemblRelease(species='homo_sapiens', release=100, canonical_transcript=["ENST00000000233"])
>>> ensembl100.to_cdna("7:g.127589084", canonical=False)
[CdnaPosition(contig_id='7', start=69, start_offset=0, end=69, end_offset=0, strand='+', gene_id='ENSG00000004059', gene_name='ARF5', transcript_id='ENST00000000233', transcript_name='ARF5-201', protein_id='ENSP00000000233'), CdnaPosition(contig_id='7', start=69, start_offset=0, end=69, end_offset=0, strand='+', gene_id='ENSG00000004059', gene_name='ARF5', transcript_id='ENST00000415666', transcript_name='ARF5-202', protein_id='ENSP00000412701')]
>>> ensembl100.to_cdna("7:g.127589084", canonical=True)
[CdnaPosition(contig_id='7', start=69, start_offset=0, end=69, end_offset=0, strand='+', gene_id='ENSG00000004059', gene_name='ARF5', transcript_id='ENST00000000233', transcript_name='ARF5-201', protein_id='ENSP00000000233')]
```Normalize a variant to each possible type:
```python
>>> ensembl100.to_all("ENST00000269305:c.376-2A>G")
[{'cdna': CdnaSubstitution(_core=EnsemblRelease(species=homo_sapiens, release=100), refseq='A', altseq='G', contig_id='17', start=376, start_offset=-2, end=376, end_offset=-2, strand='-', gene_id='ENSG00000141510', gene_name='TP53', transcript_id='ENST00000269305', transcript_name='TP53-201', protein_id='ENSP00000269305'), 'dna': DnaSubstitution(_core=EnsemblRelease(species=homo_sapiens, release=100), refseq='A', altseq='G', contig_id='17', start=7675238, start_offset=0, end=7675238, end_offset=0, strand='-'), 'exon': None, 'protein': None, 'rna': RnaSubstitution(_core=EnsemblRelease(species=homo_sapiens, release=100), refseq='A', altseq='G', contig_id='17', start=566, start_offset=-2, end=566, end_offset=-2, strand='-', gene_id='ENSG00000141510', gene_name='TP53', transcript_id='ENST00000269305', transcript_name='TP53-201')}]
```### Check if two variants are equivalent
Get the notation(s) that represent both variants:
```python
>>> x = ensembl100.same("ENSP00000358548:p.Q61K", "NRAS:c.181C>A")
>>> x.keys()
dict_keys(['cdna', 'dna', 'exon', 'protein', 'rna'])
>>> x["dna"]
[DnaSubstitution(refseq='C', altseq='A', contig_id='1', start=114713909, start_offset=0, end=114713909, end_offset=0, strand='-')]
```...or get the notation(s) that are unique to each variant:
```python
>>> x = ensembl100.diff("ENSP00000358548:p.Q61K", "NRAS:c.181C>A")
>>> x.keys()
dict_keys(['cdna', 'dna', 'exon', 'protein', 'rna'])
>>> x["dna"]
([DnaDelins(refseq='CAA', altseq='AAG', contig_id='1', start=114713907, start_offset=0, end=114713909, end_offset=0, strand='-')], [])
```### Fetch sequences
Get the mutated reference sequence, within a given window:
```python
>>> ensembl100.sequence("ENST00000635293:c.1044A>C", window=50)
CGCCTCTTTCAGAGACTTTTAACTTCAACATCTGTCCCTACCCAGCAGGC
```The sequence can also be normalized to a specific strand of the genome:
```python
>>> ensembl100.sequence("ENST00000635293:c.1044A>C", window=50, strand='+')
GCCTGCTGGGTAGGGACAGATGTTGAAGTTAAAAGTCTCTGAAAGAGGCG
```Get the sequence surrounding a fusion breakpoint:
```python
>>> ensembl100.sequence("ENST00000399410:r.2871::ENST00000561813:r.317", window=50)
ACAGTGCAGGGAAGCAACTGCAGAGGCTGTGCAATCTTGCACAAATATCT
```### Retrieve feature information
`pyvariant` also has functions for retrieving general information about various features. For example:
Get a list of transcript IDs for a gene:
```python
>>> ensembl100.transcript_ids("BRCA2")
['ENST00000380152', 'ENST00000470094', 'ENST00000528762', 'ENST00000530893', 'ENST00000533776', 'ENST00000544455', 'ENST00000614259', 'ENST00000665585', 'ENST00000666593', 'ENST00000670614', 'ENST00000671466']
```For a complete list of methods, run:
```python
>>> help(EnsemblRelease)
```## Notes
### Variant naming standards
This package follows [HGVS nomenclature](https://varnomen.hgvs.org/) recommendations for representing variants.
### Offset positions
When describing variants in an intron or UTR, it can be more informative to describe the position relative to a transcript, rather than the genome. These positions are described with a "+" or "-". For example, "TERT:c.-125" means "a position 125 nucleotides 5’ of the ATG translation initiation codon." See the [HGVS nomenclature](https://varnomen.hgvs.org/bg-material/numbering/) documentation for more information.
### Protein duplications
By default, the package assumes that protein duplications are the result of a nucleotide insertion, as opposed to a delins. This behaviour can be turned of by defining the environmental variable `PYVARIANT_GET_ALL_PROTEIN_DUPS`.
## License
This package is distributed with the [MIT](LICENSE) license.