https://github.com/biosustain/goodbye-genbank
A Python package for Biopython that gives feature annotations from GenBank records a new and better life
https://github.com/biosustain/goodbye-genbank
biopython genbank genomics gff3 sequence-ontology
Last synced: about 2 months ago
JSON representation
A Python package for Biopython that gives feature annotations from GenBank records a new and better life
- Host: GitHub
- URL: https://github.com/biosustain/goodbye-genbank
- Owner: biosustain
- License: other
- Created: 2016-04-06T16:01:50.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2016-04-27T16:01:19.000Z (about 9 years ago)
- Last Synced: 2025-04-06T10:44:49.566Z (about 2 months ago)
- Topics: biopython, genbank, genomics, gff3, sequence-ontology
- Language: Python
- Homepage: https://biosustain.github.io/goodbye-genbank/
- Size: 284 KB
- Stars: 14
- Watchers: 4
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
================
Goodbye, GenBank
================.. image:: https://img.shields.io/travis/biosustain/goodbye-genbank/master.svg?style=flat-square
:target: https://travis-ci.org/biosustain/goodbye-genbank**Goodbye, GenBank** converts `SeqFeature `_ sequence
annotations from NCBI GenBank records to a common and simplified format. GenBank feature annotations have a
feature key and reasonably well defined qualifiers, but non-standard and discontinued feature types and qualifiers are commonly
used and often the feature key is something someone made up and not a valid GenBank feature key. And even when a valid GenBank feature key is used, it is often incomplete and useless without additional details in the qualifiers.This package converts most feature keys to appropriate `Sequence Ontology `_ terms used by GFF3 and SBOL. Non-standard qualifiers are repaired or removed.
**Goodbye, GenBank** is intended for those who wish to clean-up their GenBank files and then transition to a different format.
The philosophy of this project is to salvage what is salvageable and to discard what is not. GenBank feature types are translated
to Sequence Ontology terms; qualifiers are converted into a reduced set that contains only the parts that are not broken. Qualifiers are also converted to their correct type: ``int`` for integers, ``list`` only for qualifiers that can appear multiple times, ``bool`` for flags.Moreover, different options are available to configure what is kept and what is thrown away.
Installation
------------You can install *Goodbye, GenBank* with pip:
::
pip install gbgb
Example
-------::
>>> feature
SeqFeature(FeatureLocation(ExactPosition(2931), ExactPosition(2936), strand=1), type='-10_signal')
>>> feature.qualifiers
{'ApEinfo_fwdcolor': ['pink'],
'ApEinfo_graphicformat': ['arrow_data {{0 1 2 0 0 -1} {} 0} width 5 offset 0'],
'ApEinfo_revcolor': ['pink'],
'label': ['RNAII Promoter (-10 signal)']}
>>>
>>> from gbgb import convert_feature
>>> feature = convert_feature(feature)
>>> feature
SeqFeature(FeatureLocation(ExactPosition(2931), ExactPosition(2936), strand=1), type='minus_10_signal')
>>> feature.qualifiers
{'note': 'RNAII Promoter (-10 signal)'}
>>>
>>> from gbgb import genbank_feature_key
>>> genbank_feature_key('minus_10_signal')
'regulatory'Design considerations
---------------------For the most part, *Goodbye, GenBank* attempts to be idempotent, i.e. features and their types/keys and qualifiers can be safely
transformed any number times with the same settings. The apparent mismatch between the conversion to Sequence Ontology feature
terms and valid/fixed GenBank qualifiers is to simplify downstream processing. It is up to the users which qualifiers they wish
to keep, but at least the choices they are given are reasonable.Contributing
------------If you have any questions or suggestions or if you have found a unique new specimen of GenBank files that you would like
to convert, please open an issue.Issues
------- SO Term: "regulatory" feature type with /regulatory_class="enhancer_blocking_element"
There is apparently no matching Sequence Ontology term. An enhancer blocking element behaves like an insulator, but
is not an insulator. It is a transcriptional cis regulatory region, but that description is too broad.- SO Term: "misc_structure" feature type
GenBank uses this feature type for secondary and tertiary nucleotide structures. There appears to be
no matching Sequence Ontology term.- SO Term: "assembly_gap" feature type
GenBank has both "gap" and "assembly_gap" feature types, which appear to have slightly different meanings. However,
SO only has a "gap" term, which refers to assembly gaps.- GFF3 export
There is no good GFF3 exporter out there, so why not write one?
Skeleton code in gbgb.export.gff3
- Reduction of SO terms
Allow users to specify a set of Sequence Ontology terms (inheriting from "sequence_feature"). Feature types will be
reduced to the nearest Ontology term. This is to simplify downstream analysis.- /pseudo qualifier without /pseudogene=""
There is no matching Sequence Ontology term for this. Several GenBank files contain /pseudo without /pseudogene=""
to mean pseudogene.- Mandatory qualifiers
These should be filled in using a reasonable guess or errors should be thrown when trying to convert a feature without
its mandatory qualifiers.Materials
---------- `GenBank Feature Table Definition `_
- `GenBank Release Notes (since December 1992) `_
- `NCBI Prokaryotic Genome Annotation Guide `_
- `Sequence Ontology Wiki -- Discontinuous Features `_.
On trans-splicing.
- `GFF3 Specification `_