https://github.com/edinburgh-genome-foundry/crazydoc
Read DNA sequences from colourful Microsoft Word documents
https://github.com/edinburgh-genome-foundry/crazydoc
bioinformatics computer-aided-design dna-sequences molecular-biology synthetic-biology
Last synced: 5 months ago
JSON representation
Read DNA sequences from colourful Microsoft Word documents
- Host: GitHub
- URL: https://github.com/edinburgh-genome-foundry/crazydoc
- Owner: Edinburgh-Genome-Foundry
- License: mit
- Created: 2018-02-02T23:07:51.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2023-04-10T22:10:48.000Z (about 2 years ago)
- Last Synced: 2024-08-11T03:15:48.309Z (9 months ago)
- Topics: bioinformatics, computer-aided-design, dna-sequences, molecular-biology, synthetic-biology
- Language: Python
- Size: 344 KB
- Stars: 32
- Watchers: 6
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
Awesome Lists containing this project
README
.. raw:: html
![]()
.. image:: https://github.com/Edinburgh-Genome-Foundry/crazydoc/actions/workflows/build.yml/badge.svg
:target: https://github.com/Edinburgh-Genome-Foundry/crazydoc/actions/workflows/build.yml
:alt: GitHub CI build status.. image:: https://coveralls.io/repos/github/Edinburgh-Genome-Foundry/crazydoc/badge.svg?branch=master
:target: https://coveralls.io/github/Edinburgh-Genome-Foundry/crazydoc?branch=masterCrazydoc is a Python library to parse one of the most common DNA representation formats: the joyfully coloured and stylishly annotated MS-Word document.
.. raw:: html
![]()
Crazydoc returns Biopython records of the sequences contained in an MS-Word document, with record features corresponding to the various sequence highlightings (background color, boldness, italics, case change, etc.). The records can saved as GenBanks or easily plotted.
.. raw:: html
![]()
**Motivation**While other standards such as FASTA or Genbank are better supported by modern sequence editors, none enjoys the same popularity among molecular biologist as MS-Word's ``.docx`` format, which is limited only by the sophistication and creativity of the user.
Relying on a loose syntax and unclear specifications, this format has however suffered from a lack of support in the developers community and is generally incompatible with mainstream software pipelines. This library allows to convert MS-Word DNA sequences to more computing friendly formats: Biopython records, FASTA, or annotated Genbanks.
Usage
-----To obtain all sequences contained in a docx as annotated Biopython records (such as `this one `_):
.. code:: python
from crazydoc import CrazydocParser
parser = CrazydocParser(['highlight_color', 'bold', 'underline'])
biopython_records = parser.parse_doc_file("./example.docx")You can then plot the obtained records:
.. code:: python
from crazydoc import CrazydocSketcher
sketcher = CrazydocSketcher()
for record in biopython_records:
sketch = sketcher.translate_record(record)
ax, _ = sketch.plot()
ax.set_title(record.id)
ax.figure.savefig('%s.png' % record.id).. raw:: html
![]()
To write the sequences down as Genbank records, with annotations:
.. code:: python
from crazydoc import records_to_genbank
records_to_genbank(biopython_records)Note that ``records_to_genbank()`` will truncate the record name to 20 characters,
to fit in the GenBank format. Additionally, slashes (``/``) will be replaced with
hyphens (``-``) in the filenames. To read protein sequences, pass ``is_protein=True``:.. code:: python
biopython_records = parse_doc_file(protein_path, is_protein=True)
This will return *protein* records, which will be saved with a GenPept extension
(.gp) by ``records_to_genbank(biopython_records, is_protein=True)``,
unless specified otherwise with ``extension=``.You can also save annotated sequences as colourful Word docs.
``write_crazydoc()`` takes a SeqRecord, the qualifier key to use as a feature name,
and a path to save the document to... code:: python
# Load an annotated sequence with Biopython
from Bio import SeqIO
from crazydoc import write_crazydoc
seq = SeqIO.read("examples/examples_outputs/Sequence 1.gbk", "genbank")
# Most features will already have some name qualifier but you can add your own
for i,f in enumerate(seq.features):
f.qualifiers['product'] = f"feature{i}"
# Save the annotated sequence as a docx
write_crazydoc(seq, 'product', 'test.docx')Installation
------------You can install crazydoc through PIP:
.. code::
pip install crazydoc
Alternatively, you can unzip the sources in a folder and type:
.. code::
python setup.py install
License = MIT
-------------Crazydoc is an open-source software originally written at the `Edinburgh Genome Foundry `_ by `Zulko `_ and `released on Github `_ under the MIT licence (Copyright 2018 Edinburgh Genome Foundry).
Everyone is welcome to contribute!
More biology software
---------------------.. image:: https://raw.githubusercontent.com/Edinburgh-Genome-Foundry/Edinburgh-Genome-Foundry.github.io/master/static/imgs/logos/egf-codon-horizontal.png
:target: https://edinburgh-genome-foundry.github.io/Crazydoc is part of the `EGF Codons `_ synthetic biology software suite for DNA design, manufacturing and validation.