https://github.com/dabane-ghassan/dnazip
A Python implementation of The Burrows-Wheeler Transform alongside Huffman coding on DNA sequences.
https://github.com/dabane-ghassan/dnazip
algorithms burrows-wheeler-transform bwt compression data-structures dna-sequences gui huffman python tkinter-gui
Last synced: about 4 hours ago
JSON representation
A Python implementation of The Burrows-Wheeler Transform alongside Huffman coding on DNA sequences.
- Host: GitHub
- URL: https://github.com/dabane-ghassan/dnazip
- Owner: dabane-ghassan
- License: mit
- Created: 2021-01-16T18:31:51.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2021-03-17T18:22:02.000Z (over 5 years ago)
- Last Synced: 2025-09-30T14:22:28.066Z (9 months ago)
- Topics: algorithms, burrows-wheeler-transform, bwt, compression, data-structures, dna-sequences, gui, huffman, python, tkinter-gui
- Language: Python
- Homepage: https://dabane-ghassan.github.io/dnazip/
- Size: 7.34 MB
- Stars: 4
- Watchers: 0
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# dnazip


[](https://GitHub.com/Naereen/)
[](https://forthebadge.com)
[](https://forthebadge.com)
[](https://forthebadge.com)
[](https://forthebadge.com)
[](https://forthebadge.com)
[](https://forthebadge.com)
- A Python implementation of ***The Burrows-Wheeler Transform (BWT)*** alongside ***Huffman compression*** on DNA sequences.
- Hosted on [GitHub](https://github.com/dabane-ghassan/dnazip)
- Documentation? [**here**](https://dabane-ghassan.github.io/dnazip/)
## Architecture
### Scripts

### UML

## Installation
- You can install the package either from pip or from the source code hosted on github.
### With pip
```bash
pip install dnazip-bioinfo
```
### From source
```bash
git clone https://github.com/dabane-ghassan/dnazip.git
cd dnazip
sudo python3 setup.py install
```
## Getting started
### GUI
- After installing the package from source or using pip, the interface can be launched simply from the command line:
```
dnazip
```
- If problems occur with the installation, an interface instance can be imported and launched:
```python
from dnazip.interface import Interface
gui = Interface()
gui.main()
```
### Using the library
#### Generating a random DNA sequence
- A random DNA sequence with an alphabet of ATCGN can be generated for any given length specified in the parameter.
```python
from dnazip.sequence import Sequence
randseq = Sequence.generate(length=5000)
Sequence('/path/to/new/seq').write(randseq)
```
#### Encoding a DNA sequence with Burrows-Wheeler + Huffman Coding
- To encode a DNA sequence using BWT and Huffman coding, you can use a FullEncoder object that will save two files to the same directory as the sequence, the Burrows-Wheeler transform and the UTF-8 zipped format of the sequence:
```python
from dnazip.encoder import FullEncoder
encode = FullEncoder('/path/to/seq')
encode.full_zip()
```
- The attributes of the object can be accessed to see intermediary results:
```python
encode.bw_encoder.rotations # a matrix of string rotations from a sequence
encode.bw_encoder.bwm # The Burrows-Wheeler Matrix
encode.bw_encoder.bwt # The Burrows-Wheeler Transform
encode.huff_encoder.header # The header of the zip file that contains Huffman codes for each character as well as the sequence binary padding
encode.huff_encoder.binary # The binary sequence of the BW transform
encode.huff_encoder.unicode # 8-bits encoded binary sequence
```
- ***A random sequence of size 1kB was compressed efficiently to 549 bytes.***
#### Decoding a DNA sequence with Huffman decoding + Reversing Burrows-Wheeler transform
- To decode a zipped DNA sequence using Huffman decoding and the inverse of BWT, you can use a FullDecoder object that will work in the same manner as the FullEncoder object:
```python
from dnazip.decoder import FullDecoder
decode = FullDecoder('path/to/seq')
decode.full_unzip()
```
- The attributes can also be accessed to see intermediary results:
```python
decode.huff_decoder.header # The header of the zip file that contains Huffman codes for each character as well as the sequence binary padding that where saved when the Huffman tree was created
decode.huff_decoder.unicode # 8-bits encoded sequence
decode.huff_decoder.binary # The binary sequence
decode.bw_decoder.bwm # The Burrows-Wheeler reconstructed matrix
decode.bw_decoder.original # The original sequence
```
#### Building the Burrows-Wheeler transform using the advanced algorithm
- The BWT can be constructed using a Suffix Array (SA) based algorithm that has a better time and space complexities:
```python
from dnazip.sequence import Sequence
from dnazip.burros_wheeler import BurrosWheeler
seq = Sequence('/path/to/seq').read()
BurrosWheeler.bwt_advanced(seq)
```
## Documentation
Detailed documentation on the architecture can be found [here](https://dabane-ghassan.github.io/dnazip/)
## About
### :scroll: License
**MIT Licensed** © [Ghassan Dabane](https://github.com/dabane-ghassan), 2021.
[](https://www.python.org/)
[](https://forthebadge.com)
[](https://GitHub.com/)
[](http://ForTheBadge.com)