https://github.com/cobilab/jarvis3

Efficient compression of biological data
https://github.com/cobilab/jarvis3

data-compression dna dna-sequences

Last synced: about 1 month ago
JSON representation

Efficient compression of biological data

Host: GitHub
URL: https://github.com/cobilab/jarvis3
Owner: cobilab
License: gpl-3.0
Created: 2023-03-08T16:32:40.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2026-03-13T21:58:22.000Z (2 months ago)
Last Synced: 2026-03-14T09:54:09.631Z (2 months ago)
Topics: data-compression, dna, dna-sequences
Language: C
Homepage:
Size: 75.7 MB
Stars: 7
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](LICENSE)
[![Format1](https://img.shields.io/static/v1.svg?label=Format&message=DNA%20Sequence&color=green)](#)
[![Format2](https://img.shields.io/static/v1.svg?label=Format&message=FASTA&color=green)](#)
[![Format3](https://img.shields.io/static/v1.svg?label=Format&message=FASTQ&color=green)](#)
[![MX](https://img.shields.io/static/v1.svg?label=Mixing&message=neural%20network&color=yellow)](#)

[![HGR](https://img.shields.io/static/v1.svg?label=Human%20genome&message=compression%20record&color=orange)](#)
[![CR](https://img.shields.io/static/v1.svg?label=Cassava%20genome&message=compression%20record&color=orange)](#)

JARVIS3: an improved encoder for genomic sequences

### Installation ###

Manually:


git clone https://github.com/cobilab/jarvis3.git

cd jarvis3/src/

make

Using Conda:


conda install -c bioconda jarvis3

### Execution ###

#### Run JARVIS3 ####

Example of running JARVIS3 using level 7:


./JARVIS3 -v -l 7 File.seq

### Parameters ###

To see the possible options type


./JARVIS3 -h

This will print the following options:
```

██ ███████ ███████ ██ ██ ██ ███████ ███████
██ ██ ██ ██ ██ ██ ██ ██ ██ ██
██ ███████ ██████ ██ ██ ██ ███████ ███████
██ ██ ██ ██ ██ ███ ██ ██ ██ ██ ██
███████ ██ ██ ██ ███ ████ ██ ███████ ███████

NAME
JARVIS3 v3.7,
Efficient lossless encoding of genomic sequences

SYNOPSIS
./JARVIS3 [OPTION]... [FILE]

SAMPLE
Run Compression -> ./JARVIS3 -v -l 14 sequence.txt
Run Decompression -> ./JARVIS3 -v -d sequence.txt.jc

DESCRIPTION
Lossless compression and decompression of genomic
sequences for miniaml storage and analysis purposes.
Measure an upper bound of the sequence complexity.

-h, --help
Usage guide (help menu).

-a, --version
Display program and version information.

-x, --explanation
Explanation of the context and repeat models.

-f, --force
Force mode. Overwrites old files.

-v, --verbose
Verbose mode (more information).

-p, --progress
Show progress bar.

-d, --decompress
Decompression mode.

-e, --estimate
It creates a file with the extension ".iae" with the
respective information content. If the file is FASTA or
FASTQ it will only use the "ACGT" (genomic) sequence.

-s, --show-levels
Show pre-computed compression levels (configured).

-l [NUMBER], --level [NUMBER]
Compression level (integer).
Default level: 7.
It defines compressibility in balance with computational
resources (RAM & time). Use -s for levels perception.

-sd [NUMBER], --seed [NUMBER]
Pseudo-random seed.
Default value: 0.

-hs [NUMBER], --hidden-size [NUMBER]
Hidden size of the neural network (integer).
Default value: 40.

-lr [DOUBLE], --learning-rate [DOUBLE]
Neural Network leaning rate (double).
The 0 value turns the Neural Network off.
Default value: 0.03.

-o [FILENAME], --output [FILENAME]
Compressed/decompressed output filename.

[FILENAME]
Input sequence filename (to compress) -- MANDATORY.
File to compress is the last argument.

COPYRIGHT
Copyright (C) 2014-2024.
This is a Free software, under GPLv3. You may redistribute
copies of it under the terms of the GNU - General Public
License v3 .

```

To see the possible levels (automatic choosen compression parameters), type:


./JARVIS3 -s

This will ouput th following pre-set models for each level:


Level 1: -rm 1:12:0.90:4:0.72:0:0.1:1 

Level 2: -rm 1:12:0.90:4:0.72:1:0.1:1 

Level 3: -rm 1:13:0.90:4:0.72:1:0.1:1 

Level 4: -rm 1:14:0.90:4:0.72:1:0.1:1 

Level 5: -rm 2:12:0.90:5:0.60:1:0.1:1 

Level 6: -rm 4:12:0.94:7:0.70:1:0.05:3 

Level 7: -rm 3:13:0.90:5:0.72:1:0.1:1 

Level 8: -rm 3:14:0.90:5:0.72:1:0.1:1 

Level 9: -rm 5:14:0.90:5:0.72:1:0.1:1 

Level 10: -rm 6:12:0.90:6:0.78:1:0.03:1 

Level 11: -rm 8:13:0.90:6:0.78:1:0.03:2 

Level 12: -rm 10:12:0.91:7:0.80:1:0.02:3 

Level 13: -rm 12:12:0.90:7:0.81:1:0.02:3 

Level 14: -lr 0 -cm 1:1:0:0.9/0:0:0:0 -rm 2:12:0.92:7:0.80:1:0.05:2 

Level 15: -lr 0 -cm 3:1:0:0.9/0:0:0:0 -rm 3:12:0.93:7:0.81:1:0.05:3 

Level 16: -lr 0 -cm 3:1:0:0.9/0:0:0:0 -rm 4:12:0.92:7:0.81:1:0.03:2 

Level 17: -lr 0 -cm 4:1:0:0.9/0:0:0:0 -rm 4:13:0.94:7:0.81:1:0.04:3 

Level 18: -lr 0 -cm 6:1:0:0.9/0:0:0:0 -rm 4:13:0.94:7:0.81:1:0.04:3 

Level 19: -lr 0 -cm 6:1:0:0.9/0:0:0:0 -rm 8:12:0.93:7:0.81:1:0.02:3 

Level 20: -lr 0 -cm 4:1:0:0.9/0:0:0:0 -rm 20:12:0.9:7:0.85:1:0.01:4 

Level 21: -lr 0 -cm 4:1:0:0.9/0:0:0:0 -rm 50:12:0.9:7:0.85:1:0.01:5 

Level 22: -lr 0 -cm 4:1:0:0.9/0:0:0:0 -rm 100:12:0.9:7:0.85:1:0.01:5 

Level 23: -lr 0 -cm 4:1:0:0.9/0:0:0:0 -rm 200:12:0.9:7:0.85:1:0.01:6 

Level 24: -lr 0 -cm 6:1:0:0.9/0:0:0:0 -rm 6:15:0.93:6:0.81:1:0.02:1 

Level 25: -lr 0.03 -hs 24 -cm 6:1:0:0.9/0:0:0:0 -rm 6:15:0.92:6:0.81:1:0.02:1 

Level 26: -lr 0.03 -hs 32 -cm 4:1:0:0.9/0:0:0:0 -rm 20:15:0.90:7:0.82:1:0.02:1 

Level 27: -lr 0.03 -hs 24 -cm 6:1:0:0.9/0:0:0:0 -rm 15:13:0.92:7:0.85:0:0.02:4 -rm 13:12:0.92:7:0.84:2:0.01:3 

Level 28: -lr 0.03 -hs 42 -cm 6:1:0:0.9/0:0:0:0 -rm 6:15:0.93:6:0.81:1:0.02:1 

Level 29: -lr 0.03 -hs 42 -cm 6:1:0:0.9/0:0:0:0 -rm 10:15:0.93:6:0.81:1:0.02:1 

Level 30: -lr 0.03 -hs 42 -cm 6:1:0:0.9/0:0:0:0 -rm 10:15:0.93:6:0.81:0:0.02:1 -rm 10:15:0.93:6:0.81:2:0.02:1 

Level 31: -lr 0.03 -hs 48 -cm 1:1:0:0.9/0:0:0:0 -cm 4:1:0:0.9/0:0:0:0 -cm 8:1:1:0.89/0:0:0:0 -cm 12:20:1:0.97/0:0:0:0 -rm 300:12:0.9:7:0.85:0:0.01:10 -rm 200:12:0.9:7:0.8:2:0.01:4 

Level 32: -lr 0.04 -hs 64 -cm 1:1:0:0.9/0:0:0:0 -cm 4:1:0:0.9/0:0:0:0 -cm 8:1:1:0.89/0:0:0:0 -cm 12:20:1:0.97/0:0:0:0 -rm 500:12:0.9:7:0.85:0:0.01:12 -rm 200:12:0.9:7:0.8:2:0.01:4 

Level 33: -lr 0.04 -hs 86 -cm 1:1:0:0.9/0:0:0:0 -cm 4:1:0:0.9/0:0:0:0 -cm 8:1:1:0.89/0:0:0:0 -cm 12:20:1:0.97/0:0:0:0 -rm 500:12:0.9:7:0.85:0:0.01:12 -rm 200:12:0.9:7:0.8:2:0.01:4 

Level 34: -lr 0.04 -hs 256 -cm 1:1:0:0.9/0:0:0:0 -cm 4:1:0:0.9/0:0:0:0 -cm 8:1:1:0.9/0:0:0:0 -cm 12:20:1:0.97/0:0:0:0 -rm 1500:12:0.9:7:0.85:0:0.01:10 -rm 500:12:0.9:7:0.82:2:0.01:3 

Level 35: -lr 0.04 -hs 248 -cm 1:1:0:0.9/0:0:0:0 -cm 3:1:0:0.9/0:0:0:0 -cm 7:1:0:0.9/0:0:0:0 -cm 9:1:1:0.9/0:0:0:0 -cm 11:10:0:0.9/0:0:0:0 -rm 100:14:0.9:7:0.85:1:0.01:3 -rm 200:12:0.88:7:0.85:0:0.01:3 -rm 300:12:0.87:7:0.85:2:0.01:3 

Level 36: -lr 0.04 -hs 248 -cm 1:1:0:0.9/0:0:0:0 -cm 3:1:0:0.9/0:0:0:0 -cm 7:1:0:0.9/0:0:0:0 -cm 9:1:1:0.9/0:0:0:0 -cm 11:10:0:0.9/0:0:0:0 -cm 13:200:1:0.9/1:10:1:0.9 -rm 100:14:0.9:7:0.85:1:0.01:3 -rm 200:12:0.88:7:0.85:0:0.01:8 -rm 300:12:0.87:7:0.85:2:0.01:3 

Level 37: -lr 0.01 -hs 248 -cm 1:1:0:0.9/0:0:0:0 -cm 3:1:0:0.9/0:0:0:0 -cm 6:1:0:0.9/0:0:0:0 -cm 9:1:0:0.9/0:0:0:0 -cm 11:10:1:0.9/0:0:0:0 -cm 14:200:1:0.9/1:10:1:0.9 -rm 300:14:0.88:7:0.85:0:0.01:8 -rm 300:14:0.88:7:0.85:2:0.01:8 -rm 500:12:0.88:7:0.85:0:0.01:15 

Level 38: -lr 0 -cm 12:1:0:0.7/0:0:0:0 -rm 2:14:0.95:1:0.9:1:0.1:1 

Level 39: -lr 0 -cm 12:1:0:0.7/0:0:0:0 -rm 3:14:0.95:1:0.9:1:0.1:1 

Level 40: -lr 0.03 -lr 32 -cm 12:1:0:0.7/0:0:0:0 -rm 4:14:0.95:1:0.9:1:0.1:1

To see the meaning of the model parameters, type:


./JARVIS3 -x

This will output the following content:


      -cm [NB_C]:[NB_D]:[NB_I]:[NB_G]/[NB_S]:[NB_E]:[NB_R]:[NB_A]  

      Template of a context model.                                 

      Parameters:                                                  

      [NB_C]: (integer [1;14]) order size of the regular context   

              model. Higher values use more RAM but, usually, are  

              related to a better compression score.               

      [NB_D]: (integer [1;5000]) denominator to build alpha, which 

              is a parameter estimator. Alpha is given by 1/[NB_D].

              Higher values are usually used with higher [NB_C],   

              and related to confident bets. When [NB_D] is one,   

              the probabilities assume a Laplacian distribution.   

      [NB_I]: (integer {0,1,2}) number to define if a sub-program  

              which addresses the specific properties of DNA       

              sequences (Inverted repeats) is used or not. The     

              number 1 turns ON the sub-program using at the same  

              time the regular context model. The number 2 does    

              only contemple the inversions only (NO regular). The 

              number 0 does not contemple its use (Inverted repeats

              OFF). The use of this sub-program increases the      

              necessary time to compress but it does not affect the

              RAM.                                                 

      [NB_G]: (real [0;1)) real number to define gamma. This value 

              represents the decayment forgetting factor of the    

              regular context model in definition.                 

      [NB_S]: (integer [0;20]) maximum number of editions allowed  

              to use a substitutional tolerant model with the same 

              memory model of the regular context model with       

              order size equal to [NB_C]. The value 0 stands for   

              turning the tolerant context model off. When the     

              model is on, it pauses when the number of editions   

              is higher that [NB_C], while it is turned on when    

              a complete match of size [NB_C] is seen again. This  

              is probabilistic-algorithmic model very useful to    

              handle the high substitutional nature of genomic     

              sequences. When [NB_S] > 0, the compressor used more 

              processing time, but uses the same RAM and, usually, 

              achieves a substantial higher compression ratio. The 

              impact of this model is usually only noticed for     

              higher [NB_C].                                       

      [NB_R]: (integer {0,1}) number to define if a sub-program    

              which addresses the specific properties of DNA       

              sequences (Inverted repeats) is used or not. It is   

              similar to the [NR_I] but for tolerant models.       

      [NB_E]: (integer [1;5000]) denominator to build alpha for    

              substitutional tolerant context model. It is         

              analogous to [NB_D], however to be only used in the  

              probabilistic model for computing the statistics of  

              the substitutional tolerant context model.           

      [NB_A]: (real [0;1)) real number to define gamma. This value 

              represents the decayment forgetting factor of the    

              substitutional tolerant context model in definition. 

              Its definition and use is analogus to [NB_G].        

                                                                   

      ... (you may use several context models)                     

                                                                   

                                                                   

      -rm [NB_R]:[NB_C]:[NB_B]:[NB_L]:[NB_G]:[NB_I]:[NB_W]:[NB_Y]  

      Template of a repeat model.                                  

      Parameters:                                                  

      [NB_R]: (integer [1;10000] maximum number of repeat models   

              for the class. On very repetive sequences the RAM    

              increases along with this value, however it also     

              improves the compression capability.                 

      [NB_C]: (integer [1;14]) order size of the repeat context    

              model. Higher values use more RAM but, usually, are  

              related to a better compression score.               

      [NB_B]: (real (0;1]) beta is a real value, which is a        

              parameter for discarding or maintaining a certain    

              repeat model.                                        

      [NB_L]: (integer (1;20]) a limit threshold to play with      

              [NB_B]. It accepts or not a certain repeat model.    

      [NB_G]: (real [0;1)) real number to define gamma. This value 

              represents the decayment forgetting factor of the    

              regular context model in definition.                 

      [NB_I]: (integer {0,1,2}) number to define if a sub-program  

              which addresses the specific properties of DNA       

              sequences (Inverted repeats) is used or not. The     

              number 1 turns ON the sub-program using at the same  

              time the regular context model. The number 0 does    

              not contemple its use (Inverted repeats OFF). The    

              number 2 uses exclusively Inverted repeats. The      

              use of this sub-program increases the necessary time 

              to compress but it does not affect the RAM.          

      [NB_W]: (real (0;1)) initial weight for the repeat class.    

      [NB_Y]: (integer {0}, [1;50]) maximum cache size. This will  

              use a table cache with the specified size. The size  

              must be in balance with the k-mer size [NB_C].

#### Compression and decompression of FASTA and FASTQ data ####

First, make sure to give permissions to the script by typing the following at the src/ folder


chmod +x JARVIS3.sh

The extension of compressing FASTA and FASTQ data contains a menu to expose the parameters that can be accessed using:


./JARVIS3.sh --help

This will ouput the following menu


 -------------------------------------------------------

                                                        

 JARVIS3, v3.7. High reference-free compression of DNA  

                sequences, FASTA data, and FASTQ data.  

                                                        

 Program options ---------------------------------------

                                                        

 -h, --help                   Show this,                

 -a, --about                  Show program information, 

 -c, --install                Install/compile programs, 

 -s, --show                   Show compression levels,  

                                                        

 -l , --level       JARVIS3 compression level,

 -b , --block       Block size to be splitted,

 -t , --threads     Number of JARVIS3 threads,

                                                        

 -dn, --dna                   Assume DNA sequence type, 

 -fa, --fasta                 Assume FASTA data type,   

 -fq, --fastq                 Assume FASTQ data type,   

 -au, --automatic             Detect data type (def),   

                                                        

 -d, --decompress             Decompression mode,       

                                                        

 Input options -----------------------------------------

                                                        

 -i , --input     Input DNA filename.       

                                                        

 Example -----------------------------------------------

                                                        

 ./JARVIS3.sh --block 16MB -t 8 -i sample.seq           

 ./JARVIS3.sh --decompress -t 4 -i sample.seq.tar       

                                                        

 -------------------------------------------------------

Preparing JARVIS3 for FASTA and FASTQ:


./JARVIS3.sh --install

Compression of FASTA data:


./JARVIS3.sh --threads 8 --fasta --block 10MB --input sample.fa

Decompression of FASTA data:


./JARVIS3.sh --decompress --fasta --threads 4 --input sample.fa.tar

Compression of FASTQ data:


./JARVIS3.sh --threads 8 --fastq --block 40MB --input sample.fq

Decompression of FASTQ data:


./JARVIS3.sh --decompress --fastq --threads 4 --input sample.fq.tar

### Citation ###


Sousa, Maria JP, Armando J. Pinho, and Diogo Pratas. "JARVIS3: an efficient encoder for genomic data." Bioinformatics 40.12 (2024): btae725.

#### Issues ###

For any issue let us know at [issues link](https://github.com/cobilab/JARVIS3/issues).

### License ###

[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](LICENSE)

For more information:

http://www.gnu.org/licenses/gpl-3.0.html

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cobilab/jarvis3

Awesome Lists containing this project

README