An open API service indexing awesome lists of open source software.

https://github.com/thodkatz/chapel-kmer

A parallel implementation of the k-mer counting problem using Chapel.
https://github.com/thodkatz/chapel-kmer

chapel parallel

Last synced: about 2 months ago
JSON representation

A parallel implementation of the k-mer counting problem using Chapel.

Awesome Lists containing this project

README

        







# k-mer counting using Chapel

## Directory structure

```
src
├── kmer
│   ├── benchmarking.chpl
│   └── kmer.chpl
└── wordcount
├── benchmarking.chpl
├── word_counting_parallel.chpl
└── word_counting_serial.chpl
```

In the `wordcount` directory, we are trying to count how many times each unique word of a random generated text (e.g. Lorem Ipsum) is appeared within the text. The goal is to be familiar with **Chapel** and the **Map** standard module. We implemented a naive version, without considering performance.

In the `kmer` directory, in analogous way with the `wordcount`, we are trying to implement the kmer-counting problem. Since it is computationally demanding for large sequences, we strive to find an efficient parallel solution. So far we have implemented a **standalone parallel iterator** taking into account memory limitations by using `yield` statements. Currently, we exploit **only shared memory** parallel scheme without any distribution technique. There are also many different approaches on this. [Jellyfish](https://github.com/gmarcais/Jellyfish), a kmer-counting project, leverage pthreads and an efficient [lock-free](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3051319/) mechanism (compare and swap named as CAS) to avoid the heavy usage of mutexes that kill performance. It should be noted that the Map standard module of Chapel is using also locks. As an optimization, a lock-free structure can be potentially implemented in Chapel with an efficient hash table.

The **input** of this program is intended to be files with the standardized **fasta** format. In order to parse these multiline files, we need to extract the valuable information of the genome sequence neglecting the newline characters. To create such a friendly, linear format, we use `awk`. This helped us to balance the work between threads while not missing any possible kmer when changing lines.

To linearise the fasta file, then:
```
make fasta FASTA=
```
This will create a `linear.fasta` ready to be parsed for our code.

Expected fasta files should have the following structure. Until now, we don't support multiple fasta format:
```
>text
CTTCGAAAGTTTGGGCCGAGTCTTACAGTCGGTCTTGAAGCAAAGTAACGAACTCCACGG..
CCCTGACTACCGAACCAGTTGTGAGTACTCAACTGGGTGAGAGTGCAGTCCCTATTGAGT..
.
.
.
```

## Usage

For the kmer solution:
```
git clone https://github.com/thodkatz/chapel-kmer.git
cd chapel-kmer/
make fasta FASTA=
make kmer
./bin/kmer --k=
```
Optionally we can override the `linear.fasta` generated by `make fasta` when running the program with `./bin/kmer --pathFasta=`.

The kmer table is generated as a text file with different versions, each one representing a different implementation but with the same output format. Currently we have one serial and one parallel version. For validation, we can check the differences between these generated files. The serial ones is used mainly for validation.

## Output format

```
(file.fasta)

>Rosalind_6431
CTTCGAAAGTTTGGGCCGAGTCTT
```

```
(kmer_parallel.txt)

kmer:count
CGA:2
GTT:1
AGT:2
GAA:1
TGG:1
GAG:1
TTG:1
TTC:1
GCC:1
CTT:2
GTC:1
CCG:1
AAG:1
TTT:1
AAA:1
TCT:1
GGG:1
TCG:1
GGC:1
```

## Related kmer-projects
- [Jellyfish](https://github.com/gmarcais/Jellyfish)
- [Gerbil: A fast and memory-efficient k-mer counter with GPU-support](https://github.com/uni-halle/gerbil)
- [KMC: Fast and frugal disk based k-mer counter](https://github.com/refresh-bio/KMC)