Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/StephenHwang/MEMO

MEMO: MEM-based pangenome indexing for k-mer queries
https://github.com/StephenHwang/MEMO

Last synced: 28 days ago
JSON representation

MEMO: MEM-based pangenome indexing for k-mer queries

Awesome Lists containing this project

README

        

# MEMO: MEM-based pangenome indexing for _k_-mer queries ![GitHub release (latest by date)](https://img.shields.io/github/v/release/StephenHwang/MEMO) ![GitHub](https://img.shields.io/github/license/StephenHwang/MEMO?color=green)

Maximal Exact Match Ordered (MEMO) is a pangenome indexing method based on maximal exact matches (MEMs) between genomes. A single MEMO index can handle arbitrary-length-_k_ _k_-mer queries over pangenomic windows. MEMO performs membership queries for per-genome _k_-mer presence/absence and conservation queries for the number of genomes containing the _k_-mers in a window. MEMO achieves smaller index sizes and faster queries than _k_-mer-based approaches like KMC3 and PanKmer. See the small example here on running MEMO for visualizing sequence conservation.

## Installation
### Docker/Singularity Container
MEMO is available as a Docker image on DockerHub.
```sh
### Docker:
docker pull hwangstephen/memo:latest
docker run hwangstephen/memo:latest memo -h
### Singularity:
singularity pull docker://hwangstephen/memo:latest
./memo_latest.sif memo -h
```

### Build from source
MEMO relies on the following dependencies:
- Python:
- python (>=3.10)
- pandas
- plotnine
- pyarrow
- numba
- numpy
- Others:
- MONI
- samtools
- seqtk

Compile MONI from repo:
```
sudo apt-get install -y build-essential cmake git python3 zlib1g-dev
git clone https://github.com/maxrossi91/moni
mkdir build
cd build
cmake ..
make
make install
```

After downloading/building the required dependencies, clone and run MEMO from its repo:
```
git clone https://github.com/StephenHwang/MEMO.git
cd MEMO/src
./memo -h
```

## Usage
### Index Creation
To create a MEMO conservation index, specify a list of genomes `-g` and an output location `-o` and prefix `-p`. To create the MEMO membership index, include the `-m` flag.
Each line in the `genome_list.txt` is the path to each genome in the pangenome; the first genome listed is the pangenome pivot.
```sh
./memo index \
-g genome_list.txt \
-o output_dir \
-p output_prefix
```

### Querying k-mer membership and conservation
Once you have created your indexes, specify your length-_k_ `k`, genomic region `-r`, and the total number of genomes in your genome (inclusive of pivot) `-n`. Then run `memo query` for the conservation query. To run the membership query, include the `-m` flag.
```sh
./memo query \
-b index.parquet \
-k k \
-n num_genomes \
-r chr:start-end \
-o memo_c_out.txt
```

### Visualizing sequence conservation

hprc_hla_seq_conservation

31-mer sequence conservation of the Human Leucocyte Antigen locus in the HPRC pangenome.

After the conservation query, use MEMO to visualize sequence conservation:
```sh
./memo view \
-i memo_c_out.txt \
-o out.png \
-n num_genomes \
-b num_bins
```

## Citing MEMO
>Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, Ben Langmead. MEM-based pangenome indexing for k-mer queries (2024). bioRxiv.