https://github.com/dieterich-lab/mAFiA

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/dieterich-lab/mAFiA
Owner: dieterich-lab
Created: 2023-06-01T15:07:59.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-04-22T11:36:05.000Z (about 1 year ago)
Last Synced: 2024-04-22T12:04:26.239Z (about 1 year ago)
Language: Python
Size: 41 MB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-nanopore - mAFiA - [Python] - [Detecting m6A at single-molecular resolution via direct RNA sequencing and realistic training data](https://doi.org/10.1038/s41467-024-47661-2) (Software packages / RNA modification analysis)

README

        ![Logo](logo.png "mAFiA")

# mAFiA - (Another) m⁶A Finding Algorithm

Here we provide a brief walkthrough to run mAFiA, using the example of chromosome X.

## 0. Preliminary

- The following steps are tested with python verion 3.9. Get code and activate virtual environment, e.g.:

```

git clone https://github.com/dieterich-lab/mAFiA.git

cd mAFiA

python3 -m venv mafia-venv

source mafia-venv/bin/activate

```

If you pip version is <21.3, then upgrade it to a newer version:

```

python3 -m pip install --upgrade pip

```

Install package

```

pip install -e .

```

- Download models and data from [here](https://zenodo.org/record/8321727)

    - The folder "models" contains:

        - backbone.torch: [RODAN](https://github.com/biodlab/RODAN)-based neural network for basecalling and feature extraction

        - backbone.config: training configuration for backbone

        - classifiers: pickled logistic regression models

    - The folder "data" contains a subset of input data on chr X:

        - fast5_chrX: dRNA-Seq raw data from HEK293 WT mRNA

        - GRCh38_96.X: genome reference

        - GLORI_chrX.bed: query modification sites in bed format. This file specifically corresponds to those listed in [GLORI](https://www.nature.com/articles/s41587-022-01487-9).

- Assume that data and model are unzipped to ${data} and ${model} respectively. Your output directory is ${output}

```

backbone="${models}/backbone.torch"

classifiers="${models}/classifiers"

fast5dir="${data}/fast5_chrX"

ref="${data}/GRCh38_96.X.fa"

mod="${data}/GLORI_chrX.bed"

mkdir -p "${output}"

basecall="${output}/rodan.fasta"

bam="${output}/minimap.q50.bam"

```

## 1. Basecalling

The basecalling script is adapted from the [RODAN](https://github.com/biodlab/RODAN) repository. Assume that ${mafia} is your code directory.

```

python3 ${mafia}/RODAN/basecall.py \

--fast5dir ${fast5dir} \

--model ${backbone} \

--batchsize 4096 \

--outdir ${output}

```

On a reasonably modern GPU machine, this step should take less than 30 mins.

## 2. Alignment

Align basecalling results to reference genome. Filter, sort, and index BAM file.

```

minimap2 --secondary=no -ax splice -uf -k14 -t 36 --cs ${ref} ${basecall} \

| samtools view -bST ${ref} -q50 - \

| samtools sort - > ${bam}

samtools index ${bam}

```

## 3. mAFiA

After the standard procedures, we can now measure m⁶A stoichiometry of the sites specified in ${mod}.

```

test_mAFiA \

--bam_file ${bam} \

--fast5_dir ${fast5dir} \

--ref_file ${ref} \

--mod_file ${mod} \

--min_coverage 50 \

--max_num_reads 1000 \

--backbone_model_path ${backbone} \

--classifier_model_dir ${classifiers} \

--mod_prob_thresh 0.5 \

--out_dir ${output}

```

The last step should take less than 1 hour. We are currently working on integrating the feature extraction step directly into basecalling. The run-time should then be significantly reduced.

In your ${output} directory, you should now see two files:

- "mAFiA.sites.bed": List of sites with coverage above minimum threshold (default 50). The column "modRatio" lists the site's stoichiometry.

- "mAFiA.reads.bam": Aligned reads identical to those in the input BAM file ${bam}, but with additional MM and ML tags that mark the location and modification probability in each individual read. The results can be visualized with, eg, [IGV](https://software.broadinstitute.org/software/igv/).

The complete HEK293 WT dataset for all chromosomes can be downloaded from [zenodo](https://zenodo.org/record/8319583).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dieterich-lab/mAFiA

Awesome Lists containing this project

README