https://github.com/pirovc/metameta
https://github.com/pirovc/metameta
binning metagenomics pipeline profiling snakemake taxonomy
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/pirovc/metameta
- Owner: pirovc
- License: other
- Created: 2016-11-29T15:46:56.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2019-09-27T12:36:03.000Z (over 6 years ago)
- Last Synced: 2024-05-22T15:32:50.919Z (over 1 year ago)
- Topics: binning, metagenomics, pipeline, profiling, snakemake, taxonomy
- Language: Python
- Size: 14.2 MB
- Stars: 23
- Watchers: 5
- Forks: 10
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-microbes - MetaMeta - [python] - Integrates metagenome analysis tools to improve taxonomic profiling. [doi](https://doi.org/10.1101/138578) (Metagenomics (WGS, Shotgun sequencing))
README
# MetaMeta: Integrating metagenome analysis tools to improve taxonomic profiling
Vitor C. Piro (vitorpiro@gmail.com)
[](http://bioconda.github.io/recipes/metameta/README.html)
Piro, V. C., Matschkowski, M., & Renard, B. Y. (2017). MetaMeta: integrating metagenome analysis tools to improve taxonomic profiling. Microbiome, 5(1), 101. http://doi.org/10.1186/s40168-017-0318-y
Install:
--------
Miniconda:
# Download conda installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Set permissions to execute
chmod +x Miniconda3-latest-Linux-x86_64.sh
# Execute. Make sure to "yes" to add the conda to your PATH
./Miniconda3-latest-Linux-x86_64.sh
# Add channels
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
MetaMeta:
conda install metameta=1.2.0
* All other tools and dependencies are installed in their own environment automatically on the first run (with `--use-conda` parameter active).
Alternatively, install MetaMeta in a separated environment (named "metametaenv") with the command:
conda create -n metametaenv metameta=1.2.0
source activate metametaenv # Command to activate the environment. To deactivate use "source deactivate"
Run:
----
Create a configuration file (yourconfig.yaml) with the required fields (workdir, dbdir and samples):
workdir: "/home/user/folder/results/"
dbdir: "/home/user/folder/databases/"
samples:
sample_name_1:
fq1: "/home/user/folder/reads/file.1.fq"
fq2: "/home/user/folder/reads/file.2.fq"
* All paths set on this file are relative to the workdir (if not absolute)
Check rules and output files:
metameta --configfile yourconfig.yaml -np
Run MetaMeta:
metameta --configfile yourconfig.yaml --use-conda --keep-going --cores 24
* Alternatively, make a copy of the configuration file for the complete set of parameters ``cp ~/miniconda3/opt/metameta/config/example_complete.yaml yourconfig.yaml``
* The number of `--cores` is the total amount avaiable for the pipeline. Number of specific threads for the tools should be set on the configuration file (yourconfig.yaml) with the parameter `threads`
* On the first run MetaMeta will download and install the configured tools as well as the database files (`archaea_bacteria_201503` by default - see below) necessary for each tool.
Pre-configured databases:
-------------------------
Available databases:
| Info | Date | metameta database name |
| --- | --- | --- |
| Archaea + Bacteria - RefSeq Complete Genomes | 2015-03 | `archaea_bacteria_201503` |
| Fungal + Viral - RefSeq Complete Genomes | 2017-09 | `fungi_viral_201709` |
Database availability per tool:
| database | clark | dudes | gottcha | kaiju | kraken | motus |
| --- | --- | --- | --- | --- | --- | --- |
| `archaea_bacteria_201503` | [Yes](https://zenodo.org/record/820055) | [Yes](https://zenodo.org/record/820053) | [Yes](https://zenodo.org/record/819341) | [Yes](https://zenodo.org/record/819425) | [Yes](https://zenodo.org/record/819363) | [Yes](https://zenodo.org/record/819365) |
| `fungi_viral_201709` | [Yes](https://zenodo.org/record/1044318) | [Yes](https://zenodo.org/record/1044328) | No | [Yes](https://zenodo.org/record/1044326) | [Yes](https://zenodo.org/record/1044330) | No |
Running sample data:
--------------------
cd ~/miniconda3/opt/metameta/
Pre-configured Archaea and Bacteria database:
./metameta --configfile sampledata/sample_data_archaea_bacteria.yaml --use-conda --keep-going --cores 6
Custom database (some viral reference genomes):
./metameta --configfile sampledata/sample_data_custom_viral.yaml --use-conda --keep-going --cores 6
Results:
cd sampledata/results/
Running MetaMeta on a cluster environment:
------------------------------------------
Make a copy of cluster configuration file:
cp ~/miniconda3/opt/metameta/config/cluster.json yourcluster.json
Edit the file with your cluster specifications (threads, partitions, cpu/memory, etc) for each rule.
Run MetaMeta (slurm example):
metameta --configfile yourconfig.yaml --keep-going --use-conda -j 999 --cluster-config yourcluster.json --cluster "sbatch --job-name {cluster.job-name} --output {cluster.output} --partition {cluster.partition} --nodes {cluster.nodes} --cpus-per-task {cluster.cpus-per-task} --mem {cluster.mem} --time {cluster.time}"
* you can change the cluster command (`sbatch`) and adapt them to your cluster system.
Custom databases:
-----------------
MetaMeta uses by default Archaea and Bacteria sequences as reference database (`archaea_bacteria_201503` - see below). Additionaly MetaMeta allows the creation of custom database.
First select which databses should be used on the configuration file:
databases:
- archaea_bacteria_201503
- custom_db
* all samples will run agains the "archaea_bacteria_201503" and the new "custom_db" databases
Second, create an entry with the path to the sequences that should be added to the custom database:
custom_db:
clark: "sampledata/database/"
dudes: "sampledata/database/"
kaiju: "sampledata/database/"
kraken: "sampledata/database/"
* clark and dudes require one or more fasta files (extension .fna) with the accession.version identifier after the header ">" (e.g. ">NC_001998.1 Guinea pig Chlamydia phage, complete genome")
* kaiju requires one or more GenBank flat file (extension .gbff)
* kraken requires one or more fasta files (extension .fna) with the gi identifier on the header (e.g. ">gi|9632287|ref|NC_001998.1| Guinea pig Chlamydia phage, complete genome")
MetaMeta will compile the "custom_db" on the first run and use it as a database. After finished it is possible to delete de database definition from the configuration file for the following runs.
Creating a custom database based on NCBI genomes:
-------------------------------------------------
It is possible to create a custom database based on the set of genomes from NCBI
Download the genome_updater script:
git clone https://github.com/pirovc/genome_updater
Download the desired database:
Example -> All fungi genomes available on refseq, fasta and GenBank formats with 6 threads:
./genome_updater.sh -d "refseq" -g "fungi" -f "genomic.fna.gz,genomic.gbff.gz" -t 6 -o fungi_genomes/
mkdir -p custom_fungi_db/clark_dudes/ custom_fungi_db/kaiju/ custom_fungi_db/kraken/
Extract files:
clark and dudes:
zcat fungi_genomes/files/*.fna.gz > custom_fungi_db/clark_dudes/fungi_genomes.fna
kaiju:
zcat fungi_genomes/files/*.gbff.gz > custom_fungi_db/kaiju/fungi_genomes.gbff
kraken (with header conversion to GI, old NCBI style):
zcat fungi_genomes/files/*.fna.gz | awk '{if(substr($0, 0, 1)==">"){sep=index($0," ");acc=substr($0,2,sep-2);header=substr($0,sep+1); cmd="wget -qO - \"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id="acc"&rettype=gi\""; cmd | getline gi; close(cmd); print ">gi|" gi "|ref|" acc "| " header }else{ print $0 }}' > custom_fungi_db/kraken/fungi_genomes.fna
Add entry on the configuration file:
databases:
- new_custom_fungi_db
Finally, add the path for each set of reference sequences on the configuration file:
new_custom_fungi_db:
clark: "custom_fungi_db/clark_dudes/"
dudes: "custom_fungi_db/clark_dudes/"
kaiju: "custom_fungi_db/kaiju/"
kraken: "custom_fungi_db/kraken/"
On the first run MetaMeta will compile the "new_custom_fungi_db" database for each configured tool. After finished it is possible to delete de database definition from the configuration file for the following runs.
Pre-install a complete environment:
-----------------------------------
wget https://raw.githubusercontent.com/pirovc/metameta/master/envs/metameta_complete.yaml
conda env create -f metameta_complete.yaml
source activate metametaenv_complete
Merging final results:
----------------------
To merge final results from many samples into one final tabular file:
~/miniconda3/opt/metameta/scripts/merge_final_profiles.sh workdir/samples_*/metametamerge/database/final.metametamerge.profile.out
Folder structure:
-----------------
MetaMeta can run several tools with several samples against several databases. The files on the working directory and database directory are organized in the structure below:
WORKDIR:
SAMPLE_1/
TOOL_1/ (*)
DB_1/
DB_2/
...
TOOL_2/ (*)
...
PROFILES/
DB_1/
TOOL_1.profile.out
TOOL_2.profile.out
...
DB_2/
...
METAMETAMERGE/
DB_1/
FINAL_PROFILE.out
FINAL_PROFILE_KRONA.html
DB_2/
...
LOG/
DB_1/
DB_2/
...
READS/ (*)
TOOL_1.1.fq
TOOL_1.2.fq
TOOL_2.1.fq
TOOL_2.2.fq
...
SAMPLE_2/
...
CLUSTERLOG/ (**)
DBDIR:
DB_1/
TOOL_1_DB/
TOOL_2_DB/
...
TOOL_1.dbprofile.out
TOOL_2.dbprofile.out
...
LOG/
DB_2/
...
TAXONOMY/
LOG/
(\*) removed when keepfiles=0
(\*\*) only when running on cluster mode
Adding a new tool:
------------------
MetaMeta integrates profiling and binning tools and it has 6 pre-configured tools (clark, dudes, gottcha, kaiju, kraken and motus). New tools are required to use the NCBI Taxonomy structure and nomenclature/identifiers to be added to the pipeline. MetaMeta accepts BioBoxes format directly (https://github.com/bioboxes/rfc/tree/master/data-format) or a .tsv file in the following format:
- Profiling: rank, taxon name or taxid, abundance
Example:
genus Methanospirillum 0.0029
genus Thermus 0.0029
genus 568394 0.0029
species Arthrobacter sp. FB24 0.0835
species 195 0.0582
species Mycoplasma gallisepticum 0.0536
- Binning: readid, taxon name or taxid, lenght of sequence assigned
Example:
M2|S1|R140 354 201
M2|S1|R142 195 201
M2|S1|R145 457425 201
M2|S1|R146 562 201
M2|S1|R147 1245471 201
M2|S1|R150 354 201
MetaMeta pipeline uses Snakemake. To add a new tool to the pipeline it is necessary to create two main files described below. Replace 'newtool' with the tool identifier (lower case, no spaces, no special chars):
tools/newtool.sm -> specifies how to execute the tool
Rules:
- newtool_run_1[..n] -> one or more rules necessary to run the tool
- newtool_rpt -> final rule that should output a file newtool.profile.out in an accepted output format (described above)
tools/newtool_db_custom.sm -> specifies how to download/compile the database/references
Rules:
- newtool_db_custom_1[..n] -> one or more rules necessary to compile the database.
- newtool_db_custom_profile -> this rule generates automatically the database profile. It should have as an output a file (newtool.dbaccession.out) with the accession version identifier for all sequences used in the database.
- newtool_db_custom_check -> rule to check the required database files. It should have as an input all mandatory files that should be present to the database work properly.
* Template files can be found inside the folder tools/template. Once the two files are inside the tools folder, it is necessary to add the tool identifier to the YAML configuration file.
Changelog:
----------
v1.2.0)
- Updated to Snakemake 4.3.0 (from 3.9.1)
- Bug fixes on custom database creation and database profile generation
- Centralized taxonomy download (once for all tools, kept on dbdir:taxonomy/)
- Updated tools: kaiju 1.0 -> 1.4.5, dudes 0.07 -> 0.08, spades 3.9.0 -> 3.11.1
- Addition of new pre-configured databases: fungal_viral_201709
- Multiple pre-configured databases support
- Several fixes on custom database creation
v1.1.1) Bug fixes parsing output files for kraken and kaiju
v1.1) Support single and paired-end reads, multiple and custom databases, krona integration