https://github.com/colinbrislawn/unite-train

🍄 Qiime2 ITS classifiers for the UNITE database
https://github.com/colinbrislawn/unite-train

Last synced: 9 months ago
JSON representation

🍄 Qiime2 ITS classifiers for the UNITE database

Host: GitHub
URL: https://github.com/colinbrislawn/unite-train
Owner: colinbrislawn
License: bsd-3-clause
Created: 2022-01-16T15:29:58.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2025-03-16T17:25:50.000Z (9 months ago)
Last Synced: 2025-03-16T18:54:10.144Z (9 months ago)
Language: HTML
Homepage:
Size: 1.03 MB
Stars: 36
Watchers: 3
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # unite-train

A pipeline to build [Qiime2](https://qiime2.org/) taxonomy [classifiers](https://docs.qiime2.org/2021.11/data-resources/) for the [UNITE database](https://unite.ut.ee/repository.php).

## [Download a pre-trained classifier here! 🎁](https://github.com/colinbrislawn/unite-train/releases)

[![Issues](https://img.shields.io/github/issues/colinbrislawn/unite-train?style=for-the-badge)](https://github.com/colinbrislawn/unite-train/issues)

![pre-releases](https://img.shields.io/github/release-date-pre/colinbrislawn/unite-train?display_date=published_at&style=for-the-badge)

[![Downloads](https://img.shields.io/github/downloads/colinbrislawn/unite-train/total.svg?style=for-the-badge)](https://github.com/colinbrislawn/unite-train/releases)

### What is this?

If you are interested in Fungi 🍄🍄‍🟫 you could use their genomic fingerprint to identify them. Affordable PCR amplification and sequencing of the ITS gene gives you these nucleic acid fingerprints, and the UNITE team provides a database to gives these sequences a name.

We can predict the taxonomy of our fungal fingerprints using an old-school machine learning method: a supervised [k-mer](https://en.wikipedia.org/wiki/K-mer) [nb-classifier](https://scikit-learn.org/stable/modules/naive_bayes.html). But first, we need to prepare our database in a process called 'training.'

This is a pipeline that trains the UNITE ITS taxonomy database for use with Qiime2. You can run this pipeline yourself, but you don't have to! I've provided a [ready to use pre-trained classifiers](https://github.com/colinbrislawn/unite-train/releases) so you can simply run [`qiime feature-classifier classify-sklearn`](https://docs.qiime2.org/2024.2/plugins/available/feature-classifier/classify-sklearn/).

If you have questions about using Qiime2, ask on [the Qiime2 forums](https://forum.qiime2.org/).

If you have questions about the UNITE ITS database, [contact the UNITE team](https://unite.ut.ee/contact.php).

If you have questions about this pipeline, please [open a new issue](https://github.com/colinbrislawn/unite-train/issues/new)!

---

## Running Snakemake workflow

Set up:

- Install [Mambaforge](https://github.com/conda-forge/miniforge#mambaforge) and configure [Bioconda](https://bioconda.github.io/).

- Install the version of [Qiime2](https://docs.qiime2.org/) you want using the recomended environment name.

  (For a faster install, you can replace `conda` with `mamba`.)

- Install [Snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) into an environment, then activate that environment.

Configure:

- Open up `config/config.yaml` and configure it to your liking.

  (For example, you may need to update the name of your Qiime2 environment.)

Run:

```bash

snakemake --cores 8 --sdm conda --resources mem_mb=10000

```

Training one classifier takes 1-9 hours on an [AMD EPYC 75F3 Milan](https://www.amd.com/en/products/cpu/amd-epyc-75f3), depending on the size and complexity of the data.

  Run on a slurm cluster:

More specifically, The University of Florida HiPerGator supercomputer,

with access generously provided by the [Kawahara Lab](https://www.floridamuseum.ufl.edu/kawahara-lab/)!

```bash

screen    # We connect to a random login node, so we may not be able...

screen -r # to reconnect with this later on.

snakemake --jobs 24 --slurm \

  --rerun-incomplete --retries 3 \

  --sdm envmodules --latency-wait 10 \

  --default-resources slurm_account=kawahara-b slurm_partition=hpg-milan

```

  Run with Docker:

Say, in 'the cloud' using [FlowDeploy](https://flowdeploy.com/).

```bash

snakemake --jobs 12 \

  --rerun-incomplete --retries 3 \

  --sdm singularity \

  --default-resources

```

Reports:

```bash

snakemake --report results/report.html

snakemake --forceall --dag | dot -Tpdf > results/dag.pdf

```

## Downloads

![Downloads Time](./benchmarks/downloads_time.png)

![Downloads Types](./benchmarks/downloads_types.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/colinbrislawn/unite-train

Awesome Lists containing this project

README