https://github.com/HKU-BAL/Clair3

Clair3 - Symphonizing pileup and full-alignment for deep learning-based long-read variant calling
https://github.com/HKU-BAL/Clair3
computational-biology deep-learning genomics illumina long-reads nanopore ont-data ont-models pacbio variant-calling variant-detection
Last synced: 27 days ago
JSON representation
Clair3 - Symphonizing pileup and full-alignment for deep learning-based long-read variant calling
Host: GitHub
URL: https://github.com/HKU-BAL/Clair3
Owner: HKU-BAL
Created: 2021-03-30T11:02:58.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2026-06-09T03:06:38.000Z (30 days ago)
Last Synced: 2026-06-09T05:08:32.115Z (29 days ago)
Topics: computational-biology, deep-learning, genomics, illumina, long-reads, nanopore, ont-data, ont-models, pacbio, variant-calling, variant-detection
Language: Python
Homepage:
Size: 38.1 MB
Stars: 369
Watchers: 12
Forks: 38
Open Issues: 16
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project

awesome-nanopore - Clair3 - [Python/C++] - [Clair3-RNA: a deep learning-based small variant caller for long-read RNA sequencing data](https://www.nature.com/articles/s41467-025-67237-y) (Software packages / Variant, SV calling, Phasing)
README

          


  

    

  

  
Clair3


  
Symphonizing pileup and full-alignment for deep-learning-based long-read variant calling


  


    

    

    

  



---

**Contact:** Ruibang Luo, Zhenxian Zheng, Xian Yu

**Email:** rbluo@cs.hku.hk · zxzheng@cs.hku.hk · yuxian@connect.hku.hk

---

## Introduction

**Clair3** is a germline small-variant caller for long-read sequencing. It combines two complementary models to balance speed and accuracy:

- **Pileup calling** — fast, handles the majority of variant candidates from summarized alignment statistics.

- **Full-alignment calling** — computationally intensive, resolves uncertain candidates from haplotype-resolved full alignments.

Clair3 is the 3rd generation of [Clair](https://github.com/HKU-BAL/Clair) (2nd) and [Clairvoyante](https://github.com/aquaskyline/Clairvoyante) (1st).

### Looking for a different variant caller?

| Use case | Tool |

| --- | --- |

| Germline on **long-read RNA-seq** | [Clair3-RNA](https://github.com/HKU-BAL/Clair3-RNA) |

| Somatic, **paired tumor/normal** | [ClairS](https://github.com/HKU-BAL/ClairS) |

| Somatic, **tumor-only** | [ClairS-TO](https://github.com/HKU-BAL/ClairS-TO) |

### Agent Skill

[Clair-skills](https://github.com/HKU-BAL/Clair-skills) is a plug-in for agentic AI coding assistants (Claude Code, Cursor, Codex, …) that covers the entire Clair suite. It helps the agent pick the right tool and model, generate ready-to-run commands, and analyze results.

---

## Contents

- [Latest Updates](#latest-updates)

- [Installation](#installation) — [Docker](#option-1-docker) · [Singularity](#option-2-singularity) · [Bioconda](#option-3-bioconda) · [Step-by-step (Conda)](#option-4-step-by-step-conda)

- [Pre-trained Models](#pre-trained-models)

- [Quick Demo](#quick-demo)

- [Usage](#usage)

- [Advanced Topics](#advanced-topics) — [Dwelling time](#dwelling-time-feature) · [Amplicon data](#dealing-with-amplicon-data) · [Postprocessing](#postprocessing-scripts)

- [Reference](#reference) — [Folder structure](#folder-structure-and-submodules) · [Training data](#training-data) · [VCF/GVCF formats](#vcfgvcf-output-formats) · [Model training guides](#model-training-guides)

- [Citation](#citation)

---

## Latest Updates

### v2.0.1 — *Apr 27, 2026*

- Added the ONT `r1041_e82_400bps_sup_v520_with_mv` signal-aware (move-table) model for Dorado v5.2 SUP basecalled data ([#428](https://github.com/HKU-BAL/Clair3/issues/428)).

- Added a pre-built **GPU Docker image** `hkubal/clair3:v2.0.1_gpu` (CUDA 12.1, PyTorch). See [GPU (NVIDIA CUDA on Linux)](#gpu-nvidia-cuda-on-linux) ([#433](https://github.com/HKU-BAL/Clair3/issues/433)).

- Fixed `clair3_version` shown in VCF headers — previously stuck at `1.2.0` regardless of installed version ([#432](https://github.com/HKU-BAL/Clair3/issues/432)).

- Fixed `SortVcf` writing pileup `RefCall` entries when no variants were found, instead of an empty VCF ([#436](https://github.com/HKU-BAL/Clair3/issues/436)).

### v2.0.0 — *Feb 9, 2026*   **(Major release)**

A preprint describing the performance of Clair3 v2 is available on [bioRxiv](https://www.biorxiv.org/content/10.64898/2026.02.13.705285v1).

- **PyTorch migration.** The deep-learning backend moved from TensorFlow to PyTorch. **v1 TensorFlow models are _not_ compatible with v2** (including the TF models ONT provides via Rerio). Use the [Converted Rerio Clair3 Models (PyTorch)](https://www.bio8.cs.hku.hk/clair3/clair3_models_rerio_pytorch/), or convert your own with the [Model Migration Guide](docs/model_migration_guide.md). Pre-trained PyTorch models: [download here](https://www.bio8.cs.hku.hk/clair3/clair3_models_pytorch/).

- **Signal-aware variant calling for ONT.** Pass `--enable_dwell_time` on BAMs with Dorado `mv` tags (requires `--emit-moves`). See [Dwelling Time Feature](docs/dwelling_time.md).

- **New Python runner.** `run_clair3.sh` was reconstructed as `run_clair3.py`; both remain usable.

- **Checkpoint format.** TF `.index`/`.data` → PyTorch `.pt`.

### v1.2.0 — *Aug 1, 2025*

Native GPU support on Linux and Apple Silicon. Clair3 on GPU runs **~5× faster than CPU**. See the [GPU Quick Start](docs/gpu_quick_start.md).



  



### v1.1.2 — *Jul 10, 2025*

- Boundary check for an insertion immediately followed by soft-clipping ([#394](https://github.com/HKU-BAL/Clair3/issues/394), @[dpryan79](https://github.com/dpryan79)).

- Parallel-job exit-code checking; pipeline now exits immediately on any job failure ([#392](https://github.com/HKU-BAL/Clair3/issues/392), @[SamStudio8](https://github.com/SamStudio8)).

### v1.1.1 — *May 19, 2025*

- Fixed the malformed VCF header on AWS ([#380](https://github.com/HKU-BAL/Clair3/issues/380)).

- Added an R10.4.1 model fine-tuned on 12 [bacterial genomes](https://elifesciences.org/reviewed-preprints/98300) ([notes](docs/fine-tuning_Clair3_with_12_bacteria_samples.pdf), @[wshropshire](https://github.com/wshropshire)).

Earlier versions (click to expand)

**v1.1.0 — Apr 8, 2025.** Removed `parallel` version checking ([#377](https://github.com/HKU-BAL/Clair3/issues/377)).

**v1.0.11 — Mar 19, 2025.** Added `--enable_variant_calling_at_sequence_head_and_tail` to call variants in the first/last 16 bp of a sequence (use with caution — less reliable alignments and less context; [#257](https://github.com/HKU-BAL/Clair3/issues/257)). Added `--output_all_contigs_in_gvcf_header` ([#371](https://github.com/HKU-BAL/Clair3/issues/371)). Added postprocessing `AddPairEndAlleleDepth` (PEAD tag, Bin Guan, NEI). Fixed AF format in GVCF output ([#365](https://github.com/HKU-BAL/Clair3/issues/365)). Added a [split-into-haplotypes calling workflow](docs/split_haplotype_into_haploid_calling.md). `set -o pipefail` in `run_clair3.sh` ([#368](https://github.com/HKU-BAL/Clair3/issues/368)). Clarified parameter docs ([#369](https://github.com/HKU-BAL/Clair3/issues/369)).

**v1.0.10 — Jul 28, 2024.** Fixed an out-of-range bug in non-human GVCF output ([#317](https://github.com/HKU-BAL/Clair3/issues/317)). Faster amplicon calling via `--chunk_num=-1` ([#306](https://github.com/HKU-BAL/Clair3/issues/306)). LongPhase bumped to 1.7.3 ([#321](https://github.com/HKU-BAL/Clair3/issues/321)).

**v1.0.9 — May 15, 2024.** Fixed VCF header ([#305](https://github.com/HKU-BAL/Clair3/pull/305)); updated `DP` FORMAT description.

**v1.0.8 — Apr 29, 2024.** Fixed occasional quality-score differences between VCF and GVCF output. LongPhase bumped to 1.7.

**v1.0.7 — Apr 7, 2024.** Memory guards for full-alignment C implementation ([#286](https://github.com/HKU-BAL/Clair3/pull/286)). Raised max mpileup coverage to 2^20 ([#292](https://github.com/HKU-BAL/Clair3/pull/292)). LongPhase bumped to 1.6.

**v1.0.6 — Mar 15, 2024.** Stack-overflow fix at very high coverage ([#282](https://github.com/HKU-BAL/Clair3/issues/282)). Reference caching for CRAM ([#278](https://github.com/HKU-BAL/Clair3/pull/278)). Fixed RefCall outputs when FA model calls no variant ([#271](https://github.com/HKU-BAL/Clair3/issues/271)). Fixed min-coverage filtering ([#262](https://github.com/HKU-BAL/Clair3/issues/262)). `--min_snp_af` / `--min_indel_af` default to 0.0 when `--vcf_fn` is set ([#261](https://github.com/HKU-BAL/Clair3/issues/261)).

**v1.0.5 — Dec 20, 2023.** Fixed multi-allelic AF at very high coverage ([#241](https://github.com/HKU-BAL/Clair3/issues/241)). `--base_err` and `--gq_bin_size` to reduce excess `./.` in GVCF ([#220](https://github.com/HKU-BAL/Clair3/issues/220)).

**v1.0.4 — Jul 11, 2023.** Command line and reference source now in VCF header. Fixed AF for 1/2 genotypes. Added AD tag.

**v1.0.3 — Jun 20, 2023.** Colon `:` allowed in reference sequence names ([#203](https://github.com/HKU-BAL/Clair3/issues/203)).

**v1.0.2 — May 22, 2023.** Added PacBio HiFi Revio model. Fixed halt on too few variant candidates ([#198](https://github.com/HKU-BAL/Clair3/issues/198)).

**v1.0.1 — Apr 24, 2023.** WhatsHap bumped to 1.7 (~15% faster haplotagging, [#193](https://github.com/HKU-BAL/Clair3/issues/193)). Fixed PL when ALT is N ([#191](https://github.com/HKU-BAL/Clair3/issues/191)).

**v1.0.0 — Mar 6, 2023.** Clair3 version in VCF header ([#141](https://github.com/HKU-BAL/Clair3/issues/141)). NumPy int fix ([#165](https://github.com/HKU-BAL/Clair3/issues/165)). IUPAC → N by default, keep with `--keep_iupac_bases` ([#153](https://github.com/HKU-BAL/Clair3/issues/153)). Added `--use_{longphase,whatshap}_for_intermediate_phasing` / `--use_{longphase,whatshap}_for_final_output_phasing` / `--use_whatshap_for_final_output_haplotagging` ([#164](https://github.com/HKU-BAL/Clair3/issues/164)). Fixed Docker shell under host user mode ([#175](https://github.com/HKU-BAL/Clair3/issues/175)).

**v0.1-r12 — Aug 19, 2022.** CRAM input ([#117](https://github.com/HKU-BAL/Clair3/issues/117)). Python 3.9, TensorFlow 2.8, Samtools 1.15.1, WhatsHap 1.4. `DP` now shows raw coverage for pileup calls ([#128](https://github.com/HKU-BAL/Clair3/issues/128)). Illumina representation-unification fix ([#110](https://github.com/HKU-BAL/Clair3/issues/110)). LongPhase 1.3.

**v0.1-r11 minor 2 — Apr 16, 2022.** Fixed missing non-variant GVCF positions at chunk boundaries. Reduced GVCF memory footprint ([#88](https://github.com/HKU-BAL/Clair3/issues/88)).

**v0.1-r11 — Apr 4, 2022.** ~2.5× faster on ONT Q20 data with pileup and full-alignment feature generation in C. LongPhase as a phasing option (`--longphase_for_phasing`). `--min_coverage`, `--min_mq`, `--min_contig_size`. CSI index support ([#90](https://github.com/HKU-BAL/Clair3/issues/90)). See [Notes on r11](docs/v0.1_r11_speedup.md).

**v0.1-r10 — Jan 13, 2022.** Added the Guppy5 model `r941_prom_sup_g5014` ([benchmarks](docs/guppy5_20220113.md)); applicable to `sup`, `hac`, `fast` reads. The older `r941_prom_sup_g506` was obsoleted. Added `--var_pct_phasing`.

**v0.1-r9 — Dec 1, 2021.** `--enable_long_indel` for indel calls >50 bp ([benchmarks](docs/indel_gt50_performance.md), [#64](https://github.com/HKU-BAL/Clair3/issues/64)).

**v0.1-r8 — Nov 11, 2021.** `--enable_phasing` to emit WhatsHap-phased VCF ([#63](https://github.com/HKU-BAL/Clair3/issues/63)). Fixed unexpected program termination on success.

**v0.1-r7 — Oct 18, 2021.** ONT `var_pct_full` raised 0.3 → 0.7 (+~0.2% indel F1). Fall-through to next-likely variant on low coverage ([#53](https://github.com/HKU-BAL/Clair3/pull/53)). Streamlined training. `mini_epochs` in `Train.py` ([#60](https://github.com/HKU-BAL/Clair3/pull/60)). GVCF intermediates now lz4-compressed (5× smaller). `--remove_intermediate_dir` ([#48](https://github.com/HKU-BAL/Clair3/issues/48)). ONT models renamed per [Medaka](https://github.com/nanoporetech/medaka/blob/master/medaka/options.py#L22) convention. Training-data leakage fixed ([#57](https://github.com/HKU-BAL/Clair3/issues/57)).

**ONT-provided models — Sep 23, 2021.** ONT also provides chemistry-/basecaller-specific Clair3 models via [Rerio](https://github.com/nanoporetech/rerio).

**v0.1-r6 — Sep 4, 2021.** Reduced `SortVcf` memory ([#45](https://github.com/HKU-BAL/Clair3/issues/45)). Lower `ulimit -n` requirement ([#47](https://github.com/HKU-BAL/Clair3/issues/47)). Clair3-Illumina in bioconda ([#42](https://github.com/HKU-BAL/Clair3/issues/42)).

**v0.1-r5 — Jul 19, 2021.** Training-data generator fix to avoid Tensorflow segfaults. Simplified Dockerfile. Fixed ALT output for reference calls. Fixed multi-allelic AF ([ACGT]Del). AD tag in GVCF. `--call_snp_only` ([#40](https://github.com/HKU-BAL/Clair3/issues/40)). Pileup/FA validity checks ([#32](https://github.com/HKU-BAL/Clair3/issues/32), [#38](https://github.com/HKU-BAL/Clair3/issues/38)).

**v0.1-r4 — Jun 28, 2021.** Bioconda install. ONT Guppy2 model ([benchmarks](docs/guppy2.md) — must be used on Guppy2-or-earlier data). [Colab notebooks](colab). Fix on too few variant candidates ([#28](https://github.com/HKU-BAL/Clair3/issues/28)).

**v0.1-r3 — Jun 9, 2021.** `ulimit -u` check with auto-retry on failed jobs ([#20](https://github.com/HKU-BAL/Clair3/issues/20), [#23](https://github.com/HKU-BAL/Clair3/issues/23), [#24](https://github.com/HKU-BAL/Clair3/issues/24)). ONT Guppy5 model ([benchmarks](docs/guppy5.md)).

**v0.1-r2 — May 23, 2021.** BED out-of-range fix ([#12](https://github.com/HKU-BAL/Clair3/issues/12)). Both `.bam.bai` and `.bai` accepted ([#10](https://github.com/HKU-BAL/Clair3/issues/10)). Boundary and package version checks.

**v0.1-r1 — May 18, 2021.** Relative paths in Conda ([#5](https://github.com/HKU-BAL/Clair3/issues/5)). `taskset` CPU-core visibility fix and Singularity image ([#6](https://github.com/HKU-BAL/Clair3/issues/6)).

**v0.1 — May 17, 2021.** Initial release.

---

## Installation

> **Pick the right install method for your hardware:**

> - **CPU** → Docker (Option 1), Singularity (Option 2), or Bioconda (Option 3).

> - **NVIDIA GPU (Linux)** → Docker GPU (Option 1) or Singularity GPU (Option 2); fall back to Step-by-step (Option 4) if unsupported.

> - **Apple Silicon (M1/M2/M3/M4)** → Step-by-step (Option 4).

>

> See the [GPU Quick Start](docs/gpu_quick_start.md) for tuned settings.

### Option 1. Docker

Pre-built image: [hkubal/clair3](https://hub.docker.com/r/hkubal/clair3).

> **Use absolute paths** for `INPUT_DIR` and `OUTPUT_DIR`.

#### CPU

```bash

INPUT_DIR="[YOUR_INPUT_FOLDER]"        # e.g. /home/user1/input  (absolute path)

OUTPUT_DIR="[YOUR_OUTPUT_FOLDER]"      # e.g. /home/user1/output (absolute path)

THREADS="[MAXIMUM_THREADS]"            # e.g. 8

MODEL_NAME="[YOUR_MODEL_NAME]"         # e.g. r1041_e82_400bps_sup_v500

docker run -it \

  -v ${INPUT_DIR}:${INPUT_DIR} \

  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \

  hkubal/clair3:v2.0.1 \

  /opt/bin/run_clair3.sh \

    --bam_fn=${INPUT_DIR}/input.bam \

    --ref_fn=${INPUT_DIR}/ref.fa \

    --threads=${THREADS} \

    --platform=ont \                       ## {ont,hifi,ilmn}

    --model_path=/opt/models/${MODEL_NAME} \

    --output=${OUTPUT_DIR}

```

> `python3 /opt/bin/run_clair3.py` can replace `/opt/bin/run_clair3.sh` in the command above.

#### GPU (NVIDIA CUDA on Linux)

Image: `hkubal/clair3:v2.0.1_gpu` (built on CUDA 12.1).

**Requirements**

- NVIDIA driver ≥ 530.30.02.

- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed on the host.

```bash

docker run -it --gpus all \

  -v ${INPUT_DIR}:${INPUT_DIR} \

  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \

  hkubal/clair3:v2.0.1_gpu \

  /opt/bin/run_clair3.sh \

    --bam_fn=${INPUT_DIR}/input.bam \

    --ref_fn=${INPUT_DIR}/ref.fa \

    --threads=${THREADS} \

    --platform=ont \                       ## {ont,hifi,ilmn}

    --model_path=/opt/models/${MODEL_NAME} \

    --output=${OUTPUT_DIR} \

    --use_gpu

```

**Notes**

- Select specific GPUs with `--gpus '"device=0,1"'` (Docker) and `--device=cuda:0,1` (Clair3).

- If the image does not work on your setup (unsupported driver/CUDA, no NVIDIA Container Toolkit, Apple Silicon, etc.), fall back to [Step-by-step (Option 4)](#option-4-step-by-step-conda).

### Option 2. Singularity

> **Use absolute paths** for `INPUT_DIR` and `OUTPUT_DIR`.

#### CPU

```bash

conda config --add channels defaults

conda create -n singularity-env -c conda-forge singularity -y

conda activate singularity-env

singularity pull docker://hkubal/clair3:v2.0.1

singularity exec \

  -B ${INPUT_DIR},${OUTPUT_DIR} \

  clair3_v2.0.1.sif \

  /opt/bin/run_clair3.sh \

    --bam_fn=${INPUT_DIR}/input.bam \

    --ref_fn=${INPUT_DIR}/ref.fa \

    --threads=${THREADS} \

    --platform=ont \                       ## {ont,hifi,ilmn}

    --model_path=/opt/models/${MODEL_NAME} \

    --output=${OUTPUT_DIR}

```

#### GPU (NVIDIA CUDA on Linux)

**Requirements**

- NVIDIA driver ≥ 530.30.02.

- Singularity (or Apptainer) with `--nv` support.

```bash

singularity pull docker://hkubal/clair3:v2.0.1_gpu

singularity exec --nv --cleanenv --env TMPDIR=/tmp \

  -B ${INPUT_DIR},${OUTPUT_DIR} \

  clair3_v2.0.1_gpu.sif \

  /opt/bin/run_clair3.sh \

    --bam_fn=${INPUT_DIR}/input.bam \

    --ref_fn=${INPUT_DIR}/ref.fa \

    --threads=${THREADS} \

    --platform=ont \                       ## {ont,hifi,ilmn}

    --model_path=/opt/models/${MODEL_NAME} \

    --output=${OUTPUT_DIR} \

    --use_gpu

```

**Notes**

- `--nv` injects the host NVIDIA driver and libraries into the container (equivalent of Docker's `--gpus all`); no NVIDIA Container Toolkit needed.

- `--cleanenv --env TMPDIR=/tmp` avoids `parallel` failing when the host `TMPDIR` points to a path not visible inside the container.

- If the image does not work on your setup, fall back to [Step-by-step (Option 4)](#option-4-step-by-step-conda).

### Option 3. Bioconda

Clair3 is available on [Bioconda](https://bioconda.github.io/recipes/clair3/README.html). The recipe bundles PyPy, samtools, parallel, whatshap, LongPhase, and the pre-trained models under `${CONDA_PREFIX}/bin/models/`. See [bioconda-recipes#64260](https://github.com/bioconda/bioconda-recipes/pull/64260) for the v2 (PyTorch) recipe.

```bash

mamba create -n clair3 -c conda-forge -c bioconda -y clair3

mamba activate clair3

MODEL_NAME="[YOUR_MODEL_NAME]"         # e.g. r1041_e82_400bps_sup_v500

run_clair3.sh \

  --bam_fn=input.bam \

  --ref_fn=ref.fa \

  --threads=${THREADS} \

  --platform=ont \                 ## {ont,hifi,ilmn}

  --model_path=${CONDA_PREFIX}/bin/models/${MODEL_NAME} \

  --output=${OUTPUT_DIR}

```

> **Note.** The Bioconda package ships a CPU-only PyTorch build. For NVIDIA GPU or Apple Silicon, use [Step-by-step (Option 4)](#option-4-step-by-step-conda).

### Option 4. Step-by-step (Conda)

Install Mamba or Conda from [miniforge](https://github.com/conda-forge/miniforge) (Mamba is much faster).

**Step 1 — Create and activate the environment**

```bash

mamba create -n clair3_v2 -c conda-forge -c bioconda -y \

  python=3.11 samtools whatshap parallel \

  zstd xz zlib bzip2 automake make gcc gxx curl pigz

mamba activate clair3_v2

pip install uv

```

**Step 2 — Install PyTorch**

Pick the right build for your system from the [PyTorch website](https://pytorch.org/get-started/locally/).

```bash

# Example: NVIDIA CUDA 13.0

uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130

# Or: CPU only

uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

```

**Step 3 — Clone Clair3**

```bash

cd ${HOME}

git clone https://github.com/HKU-BAL/Clair3.git

cd Clair3

export CLAIR3_PATH=$(pwd)

```

**Step 4 — Install Python deps and build C sources**

```bash

uv pip install numpy h5py hdf5plugin numexpr tqdm cffi torchmetrics

make PREFIX=${CONDA_PREFIX}

```

> `make` compiles samtools/htslib, LongPhase, and the Clair3 C shared library (`libclair3.so`) used for fast pileup and full-alignment tensor generation.

**Step 5 — Install PyPy3.11** (speeds up preprocessing)

```bash

wget https://downloads.python.org/pypy/pypy3.11-v7.3.20-linux64.tar.bz2

tar -xjf pypy3.11-v7.3.20-linux64.tar.bz2 && rm pypy3.11-v7.3.20-linux64.tar.bz2

ln -s $(pwd)/pypy3.11-v7.3.20-linux64/bin/pypy3 ${CONDA_PREFIX}/bin/pypy3

ln -s $(pwd)/pypy3.11-v7.3.20-linux64/bin/pypy3 ${CONDA_PREFIX}/bin/pypy

pypy3 -m ensurepip

pypy3 -m pip install mpmath==1.2.1

```

**Step 6 — (Optional) Download pre-trained models**

```bash

cd ${CLAIR3_PATH}

mkdir -p models

wget -r -np -nH --cut-dirs=2 -R "index.html*" -P ./models \

  https://www.bio8.cs.hku.hk/clair3/clair3_models_pytorch/

```

Individual models can also be grabbed from [the model index](https://www.bio8.cs.hku.hk/clair3/clair3_models_pytorch/).

**Step 7 — Run Clair3**

```bash

MODEL_NAME=r1041_e82_400bps_sup_v500

${CLAIR3_PATH}/run_clair3.sh \

  --bam_fn=input.bam \

  --ref_fn=ref.fa \

  --threads=${THREADS} \

  --platform=ont \

  --model_path=${CLAIR3_PATH}/models/${MODEL_NAME} \

  --output=${OUTPUT_DIR}

```

> `python3 ${CLAIR3_PATH}/run_clair3.py` accepts the same arguments and can be used interchangeably.

---

## Pre-trained Models

> **Important: v1 TensorFlow models are not compatible with Clair3 v2** (including the TF models ONT provides via Rerio). Convert your own with the [Model Migration Guide](docs/model_migration_guide.md), or use the pre-converted models below.

**Download:**

- HKU-provided: 

- Converted ONT Rerio: 

**Bundled locations:** `/opt/models/` (Docker) · `${CONDA_PREFIX}/bin/models/` (Bioconda).

### HKU-provided models

Listed at .

| Model | Platform | `--platform` | Training samples / Notes | Bioconda | Docker |

| --- | --- | :-: | --- | :-: | :-: |

| **`r1041_e82_400bps_hac_v600_with_mv`** *(latest)* | ONT R10.4.1 E8.2 (5 kHz), HAC | `ont` | HG001,2,5 (chr20 excluded) — **signal-aware**, use `--enable_dwell_time` | | |

| **`r1041_e82_400bps_hac_v520_with_mv`** *(latest)* | ONT R10.4.1 E8.2 (5 kHz), HAC | `ont` | HG001,2,5 (chr20 excluded) — **signal-aware**, use `--enable_dwell_time` | | ✓ |

| **`r1041_e82_400bps_sup_v520_with_mv`** *(latest)* | ONT R10.4.1 E8.2 (5 kHz), SUP | `ont` | HG001,2,5 (chr20 excluded) — **signal-aware**, use `--enable_dwell_time` | | ✓ |

| `r1041_e82_400bps_sup_v430_bacteria_finetuned` | ONT R10.4.1 | `ont` | Fine-tuned on 12 [bacterial genomes](https://elifesciences.org/reviewed-preprints/98300) | | ✓ |

| `r941_prom_sup_g5014` | ONT R9.4.1, Guppy5 SUP | `ont` | HG002,4,5; also usable on HAC reads ([benchmarks](docs/guppy5_20220113.md)) | ✓ | ✓ |

| `r941_prom_hac_g360+g422` | ONT R9.4.1, Guppy3/4 HAC | `ont` | HG001,2,4,5 | | |

| `hifi_revio` | PacBio HiFi Revio | `hifi` | HG002,4 | ✓ | ✓ |

| `hifi_sequel2` | PacBio HiFi Sequel II | `hifi` | HG001,2,4,5 | ✓ | ✓ |

| `ilmn` | Illumina | `ilmn` | HG001,2,4,5 | ✓ | ✓ |

> **Recommendation for modern ONT R10.4.1 data:** when your BAM has Dorado `mv` tags, use the dwell-time model (`..._with_mv`) for the best accuracy; otherwise, use an ONT-trained model below.

### ONT-provided models (bundled)

> ONT's models are fine-tuned to specific chemistries / basecallers and **typically outperform the HKU baselines** — we recommend using them for best results. Official PyTorch distributions from ONT are in progress; in the meantime, use the [converted Rerio models](#converted-rerio-models) below.

The following ONT-trained models are bundled with Clair3 Docker / Bioconda since v1.1.1:

| Model | Chemistry | Dorado model | Bioconda | Docker |

| --- | --- | --- | :-: | :-: |

| `r1041_e82_400bps_sup_v500` | R10.4.1 E8.2 (5 kHz) | v5.0.0 SUP | ✓ | ✓ |

| `r1041_e82_400bps_hac_v500` | R10.4.1 E8.2 (5 kHz) | v5.0.0 HAC | | ✓ |

| `r1041_e82_400bps_sup_v410` | R10.4.1 E8.2 (4 kHz) | v4.1.0 SUP | ✓ | ✓ |

| `r1041_e82_400bps_hac_v410` | R10.4.1 E8.2 (4 kHz) | v4.1.0 HAC | | ✓ |

> **ONT has released newer Dorado v5.2.0 models** (`r1041_e82_400bps_sup_v520` / `hac_v520`). They are not yet bundled in Docker / Bioconda — download them from the [Converted Rerio models](#converted-rerio-models) section below.

### Converted Rerio models

The full ONT [Rerio](https://github.com/nanoporetech/rerio) catalog converted to PyTorch for Clair3 v2 is available at . A selection of recent R10.4.1 E8.2 (5 kHz) models is listed below.

| Model | Chemistry | Dorado model |

| --- | --- | --- |

| `r1041_e82_400bps_hac_v600` *(latest)* | R10.4.1 E8.2 (5 kHz) | v6.0.0 HAC |

| `r1041_e82_400bps_sup_v520` *(latest)* | R10.4.1 E8.2 (5 kHz) | v5.2.0 SUP |

| `r1041_e82_400bps_hac_v520` | R10.4.1 E8.2 (5 kHz) | v5.2.0 HAC |

| `r1041_e82_400bps_sup_v500` | R10.4.1 E8.2 (5 kHz) | v5.0.0 SUP |

| `r1041_e82_400bps_hac_v500` | R10.4.1 E8.2 (5 kHz) | v5.0.0 HAC |

| `r1041_e82_400bps_sup_v430` | R10.4.1 E8.2 (5 kHz) | v4.3.0 SUP |

| `r1041_e82_400bps_hac_v430` | R10.4.1 E8.2 (5 kHz) | v4.3.0 HAC |

| `r1041_e82_400bps_sup_v410` | R10.4.1 E8.2 (5 kHz) | v4.1.0 SUP |

| `r1041_e82_400bps_hac_v410` | R10.4.1 E8.2 (5 kHz) | v4.1.0 HAC |

For other chemistries and basecaller versions (R10.4.1 E8.2 260 bps, R10.4 E8.1, earlier Guppy `g6xx` / `g5015`, v4.0.0 / v4.2.0), browse the full [model directory](https://www.bio8.cs.hku.hk/clair3/clair3_models_rerio_pytorch/) and pick the one matching your chemistry and basecaller (Dorado / Guppy) version.

---

## Quick Demo

- **ONT with dwelling time** — [ONT Dwelling Time Quick Demo](docs/quick_demo/ont_mv_quick_demo.md)

- **Oxford Nanopore (ONT)** — [ONT Quick Demo](docs/quick_demo/ont_quick_demo.md)

- **PacBio HiFi** — [PacBio HiFi Quick Demo](docs/quick_demo/pacbio_hifi_quick_demo.md)

- **Illumina NGS** — [Illumina Quick Demo](docs/quick_demo/illumina_quick_demo.md)

---

## Usage

### General usage

> **Caution:** Use `=value` for all parameters, e.g. `--bed_fn=fn.bed` (not `--bed_fn fn.bed`).

```bash

./run_clair3.sh \

  --bam_fn=${BAM} \

  --ref_fn=${REF} \

  --threads=${THREADS} \

  --platform=ont \                 ## {ont,hifi,ilmn}

  --model_path=${MODEL_PREFIX} \

  --output=${OUTPUT_DIR} \

  --include_all_ctgs               ## required for non-human species

```

Outputs:

| File | Description |

| --- | --- |

| `${OUTPUT_DIR}/pileup.vcf.gz` | Pileup model calls |

| `${OUTPUT_DIR}/full_alignment.vcf.gz` | Full-alignment model calls |

| `${OUTPUT_DIR}/merge_output.vcf.gz` | **Final Clair3 output** |

By default, variants are called on `chr{1..22,X,Y}` and `{1..22,X,Y}`. Override with `--include_all_ctgs`, `--ctg_name`, or `--bed_fn`.

> `python3 run_clair3.py` is interchangeable with `./run_clair3.sh`.

### Options

**Required**

```

-b, --bam_fn=FILE         Indexed BAM input.

-f, --ref_fn=FILE         Indexed FASTA reference.

-m, --model_path=STR      Folder containing pileup.pt and full_alignment.pt.

-t, --threads=INT         Max threads. Each chunk uses 4; ceil(threads/4)*3 chunks run in parallel.

-p, --platform=STR        {ont,hifi,ilmn}

-o, --output=PATH         VCF/GVCF output directory.

```

**Common options**

```

    --bed_fn=FILE                     Call variants only in these BED regions.

    --vcf_fn=FILE                     Candidate sites VCF; only call at these sites.

    --ctg_name=STR                    Sequence(s) to process.

    --sample_name=STR                 Sample name in the output VCF.

    --qual=INT                        Variants with QUAL > $qual are PASS, else LowQual.

    --chunk_size=INT                  Chunk size for parallel processing. Default: 5000000.

    --pileup_only                     Pileup model only. Default: disable.

    --print_ref_calls                 Include 0/0 calls in the VCF. Default: disable.

    --include_all_ctgs                Call on all contigs. Default: chr{1..22,X,Y}.

    --gvcf                            Emit GVCF. Default: disable.

    --remove_intermediate_dir         Drop intermediate files when no longer needed.

```

**GPU / signal-aware**

```

    --use_gpu                         Enable GPU-accelerated calling.

    --device=STR                      GPU device(s), e.g. 'cuda:0' or 'cuda:0,1'. Default: all visible GPUs.

    --enable_dwell_time               Signal-aware calling via Dorado mv tags (ONT only; C impl required).

```

**Phasing**

```

    --use_whatshap_for_intermediate_phasing      Default: enable.

    --use_longphase_for_intermediate_phasing     Default: disable.

    --use_whatshap_for_final_output_phasing      Default: disable.

    --use_longphase_for_final_output_phasing     Default: disable.

    --use_whatshap_for_final_output_haplotagging Default: disable.

    --enable_phasing                             Alias of --use_whatshap_for_final_output_phasing (legacy).

    --longphase_for_phasing                      Alias of --use_longphase_for_intermediate_phasing (legacy).

```

**External binaries**

```

    --samtools=STR     samtools >= 1.10

    --python=STR       python3 >= 3.6

    --pypy=STR         pypy3 >= 3.6

    --parallel=STR     parallel >= 20191122

    --whatshap=STR     whatshap >= 1.0

    --longphase=STR    longphase >= 1.0

```

**Experimental / advanced**

``` 
    --snp_min_af=FLOAT 
    --indel_min_af=FLOAT 
    --var_pct_full=FLOAT 
    --ref_pct_full=FLOAT 
    --var_pct_phasing=FLOAT 
    --pileup_model_prefix=STR 
    --fa_model_prefix=STR 
    --min_mq=INT 
    --min_coverage=INT 
    --min_contig_size=INT 
    --fast_mode 
    --haploid_precise 
    --haploid_sensitive 
    --no_phasing_for_fa 
    --call_snp_only 
    --enable_long_indel 
    --keep_iupac_bases 
    --base_err=FLOAT 
    --gq_bin_size=INT 
    --enable_variant_c 
 
    --output_all_conti 
 
    --disable_c_impl 
```

Min SNP AF. Default: ont/hifi/ilmn = 0.08. Min indel AF. Default: ont=0.15, hifi/ilmn=0.08. Pct of low-quality 0/1 and 1/1 pileup calls rerun in full-alignment. Default: 0.3. Pct of low-quality 0/0 pileup calls rerun in full-alignment. Default: 0.3 (ilmn/hifi), 0.1 (ont). Pct of high-quality 0/1 pileup variants used for WhatsHap phasing. Default: 0.8 (ont guppy5), 0.7 (others). Pileup model prefix. Default: pileup. Full-alignment model prefix. Default: full_alignment. Filter reads with MAPQ < $min_mq. Default: 5. Min coverage to call a variant. Default: 2. Skip contigs smaller than $min_contig_size. Default: 0. Skip candidates with AF <= 0.15. Haploid: only 1/1 is a variant. Haploid: 0/1 and 1/1 are variants. Skip WhatsHap phasing in full-alignment calling. Skip indels. Call indels > 50 bp. Keep IUPAC bases (default: convert to N). Estimated base error rate for GVCF. Default: 0.001. GQ bin size for non-variant merging in GVCF. Default: 5. alling_at_sequence_head_and_tail Call in the first/last 16 bp of a sequence (amplicon-friendly). gs_in_gvcf_header List all contigs in the GVCF header. Disable C implementation for tensor creation (default: enable).

### Examples

#### Call variants on selected chromosomes

```bash

CONTIGS_LIST="[YOUR_CONTIGS_LIST]"     # e.g "chr21" or "chr21,chr22"

docker run -it \

  -v ${INPUT_DIR}:${INPUT_DIR} \

  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \

  hkubal/clair3:v2.0.1 \

  /opt/bin/run_clair3.sh \

    --bam_fn=${INPUT_DIR}/input.bam \

    --ref_fn=${INPUT_DIR}/ref.fa \

    --threads=${THREADS} \

    --platform=ont \

    --model_path=/opt/models/${MODEL_NAME} \

    --output=${OUTPUT_DIR} \

    --ctg_name=${CONTIGS_LIST}

```

#### Call variants at known sites

```bash

KNOWN_VARIANTS_VCF="[YOUR_VCF_PATH]"   # e.g. /home/user1/known_variants.vcf.gz

docker run -it \

  -v ${INPUT_DIR}:${INPUT_DIR} \

  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \

  hkubal/clair3:v2.0.1 \

  /opt/bin/run_clair3.sh \

    --bam_fn=${INPUT_DIR}/input.bam \

    --ref_fn=${INPUT_DIR}/ref.fa \

    --threads=${THREADS} \

    --platform=ont \

    --model_path=/opt/models/${MODEL_NAME} \

    --output=${OUTPUT_DIR} \

    --vcf_fn=${KNOWN_VARIANTS_VCF}

```

#### Call variants in BED regions

> A BED file is recommended over point coordinates.

```bash

# Build a BED (0-based, "ctg start end") if needed

echo -e "${CONTIGS}\t${START_POS}\t${END_POS}" > /home/user1/tmp.bed

BED_FILE_PATH="[YOUR_BED_FILE]"        # e.g. /home/user1/tmp.bed

docker run -it \

  -v ${INPUT_DIR}:${INPUT_DIR} \

  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \

  hkubal/clair3:v2.0.1 \

  /opt/bin/run_clair3.sh \

    --bam_fn=${INPUT_DIR}/input.bam \

    --ref_fn=${INPUT_DIR}/ref.fa \

    --threads=${THREADS} \

    --platform=ont \

    --model_path=/opt/models/${MODEL_NAME} \

    --output=${OUTPUT_DIR} \

    --bed_fn=${BED_FILE_PATH}

```

#### Call variants in non-diploid organisms (haploid)

```bash

docker run -it \

  -v ${INPUT_DIR}:${INPUT_DIR} \

  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \

  hkubal/clair3:v2.0.1 \

  /opt/bin/run_clair3.sh \

    --bam_fn=${INPUT_DIR}/input.bam \

    --ref_fn=${INPUT_DIR}/ref.fa \

    --threads=${THREADS} \

    --platform=ont \

    --model_path=/opt/models/${MODEL_NAME} \

    --output=${OUTPUT_DIR} \

    --no_phasing_for_fa \                      ## disable FA phasing

    --include_all_ctgs \                       ## call on all contigs

    --haploid_precise \                        ## or --haploid_sensitive

    --enable_variant_calling_at_sequence_head_and_tail

```

---

## Advanced Topics

### Dwelling Time Feature

Clair3 v2.0 introduces **signal-aware variant calling** for Oxford Nanopore data. Dwell time (signal duration per base) extracted from BAM `mv` tags is used as an additional input channel to the full-alignment model, improving accuracy.

```bash

./run_clair3.sh \

  --bam_fn=input.bam \

  --ref_fn=ref.fa \

  --threads=8 \

  --platform=ont \

  --model_path=${MODEL_PATH} \

  --output=${OUTPUT_DIR} \

  --enable_dwell_time

```

**Requirements**

- BAM must contain `mv` (move-table) tags from Dorado with `--emit-moves`.

- `--platform=ont`.

- C implementation must be enabled (default; do **not** pass `--disable_c_impl`).

See [Dwelling Time Feature](docs/dwelling_time.md) (full guide incl. training) and the [ONT Dwelling Time Quick Demo](docs/quick_demo/ont_mv_quick_demo.md).

### Dealing with amplicon data

- Use `--enable_variant_calling_at_sequence_head_and_tail`.

- If coverage is excessively high: set `--var_pct_full=1` and `--ref_pct_full=1`.

  - Human: also set `--var_pct_phasing=1`.

  - Non-human: add `--no_phasing_for_fa`.

- Context: discussions [#160](https://github.com/HKU-BAL/Clair3/issues/160#issuecomment-1396743261), [#240](https://github.com/HKU-BAL/Clair3/issues/240).

### Postprocessing scripts

#### `SwitchZygosityBasedOnSVCalls`

Given a Clair3 VCF and a Sniffle2 SV VCF, this module re-labels Clair3 SNPs from homozygous to heterozygous when both:

1. AF ≤ 0.7, and

2. the ±16 bp flanking region falls inside one or more SV deletions.

Two INFO tags are added: `SVBASEDHET` and `ORG_CLAIR3_SCORE` (original QUAL). The new QUAL becomes the top QUAL among overlapping deletions. Inspired by Philipp Rescheneder (ONT).

```bash

pypy3 ${CLAIR3_PATH}/clair3.py SwitchZygosityBasedOnSVCalls \

  --bam_fn input.bam \

  --clair3_vcf_input clair3_input.vcf.gz \

  --sv_vcf_input sniffle2.vcf.gz \

  --vcf_output output.vcf \

  --threads 8

```

---

## Reference

### Folder Structure and Submodules

> All submodules accept `-h` / `--help`.

**`clair3/`** — not pypy-compatible, run with python.

| Submodule | Description |

| --- | --- |

| `CallVariants` | Call variants from a trained model and candidate tensors. |

| `CallVarBam` | Call variants from a trained model and a BAM. |

| `Train` | Train a model with AdamW (PyTorch). DDP via `torchrun`. Initial LR `1e-3` with warm-up. Takes tensor binaries from `Tensor2Bin`. |

**`preprocess/`** — pypy-compatible unless noted.

| Submodule | Description |

| --- | --- |

| `CheckEnvs` | Validate inputs/environment; preprocess BED; `--chunk_size` sets per-job chunk size. |

| `CreateTensorPileup` | Generate pileup tensors for training/calling. |

| `CreateTensorFullAlignment` | Generate phased full-alignment tensors. |

| `GetTruth` | Extract variants from a truth VCF (reference FASTA required if ALT contains `*`). |

| `MergeVcf` | Merge pileup and full-alignment VCF/GVCF. |

| `RealignReads` | Local read realignment (Illumina). |

| `SelectCandidates` | Select pileup candidates for full-alignment calling. |

| `SelectHetSnp` | Select heterozygous-SNP candidates for WhatsHap phasing. |

| `SelectQual` | Select a quality cutoff from pileup results; variants below it go to phasing + full-alignment. |

| `SortVcf` | Sort a VCF file. |

| `SplitExtendBed` | Split BED by contig; extend by 33 bp for variant calling. |

| `UnifyRepresentation` | Representation unification between candidates and truth. |

| `MergeBin` | Merge tensor binaries. |

| `CreateTrainingTensor` | Create training tensor binaries (pileup or full-alignment). |

| `Tensor2Bin` | Combine variant/non-variant tensors into a `blosc:lz4hc` binary (**not pypy-compatible**; ~10–15 GB training memory). |

### Training Data

Pileup and full-alignment models were trained on four GIAB samples (HG001, HG002, HG004, HG005), excluding HG003. On ONT, a second model trained on HG001–3, 5 excluded HG004. Chr20 was excluded from all training (chr1–19, 21, 22 only).

| Platform | Reference | Aligner | Training samples |

| --- | --- | --- | --- |

| ONT | GRCh38_no_alt | minimap2 | HG001,2,(3\|4),5 |

| PacBio HiFi Sequel II | GRCh38_no_alt | pbmm2 | HG001,2,4,5 |

| PacBio HiFi Revio | GRCh38_no_alt | pbmm2 | HG002,4 |

| Illumina | GRCh38 | BWA-MEM / NovoAlign | HG001,2,4,5 |

Full details and download links: [Training Data](docs/training_data.md).

### VCF/GVCF Output Formats

Clair3 uses **VCF 4.2**. Extra INFO tags distinguish call source:

- `P` — called by the pileup model.

- `F` — called by the full-alignment model.

GVCF output is **GATK-compatible** and passes GATK `ValidateVariants`. Clair3 uses `` (same as GATK), not DeepVariant's `<*>`. Merge with GLNexus — a caller-specific config is [available for download](http://www.bio8.cs.hku.hk/clair3_trio/config/clair3.yml).

### Model Training Guides

- [Pileup model training](docs/pileup_training.md)

- [Full-alignment model training](docs/full_alignment_training_r1.md)

- [Representation unification](docs/representation_unification.md)

- [Model migration (TensorFlow → PyTorch)](docs/model_migration_guide.md)

### Visualization

- [Model input visualization](docs/model_input_visualization.md)

- [Representation unification visualization](docs/representation_unification_visualization.md)

---

## Citation

| Paper | Venue | Topic |

| --- | --- | --- |

| Symphonizing pileup and full-alignment for deep learning-based long-read variant calling | [Nature Computational Science](https://rdcu.be/c1TPa) · [bioRxiv preprint](https://www.biorxiv.org/content/10.1101/2021.12.29.474431v2) | Original Clair3 |

| Accelerated long-read variant calling with Clair3 for whole-genome sequencing | [Bioinformatics, 2026](https://doi.org/10.1093/bioinformatics/btag181) | GPU-accelerated Clair3 |

| Leveraging ONT move table values for signal aware variant calling | [bioRxiv preprint, 2026](https://www.biorxiv.org/content/10.64898/2026.02.13.705285v1) | ONT `mv`-tag (move-table) signal-aware tuning |
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/HKU-BAL/Clair3

Awesome Lists containing this project

README

Clair3