An open API service indexing awesome lists of open source software.

https://github.com/bcgsc/hlaminer

⛏ HLA predictions from NGS shotgun data
https://github.com/bcgsc/hlaminer

alignments hla-prediction hla-sequence oxford-nanopore shotgun-sequence-datasets targeted-assemblies

Last synced: 3 months ago
JSON representation

⛏ HLA predictions from NGS shotgun data

Awesome Lists containing this project

README

          

[![Release](https://img.shields.io/github/release/warrenlr/HLAminer.svg)](https://github.com/warrenlr/HLAminer/releases)
[![Downloads](https://img.shields.io/github/downloads/warrenlr/HLAminer/total?logo=github)](https://github.com/warrenlr/HLAminer/releases/download/v1.4/HLAminer_1-4.tar.gz)
[![Issues](https://img.shields.io/github/issues/warrenlr/HLAminer.svg)](https://github.com/warrenlr/HLAminer/issues)
[![link](https://img.shields.io/badge/HLAminer-manuscript-brightgreen)](https://doi.org/10.1186/gm396)
[![link](https://img.shields.io/badge/HLAminer-protocol-brightgreen)](https://doi.org/10.1002/cpz1.70124)
Thank you for your [![Stars](https://img.shields.io/github/stars/warrenlr/HLAminer.svg)](https://github.com/warrenlr/HLAminer/stargazers)

![Logo](https://github.com/warrenlr/hlaminer/blob/master/hlaminer-logo.png)

## HLAminer (c) 2011-present

Derivation of HLA (Human Leukocyte Antigen) class I and II predictions from DNA/RNA sequencing datasets

*This manual assumes that you have a working knowledge of Unix, and some shell and perl scripting experience

### CONTENTS
--------
1. [SYNOPSIS](#synopsis)
2. [LICENSE](#license)
3. [OVERVIEW](#overview)
4. [DESCRIPTION](#description)
5. [INSTALL](#install)
6. [COMMANDS AND OPTIONS](#commands)
7. [PREDICTING FROM LONG (NANOPORE/PACBIO) READS](#nanopore)
8. [REFERENCE SEQUENCE FOR LONG-READ ALIGNMENTS](#reference)
9. [DATABASES](#databases)
10. [AUTHORS](#authors)
11. [CITING](#citing)
12. [FULL LICENSE](#full)
--------
### SYNOPSIS
--------

HLAminer is a pipeline for predicting Human Leukocyte Antigen (HLA) signatures from shotgun sequence data (ie. whole genome, whole transcriptome/RNA-Seq, exome), at the group and allele resolution.
It supports predictions from a variety of DNA sequencing technologies including those from Illumina, MGI, PacBio and Oxford Nanopore.
Predictions are either derived from targeted sequence assembly, or direct sequence alignments.

For quick tests on Illumina RNA-seq data:
1. Copy ./test-demo/ eg. cp -rf test-demo foo
2. In folder "foo", edit the patient.fof file to point to your NGS RNAseq data. Ensure all paths are ok.
3. For HLA Predictions by Targeted Assembly of Shotgun Reads: execute ./HLAminer/foo/HPTASRrnaseq.sh
For HLA Predictions by Read Alignment: execute ./HLAminer/foo/HPRArnaseq.sh

### LICENSE
--------

HLAminer Copyright (c) 2011-present Canada's Michael Smith Genome Science Centre. All rights reserved.
TASR Copyright (c) 2010-present Canada's Michael Smith Genome Science Centre. All rights reserved.
SSAKE Copyright (c) 2006-present Canada's Michael Smith Genome Science Centre. All rights reserved.

Due to the clinical implications of HLAminer, the code is now released
under the BC Cancer Agency software license agreement (academic use).
Details of the license can be accessed at:
and at the bottom of this readme file

For commercial licensing options, please contact Patrick Rebstein prebstein@bccancer.bc.ca

Software components of HLAminer (eg. TASR) are still distributed under the
terms of the GNU General Public License

### OVERVIEW
--------

Derivation of HLA class I and class II predictions from shotgun sequence datasets (HLAminer) by:
1) Targeted Assembly of Shotgun Reads (HPTASR)
2) Read Alignment (HPRA)

BEST SHORT READ RESULTS ARE OBTAINED WITH HPTASR WITH READS 100bp AND UP (IDEALLY 150bp).
IT WILL WORK WITH SHORTER READS (50bp) BUT 4-digit HLA ALLELE PREDICTIONS MAY BE AMBIGUOUS

This clip summarizes the pipeline:
https://www.youtube.com/watch?v=j-g8Geh5ST8&list=LL&index=110

### DESCRIPTION
--------

The HLA prediction by targeted assembly of short sequence reads (HPTASR), performs targeted de novo assembly of HLA NGS reads and align the resulting contigs to reference HLA alleles from the IMGT/HLA sequence repository using commodity hardware with standard specifications (<2GB RAM, 2GHz). Putative HLA types are inferred by mining and scoring the contig alignments and an expect value is determined for each. The method is accurate, simple and fast to execute and, for transcriptome data, requires low depth of coverage. Known HLA class I/class II reference sequences available from the IMGT/HLA public repository are read by TASR using default options (Warren and Holt 2011) to create a hash table of all possible 15 nt words (k-mers) from these reference sequences. Note that this parameter is customizable and larger k values will yield predictions with increased specificity (at the possible expense of sensitivity). Subsequently, NGS data sets are interrogated for the presence of one of these kmers (on either strand) at the 5’ or 3’ start. Whenever an HLA word is identified, the read is recruited as a candidate for de novo assembly. Upon de novo assembly of all recruited reads, a set of contigs is generated. Only sequence contigs equal or larger than 200nt in length are considered for further analysis, as longer contigs better resolve HLA allelic variants. Reciprocal BLASTN alignments are performed between the contigs and all HLA allelic reference sequences. HPTASR mines the alignments, scoring each possible HLA allele identified, computing and reporting an expect value (E-value) based on the chance of contigs characterizing given HLA alleles and, reciprocally, the chance of reference HLA alleles aligning best to certain assembled contig sequences

The HLA prediction from direct read alignment (HPRA) method is conceptually simpler and faster to execute, since paired reads are aligned up-front to reference HLA alleles. Alignments from the HPTASR and HPRA methods are processed by the same software (HLAminer.pl) to derive HLA-I predictions by scoring and evaluating the probability of each candidate bearing alignments.

### What's new in version 1.4?
--------

Ability to stream the (.sam) output of modern read aligners, directly into HLAminer.
Initial support for predicting HLA types from long nanopore reads such as those from Oxford Nanopore Technologies.
Better information/sub-routine/date tracking in hlaminer

### What's new in version 1.3?
--------

A more concise HLA allele summary in HLAminer_HPTASR.csv and HLAminer_HPRA.csv (associated .log is unchanged and lists all predictions)
Keeps top two [highest-scoring by HLA group] predictions per gene and only the 'P' designated allele when the summary include HLA Sequences reported to have the same antigen binding domain.
For the original output, refer to the HLAminer_v1-2.pl included in the ./bin directory
A prediction example from MCF-7 PacBio RNA-seq reads is also provided

### What's new in version 1.2?
--------

Updated all HLA sequence databases
Corrected shell script that download HLA sequences to reflect change of location at EBI (ie. fasta sub folder)
Added support for predictions from direct alignment of single-end reads

### INSTALL
--------


1. Download and decompress the tar ball
gunzip HLAminer_v1-4.tar.gz
tar -xvf HLAminer_v1-4.tar
2. Make sure you see the following directories:
./bin
./databases
./docs
./test-demo

3. Read the docs in the ./docs/ folder
4. Change/Add/Adjust the perl shebang line of each .pl and .sh script in the ./bin/ folder as needed


From direct Read Alignment (HPRA, faster but less accurate):
HPRArnaseq_classI.sh
HPRArnaseq_classI-II.sh
HPRAwgs_classI.sh
HPRAwgs_classI-II.sh
-and for single end reads-
HPRArnaseq_classI_SE.sh
HPRArnaseq_classI-II_SE.sh
HPRAwgs_classI_SE.sh
HPRAwgs_classI-II_SE.sh

From Targeted Assembly (HPTASR, longer but more accurate):
HPTASRrnaseq_classI.sh
HPTASRrnaseq_classI-II.sh
HPTASRwgs_classI.sh
HPTASRwgs_classI-II.sh

*Running HPTASRwgs(rnaseq)_classI-II.sh will take longer than HPTASRwgs(rnaseq)_classI.sh, due to the reciprocal BLAST step. You may remove this step from the former (and HLAminer.pl command) to speed things up. However, this step is helpful in weeding out spurious alignments to HLA references. That said, if you're solely interested in HLA-I, you have the option to run the latter set of scripts [HPTASRwgs(rnaseq)_classI.sh].

Also, in the ncbiBlastConfig2-2-XX.txt files (bin and test-demo directories), you may adjust the number of threads and number of reported alignments to speed things up. The options have different name depending on the blast version, refer to the blast manual
eg.
v2.2.22
option:description
-a:threads
-v:number of descriptions
-b:number of alignments

v2.2.28
-num_threads:threads
-max_target_seqs:number of hit sequences to report (when output is 5/xml)

In our hands, a few tests show that blast 2.2.22 may be faster than blast+ (2.2.28) while
producing accurate results - HLAminer (Warren et al. 2012) was thoroughly tested with 2.2.22

NCBI blast may be downloaded from:
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/
-or-
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/

HLAminer.pl
parseXMLblast.pl
[![link](https://img.shields.io/badge/TASR-github-yellow)](https://github.com/warrenlr/TASR)

5. You must install perl module Bio::SearchIO to use HPTASR
6. Edit the fullpath location of bwa and other software dependencies in the shell scripts in the ./bin/ folder, as needed
7. For your convenience, ncbi blastall and formatdb have been placed in the ./bin/ folder and executed from the following shell scripts:

NAME,PROCESS,NGS DATA TYPE,PREDICTIONS
HPRArnaseq_classI.sh,Paired read alignment,RNAseq (transcriptome),HLA-I A,B,C genes
HPRArnaseq_classI-II.sh,Paired read alignment,RNAseq (transcriptome),HLA-I A,B,C and HLA-II DP,DQ,DR genes

HPRAwgs_classI.sh,Paired read alignment,Exon capture (exome) and WGS (genome),HLA-I A,B,C genes
HPRAwgs_classI-II.sh,Paired read alignment,Exon capture (exome) and WGS (genome),HLA-I A,B,C and HLA-II DP,DQ,DR genes

HPTASRrnaseq_classI.sh,Targeted assembly of sequence reads,RNAseq (transcriptome),HLA-I A,B,C genes
HPTASRrnaseq_classI-II.sh,Targeted assembly of sequence reads,RNAseq (transcriptome),HLA-I A,B,C and HLA-II DP,DQ,DR genes

HPTASRwgs_classI.sh,Targeted assembly of sequence reads,Exon capture (exome) and WGS (genome),HLA-I A,B,C genes
HPTASRwgs_classI-II.sh,Targeted assembly of sequence reads,Exon capture (exome) and WGS (genome),HLA-I A,B,C and HLA-II DP,DQ,DR genes

Run those scripts by specifying the relative path ../bin/blastall or ../bin/formatdb in the shell scripts AND config file "ncbiBlastConfig.txt". Make sure HPRA and HPTASR are running in same-level directories to ../bin/ (eg. ../test-demo/)

8. Before running on your data, inspect the ./test-demo/ folder and familiarize yourself with the files and the execution.
9. When you are ready and the demo works well, place the fullpath location of your short read fastq or fasta files in the "patient.fof" file.
10. Make sure that the following files are in your working directory:

patient.fof
ncbiBlastConfig.txt (specific to the version of blast you are using, see in
../bin and ../test-demo directories

### ADDITIONAL INSTALL NOTES ON MAC OSX

#### install bioperl
---------------
sudo perl -MCPAN -e shell
install CJFIELDS/BioPerl-1.6.923.tar.gz
change shebang line in all PERL (.pl) scripts and location of bioperl on your system in HLAminer.pl

#### install ncbi blast
------------------
download and install blast-2.2.22-universal-macosx.tar.gz
change path to blast in ncbiconfig.txt

#### install homebrew
----------------


ruby -e "$(curl -fsSL
https://raw.githubusercontent.com/Homebrew/install/master/install)"

Since databases were indexed with older
version, had to re-index:
bwa index -a is HLA_ABC_CDS.fasta

change path to bwa in HPRA* shell scripts

TEST YOUR INSTALL BY RUNNING THE SCRIPT IN test-demo. CONSULT THE README-FIRST.txt FILE

### COMMANDS AND OPTIONS
--------

Usage: ../bin/HLAminer.pl [v1.4]
Derivation of HLA class I and II predictions from shotgun sequence datasets
--------------------------------------------------------------------
HPTASR (HLA Predictions by Targeted Assembly of Shotgun Reads):
-b blastn alignments.........................
-r reciprocal blastn.........................
-c contig fasta file.........................
-z minimum contig size.......................<200>
------------------------------- OR ---------------------------------
HPRA (HLA Predictions by Read Alignment):
-a sam alignments............................< -a ngs_vs_hla.sam > or < -a stream >
-e single-end reads used (1=yes/0=no)........<0>
--------------------------------------------------------------------
-h hla fasta file............................
-p P-designation file........................
-i minimum % sequence identity...............<99>
-q minimum log10 (phred-like) expect value...<30>
-s minimum score.............................<1000>
-n consider null alleles (1=yes/0=no)........<0>
-l label (run name) -optional-

The shell scripts are set to filter out short (<200) contigs that could blur HPTASR predictions. Feel free to adjust as you see fit.

Likewise, HLAminer.pl runs with the set defaults:
-z minimum contig size.......................<200> (HPTASR)
-i minimum % sequence identity...............<99> (HPTASR / HPRA)
-q minimum log10 (phred-like) expect value...<30> (HPTASR / HPRA)
-s minimum score.............................<1000> (HPTASR / HPRA)
-n consider null alleles (1=yes/0=no)........<0> (HPTASR / HPRA)

The minimum sequence identity applies to the short read paired alignment or blast alignment, depending on the choice made. HLA predictions with a phred-like expect value lower than -q or a score lower than -s will not be diplayed. Because IMGT/HLA reports numerous null alleles, an option exist to consider or not these unexpressed alleles.

Likewise, TASR-based (Targeted Assembly of Short Reads) predictions could be
improved by using larger k values for assembly (-k). Experimentation for
choosing the ideal k to use depends on the input read length and is warranted.

### PREDICTING FROM LONG (NANOPORE/PACBIO) READS
--------

HLAminer v1.4 provides initial support for HLA prediction from raw uncorrected shotgun nanopore long reads (such as those from Oxford Nanopore Technologies).
HLAminer v1.4 implements a streaming approach to reading .sam alignment files, supporting the alignment of GB worth of read data in a few hours and predictions within seconds of alignment completion, without saving costly .sam files to disk.

*More information here:*
Warren RL, Birol I. 2025. Streaming long-read sequence alignments for HLA
predictions using HLAminer. Current Protocols. https://doi.org/10.1002/cpz1.70124

We tested the software on the NA19240 WGS promethion dataset (2018, older chemistry).
https://gigabaseorgigabyte.wordpress.com/2018/05/24/promethion-human-genome-na19240/
with data available from ENA at this location:
https://www.ebi.ac.uk/ena/data/view/PRJEB26791

1) First, we downloaded read files (ERR2585112 and ERR2585115 from the ENA)


nohup wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR258/005/ERR2585115/ERR2585115.fastq.gz &
nohup wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR258/002/ERR2585112/ERR2585112.fastq.gz &

2) Then, we predicted HLA by running minimap2 and HLAminer v1.4:

/usr/bin/time -v -o minimap_hlaminerERR2585115-1mod.time minimap2 -t 60 -ax map-ont --MD ../database/GCA_000001405.15_GRCh38_genomic.chr-only-noChr6-HLA-I_II_GEN.fa.gz ERR2585115.fastq.gz | ./HLAminer.pl -h ../database/HLA-I_II_GEN.fasta -s 500 -q 1 -i 1 -p ../database/hla_nom_p.txt -a stream

-or-

/usr/bin/time -v -o minimap_hlaminerERR2585115-2mod.time minimap2 -t 60 -ax map-ont --MD ../database/GCA_000001405.15_GRCh38_genomic.chr-only-chr6hlamasked-HLA-I_II_GEN.fa.gz ERR2585115.fastq.gz | ./HLAminer.pl -h ../database/HLA-I_II_GEN.fasta -s 500 -q 1 -i 1 -p ../database/hla_nom_p.txt -a stream


A test is provided at ./test-demo/HPRAwgs_ONTclassI-IIdemo.sh

The output files:
HLAminer_HPRA_ERR2585115.csv
-and-
HLAminer_HPRA_ERR2585115.log

you generate should be comparable* to:
HLAminer_HPRA_ERR2585115_test.csv
-and-
HLAminer_HPRA_ERR2585115_test.log

*exact scores are not to be expected, because they vary based on the HLA sequence database used, which differ from different release of the code (and your own updating of HLA sequence databases, which is highly recommended to do before you use HLAminer).

Results below demonstrate HLAminer's ability to fairly accurately predict HLA-I and -II types from direct [and streamed] nanopore sequencing reads [old chemistry] alignments

HLAminer PREDICTIONS (top 2 per HLA allele group, by high-score):
--------------------

A*30:106/A*68:02P
B*35:01P/B*57:01P ex. B*57:03P
C*04:01P/C*18:01P
F*01:01P
G*01:01P

DQA1*01:02P/DQA1*05:01P
DQB1*05:02P/DQB1*03:03P ex. DQB1*03:01P
DRB1*12:01P/DRB1*16:02P
DPA1*02:02P/DPA1*01:03P
DPB1*90:01P/DPB1*01:01P
DRA*01:01P

REPORTED TYPES (extracted from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5804087/#MOESM1 Supp file 1 13059_2018_1388_MOESM1_ESM.xlsx Tab S6, NA19240 (Child)):
--------------

A*30:01:01G/A*68:02:01G
B*35:01:01G/B*57:03:01G
C*04:01:01G/C*18:01:01G
F*01:01:01:08;F*01:01:01:09;F*01:02;F*01:01:01:10;F*01:03:01:01;F*01:01:01:11;F*01:03:01:02;F*01:01:01:12;F*01:01:02:01;F*01:01:02:02;F*01:01:02:03;F*01:01:02:04;F*01:01:02:05;F*01:01:02:06;F*01:01:01:01;F*01:01:01:02;F*01:01:01:03;F*01:01:01:04;F*01:01:01:05;F*01:01:01:06;F*01:01:01:07/F*01:01:01:08;F*01:01:01:09;F*01:02;F*01:01:01:10;F*01:03:01:01;F*01:01:01:11;F*01:03:01:02;F*01:01:01:12;F*01:01:02:01;F*01:01:02:02;F*01:01:02:03;F*01:01:02:04;F*01:01:02:05;F*01:01:02:06;F*01:01:01:01;F*01:01:01:02;F*01:01:01:03;F*01:01:01:04;F*01:01:01:05;F*01:01:01:06;F*01:01:01:07
G*01:06;G*01:01:02:01;G*01:01:02:02;G*01:18;G*01:08:01;G*01:01:18;G*01:01:19/G*01:05N
H*02:05/H*02:05
J*02:01;J*01:01:01:05;J*01:01:01:04;J*01:01:01:03;J*01:01:01:02;J*01:01:01:01/J*02:01;J*01:01:01:05;J*01:01:01:04;J*01:01:01:03;J*01:01:01:02;J*01:01:01:01 L*01:01:02;L*01:01:01:01;L*01:01:01:02;L*01:01:01:03/L*01:02

DQA1*01:02:01G/DQA1*05:01:01G
DQB1*05:02:01G/DQB1*03:01:01G
DRB1*12:01:01G/DRB1*16:02:01G
DPA1*02:01:01G/DPA1*02:02:02G
DPB1*01:01:01G/DPB1*01:01:01G
DOA*01:01:05/DOA*01:01:02:03;DOA*01:01:02:02;DOA*01:01:02:01;DOA*01:01:04:02;DOA*01:01:04:01 DMA*01:01:01:02;DMA*01:01:01:01;DMA*01:04;DMA*01:03;DMA*01:02;DMA*01:01:01:04;DMA*01:01:01:03/DMA*01:01:01:02;DMA*01:01:01:01;DMA*01:04;DMA*01:03;DMA*01:02;DMA*01:01:01:04;DMA*01:01:01:03
DMB*01:02;DMB*01:01:01:04;DMB*01:01:01:03;DMB*01:01:01:02;DMB*01:03:01:04;DMB*01:01:01:01;DMB*01:03:01:03;DMB*01:03:01:02;DMB*01:03:01:01;DMB*01:05;DMB*01:04/DMB*01:02;DMB*01:01:01:04;DMB*01:01:01:03;DMB*01:01:01:02;DMB*01:03:01:04;DMB*01:01:01:01;DMB*01:03:01:03;DMB*01:03:01:02;DMB*01:03:01:01;DMB*01:05;DMB*01:04
DRA*01:01:01:01;DRA*01:02:01;DRA*01:01:01:02;DRA*01:02:02;DRA*01:01:01:03;DRA*01:02:03;DRA*01:01:02/DRA*01:01:01:01;DRA*01:02:01;DRA*01:01:01:02;DRA*01:02:02;DRA*01:01:01:03;DRA*01:02:03;DRA*01:01:02

For more information, please refer to:
[![link](https://img.shields.io/badge/HLAminer-preprint-brightgreen)](https://doi.org/10.48550/arXiv.2209.09155)
[![link](https://img.shields.io/badge/HLAminer-protocol-brightgreen)](https://doi.org/10.1002/cpz1.70124)

### REFERENCE SEQUENCE FOR LONG-READ ALIGNMENTS
--------
#### Use of HG38 reference for long-read alignments with HLAminer

Here’s why we originally used HG38 without chr6 for alignments with long reads:

✅ 1. Excluding chr6 and replacing with complete HLA alleles improves allele resolution
The classical HLA loci (HLA-A, -B, -C, -DRB1, etc.) on chr6 are highly polymorphic and not well-represented in a linear reference.
Mapping short or long reads to the limited HLA representation in GRCh38 often results in ambiguous or incorrect alignment, especially for novel or rare alleles.
By using all known HLA alleles (from IPD-IMGT/HLA) in a dedicated contig, you're allowing your mapper (e.g. minimap2) to consider each allele as an independent path — effectively emulating a graph reference without the overhead.

✅ 2. Keeping the rest of GRCh38 intact ensures genomic compatibility
Most reads outside of the MHC region still align correctly to the standard GRCh38 chromosomes.


GCA_000001405.15_GRCh38_genomic.chr-only-noChr6-HLA-I_II_GEN.fa.gz

Improvements to the reference sequence used with long-read alignments (April 2025):

✅ 1. Masking of chr6 HLA locus (HG38 coordinates 28510120-33480577). This range is the standard extended (Class I, II, and III regions and flanking genes+regulatory regions relevant to immune function) MHC region definition from 1000 Genomes and GRC and it's masked in many genomics pipelines to improve mapping accuracy and downstream interpretation. https://www.ncbi.nlm.nih.gov/grc/human/regions/MHC?asm=GRCh38


GCA_000001405.15_GRCh38_genomic.chr-only-chr6hlamasked-HLA-I_II_GEN.fa.gz

Notes:


-Custom references are better suited to HLA inference than any static genome, including hs38DH, and the approach aligns with best practices in immunogenomics.
-It's crucial for you to keep the HLA allele set up-to-date, and update+document+date those files frequently (see the database folder README and the DATABASES section below for details):


GCA_000001405.15_GRCh38_genomic.chr-only-noChr6-HLA-I_II_GEN.fa.gz
GCA_000001405.15_GRCh38_genomic.chr-only-chr6hlamasked-HLA-I_II_GEN.fa.gz

### DATABASES
--------

Follow these instructions to download updated HLA sequences from ebi/imgt (shell scripts to automatically download and format the databases exist in ./database/) and refer to README.txt in the ./database directory:

1) Coding HLA sequences

HLA CDS sequences from:


wget ftp://ftp.ebi.ac.uk/pub/databases/imgt/mhc/hla/fasta/A_nuc.fasta
wget ftp://ftp.ebi.ac.uk/pub/databases/imgt/mhc/hla/fasta/B_nuc.fasta
wget ftp://ftp.ebi.ac.uk/pub/databases/imgt/mhc/hla/fasta/C_nuc.fasta
cat A_nuc.fasta B_nuc.fasta C_nuc.fasta | perl -ne 'chomp;if(/\>\S+\s+(\S+)/){print ">$1\n";}else{print "$_\n";}' > HLA_ABC_CDS.fasta
../bin/formatdb -p F -i HLA_ABC_CDS.fasta
/home/pubseq/BioSw/bwa/bwa-0.5.9/bwa index -a is HLA_ABC_CDS.fasta

2) HLA genomic sequences

To make the HLA genomic sequence database, execute these unix commands:


wget ftp://ftp.ebi.ac.uk/pub/databases/imgt/mhc/hla/fasta/A_gen.fasta
wget ftp://ftp.ebi.ac.uk/pub/databases/imgt/mhc/hla/fasta/B_gen.fasta
wget ftp://ftp.ebi.ac.uk/pub/databases/imgt/mhc/hla/fasta/C_gen.fasta
cat A_gen.fasta B_gen.fasta C_gen.fasta | perl -ne 'chomp;if(/\>\S+\s+(\S+)/){print ">$1\n";}else{print "$_\n";}' > HLA_ABC_GEN.fasta
../bin/formatdb -p F -i HLA_ABC_GEN.fasta
/home/pubseq/BioSw/bwa/bwa-0.5.9/bwa index -a is HLA_ABC_GEN.fasta

FOR YOUR CONVENIENCE, A SINGLE SHELL SCRIPT CAN BE RUN FROM ../database TO
UPDATE ALL IMGT-HLA SEQUENCE DATABASES AND CREATE BWA/BLAST INDEXES. JUST KEEP
IN MIND THAT THIS IS DONE WITH A SPECIFIC VERSION OF THESE TOOLS, IF YOU
UPGRADE, YOU WILL HAVE TO REGENERATE THE INDEXES
******************************
***../database/updateAll.sh***
******************************

Note: For alignment of nanopore genomic/transcriptomic reads, we recommend aligning to a file comprised of human genome chromosomes and HLA-I and -II alleles to reduce the noise in alignments (especially when running minimap2). If you are predicting from direct ONT (nanopore) or PacBio long read alignments, please update the genome files (refer to the above section "Use of HG38 reference for long-read alignments with HLAminer"):

Genome file (without Chromosome 6):


https://www.bcgsc.ca/downloads/btl/hlaminer/GCA_000001405.15_GRCh38_genomic.chr-only-noChr6.fa.gz

then:

cd database
cat GCA_000001405.15_GRCh38_genomic.chr-only-noChr6.fa HLA-I_II_GEN.fasta | pigz - > GCA_000001405.15_GRCh38_genomic.chr-only-noChr6-HLA-I_II_GEN.fa.gz
cat GCA_000001405.15_GRCh38_genomic.chr-only-noChr6.fa HLA-I_II_CDS.fasta | pigz - > GCA_000001405.15_GRCh38_genomic.chr-only-noChr6-HLA-I_II_CDS.fa.gz

Genome file (without HLA region on Chromosome 6):


https://www.bcgsc.ca/downloads/btl/hlaminer/GCA_000001405.15_GRCh38_genomic.chr-only-chr6hlamasked.fa.gz

then:

cd database
cat GCA_000001405.15_GRCh38_genomic.chr-only-chr6hlamasked.fa HLA-I_II_GEN.fasta | pigz - > GCA_000001405.15_GRCh38_genomic.chr-only-chr6hlamasked-HLA-I_II_GEN.fa.gz
cat GCA_000001405.15_GRCh38_genomic.chr-only-chr6hlamasked.fa HLA-I_II_CDS.fasta | pigz - > GCA_000001405.15_GRCh38_genomic.chr-only-chr6hlamasked-HLA-I_II_CDS.fa.gz

or download those files (JAN/APR 2025 update) directly from:
https://www.bcgsc.ca/downloads/btl/hlaminer/

3) P designation files

Upgrade the P designation info from:

info:
http://hla.alleles.org/wmda/index.html

file:
http://hla.alleles.org/wmda/hla_nom_p.txt

### OUTPUT FILES
--------

HLA predictions from read pair alignments:


HLAminer_HPRA.log
HLAminer_HPRA.csv

HLA predictions from targeted assemblies:

HLAminer_HPTASR.log
HLAminer_HPTASR.csv

The .log file tracks the process of HLA mining. It contains the following information:
-HLAminer command and parameters utilized
-Contig/read pair alignment output and best HLA hit for each
-Initial gene summary, score and expect value
-Final summary, listing all predictions by highest score (more likely).

The .csv file contains HLAminer predictions. Predictions are listed by HLA
gene and ranked by highest score. Predictions 1) and 2) are expected to
represent the two alleles for each.

eg.


----------------------------------------------------------------------
SUMMARY
MOST LIKELY HLA-I ALLELES (Confidence (-10 * log10(Eval)) >= 30, Score >= 500)
Allele,Score,Expect (Eval) value,Confidence (-10 * log10(Eval))
----------------------------------------------------------------------

HLA-A
Prediction #1 - A*26
A*26:33,4179,5.22e-124,1232.8

Prediction #2 - A*33
A*33:24,1791,2.41e-75,746.2

Prediction #3 - A*68
A*68:05,597,1.85e-10,97.3

From these predictions, the individual is expected to be heterozygous with HLA-I A alleles A*26 and
A*33. The chance, expect value and confidence are different representations of the same
metric. The Confidence represents the Expect value - Eval - as a score in a
manner analoguous to the phred score employed in sequencing to quickly assess
the likelihood of a base being correct.

Predictions/read pair are ambiguous when there are multiple predicted allele groups and/or protein coding alleles with the same score.

### AUTHORS
--------


Rene Warren

### CITING
--------

Thank you for your [![Stars](https://img.shields.io/github/stars/warrenlr/HLAminer.svg)](https://github.com/warrenlr/HLAminer/stargazers) and for using, developing and promoting this free software!

If you use HLAminer for you research, please cite:


Warren RL, Choe G, Freeman DJ, Castellarin M, Munro S, Moore R, Holt RA. (2012)
Derivation of HLA types from shotgun sequence datasets
Genome Med. 4, 95. https://doi.org/10.1186/gm396

[![link](https://img.shields.io/badge/HLAminer-manuscript-brightgreen)](https://doi.org/10.1186/gm396)

and


Warren RL, Birol I. (2025)
Streaming long-read sequence alignments for HLA predictions using HLAminer
Current Protocols. 5, e70124. https://doi.org/10.1002/cpz1.70124

[![link](https://img.shields.io/badge/HLAminer-protocol-brightgreen)](https://doi.org/10.1002/cpz1.70124)

### LICENSE AGREEMENT
-----------------------------------------------------------
BC CANCER AGENCY SOFTWARE LICENSE AGREEMENT (ACADEMIC USE)

CAREFULLY READ THE FOLLOWING TERMS AND CONDITIONS. This License
Agreement (the "Agreement") is a legal contract between you, your
employer, educational institution or organization (collectively, "You")
and the British Columbia Cancer Agency ("BCCA") with respect to the
license of the software, including all associated documentation
(collectively, the "Product").

BCCA is willing to license the Product to You only if You accept the
terms and conditions of this Agreement. By clicking on the "I ACCEPT"
button, or by copying, downloading, accessing or otherwise using the
Product, You automatically agree to be bound by the terms of this
Agreement. IF YOU DO NOT WISH TO BE BOUND BY THE TERMS OF THIS
AGREEMENT, DO NOT COPY, DOWNLOAD, ACCESS OR OTHERWISE USE THE
PRODUCT.

1. AUTHORITY: In the event that You are an educational institution or
organization, Your representative who is clicking the "I ACCEPT"
button, or otherwise copying, downloading, accessing or using the
Product hereby, in their personal capacity, represents and warrants
that they possess the legal authority to enter into this Agreement
on Your behalf and to bind You to the terms of this Agreement.

2. LICENSE TO USE: BCCA hereby grants to You a personal, non-exclusive,
non-transferable, limited license to use the Product solely for
internal, non-commercial use for non-profit research or educational
purposes only on the terms and conditions contained in this Agreement.
The Product may be installed at a single site at Your premises only. A
copy of the Product installed on a single common machine or cluster of
machines may be shared for internal use by Qualified Users only. In
order to be a "Qualified User", an individual must be a student,
researcher, professor, instructor or staff member of a non-profit
educational institution or organization who uses the Product solely for
non-profit research or educational purposes.

3. RESTRICTIONS: You acknowledge and agree that You shall not, and
shall not authorize any third party to:
(a) make copies of the Product, except as provided in Section 2 and
except for a single backup copy, and any such copy together with the
original must be kept in Your possession or control;
(b) modify, adapt, decompile, disassemble, translate into another
computer language, create derivative works of, or otherwise reverse
engineer the Product, or disclose any trade secrets relating to the
Product, except as permitted in Section 5;
(c) license, sublicense, distribute, sell, lease, transfer, assign,
trade, rent or publish the Product or any part thereof and/or copies
thereof, to any third party;
(d) use the Product to process any data other than Your own;
(e) use the Product or any part thereof for any commercial or
for-profit purpose or any other purpose other than as permitted in
Section 2; or
(f) use, without its express permission, the name of BCCA.

4. INTELLECTUAL PROPERTY RIGHTS: Subject to Section 5 below, all
patents, copyrights, trade secrets, service marks, trademarks and
other proprietary rights in or related to the Product and any
improvements, modifications and enhancements thereof are and will
remain the exclusive property of BCCA or its licensors. You agree
that You will not, either during or after the termination of this
Agreement, contest or challenge the title to or the intellectual
property rights of BCCA or its licensors in the Product or any
portion thereof.

5. OWNERSHIP OF IMPROVEMENTS: In the event that the Product, in the
form provided to You, includes source code (the "Source Code"),
You are entitled to make improvements, modifications and
enhancements to the Source Code (collectively, "Improvements")
which Improvements are to be used by You for non-profit research
and educational purposes only and You shall be the owner of those
Improvements that You directly make and of all intellectual
property rights to such Improvements, subject to the foregoing
limits on Your use and distribution of such Improvements. You
hereby grant to BCCA a perpetual, non-exclusive, worldwide,
fully-paid, irrevocable license to use such Improvements for any
purposes whatsoever, and to sublicense such Improvements including
the right for third parties to sublicense the same, in perpetuity
to the extent such rights are not limited in duration under
applicable law, without identifying or seeking Your
consent. Notwithstanding the foregoing, You acknowledge that BCCA
and its licensors will retain or own all rights in and to any
pre-existing code or other technology, content and data that may be
incorporated in the Improvements. For greater certainty, this
Section applies solely to the Source Code and shall not give You
any rights with respect to the object code or any other portion or
format of the Product which use, for greater certainty, is limited
as set forth in this Agreement including as set out in Section 3(b)
above. You acknowledge and agree that you will provide copies of
Improvements to BCCA in such format as reasonably requested by BCCA
at any time upon the request of BCCA.

6. CONFIDENTIALITY: You acknowledge that the Product is and
incorporates confidential and proprietary information developed,
acquired by or licensed to BCCA. You will take all reasonable
precautions necessary to safeguard the confidentiality of the
Product, and will not disclose any information about the Product to
any other person without BCCA's prior written consent. You will
not allow the removal or defacement of any confidential or
proprietary notice placed on the Product. You acknowledge that any
breach of this Section 6 will cause irreparable harm to BCCA and
its licensors.

7. NO WARRANTIES: THIS PRODUCT IS PROVIDED TO YOU BY BCCA IN ORDER TO
ALLOW YOU TO OBTAIN ACCESS TO LEADING ACADEMIC RESEARCH. THE PRODUCT
IS PROVIDED TO YOU ON AN "AS IS" BASIS WITHOUT WARRANTY OF ANY
KIND. NO WARRANTY, REPRESENTATION OR CONDITION EITHER EXPRESS OR
IMPLIED, INCLUDING WITHOUT LIMITATION, ANY IMPLIED WARRANTY OR
CONDITION OF MERCHANTABILITY, NON-INFRINGEMENT, PERFORMANCE,
DURABILITY OR FITNESS FOR A PARTICULAR PURPOSE OR USE SHALL
APPLY. BCCA DOES NOT WARRANT THAT THE PRODUCT WILL OPERATE ON A
CONTINUOUS OR TROUBLE FREE BASIS.

8. LIMITATION OF LIABILITY: TO THE MAXIMUM EXTENT PERMITTED BY
APPLICABLE LAW, IN NO EVENT SHALL THE AGGREGATE LIABILITY OF BCCA TO
YOU EXCEED THE AMOUNT YOU HAVE PAID TO ACQUIRE THE PRODUCT ("MAXIMUM
AMOUNT") AND WHERE YOU HAVE NOT PAID ANY AMOUNT FOR THE PRODUCT THEN
THE MAXIMUM AMOUNT SHALL BE DEEMED TO BE CDN$100.00. IN NO EVENT SHALL
BCCA BE LIABLE FOR ANY INDIRECT, INCIDENTAL, CONSEQUENTIAL, OR SPECIAL
DAMAGES, INCLUDING WITHOUT LIMITATION ANY DAMAGES FOR LOST PROFITS OR
SAVINGS, REGARDLESS OF WHETHER THEY HAVE BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE. EXCEPT TO THE EXTENT THAT THE LAWS OF A
COMPETENT JURISDICTION REQUIRE LIABILITIES BEYOND AND DESPITE THESE
LIMITATIONS, EXCLUSIONS AND DISCLAIMERS, THESE LIMITATIONS, EXCLUSIONS
AND DISCLAIMERS SHALL APPLY WHETHER AN ACTION, CLAIM OR DEMAND ARISES
FROM A BREACH OF WARRANTY OR CONDITION, BREACH OF CONTRACT,
NEGLIGENCE, STRICT LIABILITY OR ANY OTHER KIND OF CIVIL OR STATUTORY
LIABILITY CONNECTED WITH OR ARISING FROM THIS AGREEMENT. YOU AGREE
THAT THE FOREGOING DISCLAIMER OF WARRANTIES AND LIMITATION OF
LIABILITY ARE FAIR IN LIGHT OF THE NATURE OF THE RIGHTS GRANTED HEREIN
AND THE AMOUNT OF FEES PAID BY YOU IN RESPECT OF THE PRODUCT.

9. INDEMNITY: You will indemnify, defend and hold harmless BCCA, its
board of directors, staff and agents from and against any and all
liability, loss, damage, action, claim or expense (including
attorney's fees and costs at trial and appellate levels) in
connection with any claim, suit, action, demand or judgement
(collectively, "Claim") arising out of, connected with, resulting
from, or sustained as a result of Your use of the Product or the
downloading of the Product, including without limitation, any Claim
relating to infringement of BCCA's intellectual property rights or
the intellectual property rights of any third party.

10. SUPPORT AND MAINTENANCE: You acknowledge and agree that, unless
and to the extent expressly agreed by BCCA in a separate written
document, the Product is provided to You without any support or
maintenance from BCCA and, for greater certainty, BCCA shall have
no obligation to issue any update or upgrade to any Product.

11. TERM: This Agreement is effective until terminated. You may
terminate this Agreement at any time by ceasing use of the Product
and destroying or deleting any copies of the Product. This
Agreement will terminate immediately without notice from BCCA if
You fail to comply with any provision of this Agreement. BCCA may
terminate this Agreement at any time upon notice to you where BCCA
determines, in its sole discretion, that any continued use of the
Product could infringe the rights of any third parties. Upon
termination of this Agreement, and in any event upon BCCA
delivering You notice of termination, You shall immediately purge
all Products from Your computer system(s), return to BCCA all
copies of the Product that are in Your possession or control, and
cease any further development of any Improvements. On any
termination of this Agreement Sections 1, 4, 6, 7, 8, 9, 13 and 14
shall survive such termination.

12. GOVERNMENT END USERS: Where any of the Product is used, duplicated
or disclosed by or to the United States government or a government
contractor or sub contractor, it is provided with RESTRICTED
RIGHTS as defined in Title 48 CFR 52.227-19 and is subject to the
following: Title 48 CFR 2.101, 52.227-19, 227.7201 through
227.7202-4, FAR 52.227-14, and FAR 52.227-19(c)(1-2) and (6/87),
and where applicable, the customary software license, as described
in Title 48 CFR 227-7202 with respect to commercial software and
commercial software documentation including DFAR 252.227-7013,
DFAR 252,227-7014, DFAR 252.227-7015 and DFAR 252.7018, all as
applicable.

13. USE OF THE DOWNLOAD SERVICE: You acknowledge and agree that you
will be responsible for all costs, charges and taxes (where
applicable) arising out of Your use of the Product and the
downloading of the Product. You acknowledge that You are
responsible for supplying any hardware or software necessary to
use the Product pursuant to this Agreement.

14. GENERAL PROVISIONS:
(a) This Agreement will be governed by the laws of the Province of
British Columbia, and the laws of Canada applicable therein, excluding
any rules of private international law that lead to the application of
the laws of any other jurisdiction. The United Nations Convention on
Contracts for the International Sale of Goods (1980) does not apply to
this Agreement. The courts of the Province of British Columbia shall
have non-exclusive jurisdiction to hear any matter arising in
connection with this Agreement.
(b) USE OF THE PRODUCT IS PROHIBITED IN ANY JURISDICTION WHICH DOES
NOT GIVE EFFECT TO THE TERMS OF THIS AGREEMENT.
(c) You agree that no joint venture, partnership, employment,
consulting or agency relationship exists between You and BCCA as a
result of this Agreement or Your use of the Product.
(d) You hereby consent to Your contact information and any other
personally identifiable information that You provide to us being
disclosed to and maintained and used by us and our business partners
for the purposes of (i) managing and developing our respective
businesses and operations; (ii) marketing products and services to You
and your staff; and (iii) developing new and enhancing existing
products. You further agree that we may provide this information to
other persons as required to satisfy any legal requirements and to any
person that acquires some or all of the assets of BCCA. Where any of
the personally identifiable information that You provide to us is in
respect of individuals other than Yourself (such as Your staff) then
You represent and warrant to use that You have obtained all necessary
consents and authorizations from such individuals in order to comply
with this provision. Please see the BCCA website for further
information regarding personally identifiable information.
(e) This Agreement is the entire Agreement between You and BCCA
relating to this subject matter. You will not contest the validity of
this Agreement merely because it is in electronic form. No
modification of this Agreement will be binding, unless in writing and
accepted by an authorized representative of each party.
(f) The provisions of this Agreement are severable in that if any
provision in the Agreement is determined to be invalid or
unenforceable under any controlling body of law, that will not affect
the validity or enforceability of the remaining provisions of the
Agreement.
(g) You agree to print out or download a copy of this Agreement and
retain it for Your records.
(h) You consent to the use of the English language in this Agreement.
(i) You may not assign this Agreement or any of Your rights or
obligations hereunder without BCCA's prior written consent. BCCA, at
its sole discretion may assign this Agreement without notice to You.

For commercial licensing options, please contact Patrick Rebstein prebstein@bccancer.bc.ca
-----------------------------------------------------------