{"id":37138536,"url":"https://github.com/uio-bmi/compairr","last_synced_at":"2026-01-16T10:13:05.446Z","repository":{"id":45320216,"uuid":"353091132","full_name":"uio-bmi/compairr","owner":"uio-bmi","description":"Comparison of Adaptive Immune Receptor Repertoires","archived":false,"fork":false,"pushed_at":"2025-01-28T12:16:05.000Z","size":328,"stargazers_count":26,"open_issues_count":4,"forks_count":5,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-01-28T13:33:45.747Z","etag":null,"topics":["airr","bioinformatics","immune-repertoire","immunoinformatics","immunology","rep-seq","repertoire-analysis"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/uio-bmi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-03-30T17:47:43.000Z","updated_at":"2025-01-28T12:16:09.000Z","dependencies_parsed_at":"2025-01-28T13:30:27.599Z","dependency_job_id":"d8c1fca0-998b-40c9-989f-42c063fcb4ad","html_url":"https://github.com/uio-bmi/compairr","commit_stats":null,"previous_names":[],"tags_count":27,"template":false,"template_full_name":null,"purl":"pkg:github/uio-bmi/compairr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uio-bmi%2Fcompairr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uio-bmi%2Fcompairr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uio-bmi%2Fcompairr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uio-bmi%2Fcompairr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/uio-bmi","download_url":"https://codeload.github.com/uio-bmi/compairr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uio-bmi%2Fcompairr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28478049,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T06:30:42.265Z","status":"ssl_error","status_checked_at":"2026-01-16T06:30:16.248Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airr","bioinformatics","immune-repertoire","immunoinformatics","immunology","rep-seq","repertoire-analysis"],"created_at":"2026-01-14T16:00:27.200Z","updated_at":"2026-01-16T10:13:05.431Z","avatar_url":"https://github.com/uio-bmi.png","language":"C++","readme":"[![](https://img.shields.io/static/v1?label=AIRR-C%20sw-tools%20v1\u0026message=compliant\u0026color=008AFF\u0026labelColor=000000\u0026style=plastic)](https://docs.airr-community.org/en/stable/swtools/airr_swtools_standard.html)\n\n# CompAIRR\n\nCompAIRR (`compairr`) is a command line tool to compare two sets of\nadaptive immune receptor repertoires and compute their overlap. It can\nalso identify which sequences are present in which repertoires.\nFurthermore, CompAIRR can cluster the sequences in a repertoire\nset. Sequence comparisons can be exact or approximate. CompAIRR has\nbeen shown to be very fast and to have a small memory footprint\ncompared to similar tools, when up to 2 differences are allowed.\n\n\n## Installation\n\nThe code is C++11 standard compliant and should compile easily using\n`make` and a modern C++ compiler (e.g. GNU GCC or LLVM Clang). Run\n`make clean`, `make`, `make test` and `make install` in the main\nfolder to clean, build, test and install the tool. There are no\ndependencies except for the C and C++ standard libraries.\n\nBinaries for Linux (x86_64) and macOS (x86_64 and Arm64) are also\ndistributed with each\n[release](https://github.com/uio-bmi/compairr/releases/latest).\n\nA `Dockerfile` is included if you want to make a Docker image.  A\ndocker image may be built with the following command:\n\n```sh\ndocker build -t compairr .\n```\n\nReady-made Docker images for CompAIRR can be found on the\n[Docker Hub](https://hub.docker.com/r/torognes/compairr).\n\nCompAIRR can be installed on macOS using homebrew with\n`brew install torognes/bioinf/compairr`.\n\n\n## Tutorial\n\nFor an introduction to how to use CompAIRR, please have a look at the\n[CompAIRR tutorial](https://github.com/LonnekeScheffer/compairr-tutorial).\n\n\n## General options\n\nUse the `-h` or `--help` option to show some help information.\n\nRun the program with `-v` or `--version` for version information.\n\nThe type of operation that should be performed is specified with one\nof the options `-m`, `-x`, `-c` or `-z` (or the corresponding long option\nforms `--matrix`, `--existence`, `--cluster`, or `--deduplicate`).\n\nThe code is multi-threaded. The number of threads may be specified\nwith the `-t` or `--threads` option.\n\nThe results will be written to standard out (stdout) unless a file\nname has been specified with the `-o` or `--output-file` option.\n\nWhile the program is running it will print some status and progress\ninformation to standard error (stderr) unless a log file has been\nspecified with the `-l` or `--log` option. Error messages and warnings\nwill also be written here.\n\nThe default is to compare amino acid sequences, but nucleotide\nsequences are compared if the `-n` or `--nucleotides` option is given.\nThe accepted amino acid symbols are `ACDEFGHIKLMNPQRSTVWY`, while the\naccepted nucleotide symbols are `ACGTU`. Lower case letters are also\naccepted. The program will abort with an error message if any other\nsymbol is encountered in a sequence, unless one specifies the `-u` or\n`--ignore-unknown` option, in which case CompAIRR will simply ignore\nthat sequence. If the program encounters an empty sequence it will\nalso abort with an error message, unless the `-e` or `--ignore-empty`\noption is given.\n\nBy default, the sequences should be given in the `junction` or\n`junction_aa` column of the input file, for nucleotide and amino acid\nsequences, respectively. Alternatively, the sequences may be present\nin the `cdr3` or `cdr3_aa` column, if the `--cdr3` option is given.\n\nThe user can specify how many differences are allowed when comparing\nsequences, using the option `-d` or `--differences`. To allow indels\n(insertions or deletions) the option `-i` or `--indels` may be\nspecified, otherwise only substitutions are allowed. By default, no\ndifferences are allowed. The `-i` option is allowed only when d=1. The\nnumber of differences allowed strongly influences the speed of\nCompAIRR. The program will be slower as more differences\nare allowed. When d=0 or d=1 it is very fast, but it will be relatively\nslow with d=2 and even slower when d\u003e2. See the section on performance\nbelow for an example.\n\nThe V and J gene alleles specified for each sequence must also match,\nunless the `-g` or `--ignore-genes` option is in effect.\n\n\n## Computing overlap between two repertoire sets\n\nTo compute the overlap between two repertoire sets, use the `-m` or\n`--matrix` option.\n\nFor each of the two repertoire sets there must an input file of\ntab-separated values formatted according to [the AIRR standard for\nrearrangements](https://docs.airr-community.org/en/stable/datarep/rearrangements.html).\nThe two input files are specified on the command line without any\npreceding option letter. If only one filename is specified on the\ncommand line, or the same filename is specified twice, it is assumed\nthat the set should be compared to itself. Each file must contain the\nrepertoire ID and either the nucleotide or the amino acid sequence of\nthe rearrangement. If the repertoire ID column is missing, all\nsequences are assumed to belong to the same repertoire (with ID 1 or\n2, respectively, for the two sets). A sequence ID may also be\nincluded. Unless they should be ignored, the V gene, the J gene, and\nthe duplicate count is also needed.\n\nEach set can contain many repertoires and each repertoire can contain\nmany sequences. The tool will find the sequences in the two sets that\nare similar and output a matrix with results.\n\nCompAIRR assumes that all sequences within each repertoire are\ndistinct, and that the abundance of each sequence is indicated in the\n`duplicate_count` field in the input file. Duplicated sequences,\ni.e. identical sequences (with the same V and J genes) within the same\nrepertoire, may lead to unexpected results. CompAIRR will warn if it\ndetects duplicates. Duplicates may be merged with the `--deduplicate`\ncommand.\n\nThe similar sequences of each repertoire in each set are found by\ncomparing the sequences and their V and J genes.  The duplicate count\nof each sequence is taken into account and a matrix is output\ncontaining a value for each combination of repertoires in the two\nsets. The value is usually the sum of the products of the duplicate\ncounts of all pairs of sequences in the two repertoires that match. If\nthe option `-f` or `--ignore-counts` is specified, the duplicate count\ninformation is ignored and all counts are treated as 1. Instead of\nsumming the product of the counts, the ratio, min, max, or mean may be\nused if specified with the `-s` or `--score` option. The Morisita-Horn\nindex or Jaccard index will be calculated if `MH` or `Jaccard` is\nspecified with the `-s` option. These indices can only be computed\nwhen d=0.\n\nThe output will be a matrix of values in a tab-separated plain text\nfile. Two different formats can be selected. In the default format,\nthe first line contains the hash character (`#`) followed by the\nrepertoire ID's from the second set. The following lines contains the\nrepertoire ID from the first set, followed by the values corresponding\nto the comparison of this repertoire with each of the repertoires in\nthe second set.\n\nAn alternative output format is used when the `-a` or `--alternative`\noption is specified. It will write the results in a three column\nformat with the repertoire ID from set 1 and set 2 in the two first\ncolumns, respectively, and the value in the third column. There will\nbe one line for each combination of repertoires in the sets. The very\nfirst line will contain a hash character (`#`) followed by the field\nnames separated by tabs.\n\nIf the `-p` or `--pairs` option is specified, CompAIRR will write\ninformation about all pairs of matching sequences to a specified TSV\nfile. Please note that such files may grow very large when there are\nmany matches. Use of multithreading may be of little use in this\ncase. The order of the lines in the file is unspecified. The following\ncolumns from both input files will be included in the output:\n`repertoire_id`, `sequence_id`, `duplicate_count`, `v_call`, `j_call`,\nand `junction`. The term `junction` will be replaced with\n`junction_aa`, `cdr3`, or `cdr3_aa` as appropriate. Additional columns\nfrom the input files may be copied to the pairs file using the `-k` or\n`--keep-columns` option. Multiple columns, separated by commas (but no\nspaces), may be given. A warning will be given if any of the specified\ncolumns are missing. In the header, columns from the first and second\ninput file will be suffixed by `_1` and `_2`, respectively. The\ndistance between the sequences will be included if the `--distance`\noption is included. This is usually the Hamming distance (minimum\nnumber of substitutions), unless the `--indel` (or `-i`) option is\nspecified, in which case the distance is the Levenshtein distance\n(minimum number of substitutions or indels). If only the information\nin the pairs file is required, and not the information in the matrix,\nthe storage and output of the matrix can be avoided with the\n`--no-matrix` option. This may save some memory and time if there are\nmany repertoires in the sets.\n\n\n## Analysing in which repertoires a set of sequences are present\n\nUse the option `-x` or `--existence` to analyse in which repertoires a\nset of sequences are present, and create a sequence presence matrix.\n\nTwo input files with repertoire sets in standard format must be\nspecified on the command line. The first file should contain the\ndifferent sequences to analyse. The `sequence_id` column must be\npresent in this file. If the optional `repertoire_id` column is\npresent, all those identifiers must be identical. The second file must\ncontain the repertoires to match. The `repertoire_id` column must be\npresent in the second file, otherwise the ID will be set to 2 for all\nsequences.\n\nCompAIRR will identify in which repertoires each sequence is present\nand will output the results either as a matrix or as a three-column\ntable (if the `-a` option is specified). The options `-d`, `-i`, `-g`,\nand `-n` (and the corresponding long option names `--differences`,\n`--indels`, `--ignore-genes`, and `--nucleotides`) will be taken into\naccount when comparing sequences.\n\nThe output will be in a similar format as when computing the overlap\n(above), but the first column will contain the `sequence_id` from the\nfirst file instead of the `repertoire_id`.\n\nThe `-p` or `--pairs` option may be specified to output all pairs of\nmatching sequences in the same way as for the overlap computation.\n\n\n## Clustering the sequences in a repertoire\n\nTo cluster the sequences in one repertoire, use the `-c` or\n`--cluster` option.\n\nOne input file in tab-separated format must be specified on the\ncommand line.\n\nThe tool will cluster the sequences using single linkage hierarchical\nclustering, according to the specified distance and indel options\n(`-d`, `--distance`, `-i`, `--indels`). The V and J gene alleles will\nbe taken into account unless the `-g` or `--ignore-genes` option is\nspecified. The options `-n` or `--nucleotides` indicate that the\ncomparison should be performed with nucleotide sequences, not amino\nacid sequences. If the repertoire ID column is missing, all\nsequences are assumed to belong to the same repertoire (with ID 1).\n\nThe output will be in a similar TSV format as the input file, but\npreceded with two additional columns. The first column will contain a\ncluster number, starting at 1. The second column will contain the size\nof the cluster. The subsequent columns are `repertoire_id`,\n`sequence_id`, `duplicate_count`, `v_call`, `j_call`, and `junction`\n(or `junction_aa`, `cdr3` or `cdr3_aa`, as appropriate).\n\nThe clusters are sorted by size, in descending order.\n\n\n## Deduplication\n\nThe `--deduplicate` command may be used to deduplicate a data set by\nmerging entries in the same repertoire with identical sequences and\nidentical V and J genes. This may be necessary to get correct results\nwhen computing overlaps between repertoires. Duplicates may be present\nfor instance in cases were the data set contains both nucleotide and\namino acid sequences from the same rearrangement, where the nucleotide\nsequences may be distinct while the amino acid sequences may not be,\ndue to the degeneracy of the genetic code.\n\nOne input file in TSV format must be specified on the command line.\n\nStrictly identical sequences in the same repertoire will be merged and\ntheir counts will be added together. If the `-g` or `--ignore_genes`\noption is specified, the V and J genes are ignored. The `-n` or\n`--nucleotides` option may be specified if the input is nucleotide\nsequences, otherwise amino acid sequences will be assumed. If the `-f`\nor `--ignore_counts` option is specified, the counts in the input file\nwill be ignored, and just the number of identical sequences will be\ncounted. If the repertoire ID column is missing, all sequences are\nassumed to belong to the same repertoire (with ID 1).\n\nThe output will be in a similar TSV format as the input file, with the\nfollowing columns: `repertoire_id`, `duplicate_count`, `v_call`,\n`j_call`, and `junction` (or `junction_aa`, `cdr3` or `cdr3_aa`, as\nappropriate). If the `-g` or `--ignore_genes` option is specified, the\n`v_call` and `j_call` columns will not be included.\n\n\n## Input files\n\nThe input files must be in tab-separated value (TSV) format accoring\nto the [Rearrangement\nSchema](https://docs.airr-community.org/en/stable/datarep/rearrangements.html)\nof the [AIRR standards 1.3\ndocumentation](https://docs.airr-community.org/en/stable/).\n\nThe first line must contain the header. The rest of the file must\ncontain one line per sequence. The following fields should be included:\n\n* `repertoire_id`: identifier of the repertoire\n* `sequence_id`: identifier of the sequence (optional except for for first file when using `-x` or `--existence`)\n* `duplicate_count`: number of identical copies of the same rearrangement (required unless `-f` option given)\n* `v_call`: V gene name with allele (required unless `-g` option given)\n* `j_call`: J gene name with allele (required unless `-g` option given)\n* `junction`: nucleotide sequence (required if `-n` option given and `--cdr3` option not given)\n* `junction_aa`: amino acid sequence (single letter code) (required unless `-n` or `--cdr3` options given)\n* `cdr3`: nucleotide sequence (required if both `-n` and `--cdr3` options given)\n* `cdr_aa`: amino acid sequence (single letter code) (required if `--cdr3` option given and `-n` option not given)\n\nSee below for an example. Other fields may be included, but will be\nignored.\n\n\n## Command line option overview\n\nThe command line should look like this:\n\n```\ncompairr OPTIONS TSVFILE1 [TSVFILE2]\n```\n\nExactly one of the command options `-m`, `-x` or `-c` (or their long forms) must be specified. Other options as indicated in the table below could also be included. With the `-m` and `-x` command options, the names of two tab-separated value files with repertoires must also be specified on the command line, with the `-c` command option, only one such file should be specified.\n\nShort | Long               | Argument | Default  | Description\n------|--------------------|----------|----------|-------------\n`-a`  | `--alternative`    |          |          | Output results in three-column format, not matrix\n`  `  | `--cdr3`           |          |          | Use the `cdr3` or `cdr3_aa` column instead of `junction` or `junction_aa`\n`-c`  | `--cluster`        |          |          | Cluster sequences in one repertoire\n`-d`  | `--differences`    | INTEGER  | 0        | Number of differences accepted\n`  `  | `--distance`       |          |          | Include sequence distance in pairs file\n`-e`  | `--ignore-empty`   |          |          | Ignore empty sequences\n`-f`  | `--ignore-counts`  |          |          | Ignore duplicate count information\n`-g`  | `--ignore-genes`   |          |          | Ignore V and J gene information\n`-h`  | `--help`           |          |          | Display help text and exit\n`-i`  | `--indels`         |          |          | Allow insertions or deletions\n`-k`  | `--keep-columns`   | STRING   |          | Copy given comma-separated columns to pairs file\n`-l`  | `--log`            | FILENAME | (stderr) | Log to specified file instead of stderr\n`-m`  | `--matrix`         |          |          | Compute overlap matrix between two sets\n`  `  | `--no-matrix`      |          |          | Do not keep or output any matrix\n`-n`  | `--nucleotides`    |          |          | Compare nucleotides, not amino acids\n`-o`  | `--output`         | FILENAME | (stdout) | Output results to specified file instead of stdout\n`-p`  | `--pairs`          | FILENAME | (none)   | Output matching pairs to specified file\n`-s`  | `--score`          | STRING   | product  | Sum `product`, `ratio`, `min`, `max`, or `mean`; or compute `MH` or `Jaccard` index\n`-t`  | `--threads`        | INTEGER  | 1        | Number of threads to use (1-256)\n`-u`  | `--ignore-unknown` |          |          | Ignore sequences including unknown residue symbols\n`-v`  | `--version`        |          |          | Display version information\n`-x`  | `--existence`      |          |          | Check existence of sequences in repertoires\n`-z`  | `--deduplicate`    |          |          | Deduplicate sequences\n\n\n## Example 1: Repertoire overlap\n\nIn this example we will compute the overlap of two repertoire sets.\n\nLet's use two simple input files. The first is `seta.tsv`:\n\n```tsv\nrepertoire_id\tsequence_id\tduplicate_count\tv_call\tj_call\tjunction\tjunction_aa\tsequence\trev_comp\tproductive\td_call\tsequence_alignment\tgermline_alignment\tv_cigar\td_cigar\tj_cigar\nA1\tR\t1\tTCRBV07-06\tTCRBJ02-01\ttgcgcgagcagcaccagccatgaacagtatttt\tCASSTSHEQYF\t\t\t\t\t\t\t\t\t\nA2\tS\t3\tTCRBV07-09\tTCRBJ01-02\ttgcgcgagcagcctgcgcgtgggcggctatggctataccttt\tCASSLRVGGYGYTF\t\t\t\t\t\t\t\t\t\n```\n\n\nThe second is `setb.tsv`:\n\n```tsv\nrepertoire_id\tsequence_id\tduplicate_count\tv_call\tj_call\tjunction\tjunction_aa\tsequence\trev_comp\tproductive\td_call\tsequence_alignment\tgermline_alignment\tv_cigar\td_cigar\tj_cigar\nB1\tT\t5\tTCRBV07-09\tTCRBJ01-02\ttgcgcgagcagcctgcgcgtgggcggctatggctataccttt\tCASSLRVGGYGYTF\t\t\t\t\t\t\t\t\t\nB1\tU\t10\tTCRBV07-09\tTCRBJ01-02\ttgcgcgagcagcctgcgcgtgggcggctttggctataccttt\tCASSLRVGGFGYTF\t\t\t\t\t\t\t\t\t\nB2\tV\t7\tTCRBV07-06\tTCRBJ02-01\ttgcgcgagcagcaccagccatcagcagtatttt\tCASSTSHQQYF\t\t\t\t\t\t\t\t\t\n```\n\nWe run the following command:\n\n`compairr -m seta.tsv setb.tsv -d 1 -o output.tsv -p pairs.tsv`\n\nHere is the output to the console:\n\n```\nCompAIRR 1.7.0 - Comparison of Adaptive Immune Receptor Repertoires\nhttps://github.com/uio-bmi/compairr\n\nStart time:        Thu Mar 03 12:29:32 CET 2022\nCommand (m/c/x):   Overlap (-m)\nRepertoire set 1:  seta.tsv\nRepertoire set 2:  setb.tsv\nNucleotides (n):   No\nDifferences (d):   1\nIndels (i):        No\nIgnore counts (f): No\nIgnore genes (g):  No\nIgn. unknown (u):  No\nThreads (t):       1\nOutput file (o):   output.tsv\nOutput format (a): Matrix\nScore (s):         Sum of products of counts\nPairs file (p):    pairs.tsv\nLog file (l):      (stderr)\n\nImmune receptor repertoire set 1\n\nReading sequences: 100% (0s)\nRepertoires:       2\nSequences:         2\nResidues:          25\nShortest:          11\nLongest:           14\nAverage length:    12.5\nTotal dupl. count: 4\nIndexing:          100% (0s)\n\nRepertoires in set:\n# Sequences Count Repertoire ID\n1         1     1 A1\n2         1     3 A2\n\nImmune receptor repertoire set 2\n\nReading sequences: 100% (0s)\nRepertoires:       2\nSequences:         3\nResidues:          39\nShortest:          11\nLongest:           14\nAverage length:    13.0\nTotal dupl. count: 22\nIndexing:          100% (0s)\n\nRepertoires in set:\n# Sequences Count Repertoire ID\n1         2    15 B1\n2         1     7 B2\n\nUnique V genes:    2\nUnique J genes:    2\nComputing hashes:  100% (0s)\nComputing hashes:  100% (0s)\nHashing sequences: 100% (0s)\nAnalysing:         100% (0s)\nWriting results:   100% (0s)\n\nEnd time:          Thu Mar 03 12:29:32 CET 2022\n```\n\nRepertoires will be sorted alphabetically by ID. The program gives some\nstatistics on the input files after reading them.\n\nHere is the result in the `output.tsv` file:\n\n```tsv\n#\tB1\tB2\nA1\t0\t7\nA2\t45\t0\n```\n\nAnd here is the result in the `pairs.tsv` file:\n\n```tsv\n#repertoire_id_1\tsequence_id_1\tduplicate_count_1\tv_call_1\tj_call_1\tjunction_aa_1\trepertoire_id_2\tsequence_id_2\tduplicate_count_2\tv_call_2\tj_call_2\tjunction_aa_2\nA1\tR\t1\tTCRBV07-06\tTCRBJ02-01\tCASSTSHEQYF\tB2\tV\t7\tTCRBV07-06\tTCRBJ02-01\tCASSTSHQQYF\nA2\tS\t3\tTCRBV07-09\tTCRBJ01-02\tCASSLRVGGYGYTF\tB1\tT\t5\tTCRBV07-09\tTCRBJ01-02\tCASSLRVGGYGYTF\nA2\tS\t3\tTCRBV07-09\tTCRBJ01-02\tCASSLRVGGYGYTF\tB1\tU\t10\tTCRBV07-09\tTCRBJ01-02\tCASSLRVGGFGYTF\n```\n\nHere, sequence R in repertoire A1 is similar to sequence V in\nrepertoire B2. The only difference is the E and Q in the 8th\nposition. The gene allele names are also the same. They have duplicate\ncounts of 1 and 7, respectively. The product is 7. That value is found\nin the third column on the second line in the main output file.\n\nSequence S in repertoire A2 with duplicate count 3 is similar to both\nsequence T and U in repertoire B1, with duplicate counts of 5 and 10,\nrespectively. Sequence T in B1 is identical, while sequence U in B1\nhas an F instead of a Y in the 10th position. The result is 3 * (5 +\n10) = 3 * 15 = 45. That value is found in the second column on the\nthird line of the main output file.\n\nSince there are no sequences from repertoire A1 similar to B1 or from\nA2 similar to B1, the other values in the matrix are zero.\n\nThis small dataset is included in the test folder and the tool can\nautomatically be tested by running `make test`.\n\n\n## Example 2: Sequence existence\n\nIn this example we will use the `-x` or `--existence` option to find\nout in which repertoires a set of sequences are present.\n\nThe file `setc.tsv` contains the sequences that we will analyse:\n\n```tsv\nrepertoire_id\tsequence_id\tduplicate_count\tv_call\tj_call\tjunction\tjunction_aa\tsequence\trev_comp\tproductive\td_call\tsequence_alignment\tgermline_alignment\tv_cigar\td_cigar\tj_cigar\nC\tX\t1\tTCRBV07-09\tTCRBJ01-02\ttgcgcgagcagcctgcgcgtgggcggctttggctataccttt\tCASSLRVGGFGYTF\t\t\t\t\t\t\t\t\t\nC\tY\t1\tTCRBV07-06\tTCRBJ02-01\ttgcgcgagcagcaccagccatcagcagtatttt\tCASSTSHQQYF\t\t\t\t\t\t\t\t\t\n```\n\nThe file above is included in the folder `test` in the distribution.\n\nWe will compare it to repertoire sets in the file `setb.tsv` described\nearlier.\n\nWe run the following command:\n\n`compairr -x setc.tsv setb.tsv -d 1 -f -o output.tsv -p pairs.tsv`\n\nHere is the output to the console:\n\n```\nCompAIRR 1.7.0 - Comparison of Adaptive Immune Receptor Repertoires\nhttps://github.com/uio-bmi/compairr\n\nStart time:        Thu Mar 03 12:31:16 CET 2022\nCommand (m/c/x):   Existence (-x)\nRepertoire:        setc.tsv\nRepertoire set:    setb.tsv\nNucleotides (n):   No\nDifferences (d):   1\nIndels (i):        No\nIgnore counts (f): Yes\nIgnore genes (g):  No\nIgn. unknown (u):  No\nThreads (t):       1\nOutput file (o):   output.tsv\nOutput format (a): Matrix\nScore (s):         Sum of products of counts\nPairs file (p):    pairs.tsv\nLog file (l):      (stderr)\n\nImmune receptor repertoire set 1\n\nReading sequences: 100% (0s)\nRepertoires:       1\nSequences:         2\nResidues:          25\nShortest:          11\nLongest:           14\nAverage length:    12.5\nTotal dupl. count: 2\nIndexing:          100% (0s)\n\nRepertoires in set:\n# Sequences Count Repertoire ID\n1         2     2 C\n\nImmune receptor repertoire set 2\n\nReading sequences: 100% (0s)\nRepertoires:       2\nSequences:         3\nResidues:          39\nShortest:          11\nLongest:           14\nAverage length:    13.0\nTotal dupl. count: 22\nIndexing:          100% (0s)\n\nRepertoires in set:\n# Sequences Count Repertoire ID\n1         2    15 B1\n2         1     7 B2\n\nUnique V genes:    2\nUnique J genes:    2\nComputing hashes:  100% (0s)\nComputing hashes:  100% (0s)\nHashing sequences: 100% (0s)\nAnalysing:         100% (0s)\nWriting results:   100% (0s)\n\nEnd time:          Thu Mar 03 12:31:16 CET 2022\n```\n\nHere is the result in the `output.tsv` file:\n\n```tsv\n#\tB1\tB2\nX\t2\t0\nY\t0\t1\n```\n\nPlease note that the `-f` option was used to ignore the duplicate\ncounts.\n\nAnd here is the result in the `pairs.tsv` file:\n\n```tsv\n#repertoire_id_1\tsequence_id_1\tduplicate_count_1\tv_call_1\tj_call_1\tjunction_aa_1\trepertoire_id_2\tsequence_id_2\tduplicate_count_2\tv_call_2\tj_call_2\tjunction_aa_2\nC\tX\t1\tTCRBV07-09\tTCRBJ01-02\tCASSLRVGGFGYTF\tB1\tU\t10\tTCRBV07-09\tTCRBJ01-02\tCASSLRVGGFGYTF\nC\tX\t1\tTCRBV07-09\tTCRBJ01-02\tCASSLRVGGFGYTF\tB1\tT\t5\tTCRBV07-09\tTCRBJ01-02\tCASSLRVGGYGYTF\nC\tY\t1\tTCRBV07-06\tTCRBJ02-01\tCASSTSHQQYF\tB2\tV\t7\tTCRBV07-06\tTCRBJ02-01\tCASSTSHQQYF\n```\n\nThe results indicate that sequence X was found (twice) in repertoire\nB1 (matching sequences T and U) and that sequence Y was found in\nrepertoire B2 (matching sequence V).\n\n\n## Example 3: Clustering sequences\n\nThis time we will cluster the nucleotide sequences in the file\n`setb.tsv` using the `-c` or `--cluster` option.\n\nThe command line to run is:\n\n`compairr -c setb.tsv -d 1 -n -o output.tsv`\n\nThe output during the clustering is as follows:\n\n```\nCompAIRR 1.7.0 - Comparison of Adaptive Immune Receptor Repertoires\nhttps://github.com/uio-bmi/compairr\n\nStart time:        Thu Mar 03 12:33:05 CET 2022\nCommand (m/c/x):   Cluster (-c)\nRepertoire:        setb.tsv\nNucleotides (n):   Yes\nDifferences (d):   1\nIndels (i):        No\nIgnore counts (f): No\nIgnore genes (g):  No\nIgn. unknown (u):  No\nThreads (t):       1\nOutput file (o):   output.tsv\nLog file (l):      (stderr)\n\nImmune receptor repertoire clustering\n\nReading sequences: 100% (0s)\nRepertoires:       2\nSequences:         3\nResidues:          117\nShortest:          33\nLongest:           42\nAverage length:    39.0\nTotal dupl. count: 22\nIndexing:          100% (0s)\n\nUnique V genes:    2\nUnique J genes:    2\n\nComputing hashes:  100% (0s)\nHashing sequences: 100% (0s)\nBuilding network:  100% (0s)\nClustering:        100% (0s)\nSorting clusters:  100% (0s)\nWriting clusters:  100% (0s)\n\nClusters:          2\nEnd time:          Thu Mar 03 12:33:05 CET 2022\n```\n\nThe result in the file `output.tsv` looks like this:\n\n```tsv\n#cluster_no\tcluster_size\trepertoire_id\tsequence_id\tduplicate_count\tv_call\tj_call\tjunction\n1\t2\tB1\tT\t5\tTCRBV07-09\tTCRBJ01-02\ttgcgcgagcagcctgcgcgtgggcggctatggctataccttt\n1\t2\tB1\tU\t10\tTCRBV07-09\tTCRBJ01-02\ttgcgcgagcagcctgcgcgtgggcggctttggctataccttt\n2\t1\tB2\tV\t7\tTCRBV07-06\tTCRBJ02-01\ttgcgcgagcagcaccagccatcagcagtatttt\n```\n\nIn this case, there are 2 clusters. The first contains 2 sequences (T\nand U from B1), while the second cluster contains 1 sequence (V from\nB2). The sequences are clustered across repertoires.\n\n\n## Example 4: Deduplication\n\nThis time we will deduplicate the amino acid sequences in the file\n`setb.tsv` using the `-z` or `--deduplicate` option.\n\nThe command line to run is:\n\n`compairr -z setb.tsv -o output.tsv`\n\nThe output will look like this:\n\n```\nCompAIRR 1.8.0 - Comparison of Adaptive Immune Receptor Repertoires\nhttps://github.com/uio-bmi/compairr\n\nStart time:        Thu Sep 15 17:10:51 CEST 2022\nCommand:           Deduplicate (--deduplicate)\nRepertoire:        setb.tsv\nNucleotides (n):   No\nDifferences (d):   0\nIndels (i):        No\nIgnore counts (f): No\nIgnore genes (g):  No\nIgn. unknown (u):  No\nThreads (t):       1\nOutput file (o):   output.tsv\nLog file (l):      (stderr)\n\nReading sequences: 100% (0s)\nRepertoires:       2\nSequences:         3\nResidues:          39\nShortest:          11\nLongest:           14\nAverage length:    13.0\nTotal dupl. count: 22\nIndexing:          100% (0s)\nUnique V genes:    2\nUnique J genes:    2\nComputing hashes:  100% (0s)\nDeduplicating:     100% (0s)\nDuplicates merged: 0\nWriting output:    100% (0s)\n\nEnd time:          Thu Sep 15 17:10:51 CEST 2022\n```\n\nThe result in the file `output.tsv` looks like this:\n\n```tsv\nrepertoire_id\tduplicate_count\tv_call\tj_call\tjunction_aa\nB1\t5\tTCRBV07-09\tTCRBJ01-02\tCASSLRVGGYGYTF\nB1\t10\tTCRBV07-09\tTCRBJ01-02\tCASSLRVGGFGYTF\nB2\t7\tTCRBV07-06\tTCRBJ02-01\tCASSTSHQQYF\n```\n\nThere were no duplicates in this dataset so the output is essentially\nidentical to the input data, but does not include all the original\ncolumns. If the two sequences in repertoire B1 had been identical, the\ntwo lines would have been merged and the new `duplicate_count` would\nhave been 15.\n\n\n## Implementation\n\nThe program is written in C++. The strategy for finding similar\nsequences is based on a similar concept developed for the tool\n[Swarm](https://github.com/torognes/swarm) (Mahé et al.\n2021). Basically, a 64-bit hash is computed for all sequences in the\nsets. All hashes for one set are stored in a Bloom filter and in a\nhash table. We then look for matches to sequences in the second set by\nlooking them up in the Bloom filter and then, if there was a match, in\nthe hash table. To find matches with 1 or 2 substitutions or indels,\nthe hashes of all these variant sequences are generated and looked\nup. When d\u003e2, a different strategy is used where all sequences are\ncompared against each other and the number of differences is found.\n\n\n## Performance\n\nAs a preliminary performance test, Cohort 2 (\"Keck\") of [the\ndataset](https://s3-us-west-2.amazonaws.com/publishedproject-supplements/emerson-2017-natgen/emerson-2017-natgen.zip)\nby Emerson et al. (2017) was compared to itself. It contains 120 repertoires\nwith a total of 24 205 557 extracted sequences. The test was performed\nwith CompAIRR version 1.3.1. The timing results are shown below.\n\nDistance | Indels | Threads | Time (s) | Time (mm:ss)\n-------: | :----: | ------: | -------: | -----------:\n0 | no | 1 | 18 | 0:18\n0 | no | 4 | 12 | 0:12\n1 | no | 1 | 224 | 3:44\n1 | no | 4 | 72 | 1:12\n1 | yes | 1 | 367 | 6:07\n1 | yes | 4 | 111 | 1:51\n2 | no | 4 | 3200 | 53:20\n\nWhen the distance is zero almost all of the time was used to read\nfiles.\n\nMemory usage was 2.5GB, corresponding to an average of about 100 bytes\nper sequence.\n\nSince this is a comparison of a repertoire set to itself, the dataset\nis only read once, and the memory needed is also reduced as compared\nto a situation were two different repertoire sets are compared.\n\nWall time and memory usage was measured by `/usr/bin/time`. The\nanalysis was performed on an Apple Mac Mini M1 (2020) with 16GB RAM.\n\n\n## Benchmarking\n\nThe AIRR overlap functionality of CompAIRR has been thoroughly\nbenchmarked against similar tools. All data, scripts, and results are\navailable in a separate [CompAIRR benchmarking\nrepository](https://github.com/uio-bmi/compairr-benchmarking).\n\n\n## Tips\n\nIf computer memory is limited, the dataset may be split into blocks\nbefore running CompAIRR on each block separately. Results then needs\nto be merged together again afterwards. This may be achieved with a\nsimple script. We will consider providing such a script.\n\n\n## Development team\n\nThe code has been developed by Torbjørn Rognes based on code from\nSwarm where Frédéric Mahé and Lucas Czech made important\ncontributions. Geir Kjetil Sandve had the idea of developing a tool\nfor rapid repertoire set comparison. Lonneke Scheffer has tested and\nbenchmarked the tool, and suggested new features. Milena Pavlovic and\nVictor Greiff have also contributed to the project.\n\n\n## Support\n\nWe will prioritize fixing important bugs. We will also try to answer\nquestions, improve documentation and implement suggested enhancements\nas time permits. As we have no dedicated funding for this project we\ncannot make any guarantees on the level of support.\n\nTo report a potential bug, suggest enhancements or ask questions,\nplease use one of the following means:\n\n* [Submit an issue on GitHub](https://github.com/uio-bmi/compairr/issues) (preferred)\n\n* Send an email to [`torognes@ifi.uio.no`](mailto:torognes@ifi.uio.no)\n\nIf you would like to contribute with code you are most welcome to\n[submit a pull request](https://github.com/uio-bmi/compairr/pulls).\n\n\n## Citing CompAIRR\n\nPlease cite the following if you use CompAIRR in any published work:\n\n* Rognes T, Scheffer L, Greiff V, Sandve GK (2021) **CompAIRR: ultra-fast comparison of adaptive immune receptor repertoires by exact and approximate sequence matching.** *Bioinformatics*, btac505. doi: [10.1093/bioinformatics/btac505](https://doi.org/10.1093/bioinformatics/btac505)\n\nThe article is also available in preprint form:\n\n* Rognes T, Scheffer L, Greiff V, Sandve GK (2021) **CompAIRR: ultra-fast comparison of adaptive immune receptor repertoires by exact and approximate sequence matching.** *bioRxiv*, 2021.10.30.466600. doi: [10.1101/2021.10.30.466600](https://doi.org/10.1101/2021.10.30.466600)\n\n\n## References\n\n* Emerson RO, DeWitt WS, Vignali M, Gravley J, Hu JK, Osborne EJ, Desmarais C, Klinger M, Carlson CS, Hansen JA, Rieder M, Robins HS (2017) **Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire.** *Nature Genetics*, 49 (5): 659-665. doi: [10.1038/ng.3822](https://doi.org/10.1038/ng.3822)\n\n* Mahé F, Czech L, Stamatakis A, Quince C, de Vargas C, Dunthorn M, Rognes T (2021) **Swarm v3: Towards Tera-Scale Amplicon Clustering.** *Bioinformatics*, btab493. doi: [10.1093/bioinformatics/btab493](https://doi.org/10.1093/bioinformatics/btab493)\n","funding_links":[],"categories":["🔬 VDJ Analysis"],"sub_categories":["Structure \u0026 Modeling"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuio-bmi%2Fcompairr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fuio-bmi%2Fcompairr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuio-bmi%2Fcompairr/lists"}