{"id":32215144,"url":"https://github.com/sanger-pathogens/mlst_check","last_synced_at":"2026-02-21T12:02:09.803Z","repository":{"id":4051006,"uuid":"5153589","full_name":"sanger-pathogens/mlst_check","owner":"sanger-pathogens","description":"Multilocus sequence typing by blast using the schemes from PubMLST","archived":false,"fork":false,"pushed_at":"2022-06-23T22:51:30.000Z","size":5488,"stargazers_count":30,"open_issues_count":8,"forks_count":15,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-10-16T22:15:03.222Z","etag":null,"topics":["bioinformatics","bioinformatics-pipeline","genomics","global-health","infectious-diseases","next-generation-sequencing","pathogen","research","sequencing"],"latest_commit_sha":null,"homepage":"http://sanger-pathogens.github.io/mlst_check/","language":"Perl","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sanger-pathogens.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATIONS.md","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS","dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2012-07-23T15:23:36.000Z","updated_at":"2025-02-20T07:56:33.000Z","dependencies_parsed_at":"2022-09-02T03:40:39.863Z","dependency_job_id":null,"html_url":"https://github.com/sanger-pathogens/mlst_check","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/sanger-pathogens/mlst_check","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanger-pathogens%2Fmlst_check","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanger-pathogens%2Fmlst_check/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanger-pathogens%2Fmlst_check/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanger-pathogens%2Fmlst_check/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sanger-pathogens","download_url":"https://codeload.github.com/sanger-pathogens/mlst_check/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanger-pathogens%2Fmlst_check/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280360155,"owners_count":26317439,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-21T02:00:06.614Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","bioinformatics-pipeline","genomics","global-health","infectious-diseases","next-generation-sequencing","pathogen","research","sequencing"],"created_at":"2025-10-22T07:35:46.280Z","updated_at":"2025-10-22T07:35:49.356Z","avatar_url":"https://github.com/sanger-pathogens.png","language":"Perl","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multilocus sequence typing \n\nMultilocus sequence typing by blast using the schemes from PubMLST.\n\n[![Build Status](https://travis-ci.org/sanger-pathogens/mlst_check.svg?branch=master)](https://travis-ci.org/sanger-pathogens/mlst_check)  \n[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-brightgreen.svg)](https://github.com/sanger-pathogens/mlst_check/blob/master/GPL-LICENSE)  \n[![status](http://joss.theoj.org/papers/0b801d23613c9b626c2b6028f8c14056/status.svg)](http://joss.theoj.org/papers/0b801d23613c9b626c2b6028f8c14056)  \n[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg)](http://bioconda.github.io/recipes/perl-bio-mlst-check/README.html)  \n[![Container ready](https://img.shields.io/badge/container-ready-brightgreen.svg)](https://quay.io/repository/biocontainers/perl-bio-mlst-check)  \n[![Docker Build Status](https://img.shields.io/docker/build/sangerpathogens/mlst_check.svg)](https://hub.docker.com/r/sangerpathogens/mlst_check)  \n[![Docker Pulls](https://img.shields.io/docker/pulls/sangerpathogens/mlst_check.svg)](https://hub.docker.com/r/sangerpathogens/mlst_check)  \n[![codecov](https://codecov.io/gh/sanger-pathogens/mlst_check/branch/master/graph/badge.svg)](https://codecov.io/gh/sanger-pathogens/mlst_check)   \n\n\n## Contents\n  * [Introduction](#introduction)\n  * [Quick start](#quick-start)\n  * [Installation](#installation)\n    * [Required dependencies](#required-dependencies)\n    * [Bioconda \\- OSX/Linux](#bioconda---osxlinux)\n    * [Docker](#docker)\n    * [Debian/Ubuntu](#debianubuntu)\n    * [HomeBrew/LinuxBrew](#homebrewlinuxbrew)\n    * [Running the tests](#running-the-tests)\n  * [Usage](#usage)\n  * [Input format](#input-format)\n  * [Outputs](#outputs)\n  * [License](#license)\n  * [Feedback/Issues](#feedbackissues)\n  * [Citation](#citation)\n  * [Method](#method)\n  * [Contribute to the software](#contribute-to-the-software)\n\n## Introduction\nThis application is for taking MLST databases from multiple locations and consolidating them in one place so that they can be easily used (and kept up to date).\nThen you can provide FASTA files and get out sequence types (ST) for a given MLST database.\nTwo spreadsheets are outputted, one contains the allele number for each locus, and the ST (or nearest ST), the other contains the genomic sequence for each allele.  \nIf more than 1 allele gives 100% identity for a locus, the contaminated flag is set.\nOptionally you can output a concatenated sequence in FASTA format, which you can then use with tree building programs.\nNew, unseen alleles are saved in FASTA format, with 1 per file, for submission to back to MLST databases.\n\n## Quick start\nSet the directory where you would like to store the MLST databases (If you use Docker, you can skip this and the next step as the databases are bundled with the container)\n```\nexport MLST_DATABASES=/path/to/where_you_want_to_store_the_databases\n```\n\nDownload the database\n```\ndownload_mlst_databases\n```\n\nGet sequence types for all FASTA files in my current directory, creates 2 spreadsheets of results.\n```\nget_sequence_type -s \"Escherichia coli\"  *.fa \n```\n\nCreate a multifasta alignment for tree building\n```\nget_sequence_type -s \"Escherichia coli\" -c *.fa\n```\n\nList all MLST databases available\n```\nget_sequence_type -a\n```\n\nMore details\n```\nget_sequence_type -h\n```\n\n## Installation\nmlst_check has the following dependencies:\n\n### Required dependencies\n* [NCBI BLAST](http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web\u0026PAGE_TYPE=BlastDocs\u0026DOC_TYPE=Download) \n\nIf you encounter an issue when installing mlst_check please contact your local system administrator. If you encounter a bug please log it [here](https://github.com/sanger-pathogens/mlst_check/issues).\n\nInstructions are given for installing the software via Docker (can be run on all operating systems), for Debian/Ubuntu distributions and HomeBrew/LinuxBrew.\n\n### Bioconda - OSX/Linux\nInstall conda. Then install bioconda and mlst_check:\n\n```\nconda config --add channels defaults\nconda config --add channels conda-forge\nconda config --add channels bioconda\nconda install perl-bio-mlst-check\n```\n\n### Docker\nThe docker container includes a snapshot of the MLST databases from the day it was built.  To install it:\n\n```\ndocker pull sangerpathogens/mlst_check\n```\n\nSome example data is included in the container, which can be run using this command:\n```\ndocker run --rm -it -v /home/ubuntu/data:/data sangerpathogens/mlst_check get_sequence_type -s 'Salmonella enterica' /example/sample1.fa /example/sample2.fa /example/sample3.fa\n```\nYour results will then be in the /home/ubuntu/data directory (or whatever you have called it).\n\n\nTo use the command with your own data place your FASTA files in /home/ubuntu/data (or substituting in your directories):\n```\ndocker run --rm -it -v /home/ubuntu/data:/data sangerpathogens/mlst_check get_sequence_type -s 'Salmonella enterica' my_sample.fa\n```\nYour results will then be in the /home/ubuntu/data directory as previous.\n\n### Debian/Ubuntu\nIf you run Debian or Ubuntu it should be straightforward to install the software. These instructions assume you have root access. Run:\n\n```\napt-get update -qq\napt-get install -y ncbi-blast+ cpanminus gcc autoconf make libxml2-dev zlib1g zlib1g-dev libmodule-install-perl\ncpanm -f Bio::MLST::Check\n```\n\nSet the directory where you would like to store the MLST databases.\n```\nexport MLST_DATABASES=/path/to/where_you_want_to_store_the_databases\n```\n\nDownload the latest copy of the databases (run it once per month)\n```\ndownload_mlst_databases\n```\nTo use the software to find the sequence types for all fasta files in your current directory:\n```\nget_sequence_type -s \"Clostridium difficile\" *.fa\n```\n\n### HomeBrew/LinuxBrew\nIf you run OSX, a non-Debian Linux or you do not have root access on your machine, you can use HomeBrew/LinuxBrew to install the dependancies.  First install [Homebrew](http://brew.sh/) (OSX) or [LinuxBrew](http://linuxbrew.sh/) (Linux).\n\n```\nbrew tap homebrew/science\nbrew install cpanminus blast\n```\n\nAssuming you have setup perl modules to install in your local directory (~/perl5 in this case), install the software and all its Perl dependancies:\n```\ncpanm --local-lib=~/perl5 -f Bio::MLST::Check\n```\nSet a directory where you would like to store the MLST databases.\n```\nexport MLST_DATABASES=/path/to/where_you_want_to_store_the_databases\n```\n\nDownload the latest copy of the databases (run it once per month)\n```\ndownload_mlst_databases\n```\nTo use the software to find the sequence types for all fasta files in your current directory:\n```\nget_sequence_type -s \"Clostridium difficile\" *.fa\n```\n\n### Running the tests\nThe test can be run from the top level directory:  \n```\ndzil test --test-verbose\n```\n## Usage\nThe MLST databases must be downloaded first. This is something you would only do every now and again. You need to set the $MLST_DATABASES environment variable first to a location where you want to save your databases. If you use Docker, you can skip this step as the databases are bundled with the container.\n```\nUsage: download_mlst_databases [options]\n   -c STR Config file containing details of MLST databases from pubMLST\n   -b STR Directory where MLST databases are stored [$MLST_DATABASES]\n   -h     Print this message and exit\n   -v     Print version number and exit\n```\n\nThe get_sequence_type script allows you to calculate the ST of a FASTA file against one or more database. If you dont provide the '-s' option, then every database will be searched.  If you wish to build a phylogenetic tree use the '-c' or '-y' options to get a single aligned FASTA/Phylip file.\n```\nUsage: get_sequence_type [options] *.fasta\n\n   -s STR Species of MLST database (0 or more comma separated)\n   -d INT Number of threads [1]\n   -c     Output a FASTA file of concatenated alleles and unknown sequences \n   -y     Output a phylip file of concatenated alleles and unknown sequences\n   -o STR Output directory [.]\n   -a     Print out all available MLST databases and exit\n   -h     Print this message and exit\n   -v     Print version number and exit\n```\n\n## Input format\nThe input files must be in [FASTA format](https://en.wikipedia.org/wiki/FASTA_format) and contain nucleotide sequences. These can be full genome sequences, fragmented _de novo_ assemblies or individual genes. If the gene is truncated or split over 2 sequences, it is unlikely to be detected by this algorithm, however MLST genes usually assemble consistently well because they have been carefully chosen by the schemes creators.\n\n## Outputs\nThe output is\n* mlst_results.allele.csv\n\n  This is a tab separated spreadsheet containing the ST number of each input FASTA file and the corresponding allele numbers for each gene in the scheme. If one of the alleles is not contained in the database, then it will be flagged with 'U' and the 3rd column will describe it as 'Unknown'. If the combination of allele numbers has never been seen before, it will be flagged as 'Novel'. The ST column is populated with the nearest ST found. A whole number indicates an exact match was found for the ST. If it is prepended with a tilda (~) it indicates it is a 'best effort' and the nearest matching ST with the lowest number is used.  Should two diffent alleles for a single gene be found, then the allele numbers will be put into the 'Contamination' column (since there shouldnt be 2 copies of these genes). However some schemes are poorly defined so take it with a pinch of salt.  If there are no matches such as in _sample5_ below, the ST is blank and all alleles are marked with unknown (U).\n\n  Isolate | ST  |\"New ST\" |Contamination     | aroC | dnaN | hemD | hisD | purE | sucA | thrA\n  ------- | --- | --------|------------------|------|------|------|------|------|------|-----\n  sample1 | ~559| Unknown |                  | 130  | 97   | 25   | 125  | U    | 9    | 101\n  sample2 | 518 |         |                  | 101  | 41   | 40   | 184  | 76   | 90   | 3\n  sample3 | 150 |         | purE-422,purE-84 | 130  | 97   | 25   | 125  | 422  | 9    | 101\n  sample4 | ~150| Novel   |                  | 130  | 95   | 25   | 125  | 422  | 9    | 101\n  sample5 |     |\tUnknown |                  | U    | U    | U    | U    | U    | U    | U \n\n* mlst_results.genomic.csv\n\n  This spreadsheet is similar to the mlst_results.allele.csv spreadsheet, however it gives the full sequences of each allele instead of the allele number.\n\n* *unknown.fa\n\n  You can choose to output any new alleles (-c) which are not contained in the MLST database. These can then be used to feedback to the curators maintaining the MLST databases, where they can be assigned allele numbers and profiles.\n\n* concatenated_alleles.fa and concatenated_alleles.phylip\n\n  You can choose to output a multiple FASTA/Phylip alignment of all of the MLST genes concatenated together, where each sample is represented by a single sequence. This file can then be used as input to a phylogenetic tree building application (such as RAxML or FastTree) to create a phylogenetic tree (dendrogram).\n\n## License\nmlst_check is free software, licensed under [GPLv3](https://github.com/sanger-pathogens/mlst_check/blob/master/GPL-LICENSE).\n\n## Feedback/Issues\nPlease report any issues to the [issues page](https://github.com/sanger-pathogens/mlst_check/issues).\n\n## Citation\n```\"Multilocus sequence typing by blast from de novo assemblies against PubMLST\", Andrew J. Page, Ben Taylor, Jacqueline A. Keane, The Journal of Open Source Software, (2016). doi: http://dx.doi.org/10.21105/joss.00118```\n\n## Method\nThe user can decide to use a specific MLST scheme or search all of them. The first step is to generate a blastn database using makeblastdb from the alleles.  The input sequences are then blasted against the database using blastn.  If there is a 100% match to the full length of an allele, the corresponding allele number is noted. If there is a partial match to an allele, the best hit is chosen, where it has the highest number of matching bases and the highest percentage identity. This nearest allele number is noted and it is flagged as 'Unknown'.  If there is contamination, and more than 1 allele for a single gene is 100% present, the corresponding allele numbers are presented in the contamination column. The first allele in the blast results is used for the gene.  The profile for the MLST scheme links the combination of allele numbers for each gene to an ST number.  This number is presented if there is an exact match.  If one or more of the alleles is _Unknown_, the nearest ST with the lowest integer number is used. Where the combination of allele numbers is unique, the ST is marked as _Novel_ and the ST with the closest number of matches and the lowest integer is presented and indicated with a tilda (~).\n\n## Contribute to the software\nIf you wish to fix a bug or add new features to the software we welcome Pull Requests. Please fork the repo, make the change, then submit a Pull Request with details about what the change is and what it fixes/adds.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsanger-pathogens%2Fmlst_check","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsanger-pathogens%2Fmlst_check","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsanger-pathogens%2Fmlst_check/lists"}