{"id":19592869,"url":"https://github.com/soedinglab/mmseqs","last_synced_at":"2025-04-27T14:33:54.448Z","repository":{"id":31468297,"uuid":"35032296","full_name":"soedinglab/MMseqs","owner":"soedinglab","description":null,"archived":false,"fork":false,"pushed_at":"2016-09-19T09:37:09.000Z","size":565,"stargazers_count":14,"open_issues_count":1,"forks_count":4,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-04-05T00:51:18.524Z","etag":null,"topics":["alignment","cpp","mmseqs","opensource","sequence-clustering"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/soedinglab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-05-04T12:00:32.000Z","updated_at":"2024-08-13T07:55:21.000Z","dependencies_parsed_at":"2022-08-03T15:18:05.896Z","dependency_job_id":null,"html_url":"https://github.com/soedinglab/MMseqs","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soedinglab%2FMMseqs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soedinglab%2FMMseqs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soedinglab%2FMMseqs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soedinglab%2FMMseqs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/soedinglab","download_url":"https://codeload.github.com/soedinglab/MMseqs/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251154828,"owners_count":21544563,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","cpp","mmseqs","opensource","sequence-clustering"],"created_at":"2024-11-11T08:37:14.775Z","updated_at":"2025-04-27T14:33:53.457Z","avatar_url":"https://github.com/soedinglab.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PLEASE USE MMSEQS2 (THIS VERSION IS NOT FURTHER DEVELOPMENT ANYMORE)\nPlease use MMseqs2 instead of MMseqs. It is faster, more sensitive, clusters better and is more userfriendly.\nYou can find MMseqs2 here: \u003chttps://github.com/soedinglab/MMseqs2\u003e \n**This repository is not further developed.**\n\n# MMseqs \nMMseqs (Many-against-Many sequence searching) is a software suite for very fast protein sequence searches and clustering of huge protein sequence data sets. \nMMseqs is around 1000 times faster than protein BLAST and sensitive enough to capture similarities down to less than 30% sequence identity.\n\n\n## Requirements\n\nTo compile from source, you will need:\n\n  * a recent C and C++ compiler (Minimum requirement is GCC 4.4. GCC 4.8 or later is recommended).\n\n### Memory Requirements\nWhen using MMseqs the available memory limits the size of database you will be able to compute. \nWe recommend at least 128 GB of RAM so you can compute databases up to 50.000.000 entries:\n\nYou can calculate the memory requirements in bytes for L columns and N rows using the following formula:\n        \n        M = (4*N*L + 8*a^k) byte\n\nMMseqs stores an index table and two auxiliary arrays, which have a total size of M byte.\n\nFor a database containing N sequences with an average length L, the memory consumption of the index table is `(4*N*L) byte`.\nNote that the memory consumption grows linearly with the number of the sequences N in the database.\n\nThe two auxiliary arrays consume `(8*a^k) byte`, with a being the size of the amino acid alphabet (usually 21 including the unknown amino acid X) and the  k-mer size k.\n\n## Installation\n### Cloning from GIT\nIf you want to compile the most recent version, simply clone the git repository. \n\n        git clone https://github.com/soedinglab/MMseqs.git\n\n### Compile \nFirst, set environment variables:\n\n        export MMDIR=$HOME/path/to/mmseqs/\n        export PATH=$PATH:$MMDIR/bin\n\nMMseqs uses ffindex, a fast and simple database for wrapping and accessing a huge number of small files. Setting the environment variable `LD_LIBRARY_PATH` ensures that the needed ffindex libraries are available:\n\n        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MMDIR/lib/ffindex/src\n        cd $MMDIR/lib/ffindex\n        make\n \nThen build the MMseqs binaries:\n\n        cd $MMDIR/src\n        make\n\nMMseqs binaries are now located in $MMDIR/bin.\n\n## Overview of MMseqs\nMMseqs contains six binaries. Three commands execute complete workflows that combine MMseqs core modules. \nThe other three commands execute the single modules which are used by the workflows and are available for advanced users.\n\n### Workflows\n* `mmseqs_search` Compares all sequences in the query database with all sequences in the target database.\n* `mmseqs_cluster` Clusters the sequences in the input database by sequence similarity.\n* `mmseqs_update` Given an existing clustering of a sequence database and a new version of the sequence database (with some new sequences being added and others having been deleted), MMseqs incrementally updates the existing clustering.\n### Single modules\n* `mmseqs_pref` Computes k-mer similarity scores between all sequences in the query database and all sequences in the target database.\n* `mmseqs_aln` Computes Smith-Waterman alignment scores between all sequences in the query database and the sequences of the target database whose prefiltering scores computed by `mmseqs_pref` pass a minimum threshold.\n* `mmseqs_clu` Computes a similarity clustering of a sequence database based on Smith-Waterman alignment scores of the sequence pairs computed by `mmseqs_aln`.\n\n### FFindex Database Format\n\nAll modules take ffindex databases as input and produce ffindex databases as output. ffindex was developed to avoid drastically slowing down the file system when millions of files need to be written and accessed. ffindex hides the files from the file system by storing them as unstructured data records in a single data file. In addition to this data file, an ffindex database includes a second index file: \nThis index file stores an unique accession code, the start position in bytes of the data record in the FFindex data file and the length of the record for each file. When transforming a FASTA file with multiple sequences into an ffindex database, the accession code is the ID of the sequence parsed from the header. If no ID can be identified, the accession code is the whole header without the `\u003e` character before the first blank space.\n\nThe binaries `fasta2ffindex` and `ffindex2fasta` located in mmseqs/bin do the format conversion from and to the ffindex database format. `fasta2ffindex` generates a ffindex database from a FASTA sequence database. `ffindex2fasta` converts an ffindex database to a FASTA formatted text file: the headers are ffindex accession codes preceded by `\u003e`, with the corresponding dataset from the ffindex data file following.\nHowever, for a fast access to the particular datasets in very large databases it is advisable￼to use the ffindex database directly without converting. We offer the binary `ffindex_get` ($MMDIR/lib/ffindex/src/) for direct access to the datasets stored in an ffindex database.\n\n\n### How to cluster \nBefore clustering, convert your FASTA database into ffindex format:\n\n        fasta2ffindex DB.fasta DB\n\nPlease ensure that in case of large input databases the temporary folder tmp  provides enough free space.\nFor the disc space requirements, see the user guide. \n\n        mkdir tmp\n        mmseqs_cluster DB DB_clu tmp --cascaded\n\nTo generate a FASTA-style formatted output file from the ffindex output file, type:\n\n        ffindex2fasta DB_clu DB_clu.fasta\n\nTo run the more sensitive cascaded clustering and convert the result into FASTA format, type:\n\n        mmseqs_cluster DB DB_clu_s7 tmp --cascaded -s 7\n        ffindex2fasta DB_clu_s7 DB_clu_s7.fasta\n\n### How to search\nYou can use the query database queryDB.fasta and target database targetDB.fasta to test the search workflow.\nBefore clustering, you need to convert your database containing query sequences (queryDB.fasta) and your target database (targetDB.fasta) into ffindex format:\n\n        fasta2ffindex queryDB.fasta queryDB\n        fasta2ffindex targetDB.fasta targetDB\n\nIt generates ffindex database files, e. g. queryDB and ffindex index file queryDB.index\nfrom queryDB.fasta. Then, generate a directory for tmp files:\n\n        mkdir tmp\n\nPlease ensure that in case of large input databases tmp provides enough free space.\nFor the disc space requirements, see the user guide.\nTo run the search type:\n\n        mmseqs_search queryDB targetDB outDB tmp\n\nThen convert the result ffindex database into a FASTA formatted database: \n\n        ffindex2fasta outDB outDB.fasta\n\n## License Terms\n\n    This program is free software: you can redistribute it and/or modify\n    it under the terms of the GNU General Public License as published by\n    the Free Software Foundation, either version 3 of the License, or\n    (at your option) any later version.\n\n    This program is distributed in the hope that it will be useful,\n    but WITHOUT ANY WARRANTY; without even the implied warranty of\n    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n    GNU General Public License for more details.\n\n    You should have received a copy of the GNU General Public License\n    along with this program.  If not, see \u003chttp://www.gnu.org/licenses/\u003e.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoedinglab%2Fmmseqs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoedinglab%2Fmmseqs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoedinglab%2Fmmseqs/lists"}