{"id":13752326,"url":"https://github.com/soedinglab/plass","last_synced_at":"2025-04-04T10:05:34.372Z","repository":{"id":54865666,"uuid":"118119513","full_name":"soedinglab/plass","owner":"soedinglab","description":"sensitive and precise assembly of short sequencing reads","archived":false,"fork":false,"pushed_at":"2024-10-11T20:43:30.000Z","size":28374,"stargazers_count":153,"open_issues_count":33,"forks_count":15,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-03-28T09:06:46.408Z","etag":null,"topics":["bioinformatics","metagenomics","metatranscriptomics","opensource","proteins","proteomics","sequence-assembler"],"latest_commit_sha":null,"homepage":"https://plass.mmseqs.com","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/soedinglab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-01-19T11:53:19.000Z","updated_at":"2025-03-18T15:07:56.000Z","dependencies_parsed_at":"2024-03-31T09:27:02.656Z","dependency_job_id":"2cfdcd09-e28e-45cc-aa13-8dd3b6aec837","html_url":"https://github.com/soedinglab/plass","commit_stats":{"total_commits":3200,"total_committers":62,"mean_commits":51.61290322580645,"dds":0.634375,"last_synced_commit":"58298c0c3b501bf503ebc554c572b689d78dc8b1"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soedinglab%2Fplass","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soedinglab%2Fplass/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soedinglab%2Fplass/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soedinglab%2Fplass/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/soedinglab","download_url":"https://codeload.github.com/soedinglab/plass/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247157242,"owners_count":20893216,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","metagenomics","metatranscriptomics","opensource","proteins","proteomics","sequence-assembler"],"created_at":"2024-08-03T09:01:03.796Z","updated_at":"2025-04-04T10:05:34.349Z","avatar_url":"https://github.com/soedinglab.png","language":"C","readme":"# PLASS and PenguiN assembler\n[![BioConda Install](https://img.shields.io/conda/dn/bioconda/plass.svg?style=flag\u0026label=BioConda%20install)](https://anaconda.org/bioconda/plass)\n[![BioContainer Pulls](https://img.shields.io/endpoint?url=https%3A%2F%2Fmmseqs.com%2Fbiocontainer.php%3Fcontainer%3Dplass)](https://biocontainers.pro/#/tools/plass)\n[![DOI](https://zenodo.org/badge/118119513.svg)](https://zenodo.org/badge/latestdoi/118119513)\n\nPlass (Protein-Level ASSembler) and PenguiN (Protein guided nucleotide assembler) are software to assemble protein sequences or DNA/RNA contigs from short read sequencing data meant to work best for complex metagenomic or metatranscriptomic datasets. Plass and Penguin are GPL-licensed open source software implemented in C++ and available for Linux and macOS and are designed to run on multiple cores. \n\n[Plass:](https://github.com/soedinglab/plass/tree/master?tab=readme-ov-file#plass---protein-level-assembler) [Steinegger M, Mirdita M and Soeding J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods, doi: doi.org/10.1038/s41592-019-0437-4 (2019)](https://www.nature.com/articles/s41592-019-0437-4).\n\n[PenguiN:](https://github.com/soedinglab/plass/tree/master?tab=readme-ov-file#penguin---Protein-guided-Nucleotide-Assembler) [Jochheim A, Jochheim FA, Kolodyazhnaya A, Morice E, Steinegger M, Soeding J. Strain-resolved de-novo metagenomic assembly of viral genomes and microbial 16S rRNAs. Microbiome 12, 187, (2024)](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-024-01904-y)\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/soedinglab/plass/master/.github/plass.png\" height=\"256\" /\u003e\u003c/p\u003e\n\n### Soil Reference Catalog (SRC) and Marine Eukaryotic Reference Catalog (MERC)\nSRC was created by assembling 640 soil metagenome samples. MERC was assembled from the the metatranscriptomics datasets created by the TARA ocean expedition. Both catalogues were redundancy reduced to 90% sequence identity at 90% coverage.\nEach catalog is a single FASTA file containing the sequences, the header identifiers contain the Sequence Read Archive (SRA) identifiers.\nThe catalogues can be downloaded [here](http://wwwuser.gwdg.de/~compbiol/plass/current_release/).\nWe provide a [HH-suite3](https://github.com/soedinglab/hh-suite) database called \"BFD\" containing sequences from the Metaclust, SRC, MERC and Uniport at [here](https://bfd.mmseqs.com/).\n\n# PenguiN - Protein-guided Nucleotide assembler\nPenguiN a software to assemble short read sequencing data on a nucleotide level. In a first step it assembles coding sequences using the information from the translated protein sequences. In a second step it links them across non-coding regions. The main purpose of PenguiN is the assembly of complex metagenomic and metatranscriptomic datasets. It was especially tested for the assembly of viral genomes as well as 16S rRNA gene sequences. It assembles 3-40 times more complete viral genomes and six times as many 16S rRNA sequences than state of the art assemblers like Megahit and the SPAdes variants.\n\n### Install Plass and PenguiN\nOur software can be install via [conda](https://github.com/conda-forge/miniforge) or as statically compiled binaries. It requires a 64-bit Linux or macOS system.\n\n     # install from bioconda\n     conda install -c conda-forge -c bioconda plass \n     # install docker\n     docker pull ghcr.io/soedinglab/plass:latest\n     # static build with AVX2 (fastest)\n     wget https://mmseqs.com/plass/plass-linux-avx2.tar.gz; tar xvfz plass-linux-avx2.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH\n     # static build with SSE4.1\n     wget https://mmseqs.com/plass/plass-linux-sse41.tar.gz; tar xvfz plass-linux-sse41.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH\n     # universal build with macOS (Intel or Apple Silicon)\n     wget https://mmseqs.com/plass/plass-osx-universal.tar.gz; tar xvfz plass-osx-universal.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH\n\nOther precompiled binaries for SSE2, ARM and PowerPC can be found at [mmseqs.com/plass](https://mmseqs.com/plass).\n\n## How to assemble\nPlass and PenguiN can assemble both paired-end reads (FASTQ) and single reads (FASTA or FASTQ):\n\n      # assemble paired-end reads \n      plass assemble examples/reads_1.fastq.gz examples/reads_2.fastq.gz assembly.fas tmp\n\n      # assemble single-end reads \n      plass assemble examples/reads_1.fastq.gz assembly.fas tmp\n\n      # assemble single-end reads using stdin\n      cat examples/reads_1.fastq.gz | plass assemble stdin assembly.fas tmp\n\n\nImportant parameters: \n\n     --min-seq-id         Adjusts the overlap sequence identity threshold\n     --min-length         minimum codon length for ORF prediction (default: 40)\n     -e                   E-value threshold for overlaps \n     --num-iterations     Number of iterations of assembly\n     --filter-proteins    Switches the neural network protein filter off/on\n\nPlass workflows: \n\n      plass assemble      Assembles proteins (i:Nucleotides -\u003e o:Proteins)\n      \n      \nPenguiN workflows: \n\n      penguin guided_nuclassemble  Assembles nucleotides using protein and nucleotide information (i:Nucleotides -\u003e o:Nucleotides)\n      penguin nuclassemble         Assembles nucleotides using only nucleotdie information (i:Nucleotides -\u003e o:Nucleotides)\n\n### Assemble using MPI \nBoth tools can be distributed over several homogeneous computers. However the `tmp` folder has to be shared between all nodes (e.g. NFS). The following command assembles on several nodes:\n\n    RUNNER=\"mpirun -np 42\" plass assemble examples/reads_1.fastq.gz examples/reads_2.fastq.gz assembly.fas tmp\n\n\n### Compile from source\nCompiling from source has the advantage that it will be optimized to the specific system, which should improve its performance. To compile `git`, `g++` (4.9 or higher) and `cmake` (3.0 or higher) are required. Afterwards, the PLASS and PenguiN binaries will be located in the `build/bin` directory.\n\n      git clone https://github.com/soedinglab/plass.git\n      cd plass\n      git submodule update --init\n      mkdir build \u0026\u0026 cd build\n      cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..\n      make -j 4 \u0026\u0026 make install\n      export PATH=\"$(pwd)/bin/:$PATH\"\n        \n:exclamation: If you want to compile PLASS or PenguiN on macOS, please install and use `gcc` from Homebrew. The default macOS `clang` compiler does not support OpenMP and PLASS will not be able to run multithreaded. Use the following cmake call:\n\n      CXX=\"$(brew --prefix)/bin/g++-13\" cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..\n\n#### Dependencies\n\nWhen compiling from source, our sofwtare requires the `zlib` and `bzip` installed.\n\n### Use the docker image\nWe also provide a Docker image of Plass. You can mount the current directory containing the reads to be assembled and run plass with the following command:\n\n      docker run -ti --rm -v \"$(pwd):/app\" -w /app ghcr.io/soedinglab/plass:latest assemble reads_1.fastq reads_2.fastq assembly.fas tmp\n\n## Hardware requirements\nPlass needs roughly 1 byte of memory per residue to work efficiently. Plass will scale its memory consumption based on the available main memory of the machine. Plass needs a CPU with at least the SSE4.1 instruction set to run. \n\n## Known problems \n* The assembly of Plass includes all ORFs having a start and end codon that includes even very short ORFs \u003c 60 amino acids. Many of these short ORFs are spurious since our neural network cannot distingue them well. We would recommend to use other method to verify the coding potential of these. Assemblies above 100 amino acids are mostly genuine protein sequences. \n* Plass in default searches for ORFs of 40 amino acids or longer. This limits the read length to \u003e 120. To assemble this protein, you need to lower the `--min-length` threshold. Be aware using short reads (\u003c 100 length) might result in lower sensitivity.\n","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoedinglab%2Fplass","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoedinglab%2Fplass","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoedinglab%2Fplass/lists"}