{"id":23084122,"url":"https://github.com/smaegol/plasflow","last_synced_at":"2025-07-21T11:32:25.203Z","repository":{"id":49109872,"uuid":"78748754","full_name":"smaegol/PlasFlow","owner":"smaegol","description":"Software for prediction of plasmid sequences in metagenomic assemblies","archived":false,"fork":false,"pushed_at":"2021-06-28T17:00:50.000Z","size":39513,"stargazers_count":103,"open_issues_count":17,"forks_count":28,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-07-07T02:19:33.099Z","etag":null,"topics":["classification","contigs","fasta","metagenome","metagenome-assembly","metagenomes","plasflow","plasmid","plasmid-sequences","plasmids","prediction","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/smaegol.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-01-12T13:35:45.000Z","updated_at":"2025-06-25T11:43:48.000Z","dependencies_parsed_at":"2022-09-17T00:01:22.235Z","dependency_job_id":null,"html_url":"https://github.com/smaegol/PlasFlow","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/smaegol/PlasFlow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smaegol%2FPlasFlow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smaegol%2FPlasFlow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smaegol%2FPlasFlow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smaegol%2FPlasFlow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/smaegol","download_url":"https://codeload.github.com/smaegol/PlasFlow/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smaegol%2FPlasFlow/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266291739,"owners_count":23906323,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","contigs","fasta","metagenome","metagenome-assembly","metagenomes","plasflow","plasmid","plasmid-sequences","plasmids","prediction","tensorflow"],"created_at":"2024-12-16T15:49:04.977Z","updated_at":"2025-07-21T11:32:25.184Z","avatar_url":"https://github.com/smaegol.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Anaconda-Server Badge](https://anaconda.org/smaegol/plasflow/badges/installer/conda.svg)](https://anaconda.org/smaegol/plasflow) [![Anaconda-Server Badge](https://anaconda.org/smaegol/plasflow/badges/platforms.svg)](https://anaconda.org/smaegol/plasflow) [![Anaconda-Server Badge](https://anaconda.org/smaegol/plasflow/badges/downloads.svg)](https://anaconda.org/smaegol/plasflow)\n[![PyPI](https://img.shields.io/pypi/v/plasflow.svg)](https://pypi.org/project/plasflow/1.1.0/)\n\n# NOT MAINTAINED\n\nUse at your own risk. I am very grateful that it is being widely used but, as I completely changed my research area I cannot give my time to maintain this project. There are other, newer packages developed, which can be used instead. \n\n# PlasFlow 1.1\n\nPlasFlow is a set of scripts used for prediction of plasmid sequences in metagenomic contigs. It relies on the neural network models trained on full genome and plasmid sequences and is able to differentiate between plasmids and chromosomes with accuracy reaching 96%. It outperforms other available solutions for plasmids recovery from metagenomes and incorporates the thresholding which allows for exclusion of incertain predictions. PlasFlow has been published in _Nucleic Acids Research_ (https://doi.org/10.1093/nar/gkx1321).\n\n# Table of contents\n\n- [News](#news)\n- [Requirements](#requirements)\n- [Installation](#installation)\n\n  - [Conda-based](#conda-based---recommended)\n  - [Pip installer](#pip-installer)\n  - [Manual installation](#manual-installation)\n  - [Perl modules for additional scripts](#perl-modules-for-additional-scripts)\n\n- [Getting started](#getting-started)\n\n- [Output](#output)\n\n- [Test dataset](#test-dataset)\n- [Detailed information](#detailed-information)\n- [Citation](#citation)\n- [TBD](#tbd)\n- [Support](#support)\n\n## News\n\n#### 2018-05-25 Version 1.1 released\n\nNew version (1.1) released, which is better suited for large datasets. It can be downloaded from conda and pypi, but the simplest way to upgrade is to replace PlasFlow.py file in you previous installation with the current one.\nIf you still encounter problems with the new version, try to use smaller numbers for the `--batch_size` option.\n\n\n## Requirements:\n\n- Python 3.5\n- Python packages:\n\n  - Scikit-learn 0.18.1\n  - Numpy\n  - Pandas\n  - [TensorFlow 0.10.0](https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl)\n  - rpy2 \u003e= 2.8\n  - scipy\n  - biopython\n  - dateutil \u003e= 2.5\n\n- R 3.25\n\n- R packages:\n  - [Biostrings](https://bioconductor.org/packages/release/bioc/html/Biostrings.html)\n\nFor the perl scripts, especially `filter_sequences_by_length.pl`:\n\n- Perl 5 and modules:\n\n  - Bioperl ([installation instructions](https://bioperl.org/INSTALL.html))\n  - Getopt\n\n\n## Installation\n\n### Conda-based - recommended\n\nConda is recommended option for installation as it properly resolve all dependencies (including R and Biostrings) and allows for installation without messing with other packages installed. Conda can be used both as the [Anaconda](https://www.anaconda.com/download/), and [Miniconda](https://conda.io/miniconda.html) (which is easier to install and maintain).\n\nAfter the installation it is required to add [bioconda](https://bioconda.github.io/) channel, required for [Biostrings](https://bioconductor.org/packages/release/bioc/html/Biostrings.html) package installation:\n\n```\nconda config --add channels bioconda\n```\n\nSometimes it can be also required to add default conda channel ([conda-forge](https://conda-forge.org/)):\n\n```\nconda config --add channels conda-forge\n```\n\n\nTo exclude the possibility of dependencies conflicts its encouraged to create spearate conda environment for Plasflow using command:\n\n```\nconda create --name plasflow python=3.5\n```\n\nPython 3.5 is required becuase of TensorFlow requirements.\n\nto activate created environment type:\n\n```\nsource activate plasflow\n```\n\nMac users should install Tensorflow at this step (as osx-64 package is not present in default channels). If you encounter any problems  with missing TensorFlow dependency on other platforms also try to install TF from this source.\n\n```\nconda install -c jjhelmus tensorflow=0.10.0rc0\n```\n\nPlasFlow can be easily installed as an Anaconda package from my Anaconda channel using:\n\n```\nconda install plasflow -c smaegol\n```\n\nWith this command all required dependencies are installed into created conda environment. When installation is finished PlasFlow can be invoked as described in the [Getting started](#getting-started) section.\n\nWhen you decide to finish your work with PlasFlow, you can simply deactivate current anaconda environment with command:\n\n```\nsource deactivate\n```\n\n### Pip installer\n\nThere is a possibility of pip based installation. However, some requirements have to be met:\n\n1. Python 3.5 is required (due to TensorFlow requirements)\n2. TensorFlow has to be installed manually:\n\n```\npip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl\n```\n\nthen install PlasFlow with\n\n```\npip install plasflow\n```\n\nHowever, models used for prediction have to be downloaded separately (for example using `git clone https://github.com/smaegol/PlasFlow`).\n\n### Manual installation\n\nOf course, PlasFlow repo can be cloned using\n\n```\ngit clone https://github.com/smaegol/PlasFlow\n```\n\nbut in that case all dependencies have to be installed manually. TensorFlow can be installed as specified above:\n\n```\npip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl\n```\n\npython dependencies can be installed using pip:\n\n```\npip install numpy pandas scipy rpy2 scikit-learn biopython\n```\n\nto install R Biostrings go to \u003chttps://bioconductor.org/packages/release/bioc/html/Biostrings.html\u003e and follow instructions therein.\n\n### Perl modules for additional scripts\n\nPerl scripts (like `filter_sequences_by_length.pl`) included with PlasFlow requires few Perl modules. THey can be easily installed using conda:\n\n```\nconda install -c bioconda perl-bioperl perl-getopt-long\n```\n\nor cpan:\n\n```\ncpan -i Bio::Perl Getopt::longer\n```\n\nor any package manager included in your system (apt, brew)\n\n## Getting started\n\nPlasFlow is designed to take a metagenomic assembly and identify contigs which may come from plasmids. It outputs several files, from which the most important is a tabular file containing all predictions (specified with `--output` option).\n\nPrior to the PlasFlow invocation it is highly recommended to filter sequences by length, leaving only those longer than 1000 bp. PlasFlow, similarly to other kmer-based methods, does not perform well on short sequences, as it is hard to get proper kmer coverage from them. Hence, results for short sequences are unreliable. As metagenomic assemblies usually contain large number of short contigs additional filtering test can improve results and speed up the PlasFlow. It can also prevent too high RAM usage.\n\nTo filter sequences using provided Perl script type:\n\n```\nfilter_sequences_by_length.pl -input input_dataset.fasta -output filtered_output.fasta -thresh sequence_length_threshold\n```\n\nwhere sequence length threshold have to be provided in base pairs. Filtered fasta file can be then used directly for PlasFlow prediction.\n\n\nOptions available in PlasFlow include:\n\n- `--input` - specifies input fasta file with assembly contigs to classify [required]\n- `--output` - a name of the tsv file with the tabular output of classification [required]\n- `--threshold` - manually specified threshold for probability filtering (default = 0.7)\n- `--labels` - manually specified custom location of labels file (used for translation from numeric output to actual class names)\n- `--models` - custom location of models used for prediction (have to be specified if PlasFlow was installed using pip)\n- `--batch_size` - how many sequences can be used in the single batch of kmers frequency calculation\n\n\n## Output\n\nThe most important output of PlasFlow is a tabular file containing all predictions (specified with `--output` option), consiting of several columns including:\n\ncontig_id | contig_name | contig_length | id | label | ...\n--------- | ----------- | ------------- | -- | ----- | ---\n\n\nwhere:\n\n- `contig_id`is an internal id of sequence used for the classification\n- `contig_name` is a name of contig used in the classification\n- `contig_length` shows the length of a classified sequence\n- `id` is an internal id of a produced label (classification)\n- `label` is the actual classification\n- `...` represents additional columns showing probabilities of assignment to each possible class\n\nSequences can be classified to 26 classes including: chromosome.Acidobacteria, chromosome.Actinobacteria, chromosome.Bacteroidetes, chromosome.Chlamydiae, chromosome.Chlorobi, chromosome.Chloroflexi, chromosome.Cyanobacteria, chromosome.DeinococcusThermus, chromosome.Firmicutes, chromosome.Fusobacteria, chromosome.Nitrospirae, chromosome.other, chromosome.Planctomycetes, chromosome.Proteobacteria, chromosome.Spirochaetes, chromosome.Tenericutes, chromosome.Thermotogae, chromosome.Verrucomicrobia, plasmid.Actinobacteria, plasmid.Bacteroidetes, plasmid.Chlamydiae, plasmid.Cyanobacteria, plasmid.DeinococcusThermus, plasmid.Firmicutes, plasmid.Fusobacteria, plasmid.other, plasmid.Proteobacteria, plasmid.Spirochaetes.\n\nIf the probability of assignment to given class is lower than threshold (default = 0.7) then the sequence is treated as unclassified.\n\nAdditionaly, PlasFlow produces fasta files containing input sequences binned to plasmids, chromosomes and unclassified.\n\n## Test dataset\n\nTest dataset is located in the `test` folder (file `Citrobacter_freundii_strain_CAV1321_scaffolds.fasta`). It is the SPAdes 3.9.1 assembly of Citrobacter freundii strain CAV1321 genome (NCBI assembly ID: GCA_001022155.1), which contains 1 chromosome and 9 plasmids. In the same folder the results of classification can be found in the form of tsv file (`Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv`) and fasta files containing identified bins (`Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv_chromosomes.fasta`, `Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv_plasmids.fasta` and `Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv_unclassified.fasta`).\n\nTo invoke PlasFlow on the test dataset please copy the `test/Citrobacter_freundii_strain_CAV1321_scaffolds.fasta` file to you current working directory and type:\n\n```\nPlasFlow.py --input Citrobacter_freundii_strain_CAV1321_scaffolds.fasta --output test.plasflow_predictions.tsv --threshold 0.7\n```\nThe predictions will be located in the `test.plasflow_predictions.tsv` file and can be compared to results available in the `test/Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv`.\n\n\n## Detailed information\n\nDetailed information concerning the alogrithm and assumptions on which the PlasFlow is based can be found in the publication \"_PlasFlow - Predicting Plasmid Sequences in Metagenomic Data Using Genome Signatures_\" (_Nucleic Acids Research_, submitted). The flowchart illustrating major steps of training and prediction is shown below\n\n![PlasFlow Flowchart](https://github.com/smaegol/PlasFlow/blob/master/flowchart.png)\n\nAll models tested and described in the manuscript can be found in the seperate repository: \u003chttps://github.com/smaegol/PlasFlow_models\u003e\n\nScripts used for the preparation of training dataset and for neural network training are available in the `scripts` subfolder as well in the separate repository: \u003chttps://github.com/smaegol/PlasFlow_processing\u003e\n\n## Citation\n\nPlease cite the following paper when using PlasFlow for your own research.\n\n\u003e Krawczyk PS, Lipinski L, Dziembowski A.\n\u003e Nucleic Acids Res. 2018 Apr 6;46(6):e35. doi: 10.1093/nar/gkx1321.\n\n## TBD\n\nIn next releases we plan to retrain models using the most recent TensorFlow release. During the development of PlasFlow there was a lot of changes in the TensorFlow library and the newest version is not compatible with models trained for TensorFlow. However, retraining requires signficant computational effort and recoding. As we want to include _Archaea_ sequences (which are missed now) in the models, we plan to train new models with the latest TensorFlow version and release new version of PlasFlow in the second part of 2018.\n\n## Support\n\nAny issues connected with the PlasFlow should be addressed to Pawel Krawczyk (p.krawczyk (at) ibb.waw.pl).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmaegol%2Fplasflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsmaegol%2Fplasflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmaegol%2Fplasflow/lists"}