{"id":20405605,"url":"https://github.com/astrabert/virbicla","last_synced_at":"2025-06-25T20:07:04.612Z","repository":{"id":228280482,"uuid":"773489797","full_name":"AstraBert/VirBiCla","owner":"AstraBert","description":"A novel ML-based binary classifier to tell viral and non-viral long reads apart in metagenomic samples.","archived":false,"fork":false,"pushed_at":"2024-04-08T18:44:51.000Z","size":51442,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-23T01:03:38.894Z","etag":null,"topics":["binary-classifier","biodiversity","dna","genomics","healthcare","long-read-sequencing","machine-learning","metagenomics","microbiology","oxford-nanopore","virome","viromics","virus","voting-classifier"],"latest_commit_sha":null,"homepage":"https://astrabert.github.io/VirBiCla/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AstraBert.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":["AstraBert"]}},"created_at":"2024-03-17T19:52:05.000Z","updated_at":"2024-03-21T00:52:04.000Z","dependencies_parsed_at":"2024-04-08T20:12:29.904Z","dependency_job_id":null,"html_url":"https://github.com/AstraBert/VirBiCla","commit_stats":null,"previous_names":["astrabert/virbicla"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AstraBert/VirBiCla","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FVirBiCla","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FVirBiCla/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FVirBiCla/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FVirBiCla/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AstraBert","download_url":"https://codeload.github.com/AstraBert/VirBiCla/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FVirBiCla/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261945376,"owners_count":23234237,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["binary-classifier","biodiversity","dna","genomics","healthcare","long-read-sequencing","machine-learning","metagenomics","microbiology","oxford-nanopore","virome","viromics","virus","voting-classifier"],"created_at":"2024-11-15T05:12:06.021Z","updated_at":"2025-06-25T20:06:59.576Z","avatar_url":"https://github.com/AstraBert.png","language":"Python","readme":"# VirBiCla\r\n\r\nA novel ML-based binary classifier to tell viral and non-viral DNA/RNA sequences apart.\r\n\r\n## Table of Contents\r\n\r\n- [Model Description](#model-description)\r\n  - [Model Components](#model-components)\r\n  - [Ensemble Voting](#ensemble-voting)\r\n  - [Model Training](#model-training)\r\n  - [Usage](#usage)\r\n- [Training](#training)\r\n  - [Data collection](#data-collection)\r\n  - [Data processing](#data-processing)\r\n  - [Training specs](#training-specs)\r\n- [Testing](#testing)\r\n  - [Data collection](#data-collection-1)\r\n  - [Data processing](#data-processing-1)\r\n  - [Tests](#tests)\r\n- [Use advices and limitations](#use-advices-and-limitations)\r\n- [Customization](#customization)\r\n- [License and right of usage](#license-and-right-of-usage)\r\n- [References](#references)\r\n\r\n## Model Description\r\n\r\nVirBiCla is an ensemble voting classifier model created using scikit-learn in python.\r\n\r\n### Model Components\r\n- **Logistic Regression (LR):** A linear model that predicts the probability of a categorical dependent variable.\r\n- **Random Forest (RF):** An ensemble learning method that constructs a multitude of decision trees during training and outputs the mode of the classes.\r\n- **Gaussian Naive Bayes (GNB):** A simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions.\r\n- **Decision Tree (DT):** A non-parametric supervised learning method used for classification and regression.\r\n- **K-Nearest Neighbors (KNN):** A non-parametric method used for classification and regression, where the input consists of the k closest training examples in the feature space.\r\n- **Bagging Classifier (BC):** An ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset.\r\n- **Histogram-Based Gradient Boosting Classifier (HGB):** A gradient boosting model that uses histograms to speed up training.\r\n- **Extra Trees Classifier (ETC):** An ensemble learning method that builds multiple decision trees and merges their predictions.\r\n- **Stochastic Gradient Descent (SGD):** An optimization algorithm that updates the parameters iteratively to minimize the loss function.\r\n\r\n### Ensemble Voting\r\n- **Voting Method:** Soft voting, where the predicted probabilities for each class are averaged and the class with the highest probability is chosen.\r\n- **Classifiers:** A combination of LR, RF, GNB, DT, KNN, BC, HGB, ETC, and SGD classifiers are combined into a voting ensemble.\r\n\r\n### Model Training\r\n- **Training Data:** The model is trained using the datasets you can see [here](./data/train/)\r\n- **Fit Method:** The classifier is fit to the training data using the `fit` method.\r\n\r\n### Usage\r\nOnce trained, this ensemble classifier can be used to make predictions on new data by calling the `predict` method.\r\n\r\nYou can use the model by running the [prediction script](./scripts/predict.py) as follows:\r\n\r\n```bash\r\n# make sure to include -p if you wish to get out a nice visualization of the viral/non-viral classification stats\r\npython3 predict.py -i input.fasta -o output.csv -p\r\n```\r\n\r\n## Training\r\n### Data collection\r\nTraining data were collected using [this script](./shell/retrieve_lots_of_blastdb.sh), which is based on the instructions provided by NCBI on [how to retrieve BLAST databases](https://www.ncbi.nlm.nih.gov/books/NBK569850/).\r\n\r\nTraining datasets are:\r\n\r\n- 16S_ribosomal_RNA: 16S ribosomal RNA (Bacteria and Archaea type strains) - 22239 seqs\r\n- 28S_fungal_sequences: 28S ribosomal RNA sequences (LSU) from Fungi type and reference material - 10292 seqs\r\n- SSU_eukaryote_rRNA: Small subunit ribosomal nucleic acid for Eukaryotes - 8784 seqs\r\n- ref_viruses_rep_genomes: Refseq viruses representative genomes - 18688 seqs\r\n\r\nDescriptions are taken from a dedicated BLAST+ tools [repository](https://github.com/ncbi/blast_plus_docs?tab=readme-ov-file#blast-databases), while the number of sequences has been retrieved from the metadata files avauilable on [BLAST ftp website](https://ftp.ncbi.nlm.nih.gov/blast/db/).\r\n\r\n### Data processing\r\nData were processed using [this script](./scripts/preprocess_train.py) and plots of the resulting training dataset were made with the aid of [this script](./scripts/plot_train.py).\r\n\r\nHere's a step-by-step breakdown of the processing procedure:\r\n\r\n1. **Nucleotide Proportions and GC Content Calculation:**\r\n    - The input DNA sequence is analyzed to compute the frequency of each nucleotide (A, T, C, G) in the sequence.\r\n    - Additionally, the GC content of the sequence is calculated, representing the proportion of Guanine (G) and Cytosine (C) bases compared to the total number of bases in the sequence.\r\n\r\n2. **Effective Number of Codons (ENC) Computation:**\r\n    - The program calculates the Effective Number of Codons (ENC), which provides insights into codon usage bias within the DNA sequence.\r\n    - ENC is a measure of how efficiently codons are used to encode amino acids, with lower values indicating higher codon bias.\r\n\r\n3. **Homopolymeric Region Detection:**\r\n    - Homopolymeric regions, characterized by consecutive repeats of a single nucleotide (e.g., AAAA), are identified within the DNA sequence.\r\n    - The program computes the percentage of homopolymeric loci for each nucleotide (A, T, C, G) present in the sequence.\r\n\r\n4. **Information Entropy Estimation:**\r\n    - Information entropy of the DNA sequence is estimated, providing a measure of uncertainty or randomness in the sequence.\r\n    - Higher entropy values suggest greater sequence diversity, while lower values indicate more uniform or repetitive sequences.\r\n\r\n5. **Gene Density Calculation:**\r\n    - The program estimates gene density based on coding statistics of reading frames derived from the DNA sequence.\r\n    - It computes the ratio of coding regions (sequences encoding proteins) to the total length of the translated sequences, providing insights into the density of potential genes within the sequence.\r\n\r\n6. **Integration of Results:**\r\n    - Finally, the extracted features and metrics, including nucleotide proportions, GC content, ENC, homopolymeric regions, information entropy, and gene density, are integrated into a structured dictionary.\r\n    - The dictionary encapsulates all relevant information about the processed DNA sequence, facilitating further analysis and interpretation, and it is finally integrated with the viral/non-viral information (\"V\" or \"NV\") associated with each file.\r\n\r\n7. **Results storage**:\r\n    - The resulting dictionaries for each DNA sequences are stored into a list, converted into a DataFrame and, eventually, merged into a [CSV file](./data/train/viral-vs-nonviral_train.csv)\r\n\r\nAll the functions employed to process data are defined [here](./scripts/utils.py).\r\n\r\nAll the plots are available [here](./data/plots/train/).\r\n\r\n### Training specs\r\nThe binary classificator is trained to predict the \"Domain\" field (in the CSV) starting from all the other parameters, classifying a sequence as either viral (\"V\") or non-viral (\"NV\"). \r\n\r\nTraining takes about 40s on Google Colab free python engine (12GB RAM).\r\n\r\n## Testing\r\n### Data collection\r\nThree test sets were prepared ([test](./data/test/), [test1](./data/test1/) and [test2](./data/test2/)), and they are composed by:\r\n\r\n- 401 sequences (ranging from 1000 to 8000 bp) from recently submitted viral specimens, collected from [NCBI virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/), 500 out of 4092 Araneae SSU rRNA sequences collected from [SILVA](https://www.arb-silva.de/) and 500 out of 1982 Tumebacillus SSU rRNA sequences also collected from SILVA.    \r\n- 2000 DNA sequences (ranging from 1000 to 10000 bp) from bacteriophages, collected from [NCBI virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/), 1000 out of 4092 Araneae SSU rRNA sequences collected from [SILVA](https://www.arb-silva.de/) and 1000 out of 1982 Tumebacillus SSU rRNA sequences also collected from SILVA.\r\n- 1072 DNA sequences (ranging from 1000 to 15000 bp) from human viruses, collected from [NCBI virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/), 600 sequences from [LSU_prpkaryote_rRNA](./data/train/LSU_prokaryote_rRNA.fsa.gz) and 600 sequences from [28S_fungal_sequences](./data/train/28S_fungal_sequences.fsa.gz).\r\n\r\n### Data processing\r\nProcessing followed the same steps descripted for training, with [this script](./scripts/preprocess_test.py). Plots were also generated using a similar [script](./scripts/plot_train.py)\r\n\r\nAll the plots are available [here](./data/plots/)\r\n\r\n### Tests\r\nYou can find the tests in the [dedicated notebook](./scripts/VirBiCla.ipynb). \r\n\r\n- `test` dataset had the following output:\r\n\r\n    + **Confusion matrix:**\r\n    \r\n    |  |  |\r\n    |---|---|\r\n    | True postive: 986 |  False positive: 14 |\r\n    | False negative: 14 | True negative: 387 |\r\n\r\n    + **Accuracy:** 0.9800142755174875\r\n\r\n- `test1` dataset had the following output:\r\n\r\n    + **Confusion matrix:**\r\n    \r\n    |  |  |\r\n    |---|---|\r\n    | True postive: 1980 |  False positive: 20 |\r\n    | False negative: 32 | True negative: 1968 |\r\n\r\n    + **Accuracy:** 0.987\r\n\r\n- `test2` dataset had the following output:\r\n\r\n    + **Confusion matrix:**\r\n    \r\n    |  |  |\r\n    |---|---|\r\n    | True postive: 1181 |  False positive: 19 |\r\n    | False negative: 1 | True negative: 1071 |\r\n\r\n    + **Accuracy:** 0.9911971830985915\r\n\r\nOverall, the models seems to perform well in telling viral and non-viral DNA sequences apart.\r\n\r\n## Use: advices and limitations\r\n\u003c!-- User case in based on BioProject [PRJEB52499](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJEB52499), from paper [*Characterizations of the Gut Bacteriome, Mycobiome, and Virome in Patients with Osteoarthritis*](https://journals.asm.org/doi/10.1128/spectrum.01711-22). --\u003e\r\n\r\nUser case for VirBiCla is [here](./user_case/).\r\n\r\nIt is based on data downloaded from NCBI SRA, whose accession numbers are retained in their names, thus all the information about sequencing and related research are easily accesible.\r\n\r\nThe three samples refer to _Aedes aegypti_'s virome ([SRR25317595](./user_case/SRR25317595_resized.fasta)), _Tuta absoluta_'s microbiome ([SRR20766474](./user_case/SRR20766474.fas)) and rRNA genes region from several fungi samples ([ERR2576718](./user_case/ERR2576718.fasta)).\r\n\r\nSize-selected virome (\u003e1000 bp) was first combined with microbiome without any size restraint, and the resulting ca. 17000 reads were predicted by VirBiCla, resulting in [50% accuracy](./user_case/microbiome_notselected/). From the [confusion matrix](./user_case/microbiome_notselected/test.stats) you can easily see that the model predicted correctly almost all the actual viral sequences, whereas it misclassified the bacterial ones.\r\n\r\nWhen size-selected virome was combined with size-selected microbiome ([\u003e1000 bp](./user_case/microbiome_selected/gt1000/) and [\u003e1500 bp](./user_case/microbiome_selected/gt1500/)), progressively greater accuracy (53 and 60%) and better confusion matrices were observed (you can find them in test.stats files in each folder).\r\n\r\nWhen size-selected virome was combined with unprocessed mycetome (which had most of the sequences way longer than 2000 bp), [results](./user_case/microbiome_selected/gt2000/) were definitely better. VirBiCla yielded an almost perfect performance, with 98% accuracy and [few misclassfied sequences](./user_case/microbiome_selected/gt2000/test.stats).\r\n\r\nAs a general advice, we suggest that the user employes the model \"as is\" only on long-read sequencing metagenomics samples: as you can see, VirBiCla is really good at differentiating between non-viral and viral sequences when put into a context of long reads (\u003e2000-3000 bp). \r\n\r\n## Customization\r\n\r\nIf you have different data and you wish to custom VirBiCla for your needs, you need to follow these steps:\r\n\r\n1. Modify [preprocess_train.py](./scripts/preprocess_test.py), adding your own fasta files to the preprocessing pipeline. For example, let's say that we have two files called \"virome.fa\" and \"microbiome.fa\": we should then make these modifications:\r\n```python\r\nMAPPING_DOMAINS = {\r\n    \"virome\": \"V\",\r\n    \"microbiome\": \"NV\",\r\n}\r\n\r\nif __name__ == \"__main__\":\r\n    csvpath = \"viral-vs-nonviral_train.csv\"\r\n    hugelist = []\r\n    for fsa in list(MAPPING_DOMAINS.keys()):\r\n        print(f\"Loading data from {fsa}...\")\r\n        fastafile = f\"{fsa}.fa\"\r\n```\r\n2. Modify [model.py](./scripts/model.py) with your data. For example, let's say that we have saved our training data in a CSV file called \"viral-vs-nonviral_train.csv\": we should then make these modifications:\r\n```python\r\ndf = pd.read_csv(\"viral-vs-nonviral_train.csv\")\r\nX_train = df_train.iloc[:, 1:]\r\ny_train_rev = df_train[\"Domain\"]\r\n```\r\n\r\n## License and right of usage\r\n\r\nThe program is provided under MIT license (more at [LICENSE](./LICENSE)).\r\n\r\nWhen using VirBiCla, please consider citing the author ([Astra Bertelli](https://astrabert.vercel.app)) and this repository.\r\n\r\n## References\r\n- [scikit-learn](https://scikit-learn.org/stable/)\r\n- [codon-bias](https://codon-bias.readthedocs.io/en/latest/)\r\n- [matplotlib](https://matplotlib.org/)\r\n- [biopython](https://biopython.org/)\r\n- [pandas](https://pandas.pydata.org/)\r\n- [scipy](https://scipy.org/)\r\n- NCBI, SILVA and BLAST+-tools docs repository","funding_links":["https://github.com/sponsors/AstraBert"],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrabert%2Fvirbicla","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fastrabert%2Fvirbicla","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrabert%2Fvirbicla/lists"}