{"id":20405565,"url":"https://github.com/astrabert/resistml","last_synced_at":"2025-08-16T18:09:52.485Z","repository":{"id":231137131,"uuid":"780656118","full_name":"AstraBert/resistML","owner":"AstraBert","description":"A tool for AMR gene family prediction, simple and ML-based","archived":false,"fork":false,"pushed_at":"2024-04-08T18:42:54.000Z","size":1612,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-11T07:55:51.056Z","etag":null,"topics":["ai","antibiotic-resistance","bert-model","finetuning","healthcare","machine-learning","protein-sequences","text-classification","voting-classifier"],"latest_commit_sha":null,"homepage":"https://huggingface.co/as-cle-bert/resistBERT","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AstraBert.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["AstraBert"]}},"created_at":"2024-04-01T23:07:38.000Z","updated_at":"2025-05-11T07:50:26.000Z","dependencies_parsed_at":"2024-11-15T05:12:06.861Z","dependency_job_id":"23d8c073-9509-4e14-aadd-36bb4d42b5c7","html_url":"https://github.com/AstraBert/resistML","commit_stats":null,"previous_names":["astrabert/resistml"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/AstraBert/resistML","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FresistML","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FresistML/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FresistML/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FresistML/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AstraBert","download_url":"https://codeload.github.com/AstraBert/resistML/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2FresistML/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270749468,"owners_count":24638746,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-16T02:00:11.002Z","response_time":91,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","antibiotic-resistance","bert-model","finetuning","healthcare","machine-learning","protein-sequences","text-classification","voting-classifier"],"created_at":"2024-11-15T05:11:55.864Z","updated_at":"2025-08-16T18:09:52.433Z","avatar_url":"https://github.com/AstraBert.png","language":"Jupyter Notebook","funding_links":["https://github.com/sponsors/AstraBert"],"categories":[],"sub_categories":[],"readme":".. raw:: html\r\n\r\n   \u003ctable\u003e\r\n   \u003ctr\u003e\r\n   \u003ctd\u003e\r\n   \u003cimg src=\"https://img.shields.io/github/languages/top/AstraBert/resistML\" alt=\"GitHub top language\"\u003e\r\n   \u003c/td\u003e\r\n   \u003ctd\u003e\r\n   \u003cimg src=\"https://img.shields.io/github/commit-activity/t/AstraBert/resistML\" alt=\"GitHub commit activity\"\u003e\r\n   \u003c/td\u003e\r\n   \u003ctd\u003e\r\n   \u003cimg src=\"https://img.shields.io/badge/resistML-stable-green\" alt=\"Static Badge\"\u003e\r\n   \u003c/td\u003e\r\n   \u003ctd\u003e\r\n   \u003cimg src=\"https://img.shields.io/badge/resistBERT-unstable-orange\" alt=\"Static Badge\"\u003e\r\n   \u003c/td\u003e\r\n   \u003ctd\u003e\r\n   \u003cimg src=\"https://img.shields.io/badge/Release-v0.0.0-blue\" alt=\"Static Badge\"\u003e\r\n   \u003c/td\u003e\r\n   \u003c/tr\u003e\r\n   \u003c/table\u003e\r\n\r\n========\r\nresistML\r\n========\r\n\r\nA tool for AMR gene family prediction, simple and ML-based.\r\n\r\nTraining\r\n========\r\n\r\nData collection for training\r\n----------------------------\r\n\r\nLatest reference sequences release (Feb 2024) were downloaded from **CARD** (*The Comprehensive Antibiotic Resistance Database*). If you want to automatically download them too, use `this link \u003chttps://card.mcmaster.ca/latest/data\u003e`_.\r\n\r\nProtein sequences were mapped with their ARO indices to the corrresponding AMR gene families (see `this file \u003chttps://github.com/AstraBert/resistML/tree/main/data/aro_categories_index.tsv\u003e`_ for reference) and the 12 most common families were chosen to train resistML and resistBERT.\r\n\r\nTraining procedures\r\n-------------------\r\n\r\nresistML (stable)\r\n~~~~~~~~~~~~~~~~~\r\n\r\nresistML was trained starting from all the protein sequences retrieved beforehands, extracting their features in a `csv file \u003chttps://github.com/AstraBert/resistML/tree/main/data/proteinstats.tsv\u003e`_. \r\n\r\nFeatures were extracted through biopython ::menuselection:`Bio.SeqUtils.ProtParam --\u003e ProteinAnalysis` subclass, and they are (maiusc is for the header you can find in the csv):\r\n\r\n- HIDROPHOBICITY score\r\n- ISOELECTRIC point\r\n- AROMATICity\r\n- INSTABility\r\n- MW (molar weight)\r\n- HELIX,TURN,SHEET (percentage of these three secondary strcutures)\r\n- MOL_EXT_RED,MOL_EXT_OX (molar extinction reduced and oxidized)\r\n\r\nDataset building occured `here \u003chttps://github.com/AstraBert/resistML/tree/main/scripts/build_base_dataset.py\u003e`_ \r\n\r\nThe base model itself is a simple Voting Classifier based on a DecisionTreeClassifier, ExtraTreesClassifier and HistGradientBoostingClassifier, all provided by scikit-learn library.\r\n\r\nDuring validation, it yielded 100% accuracy on predicting training data.\r\n\r\nresistBERT (unstable)\r\n~~~~~~~~~~~~~~~~~~~~~\r\n\r\nresistBERT is a BERT model for text classification, finetuned from `prot_bert \u003chttps://huggingface.co/Rostlab/prot_bert\u003e`_ by RosettaLab.\r\n\r\nData using from finetuning were a selection of 1496 sequences out of the total 1836 ones. 80% were used for training, 20% were used for validations.\r\n\r\nSequences were preprocessed and labelled `here \u003chttps://github.com/AstraBert/resistML/tree/main/scripts/build_base_dataset.py\u003e`_, then the complete jsonl file was reduced `here \u003chttps://github.com/AstraBert/resistML/tree/main/scripts/reduce_dataset.py\u003e`_ and uploaded to Huggingface under the identifier :command:`as-cle-bert/AMR-Gene-Families` through `this script \u003chttps://github.com/AstraBert/resistML/tree/main/scripts/jsonl2hfdataset.py\u003e`_.\r\n\r\nFinetuning occurred from the HF dataset thanks to AutoTrain: during validation, the model yielded the following stats:\r\n\r\n- loss: 0.08235077559947968\r\n\r\n- f1_macro: 0.986759581881533\r\n\r\n- f1_micro: 0.99\r\n\r\n- f1_weighted: 0.9899790940766551\r\n\r\n- precision_macro: 0.9871615312791784\r\n\r\n- precision_micro: 0.99\r\n\r\n- precision_weighted: 0.9901213818860879\r\n\r\n- recall_macro: 0.986574074074074\r\n\r\n- recall_micro: 0.99\r\n\r\n- recall_weighted: 0.99\r\n\r\n- accuracy: 0.99\r\n\r\nThe model is now available on Huggingface under the identifier :command:`as-cle-bert/resistBERT`. There is also a widget through which you can make inferences thanks to HF :command:`Inference API`. Keep in mind that Inference API *can* be unstable, so downloading the model and using it from a local machine/cloud service would be preferable. \r\n\r\nTesting\r\n=======\r\n\r\nData retrieval for tests\r\n------------------------\r\n\r\nData were downloaded from **CARD** (*The Comprehensive Antibiotic Resistance Database*), as the annotations for the family names used to label training sequences were the same. \r\n\r\nFor families \"PDC beta-lactamase\", \"CTX-M beta-lactamase\", \"SHV beta-lactamase\", \"CMY beta-lactamase\", sequences were downloaded after having searched the exact AMR gene family as in the labels used for training, through `Download sequences` method. In the downloading customization page, filters were set to `is_a` and `Protein`.\r\n\r\nFor all the other families, procedure was the same but customization filters were set to `is_a`, `structurally_homologous_to`, `evolutionary_variant_of` and `Protein` to increase the number of retrieved sequences.\r\n\r\nTest building\r\n-------------\r\n\r\nTest were built thanks to `this script \u003chttps://github.com/AstraBert/resistML/tree/main/scripts/build_tests.py\u003e`_. \r\n\r\nThese are the test metadata:\r\n\r\n**Metadata for test 0:**\r\n\r\n- Protein statistics for resistML were saved in test/testfiles/test_0.csv\r\n- Sequences and labels for resistBERT were saved in test/testfiles/test_0.jsonl\r\n- 12 protein sequences were taken into account for 2 families\r\n- Families taken into account were: quinolone resistance protein (qnr), CMY beta-lactamase\r\n\r\n**Metadata for test 1:**\r\n\r\n- Protein statistics for resistML were saved in test/testfiles/test_1.csv\r\n- Sequences and labels for resistBERT were saved in test/testfiles/test_1.jsonl\r\n- 11 protein sequences were taken into account for 2 families\r\n- Families taken into account were: VIM beta-lactamase,IMP beta-lactamase\r\n\r\n**Metadata for test 2:**\r\n\r\n- Protein statistics for resistML were saved in test/testfiles/test_2.csv\r\n- Sequences and labels for resistBERT were saved in test/testfiles/test_2.jsonl\r\n- 13 protein sequences were taken into account for 2 families\r\n- Families taken into account were: quinolone resistance protein (qnr),SHV beta-lactamase\r\n\r\n**Metadata for test 3:**\r\n\r\n- Protein statistics for resistML were saved in test/testfiles/test_3.csv\r\n- Sequences and labels for resistBERT were saved in test/testfiles/test_3.jsonl\r\n- 10 protein sequences were taken into account for 3 families\r\n- Families taken into account were: quinolone resistance protein (qnr),VIM beta-lactamase,CMY beta-lactamase\r\n\r\n**Metadata for test 4:**\r\n\r\n- Protein statistics for resistML were saved in test/testfiles/test_4.csv\r\n- Sequences and labels for resistBERT were saved in test/testfiles/test_4.jsonl\r\n- 12 protein sequences were taken into account for 2 families\r\n- Families taken into account were: CMY beta-lactamase,IMP beta-lactamase\r\n\r\n**Metadata for test 5:**\r\n\r\n- Protein statistics for resistML were saved in test/testfiles/test_5.csv\r\n- Sequences and labels for resistBERT were saved in test/testfiles/test_5.jsonl\r\n- 12 protein sequences were taken into account for 2 families\r\n- Families taken into account were: VIM beta-lactamase,SHV beta-lactamase\r\n\r\n**Metadata for test 6:**\r\n\r\n- Protein statistics for resistML were saved in test/testfiles/test_6.csv\r\n- Sequences and labels for resistBERT were saved in test/testfiles/test_6.jsonl\r\n- 11 protein sequences were taken into account for 3 families\r\n- Families taken into account were: PDC beta-lactamase,MCR phosphoethanolamine transferase,ACT beta-lactamase\r\n\r\n**Metadata for test 7:**\r\n\r\n- Protein statistics for resistML were saved in test/testfiles/test_7.csv\r\n- Sequences and labels for resistBERT were saved in test/testfiles/test_7.jsonl\r\n- 10 protein sequences were taken into account for 3 families\r\n- Families taken into account were: MCR phosphoethanolamine transferase,CTX-M beta-lactamase,PDC beta-lactamase\r\n\r\n**Metadata for test 8:**\r\n\r\n- Protein statistics for resistML were saved in test/testfiles/test_8.csv\r\n- Sequences and labels for resistBERT were saved in test/testfiles/test_8.jsonl\r\n- 12 protein sequences were taken into account for 2 families\r\n- Families taken into account were: ACT beta-lactamase,CMY beta-lactamase\r\n\r\n**Metadata for test 9:**\r\n- Protein statistics for resistML were saved in test/testfiles/test_9.csv\r\n- Sequences and labels for resistBERT were saved in test/testfiles/test_9.jsonl\r\n- 15 protein sequences were taken into account for 3 families\r\n- Families taken into account were: quinolone resistance protein (qnr),SHV beta-lactamase,KPC beta-lactamase\r\n\r\nAll data can be found `here \u003chttp://github.com/AstraBert/resistML/tree/main/test\u003e`_ , along with the seqences use to generate them.\r\n\r\nTest results\r\n------------\r\n\r\n**resistML** yielded 100% accuracy, f1 score, recall score and precision score in all 10 tests.\r\n\r\n**resistBERT** was more unstable:\r\n\r\n- On test_0, test_2, test_4, test_6, test_7, test_8 and test_9 yielded 100% accuracy, f1 score, recall score and precision score\r\n- On test_1 it yielded:\r\n  1. Accuracy: 50%\r\n  2. f1 score: 33%\r\n  3. Precision: 25%\r\n  4. Recall: 50%\r\n- On test_3 it yielded 66.7% accuracy, f1 score, recall score and precision score\r\n- On test_5 it yielded 50% accuracy, f1 score, recall score and precision score\r\n\r\n\r\nAll results for resistBERT can be found `in the dedicated notebook \u003chttp://github.com/AstraBert/resistML/scripts/test_resistBERT.ipynb\u003e`_ . \r\n\r\nLicense and rights of usage \r\n===========================\r\n\r\nThis repository is hereby provided under MIT license (more at `LICENSE \u003chttps://github.com/AstraBert/breastcancer-auto-class/blob/main/LICENSE\u003e_`).\r\n\r\nIf you use this work for your projects, please consider citing the author `Astra Bertelli \u003chttp://astrabert.vercel.app\u003e`_ .\r\n\r\nReferences\r\n==========\r\n\r\n1. **CARD - The Comprehensive Antibiotic Resistance Database**\r\n\r\n2. **Biopython**\r\n\r\n3. **Scikit-learn** \r\n\r\n4. **Hugging Face's prot_bert Model**\r\n\r\n5. **Hugging Face's AutoTrain**\r\n\r\nIf you feel that your work was relevant in building resistML and you weren't referenced in this section, feel free to flag an issue or to contact the author.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrabert%2Fresistml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fastrabert%2Fresistml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrabert%2Fresistml/lists"}