{"id":16550335,"url":"https://github.com/marcogarlet/sarscov2vec","last_synced_at":"2025-10-28T18:30:48.064Z","repository":{"id":45187034,"uuid":"443329619","full_name":"MarcoGarlet/sarscov2vec","owner":"MarcoGarlet","description":"NLP applied to extract information of actives compound against SARS-CoV-2 viral protease from large text corpora.","archived":false,"fork":false,"pushed_at":"2024-10-03T21:52:24.000Z","size":3929,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-01T17:05:37.571Z","etag":null,"topics":["information-retrieval","nlp","svm-classifier","word2vec"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MarcoGarlet.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-12-31T11:26:06.000Z","updated_at":"2025-01-12T23:36:45.000Z","dependencies_parsed_at":"2022-07-26T16:32:00.620Z","dependency_job_id":null,"html_url":"https://github.com/MarcoGarlet/sarscov2vec","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MarcoGarlet%2Fsarscov2vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MarcoGarlet%2Fsarscov2vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MarcoGarlet%2Fsarscov2vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MarcoGarlet%2Fsarscov2vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MarcoGarlet","download_url":"https://codeload.github.com/MarcoGarlet/sarscov2vec/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238689771,"owners_count":19514091,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["information-retrieval","nlp","svm-classifier","word2vec"],"created_at":"2024-10-11T19:33:55.269Z","updated_at":"2025-10-28T18:30:47.213Z","avatar_url":"https://github.com/MarcoGarlet.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# sarscov2vec\n\n[![DOI](https://zenodo.org/badge/443329619.svg)](https://zenodo.org/badge/latestdoi/443329619)\n\nRealize [Elton et al.](https://arxiv.org/pdf/1903.00415.pdf) pipeline using [Mekni et al.](https://www.mdpi.com/1422-0067/22/14/7714) SARS-CoV-2 viral protease SVM on PubMed Central PMC Open Access articles.\n\n\n## scheme\n\u003cp align=\"center\"\u003e\n  \u003cimg alt=\"Elton\" src=\"img/EltonPipeline.jpg\" width=\"45%\"\u003e\n\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\n  \u003cimg alt=\"Mekni\" src=\"img/mekni.png\" width=\"35%\"\u003e\n\u003c/p\u003e\n\n\n## project \n\u003cp align=\"center\"\u003e\n  \u003cimg alt=\"IRArch\" src=\"img/IRArch.png\" width=\"35%\"\u003e\n\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\n  \u003cimg alt=\"flowchart\" src=\"img/flowchart.jpg\" width=\"35%\"\u003e\n\u003c/p\u003e\n\nChemDataExtractor is used to identify Chemical Entities validate using PubChemPy and PaDEL-Descriptor software to extract compunds descriptors.  \n\n## description\n\n2-d PCA is used to plot word2vec results following Elton et al. pipeline. \nMoreover, as different approach, elbow method to select optimal out PCA dimension is followed and incremental K-means is applied.\n\n## design\n\nStrategy pattern is followed to dynamically change behavior on different load/store strategies and classifiers. \n\n\u003cimg src=\"img/StrategyPattern2.png \" width=\"45%\"/\u003e\n\n## usage\n```console\nfoo@bar:~/project$ ./build.sh\n...\n# start padel container\nfoo@bar:~/project$ ./padel-service/padel-service.sh \n...\n# start mongo docker container\nfoo@bar:~/project$ ./mongo-dock.sh\n...\n# start project\nfoo@bar:~/project$ python3 sarscov2vec.py\n...\n\n```\n\nOptionally is possible to remove lines in [code/mainProject.py](code/mainProject.py) (commented with \"delete this to use FS\") to disable usage of MongoDB and use file system to store chemical entities and sentences.\nIn this case skip start mongo docker container command.\n\n\n## results\n\n### pca 2-d\n\nPCA 2-d results coloring active compunds against SARS-CoV-2 viral protease.\n\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003cimg src=\"img/result_0_1.png\"  alt=\"40MB\" width = \"60%\"\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"img/result_0_3.png\" alt=\"64MB\" width = \"60%\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e \n  \u003ctr\u003e\n      \u003ctd\u003e\u003cimg src=\"img/result_0_5.png\" alt=\"190MB\" width = \"60%\"\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cimg src=\"img/result_0_7.png\"  alt=\"625MB\" width =\"60%\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n     \u003ctd\u003e\u003cimg src=\"img/result_0_9.png\" alt=\"747MB\" width = \"60%\"\u003e\u003c/td\u003e\n     \u003ctd\u003e\u003cimg src=\"img/result_0_11.png\" alt=\"902MB\" width = \"60%\"\u003e\u003c/td\u003e   \n  \u003c/tr\u003e\n\u003c/table\u003e\n\n### optimal PCA out and K-MEANS\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg alt=\"Elton\" src=\"img/PCA_w2vec.png\" width=\"45%\"\u003e\n\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\n  \u003cimg alt=\"Mekni\" src=\"img/K_means.png\" width=\"45%\"\u003e\n\u003c/p\u003e\n\n| Cluster Num.      | active CE | CE | words |\n| ----------- | ----------- | ----------- | ----------- |\n| 0           | 15          | 989         | 89879       |\n| 1           | 1           | 92          | 4370        |\n| 2           | 0           | 9           | 1272        |\n\n\n| Coeff.      | value |\n| ------------- | ----------- |\n| silhouette avg| 0.8479288100264025|\n| SSE (k=3)     | 2390           |\n\n\n| **Term** | t0 | t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | t9 |\n| - | - | - | - | - | - | - | - | - | - | - |\n| **covid-19** | ill | psychiatric | hiv-positive | dementia | pandemic | concern | pertain | hemophilia | people | behaviour |\n\n### identified active fragments\n\n\u003ctable\u003e\n\n  \u003ctr\u003e\n    \u003ctd valign=\"top\"\u003e\u003cimg src=\"img/carbononitridic-bromide6.png\" width=\"45%\"\u003e\u003c/td\u003e\n    \u003ctd valign=\"top\"\u003e\u003cimg src=\"img/chloroethane5.png\" width=\"45%\"\u003e\u003c/td\u003e\n    \u003ctd valign=\"top\"\u003e\u003cimg src=\"img/chloroform;ethanol7.png\" width=\"45%\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n   \u003ctr\u003e\n    \u003ctd valign=\"top\"\u003e\u003cimg src=\"img/chloroform;methanol3.png\" width=\"45%\"\u003e\u003c/td\u003e\n    \u003ctd valign=\"top\"\u003e\u003cimg src=\"img/dichloromethane0.png\" width=\"45%\"\u003e\u003c/td\u003e\n    \u003ctd valign=\"top\"\u003e\u003cimg src=\"img/ethenol2.png\" width=\"45%\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n     \u003ctr\u003e\n    \u003ctd valign=\"top\"\u003e\u003cimg src=\"img/furan1.png\" width=\"45%\"\u003e\u003c/td\u003e\n    \u003ctd valign=\"top\"\u003e\u003cimg src=\"img/pentane8.png\" width=\"45%\"\u003e\u003c/td\u003e\n    \u003ctd valign=\"top\"\u003e\u003cimg src=\"img/2-%5B2-(2-heptoxyethoxy)ethoxy%5Dethanol4.png\" width=\"45%\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n \u003c/table\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarcogarlet%2Fsarscov2vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmarcogarlet%2Fsarscov2vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarcogarlet%2Fsarscov2vec/lists"}