Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/marcogarlet/sarscov2vec
NLP applied to extract information of actives compound against SARS-CoV-2 viral protease from large text corpora.
https://github.com/marcogarlet/sarscov2vec
information-retrieval nlp svm-classifier word2vec
Last synced: about 1 month ago
JSON representation
NLP applied to extract information of actives compound against SARS-CoV-2 viral protease from large text corpora.
- Host: GitHub
- URL: https://github.com/marcogarlet/sarscov2vec
- Owner: MarcoGarlet
- License: mit
- Created: 2021-12-31T11:26:06.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2022-04-29T10:10:55.000Z (over 2 years ago)
- Last Synced: 2023-04-04T07:11:49.907Z (over 1 year ago)
- Topics: information-retrieval, nlp, svm-classifier, word2vec
- Language: Python
- Homepage:
- Size: 3.74 MB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# sarscov2vec
[![DOI](https://zenodo.org/badge/443329619.svg)](https://zenodo.org/badge/latestdoi/443329619)
Realize [Elton et al.](https://arxiv.org/pdf/1903.00415.pdf) pipeline using [Mekni et al.](https://www.mdpi.com/1422-0067/22/14/7714) SARS-CoV-2 viral protease SVM on PubMed Central PMC Open Access articles.
## scheme
## project
ChemDataExtractor is used to identify Chemical Entities validate using PubChemPy and PaDEL-Descriptor software to extract compunds descriptors.
## description
2-d PCA is used to plot word2vec results following Elton et al. pipeline.
Moreover, as different approach, elbow method to select optimal out PCA dimension is followed and incremental K-means is applied.## design
Strategy pattern is followed to dynamically change behavior on different load/store strategies and classifiers.
## usage
```console
foo@bar:~/project$ ./build.sh
...
# start padel container
foo@bar:~/project$ ./padel-service/padel-service.sh
...
# start mongo docker container
foo@bar:~/project$ ./mongo-dock.sh
...
# start project
foo@bar:~/project$ python3 sarscov2vec.py
...```
Optionally is possible to remove lines in [code/mainProject.py](code/mainProject.py) (commented with "delete this to use FS") to disable usage of MongoDB and use file system to store chemical entities and sentences.
In this case skip start mongo docker container command.## results
### pca 2-d
PCA 2-d results coloring active compunds against SARS-CoV-2 viral protease.
### optimal PCA out and K-MEANS
| Cluster Num. | active CE | CE | words |
| ----------- | ----------- | ----------- | ----------- |
| 0 | 15 | 989 | 89879 |
| 1 | 1 | 92 | 4370 |
| 2 | 0 | 9 | 1272 || Coeff. | value |
| ------------- | ----------- |
| silhouette avg| 0.8479288100264025|
| SSE (k=3) | 2390 || **Term** | t0 | t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | t9 |
| - | - | - | - | - | - | - | - | - | - | - |
| **covid-19** | ill | psychiatric | hiv-positive | dementia | pandemic | concern | pertain | hemophilia | people | behaviour |### identified active fragments