Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/marcogarlet/sarscov2vec

NLP applied to extract information of actives compound against SARS-CoV-2 viral protease from large text corpora.
https://github.com/marcogarlet/sarscov2vec

information-retrieval nlp svm-classifier word2vec

Last synced: about 1 month ago
JSON representation

NLP applied to extract information of actives compound against SARS-CoV-2 viral protease from large text corpora.

Awesome Lists containing this project

README

        

# sarscov2vec

[![DOI](https://zenodo.org/badge/443329619.svg)](https://zenodo.org/badge/latestdoi/443329619)

Realize [Elton et al.](https://arxiv.org/pdf/1903.00415.pdf) pipeline using [Mekni et al.](https://www.mdpi.com/1422-0067/22/14/7714) SARS-CoV-2 viral protease SVM on PubMed Central PMC Open Access articles.

## scheme


Elton
       
Mekni

## project


IRArch
       
flowchart

ChemDataExtractor is used to identify Chemical Entities validate using PubChemPy and PaDEL-Descriptor software to extract compunds descriptors.

## description

2-d PCA is used to plot word2vec results following Elton et al. pipeline.
Moreover, as different approach, elbow method to select optimal out PCA dimension is followed and incremental K-means is applied.

## design

Strategy pattern is followed to dynamically change behavior on different load/store strategies and classifiers.

## usage
```console
foo@bar:~/project$ ./build.sh
...
# start padel container
foo@bar:~/project$ ./padel-service/padel-service.sh
...
# start mongo docker container
foo@bar:~/project$ ./mongo-dock.sh
...
# start project
foo@bar:~/project$ python3 sarscov2vec.py
...

```

Optionally is possible to remove lines in [code/mainProject.py](code/mainProject.py) (commented with "delete this to use FS") to disable usage of MongoDB and use file system to store chemical entities and sentences.
In this case skip start mongo docker container command.

## results

### pca 2-d

PCA 2-d results coloring active compunds against SARS-CoV-2 viral protease.


40MB
64MB


190MB
625MB


747MB
902MB

### optimal PCA out and K-MEANS


Elton
       
Mekni

| Cluster Num. | active CE | CE | words |
| ----------- | ----------- | ----------- | ----------- |
| 0 | 15 | 989 | 89879 |
| 1 | 1 | 92 | 4370 |
| 2 | 0 | 9 | 1272 |

| Coeff. | value |
| ------------- | ----------- |
| silhouette avg| 0.8479288100264025|
| SSE (k=3) | 2390 |

| **Term** | t0 | t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | t9 |
| - | - | - | - | - | - | - | - | - | - | - |
| **covid-19** | ill | psychiatric | hiv-positive | dementia | pandemic | concern | pertain | hemophilia | people | behaviour |

### identified active fragments