https://github.com/qanastek/biocreative-vii-track-5
[BioCreative VII] Track 5 - LitCovid track Multi-label topic classification for COVID-19 literature annotation
https://github.com/qanastek/biocreative-vii-track-5
bert biocreative biomedical bionlp challenge classification flair healthcare machine-learning nlp pubmed tars text-classification tf-idf
Last synced: 3 months ago
JSON representation
[BioCreative VII] Track 5 - LitCovid track Multi-label topic classification for COVID-19 literature annotation
- Host: GitHub
- URL: https://github.com/qanastek/biocreative-vii-track-5
- Owner: qanastek
- Created: 2021-09-28T12:16:45.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2023-08-13T13:29:09.000Z (almost 2 years ago)
- Last Synced: 2025-01-18T09:18:34.084Z (5 months ago)
- Topics: bert, biocreative, biomedical, bionlp, challenge, classification, flair, healthcare, machine-learning, nlp, pubmed, tars, text-classification, tf-idf
- Language: Python
- Homepage:
- Size: 20.4 MB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# BioCreative-VII-Track-5
**Author**:
[LABRAK Yanis](https://www.linkedin.com/in/yanis-labrak-8a7412145/)
Master 2 – Computer Science**Affiliation**:
[Laboratoire Informatique d’Avignon (LIA)](https://lia.univ-avignon.fr/)
Natural Language Processing Department## Compatibility issues between Flair 0.8 and 0.9 scripts
Refer to [this GitHub issue](https://github.com/flairNLP/flair/issues/2426) to solve the compatibility issues or go back to Flair 0.8.
## Descriptions
|submit_id |label-based micro avg precision|label-based micro avg recall|label-based micro avg f1|label-based macro avg precision|label-based macro avg recall|label-based macro avg f1|instance-based precision|instance-based recall|instance-based f1|
|-----------------|-------------------------------|----------------------------|------------------------|-------------------------------|----------------------------|------------------------|------------------------|---------------------|-----------------|
|BC7_submission_39|0.5130 |0.8598 |0.6426 |0.5240 |0.7391 |0.5614 |0.5965 |0.8597 |0.7043 |
|BC7_submission_40|0.8760 |0.8659 |0.8709 |0.8498 |0.8138 |0.8231 |0.8981 |0.8942 |0.8961 |
|BC7_submission_62|0.8699 |0.8966 |0.8830 |0.8298 |0.8570 |0.8366 |0.8993 |0.9198 |0.9094 |
|BC7_submission_61|0.8951 |0.8280 |0.8602 |0.8814 |0.7723 |0.8174 |0.8787 |0.8610 |0.8698 |### BC7 Submission 39
I used class specific keywords extracted from the training dataset (keywords + title + abstract) with a TF-IDF to enhance a HuggingFace PubMedBERT model (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) adapted to the task by changing the loss function to a BCE one for multi-label classification and running it during 28 epochs with a learning rate of 5e-5.
**Folder**: `model 2 - PubMed Train`
### BC7 Submission 40
I used a pretrained model called TARS based on the paper "Task-Aware Representation of Sentences for Generic Text Classification" available in the framework Flair to classify documents based only on their abstracts during 50 epochs with a learning rate of 0.02 and with only 85% of the training corpus.
**Folder**: `model 1 - 50 runs flair tars`
### [Refused] BC7 Submission 42
I used class specific keywords extracted from the training dataset (keywords + title + abstract) with a TF-IDF to enhance a HuggingFace PubMedBERT model (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) adapted to the task by changing the loss function to a BCE one for multi-label classification and running it during 28 epochs on **Train and Dev** with a learning rate of 5e-5.
**Folder**: `model 5 - PubMed Train+Dev`
**Refused**: Due to negative predictions.
### BC7 Submission 61
I trained a 1-2-3 gram TF-IDF on both Train and Dev datasets to compute df vectors (dimension 20K) which will represents documents (keywords + title + abstract) in the multi-label SVM classifier.
**Folder**: `model 6 - TF-IDF 1-2-3 gram`
### BC7 Submission 62
I used class specific keywords extracted from the training dataset (keywords + title + abstract) with a TF-IDF to enhance a pretrained model called TARS based on the paper "Task-Aware Representation of Sentences for Generic Text Classification" available in the framework Flair adapted to the task for multi-label classification and running it during 10 epochs on Train only with a learning rate of 0.02.
**Folder**: `model 4 - flair all + ner`