Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cozek/profiling-fake-news-spreaders
This repo contains code related to our submission for the task of Profiling Fake News Spreaders in PAN at CLEF 2020
https://github.com/cozek/profiling-fake-news-spreaders
author-profiling ensemble-classifier fake-news pan2020 pytorch
Last synced: about 1 month ago
JSON representation
This repo contains code related to our submission for the task of Profiling Fake News Spreaders in PAN at CLEF 2020
- Host: GitHub
- URL: https://github.com/cozek/profiling-fake-news-spreaders
- Owner: cozek
- Created: 2020-07-17T13:52:30.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2020-10-16T16:16:03.000Z (over 4 years ago)
- Last Synced: 2024-10-25T07:32:23.170Z (3 months ago)
- Topics: author-profiling, ensemble-classifier, fake-news, pan2020, pytorch
- Language: Python
- Homepage:
- Size: 51.8 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# profiling-fake-news-spreaders
This repo contains code related to our submission for the task of [Profiling Fake News Spreaders](https://pan.webis.de/clef20/pan20-web/author-profiling.html) in PAN at CLEF 2020.
The [software](https://github.com/cozek/profiling-fake-news-spreaders/tree/master/software) directory include the scripts used in TIRA, verbatim. PyTorch model weights are required to run the scripts. Run these scripts the same as they would run in TIRA as:
```
./electra_{en/es}_{type}.py -i $inputDataset -o $outputDir
```
It expects data to be in XML format as specified by the organizers in the [task homepage](https://pan.webis.de/clef20/pan20-web/author-profiling.html). You will need to edit the shebang to match your environment.The model weights and Ensemble Training Notebooks can be viewed/downloaded from Kaggle:
- [Notebook/Weights for English Dataset](https://www.kaggle.com/coseck/fork-of-electra-on-pan-fake-news-2b295d)
- [Notebook/Weights for Spanish Dataset](https://www.kaggle.com/coseck/spanish-electra-on-pan-fake-news)EDA of the Dataset: [EDA Notebook](https://www.kaggle.com/coseck/pan2020-profiling-fake-news-spreaders-eda)
Analysis of Ensemble: [Notebook](https://github.com/cozek/profiling-fake-news-spreaders/blob/master/notebooks/Analysis%20of%20Ensemble.ipynb)
Training data can be requested from [Zenodo](https://zenodo.org/record/3692319#.XxG-gi0w1QI).
The desciption of the scripts are as follows:
- `electra_{en/es}_ensemble.py` : Runs the complete ensemble of 15 models on the given inputDataset```
./electra_{en/es}__ensemble.py -i inputDatasetDir -o outputDir -m savedModelsDir
```- `electra_{en/es}_oneshot.py` : Runs the best model found during once on the given inputDataset only once.
```
./electra_{en/es}__oneshot.py -i inputDatasetDir -o outputDir -m bestmodel.pt
```- `electra_{en/es}_solo.py` : Creates the ensemble using 15 copies of the best model and runs it on the given inputDataset.
```
./electra_{en/es}__solo.py -i inputDatasetDir -o outputDir -m bestmodel.pt
```Requirements:
- This work reuses code written for another project which must be pulled as well
```
git clone --recurse-submodules https://github.com/cozek/trac2020_submission.git
```
- Other libraries:
- PyTorch
- Transformers
- Pandas
- Numpy
- Scikit-learnIf code/paper was helpful, please cite:
```
@InProceedings{das:2020,
author = {{Kaushik Amar} Das and Arup Baruah and {Ferdous Ahmed} Barbhuiya and Kuntal Dey},
booktitle = {{CLEF 2020 Labs and Workshops, Notebook Papers}},
crossref = {pan:2020},
editor = {Linda Cappellato and Carsten Eickhoff and Nicola Ferro and Aur{\'e}lie N{\'e}v{\'e}ol},
month = sep,
publisher = {CEUR-WS.org},
title = {{Ensemble of ELECTRA for Profiling Fake News Spreaders---Notebook for PAN at CLEF 2020}},
url = {},
year = 2020
}
```