https://github.com/s1998/progressivetraincodeswitch
Code for the Findings of EMNLP 2022 paper "Progressive Sentiment Analysis for Code-Switched Text Data"
https://github.com/s1998/progressivetraincodeswitch
Last synced: about 2 months ago
JSON representation
Code for the Findings of EMNLP 2022 paper "Progressive Sentiment Analysis for Code-Switched Text Data"
- Host: GitHub
- URL: https://github.com/s1998/progressivetraincodeswitch
- Owner: s1998
- Created: 2022-05-25T05:29:52.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2022-12-06T08:10:25.000Z (over 2 years ago)
- Last Synced: 2025-02-01T18:27:25.455Z (4 months ago)
- Language: Python
- Homepage:
- Size: 5.21 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
## Progressive Sentiment Analysis for Code-Switched Text Data
- [Model](#model)
- [Training](#training)
- [Required Inputs](#required-inputs)
- [Commands](#commands)
- [Requirements](#requirements)
- [Citation](#citation)### Model
Most of the experiments have been carried out using ```bert-base-multilingual-cased``` as the backbone model.
The framework is present in the file ``` models/ds_model.py ```.
### Required inputs
External english data for pretraining should be present in the ```data/english_data``` file.
For code-switched datasets, create a folder with the dataset name and create ```train.txt``` and ```validation.txt```.
For example, for the ```sail``` dataset used in ```GLUECoS```, files should be ```data/sail/train.txt``` and ```data/sail/validation.txt```.
Each row should contain the text and the label (i.e. positive or negative). The words in Hindi/Tamil should be transliterated to devanagari script.
You can obtain the Hindi-English (sail) or Spanish-English (enes) dataset from [here](https://github.com/microsoft/GLUECoS) and put it in the data folder. Tamil-English (taen) dataset can be downloaded from [here](https://dravidian-codemix.github.io/2020/datasets.html) .
### Commands
The ```main_ds.py``` can take following arguments:
- ```arg_data``` to denote the dataset, currently takes ```sail``` , ```enes``` or ```taen``` as input.
- ```external_data_imbalance_fix``` to deal with imbalance in the source dataset used for pretraining.
- ```seed``` to fix the seed scross experiments
- ```zsl_ds_us_data_merged_multiple_m_half_data``` or ```zsl_ds_us_data_merged_multiple_m_half_data_many_runs``` or ```supervised``` to do single run or multiple runs or supervised run.Example commands to run:
```
python main_ds.py --external_data_imbalance_fix upsample --seed 22 --zsl_ds_us_data_merged_multiple_m_half_data_many_runs --arg_data sail > logs/sail_half_data_hrd_lbl_merged_bkts_ds_us_run22 &python main_ds.py --external_data_imbalance_fix upsample --seed 22 --zsl_ds_us_data_merged_multiple_m_half_data_many_runs --arg_data taen > logs/taen_half_data_hrd_lbl_merged_bkts_ds_us_run22 &
python main_ds.py --external_data_imbalance_fix upsample --seed 22 --zsl_ds_us_data_merged_multiple_m_half_data_many_runs --arg_data enes > logs/enes_half_data_hrd_lbl_merged_bkts_ds_us_run22 &
```### Requirements
This project is based on ```python==3.6.10```. The dependencies are as follow:
```
torch==1.9.1
argparse
transformers==3.5.1
nltk==3.5
sklearn
ai4bharat==0.5.0.3
```### Citation
```
@misc{https://doi.org/10.48550/arxiv.2210.14380,
doi = {10.48550/ARXIV.2210.14380},
url = {https://arxiv.org/abs/2210.14380},
author = {Ranjan, Sudhanshu and Mekala, Dheeraj and Shang, Jingbo},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Progressive Sentiment Analysis for Code-Switched Text Data},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
```