https://github.com/v-ade-r/top0.03-kaggle-nlp-with-disaster-tweets

LLM in sentiment classification. Code for the competition on Kaggle.
https://github.com/v-ade-r/top0.03-kaggle-nlp-with-disaster-tweets

llms machine-learning sentiment-classification

Last synced: 22 days ago
JSON representation

LLM in sentiment classification. Code for the competition on Kaggle.

Host: GitHub
URL: https://github.com/v-ade-r/top0.03-kaggle-nlp-with-disaster-tweets
Owner: v-ade-r
License: unlicense
Created: 2024-03-10T19:25:46.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-03-10T19:59:41.000Z (about 1 year ago)
Last Synced: 2025-02-14T01:48:27.934Z (2 months ago)
Topics: llms, machine-learning, sentiment-classification
Language: Python
Homepage:
Size: 66.4 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Top_3%-Kaggle-NLP-with-Disaster-Tweets

Code for the "Natural Language Processing with Disaster Tweets" competition on Kaggle.

Objective: Predicting whether a particular tweet is about a disaster or not, using existing open source LLMs. 

You can learn from this code how to:

1. Preprocess text data.

2. Load pretrained open source LLM model, create tokenizer and data collator.

3. Tokenize data, and prepare datasets for the model.

4. Train the model with correctly set optimizer and early_stopping callback.

5. How to evalute the model.

# Summary

1. Models:     On this code I have testes 3 models (distilbert_uncased, bert_uncased and roberta). Roberta was a clean winner.

2. Preprocessing:  The best results I have got using unhashed preprocessing functions.

3. Learning_rate:  I looped through the list of [1e-06, 2e-06, ...,9e-05] and 2e-05 was a clean winner.

4. Best public score for single model:     0.84155 - generated by the code above. 

5. Best public score for the ensemble:     0.844 - I averaged predictions of all roberta_models with different learning 

rates trained on unpreprocessed data, and also a few best ones trained on preprocessed data. Moreover I added a few 

distilbert_uncased and bert_uncased models (27 models total). It is not very computation optimal, so I didn't put it here.

Potential tasks for optimizing predictions: 

1. Polishing the character preprocessing functions. 

2. Playing out with various random_states and learning rates not only for roberta but for bert and distilbert as well. 

3. Adding a few another 'bert' models to the ensemble, bert-large (not enough ram), different roberta. 

4. Reducing number of models in ensemble, leaving only a few ones with different and strong biases to optimize 

generalization of the ensemble.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/v-ade-r/top0.03-kaggle-nlp-with-disaster-tweets

Awesome Lists containing this project

README