https://github.com/ohmatheus/kaggle_llmclassificationfinetuning
NLP + Finetuning + Feature Engi
https://github.com/ohmatheus/kaggle_llmclassificationfinetuning
deberta-v3 feature-engineering nlp robe text-classification
Last synced: 3 months ago
JSON representation
NLP + Finetuning + Feature Engi
- Host: GitHub
- URL: https://github.com/ohmatheus/kaggle_llmclassificationfinetuning
- Owner: ohmatheus
- Created: 2024-11-28T13:39:40.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-12-20T09:22:48.000Z (5 months ago)
- Last Synced: 2024-12-20T09:31:51.755Z (5 months ago)
- Topics: deberta-v3, feature-engineering, nlp, robe, text-classification
- Language: Jupyter Notebook
- Homepage:
- Size: 1.42 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Kaggle_LLMClassificationFinetuning
## Overview
This repo is the humble result of my work on a Kaggle competition:
https://www.kaggle.com/competitions/llm-classification-finetuning/overviewThe idea is to predict which responses users will prefer in a head-to-head battle between chatbots powered by LLMs. The dataset is composed of a promt, and 2 responses comming from 2 differents LLM for the Chatbot Arena.
The only data accessible in the test set are [prompt] [reponse_a] and [reponse_b].
It a multiclass classification evaluated on the log loss of the probability made for each class.
I did this mainly to improve my knowledge on NLP and finetuning LLM.
## Evolution
I started using a notebook given by Kaggle, working with tensoflow and WSL. I had *many* issues working with tensorflow and wsl (like tensor incompatibility between Tensorflow and Transformers, for instance), so i quicly recreate that notebook using PyTorch, worked like a charm.First i tried a solution using Roberta and a siamese network, tokenizing the promt paired with each response separetely. Achieved a modest result, but good enought to start with.
Then i played a little bit with some basic feature engineering (lenght, similarity, key overlap and lexical diversity). This improved a little my results.
For that i created a model using roberta, get both embedding from it, concatenating them with a vector containning all my features, and added a classification hea don top of it.Then i switch to an other variant of roberta, mdeberta, supposed to handle multiligual embeddings. Helped to improve results as well. ("microsoft/mdeberta-v3-base" on HuggingFace)
Finnaly, a good enhancement i had was by adding a warm-up/decay scheduler (originally present in the tensorflow starter notebook) but i also added different starting LR for the finetuning part and the classification layer. This improved drasticly my results. I did not took the time to search for optimal hyperparameters, because i spent enought time on this project and i wanted to start something else, but they are possible improvements to be made on this part.
## Results
The competition scores using the log loss between prediction and test set label, of probabilities shared between [reponse_a prefered] [reponse_b prefered] or [tie].
I scored 1.19 loss, while best of leaderboard are close to 0.83. Which is 'ok' but not a particularly good result.But there is plenty of room to improve, and i have now a good backbone to start another interesting competition, based on almost the same parameters.
## Possible improvement
- Create a pipeline with less data to be able to test different ideas/FeatueEngineering/Models so i can iterate faster and compare more strategies.
- Better feature engineering: already have better idea on how to handle similarity.
- Try bigger and better models, i saw very good results of ppl using Gemma2, and i recently learned about the existence of a multiligual Gemma2 (https://huggingface.co/BAAI/bge-multilingual-gemma2) that i would like to test.
- Grid Search to optimize hyperparameters
- Getting the most out of the GPU T4 x2 accelerator from kaggle by using multithreading trainning.
- Upgrading Sequence lenght, currenlty at 256, not ideal.
- Chaging the model to create only 1 embedding containning prompt resp_a and resp_b. Currently it is using too much memory by storing prompt x2, and im stuck with a poor sequence lenght (256)## Next Step
Using this work as a baseline for another similar competition (timed) [WSDM Cup - Multilingual Chatbot Arena](https://www.kaggle.com/competitions/wsdm-cup-multilingual-chatbot-arena/overview), almost the same, but is only a binnary classification (no Tie) and ask for more supports on multiligual prompts.## Links
Trainning on Kaggle: https://www.kaggle.com/code/ohmatheus/llm-classification-supervisedlearning
Predict on Kaggle: https://www.kaggle.com/code/ohmatheus/llm-classification-predict