https://github.com/ohmatheus/kaggle_llmclassificationfinetuning

NLP + Finetuning + Feature Engi
https://github.com/ohmatheus/kaggle_llmclassificationfinetuning

deberta-v3 feature-engineering nlp robe text-classification

Last synced: 5 months ago
JSON representation

NLP + Finetuning + Feature Engi

Host: GitHub
URL: https://github.com/ohmatheus/kaggle_llmclassificationfinetuning
Owner: ohmatheus
Created: 2024-11-28T13:39:40.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-12-20T09:22:48.000Z (7 months ago)
Last Synced: 2024-12-20T09:31:51.755Z (7 months ago)
Topics: deberta-v3, feature-engineering, nlp, robe, text-classification
Language: Jupyter Notebook
Homepage:
Size: 1.42 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Kaggle_LLMClassificationFinetuning

## Overview
This repo is the humble result of my work on a Kaggle competition:
https://www.kaggle.com/competitions/llm-classification-finetuning/overview

The idea is to predict which responses users will prefer in a head-to-head battle between chatbots powered by LLMs. The dataset is composed of a promt, and 2 responses comming from 2 differents LLM for the Chatbot Arena.

The only data accessible in the test set are [prompt] [reponse_a] and [reponse_b].

It a multiclass classification evaluated on the log loss of the probability made for each class.

I did this mainly to improve my knowledge on NLP and finetuning LLM.

## Evolution
I started using a notebook given by Kaggle, working with tensoflow and WSL. I had *many* issues working with tensorflow and wsl (like tensor incompatibility between Tensorflow and Transformers, for instance), so i quicly recreate that notebook using PyTorch, worked like a charm.

First i tried a solution using Roberta and a siamese network, tokenizing the promt paired with each response separetely. Achieved a modest result, but good enought to start with.

Then i played a little bit with some basic feature engineering (lenght, similarity, key overlap and lexical diversity). This improved a little my results.
For that i created a model using roberta, get both embedding from it, concatenating them with a vector containning all my features, and added a classification hea don top of it.

Then i switch to an other variant of roberta, mdeberta, supposed to handle multiligual embeddings. Helped to improve results as well. ("microsoft/mdeberta-v3-base" on HuggingFace)

Finnaly, a good enhancement i had was by adding a warm-up/decay scheduler (originally present in the tensorflow starter notebook) but i also added different starting LR for the finetuning part and the classification layer. This improved drasticly my results. I did not took the time to search for optimal hyperparameters, because i spent enought time on this project and i wanted to start something else, but they are possible improvements to be made on this part.

## Results
The competition scores using the log loss between prediction and test set label, of probabilities shared between [reponse_a prefered] [reponse_b prefered] or [tie].
I scored 1.19 loss, while best of leaderboard are close to 0.83. Which is 'ok' but not a particularly good result.

But there is plenty of room to improve, and i have now a good backbone to start another interesting competition, based on almost the same parameters.

## Possible improvement
- Create a pipeline with less data to be able to test different ideas/FeatueEngineering/Models so i can iterate faster and compare more strategies.
- Better feature engineering: already have better idea on how to handle similarity.
- Try bigger and better models, i saw very good results of ppl using Gemma2, and i recently learned about the existence of a multiligual Gemma2 (https://huggingface.co/BAAI/bge-multilingual-gemma2) that i would like to test.
- Grid Search to optimize hyperparameters
- Getting the most out of the GPU T4 x2 accelerator from kaggle by using multithreading trainning.
- Upgrading Sequence lenght, currenlty at 256, not ideal.
- Chaging the model to create only 1 embedding containning prompt resp_a and resp_b. Currently it is using too much memory by storing prompt x2, and im stuck with a poor sequence lenght (256)

## Next Step
Using this work as a baseline for another similar competition (timed) [WSDM Cup - Multilingual Chatbot Arena](https://www.kaggle.com/competitions/wsdm-cup-multilingual-chatbot-arena/overview), almost the same, but is only a binnary classification (no Tie) and ask for more supports on multiligual prompts.

## Links
Trainning on Kaggle: https://www.kaggle.com/code/ohmatheus/llm-classification-supervisedlearning
Predict on Kaggle: https://www.kaggle.com/code/ohmatheus/llm-classification-predict

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ohmatheus/kaggle_llmclassificationfinetuning

Awesome Lists containing this project

README