Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/atinyshrimp/tripadvisor-recommendation-ml-nlp
Machine Learning and NLP models for improving text-based recommendations on TripAdvisor, using BM25, TF-IDF, embeddings, and a Hybrid approach.
https://github.com/atinyshrimp/tripadvisor-recommendation-ml-nlp
bm25 data-science embeddings kaggle-dataset machine-learning nlp nlp-machine-learning python recommandation-system sentence-embeddings sentence-transformers text-similarity tripadvisor
Last synced: 28 days ago
JSON representation
Machine Learning and NLP models for improving text-based recommendations on TripAdvisor, using BM25, TF-IDF, embeddings, and a Hybrid approach.
- Host: GitHub
- URL: https://github.com/atinyshrimp/tripadvisor-recommendation-ml-nlp
- Owner: atinyshrimp
- Created: 2024-11-19T09:21:15.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-11-19T10:09:44.000Z (about 1 month ago)
- Last Synced: 2024-11-19T10:31:18.337Z (about 1 month ago)
- Topics: bm25, data-science, embeddings, kaggle-dataset, machine-learning, nlp, nlp-machine-learning, python, recommandation-system, sentence-embeddings, sentence-transformers, text-similarity, tripadvisor
- Language: Jupyter Notebook
- Homepage: https://nbviewer.org/github/atinyshrimp/TripAdvisor-Recommendation-ML-NLP/blob/main/notebooks/TripAdvisor_Recommendation_Challenge_DENSON_LAPILUS_DIA2.ipynb
- Size: 489 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# TripAdvisor Recommendation Challenge - Beating BM25
## Overview
This project was developed as part of the ESILV A5 Machine Learning for NLP course. The objective is to improve recommendation quality for places on TripAdvisor based on review text, surpassing the BM25 model using advanced Natural Language Processing techniques.## Authors
- **Sarujan Denson**
- **Joyce Lapilus**## Problem Statement
The challenge is to recommend places on TripAdvisor that are most similar based on review content. The task involves using textual similarity methods to predict ratings and rank places effectively. Our aim is to explore both lexical and semantic models, evaluate their performance, and achieve better accuracy than the BM25 baseline.## Models Explored
1. **BM25 Model**: A traditional lexical retrieval model using term frequency and inverse document frequency.
2. **TF-IDF Model**: A custom implementation that uses term weighting to calculate similarity.
3. **Embedding-Based Model**: Utilizes SentenceTransformer to obtain dense semantic embeddings for review content.
4. **Hybrid Model**: A combination of TF-IDF for initial candidate retrieval, followed by embedding-based re-ranking for refined results.## Implementation Details
- **Embedding-Based Model**: Uses `SentenceTransformer` (all-MiniLM-L6-v2) to create high-dimensional vector embeddings and compute cosine similarity.
- **Hybrid Model**: Combines TF-IDF and embedding similarity to balance lexical and semantic matching.## Evaluation Metrics
- **Mean Squared Error (MSE)**: Measures the difference between predicted and actual ratings. A lower MSE indicates a more accurate model.
- **Normalized Discounted Cumulative Gain (NDCG)**: Ranks model performance, where a score closer to 1 indicates better ranking alignment with ground truth.## Results
- **MSE Performance**:
- Hybrid Model: 22.13% (Best Performance)
- Embedding-Based Model: 29.76%
- BM25 Model: 29.91%
- TF-IDF Model: 31.34%
- **NDCG Performance**:
- Hybrid Model: 0.9949 (Highest Score)
- BM25 Model: 0.9933
- TF-IDF Model: 0.9933
- Embedding-Based Model: 0.9917The Hybrid Model outperformed all others, achieving the lowest MSE and the highest NDCG score, demonstrating the effectiveness of combining lexical and semantic similarity approaches.
## File Structure
- `report/ESILV_A5_ML_for_NLP_Project_1_Report.pdf`: The detailed project report.
- `notebooks/recommendation_model.ipynb`: The Jupyter notebook containing code for data processing, model training, and evaluation.## How to Run
1. Clone the repository:
```bash
git clone https://github.com/atinyshrimp/TripAdvisor-Recommendation-Challenge.git
cd TripAdvisor-Recommendation-Challenge
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Open the Jupyter notebook:
```bash
jupyter notebook notebooks/recommendation_model.ipynb
```
4. Download NLTK data packages (stopwords, punkt, and wordnet):
```bash
python -m nltk.downloader punkt stopwords wordnet
```
5. Follow the steps in the notebook to train and evaluate the models.## Dataset Source
The dataset used in this project is publicly available on Kaggle:
- **Dataset**: [TripAdvisor Hotel Reviews](https://www.kaggle.com/datasets/joebeachcapital/hotel-reviews)
- **Author(s)/Contributor(s)**: [Joakim Arvidsson](https://www.kaggle.com/joebeachcapital)