https://github.com/pngo1997/learning-to-rank-algorithm
Builds a ranking model to predict the relevance score for query-product pairs in HomeDepot’s product search.
https://github.com/pngo1997/learning-to-rank-algorithm
feature-engineering information-retrieval inverted-index learning-to-rank linear-regression mse neural-network point-wise python r-squared ranking-algorithm support-vector-regression text-processing tf-idf xgboost
Last synced: 2 months ago
JSON representation
Builds a ranking model to predict the relevance score for query-product pairs in HomeDepot’s product search.
- Host: GitHub
- URL: https://github.com/pngo1997/learning-to-rank-algorithm
- Owner: pngo1997
- Created: 2024-03-22T19:43:41.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-31T19:36:48.000Z (8 months ago)
- Last Synced: 2025-02-28T14:13:31.190Z (7 months ago)
- Topics: feature-engineering, information-retrieval, inverted-index, learning-to-rank, linear-regression, mse, neural-network, point-wise, python, r-squared, ranking-algorithm, support-vector-regression, text-processing, tf-idf, xgboost
- Language: Jupyter Notebook
- Homepage:
- Size: 19.7 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🛍️ HomeDepot Product Search Relevance Prediction
## 📜 Overview
This project builds a **ranking model** to predict the **relevance score** for **query-product pairs** in HomeDepot’s product search. Using **Learning to Rank (LTR)**, we apply a **Pointwise Approach** to train a regression model based on **text similarity features** between the user query and product information.📌 **Note**:
Product description dataset is very large. Please contact me if you want to use it.📌 **Dataset**:
- **Train Set (`train_new.csv`)** – Query-product pairs with **ground-truth relevance scores**.
- **Test Set (`test_new.csv`)** – Query-product pairs for **prediction**.
- **Product Descriptions (`product_descriptions_new.csv`)** – Additional product details.
- **Product Attributes (`attributes_new.csv`)** – Additional structured product attributes.📌 **Goal**:
1. Compute **text similarity** between `search_term` and:
- `product_title`
- `product_description`
- `product_attributes`
2. Generate **feature vectors** for training and testing.
3. Train a **machine learning model** to predict **relevance scores**.
4. Evaluate performance using **Mean Squared Error (MSE) & R² score**.📌 **Programming Language**: `Python 3`
📌 **Libraries Used**: `pandas`, `scikit-learn`, `nltk`, `numpy`, `scipy`, `XGBoost`## 🚀 Approach
### **1️⃣ Data Preprocessing**
- **Text Cleaning** (e.g., spelling correction, numerical normalization).
- **Tokenization & Stopword Removal** using `NLTK`.
- **TF-IDF Vectorization** for product details.### **2️⃣ Feature Engineering**
- Compute **Cosine Similarity** between `search_term` and:
- `product_title`
- `product_description`
- `product_attributes`
- Compute **additional similarity measures** (e.g., **Jaccard, Dice Coefficient, Overlap**).
- Minimum **6 similarity features** for each query-product pair.### **3️⃣ Model Training & Evaluation**
- Train models using **Supervised Learning Algorithms**:
- **Linear Regression**
- **Support Vector Regressor (SVR)**
- **XGBoost Regressor**
- **Neural Networks**
- Evaluate model performance using:
- **Mean Squared Error (MSE)**
- **R² Score**### **4️⃣ Predictions on Test Data**
- Generate **predicted relevance scores** for `test_new.csv`.