https://github.com/kavayk29/quora-duplicate-question-pair

This project improves information retrieval by detecting duplicate question pairs in the Quora dataset using data exploration, text preprocessing, feature engineering, and models like Random Forest and LSTM, aiming to streamline question-answering.
https://github.com/kavayk29/quora-duplicate-question-pair

beautifulsoup4 bilstm gensim keras lstm matplotlib numpy pandas pytorch random-forest seaborn sklearn tensorflow xgboost

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/kavayk29/quora-duplicate-question-pair
Owner: Kavayk29
Created: 2024-08-08T17:19:04.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-11-27T14:58:40.000Z (8 months ago)
Last Synced: 2025-04-07T14:48:13.386Z (3 months ago)
Topics: beautifulsoup4, bilstm, gensim, keras, lstm, matplotlib, numpy, pandas, pytorch, random-forest, seaborn, sklearn, tensorflow, xgboost
Language: Jupyter Notebook
Homepage:
Size: 28 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Project Overview
This project aims to detect duplicate question pairs in the Quora dataset. By identifying similar questions, the system can help streamline the question-answering process and improve the efficiency of information retrieval on the platform.

Key Features:
Data Exploration: Load and explore the Quora dataset to understand its structure and characteristics.
Text Preprocessing: Implement various techniques to clean and preprocess the text data, including the removal of HTML tags and special characters.
Feature Engineering: Extract meaningful features from the text data to improve the model’s ability to detect duplicate questions.
Modeling: Apply machine learning models such as Random Forest and XGBoost, as well as deep learning models like LSTM and BiLSTM, to predict duplicate question pairs.
Evaluation: Assess model performance using metrics like accuracy, precision, and recall.
Technologies Used:
Python: Core language for data processing and modeling.
Pandas: For handling and manipulating data structures.
Numpy: For numerical operations and array management.
Seaborn & Matplotlib: For data visualization and analysis.
BeautifulSoup: For text cleaning and preprocessing.
How to Use:
Load the Dataset: Begin by loading the Quora dataset using the provided code.
Preprocess the Data: Clean and prepare the text data for modeling.
Train the Models: Utilize the provided scripts to train and evaluate different models on the dataset.
Analyze Results: Review the model performance metrics and visualizations to understand the results.
Conclusion:
This project provides a comprehensive approach to detecting duplicate questions on Quora. By combining data preprocessing, feature engineering, and advanced modeling techniques, it delivers a robust solution for improving information retrieval on the platform.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kavayk29/quora-duplicate-question-pair

Awesome Lists containing this project

README