Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kavayk29/quora-duplicate-question-pair

This project improves information retrieval by detecting duplicate question pairs in the Quora dataset using data exploration, text preprocessing, feature engineering, and models like Random Forest and LSTM, aiming to streamline question-answering.
https://github.com/kavayk29/quora-duplicate-question-pair

beautifulsoup4 bilstm gensim keras lstm matplotlib numpy pandas pytorch random-forest seaborn sklearn tensorflow xgboost

Last synced: about 3 hours ago
JSON representation

This project improves information retrieval by detecting duplicate question pairs in the Quora dataset using data exploration, text preprocessing, feature engineering, and models like Random Forest and LSTM, aiming to streamline question-answering.

Awesome Lists containing this project

README

        

Project Overview
This project aims to detect duplicate question pairs in the Quora dataset. By identifying similar questions, the system can help streamline the question-answering process and improve the efficiency of information retrieval on the platform.

Key Features:
Data Exploration: Load and explore the Quora dataset to understand its structure and characteristics.
Text Preprocessing: Implement various techniques to clean and preprocess the text data, including the removal of HTML tags and special characters.
Feature Engineering: Extract meaningful features from the text data to improve the model’s ability to detect duplicate questions.
Modeling: Apply machine learning models such as Random Forest and XGBoost, as well as deep learning models like LSTM and BiLSTM, to predict duplicate question pairs.
Evaluation: Assess model performance using metrics like accuracy, precision, and recall.
Technologies Used:
Python: Core language for data processing and modeling.
Pandas: For handling and manipulating data structures.
Numpy: For numerical operations and array management.
Seaborn & Matplotlib: For data visualization and analysis.
BeautifulSoup: For text cleaning and preprocessing.
How to Use:
Load the Dataset: Begin by loading the Quora dataset using the provided code.
Preprocess the Data: Clean and prepare the text data for modeling.
Train the Models: Utilize the provided scripts to train and evaluate different models on the dataset.
Analyze Results: Review the model performance metrics and visualizations to understand the results.
Conclusion:
This project provides a comprehensive approach to detecting duplicate questions on Quora. By combining data preprocessing, feature engineering, and advanced modeling techniques, it delivers a robust solution for improving information retrieval on the platform.