https://github.com/mayankmittal29/duplifinder-quora-clone-catcher
An advanced system for detecting semantically duplicate question pairs using cutting-edge NLP techniques. Combines traditional ML models (XGBoost, SVM, Random Forest) with deep learning architectures (BiLSTM, Siamese Networks, Transformers) and contextual embeddings (BERT, RoBERTa). Features engineered using token similarity, fuzzy matching, and em
https://github.com/mayankmittal29/duplifinder-quora-clone-catcher
bert bilstm cross-validation eda fastext fuzzy-matching glove numpy pandas python3 quora-question-pairs random-forest roberta seaborn stemming svm tf-idf transformers word2vec xgboost
Last synced: 2 months ago
JSON representation
An advanced system for detecting semantically duplicate question pairs using cutting-edge NLP techniques. Combines traditional ML models (XGBoost, SVM, Random Forest) with deep learning architectures (BiLSTM, Siamese Networks, Transformers) and contextual embeddings (BERT, RoBERTa). Features engineered using token similarity, fuzzy matching, and em
- Host: GitHub
- URL: https://github.com/mayankmittal29/duplifinder-quora-clone-catcher
- Owner: mayankmittal29
- Created: 2025-03-19T13:35:08.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-03-19T14:18:00.000Z (3 months ago)
- Last Synced: 2025-03-19T14:33:56.732Z (3 months ago)
- Topics: bert, bilstm, cross-validation, eda, fastext, fuzzy-matching, glove, numpy, pandas, python3, quora-question-pairs, random-forest, roberta, seaborn, stemming, svm, tf-idf, transformers, word2vec, xgboost
- Language: Jupyter Notebook
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ DupliFinder: Quora Question Pairs Challenge ๐



## ๐ Problem Statement
Quora is a platform where people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge on Quora is to quickly identify duplicate questions to provide better user experience and maintain high-quality content.
This project aims to tackle the Quora Question Pairs challenge from Kaggle, which requires building a machine learning model to identify whether a pair of questions are semantically identical (duplicates) or not.
## ๐ฏ Project Goals
- Develop models to accurately classify question pairs as duplicates or non-duplicates
- Experiment with various text preprocessing techniques
- Compare performance of traditional ML algorithms and deep learning approaches
- Extract and engineer useful features from text data
- Optimize model performance through hyperparameter tuning and cross-validation## ๐ Dataset Description
The dataset consists of over 400,000 question pairs from Quora, each with the following fields:
- **id**: The unique identifier for a question pair
- **qid1, qid2**: Unique identifiers for each question (only in train.csv)
- **question1, question2**: The full text of each question
- **is_duplicate**: The target variable (1 if questions are duplicates, 0 otherwise)
#### The dataset having train.csv and test.csv can in found in this one drive link :- https://iiithydstudents-my.sharepoint.com/:u:/g/personal/mayank_mittal_students_iiit_ac_in/Ef2igGfs64VDqRpSfgYc7-8Biad7vuYDD7qrnD2NDngVmQ?e=SUeJaH
โ ๏ธ **Note**: The ground truth labels are subjective and were provided by human experts. While they represent a reasonable consensus, they may not be 100% accurate on a case-by-case basis.## ๐ง Methodology
### 1. Data Exploration and Preprocessing
- **Exploratory Data Analysis (EDA)** ๐
- Distribution of duplicate/non-duplicate questions
- Question length analysis
- Word frequency analysis
- Visualization of key features- **Text Preprocessing** ๐งน
- Removal of HTML tags and special characters
- Expanding contractions
- Tokenization
- Stopword removal
- Stemming/Lemmatization
- Advanced cleaning techniques### 2. Feature Engineering
- **Basic Features** ๐งฎ
- Question length
- Word count
- Common words between questions
- Word share ratio- **Advanced Features** ๐ฌ
- Token features (common words, stopwords, etc.)
- Length-based features
- Fuzzy matching features (Levenshtein distance, etc.)
- TF-IDF features
- Word embedding features### 3. Text Representation Methods
- **Bag of Words (BoW)** ๐
- **TF-IDF Vectorization** ๐
- **Word Embeddings** ๐ค
- Word2Vec
- GloVe
- FastText
- **Contextual Embeddings** ๐ง
- BERT
- RoBERTa
- DistilBERT### 4. Machine Learning Models
- **Traditional ML Algorithms** ๐ค
- Random Forest
- XGBoost
- Support Vector Machines (SVM)
- Logistic Regression
- Naive Bayes- **Deep Learning Models** ๐ง
- LSTM/BiLSTM
- Siamese Networks
- Transformer-based models
- Fine-tuned BERT/RoBERTa### 5. Model Optimization
- **Hyperparameter Tuning** ๐๏ธ
- Grid Search
- Random Search
- Bayesian Optimization- **Cross-Validation** โ
- K-Fold Cross-Validation
- Stratified K-Fold Cross-Validation- **Ensemble Methods** ๐ค
- Voting
- Stacking
- Bagging## ๐ Performance Metrics
- Accuracy
- Precision
- Recall
- F1 Score
- ROC-AUC
- Log Loss## ๐ Results
| Model | Embedding | Accuracy | F1 Score | ROC-AUC |
|-------|-----------|----------|----------|---------|
| Random Forest | BoW | 80.2% | 0.79 | 0.86 |
| XGBoost | BoW | 81.3% | 0.80 | 0.87 |
| SVM | TF-IDF | 82.5% | 0.81 | 0.88 |
| LSTM | Word2Vec | 83.7% | 0.82 | 0.89 |
| BERT | Contextual | 87.2% | 0.86 | 0.92 |*Note: This table will be updated as more models are implemented and tested.*
## ๐ฎ Future Work
- Implement more advanced deep learning architectures
- Experiment with different embedding techniques
- Explore transfer learning approaches
- Investigate attention mechanisms
- Develop an ensemble of best-performing models
- Build a simple web app for question duplicate detection## ๐ ๏ธ Tools and Technologies
- **Programming Language**: Python
- **ML Libraries**: Scikit-learn, XGBoost, LightGBM
- **DL Libraries**: TensorFlow, Keras, PyTorch
- **NLP Libraries**: NLTK, SpaCy, Transformers
- **Data Manipulation**: NumPy, Pandas
- **Visualization**: Matplotlib, Seaborn, Plotly
- **Text Processing**: Regex, BeautifulSoup, FuzzyWuzzy## ๐ Repository Structure
```
DupliFinder/
โ
โโโ data/ # Dataset files
โ โโโ train.csv # Training set
โ โโโ test.csv # Test set
โ
โโโ notebooks/ # Jupyter notebooks
โ โโโ 1_EDA.ipynb # Exploratory Data Analysis
โ โโโ 2_Preprocessing.ipynb # Text preprocessing
โ โโโ 3_FeatureEngineering.ipynb # Feature engineering
โ โโโ 4_Traditional_ML.ipynb # Traditional ML models
โ โโโ 5_Deep_Learning.ipynb # Deep learning models
โ
โโโ src/ # Source code
โ โโโ preprocessing/ # Text preprocessing modules
โ โโโ features/ # Feature engineering modules
โ โโโ models/ # Model implementations
โ โโโ utils/ # Utility functions
โ โโโ visualization/ # Visualization functions
โ
โโโ models/ # Saved model files
โ
โโโ app/ # Web application files
โ
โโโ requirements.txt # Project dependencies
โ
โโโ README.md # Project documentation
```## ๐ Getting Started
### Prerequisites
- Python 3.7+
- pip### Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/DupliFinder.git
cd DupliFinder
```2. Install dependencies:
```bash
pip install -r requirements.txt
```3. Download the dataset:
```bash
mkdir -p data
# Download from Kaggle and place in data/ directory
```4. Run the notebooks or scripts:
```bash
jupyter notebook notebooks/1_EDA.ipynb
```## ๐ Demo

## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## ๐ Acknowledgments
- Kaggle for hosting the original competition
- Quora for providing the dataset
- The open-source community for their invaluable tools and libraries## ๐ฌ Contact
If you have any questions or suggestions, feel free to reach out:
- GitHub: [your-username](https://github.com/your-username)
- LinkedIn: [your-linkedin](https://linkedin.com/in/your-linkedin)---
โญ Star this repository if you find it useful! โญ