An open API service indexing awesome lists of open source software.

https://github.com/mayankmittal29/duplifinder-quora-clone-catcher

An advanced system for detecting semantically duplicate question pairs using cutting-edge NLP techniques. Combines traditional ML models (XGBoost, SVM, Random Forest) with deep learning architectures (BiLSTM, Siamese Networks, Transformers) and contextual embeddings (BERT, RoBERTa). Features engineered using token similarity, fuzzy matching, and em
https://github.com/mayankmittal29/duplifinder-quora-clone-catcher

bert bilstm cross-validation eda fastext fuzzy-matching glove numpy pandas python3 quora-question-pairs random-forest roberta seaborn stemming svm tf-idf transformers word2vec xgboost

Last synced: 2 months ago
JSON representation

An advanced system for detecting semantically duplicate question pairs using cutting-edge NLP techniques. Combines traditional ML models (XGBoost, SVM, Random Forest) with deep learning architectures (BiLSTM, Siamese Networks, Transformers) and contextual embeddings (BERT, RoBERTa). Features engineered using token similarity, fuzzy matching, and em

Awesome Lists containing this project

README

        

# ๐Ÿ” DupliFinder: Quora Question Pairs Challenge ๐Ÿ”

![Python](https://img.shields.io/badge/Python-3.7+-blue.svg)
![Machine Learning](https://img.shields.io/badge/Machine%20Learning-NLP-brightgreen.svg)
![Deep Learning](https://img.shields.io/badge/Deep%20Learning-LSTM%2FBERT-orange.svg)
![Status](https://img.shields.io/badge/Status-Active-success.svg)

## ๐Ÿ“š Problem Statement

Quora is a platform where people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge on Quora is to quickly identify duplicate questions to provide better user experience and maintain high-quality content.

This project aims to tackle the Quora Question Pairs challenge from Kaggle, which requires building a machine learning model to identify whether a pair of questions are semantically identical (duplicates) or not.

## ๐ŸŽฏ Project Goals

- Develop models to accurately classify question pairs as duplicates or non-duplicates
- Experiment with various text preprocessing techniques
- Compare performance of traditional ML algorithms and deep learning approaches
- Extract and engineer useful features from text data
- Optimize model performance through hyperparameter tuning and cross-validation

## ๐Ÿ“Š Dataset Description

The dataset consists of over 400,000 question pairs from Quora, each with the following fields:

- **id**: The unique identifier for a question pair
- **qid1, qid2**: Unique identifiers for each question (only in train.csv)
- **question1, question2**: The full text of each question
- **is_duplicate**: The target variable (1 if questions are duplicates, 0 otherwise)
#### The dataset having train.csv and test.csv can in found in this one drive link :- https://iiithydstudents-my.sharepoint.com/:u:/g/personal/mayank_mittal_students_iiit_ac_in/Ef2igGfs64VDqRpSfgYc7-8Biad7vuYDD7qrnD2NDngVmQ?e=SUeJaH
โš ๏ธ **Note**: The ground truth labels are subjective and were provided by human experts. While they represent a reasonable consensus, they may not be 100% accurate on a case-by-case basis.

## ๐Ÿ”ง Methodology

### 1. Data Exploration and Preprocessing

- **Exploratory Data Analysis (EDA)** ๐Ÿ“ˆ
- Distribution of duplicate/non-duplicate questions
- Question length analysis
- Word frequency analysis
- Visualization of key features

- **Text Preprocessing** ๐Ÿงน
- Removal of HTML tags and special characters
- Expanding contractions
- Tokenization
- Stopword removal
- Stemming/Lemmatization
- Advanced cleaning techniques

### 2. Feature Engineering

- **Basic Features** ๐Ÿงฎ
- Question length
- Word count
- Common words between questions
- Word share ratio

- **Advanced Features** ๐Ÿ”ฌ
- Token features (common words, stopwords, etc.)
- Length-based features
- Fuzzy matching features (Levenshtein distance, etc.)
- TF-IDF features
- Word embedding features

### 3. Text Representation Methods

- **Bag of Words (BoW)** ๐Ÿ“
- **TF-IDF Vectorization** ๐Ÿ“Š
- **Word Embeddings** ๐Ÿ”ค
- Word2Vec
- GloVe
- FastText
- **Contextual Embeddings** ๐Ÿง 
- BERT
- RoBERTa
- DistilBERT

### 4. Machine Learning Models

- **Traditional ML Algorithms** ๐Ÿค–
- Random Forest
- XGBoost
- Support Vector Machines (SVM)
- Logistic Regression
- Naive Bayes

- **Deep Learning Models** ๐Ÿง 
- LSTM/BiLSTM
- Siamese Networks
- Transformer-based models
- Fine-tuned BERT/RoBERTa

### 5. Model Optimization

- **Hyperparameter Tuning** ๐ŸŽ›๏ธ
- Grid Search
- Random Search
- Bayesian Optimization

- **Cross-Validation** โœ…
- K-Fold Cross-Validation
- Stratified K-Fold Cross-Validation

- **Ensemble Methods** ๐Ÿค
- Voting
- Stacking
- Bagging

## ๐Ÿ“ˆ Performance Metrics

- Accuracy
- Precision
- Recall
- F1 Score
- ROC-AUC
- Log Loss

## ๐Ÿš€ Results

| Model | Embedding | Accuracy | F1 Score | ROC-AUC |
|-------|-----------|----------|----------|---------|
| Random Forest | BoW | 80.2% | 0.79 | 0.86 |
| XGBoost | BoW | 81.3% | 0.80 | 0.87 |
| SVM | TF-IDF | 82.5% | 0.81 | 0.88 |
| LSTM | Word2Vec | 83.7% | 0.82 | 0.89 |
| BERT | Contextual | 87.2% | 0.86 | 0.92 |

*Note: This table will be updated as more models are implemented and tested.*

## ๐Ÿ”ฎ Future Work

- Implement more advanced deep learning architectures
- Experiment with different embedding techniques
- Explore transfer learning approaches
- Investigate attention mechanisms
- Develop an ensemble of best-performing models
- Build a simple web app for question duplicate detection

## ๐Ÿ› ๏ธ Tools and Technologies

- **Programming Language**: Python
- **ML Libraries**: Scikit-learn, XGBoost, LightGBM
- **DL Libraries**: TensorFlow, Keras, PyTorch
- **NLP Libraries**: NLTK, SpaCy, Transformers
- **Data Manipulation**: NumPy, Pandas
- **Visualization**: Matplotlib, Seaborn, Plotly
- **Text Processing**: Regex, BeautifulSoup, FuzzyWuzzy

## ๐Ÿ“‚ Repository Structure

```
DupliFinder/
โ”‚
โ”œโ”€โ”€ data/ # Dataset files
โ”‚ โ”œโ”€โ”€ train.csv # Training set
โ”‚ โ””โ”€โ”€ test.csv # Test set
โ”‚
โ”œโ”€โ”€ notebooks/ # Jupyter notebooks
โ”‚ โ”œโ”€โ”€ 1_EDA.ipynb # Exploratory Data Analysis
โ”‚ โ”œโ”€โ”€ 2_Preprocessing.ipynb # Text preprocessing
โ”‚ โ”œโ”€โ”€ 3_FeatureEngineering.ipynb # Feature engineering
โ”‚ โ”œโ”€โ”€ 4_Traditional_ML.ipynb # Traditional ML models
โ”‚ โ””โ”€โ”€ 5_Deep_Learning.ipynb # Deep learning models
โ”‚
โ”œโ”€โ”€ src/ # Source code
โ”‚ โ”œโ”€โ”€ preprocessing/ # Text preprocessing modules
โ”‚ โ”œโ”€โ”€ features/ # Feature engineering modules
โ”‚ โ”œโ”€โ”€ models/ # Model implementations
โ”‚ โ”œโ”€โ”€ utils/ # Utility functions
โ”‚ โ””โ”€โ”€ visualization/ # Visualization functions
โ”‚
โ”œโ”€โ”€ models/ # Saved model files
โ”‚
โ”œโ”€โ”€ app/ # Web application files
โ”‚
โ”œโ”€โ”€ requirements.txt # Project dependencies
โ”‚
โ””โ”€โ”€ README.md # Project documentation
```

## ๐Ÿš€ Getting Started

### Prerequisites

- Python 3.7+
- pip

### Installation

1. Clone the repository:
```bash
git clone https://github.com/yourusername/DupliFinder.git
cd DupliFinder
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Download the dataset:
```bash
mkdir -p data
# Download from Kaggle and place in data/ directory
```

4. Run the notebooks or scripts:
```bash
jupyter notebook notebooks/1_EDA.ipynb
```

## ๐Ÿ“Š Demo

![Demo GIF](https://example.com/demo.gif)

## ๐Ÿ“œ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## ๐Ÿ™ Acknowledgments

- Kaggle for hosting the original competition
- Quora for providing the dataset
- The open-source community for their invaluable tools and libraries

## ๐Ÿ“ฌ Contact

If you have any questions or suggestions, feel free to reach out:

- GitHub: [your-username](https://github.com/your-username)
- LinkedIn: [your-linkedin](https://linkedin.com/in/your-linkedin)

---

โญ Star this repository if you find it useful! โญ