https://github.com/hengluchang/quora-paraphrase-question-identification

Paraphrase question identification using Feature Fusion Network (FFN).
https://github.com/hengluchang/quora-paraphrase-question-identification

deep-learning feature-engineering kaggle neural-networks paraphrase-identification quora quora-question-pairs

Last synced: 18 days ago
JSON representation

Paraphrase question identification using Feature Fusion Network (FFN).

Host: GitHub
URL: https://github.com/hengluchang/quora-paraphrase-question-identification
Owner: hengluchang
Created: 2017-04-09T18:34:40.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2017-09-05T05:17:18.000Z (over 7 years ago)
Last Synced: 2025-04-06T20:35:53.973Z (about 1 month ago)
Topics: deep-learning, feature-engineering, kaggle, neural-networks, paraphrase-identification, quora, quora-question-pairs
Language: Python
Homepage:
Size: 443 KB
Stars: 20
Watchers: 2
Forks: 4
Open Issues: 4
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

[![Codacy Badge](https://api.codacy.com/project/badge/Grade/086fe025b4fb41599ee1e6dfa50f12bf)](https://www.codacy.com/app/hengluchang/Quora-Paraphrase-Question-Identification?utm_source=github.com&utm_medium=referral&utm_content=hengluchang/Quora-Paraphrase-Question-Identification&utm_campaign=Badge_Grade)

## Paraphrase Question Identification using Feature Fusion Network
Identify question pairs that have the same meaning. Feature Fusion Network takes advantage of learning rich features not just from sentence representations but also from hand craft features.

For more detailed information, please see our project research paper: [Paraphrase Question Identification using Feature Fusion Network](https://github.com/hengluchang/Quora-Paraphrase-Question-Identification/blob/master/paraphrase-question-identification.pdf).

## Model architecture
![](https://github.com/hengluchang/SemQuestionMatching/blob/master/FFN_architecture.jpg)

## Results
- 0.895 testing accuracy for FFN (train for 100 epoch)

## Requirements
- Python 3.5 for running FFN
- Python 2.7 for running Random Forest (RF) baseline

## Package dependencies
### RF baseline
- scikit-learn 0.18
- nltk
- pandas

### FFN
- numpy 1.11
- matplotlib 1.5
- Keras 1.2
- scikit-learn 0.18
- h5py 2.6
- hdf5 1.8
- TensorFlow 0.10

## How to run
```
$ git clone https://github.com/hengluchang/Quora-Paraphrase-Question-Identification
```
### Run Random Forest baseline
- create a folder named "dataset".
```
$ cd Quora-Paraphrase-Question-Identification
$ mkdir -p dataset
```

- Go to [Kaggle Quora Question Pairs website](https://www.kaggle.com/c/quora-question-pairs/data) and download train.csv.zip and test.csv.zip and unzip both. Place the train.csv and test.csv under /dataset directory.

- Create 10 Hand crafted features (HCFs). This will create train_10features.csv and test_10features.csv under /dataset directory.
```
$ cd ..
$ python feature_gen.py ../dataset/train.csv ../dataset/test.csv
```

- Run Random Forest baseline on these 10 HCFs, this will give you ~ 0.84 testing accuracy.

```
$ python run_baseline.py ../dataset/train_10features.csv
```

### Run Feature Fusion Network (FFN)

- Download the required data [here(Google Drive link)](https://drive.google.com/drive/folders/0B7j2V-uXleQ-ZjhxS0laWFBBTVk?usp=sharing) to the directory you clone

- Train FFN w/o HCF
```
$ pyhon3 train_noHCF.py -i -t -g -w -e -n
```
For instance:
```
$ python3 train_noHCF.py -i train_rebalanced.csv -t test.csv -g glove.840B.300d.txt -w question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_rebalanced.h5 -e word_embedding_matrix_trainANDtest_rebalanced.npy -n nb_words_trainANDtest_rebalanced.json
```

- Train FFN
```
$ python3 train_HCF.py -i -t -f -g -w -e -n
```
For instance:
```
$ python3 train_HCF.py -i train_rebalanced.csv -t test.csv -f train_rebalanced_10features.csv -g glove.840B.300d.txt -w question_pairs_weights_100epoch_test10_val20_dropout20_sumOP_noAVG_HCF_rebalanced.h5 -e word_embedding_matrix_trainANDtest_rebalanced.npy -n nb_words_trainANDtest_rebalanced.json
```

- Test FFN w/o HCF
```
$ python3 test_noHCF.py -i -o -e -n -w
```
For instance:
```
$ python3 test_noHCF.py -i test.csv -o result_question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_rebalanced.csv -e word_embedding_matrix_trainANDtest_rebalanced.npy -n nb_words_trainANDtest_rebalanced.json -w question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_rebalanced.h5
```
- Test FFN

```
$ python3 test_HCF.py -i -o -e -n -w
```
For instance:
```
$ python3 test_sum_HCF.py -i test.csv -f -test_10features.csv -o result_question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_HCF_rebalanced.csv -e word_embedding_matrix_trainANDtest_rebalanced.npy -n nb_words_trainANDtest_rebalanced.json -w question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_HCF_rebalanced.h5
```

## Reference
- [Keras model to identify Quora question pairs](https://github.com/bradleypallen/keras-quora-question-pairs): borrowed most of the Deep Neural Network script

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hengluchang/quora-paraphrase-question-identification

Awesome Lists containing this project

README