https://github.com/AlexGidiotis/Document-Classifier-LSTM

A bidirectional LSTM with attention for multiclass/multilabel text classification.
https://github.com/AlexGidiotis/Document-Classifier-LSTM

arxiv attention-mechanism hierarchical-attention-networks keras lstm multilabel-multiclass recurrent-neural-networks tensorflow text-classification

Last synced: 3 months ago
JSON representation

A bidirectional LSTM with attention for multiclass/multilabel text classification.

Host: GitHub
URL: https://github.com/AlexGidiotis/Document-Classifier-LSTM
Owner: AlexGidiotis
License: mit
Created: 2017-08-20T17:32:57.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2024-08-30T23:53:12.000Z (10 months ago)
Last Synced: 2024-11-06T02:38:46.922Z (8 months ago)
Topics: arxiv, attention-mechanism, hierarchical-attention-networks, keras, lstm, multilabel-multiclass, recurrent-neural-networks, tensorflow, text-classification
Language: Python
Homepage:
Size: 82 KB
Stars: 171
Watchers: 6
Forks: 52
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Document-Classifier-LSTM
Recurrent Neural Networks for multilclass, multilabel classification of texts. The models that learn to tag samll texts with 169 different tags from arxiv.

In classifier.py is implemented a standard BLSTM network with attention.

In hatt_classifier.py you can find the implementation of [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf).

The neural networks were built using Keras and Tensorflow.

The best performing model is the attention BLSTM that achieves a micro f-score of 0.67 on the test set.

The Hierarchical Attention Network achieves only 0.65 micro f-score.

I am using 500k paper abstracts from arxiv. In order to download your own data refer to the [arxiv OAI api](https://arxiv.org/help/bulk_data).

Pretrained word embeddings can be used. The embeddings can either be GloVe or Word2Vec. You can download the [GoogleNews-vectors-negative300.bin](https://code.google.com/archive/p/word2vec) or the [GloVe embeddings](https://nlp.stanford.edu/projects/glove).

## Usage:

1) In order to train your own model you must prepare your data set using the data_prep.py script. The preprocessing converts to lower case, tokenizes and removes very short words. The preprocessed files and label files should be saved in a /data folder.

2) You can now run classifier.py or hatt_classifier.py to build and train the models.

3) The trained models are exported to json and the weights to h5 for later use.

4) You can use utils.visualize_attention to visualize the attention weights.

## Requirements

- Python
- NLTK
- NumPy
- Pandas
- SciPy
- OpenCV
- scikit-learn
- [Tensorflow](https://github.com/tensorflow/tensorflow)
- [Keras](https://github.com/fchollet/keras)

Run `pip install -r requirements.txt` to install the requirements.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/AlexGidiotis/Document-Classifier-LSTM

Awesome Lists containing this project

README