Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/paulrinckens/han_for_doc_classification

Hierarchical Attention Networks for Document Classification
https://github.com/paulrinckens/han_for_doc_classification

document-classification hierarchical-attention-networks machine-learning natural-language-processing

Last synced: 2 months ago
JSON representation

Hierarchical Attention Networks for Document Classification

Host: GitHub
URL: https://github.com/paulrinckens/han_for_doc_classification
Owner: paulrinckens
License: mit
Created: 2020-04-19T12:28:46.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2023-03-25T00:56:08.000Z (almost 2 years ago)
Last Synced: 2024-10-03T10:33:40.135Z (3 months ago)
Topics: document-classification, hierarchical-attention-networks, machine-learning, natural-language-processing
Language: Jupyter Notebook
Size: 820 KB
Stars: 4
Watchers: 1
Forks: 2
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Hierarchical Attention Networks for Document Classification

An implementation of Hierarchical Attention Networks for Document Classification from the following paper:

![Paper on Hierarchical Attention Networks for Document Classification](img/paper_screenshot_han_for_doc_class.png)

## Content
This repo contains the following:
- Implementation of the Network Architecture with the following technologies
- Keras
- Preprocessing with spaCy
- Embedding Layer with pretrained Glove vectors
- Tokenizing with Keras tokenizer
- Utilities to train the model on the *Ten Thousand German News Article Dataset* (https://github.com/tblock/10kGNAD)
- Jupyter Notebook used during initial development of the network
- REST API server developed with FastAPI for predicting on a trained model and display of the sentence and word attentions in html

## Get started
- Download german word vectors pretrained by Deepset (https://deepset.ai/german-word-embeddings)
from here `https://int-emb-glove-de-wiki.s3.eu-central-1.amazonaws.com/vectors.txt` and place them in the directory ``embeddings/glove_german/``
- The dataset can be downloaded from ``https://github.com/tblock/10kGNAD/blob/master/articles.csv``. Place the articles.csv file in the directory ``data/``

## Train the model
Run the training script `train_script.py`

## Start the server
Running
```
uvicorn app.app:app
```
starts the server under `http://127.0.0.1:8000`

### Access the API docs
After running the server the API docs can be accessed via `http://127.0.0.1:8000/docs`

### Make predictions
Send a HTTP request to the running server, e.g.
```
curl --location --request GET 'http://127.0.0.1:8000/predict/?text='
```

### Visualize prediction and attentions
Send a HTTP request to the running server, e.g.
```
http://127.0.0.1:8000/visualize/?text=
```
The server responds with a static html which should look something like the following:
![Prediction and Attention Visualization Response](img/prediction_and_attention_visualization_response.png)