https://github.com/fengyh3/text-classification

Deep Learning for Text Classification in NLP
https://github.com/fengyh3/text-classification

tensorflow text-classification

Last synced: 3 months ago
JSON representation

Deep Learning for Text Classification in NLP

Host: GitHub
URL: https://github.com/fengyh3/text-classification
Owner: fengyh3
Created: 2020-04-04T12:37:07.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-04-14T13:54:57.000Z (over 6 years ago)
Last Synced: 2025-02-14T11:33:54.142Z (over 1 year ago)
Topics: tensorflow, text-classification
Language: Python
Size: 626 KB
Stars: 4
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Text-Classification

Deep Learning for Text Classification in NLP.

# Enviroment

py3 + tensorflow 1.12+

# Dataset

Movie Review dataset is from [this website](http://www.cs.cornell.edu/people/pabo/movie-review-data/)

Yelp: it's from [yelp academic review](https://www.kaggle.com/yelp-dataset/yelp-dataset/version/2), i just use first 500,000 texts to train.

# Models

Now it contain four models: CNN/BiLSTM/BiLSTM+attention/FastText/HAN.(To be continued...)

# Results

Some results about accuracy are in below:

|      | CNN    | BiLSTM    | BiLSTM + attention | FastText | RCNN_max-pooling | RCNN_average-pooling|    HAN    |  Bert-Tiny | Bert-Mini |

| ---- | ------ | ------ | ------ | ---------- |---------------------|-------------------------|-----------------|------------|------------|

|movie review | 76.2% | 79.5% | 76.9% |   80.3%   |     80.4%          |        80.3%            |      -%    |  77.2%(dataset encoding issue)  |  77.2%    |

|Yelp | 65.1% | 68.2% | 70.2% |  69.5%    |               |                    |    70.5%      | 72.5%  |  74.8%  |

# Tips

Note that the models do not contain save and load model in tensorflow, and it contains visulazation using tensorboard. Moreover, the models just simply ajust the hyper-parameters and in FastText it just uses unigram. So it just a toy-level demo and use it to learn the text classification.

In moview review dataset, we can see that because of the dataset is a bunch of small-scale and short texts, so the complcated DL methods may be not as good as simpler DL methods or ML methods. What's more, the training cost: RCNN > BiLSTM + attention ≈ BiLSTM > CNN >> FastText. And due to movie review dataset is encoding with 'windows-1252', so in training in bert, it causes the messy code and i can't  get a good enough result.

In Yelp dataset, it is a larger-scale dataset and the texts are longer. Due to the limitation of computed resource, the models' hyper-parameter is not a pretty good setting. 

Now it will be continued with Transformer, BERT and so on.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fengyh3/text-classification

Awesome Lists containing this project

README