https://github.com/alanshaw-github/toyrnntext

This is a toy implementation of RNNText on zhihu tag classification dataset.
https://github.com/alanshaw-github/toyrnntext

python rnn tensorflow textclassification zhihu

Last synced: 20 days ago
JSON representation

This is a toy implementation of RNNText on zhihu tag classification dataset.

Host: GitHub
URL: https://github.com/alanshaw-github/toyrnntext
Owner: AlanShaw-GitHub
License: apache-2.0
Created: 2018-10-25T02:12:46.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2018-10-25T02:21:03.000Z (over 6 years ago)
Last Synced: 2025-04-12T03:14:54.926Z (20 days ago)
Topics: python, rnn, tensorflow, textclassification, zhihu
Language: Python
Homepage:
Size: 1.08 MB
Stars: 5
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# toyRNNText

Requirements:

- python >= 3.6
- tensorflow >= 1.10
- jieba
- gensim
- numpy
- pickle

### Shutouts:

![demo](demo.jpg)

This is a toy implementation of a common **text classification** model called RNNText.

It uses the end2end architecture which takes a sentence as input, and directly predicts the labels it belongs to.

Differ from the traditional methods like SVM etc. It uses neural networks to encode the huge information and corelations between sentences and corresponding tags.

The model is extremely simple(main model part takes less than 50 lines), we argue that the results mainly achieved by tuning the hyper-parameters and empirical tricks.

We also found that adding L2 normalization punishment to the final loss function significantly benefits the results on valid set, it's probably because the neural-network-like models easily get overfitted on the training set.

The original dataset is from NLPCC website, check this link:http://tcci.ccf.org.cn/conference/2018/taskdata.php

The word embedding use pretrained Google word2vec model on open source wikipedia(chinese) dumps, and is fine-tuned during the training process, which also benefits the results on valid set.

I will release the pretained model on 100k sentences(10k different labels) and the preprocessed data(also 100k ,pickle format).Note that the original dataset contains over 700k sentences(20k labels) .

To use the pretained model(100k), you need to first download the cleaned dataset and tensorflow checkpoint on www.freedomworld.cn/toyRNNText , then put the dataset on root path(./) , and the checkpoints on ./model_path_large.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alanshaw-github/toyrnntext

Awesome Lists containing this project

README