Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zhanlaoban/transformers_for_text_classification
基于Transformers的文本分类
https://github.com/zhanlaoban/transformers_for_text_classification
nlp text-classification transformers
Last synced: 3 days ago
JSON representation
基于Transformers的文本分类
- Host: GitHub
- URL: https://github.com/zhanlaoban/transformers_for_text_classification
- Owner: zhanlaoban
- Created: 2019-12-19T02:12:13.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2021-09-17T02:24:43.000Z (over 3 years ago)
- Last Synced: 2025-01-12T19:08:20.806Z (12 days ago)
- Topics: nlp, text-classification, transformers
- Language: Python
- Size: 39.9 MB
- Stars: 338
- Watchers: 8
- Forks: 63
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Transformers_for_Text_Classification
# 基于Transformers的文本分类
基于最新的 [huggingface](https://github.com/huggingface) 出品的 [transformers](https://github.com/huggingface/transformers/releases/tag/v2.2.2) v2.2.2代码进行重构。为了保证代码日后可以直接复现而不出现兼容性问题,这里将 [transformers](https://github.com/huggingface/transformers/releases/tag/v2.2.2) 放在本地进行调用。
# Highlights
- 支持transformer模型后接各种特征提取器
- 支持测试集预测代码
- 精简原始transformers代码,使之更适合文本分类任务
- 优化logging终端输出,使之输出内容更加合理# Support
**model_type:**
- [x] bert
- [x] bert_cnn
- [x] bert_lstm
- [x] bert_gru
- [x] xlnet
- [ ] xlnet_cnn
- [x] xlnet_lstm
- [x] xlnet_gru
- [ ] albert# Content
- dataset:存放数据集
- pretrained_models:存放预训练模型
- transformers:transformers文件夹
- results:存放训练结果# Usage
## 1. 使用不同模型
**在shell文件中修改`model_type`参数即可指定模型**
如,BERT后接FC全连接层,则直接设置`model_type=bert`;BERT后接CNN卷积层,则设置`model_type=bert_cnn`.
在本README的`Support`中列出了本项目中各个预训练模型支持的`model_type`。
最后,在终端直接运行shell文件即可,如:
```
bash run_classifier.sh
```**注**:**在中文RoBERTa、ERNIE、BERT_wwm这三种预训练语言模型中,均使用BERT的model_type进行加载。**
## 2. 使用自定义数据集
1. 在`dataset`文件夹里存放自定义的数据集文件夹,如`TestData`.
2. 在根目录下的`utils.py`中,仿照`class THUNewsProcessor`写一个自己的类,如命名为`class TestDataProcessor`,并在`tasks_num_labels`, `processors`, `output_modes`三个dict中添加相应内容.
3. 最后,在你需要运行的shell文件中修改TASK_NAME为你的任务名称,如`TestData`.# Environment
- one 2080Ti, 12GB RAM
- Python: 3.6.5
- PyTorch: 1.3.1- TensorFlow: 1.14.0(仅为了支持TensorBoard,无其他作用)
- Numpy: 1.14.6# Performance
数据集: THUNews/5_5000
epoch:1
train_steps: 5000
| model | dev set best F1 and Acc | remark |
| ------------------ | -------------------------- | ----------------------------------------------- |
| bert_base | 0.9308869881728941, 0.9324 | BERT接FC层, batch_size 8, learning_rate 2e-5 |
| bert_base+cnn | 0.9136314735833212, 0.9156 | BERT接CNN层, batch_size 8, learning_rate 2e-5 |
| bert_base+lstm | 0.9369254464106703, 0.9372 | BERT接LSTM层, batch_size 8, learning_rate 2e-5 |
| bert_base+gru | 0.9379539112313108, 0.938 | BERT接GRU层, batch_size 8, learning_rate 2e-5 |
| roberta_large | | RoBERTa接FC层, batch_size 2, learning_rate 2e-5 |
| xlnet_mid | 0.9530066512880131, 0.954 | XLNet接FC层, batch_size 2, learning_rate 2e-5 |
| xlnet_mid+lstm | 0.9269927348553552, 0.9304 | XLNet接LSTM层, batch_size 2, learning_rate 2e-5 |
| xlnet_mid+gru | 0.9494631023945569, 0.9508 | XLNet接GRU层, batch_size 2, learning_rate 2e-5 |
| albert_xlarge_183k | | |# Download Chinese Pre-trained Models
[NPL_PEMDC](https://github.com/zhanlaoban/NLP_PEMDC)