https://github.com/lunarwhite/tan-division
Chinese corpus sentiment analysis. 谭松波酒店评论中文文本情感分析
https://github.com/lunarwhite/tan-division
deep-learning keras lstm nlp python rnn tensorflow
Last synced: about 1 month ago
JSON representation
Chinese corpus sentiment analysis. 谭松波酒店评论中文文本情感分析
- Host: GitHub
- URL: https://github.com/lunarwhite/tan-division
- Owner: lunarwhite
- License: mit
- Created: 2021-05-24T02:20:19.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2025-03-11T16:57:26.000Z (3 months ago)
- Last Synced: 2025-04-30T04:48:35.443Z (about 1 month ago)
- Topics: deep-learning, keras, lstm, nlp, python, rnn, tensorflow
- Language: Python
- Homepage:
- Size: 1.26 MB
- Stars: 57
- Watchers: 1
- Forks: 12
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# tan-division




Try Chinese corpus sentiment analysis with TensorFlow + Keras.
```
├───.gitignore
├───README.md
├───requirements.txt
├───res
│ ├───datanew
│ │ ├───neg
│ │ └───pos
│ └───word-vector
│ └───sgns.zhihu.bigram.bz2
├───src
│ └───run.py
└───tmp
└───weights.hdf5
```## 1 Overview
- 基于谭松波老师的酒店评论数据集的中文文本情感分析,二分类问题
- 数据集标签有 `pos` 和 `neg`,分别 2000 条 txt 文本
- 选择 RNN、LSTM 和 Bi-LSTM 作为基础模型,借助 Keras 搭建训练
- 主要工具包版本为 TensorFlow `2.0.0`、Keras `2.3.1` 和 Python `3.6.2`
- 在测试集上可稳定达到 92% 的准确率## 2 Setup
- clone repo:`git clone https://github.com/lunarwhite/tan-division.git`
- 更新 pip:`pip3 install --upgrade pip`
- 为项目创建虚拟环境:`conda create --name python=3.6`
- 激活 env:`conda activate `
- 安装 Python 库依赖:`pip3 install -r requirements.txt`
- 下载封装好的[中文词向量](https://github.com/Embedding/Chinese-Word-Vectors),本项目选取 [Zhihu_QA Word + Ngram](https://pan.baidu.com/s/1OQ6fQLCgqT43WTwh5fh_lg),放在 `res/word-vector` 路径下## 3 Train
- 运行:`python src/run.py`
- 调参:在 `src/run.py` 文件中修改常用参数,如下
```python
my_lr = 1e-2 # 初始学习率
my_test_size = 0.1
my_validation_split = 0.1 # 验证集比例
my_epochs = 40 # 训练轮数
my_batch_size = 128 # 批大小
my_dropout = 0.2 # dropout 参数大小
my_optimizer = Nadam(lr=my_lr) # 优化方法
my_loss = 'binary_crossentropy' # 损失函数
```## 4 Workflow
- 观察数据
- 数据集大小
- 数据集样本
- 样本长度
- 数据预处理
- 分词
- 短句补全、长句裁剪
- 索引化
- 构建词向量
- 搭建模型
- RNN
- LSTM
- Bi-LSTM
- 可视化分析
- epochs-loss
- epochs-accuracy
- 调试
- callback
- checkpoint
- 改进模型
- loss function
- optimizer
- learning rate
- epochs
- batch_size
- dropout
- early-stopping