{"id":18552999,"url":"https://github.com/lunarwhite/tan-division","last_synced_at":"2025-04-30T04:48:47.529Z","repository":{"id":39717188,"uuid":"370203563","full_name":"lunarwhite/tan-division","owner":"lunarwhite","description":"Chinese corpus sentiment analysis. 谭松波酒店评论中文文本情感分析","archived":false,"fork":false,"pushed_at":"2025-03-11T16:57:26.000Z","size":1321,"stargazers_count":57,"open_issues_count":0,"forks_count":12,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-30T04:48:35.443Z","etag":null,"topics":["deep-learning","keras","lstm","nlp","python","rnn","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lunarwhite.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-05-24T02:20:19.000Z","updated_at":"2025-04-18T03:58:11.000Z","dependencies_parsed_at":"2024-06-07T02:27:11.423Z","dependency_job_id":"f39d8e4e-3e36-4168-b6c7-ad3cdd4e5fd2","html_url":"https://github.com/lunarwhite/tan-division","commit_stats":null,"previous_names":["lunarwhite/tan-division"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lunarwhite%2Ftan-division","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lunarwhite%2Ftan-division/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lunarwhite%2Ftan-division/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lunarwhite%2Ftan-division/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lunarwhite","download_url":"https://codeload.github.com/lunarwhite/tan-division/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251644827,"owners_count":21620630,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","keras","lstm","nlp","python","rnn","tensorflow"],"created_at":"2024-11-06T21:15:46.410Z","updated_at":"2025-04-30T04:48:47.509Z","avatar_url":"https://github.com/lunarwhite.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tan-division\r\n\r\n![GitHub Repo stars](https://img.shields.io/github/stars/lunarwhite/tan-division?color=orange)\r\n![GitHub watchers](https://img.shields.io/github/watchers/lunarwhite/tan-division?color=yellow)\r\n![GitHub forks](https://img.shields.io/github/forks/lunarwhite/tan-division?color=green)\r\n![GitHub top language](https://img.shields.io/github/languages/top/lunarwhite/tan-division)\r\n![GitHub License](https://img.shields.io/github/license/lunarwhite/tan-division?color=white)\r\n\r\nTry Chinese corpus sentiment analysis with TensorFlow + Keras.\r\n\r\n```\r\n├───.gitignore \r\n├───README.md\r\n├───requirements.txt\r\n├───res\r\n│   ├───datanew\r\n│   │   ├───neg\r\n│   │   └───pos\r\n│   └───word-vector\r\n│       └───sgns.zhihu.bigram.bz2\r\n├───src\r\n│   └───run.py\r\n└───tmp\r\n    └───weights.hdf5\r\n```\r\n\r\n## 1 Overview\r\n\r\n- 基于谭松波老师的酒店评论数据集的中文文本情感分析，二分类问题\r\n- 数据集标签有 `pos` 和 `neg`，分别 2000 条 txt 文本\r\n- 选择 RNN、LSTM 和 Bi-LSTM 作为基础模型，借助 Keras 搭建训练\r\n- 主要工具包版本为 TensorFlow `2.0.0`、Keras `2.3.1` 和 Python `3.6.2`\r\n- 在测试集上可稳定达到 92% 的准确率\r\n\r\n## 2 Setup\r\n\r\n- clone repo：`git clone https://github.com/lunarwhite/tan-division.git`\r\n- 更新 pip：`pip3 install --upgrade pip`\r\n- 为项目创建虚拟环境：`conda create --name \u003cenv_name\u003e python=3.6`\r\n- 激活 env：`conda activate \u003cenv_name\u003e`\r\n- 安装 Python 库依赖：`pip3 install -r requirements.txt`\r\n- 下载封装好的[中文词向量](https://github.com/Embedding/Chinese-Word-Vectors)，本项目选取 [Zhihu_QA Word + Ngram](https://pan.baidu.com/s/1OQ6fQLCgqT43WTwh5fh_lg)，放在 `res/word-vector` 路径下\r\n\r\n## 3 Train\r\n\r\n- 运行：`python src/run.py`\r\n- 调参：在 `src/run.py` 文件中修改常用参数，如下\r\n    ```python\r\n    my_lr = 1e-2 # 初始学习率\r\n    my_test_size = 0.1\r\n    my_validation_split = 0.1 # 验证集比例\r\n    my_epochs = 40 # 训练轮数\r\n    my_batch_size = 128 # 批大小\r\n    my_dropout = 0.2 # dropout 参数大小\r\n    \r\n    my_optimizer = Nadam(lr=my_lr) # 优化方法\r\n    my_loss = 'binary_crossentropy' # 损失函数\r\n    ```\r\n\r\n## 4 Workflow\r\n\r\n- 观察数据\r\n  - 数据集大小\r\n  - 数据集样本\r\n  - 样本长度\r\n- 数据预处理\r\n  - 分词\r\n  - 短句补全、长句裁剪\r\n  - 索引化\r\n  - 构建词向量\r\n- 搭建模型\r\n  - RNN\r\n  - LSTM\r\n  - Bi-LSTM\r\n- 可视化分析\r\n  - epochs-loss\r\n  - epochs-accuracy\r\n- 调试\r\n  - callback\r\n  - checkpoint\r\n- 改进模型\r\n  - loss function\r\n  - optimizer\r\n  - learning rate\r\n  - epochs\r\n  - batch_size\r\n  - dropout\r\n  - early-stopping\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flunarwhite%2Ftan-division","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flunarwhite%2Ftan-division","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flunarwhite%2Ftan-division/lists"}