https://github.com/cudavailable/naive-bayes-classifier-demo

基于简单实现的多项式朴素贝叶斯的文本主题分类
https://github.com/cudavailable/naive-bayes-classifier-demo

implementation-of-algorithms naive-bayes-classifier news-classification

Last synced: 5 months ago
JSON representation

基于简单实现的多项式朴素贝叶斯的文本主题分类

Host: GitHub
URL: https://github.com/cudavailable/naive-bayes-classifier-demo
Owner: cudavailable
License: mit
Created: 2024-12-02T12:00:05.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-12-03T08:24:44.000Z (about 1 year ago)
Last Synced: 2025-03-21T07:29:31.655Z (9 months ago)
Topics: implementation-of-algorithms, naive-bayes-classifier, news-classification
Language: Python
Homepage:
Size: 20.5 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

## Naive-Bayes-Classifier-Demo
基于一个自实现的多项式朴素贝叶斯分类器，在新闻文本数据集上训练后进行10折交叉验证评估

## 文件说明
- dataset.py : 在指定的存放文本数据路径下，将按类别读取数据，用于训练和测试
- model.py : 包含一个自实现的多项式朴素贝叶斯分类器类
- train.py : 主要包括将输入文本数据进行训练前的处理，模型训练和10折交叉验证评估
- logger.py : 包含一个日志类，仅用于同时向控制台和指定路径的日志文件输出实验关键记录
- main.py : 预制超参数，启动训练函数
- stopwords_cn.txt : 停用词文本
- log : 包含输出日志log.txt

## 使用说明
1. git clong 本仓库到本地；
2. 下载THUCNews新闻数据集(http://thuctc.thunlp.org/)
3. 将成功下载的数据解压后放到一个空间足够的位置，并检查解压后是否有乱码、其子文件夹是否有14类(数据集具体说明请参照前述官网的说明)；
4. 在main函数中调整参数，如：max_text_cnt代表每类新闻采用文本数，可酌情调整。但data_dir需确保是下载数据集的绝对路径；
5. 运行main函数，可从控制台和log中观察训练评估状况。

## 我的配置
max_text_cnt = 500
max_features = 5000
(其他训练评估细节可参见log日志)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cudavailable/naive-bayes-classifier-demo

Awesome Lists containing this project

README