{"id":14958285,"url":"https://github.com/lining0806/textmining","last_synced_at":"2025-04-07T11:05:41.254Z","repository":{"id":80295367,"uuid":"46331578","full_name":"lining0806/TextMining","owner":"lining0806","description":"Python文本挖掘系统 Research of Text Mining System","archived":false,"fork":false,"pushed_at":"2018-03-02T06:01:15.000Z","size":3971,"stargazers_count":341,"open_issues_count":2,"forks_count":154,"subscribers_count":35,"default_branch":"master","last_synced_at":"2025-04-07T11:05:32.588Z","etag":null,"topics":["jieba","sklearn","stopwords","text-mining","tf-idf","user-dict"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lining0806.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2015-11-17T07:55:46.000Z","updated_at":"2025-04-02T07:37:25.000Z","dependencies_parsed_at":null,"dependency_job_id":"8108e3ed-946c-4834-9321-86e4d5f22b51","html_url":"https://github.com/lining0806/TextMining","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lining0806%2FTextMining","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lining0806%2FTextMining/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lining0806%2FTextMining/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lining0806%2FTextMining/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lining0806","download_url":"https://codeload.github.com/lining0806/TextMining/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247640463,"owners_count":20971557,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["jieba","sklearn","stopwords","text-mining","tf-idf","user-dict"],"created_at":"2024-09-24T13:16:40.621Z","updated_at":"2025-04-07T11:05:41.226Z","avatar_url":"https://github.com/lining0806.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 文本挖掘系统 Text Mining System\r\n\r\n***\r\n\r\n## 系统说明\r\n\r\n* 集成了**文本过滤、去重**及**邮件实时通知**的功能\r\n* 集成了**文本关键词提取**的功能\r\n* 集成了**文本分类**即**打标签**的功能\r\n* 集成了**文本推荐**即**热点评价**的功能\r\n* **支持中英文**\r\n\r\n## 系统架构图\r\n![image](Architecture-of-Text-Mining-System.png)\r\n\r\n## 关于分词\r\n**英文分词，采用nltk工具包进行分词**  \r\n\r\n\tpip install nltk \r\n\r\n**中文分词，采用jieba工具包进行分词**  \r\n\r\n\tpip install jieba \r\n\r\n**jieba分词**\r\n\r\n\tdict 主词典文件 \r\n\tuser_dict 用户词典文件，即分词白名单 \r\n\r\n**user_dict为分词白名单**\r\n* 如果添加的过滤词（包括黑名单和白名单）无法正确被jieba正确分词，则将该需要添加的单词及词频加入到主词典dict文件中或者用户词典user_dict，一行一个（词频也可省略）  \r\n\r\n## 关于停用词，黑名单，白名单\r\n\r\n**stopwords为停用词**    \r\n* 可以随时添加停用的单词，一行一个  \r\n\r\n**blackwords为过滤词黑名单**  \r\n* 可以随时添加过滤的单词，一行一个  \r\n\r\n**writewords为关键词白名单**  \r\n* 可以随时添加关键的单词，一行一个 \r\n\r\n## 关于特征词\r\n\r\n* 特征词用于分类，用于计算文本特征\r\n* 特征词的选取可以通过该词在训练集中的词频数来确定\r\n* 特征词的维度可以设置\r\n\r\n## 关于配置\r\n\r\n**config文件：**  \r\n* 可以进行服务器配置，针对数据库中制订collection的不同字段column \r\n* 可以限定操作数据库条目的数量，默认时间从最近往前推\r\n* 可以选择语言(中文，英文)\r\n* 可以设置分类特征词词典的维度\r\n* 可以设置是否接收邮件通知\r\n* 可以设置版本加速，如果加速分类，此时会将文本特征词和分类模型固定化！因此，如果要测试分类特征词词典的维度、分类器的特征和算法，需要取消加速。\r\n\r\n**程序文件：**  \r\n* 可以更改特征词典的生成，通过该词的词频数或者包含该词的文档频率\r\n* 可以更改文本过滤及去重算法\r\n* 可以更改关键词提取算法，可选基于特征词提取、基于Tf提取、基于IDf提取、基于TfIDf提取，可以更改前K个关键词筛选方法\r\n* 可以更改训练集和测试集的特征生成，基于特征词，可选Bool特征、Tf特征、IDf特征(无区分)、TfIDf特征，可以选择进行特征选择或降维\r\n* 可以更改文本分类算法，可选SVC、LinearSVC、MultinomialNB、LogisticRegression、KNeighborsClassifier、DecisionTreeClassifier，可以更改算法调参寻优的方法\r\n* 可以更改文本推荐算法\r\n\r\n## 其他说明\r\n* 更改分词文件dict user_dict lag\r\n需要事先手动删除datas文件夹\r\n\r\n* 更改训练集\r\n需要事先手动删除all_words_dict和train_datas\r\n\r\n* 更改文件stopwords blackwords writewords fea_dict_size\r\n重新运行程序即可\r\n\r\n## 关于环境搭建\r\n\r\n**Ubuntu下numpy scipy matplotlib的安装**  \r\n\r\n    sudo apt-get update\r\n    sudo apt-get install git g++ gfortran\r\n    sudo apt-get install python-dev python-setuptools python-pip\r\n\t\r\n    sudo apt-get install libblas-dev liblapack-dev libatlas-base-dev\r\n    export BLAS=/usr/lib/libblas/libblas.so \r\n    export LAPACK=/usr/lib/lapack/liblapack.so \r\n    export ATLAS=/usr/lib/atlas-base/libatlas.so\r\n\t\r\n\tsudo apt-get install python-numpy\r\n\tsudo apt-get install python-scipy\r\n\tsudo apt-get install python-matplotlib\r\n\t或\r\n\tsudo pip numpy\r\n\tsudo pip scipy\r\n\tsudo pip matplotlib\t\r\n    \r\n    sudo pip jieba\r\n    sudo pip scikit-learn\r\n    sudo pip simplejson\r\n    sudo pip pymongo\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flining0806%2Ftextmining","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flining0806%2Ftextmining","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flining0806%2Ftextmining/lists"}