Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/yiyepiaoling0715/unsupervised_extract_detect_words
multiprocess unsupervised chinese_detect_words ngram_combination
https://github.com/yiyepiaoling0715/unsupervised_extract_detect_words
detect entropy hotword-detection multiprocessing mutual-information ngram pmi recursive segment unsupervised-learning
Last synced: 13 days ago
JSON representation
multiprocess unsupervised chinese_detect_words ngram_combination
- Host: GitHub
- URL: https://github.com/yiyepiaoling0715/unsupervised_extract_detect_words
- Owner: yiyepiaoling0715
- Created: 2019-01-02T07:59:58.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-01-02T08:51:13.000Z (almost 6 years ago)
- Last Synced: 2024-08-29T17:28:44.856Z (4 months ago)
- Topics: detect, entropy, hotword-detection, multiprocessing, mutual-information, ngram, pmi, recursive, segment, unsupervised-learning
- Language: Python
- Size: 7.19 MB
- Stars: 25
- Watchers: 2
- Forks: 6
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
1.思路:借鉴之前有一篇blog,利用人人网数据进行新词挖掘的思想,做了改进优化;
2.原始思路: 利用jieba对文档分词,3个相邻词为一组,计算两个词的左信息熵,右信息熵,内部的凝聚度,并据此进行计算分数,根据分数大小获取新词;
3.优化点:
1.针对只能结合两个词,泛化到结合计算相邻N个词;
2.内部互信息【凝聚度计算】,归一化到长度=1个词的情况下的值,可以实现不同长度词在同一纬度下进行比较;
3.多进程处理,提高运行速度;
4.添加过滤机制,根据停用词,高频常用词等进行过滤
4.入口文件: segment_multi.py执行方式: python segment_multi.py
参数修改文件:configs.py5.效果展示
('_重大_疾病', 0.017789747314352424)
('_保障_范围', 0.015639743403053734)
('_本_公司', 0.014212133249451173)
('_完全_丧失', 0.013672071599779227)
('_意外_伤害', 0.010722245979224557)
('_明确_诊断', 0.009062853195861094)
('_日常生活_活动', 0.008990786509666062)
('_六项_基本_日常生活', 0.008813957372202039)
('_基本_日常生活', 0.008694797110512052)
('_基本_日常生活_活动', 0.008671016020472998)
('_保险_事故', 0.008504469334120192)
('_六项_基本_日常生活_活动', 0.008471400808888209)
('_能力_完全_丧失', 0.008404916576493579)
('_全部_条件', 0.008136980840438046)
('_无法_独立', 0.008091270307811042)
('_满足_下列_全部_条件', 0.008055553080109046)
('_现金_价值', 0.007895715475057304)