https://github.com/yiyepiaoling0715/unsupervised_extract_detect_words

multiprocess unsupervised chinese_detect_words ngram_combination
https://github.com/yiyepiaoling0715/unsupervised_extract_detect_words

detect entropy hotword-detection multiprocessing mutual-information ngram pmi recursive segment unsupervised-learning

Last synced: 7 months ago
JSON representation

multiprocess unsupervised chinese_detect_words ngram_combination

Host: GitHub
URL: https://github.com/yiyepiaoling0715/unsupervised_extract_detect_words
Owner: yiyepiaoling0715
Created: 2019-01-02T07:59:58.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-01-02T08:51:13.000Z (over 6 years ago)
Last Synced: 2024-08-29T17:28:44.856Z (10 months ago)
Topics: detect, entropy, hotword-detection, multiprocessing, mutual-information, ngram, pmi, recursive, segment, unsupervised-learning
Language: Python
Size: 7.19 MB
Stars: 25
Watchers: 2
Forks: 6
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

1.思路：借鉴之前有一篇blog，利用人人网数据进行新词挖掘的思想，做了改进优化；

2.原始思路：利用jieba对文档分词，3个相邻词为一组，计算两个词的左信息熵，右信息熵，内部的凝聚度，并据此进行计算分数，根据分数大小获取新词；

3.优化点：

1.针对只能结合两个词，泛化到结合计算相邻N个词；

2.内部互信息【凝聚度计算】，归一化到长度=1个词的情况下的值，可以实现不同长度词在同一纬度下进行比较；

3.多进程处理，提高运行速度；

4.添加过滤机制，根据停用词,高频常用词等进行过滤

4.入口文件： segment_multi.py

执行方式： python segment_multi.py

参数修改文件：configs.py

5.效果展示

('_重大_疾病', 0.017789747314352424)

('_保障_范围', 0.015639743403053734)

('_本_公司', 0.014212133249451173)

('_完全_丧失', 0.013672071599779227)

('_意外_伤害', 0.010722245979224557)

('_明确_诊断', 0.009062853195861094)

('_日常生活_活动', 0.008990786509666062)

('_六项_基本_日常生活', 0.008813957372202039)

('_基本_日常生活', 0.008694797110512052)

('_基本_日常生活_活动', 0.008671016020472998)

('_保险_事故', 0.008504469334120192)

('_六项_基本_日常生活_活动', 0.008471400808888209)

('_能力_完全_丧失', 0.008404916576493579)

('_全部_条件', 0.008136980840438046)

('_无法_独立', 0.008091270307811042)

('_满足_下列_全部_条件', 0.008055553080109046)

('_现金_价值', 0.007895715475057304)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yiyepiaoling0715/unsupervised_extract_detect_words

Awesome Lists containing this project

README