{"id":8683806,"url":"https://github.com/yiyepiaoling0715/unsupervised_extract_detect_words","last_synced_at":"2025-08-06T16:32:52.249Z","repository":{"id":236588065,"uuid":"163811686","full_name":"yiyepiaoling0715/unsupervised_extract_detect_words","owner":"yiyepiaoling0715","description":"multiprocess unsupervised  chinese_detect_words  ngram_combination","archived":false,"fork":false,"pushed_at":"2019-01-02T08:51:13.000Z","size":7535,"stargazers_count":25,"open_issues_count":1,"forks_count":6,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-08-29T17:28:44.856Z","etag":null,"topics":["detect","entropy","hotword-detection","multiprocessing","mutual-information","ngram","pmi","recursive","segment","unsupervised-learning"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yiyepiaoling0715.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-01-02T07:59:58.000Z","updated_at":"2022-12-07T15:25:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"07b94721-51a3-4c91-b7ed-92df37127669","html_url":"https://github.com/yiyepiaoling0715/unsupervised_extract_detect_words","commit_stats":null,"previous_names":["zheng5yu9/unsupervised_extract_detect_words","yiyepiaoling0715/unsupervised_extract_detect_words"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yiyepiaoling0715%2Funsupervised_extract_detect_words","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yiyepiaoling0715%2Funsupervised_extract_detect_words/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yiyepiaoling0715%2Funsupervised_extract_detect_words/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yiyepiaoling0715%2Funsupervised_extract_detect_words/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yiyepiaoling0715","download_url":"https://codeload.github.com/yiyepiaoling0715/unsupervised_extract_detect_words/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228923808,"owners_count":17992581,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["detect","entropy","hotword-detection","multiprocessing","mutual-information","ngram","pmi","recursive","segment","unsupervised-learning"],"created_at":"2024-04-27T23:44:51.823Z","updated_at":"2024-12-09T16:31:33.993Z","avatar_url":"https://github.com/yiyepiaoling0715.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"1.思路：借鉴之前有一篇blog，利用人人网数据进行新词挖掘的思想，做了改进优化；\n\n2.原始思路： 利用jieba对文档分词，3个相邻词为一组，计算两个词的左信息熵，右信息熵，内部的凝聚度，并据此进行计算分数，根据分数大小获取新词；\n\n3.优化点： \n\n          1.针对只能结合两个词，泛化到结合计算相邻N个词；\n\n          2.内部互信息【凝聚度计算】，归一化到长度=1个词的情况下的值，可以实现不同长度词在同一纬度下进行比较；\n          \n          3.多进程处理，提高运行速度；\n          \n          4.添加过滤机制，根据停用词,高频常用词等进行过滤\n          \n4.入口文件： segment_multi.py   \n\n  执行方式： python segment_multi.py\n  \n  参数修改文件：configs.py\n\n5.效果展示\n\n    ('_重大_疾病', 0.017789747314352424)\n   \n    ('_保障_范围', 0.015639743403053734)\n    \n    ('_本_公司', 0.014212133249451173)\n    \n    ('_完全_丧失', 0.013672071599779227)\n    \n    ('_意外_伤害', 0.010722245979224557)\n    \n    ('_明确_诊断', 0.009062853195861094)\n    \n    ('_日常生活_活动', 0.008990786509666062)\n    \n    ('_六项_基本_日常生活', 0.008813957372202039)\n    \n    ('_基本_日常生活', 0.008694797110512052)\n    \n    ('_基本_日常生活_活动', 0.008671016020472998)\n    \n    ('_保险_事故', 0.008504469334120192)\n    \n    ('_六项_基本_日常生活_活动', 0.008471400808888209)\n    \n    ('_能力_完全_丧失', 0.008404916576493579)\n    \n    ('_全部_条件', 0.008136980840438046)\n    \n    ('_无法_独立', 0.008091270307811042)\n    \n    ('_满足_下列_全部_条件', 0.008055553080109046)    \n        \n    ('_现金_价值', 0.007895715475057304)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyiyepiaoling0715%2Funsupervised_extract_detect_words","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyiyepiaoling0715%2Funsupervised_extract_detect_words","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyiyepiaoling0715%2Funsupervised_extract_detect_words/lists"}