{"id":17568719,"url":"https://github.com/forestluo/pynldb","last_synced_at":"2026-05-17T02:04:48.649Z","repository":{"id":258883259,"uuid":"875847283","full_name":"forestluo/PyNLDB","owner":"forestluo","description":"NLP based on Python","archived":false,"fork":false,"pushed_at":"2024-10-25T11:27:13.000Z","size":429,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-10-25T18:43:38.018Z","etag":null,"topics":["nature-language-processing","nlp","pycharm","python3","segmentation"],"latest_commit_sha":null,"homepage":"http://www.algmain.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/forestluo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-21T00:47:29.000Z","updated_at":"2024-10-25T11:27:16.000Z","dependencies_parsed_at":"2024-10-27T06:50:28.730Z","dependency_job_id":null,"html_url":"https://github.com/forestluo/PyNLDB","commit_stats":null,"previous_names":["forestluo/pynldb"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forestluo%2FPyNLDB","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forestluo%2FPyNLDB/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forestluo%2FPyNLDB/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forestluo%2FPyNLDB/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/forestluo","download_url":"https://codeload.github.com/forestluo/PyNLDB/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246193460,"owners_count":20738494,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nature-language-processing","nlp","pycharm","python3","segmentation"],"created_at":"2024-10-21T17:05:32.917Z","updated_at":"2025-10-29T01:06:23.584Z","avatar_url":"https://github.com/forestluo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PyNLDB\n\n2024年12月13日\n\n（1）对代码结构做了调整。\n\nWord2Vector : 单独成为一个算法功能。\n\nW2VOperator : 作为一个Word2Vector的工具使用。\n\n    \"load words\"：       加载words2.json。\n    \"load cores\"：       加载cores.json。\n    \"load example\"：     加载内部例程数据；用于校验算法。\n    \"load dictionary\"：  加载dictionary.json。\n    \"load vectors\"：     加载vectors.json。\n    \"save vectors\"：     保存vectors.json。\n    \"auto initialize\"：  自动初始化原始数据。为进一步计算做准备。\n    \"solving vectors\"：  选择算法类型，并进行计算。\n    \"verify vectors\"：   校验计算结果。\n\n（2）增加了两个分词算法。\n\nSegmentTool : 分词工具。均需要依赖cores.json词典。\n\nSegmentTool.l2r : 从左至右，按照最大匹配法进行分词。\n\nSegmentTool.r2l : 从右至左，按照最大匹配法进行分词。\n\nSegmentTool.mid : 从中间某个位置分成两段。然后左侧段使用从右至左最大匹配算法，右侧段使用从左至右最大匹配算法。\n\n（3）将相关系数计算合并至GammaTool中\n\n---\n\n2024年12月3日\n\n（1）对代码的结构做调整。所有的工具命令都迁移至tool目录中。\n\nGenerateData : 从NLDB3语料库之中提取文件数据。\n\nOperateData : 对随机语料数据，进行操作测试。包括：提取句子和提取数量词。\n\nSQLite3Operator : 将提取的文件数据转移至SQLite3中。如果全部转移成功，数据库能有90G左右。\n\nWord2Vector : 以相关系数为基础，对单字进行矢量化。主要方法有两种，目前推荐使用梯度算法。\n\n（2）增加了相关系数矢量化的方法（包括梯度算法）。\n\n我的NLP（自然语言处理）历程（20）——矢量化：https://zhuanlan.zhihu.com/p/9525651467\n\n（3）核心功能函数，包括以下三个：\n\nContentTool.normalize_content ：清洗原始语料。\n\nSentenceTemplate.extract ：按照模板提取句子。\n\nQuantityTemplate.extract ：按照模板提取数量词。\n\n分词方面的功能，因为之前已经做过多次，可以暂时先放一下。这里优先尝试词汇的矢量化算法。\n\n---\n\n 基于Python3（PyCharm 2024）和NLDB的数据处理程序。主要是实现断句，分词和词性检测等功能。\n\n 原始数据库的内容过大（超过40G），无法在Github上免费分享。因此没有选择上传（如有需要，请捐助之后联系）。\n\n 数据文件均以json格式保存在本地。考虑数据量比较大（原始数据超过5G），未来将会增加爬虫机制，直接从互联网获得原始数据进行分析。\n\n NLDB3.py：用于连接数据库。main程序用于测试与NLDB3的连接情况。\n\n NLDB3Raw.py：（1）main程序用于将NLDB3中的原始数据（RawContent）导出成raw.json文件；（2）transverse用于嵌入式遍历处理；（3）random用于随机在数据库中选择一条数据（用于测试其他例程）。\n\n RawContent.py：用于保存、加载和遍历原始数据。（1）main程序用于将原始数据（raw.json）加载后，进行正则（“清洗”）化处理，然后另存为数据文件normalized.json；（2）trasverse用于嵌入式遍历处理。\n\n SentenceTool.py：用于分拆（“断句”）的工具。（1）main程序用于将数据库中的一条随机内容，进行分拆处理；（2）split函数会将内容按照标点符号进行拆解，并做好标记；（3）merge函数会对拆解后的标记内容进行逐级合并。\n\n SentenceTemplate.py：句子模板。按照模板匹配，提取出完整的句子。（1）main程序将会生成缺省的模板内容，并保存至templates.json文件；（2）extract函数用于从内容中，按照标点符号，提取出完整的句子。断句的基本过程：按照标点符号彻底拆解-\u003e做好标记，并作适当合并-\u003e逐级合并-\u003e按照模板提取。\n\n 需要注意几点：（1）缺省模板的排序是固定的，不能随意调整。基本是按照最大匹配法的原则排列。如果自己想增加模板，也必须遵守这个原则。（2）对于没有完整标点符号指示的内容，程序会认为不是一个完整的句子。这种标点不全的内容，将会被直接抛弃。\n\n SentenceContent.py：用于保存、加载和遍历句子数据。main程序将利用templates.json指定的模板，从normalized.json中提取句子数据，并保存至sentences.json文件中。\n\n TokenContent.py：用于保存、加载和遍历Token数据。Token是以单个Unicode字符为单位进行处理。除了Token，还有统计计数。main程序将从normalized.json中统计Token的次数。\n\n WordContent.py：用于保存、加载和遍历Word数据。这里的Word是指由两个相邻Token组成的。（1）main程序通过加载的normalized.json和tokens.json数据，生成单词统计结果，并计算相关系数。最终结果文件保存为words.json；（2）update_gamma函数可以通过Token的统计数据计算gamma数值。\n\n# 参考链接\n\nwww.algmain.com\n\n我的NLP（自然语言处理）历程（8）——频次统计：https://zhuanlan.zhihu.com/p/539109593\n\n我的NLP（自然语言处理）历程（9）——词典导入：https://zhuanlan.zhihu.com/p/539464788\n\n我的NLP（自然语言处理）历程（10）——相关系数：https://zhuanlan.zhihu.com/p/541794935\n\n我的NLP（自然语言处理）历程（11）——疯狂的麦克斯：https://zhuanlan.zhihu.com/p/542073251\n\n我的NLP（自然语言处理）历程（12）——分词算法：https://zhuanlan.zhihu.com/p/542550863\n\n我的NLP（自然语言处理）历程（13）——断句算法：https://zhuanlan.zhihu.com/p/542904661\n\n我的NLP（自然语言处理）历程（14）——基于相关系数的分词算法：https://zhuanlan.zhihu.com/p/552443996\n\n我的NLP（自然语言处理）历程（15）——相关系数与词性检测：https://zhuanlan.zhihu.com/p/555630299\n\n我的NLP（自然语言处理）历程（16）——提取数量词：https://zhuanlan.zhihu.com/p/557053336\n\n我的NLP（自然语言处理）历程（17）——信息熵与分词：https://zhuanlan.zhihu.com/p/557433900\n\n我的NLP（自然语言处理）历程（18）——分词最后环节：https://zhuanlan.zhihu.com/p/558171316\n\n我的NLP（自然语言处理）历程（19）——词性检测：https://zhuanlan.zhihu.com/p/560504920\n\n---\n\n给作者捐赠：\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"https://github.com/forestluo/AlgMain/blob/main/weixin.jpg\" width=\"210px\"\u003e\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u003cimg src=\"https://github.com/forestluo/AlgMain/blob/main/zhifubao.jpg\" width=\"210px\"\u003e\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fforestluo%2Fpynldb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fforestluo%2Fpynldb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fforestluo%2Fpynldb/lists"}