{"id":20469952,"url":"https://github.com/budali/jd_nlp","last_synced_at":"2025-03-05T13:23:22.242Z","repository":{"id":45798060,"uuid":"399667129","full_name":"budaLi/Jd_nlp","owner":"budaLi","description":"贪心学院 京东nlp","archived":false,"fork":false,"pushed_at":"2021-09-10T06:14:05.000Z","size":3811,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-16T01:55:30.740Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/budaLi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-08-25T02:37:07.000Z","updated_at":"2022-07-16T15:46:57.000Z","dependencies_parsed_at":"2022-07-17T01:16:22.794Z","dependency_job_id":null,"html_url":"https://github.com/budaLi/Jd_nlp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/budaLi%2FJd_nlp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/budaLi%2FJd_nlp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/budaLi%2FJd_nlp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/budaLi%2FJd_nlp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/budaLi","download_url":"https://codeload.github.com/budaLi/Jd_nlp/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242032173,"owners_count":20060735,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T14:11:01.839Z","updated_at":"2025-03-05T13:23:22.224Z","avatar_url":"https://github.com/budaLi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 贪心学院 NLP\n\n# 2021.8.25  \n\n人不能闲下来，闲着就会迷茫...\n\n# 2021.8.25 \n\n## 002  训练营介绍，课程体系介绍\n\n  介绍了项目开班，课程大概学习的内容\n\n## 003 NLP定义及其歧义性\n\n  1. NLP = NLU(语义理解) + NLG(语言生成)\n  \n## 004，005 机器翻译\n\n   1. 统计机器翻译\n\n![1.png](https://github.com/budaLi/Jd_nlp/blob/main/imgs/%E7%BB%9F%E8%AE%A1%E6%9C%BA%E5%99%A8%E7%BF%BB%E8%AF%91_1.jpg)\n\n    传统的机器翻译为;根据语料库里的单词与其翻译一一对应形成词库，翻译时根据对应的词进行直译。\n    缺点： 速度慢、无语义分析、无上下文环境\n\n\n   2. 中英文翻译\n\n  今晚的课程有意思\n\n      1) 分词： 今晚| 的| 课程| 有意思\n      2) 直译   Tongith,of ,the course|interesting\n      3) 将直译的单词排列组合，通过Language Model（语言模型），可以输出每一组排列组合对应的概率，即\n          该模型可以判断输入的某一种排列组合更符合语法的概率，最高概率者即为翻译的结果。\n\n  上述翻译的问题之一是，当翻译词汇过多时，排列组合的数量呈指数级，通过语言模型预测不太现实，时间复杂度为O（n**2）\n  分词和翻译过程可以作为translation model,计算概率为langaage model,为了简化，是否可以将二者结合，提出Viterbi 算法。\n\n   3.\n\n           3.1 语言模型(language model)\n               给定一句英文e,计算概率P(e)\n               如果是符合英文语法的，p(e)高，如果是随机语句，p(e)低\n           3.2 翻译模型（词典）\n               给定一对\u003cc,e\u003e,计算p(c|e),c指的是中文，e指的是英文。\n               语义相似度高则p(c|e)高，语义相似度低则p(c|e)低\n           3.3 Decoding Algorithm（Viterbi）\n               给定语言模型，翻译模型和f,找出最优的使得p(e)p(c|e)最大\n\n   4. 语言模型\n\n       语言模型是需要提前训练好的，对于一个好的语言模型，可以判断出句子是否符合语法，并给出概率：\n       P(he is studing ai) \u003e P(he studing is ai)\n\n       也就是需要给出\"he is studing ai\"是句子的概率大于\"he studing is ai\"的概率，那么是如何计算的：\n           \n           Unigram: P(he is studing ai) = P(he) * P(is) * P(studing) * P(ai)   假设每个单词是独立的\n           \n           Markov Assumption 马尔科夫假设\n           \n           Bigram: (he is studing ai) = P(he) * P(is|he) * P(studing|is) * P(ai|studing)   假设当前单词只考虑与前一个单词相关\n           \n           Trigram: P(he is studing ai) = P(he) * p(is|he) * p(studing|he is) * P(ai| is studing)  假设当前单词与前两个单词相关\n           \n           N-gram        由Unigram、Bigram、Trigram可以延伸至N-gram,其中前三者是为了简化计算而假设得到的计算\n\n       联合概率(joint probability)\n       p(x1,x2） = p(x1) * p(x2|x1)  x1,x2的联合概率p(x1,x2) = 先验概率p(x1) * x1已知时x2的概率\n\n           p(x1,x2,x3,x4)\n\n          = p(x1)* p (x2|x1)* p(x3|x1,x2) *p(x4|x1,x2,x3)  # 为了简化，衍生出Unigram,Bigram,Trigram等  chain rule\n\n          = p(x1,x2) * p(x3|x1,x2) * p(x4|x1,x2,x3)\n\n          = p(x1,x2,x3) * p(x4|x1,x2,x3)\n\n          = p(x1,x2,x3,x4)\n\n## 006 NLP项目实战\n\n    1. 问答系统（ question answering)\n    2. 情感分析（sentiment analysis)\n       股票价格预测、舆情分析、产品评论、事件监测\n    3. 机器翻译（machine translation)\n    4. 自动摘要（text summarization)\n    5. 聊天机器人(charbot) 闲聊形(seq2seq)、任务导向性(意图识别)\n    6. 信息抽取（information extraction)\n\n\n## 007 NLP关键技术\n\n    Semantic(语义）\n    Syntax(句子结构）\n    Morphology(单词)\n    Phonetics(声音)\n\n    1. word segmentation(分词）\n      今天是自然语言处理训练营第一次课\n      今天 是  自然语言处理 训练营 第一次 课\n    2. Part of Speech(词性）\n       今天是1⽉22⽇，也是我们训练营的第⼀天，暂时课程，以ZOOM的⽅式直播\n    3. Named Entity Recognition(命名实体识别)\n       今天是（1⽉22⽇），也是我们(训练营)的第⼀天，暂时课程，以（ZOOM）的⽅式直播\n    4. Parsing(句法分析）\n    5. Dependency Parsing (依存分析）\n    6. Relation Extraction(关系抽取）\n\n## 008 时间复杂度\n\n## 016  P、NP、NP Complete问题 \n\n## 017 问答系统\n    \n    将提问的问题于语料库中的问题进行匹配，包括基于规则的匹配和基于句子相似度的计算。\n    基于搜索的问答系统核心点：1.文本的表示 2.相似度的计算\n    知识图谱：1.实体抽取 2.关系抽取\n    \n![1.png](https://github.com/budaLi/Jd_nlp/blob/main/imgs/QA.PNG)\n\n# 2021.8.26\n\n## 020 文本处理的流程\n\n    前向最大匹配，后向最大匹配\n   \n## 024 维特比算法\n\n![1.png](https://github.com/budaLi/Jd_nlp/blob/main/imgs/viterbi.png)\n\n    分词算法总结\n        1.基于匹配规则的方法  max matching\n        2.基于概率统计方法 LM(language model),HMM,CRF\n    分词可以认为是已经解决的问题\n    \n    需要掌握：\n        1.实现max matching 和 Unigram LM方法。\n        \n   ```\n      # 前向最大匹配\n        def forward_max_mathcing(mathing_str,dic,max_len):\n            cur_start= 0\n            cur_end = max_len\n            res = []\n            while cur_end\u003c=len(mathing_str) and cur_start\u003c=cur_end:\n                cur_str = mathing_str[cur_start:cur_end]\n\n                if cur_str not in dic:\n                    cur_end -=1\n                else:\n                    res.append(cur_str)\n                    cur_start = cur_end\n                    cur_end = min(len(mathing_str),cur_end+max_len)\n                print(cur_start,cur_end,cur_str,res)\n            if cur_end!=len(mathing_str)-1:\n                print(\"no matching \")\n            else:\n                print(res)\n\n\n        dic = [\"李\",\"不搭\",\"李不搭\",\"武功\",\"武功盖世\",\"天下\",\"第一\",\"一\"]\n        strs = \"李不搭武功盖世天下第一\"\n        max_len = 4\n        forward_max_mathcing(strs,dic,max_len)\n   ```\n   输出:\n   ```\n      0 3 李不搭武 []\n      3 7 李不搭 ['李不搭']\n      7 11 武功盖世 ['李不搭', '武功盖世']\n      7 10 天下第一 ['李不搭', '武功盖世']\n      7 9 天下第 ['李不搭', '武功盖世']\n      9 11 天下 ['李不搭', '武功盖世', '天下']\n      11 11 第一 ['李不搭', '武功盖世', '天下', '第一']\n      11 10  ['李不搭', '武功盖世', '天下', '第一']\n      ['李不搭', '武功盖世', '天下', '第一']\n\n   ```\n\n如果只是实现N-gram分词算法的话，意义不是很大，只是一种简单的数据处理方法(窗口取词算法)。\n\n可以基于一定的语料库，利用N-Gram来预计或者评估一个句子是否合理。\n\n可参考:https://www.codenong.com/cs106431277/\n\n\n# 2021.8.30\n\n## 025 拼写错误纠正(spell correction)\n\n  电商、搜索引擎等需要进行拼写纠正，也叫编辑距离。\n  本质为动态规划。\n  \n  ![1.png](https://github.com/budaLi/Jd_nlp/blob/main/imgs/spell_correction.jpg)\n  \n  [拼写纠错](https://github.com/budaLi/Jd_nlp/blob/main/codes/spell_correction.py)\n  \n  \n  编辑距离 https://leetcode-cn.com/problems/edit-distance/comments/\n  \n  ![编辑距离](https://github.com/budaLi/Jd_nlp/blob/main/imgs/edit_distance.jpg)\n  \n  ···\n  \n      class Solution(object):\n            def minDistance(self, word1, word2):\n                \"\"\"\n                :type word1: str\n                :type word2: str\n                :rtype: int\n                \"\"\"\n                m = len(word1)\n                n = len(word2)\n                \n                # 如果word1或word2为空字符串\n                # 则编辑距离为长串的长度\n                \n                if m*n ==0:\n                    return m+n\n                # 初始化cost\n                \n                cost = [[0 for i in range(n+1)] for j in range(m+1) ]\n                print(cost)\n                \n                # 边界初始化\n                # word2为空\n                \n                for i in range(m+1):\n                    cost[i][0] = i\n                    \n                #word1 为空\n                \n                for j in range(n+1):\n                    cost[0][j] = j\n                print(cost)\n                for i in range(1,m+1):\n                    for j in range(1,n+1):\n                        if word1[i-1]==word2[j-1]:\n                            cost[i][j] = cost[i-1][j-1]\n                        else:\n                            #因为 cost[i-1][j-1] 与 cost[i-1][j] 以及 cost[i-1][j-1] 与 cost[i][j-1] 的绝对值之差为 1.\n                            \n                            # 假设 word1[i-1][j-1] 变换到 word2[i-1][j-1] 需要 k 步，\n                            \n                            # 那么 word1[i-1][j-1] 变换到 word[i-1][j] 则需要 k + 1 步，也可能是 k - 1 步。\n                            \n                            cost[i][j] = 1+min(cost[i-1][j-1],min(cost[i-1][j],cost[i][j-1]))\n                return cost[m][n]\n\n\n        S = Solution()\n        # word1 = \"horse\"\n        # word2 = \"ros\"\n        word1 = \"intention\"\n        word2 = \"execution\"\n        # word1 = \"a\"\n        # word2 = \"b\"\n        cos = S.minDistance(word1,word2)\n        print(cos)\n  ···\n  \n  \n  编辑距离的缺点：我们需要把词库中的每一个单词都去和用户输入计算编辑距离，时间复杂度较高，为O(V)*O(mn)，\n  \n  其中V为词库大小，mn为进行编辑距离计算的两个单词的长度。\n  \n  优化： 用户输入-\u003e 生成与其编辑距离为1，2的字符串 -\u003e 过滤 -\u003e 返回\n  \n  \n  其中，如何过滤此处不做深究，后续仍需推导\n  \n  ![image](https://user-images.githubusercontent.com/31475416/131482290-355a80a5-c824-4bc0-a463-25c2129dd1e7.png)\n\n\n## 028 停用词过滤(Filtering Words)，Stemming操作\n\n  对应NLP的应用，我们通常先把停用词、出现频率很低的词汇过滤掉，这其实类似于特征筛选的过程。\n  \n  在英文里，比如\"the\",\"an\",\"their\"这些都可以作为停用词处理，但是，也要考虑自己的应用场景。\n  \n  比如在情感分析中,\"好\",\"很好\"等不能过滤。\n  \n  词的标准化\n  \n    Stemming: one way to normalize  \n     \n          went,go,going       -\u003e go\n          fly,flies           -\u003e fli\n          deny,denied,denyig  -\u003e denu\n          \n          \"还原的单词不一定为单词,即不能保证还原为有效的原型\"\n          \n![image](https://user-images.githubusercontent.com/31475416/131485124-a0953029-bba8-4c0b-af06-5e17dcf9a3e5.png)\n\n          \n    Lemmazation\n    \n        保证还原的单词一定符合英文语法，比stemming更为严格\n    \n    \n## 029 文本的表示\n\none-hot \n\n![image](https://user-images.githubusercontent.com/31475416/131486162-e0b87ce8-1a26-45c1-83cd-484b1758b952.png)\n\n![image](https://user-images.githubusercontent.com/31475416/131486196-17b78d06-2292-4976-a17d-8c5b7f530202.png)\n\n![image](https://user-images.githubusercontent.com/31475416/131486236-0576b853-9d9e-4276-9b81-d757d7c393e8.png)\n\n  \n  \n## 031 tf-idf \n\n![image](https://user-images.githubusercontent.com/31475416/131508626-2895891a-2b76-4bb2-a983-1417498349e8.png)\n\n![image](https://user-images.githubusercontent.com/31475416/131510167-608d3770-821e-48b9-a255-f5e5dd12df64.png)\n\n![image](https://user-images.githubusercontent.com/31475416/131511874-1ab92a99-0dc5-4a4b-9fbf-450f1c1c8466.png)\n\n![image](https://user-images.githubusercontent.com/31475416/131512645-94f55380-5a15-4512-8019-22508f721d01.png)\n\n![image](https://user-images.githubusercontent.com/31475416/131514379-d960fc59-8907-4f55-b11e-3ada3cf6ff60.png)\n\n![image](https://user-images.githubusercontent.com/31475416/131515364-7a1b3d04-8750-414a-8303-098f3e4f4701.png)\n\n\n## 034  倒排表\n\n基于检索的问答系统时间复杂度过高，用户的每次输入都要去QA库中计算问题的相似度才能返回。\n\n借鉴搜索引擎的思路，使用倒排索引。\n\n所有优化后的问答系统，可以根据关键词先对问答库进行大部分过滤，再进行相似度匹配。\n\n![image](https://user-images.githubusercontent.com/31475416/131630668-aa3cf036-2985-485d-a604-d66f0ee16842.png)\n\n## 035 Noisy Channel Model\n\n\np(text|source) 等比例于 p(source|text)*p(text)\n\n可以理解为，给定一个资源source，需要将其转换为文本的形式，上述公式由贝叶斯得到，\n\n应用场景: 语音识别、机器翻译、拼写纠错、OCR、密码破解  -\u003e 文本\n\n![image](https://user-images.githubusercontent.com/31475416/131635032-38d7ee55-6046-4576-916c-c3adcbd4814c.png)\n\n![image](https://user-images.githubusercontent.com/31475416/131635056-9d2d6349-0654-4c47-aeda-4b7b046b802d.png)\n\n\n## 036 语言模型\n\n语言模型用来判断一句话是否从语法上通顺。\n\n回顾unigram,bigram,N-gram.\n\n\n## 050 利用语言模型生成句子\n\n  可以利用Unigram model生成句子，生成的过程就是随机从词库中按照词的概率取词，由于Unigram不考虑上下文信息及单词之前的相关性，\n  所以生成的句子不太符合正常的语言逻辑。\n \n## 055 一些难题\n\n  1. 逻辑推理\n  2. 解决规则冲突\n  3. 选择最小规则的子集\n  \n## 056 机器学习\n  \n  1. 线性回归\n  2. 逻辑回归\n  3. 朴素贝叶斯\n  4. 神经网络\n  5. SVM\n  6. 随机森林\n  7. Adaboost\n  8. CNN\n\n无监督学习：\n  1. K-means\n  2. PCA\n  3. ICA\n  4. MF\n  5. LSA\n  6. LDA\n\n\n# 2021.9.10  \n\n  工作原因..暂时搁置,后续学习时从词性标注实战开始。\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbudali%2Fjd_nlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbudali%2Fjd_nlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbudali%2Fjd_nlp/lists"}