{"id":22677865,"url":"https://github.com/leask/maimemo-word-extractor","last_synced_at":"2025-03-29T12:44:55.276Z","repository":{"id":150323446,"uuid":"202420686","full_name":"Leask/MaiMemo-word-extractor","owner":"Leask","description":"墨墨提词算法","archived":false,"fork":false,"pushed_at":"2016-11-04T03:59:22.000Z","size":538,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-04T13:43:52.132Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Leask.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-14T20:24:00.000Z","updated_at":"2022-10-02T16:55:06.000Z","dependencies_parsed_at":"2023-04-18T07:17:31.630Z","dependency_job_id":null,"html_url":"https://github.com/Leask/MaiMemo-word-extractor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Leask%2FMaiMemo-word-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Leask%2FMaiMemo-word-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Leask%2FMaiMemo-word-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Leask%2FMaiMemo-word-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Leask","download_url":"https://codeload.github.com/Leask/MaiMemo-word-extractor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246187219,"owners_count":20737460,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-09T18:02:49.176Z","updated_at":"2025-03-29T12:44:55.261Z","avatar_url":"https://github.com/Leask.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# word-extractor\n墨墨提词算法\n\n## 简介\n提词即是从一段文本里提取出墨墨词库里面有的单词，用户可以将这些单词添加到自己的学习规划。\n\n## 提交\n将代码或项目直接放入 src 文件夹内即可。\n\n## 关键功能\n1. 用户提供一个文本，文本与某个词库（墨墨词库）做对比，提取出来文本和词库都有的单词；\n2. 按用户提供的文本的单词出现前后顺序排列提取出来的单词\n3. 重复的单词不重复提取\n4. 可以提取短语，例如：\n    1. `He knows a bit of Dutch` =\u003e `a bit of`.\n    2. `as noisy as evey` =\u003e `as ... as`.\n    3. `keep up with jenny to` =\u003e `keep up with sb.`，[更多代词](#会出现的代词)\n4. 可以提取短语，例如词库里有个 'a bit of'， 要在句子 ‘He knows a bit of Dutch.' 提取出来; 又例如 'as ... as', 要在句子 'as noisy as evey' 中提取出来\n5. 特殊符号的处理，如 clean-up 需要作为一个单词，也要拆分成独立单词，即 clean-up, clean, up\n6. （选项功能）一般时态变形的单词的处理，如 look，如果文中的是 looked，需要优先从词库里查询是否有 look 这个单词，如果词库有 look 则不再继续查找，如果没有再查询 looked。ing 形态和加 s,es,ies 形态也同理。\n7. （选项功能）不规则时态的处理，如 drunk，需要优先查找 drunk，如果词库有 drunk 则不再继续查找，如果没有再查询 drink。\n8. 在保证正确性的前提下尽量提高提取速度，比如避免 auto boxing/unboxing\n\n## 会出现的代词\n`[\"do sth.\", \"do sth\",\"sb.'s\", \"sth.\", \"sb.\",\"sth\", \"sb\", \"one's\", \"somebody's\", \"somebody\", \"something\", \"someone\"]`\n\n## 实现\n\n### 关键点\n\n+ #### 单词搜索\n    由于每个单词都要在词库搜索是否存在，所以最明显的性能瓶颈在这里。目前有几种数据结构比较适合\n    + `Hash Table`\n    + `Prefix Tree`\n    \n+ #### 短语搜索\n    + `looking for` in `I'm looking for my wallet.`\n    + `do sb's best` in `do my best/do your best`\n    + (开启词态识别) `put on` in `putting on`\n\n+ #### 特殊情况\n    + `something of` 要提取的是短语本身，而不能是 `... of`\n\n+ #### 词态识别\n    如果开启词态识别，则将变形的单词转成原型后再提取，包括出现在短语中的单词。\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleask%2Fmaimemo-word-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fleask%2Fmaimemo-word-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleask%2Fmaimemo-word-extractor/lists"}