{"id":26483870,"url":"https://github.com/chatopera/wikidata-corpus","last_synced_at":"2025-03-20T04:58:31.288Z","repository":{"id":38323836,"uuid":"103103940","full_name":"chatopera/wikidata-corpus","owner":"chatopera","description":"Train Wikidata with word2vec for word embedding tasks","archived":false,"fork":false,"pushed_at":"2018-07-03T11:51:19.000Z","size":78193,"stargazers_count":120,"open_issues_count":1,"forks_count":29,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-04-14T20:20:44.559Z","etag":null,"topics":["wikidata","word-embeddings","word2vec"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chatopera.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-09-11T07:29:38.000Z","updated_at":"2024-03-12T06:52:24.000Z","dependencies_parsed_at":"2022-08-25T02:20:15.769Z","dependency_job_id":null,"html_url":"https://github.com/chatopera/wikidata-corpus","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chatopera%2Fwikidata-corpus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chatopera%2Fwikidata-corpus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chatopera%2Fwikidata-corpus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chatopera%2Fwikidata-corpus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chatopera","download_url":"https://codeload.github.com/chatopera/wikidata-corpus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244554067,"owners_count":20471173,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["wikidata","word-embeddings","word2vec"],"created_at":"2025-03-20T04:58:30.654Z","updated_at":"2025-03-20T04:58:31.275Z","avatar_url":"https://github.com/chatopera.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![chatoper banner][co-banner-image]][co-url]\n\n[co-banner-image]: https://user-images.githubusercontent.com/3538629/42217321-3d5e44f6-7ef7-11e8-94e7-1574bfa1dbb8.png\n[co-url]: https://www.chatopera.com\n\n# wikidata\nwikidata.org\n\n## Download\n```\nSTORE_PATH=data\nDATA_URL=http://download.wikipedia.com/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2\n\ncd $STORE_PATH\nwget $DATA_URL\n```\n\n## Extract articles\n```\nWikiExtractor.py -b 5000M \\\n    -o data/zhwiki-latest-pages-articles.extracted \\\n    data/zhwiki-latest-pages-articles.xml.bz2\n```\n\n## 繁体转简体\n```\nopencc -i data/zhwiki-latest-pages-articles.extracted/AA/wiki_00  \\\n    -o data/zhwiki-latest-pages-articles.0620.chs \\\n    -c t2s.json\n```\n\nDownload [t2s.json](https://raw.githubusercontent.com/BYVoid/OpenCC/master/data/config/t2s.json).\n\n到此为止，已经完成了大部分繁简转换工作。\n\n\n## 其他情况处理\n\n1) 维基百科使用的繁简转换方法是以词表为准，外加人工修正。人工修正之后的文字是这种格式，多数是为了解决各地术语名称不同的问题：\n\n\u003e 他的主要成就包括Emacs及後來的GNU Emacs，GNU C 編譯器及-{zh-hant:GNU 除錯器;zh-hans:GDB 调试器}-。\n\n对付这种可以简单的使用正则表达式来解决。一般简体中文的限定词是zh-hans或zh-cn。\n\n2) 由于Wikipedia Extractor抽取正文时，会将有特殊标记的外文直接剔除，最后形成类似这样的正文：\n\n\u003e 西方语言中“数学”（；）一词源自于古希腊语的（）\n\n虽然上面这句话是读不通的，但鉴于这种句子对我要处理的问题影响不大，就暂且忽略了。最后再将「」『』这些符号替换成引号，顺便删除空括号。\n\n\n```\npython2 fix_special_symbols.py data/zhwiki-latest-pages-articles.0620.chs\n```\n\n程序执行结束，输出: **data/zhwiki-latest-pages-articles.0620.chs.normalized**。\n\n## 浏览文件\n\n```\nhead data/zhwiki-latest-pages-articles.0620.chs.normalized\n```\n\n## 分词\n\n* 执行脚本\n\n```\nexport PYTHONIOENCODING=\"UTF-8\"\npython3 wordseg.py \u003e data/zhwiki-latest-pages-articles.0620.chs.normalized.wordseg\n```\n\n## word2vec\n[word2vec](https://code.google.com/archive/p/word2vec)官方的实现。\n```\n./word2vec_c_format_train.sh\n```\n\n### Usage of word2vec model\n\n* word2vec cli\n\n```\ndistance, compute-accuracy, word-analogy\n```\n\n* python\n\n```\npython3 word2vec_gensim_similarity.py\n```\n\n## TF-IDF\n\n* plain code\n\ntrain\n\n```python\npython3 tfidf_plain.py\n```\n\nAfter running, dump **words**, **weights** and **idf** into pickle file.\n\n* adv version in [sklearn](http://scikit-learn.org/)\n\n现在会有稀疏矩阵的问题，解决方案是使用限定的词汇表。\n\n```python\npython3 tfidf_sklearn.py\n```\n\n## 关联项目\n\n### [Synonyms](https://github.com/huyingxi/Synonyms)\n中文近义词库，Synonyms使用wikidata-corpus训练的词向量生成近义词表。\n\n## references\nhttp://licstar.net/archives/328\nhttp://licstar.net/archives/tag/wikipedia-extractor\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchatopera%2Fwikidata-corpus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchatopera%2Fwikidata-corpus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchatopera%2Fwikidata-corpus/lists"}