{"id":13541369,"url":"https://github.com/esbatmop/MNBVC","last_synced_at":"2025-04-02T08:31:16.098Z","repository":{"id":65256298,"uuid":"583824526","full_name":"esbatmop/MNBVC","owner":"esbatmop","description":"MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。","archived":false,"fork":false,"pushed_at":"2024-10-03T01:22:20.000Z","size":624,"stargazers_count":3427,"open_issues_count":17,"forks_count":239,"subscribers_count":64,"default_branch":"main","last_synced_at":"2024-10-10T16:03:34.979Z","etag":null,"topics":["chinese","chinese-language","chinese-nlp","chinese-simplified","corpus-data","nlp","nlp-machine-learning"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/esbatmop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-31T03:52:02.000Z","updated_at":"2024-10-10T11:02:38.000Z","dependencies_parsed_at":"2023-12-14T13:33:42.384Z","dependency_job_id":"19fe06f9-73c8-4b6b-8ab9-fdbdb657a035","html_url":"https://github.com/esbatmop/MNBVC","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/esbatmop%2FMNBVC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/esbatmop%2FMNBVC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/esbatmop%2FMNBVC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/esbatmop%2FMNBVC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/esbatmop","download_url":"https://codeload.github.com/esbatmop/MNBVC/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246782112,"owners_count":20832975,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chinese","chinese-language","chinese-nlp","chinese-simplified","corpus-data","nlp","nlp-machine-learning"],"created_at":"2024-08-01T10:00:45.309Z","updated_at":"2025-04-02T08:31:16.091Z","avatar_url":"https://github.com/esbatmop.png","language":null,"readme":"# MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集\n\n## 请媒体朋友们不要报道我们，让我们有更长久的时间可以收集整理数据。我们最怕捧杀了，您让我们保持低调，就是对中文算法圈做了大的贡献！\n\n中文互联网上最古老最神秘(没有之一)的[MOP里屋社区](http://mnbvc.253874.net/)于2023.1.1庄重宣布:\n\n**在英明神武的猫扑管子带领下，决心发挥社区所长(哪都长)，帮助开源社区长期更新一份最大的中文互联网语料集**\n\n[MNBVC语料集](https://wiki.mnbvc.org) 不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。数据均来源于互联网收集。\n\n### 进度\n\n目前总数据量47863GB，目标是达到253T数据，目前进度18.9%。  \n\n### 数据说明\n压缩包密码为253874\n\n压缩包内中文语料包括txt、json、jsonl和parquet（多模态专用）格式，最终会统一到jsonl和parquet格式。\n\n压缩包根目录的links.txt里有每个子文件夹数据来源的url\n\n每个子文件夹内有一张png格式的图片，是数据来源的网页截图\n\n收录的数据将去掉大于等于8位的数字串进行脱敏\n\n压缩包内数据只做了粗加工,例如html\u0026xml转txt、csv\u0026tsv转json等\n\n### 索引和分类\n\n我们没有能力对数据来源进行版权审核。虽然本数据集包括了数据来源信息，但为了长而持久的提供数据集的更新和下载，为了尽量避免版权争议，本数据集不提供压缩包内数据的索引和分类。并恳请大家克制住自己的分享欲，不要讨论压缩包的索引及所包含具体内容的信息。请大家更多的关注大数据量语料本身的应用，拜托大家低调的使用数据。\n\n### huggingface\n\n清洗完成的分类数据将陆续放到：[https://huggingface.co/datasets/liwu/MNBVC](https://huggingface.co/datasets/liwu/MNBVC)\n\n### 一人行快，众人行远（摇人加速 发送邮件 MNBVC@253874.net）\n\n各个小组长反映，数据清洗的苦力代码工作比较多，技术落地有点慢，希望有大量时间的同学来帮忙，会用python就行，有人手把手指导。请来帮忙的同学先阅读[项目的三条红线](https://wiki.mnbvc.org/doku.php/xmhx)。\n\n + OCR转码小组（被GPT4逼成了包含文字-图片的多模态语料组，增加编制），目前5人，缺5人（需有CV、NLP算法背景，想用nlp辅助ocr转码，有业内此领域顶尖大佬带队指导）\n + 问答语料小组，目前3人，缺4人（目前全是写python代码对齐问答项并人肉检查的苦力活，后面想利用算法模型做自动对齐）\n + 语料增强小组，目前3人，缺2人（想利用nlp补全缺字的语料，并进行文本质量检测等）\n + 代码语料小组和平行语料小组还缺几个打杂（后面由组长来决定到底干嘛）\n + 待建古文研究小组（研究地方志等古籍的转码，语料很多，难度很大）\n + 待建测试组（请测试同学加入，帮助我们提升数据质量，希望本组同学可以研究用llm直接生成测试用例和测试代码）\n\n即使没空帮助项目做开发，也可以通过参加 ([语料元气弹](https://mnbvc.253874.net/upload/form.htm)) 项目，随手上传语料文档，来参与MNBVC语料集的建设。\n\n### 中文大语料清洗工具\n\n为处理大规模的中文语料，MNBVC项目组的同学在现有开源软件基础上做了优化，提供了更高效的版本:  \n\n + 更快速且准确的中文编码检测工具：[charset_mnbvc](https://github.com/alanshi/charset_mnbvc)    \n + 将txt批量转成jsonl并挑出段落重复度高的文件：[deduplication_mnbvc](https://github.com/aplmikex/deduplication_mnbvc)   \n + 从多层目录中按关键词采样一定数量的文件并保留目录结构：[scan_copy_files_mnbvc](https://github.com/wanng-ide/scan_copy_files_mnbvc)   \n + 将MNBVC语料格式统一的格式检查工具：[DataCheck_MNBVC](https://github.com/X94521/DataCheck_MNBVC)\n + 数据清洗示例及工具：[DataClean-MNBVC](https://github.com/wormtooth/DataClean-MNBVC)\n\n### 代码仓库爬虫工具\n\n现有各个开源代码语料集都有很严重的人为过滤现象，这让追赶chatGPT变得更为困难。为避免重复劳动，提供经过MNBVC大规模验证后的代码仓库爬虫代码。\n\n + 爬取github代码仓库meta信息：[publicRepos_mnbvc](https://github.com/washing1127/publicRepos_mnbvc)\n + 爬取github代码仓库最新版本代码：[github_downloader_mnbvc](https://github.com/imgingroot/github_downloader_mnbvc)\n + 爬取notabug代码仓库：[notabug_download_mnbvc](https://github.com/gezi2333/notabug_download_mnbvc)\n + 爬取bitbucket代码仓库：[bitbucket_crawl_mnbvc](https://github.com/chenzhwsysu57/bitbucket_crawl_mnbvc)\n + 将代码转为语料：[githubcode_extractor_mnbvc](https://github.com/LinnaWang76/githubcode_extractor_mnbvc)\n + 爬取commit记录：[get_github_commit_mnbvc](https://github.com/ppmmaiwo/get_github_commit_mnbvc)\n\n### 多模态处理工具\n + PDF元信息抽取工具：[pdf_meta_data_mnbvc](https://github.com/MIracleyin/pdf_meta_data_mnbvc)       \n + PDF解析规则工具：[mmdp_mnbvc](https://github.com/MIracleyin/mmdp_mnbvc)\n + 第一版的pdf转txt工具：[pdf2txt_mnbvc](https://github.com/jayhenry/pdf2txt_mnbvc) \n + Arxiv文档解析工具：[Arxiv_mllm_mnbvc](https://github.com/flychen59/Arxiv_mllm_mnbvc) \n + Arxiv图文对处理工具：[ARXIV_IMAGE2CAPTION_mnbvc](https://github.com/KakaQK/ARXIV_IMAGE2CAPTION_mnbvc)\n + 将PDF文件转换为JSON和Markdown格式的工具：[docling_parse_mnbvc](https://github.com/MIracleyin/docling_parse_mnbvc)\n\n### 各种清洗代码\n + wikihow清洗代码：[WikiHowQAExtractor-mnbvc](https://github.com/wanicca/WikiHowQAExtractor-mnbvc)  \n + 中国外交部发言清洗代码：[QA_with_reporters_from_the_Ministry_of_Foreign_Affair_mnbvc](https://github.com/UnstoppableCurry/QA_with_reporters_from_the_Ministry_of_Foreign_Affair_mnbvc)    \n + 各类数学题清洗代码：[Math_mnbvc](https://github.com/X94521/Math_mnbvc)   \n + stackexchange的清洗代码：[stackexchange_mnbvc](https://github.com/livehl/stackexchange_mnbvc)\n + 平行语料的清洗代码：[parallel_corpus_mnbvc](https://github.com/liyongsea/parallel_corpus_mnbvc)  \n + 试卷的清洗代码：[Exam-Question-Bank-Dataset-zh_mnbvc](https://github.com/UnstoppableCurry/Exam-Question-Bank-Dataset-zh_mnbvc)\n + 裁判文书网的清洗代码：[MNBVC-judgment](https://github.com/wormtooth/MNBVC-judgment)\n + 剧本杀的清洗代码：[MNBVC-pdf-extract](https://github.com/459737087/MNBVC-pdf-extract/)\n + DocLayNet的清洗代码：[DocLayNetPlus_mnbvc](https://github.com/luigide2020/DocLayNetPlus_mnbvc)\n\n### 其他小工具\n + chinarxiv的爬虫：[chinaxivCrawler_mnbvc](https://github.com/flychen59/chinaxivCrawler_mnbvc)\n + 从warc中提取文件：[warc_extractor_mnbvc](https://github.com/akira-l/warc_extractor_mnbvc)\n + psyarxiv、chemrxiv、biorxiv、medrxiv的爬虫：[xxarxiv_mnbvc](https://github.com/isLinXu/xxarxiv_mnbvc)\n + wipo的爬虫：[wipo_mnbvc](https://github.com/X-233/wipo_mnbvc)\n \n### 语料集下载信息(每个压缩包都会随着清洗进度更新):\n\n1.通过[p2p微力](http://www.verysync.com/manual/)同步全部压缩包并接收更新    \n建议关闭tcp穿透、关闭udp传输的微力设置。如不关闭，微力有可能堵塞路由器（同时也许传输速度更快）    \n\u003e微力密钥: B4MVPVJTK3DOOAOPVLJ3E7TA7RWW4J2ZEAXJRMRSRHSBPDB7OAFHUQ    \n\u003e[微力直达链接](https://link.verysync.com/#f=MNBVC%40xclimbing\u0026sz=105E4\u0026k=P4AJDJXHY3RCCOCDJZX3S7HO7FKK4X2NSOLXFAFGFVGPDRP7COTVIE\u0026d=SJZHVB7GAZZLS2ZN43D3NNEBHPMU\u0026t=1\u0026tm=1676793101554\u0026v=v2.16.0\u0026a=1\n)\n\n2.通过百度网盘下载：[第一部分](dupan/README.md)、[第二部分](dupan/README2.md)\n\n### Citation\n\nPlease cite the repo if you use the data or code in this repo.\n\n```\n@misc{mnbvc,\n  author = {{MOP-LIWU Community} and {MNBVC Team}},\n  title = {MNBVC: Massive Never-ending BT Vast Chinese corpus},\n  year = {2023},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/esbatmop/MNBVC}},\n}\n```\n","funding_links":[],"categories":["Open LLM Training Dataset","Pre-training Corpora","NLP语料和数据集","miscellaneous","开源数据集库"],"sub_categories":["Science","General Pre-training Corpora","大语言对话模型及数据","19. MiniMax大模型--MiniMax"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fesbatmop%2FMNBVC","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fesbatmop%2FMNBVC","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fesbatmop%2FMNBVC/lists"}