Projects in Awesome Lists tagged with corpus-data
A curated list of projects in awesome lists tagged with corpus-data .
https://github.com/esbatmop/mnbvc
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
chinese chinese-language chinese-nlp chinese-simplified corpus-data nlp nlp-machine-learning
Last synced: 05 Apr 2025
https://github.com/esbatmop/MNBVC
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
chinese chinese-language chinese-nlp chinese-simplified corpus-data nlp nlp-machine-learning
Last synced: 02 Apr 2025
https://github.com/sheepzh/poetry
地球上最全的华语现代诗歌语料库,3k+诗人,80K+诗歌,15M+字
chinese-corpus corpus-data literature nlp poetry
Last synced: 26 Feb 2026
https://github.com/shijiebei2009/CEC-Corpus
:books:中文突发事件语料库(Chinese Emergency Corpus)-上海大学-语义智能实验室
Last synced: 16 Nov 2025
https://github.com/grammarly/ua-gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
corpus corpus-data corpus-tools dataset gec grammatical-error-correction natural-language-processing nlp-datasets ukrainian-language
Last synced: 21 Feb 2026
https://github.com/hailiang-wang/egret-wenda-corpus
A Public Corpus for Machine Learning
Last synced: 27 May 2026
https://github.com/canclid/canto-filter
粵文語料篩選器 Cantonese text filter
cantonese cantonese-language corpus corpus-data data nlp
Last synced: 27 Oct 2025
https://github.com/undertheseanlp/corpus.viwiki
Vietnamese Wikipedia Corpus
corpus-data corpus-linguistics vietnamese vietnamese-nlp
Last synced: 05 Mar 2026
https://github.com/pythainlp/thaigov-v2-corpus
Thai News Dataset from Thai government website.
corpus corpus-data pythainlp thai-language thai-nlp
Last synced: 13 Apr 2025
https://github.com/dohliam/hawaiian-corpus
Data from a corpus of written Hawaiian
bigrams corpora corpus corpus-data corpus-linguistics frequency frequency-list hawaii hawaiian hawaiian-electronic-library hawaiian-language n-grams ngram olelo-hawaii stoplist stopwords ulukau
Last synced: 05 Jan 2026
https://github.com/filipefilardi/text-mining
Clean corpus generic script made with tm package
20newsgroup corpora corpus-data machine-learning text-mining
Last synced: 30 May 2026
https://github.com/sagesolar/Corpus-of-Taylor-Swift
This is a dataset consisting of all song lyric words found on all of Taylor Swift's studio albums (up to and including TTPD), as well as a selection of other songs written by her.
corpus corpus-data song-dataset song-lyrics taylor-swift ttpd
Last synced: 17 Mar 2025
https://github.com/dcavar/antisemitismdatathon2020
This is project material for the Antisemitism Datathon and Hackathon 2020 at Indiana University
antisemitism corpus-data flair hatespeech machine-learning nltk python pytorch social-media spacy tensorflow twitter
Last synced: 04 Oct 2025
https://github.com/bjascob/smartlmvocabs
Improving Language Model Performance through Smart Vocabularies
corpus-data language-model ml neural-language-model python tensorflow
Last synced: 19 Jun 2025
https://github.com/0xdolan/kurdish_news?tab=readme-ov-file
Kurdish News text corpus
corpus corpus-data data kurdi kurdish news text
Last synced: 16 Aug 2025
https://github.com/nevmenandr/bashkir-corpus
Тексты для корпуса башкирского языка
bashkir corpus corpus-data minority-language texts
Last synced: 14 Apr 2025
https://github.com/clariah/wp6-missieven
General Missives in Text-Fabric
corpus-data corpus-linguistics corpus-processing corpus-tools dutch history nlp
Last synced: 22 Apr 2025
https://github.com/jonsafari/multiway-corpus
Build an n-way multilingual corpus
corpus-data machine-translation mt multilingual multiway-corpus zero-shot
Last synced: 27 Feb 2026
https://github.com/jean-baptiste-camps/geste
Un corpus de chansons de geste
corpus corpus-data lemmatization old-french pos-tagging xml-tei
Last synced: 15 Feb 2026
https://github.com/sinaahmadi/ZazaGoraniCorpus
A corpus for the Zazaki and Gorani languages
computational-linguistics corpus corpus-data corpus-linguistics feyli gorani kurdish kurdish-language-processing less-resource-languages natural-language-processing southern-kurdish zazaki
Last synced: 07 May 2025
https://github.com/clemsciences/old_norse_notebook
Jupyter notebooks to learn how to use cltk for texts analysis of Old Norse
cltk corpus-data historical-linguistics jupyter-notebook old-norse runes
Last synced: 02 Sep 2025
https://github.com/mhenderson/thomashardyr
An R package for Thomas Hardy's novels.
corpus-data rstats rstats-package
Last synced: 09 Mar 2026
https://github.com/krzjoa/komentarze
Korpus ręcznie sklasyfikowanych komentarzy do uczenia maszynowego (filtrowanie komentarzy obraźliwych)
corpus corpus-data dataset json-data machine-learning-dataset
Last synced: 28 Jul 2025
https://github.com/nevmenandr/artlang-dani-el
Тексты и описание грамматики языка ко дню рождения М. А. Даниэля
Last synced: 20 Jan 2026
https://github.com/digitallinguistics/dft
Discourse Functional Transcription
corpora corpus corpus-data corpus-linguistics data-format digital-humanities digital-linguistics discourse dlx functionalism language linguistics transcription
Last synced: 05 Jan 2026
https://github.com/liao961120/dcard-corpus
Dcard post data for building corpus
ckiptagger concordancer corpus corpus-data dcard traditional-chinese
Last synced: 11 Mar 2025
https://github.com/aurelius84/django_web
admin corpus-data django machine-learning-practice nginx uwsgi
Last synced: 17 Apr 2026
https://github.com/gederajeg/corplingr
Tidy concordances, collocates, and wordlist
corpus-data corpus-linguistics corpus-processing corpus-tools indonesian indonesian-language indonesian-linguistics leipzig-corpora-collection leipzig-corpus-files usage-based-linguistics
Last synced: 01 Apr 2025
https://github.com/p-marco/czech-it
A linguistic corpus of Czech native learners acquiring Italian language
computational-linguistics corpus-data corpus-linguistics digital-humanities italian-language linguistics
Last synced: 25 Jan 2026
https://github.com/soras/esttimemlcorpus
Estonian TimeML Annotated Corpus \ Eesti keele TimeML märgendatud korpus
corpus corpus-data corpus-tools dependency-syntax estonian estonian-language events timeml timex tlinks
Last synced: 27 Jan 2026
https://github.com/mhenderson/dhlawrencer
An R package for D. H. Lawrence's novels.
corpus-data english-literature rstats rstats-package
Last synced: 03 Dec 2025
https://github.com/mt-digital/metacorps
Web app and tools for quantitative analysis of metaphor in corpora
Last synced: 19 Jan 2026
https://github.com/mantzaris/benchmarkdatanlp.jl
Generate synthetic text from a variety of methods, eg. Context Free Grammars (CFGs), with parameterized complexity to test your NLP methods (like LLMs)
corpus-data data-generation data-generator llm-training nlp
Last synced: 20 Jan 2026
https://github.com/quasilyte/eldb
Emacs Lisp corpus. Code collected from many-many projects for you to query it!
corpus corpus-data dataset emacs-lisp query storage
Last synced: 20 Jan 2026
https://github.com/omr5221/python-text-analytics
corpus-data nltk-library python
Last synced: 17 May 2026
https://github.com/ketanmehra003/parallel-corpus-management-tool
This project is designed to help manage and analyze large corpora of text data. It provides tools for importing, processing, and querying text data efficiently.
corpus corpus-data corpus-processing corpus-tools django language-translator-api machine-learning python3
Last synced: 01 May 2026
https://github.com/oyale/eslema
Asturian language corpus for FreeLing
asturian corpus-data corpus-linguistics freeling linguistics
Last synced: 01 Mar 2026
https://github.com/artefactual-labs/rdss-archivematica-test-data-corpus
A collection of research dataset files used for testing Archivematica integration and functionality in the JISC Research Data Shared Service (RDSS).
archivematica corpus-data digital-preservation jisc research-data-management
Last synced: 13 Mar 2026
https://github.com/soras/esttimexcorpora
Estonian TIMEX Annotated Corpora \ Eesti keele ajaväljendimärgendustega korpused
corpora corpus-data timeml timex timex3
Last synced: 27 Jan 2026
https://github.com/assada/free-words
Data for/from NLP
corpus-data data nlp-machine-learning npl
Last synced: 26 Feb 2026
https://github.com/mhenderson/coco-data
Data pipeline for the coco-explorer app.
corpus-data corpus-linguistics data-package r-package
Last synced: 18 Mar 2026