Projects in Awesome Lists tagged with corpora
A curated list of projects in awesome lists tagged with corpora .
https://github.com/nltk/nltk_data
NLTK Data
corpora linguistics natural-language-processing nlp nltk
Last synced: 13 May 2025
https://github.com/juand-r/entity-recognition-datasets
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
annotations corpora datasets entity-extraction entity-recognition named-entity-recognition natural-language-processing ner nlp nlp-resources
Last synced: 14 May 2025
https://github.com/piskvorky/gensim-data
Data repository for pretrained NLP models and NLP corpora.
corpora dataset gensim glove-model lda-model lsi-model pretrained-models word2vec-model
Last synced: 04 Apr 2025
https://github.com/nonamestreet/weixin_public_corpus
微信公众号语料库
chinese-nlp corpora corpus linguistics natural-language-processing nlp wei-xin weixin weixin-data yu-liao yu-liao-ku
Last synced: 20 Nov 2025
https://github.com/PlanTL-GOB-ES/lm-spanish
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).
benchmarks corpora embeddings language-model nlp transformers
Last synced: 13 Jul 2025
https://github.com/canclid/awesome-cantonese-nlp
A curated list of resources dedicated to Natural Language Processing (NLP) of Cantonese | 粵語 NLP
Last synced: 26 Feb 2026
https://github.com/saidziani/Arabic-News-Article-Classification
Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.
arabic-language arabic-nlp corpora machine-learning nlp nltk python3 text-categorization
Last synced: 07 May 2025
https://github.com/saidziani/arabic-news-article-classification
Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.
arabic-language arabic-nlp corpora machine-learning nlp nltk python3 text-categorization
Last synced: 21 Mar 2025
https://github.com/josecannete/spanish-corpora
Unannotated Spanish 3 Billion Words Corpora
corpora linguistics natural-language-processing nlp spanish spanish-language
Last synced: 02 Apr 2026
https://github.com/czcorpus/kontext
An advanced, extensible web front-end for the Manatee-open corpus search engine
corpora corpus-linguistics corpus-tools user-interface
Last synced: 07 Feb 2026
https://github.com/kgjerde/corporaexplorer
An R package for dynamic exploration of text collections
corpora corpus r shiny text-analysis
Last synced: 22 Oct 2025
https://github.com/juliatext/corpusloaders.jl
A variety of loaders for various NLP corpora.
Last synced: 21 Oct 2025
https://github.com/digitallinguistics/data-format
The Data Format for Digital Linguistics (DaFoDiL)
corpora corpus-linguistics daffodil digital-humanities digital-linguistics dlx dlx-format json json-schema language languages linguistics natural-language schema
Last synced: 12 Apr 2025
https://github.com/korenyoni/opus-api
OPUS (opus.nlpl.eu) Python3 API
api corpora corporate corpus language-model machine-learning opus parallel-corpora parallel-corpus python
Last synced: 15 Apr 2025
https://github.com/dohliam/hawaiian-corpus
Data from a corpus of written Hawaiian
bigrams corpora corpus corpus-data corpus-linguistics frequency frequency-list hawaii hawaiian hawaiian-electronic-library hawaiian-language n-grams ngram olelo-hawaii stoplist stopwords ulukau
Last synced: 05 Jan 2026
https://github.com/filipefilardi/text-mining
Clean corpus generic script made with tm package
20newsgroup corpora corpus-data machine-learning text-mining
Last synced: 30 May 2026
https://github.com/jonsafari/habeas-corpus
Command-line corpus tools
command-line-tools corpora corpus corpus-linguistics text-corpus vocabulary
Last synced: 06 Nov 2025
https://github.com/czcorpus/wag
WaG - install your own word profile generator out of diverse data resources
corpora data-aggregation dictionaries language-resources linguistics portal react rxjs typescript visualization
Last synced: 07 Feb 2026
https://github.com/alexeykosh/lingcorpora.py
API for corpora
api corpora corpus national-corpus package
Last synced: 19 Jul 2025
https://github.com/writecrow/corpus_text_processor
A desktop application for preparing files for use in a corpus
corpora corpus-linguistics desktop-app text-processing
Last synced: 24 Aug 2025
https://github.com/danieldk/conllx-utils
CoNLL-X utilities
conll corpora cycle partitioning treebanks
Last synced: 07 Oct 2025
https://github.com/tanaikech/corporaapp
This is a Google Apps Script library for managing the corpora of Gemini API.
corpora gemini gemini-api google-apps-script google-apps-script-library semantic-search
Last synced: 14 Mar 2026
https://github.com/qanastek/antilles
ANTILLES : An Open French Linguistically Enriched Part-of-Speech Corpus
bert corpora corpus flair flair-embeddings huggingface natural-language-processing part-of-speech part-of-speech-tagger part-of-speech-tagging transformers
Last synced: 30 Apr 2025
https://github.com/khashashin/chechen_corpora
This repository contains the source code for the Chechen Language Corpora website.
Last synced: 07 May 2025
https://github.com/zsxkib/ttds-g35-cw3
TTDS Group Project: Video Games Search Engine. Sakib Ahamed. Dan Buxton, Kenza Amira, Wini Lau, Mansoor Ahmad
corpora data-science neural-ranking-models pagerank query search-engine technologies text text-analysis text-classification ttds web-search
Last synced: 10 Apr 2025
https://github.com/litee/tts-asr-corpora
Catalogue of TTS and ASR corpora that can be used for machine learning
asr corpora corpus corpus-linguistics machine-learning text-to-speech tts
Last synced: 03 Jan 2026
https://github.com/made2591/cognitive-system-postagger
A pos-tagging library with Viterbi, CYK and SVO -> XSV translator made as part of my final exam for the Cognitive System course in Department of Computer Science.
cky cognitive-services cognitive-systems computer-science corpora cyk department lemmatizer nlp nlp-library nlp-parsing nlp-stemming nltk nltk-grammar nlu postagger postagging sentence stemmer viterbi
Last synced: 31 May 2026
https://github.com/zlib-ng/corpora
Common corpora used for lossless compression testing and benchmarking.
Last synced: 02 Feb 2026
https://github.com/digitallinguistics/tools
Tools for Linguistic Productivity (TooLiP)
corpora corpus-linguistics digital-humanities digital-linguistics dlx language-documentation linguistics transliteration
Last synced: 23 Mar 2025
https://github.com/digitallinguistics/data-explorer
The DLx portal for viewing, searching, and aggregating data
corpora corpus corpus-linguistics digital-humanities documentary-linguistics language-documentation linguistics
Last synced: 09 Oct 2025
https://github.com/fostroll/corpuscula
Toolkit that simplifies corpus processing
conllu corpora natural-language-processing nlp universal-dependencies
Last synced: 10 Aug 2025
https://github.com/dwhieb/nuuchahnulth
Linguistic data on the Nuuchahnulth (Wakashan) language
corpora corpus corpus-linguistics documentary-linguistics language-documentation linguistics nuuchahnulth wakashan
Last synced: 06 Apr 2025
https://github.com/czcorpus/mquery
An HTTP API server for mining language corpora using Manatee-Open engine.
api-rest-server corpora linguistics service
Last synced: 07 Feb 2026
https://github.com/alhadis/silos
Dumping ground of search results collected for GitHub Linguist.
corpora file-formats github-linguist harvester linguist
Last synced: 04 Jan 2026
https://github.com/neroist/nimcorpora
A Nim interface for Darius Kazemi's Corpora project.
Last synced: 30 Mar 2025
https://github.com/kimaruthagna/nlp_tuts
sample scripts that show use of NLP in python.Some will be proof of concepts while others will be tutorials
bag-of-words corpora frequency-distibution lemmatization matplotlib nlp nltk nltk-python pos-ta seaborn sentiment-analysis sentiment-polarity stemmer text-visualization tutorial word-cloud word2doc word2vec word2vec-algorithm xgboost
Last synced: 14 Jun 2025
https://github.com/digitallinguistics/app
The Lotus web app for managing linguistic data
corpora corpus corpus-linguistics descriptive-linguistics digital-humanities digital-linguistics dlx documentary-linguistics language language-description language-documentation languages lexicography lexicon linguistics
Last synced: 19 Oct 2025
https://github.com/digitallinguistics/dft
Discourse Functional Transcription
corpora corpus corpus-data corpus-linguistics data-format digital-humanities digital-linguistics discourse dlx functionalism language linguistics transcription
Last synced: 05 Jan 2026
https://github.com/dohliam/corpus-tools
A collection of scripts for working with multilingual text corpora
corpora corpus corpus-linguistics frequency language linguistics ngram ngrams ruby salience stoplist stopwords
Last synced: 21 Mar 2025
https://github.com/yash22222/terrorist-activity-forecasting-and-risk-assessment-system
In an era marked by global security challenges, the "TAFRAS" emerges as a cutting-edge solution to tackle the ever-evolving threat of terrorism. The project is grounded in the urgent need for predictive systems that can anticipate, assess, and mitigate potential terrorist activities.
corpora data-vizualisation folium-maps gensim global-terrorism-database lda machine-learning matplotlib networkx nltk nmf numpy pandas python random-forest-classifier seaborn sklearn spacy textblob vader-sentiment-analysis
Last synced: 08 Apr 2026
https://github.com/digitallinguistics/concordance
A Node.js library for performing concordance-related tasks on a corpus in DLx JSON format
corpora corpus corpus-linguistics digital-linguistics dlx linguistics
Last synced: 07 May 2025
https://github.com/dsfsi/puodata
Curated corpora for Setswana. Used to train PuoBERTa.
african-languages african-nlp corpora dsfsi-datasets natural-language-processing setswana south-africa tn tsn
Last synced: 30 Jan 2026
https://github.com/czcorpus/mquery-sru
CLARIN FCS 2.0 Endpoint for Manatee-open corpus search engine
clarin corpora fcs fcs-endpoint linguistics
Last synced: 07 Feb 2026
https://github.com/arne-cl/bislama-resources
Bislama language resources
bislama corpora language language-resources underresourced-languages vanuatu
Last synced: 09 Feb 2026
https://github.com/dwhieb/dissertation
My Ph.D. dissertation in linguistics at the University of California, Santa Barbara
corpora corpus-linguistics functionalism language lexical-categories lexical-flexibility lexicography linguistics parts-of-speech typology word-classes
Last synced: 07 Jan 2026
https://github.com/maidis/turkish-parallel-corpora
Turkish Parallel Corpora
corpora corpus english machine-translation nlp parallel-texts turkish
Last synced: 10 Oct 2025
https://github.com/ivansabik/chairum-corpus
Collection of text corpora for speeches from Mexican presidents Andres Manuel Lopez Obrador (AMLO) and Claudia Sheinbaum sourced from YouTube. The dataset includes their daily morning conferences (mañaneras).
amlo andres-manuel-lopez-obrador claudia-sheinbaum corpora corpus government-data latinamerica mexico-datos open-data political-data political-data-analysis political-datasets political-science political-speech-transcripts political-speeches populism spanish-language
Last synced: 26 Jan 2026
https://github.com/acqdiv/acqdiv
Pipeline for the ACQDIV Corpus Database
child-language corpora corpus-linguistics cross-linguistic-data databases language-acquisition linguistics linguistics-databases typology
Last synced: 24 Apr 2026
https://github.com/czcorpus/ictools
A program for calculating corpora alignments using a pivot language
cmd corpora corpus linguistics manatee-open parallel-corpora translation
Last synced: 07 Feb 2026
https://github.com/writecrow/crow_frontend
The user interface for the Corpus & Repository of Writing, built in Angular
angular corpora corpus corpus-builder corpus-linguistics natural-language-processing
Last synced: 18 May 2026
https://github.com/digitallinguistics/tags2dlx
A JavaScript (Node.js) library that converts a tagged (monolinear) text to DLx JSON format
corpora corpus corpus-linguistics digital-linguistics dlx linguistics
Last synced: 08 May 2026
https://github.com/timxor/corpora
📚 A collection of ml datasets & corpuses
corpora corpus datasets lfw-dataset machine-learning
Last synced: 06 Sep 2025
https://github.com/miweru/vrt_spacy
corpora linguistic-corpora linguistics nlp spacy vrt wrapper
Last synced: 03 May 2026
https://github.com/dohliam/aligned-corpus-search
Simple aligned corpus search tool
Last synced: 16 Aug 2025
https://github.com/dsfsi/za-isizulu-siswati-news-2022
IsiZulu News (articles and headlines) and Siswati News (headlines) Corpora - za-isizulu-siswati-news-2022
african-nlp corpora dsfsi-datasets low-resource-languages natural-language-processing news-categorizer south-africa
Last synced: 02 Jan 2026
https://github.com/richardlitt/fortune-cookie-corpus
A growing corpus of fortune cookies (for NLP and fun). Add your fortunes!
corpora corpus corpus-linguistics fortune fortune-cookie fortune-cookies
Last synced: 15 Sep 2025
https://github.com/soras/esttimexcorpora
Estonian TIMEX Annotated Corpora \ Eesti keele ajaväljendimärgendustega korpused
corpora corpus-data timeml timex timex3
Last synced: 27 Jan 2026
https://github.com/qanastek/ner-mmtd
Named-entity recognition corpora for multilingual voice recognition in the music industry based on the Million Musical Tweets dataset
corpora dataset english french million-musical-tweets mmtd music named-entity-recognition ner neural-network recognition voice
Last synced: 22 Apr 2026
https://github.com/miweru/vrt_generator
Python class for creating vrt-annotated corpora
corpora linguistic-corpora linguistics vrt wrapper
Last synced: 21 Apr 2026
https://github.com/ololobus/slavic_text_scht
St. Petersburg corpus of hagiographic texts
corpora hagiographic-texts linguistics slavic-languages
Last synced: 17 Aug 2025
https://github.com/ggteixeira/corpus-cleaner
Linguistic tool (made by a linguist, for linguists) that scraps corpora, automatically cleans it up, and generates n-grams.
beautifulsoup4 bs4 corpora corpus corpus-linguistics crawler linguistics nlp python scraper web-scraping
Last synced: 28 Feb 2025
https://github.com/jamnicki/split-corpus
Split-corpus package that provide dividing text corpora into the meaningful parts as close to specified size as possible.
corpora corpus-processing large-files natural-language-processing nlp processing
Last synced: 29 Apr 2026