An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with corpora

A curated list of projects in awesome lists tagged with corpora .

https://github.com/juand-r/entity-recognition-datasets

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

annotations corpora datasets entity-extraction entity-recognition named-entity-recognition natural-language-processing ner nlp nlp-resources

Last synced: 14 May 2025

https://github.com/piskvorky/gensim-data

Data repository for pretrained NLP models and NLP corpora.

corpora dataset gensim glove-model lda-model lsi-model pretrained-models word2vec-model

Last synced: 04 Apr 2025

https://github.com/natasha/corus

Links to Russian corpora + Python functions for loading and parsing

corpora datasets nlp python russian

Last synced: 04 Apr 2025

https://github.com/PlanTL-GOB-ES/lm-spanish

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

benchmarks corpora embeddings language-model nlp transformers

Last synced: 13 Jul 2025

https://github.com/canclid/awesome-cantonese-nlp

A curated list of resources dedicated to Natural Language Processing (NLP) of Cantonese | 粵語 NLP

cantonese corpora corpus nlp

Last synced: 26 Feb 2026

https://github.com/saidziani/Arabic-News-Article-Classification

Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.

arabic-language arabic-nlp corpora machine-learning nlp nltk python3 text-categorization

Last synced: 07 May 2025

https://github.com/saidziani/arabic-news-article-classification

Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.

arabic-language arabic-nlp corpora machine-learning nlp nltk python3 text-categorization

Last synced: 21 Mar 2025

https://github.com/czcorpus/kontext

An advanced, extensible web front-end for the Manatee-open corpus search engine

corpora corpus-linguistics corpus-tools user-interface

Last synced: 07 Feb 2026

https://github.com/kgjerde/corporaexplorer

An R package for dynamic exploration of text collections

corpora corpus r shiny text-analysis

Last synced: 22 Oct 2025

https://github.com/juliatext/corpusloaders.jl

A variety of loaders for various NLP corpora.

corpora nlp

Last synced: 21 Oct 2025

https://github.com/filipefilardi/text-mining

Clean corpus generic script made with tm package

20newsgroup corpora corpus-data machine-learning text-mining

Last synced: 30 May 2026

https://github.com/czcorpus/wag

WaG - install your own word profile generator out of diverse data resources

corpora data-aggregation dictionaries language-resources linguistics portal react rxjs typescript visualization

Last synced: 07 Feb 2026

https://github.com/writecrow/corpus_text_processor

A desktop application for preparing files for use in a corpus

corpora corpus-linguistics desktop-app text-processing

Last synced: 24 Aug 2025

https://github.com/tanaikech/corporaapp

This is a Google Apps Script library for managing the corpora of Gemini API.

corpora gemini gemini-api google-apps-script google-apps-script-library semantic-search

Last synced: 14 Mar 2026

https://github.com/khashashin/chechen_corpora

This repository contains the source code for the Chechen Language Corpora website.

chechen corpora corpus nlp

Last synced: 07 May 2025

https://github.com/richardlitt/gaelic-resources

A list of computational resources for Gaelic

corpora corpus gaelic irish language nlp resources scots scottish scottish-gaelic

Last synced: 07 Jan 2026

https://github.com/zsxkib/ttds-g35-cw3

TTDS Group Project: Video Games Search Engine. Sakib Ahamed. Dan Buxton, Kenza Amira, Wini Lau, Mansoor Ahmad

corpora data-science neural-ranking-models pagerank query search-engine technologies text text-analysis text-classification ttds web-search

Last synced: 10 Apr 2025

https://github.com/litee/tts-asr-corpora

Catalogue of TTS and ASR corpora that can be used for machine learning

asr corpora corpus corpus-linguistics machine-learning text-to-speech tts

Last synced: 03 Jan 2026

https://github.com/corpora-inc/corpora

Corpora is a self-building corpus that can help build other arbitrary corpora

agpl ai api cli corpora corpus django markdown monorepo openapi pgvector postgresql python rust

Last synced: 16 Jun 2025

https://github.com/skyl/corpora

Corpora is a self-building corpus that can help build other arbitrary corpora

agpl ai api cli corpora corpus django markdown monorepo openapi pgvector postgresql python rust

Last synced: 12 Apr 2025

https://github.com/made2591/cognitive-system-postagger

A pos-tagging library with Viterbi, CYK and SVO -> XSV translator made as part of my final exam for the Cognitive System course in Department of Computer Science.

cky cognitive-services cognitive-systems computer-science corpora cyk department lemmatizer nlp nlp-library nlp-parsing nlp-stemming nltk nltk-grammar nlu postagger postagging sentence stemmer viterbi

Last synced: 31 May 2026

https://github.com/zlib-ng/corpora

Common corpora used for lossless compression testing and benchmarking.

compression corpora testing

Last synced: 02 Feb 2026

https://github.com/fostroll/corpuscula

Toolkit that simplifies corpus processing

conllu corpora natural-language-processing nlp universal-dependencies

Last synced: 10 Aug 2025

https://github.com/czcorpus/mquery

An HTTP API server for mining language corpora using Manatee-Open engine.

api-rest-server corpora linguistics service

Last synced: 07 Feb 2026

https://github.com/alhadis/silos

Dumping ground of search results collected for GitHub Linguist.

corpora file-formats github-linguist harvester linguist

Last synced: 04 Jan 2026

https://github.com/neroist/nimcorpora

A Nim interface for Darius Kazemi's Corpora project.

corpora nim

Last synced: 30 Mar 2025

https://github.com/dohliam/corpus-tools

A collection of scripts for working with multilingual text corpora

corpora corpus corpus-linguistics frequency language linguistics ngram ngrams ruby salience stoplist stopwords

Last synced: 21 Mar 2025

https://github.com/yash22222/terrorist-activity-forecasting-and-risk-assessment-system

In an era marked by global security challenges, the "TAFRAS" emerges as a cutting-edge solution to tackle the ever-evolving threat of terrorism. The project is grounded in the urgent need for predictive systems that can anticipate, assess, and mitigate potential terrorist activities.

corpora data-vizualisation folium-maps gensim global-terrorism-database lda machine-learning matplotlib networkx nltk nmf numpy pandas python random-forest-classifier seaborn sklearn spacy textblob vader-sentiment-analysis

Last synced: 08 Apr 2026

https://github.com/digitallinguistics/concordance

A Node.js library for performing concordance-related tasks on a corpus in DLx JSON format

corpora corpus corpus-linguistics digital-linguistics dlx linguistics

Last synced: 07 May 2025

https://github.com/czcorpus/mquery-sru

CLARIN FCS 2.0 Endpoint for Manatee-open corpus search engine

clarin corpora fcs fcs-endpoint linguistics

Last synced: 07 Feb 2026

https://github.com/dwhieb/dissertation

My Ph.D. dissertation in linguistics at the University of California, Santa Barbara

corpora corpus-linguistics functionalism language lexical-categories lexical-flexibility lexicography linguistics parts-of-speech typology word-classes

Last synced: 07 Jan 2026

https://github.com/ivansabik/chairum-corpus

Collection of text corpora for speeches from Mexican presidents Andres Manuel Lopez Obrador (AMLO) and Claudia Sheinbaum sourced from YouTube. The dataset includes their daily morning conferences (mañaneras).

amlo andres-manuel-lopez-obrador claudia-sheinbaum corpora corpus government-data latinamerica mexico-datos open-data political-data political-data-analysis political-datasets political-science political-speech-transcripts political-speeches populism spanish-language

Last synced: 26 Jan 2026

https://github.com/czcorpus/ictools

A program for calculating corpora alignments using a pivot language

cmd corpora corpus linguistics manatee-open parallel-corpora translation

Last synced: 07 Feb 2026

https://github.com/writecrow/crow_frontend

The user interface for the Corpus & Repository of Writing, built in Angular

angular corpora corpus corpus-builder corpus-linguistics natural-language-processing

Last synced: 18 May 2026

https://github.com/digitallinguistics/tags2dlx

A JavaScript (Node.js) library that converts a tagged (monolinear) text to DLx JSON format

corpora corpus corpus-linguistics digital-linguistics dlx linguistics

Last synced: 08 May 2026

https://github.com/timxor/corpora

📚 A collection of ml datasets & corpuses

corpora corpus datasets lfw-dataset machine-learning

Last synced: 06 Sep 2025

https://github.com/dohliam/aligned-corpus-search

Simple aligned corpus search tool

corpora corpus

Last synced: 16 Aug 2025

https://github.com/dsfsi/za-isizulu-siswati-news-2022

IsiZulu News (articles and headlines) and Siswati News (headlines) Corpora - za-isizulu-siswati-news-2022

african-nlp corpora dsfsi-datasets low-resource-languages natural-language-processing news-categorizer south-africa

Last synced: 02 Jan 2026

https://github.com/richardlitt/fortune-cookie-corpus

A growing corpus of fortune cookies (for NLP and fun). Add your fortunes!

corpora corpus corpus-linguistics fortune fortune-cookie fortune-cookies

Last synced: 15 Sep 2025

https://github.com/soras/esttimexcorpora

Estonian TIMEX Annotated Corpora \ Eesti keele ajaväljendimärgendustega korpused

corpora corpus-data timeml timex timex3

Last synced: 27 Jan 2026

https://github.com/qanastek/ner-mmtd

Named-entity recognition corpora for multilingual voice recognition in the music industry based on the Million Musical Tweets dataset

corpora dataset english french million-musical-tweets mmtd music named-entity-recognition ner neural-network recognition voice

Last synced: 22 Apr 2026

https://github.com/miweru/vrt_generator

Python class for creating vrt-annotated corpora

corpora linguistic-corpora linguistics vrt wrapper

Last synced: 21 Apr 2026

https://github.com/ololobus/slavic_text_scht

St. Petersburg corpus of hagiographic texts

corpora hagiographic-texts linguistics slavic-languages

Last synced: 17 Aug 2025

https://github.com/ggteixeira/corpus-cleaner

Linguistic tool (made by a linguist, for linguists) that scraps corpora, automatically cleans it up, and generates n-grams.

beautifulsoup4 bs4 corpora corpus corpus-linguistics crawler linguistics nlp python scraper web-scraping

Last synced: 28 Feb 2025

https://github.com/jamnicki/split-corpus

Split-corpus package that provide dividing text corpora into the meaningful parts as close to specified size as possible.

corpora corpus-processing large-files natural-language-processing nlp processing

Last synced: 29 Apr 2026