Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with corpora
A curated list of projects in awesome lists tagged with corpora .
https://github.com/juand-r/entity-recognition-datasets
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
annotations corpora datasets entity-extraction entity-recognition named-entity-recognition natural-language-processing ner nlp nlp-resources
Last synced: 19 Dec 2024
https://github.com/nltk/nltk_data
NLTK Data
corpora linguistics natural-language-processing nlp nltk
Last synced: 17 Dec 2024
https://github.com/piskvorky/gensim-data
Data repository for pretrained NLP models and NLP corpora.
corpora dataset gensim glove-model lda-model lsi-model pretrained-models word2vec-model
Last synced: 18 Dec 2024
https://github.com/PlanTL-GOB-ES/lm-spanish
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).
benchmarks corpora embeddings language-model nlp transformers
Last synced: 22 Nov 2024
https://github.com/saidziani/arabic-news-article-classification
Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.
arabic-language arabic-nlp corpora machine-learning nlp nltk python3 text-categorization
Last synced: 28 Oct 2024
https://github.com/saidziani/Arabic-News-Article-Classification
Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.
arabic-language arabic-nlp corpora machine-learning nlp nltk python3 text-categorization
Last synced: 14 Nov 2024
https://github.com/kgjerde/corporaexplorer
An R package for dynamic exploration of text collections
corpora corpus r shiny text-analysis
Last synced: 22 Nov 2024
https://github.com/digitallinguistics/data-format
The Data Format for Digital Linguistics (DaFoDiL)
corpora corpus-linguistics daffodil digital-humanities digital-linguistics dlx dlx-format json json-schema language languages linguistics natural-language schema
Last synced: 14 Oct 2024
https://github.com/korenyoni/opus-api
OPUS (opus.nlpl.eu) Python3 API
api corpora corporate corpus language-model machine-learning opus parallel-corpora parallel-corpus python
Last synced: 01 Nov 2024
https://github.com/dohliam/hawaiian-corpus
Data from a corpus of written Hawaiian
bigrams corpora corpus corpus-data corpus-linguistics frequency frequency-list hawaii hawaiian hawaiian-electronic-library hawaiian-language n-grams ngram olelo-hawaii stoplist stopwords ulukau
Last synced: 27 Nov 2024
https://github.com/alexeykosh/lingcorpora.py
API for corpora
api corpora corpus national-corpus package
Last synced: 27 Nov 2024
https://github.com/danieldk/conllx-utils
CoNLL-X utilities
conll corpora cycle partitioning treebanks
Last synced: 02 Dec 2024
https://github.com/writecrow/corpus_text_processor
A desktop application for preparing files for use in a corpus
corpora corpus-linguistics desktop-app text-processing
Last synced: 26 Nov 2024
https://github.com/khashashin/chechen_corpora
This repository contains the source code for the Chechen Language Corpora website.
Last synced: 02 Nov 2024
https://github.com/qanastek/antilles
ANTILLES : An Open French Linguistically Enriched Part-of-Speech Corpus
bert corpora corpus flair flair-embeddings huggingface natural-language-processing part-of-speech part-of-speech-tagger part-of-speech-tagging transformers
Last synced: 17 Nov 2024
https://github.com/made2591/cognitive-system-postagger
A pos-tagging library with Viterbi, CYK and SVO -> XSV translator made as part of my final exam for the Cognitive System course in Department of Computer Science.
cky cognitive-services cognitive-systems computer-science corpora cyk department lemmatizer nlp nlp-library nlp-parsing nlp-stemming nltk nltk-grammar nlu postagger postagging sentence stemmer viterbi
Last synced: 13 Nov 2024
https://github.com/litee/tts-asr-corpora
Catalogue of TTS and ASR corpora that can be used for machine learning
asr corpora corpus corpus-linguistics machine-learning text-to-speech tts
Last synced: 21 Nov 2024
https://github.com/dwhieb/nuuchahnulth
Linguistic data on the Nuuchahnulth (Wakashan) language
corpora corpus corpus-linguistics documentary-linguistics language-documentation linguistics nuuchahnulth wakashan
Last synced: 20 Dec 2024
https://github.com/fostroll/corpuscula
Toolkit that simplifies corpus processing
conllu corpora natural-language-processing nlp universal-dependencies
Last synced: 21 Dec 2024
https://github.com/zsxkib/ttds-g35-cw3
TTDS Group Project: Video Games Search Engine. Sakib Ahamed. Dan Buxton, Kenza Amira, Wini Lau, Mansoor Ahmad
corpora data-science neural-ranking-models pagerank query search-engine technologies text text-analysis text-classification ttds web-search
Last synced: 30 Oct 2024
https://github.com/digitallinguistics/tools
Tools for Linguistic Productivity (TooLiP)
corpora corpus-linguistics digital-humanities digital-linguistics dlx language-documentation linguistics transliteration
Last synced: 30 Nov 2024
https://github.com/digitallinguistics/data-explorer
The DLx portal for viewing, searching, and aggregating data
corpora corpus corpus-linguistics digital-humanities documentary-linguistics language-documentation linguistics
Last synced: 30 Nov 2024
https://github.com/digitallinguistics/concordance
A Node.js library for performing concordance-related tasks on a corpus in DLx JSON format
corpora corpus corpus-linguistics digital-linguistics dlx linguistics
Last synced: 30 Nov 2024
https://github.com/kimaruthagna/nlp_tuts
sample scripts that show use of NLP in python.Some will be proof of concepts while others will be tutorials
bag-of-words corpora frequency-distibution lemmatization matplotlib nlp nltk nltk-python pos-ta seaborn sentiment-analysis sentiment-polarity stemmer text-visualization tutorial word-cloud word2doc word2vec word2vec-algorithm xgboost
Last synced: 24 Nov 2024
https://github.com/digitallinguistics/dft
Discourse Functional Transcription
corpora corpus corpus-data corpus-linguistics data-format digital-humanities digital-linguistics discourse dlx functionalism language linguistics transcription
Last synced: 30 Nov 2024
https://github.com/yash22222/terrorist-activity-forecasting-and-risk-assessment-system
In an era marked by global security challenges, the "TAFRAS" emerges as a cutting-edge solution to tackle the ever-evolving threat of terrorism. The project is grounded in the urgent need for predictive systems that can anticipate, assess, and mitigate potential terrorist activities.
corpora data-vizualisation folium-maps gensim global-terrorism-database lda machine-learning matplotlib networkx nltk nmf numpy pandas python random-forest-classifier seaborn sklearn spacy textblob vader-sentiment-analysis
Last synced: 09 Nov 2024
https://github.com/digitallinguistics/app
The Lotus web app for managing linguistic data
corpora corpus corpus-linguistics descriptive-linguistics digital-humanities digital-linguistics dlx documentary-linguistics language language-description language-documentation languages lexicography lexicon linguistics
Last synced: 30 Nov 2024
https://github.com/alhadis/silos
Dumping ground of search results collected for GitHub Linguist.
corpora file-formats github-linguist harvester linguist
Last synced: 20 Dec 2024
https://github.com/tanaikech/corporaapp
This is a Google Apps Script library for managing the corpora of Gemini API.
corpora gemini gemini-api google-apps-script google-apps-script-library semantic-search
Last synced: 11 Nov 2024
https://github.com/dohliam/corpus-tools
A collection of scripts for working with multilingual text corpora
corpora corpus corpus-linguistics frequency language linguistics ngram ngrams ruby salience stoplist stopwords
Last synced: 27 Nov 2024
https://github.com/neroist/nimcorpora
A Nim interface for Darius Kazemi's Corpora project.
Last synced: 12 Dec 2024
https://github.com/zlib-ng/corpora
Common corpora used for lossless compression testing and benchmarking.
Last synced: 07 Nov 2024
https://github.com/arne-cl/bislama-resources
Bislama language resources
bislama corpora language language-resources underresourced-languages vanuatu
Last synced: 10 Nov 2024
https://github.com/writecrow/crow_frontend
The user interface for the Corpus & Repository of Writing, built in Angular
angular corpora corpus corpus-builder corpus-linguistics natural-language-processing
Last synced: 26 Nov 2024
https://github.com/dwhieb/dissertation
My Ph.D. dissertation in linguistics at the University of California, Santa Barbara
corpora corpus-linguistics functionalism language lexical-categories lexical-flexibility lexicography linguistics parts-of-speech typology word-classes
Last synced: 08 Dec 2024
https://github.com/digitallinguistics/tags2dlx
A JavaScript (Node.js) library that converts a tagged (monolinear) text to DLx JSON format
corpora corpus corpus-linguistics digital-linguistics dlx linguistics
Last synced: 30 Nov 2024
https://github.com/jamnicki/split-corpus
Split-corpus package that provide dividing text corpora into the meaningful parts as close to specified size as possible.
corpora corpus-processing large-files natural-language-processing nlp processing
Last synced: 21 Dec 2024
https://github.com/ololobus/slavic_text_scht
St. Petersburg corpus of hagiographic texts
corpora hagiographic-texts linguistics slavic-languages
Last synced: 21 Dec 2024
https://github.com/ggteixeira/corpus-cleaner
Linguistic tool (made by a linguist, for linguists) that scraps corpora, automatically cleans it up, and generates n-grams.
beautifulsoup4 bs4 corpora corpus corpus-linguistics crawler linguistics nlp python scraper web-scraping
Last synced: 12 Nov 2024
https://github.com/qanastek/ner-mmtd
Named-entity recognition corpora for multilingual voice recognition in the music industry based on the Million Musical Tweets dataset
corpora dataset english french million-musical-tweets mmtd music named-entity-recognition ner neural-network recognition voice
Last synced: 17 Nov 2024
https://github.com/miweru/vrt_generator
Python class for creating vrt-annotated corpora
corpora linguistic-corpora linguistics vrt wrapper
Last synced: 17 Nov 2024
https://github.com/miweru/vrt_spacy
corpora linguistic-corpora linguistics nlp spacy vrt wrapper
Last synced: 07 Dec 2024
https://github.com/richardlitt/fortune-cookie-corpus
A growing corpus of fortune cookies (for NLP and fun). Add your fortunes!
corpora corpus corpus-linguistics fortune fortune-cookie fortune-cookies
Last synced: 05 Dec 2024
https://github.com/dohliam/aligned-corpus-search
Simple aligned corpus search tool
Last synced: 27 Nov 2024