Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Projects in Awesome Lists tagged with corpus

A curated list of projects in awesome lists tagged with corpus .

https://github.com/dariusk/corpora

A collection of small corpuses of interesting data for the creation of bots and similar stuff.

bots corpus language words

Last synced: 30 Sep 2024

https://github.com/cluebenchmark/clue

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

albert benchmark bert chinese chineseglue corpus dataset glue language-model nlu pretrained-models pytorch roberta tensorflow transformers

Last synced: 01 Oct 2024

https://github.com/wainshine/chinese-names-corpus

中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。

corpus dataset dict names ner

Last synced: 30 Sep 2024

https://github.com/CLUEbenchmark/CLUE

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

albert benchmark bert chinese chineseglue corpus dataset glue language-model nlu pretrained-models pytorch roberta tensorflow transformers

Last synced: 31 Jul 2024

https://github.com/wainshine/Chinese-Names-Corpus

中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。

corpus dataset dict names ner

Last synced: 31 Jul 2024

https://github.com/lucasjinreal/weibo_terminater

Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator

chatbot chinese corpus scraper sina weibo

Last synced: 30 Sep 2024

https://github.com/candlewill/dialog_corpus

用于训练中英文对话系统的语料库 Datasets for Training Chatbot System

chatbot corpus dataset dialog system

Last synced: 30 Sep 2024

https://github.com/gunthercox/chatterbot-corpus

A multilingual dialog corpus

chatterbot corpus dialog language yaml

Last synced: 29 Sep 2024

https://github.com/wainshine/Company-Names-Corpus

公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。

company corpus dataset dict ner

Last synced: 01 Aug 2024

https://github.com/NiuTrans/Classical-Modern

非常全的文言文(古文)-现代文平行语料

corpus parallel-corpus traditional-and-simplified-chinese traditional-chinese

Last synced: 03 Aug 2024

https://github.com/CLUEbenchmark/CLUECorpus2020

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料

albert bert chinese chinese-corpus corpus datasets nlp pretrain roberta

Last synced: 03 Aug 2024

https://github.com/tensorlayer/seq2seq-chatbot

Chatbot in 200 lines of code using TensorLayer

bot chat chatbot corpus lstm nlp python rnn tensorflow tensorlayer

Last synced: 31 Jul 2024

https://github.com/quanteda/quanteda

An R package for the Quantitative Analysis of Textual Data

corpus natural-language-processing quanteda r text-analytics

Last synced: 30 Jul 2024

https://github.com/PlexPt/chatgpt-corpus

ChatGPT 中文语料库 对话语料 小说语料 客服语料 用于训练大模型

awesome corpus corpus-data question-answering

Last synced: 03 Aug 2024

https://github.com/plexpt/chatgpt-corpus

ChatGPT 中文语料库 对话语料 小说语料 客服语料 用于训练大模型

awesome corpus corpus-data question-answering

Last synced: 02 Aug 2024

https://github.com/CBLUEbenchmark/CBLUE

中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

acl2022 benchmark biomedical-tasks chinese chineseblue corpus dataset evaluation

Last synced: 01 Aug 2024

https://github.com/GAIR-NLP/MathPile

Generative AI for Math: MathPile

corpus language-model large-language-models math pre-training

Last synced: 09 Aug 2024

https://github.com/lil-lab/nlvr

Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.

computer-vision corpus machine-learning natural-language-processing

Last synced: 02 Aug 2024

https://github.com/EdinburghNLP/code-docstring-corpus

Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.

code-generation corpus docstrings documentation-generator neural-machine-translation

Last synced: 31 Jul 2024

https://github.com/m1-llie/TUMCC

[IP&M 2022] Telegram地下市场中文黑话识别语料集。Telegram Underground Market Chinese Corpus. Paper: Identification of Chinese Dark Jargons in Telegram Underground Markets Using Context-Oriented and Linguistic Features (IP&M, 2022).

chinese corpus dataset telegram

Last synced: 04 Aug 2024

https://github.com/christos-c/bible-corpus

A multilingual parallel corpus created from translations of the Bible.

bible bible-corpus corpus multilingual translation

Last synced: 31 Jul 2024

https://github.com/srvk/how2-dataset

This repository contains code and metadata of How2 dataset

corpus dataset how2-dataset language machine-translation multimodality speech-recognition video

Last synced: 31 Jul 2024

https://github.com/proycon/colibri-core

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.

c-plus-plus computational-linguistics corpus library linguistics ngram ngrams nlp pattern-recognition python skipgram text-processing

Last synced: 29 Sep 2024

https://github.com/maxoodf/russian_news_corpus

Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ

articles corpus machine-learning ml nlp nlp-machine-learning russian text word2vec

Last synced: 15 Aug 2024

https://github.com/amir-zeldes/gum

Repository for the Georgetown University Multilayer Corpus (GUM)

annis annotations coreference corpus pos-tagging rhetorical-structure-theory treebank universal-dependencies

Last synced: 01 Aug 2024

https://github.com/open-discourse/open-discourse

Open Discourse is the first fully comprehensive corpus of the plenary proceedings of the federal German Parliament (Bundestag).

bundestag corpus data hacktoberfest

Last synced: 30 Jul 2024

https://github.com/cyrta/voxceleb

mirror of VoxCeleb dataset - a large-scale speaker identification dataset

corpus dataset speaker speaker-identification speaker-recognition speaker-verification speech

Last synced: 03 Aug 2024

https://github.com/kgjerde/corporaexplorer

An R package for dynamic exploration of text collections

corpora corpus r shiny text-analysis

Last synced: 05 Aug 2024

https://github.com/proycon/folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions

computational-linguistics corpus file-format folia language library linguistic-annotation-framework linguistics nlp python xml

Last synced: 30 Sep 2024

https://github.com/GermanT5/wikipedia2corpus

Wikipedia text corpus for self-supervised NLP model training

corpus german-nlp machine-learning nlp somajo wikipedia wikipedia-corpus

Last synced: 31 Jul 2024

https://github.com/proiel/proiel-treebank

Official releases of the PROIEL treebank of ancient Indo-European languages

ancient-greek ancient-languages armenian corpus gothic2 language latin linguistics new-testament old-church-slavonic treebank

Last synced: 31 Jul 2024

https://github.com/hugovk/everyfinnishword

Every Finnish word

corpus finnish language words

Last synced: 01 Oct 2024

https://github.com/INL/OpenConvert

Text conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)

conversion corpus

Last synced: 03 Aug 2024

https://github.com/megagonlabs/asdc

Accommodation Search Dialog Corpus (宿泊施設探索対話コーパス)

corpus dialog japanese-language

Last synced: 02 Aug 2024

https://github.com/global-asp/asp-source

Source stories from the African Storybook Project in Markdown format

africa corpus creative-commons multilingual storybooks

Last synced: 29 Sep 2024

https://github.com/AsoSoft/AsoSoft-Text-Corpus

AsoSoft Text Corpus is the first large scale text corpus for the Kurdish language.

central-kurdish corpus kurdish kurdish-language-processing natural-language-processing sorani text-corpus

Last synced: 03 Aug 2024

https://github.com/giocomai/castarter.legacy

castarter - Content analysis starter toolkit for R

content-analysis corpus r rstats

Last synced: 13 Aug 2024

https://github.com/AsoSoft/AsoSoft-TTS-Speech-Corpus-for-Central-Kurdish

AsoSoft Speech Corpus for Central-Kurdish Text-To-Speech

corpus kurdish speech tts

Last synced: 03 Aug 2024

https://github.com/global-asp/pb-source

Pratham Books stories in Markdown format

corpus creative-commons india multilingual storybooks

Last synced: 29 Sep 2024

https://github.com/danieldk/conllx-rs

CoNLL-X reader and writers for Rust

conll conllx corpus rust treebank

Last synced: 02 Oct 2024

https://github.com/nevmenandr/bashkir-corpus

Тексты для корпуса башкирского языка

bashkir corpus corpus-data minority-language texts

Last synced: 03 Aug 2024

https://github.com/global-asp/lcb-source

Little Cree Books stories in Markdown format

canada corpus cree indigenous-languages storybooks storytelling syllabics

Last synced: 29 Sep 2024

https://github.com/richardlitt/gaelic-resources

A list of computational resources for Gaelic

corpora corpus gaelic irish language nlp resources scots scottish scottish-gaelic

Last synced: 03 Oct 2024

https://github.com/sagesolar/Corpus-of-Taylor-Swift

This is a dataset consisting of all song lyric words found on all of Taylor Swift's studio albums (up to and including TTPD), as well as a selection of other songs written by her.

corpus corpus-data song-dataset song-lyrics taylor-swift ttpd

Last synced: 31 Jul 2024

https://github.com/vxern/tatoeba

📜 A complete, documented API wrapper for querying and retrieving sentences from the Tatoeba corpus.

api clean corpus documented erlang gleam language sentence tatoeba tested translation wrapper

Last synced: 29 Sep 2024

https://github.com/clemsciences/cltk-2019-graz

Presentation of CLTK with slides and notebooks

cltk corpus digital-humanities jupyter-notebook lemmatizer nlp

Last synced: 02 Oct 2024

https://github.com/dellison/wikitext.jl

Julia interface to the WikiText dataset.

corpus dataset julia language-modeling natural-language-processing nlp

Last synced: 30 Sep 2024

https://github.com/datwaft/tree-sitter-corpus

A tree-sitter parser for tree-sitter's test files

corpus grammar tests tree-sitter tree-sitter-grammar tree-sitter-parser

Last synced: 26 Sep 2024

https://github.com/richardlitt/fortune-cookie-corpus

A growing corpus of fortune cookies (for NLP and fun). Add your fortunes!

corpora corpus corpus-linguistics fortune fortune-cookie fortune-cookies

Last synced: 03 Oct 2024

https://github.com/retr0327/corpus-backend

A simple corpus backend API built with KoaJs and Apache Lucene.

corpus koajs lucene

Last synced: 01 Oct 2024