Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with corpus
A curated list of projects in awesome lists tagged with corpus .
https://github.com/brightmart/nlp_chinese_corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
bert chinese chinese-corpus chinese-dataset chinese-nlp corpus dataset language-model news nlp pretrain question-answering text-classification wiki word2vec
Last synced: 30 Sep 2024
https://github.com/dariusk/corpora
A collection of small corpuses of interesting data for the creation of bots and similar stuff.
Last synced: 30 Sep 2024
https://github.com/cluebenchmark/cluedatasetsearch
搜索所有中文NLP数据集,附常用英文NLP数据集
chinese corpus datasets knowledge-graph machine-reading-comprehension machine-translation match ner nlp qa sentiment-analysis text-classification text-similarity text-summarization
Last synced: 30 Sep 2024
https://github.com/CLUEbenchmark/CLUEDatasetSearch
搜索所有中文NLP数据集,附常用英文NLP数据集
chinese corpus datasets knowledge-graph machine-reading-comprehension machine-translation match ner nlp qa sentiment-analysis text-classification text-similarity text-summarization
Last synced: 31 Jul 2024
https://github.com/cluebenchmark/clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
albert benchmark bert chinese chineseglue corpus dataset glue language-model nlu pretrained-models pytorch roberta tensorflow transformers
Last synced: 01 Oct 2024
https://github.com/CLUEbenchmark/CLUE
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
albert benchmark bert chinese chineseglue corpus dataset glue language-model nlu pretrained-models pytorch roberta tensorflow transformers
Last synced: 31 Jul 2024
https://github.com/adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
article-extractor corpus corpus-builder corpus-tools crawler html-to-markdown html2text news news-aggregator news-crawler nlp readability rss-feed scraping tei text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: 30 Jul 2024
https://github.com/gunthercox/chatterbot-corpus
A multilingual dialog corpus
chatterbot corpus dialog language yaml
Last synced: 29 Sep 2024
https://github.com/chatopera/insuranceqa-corpus-zh
:helicopter: 保险行业语料库,聊天机器人
chatbot corpus dataset insurance insuranceqa-corpus-zh machine-learning natural-language-processing natural-language-understanding qasystem question-answering
Last synced: 01 Aug 2024
https://github.com/NiuTrans/Classical-Modern
非常全的文言文(古文)-现代文平行语料
corpus parallel-corpus traditional-and-simplified-chinese traditional-chinese
Last synced: 03 Aug 2024
https://github.com/CLUEbenchmark/CLUECorpus2020
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
albert bert chinese chinese-corpus corpus datasets nlp pretrain roberta
Last synced: 03 Aug 2024
https://github.com/tensorlayer/seq2seq-chatbot
Chatbot in 200 lines of code using TensorLayer
bot chat chatbot corpus lstm nlp python rnn tensorflow tensorlayer
Last synced: 31 Jul 2024
https://github.com/quanteda/quanteda
An R package for the Quantitative Analysis of Textual Data
corpus natural-language-processing quanteda r text-analytics
Last synced: 30 Jul 2024
https://github.com/PlexPt/chatgpt-corpus
ChatGPT 中文语料库 对话语料 小说语料 客服语料 用于训练大模型
awesome corpus corpus-data question-answering
Last synced: 03 Aug 2024
https://github.com/CLUEbenchmark/CLUEPretrainedModels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
albert bert chinese corpus dataset distillation pretrained-models roberta semantic-similarity sentence-analysis sentence-classification sentence-pairs text-classification
Last synced: 01 Aug 2024
https://github.com/plexpt/chatgpt-corpus
ChatGPT 中文语料库 对话语料 小说语料 客服语料 用于训练大模型
awesome corpus corpus-data question-answering
Last synced: 02 Aug 2024
https://github.com/CBLUEbenchmark/CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
acl2022 benchmark biomedical-tasks chinese chineseblue corpus dataset evaluation
Last synced: 01 Aug 2024
https://github.com/louisowen6/NLP_bahasa_resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
bahasa-indonesia corpus corpus-linguistics dataset indonesian indonesian-language library natural-language-processing nlp nlp-bahasa-resources packages sentiment-analysis sentiment-analysis-dataset
Last synced: 01 Aug 2024
https://github.com/GAIR-NLP/MathPile
Generative AI for Math: MathPile
corpus language-model large-language-models math pre-training
Last synced: 09 Aug 2024
https://github.com/lil-lab/nlvr
Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
computer-vision corpus machine-learning natural-language-processing
Last synced: 02 Aug 2024
https://github.com/strongcourage/fuzzing-corpus
My fuzzing corpus
corpus file-format fuzzing testsuite vulnerability
Last synced: 26 Sep 2024
https://github.com/kirralabs/indonesian-NLP-resources
data resource untuk NLP bahasa indonesia
corpus corpus-linguistics crawler dataset dependency-parser indonesian indonesian-language named-entity-recognition nlp parallel-corpus pos-tagging sentiment-analysis
Last synced: 01 Aug 2024
https://github.com/EdinburghNLP/code-docstring-corpus
Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
code-generation corpus docstrings documentation-generator neural-machine-translation
Last synced: 31 Jul 2024
https://github.com/m1-llie/TUMCC
[IP&M 2022] Telegram地下市场中文黑话识别语料集。Telegram Underground Market Chinese Corpus. Paper: Identification of Chinese Dark Jargons in Telegram Underground Markets Using Context-Oriented and Linguistic Features (IP&M, 2022).
chinese corpus dataset telegram
Last synced: 04 Aug 2024
https://github.com/christos-c/bible-corpus
A multilingual parallel corpus created from translations of the Bible.
bible bible-corpus corpus multilingual translation
Last synced: 31 Jul 2024
https://github.com/srvk/how2-dataset
This repository contains code and metadata of How2 dataset
corpus dataset how2-dataset language machine-translation multimodality speech-recognition video
Last synced: 31 Jul 2024
https://github.com/proycon/colibri-core
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
c-plus-plus computational-linguistics corpus library linguistics ngram ngrams nlp pattern-recognition python skipgram text-processing
Last synced: 29 Sep 2024
https://github.com/GlobalMaksimum/sadedegel
A General Purpose NLP library for Turkish
acikhack2 ai artificial-intelligence bert binder corpus data-science deep-learning embeddings heroku machine-learning natural-language-processing neural-network neural-networks news-summarizer nlp python
Last synced: 02 Aug 2024
https://github.com/maxoodf/russian_news_corpus
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
articles corpus machine-learning ml nlp nlp-machine-learning russian text word2vec
Last synced: 15 Aug 2024
https://github.com/amir-zeldes/gum
Repository for the Georgetown University Multilayer Corpus (GUM)
annis annotations coreference corpus pos-tagging rhetorical-structure-theory treebank universal-dependencies
Last synced: 01 Aug 2024
https://github.com/open-discourse/open-discourse
Open Discourse is the first fully comprehensive corpus of the plenary proceedings of the federal German Parliament (Bundestag).
bundestag corpus data hacktoberfest
Last synced: 30 Jul 2024
https://github.com/cyrta/voxceleb
mirror of VoxCeleb dataset - a large-scale speaker identification dataset
corpus dataset speaker speaker-identification speaker-recognition speaker-verification speech
Last synced: 03 Aug 2024
https://github.com/kgjerde/corporaexplorer
An R package for dynamic exploration of text collections
corpora corpus r shiny text-analysis
Last synced: 05 Aug 2024
https://github.com/proycon/folia
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
computational-linguistics corpus file-format folia language library linguistic-annotation-framework linguistics nlp python xml
Last synced: 30 Sep 2024
https://github.com/GermanT5/wikipedia2corpus
Wikipedia text corpus for self-supervised NLP model training
corpus german-nlp machine-learning nlp somajo wikipedia wikipedia-corpus
Last synced: 31 Jul 2024
https://github.com/proiel/proiel-treebank
Official releases of the PROIEL treebank of ancient Indo-European languages
ancient-greek ancient-languages armenian corpus gothic2 language latin linguistics new-testament old-church-slavonic treebank
Last synced: 31 Jul 2024
https://github.com/INL/OpenConvert
Text conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)
Last synced: 03 Aug 2024
https://github.com/megagonlabs/asdc
Accommodation Search Dialog Corpus (宿泊施設探索対話コーパス)
corpus dialog japanese-language
Last synced: 02 Aug 2024
https://github.com/derintelligence/en-az-parallel-corpus
English-Azerbaijani parallel language corpus
azerbaijan azerbaijani-translation corpus language linguistics nlp parallel translation
Last synced: 02 Aug 2024
https://github.com/global-asp/asp-source
Source stories from the African Storybook Project in Markdown format
africa corpus creative-commons multilingual storybooks
Last synced: 29 Sep 2024
https://github.com/AsoSoft/AsoSoft-Text-Corpus
AsoSoft Text Corpus is the first large scale text corpus for the Kurdish language.
central-kurdish corpus kurdish kurdish-language-processing natural-language-processing sorani text-corpus
Last synced: 03 Aug 2024
https://github.com/giocomai/castarter.legacy
castarter - Content analysis starter toolkit for R
content-analysis corpus r rstats
Last synced: 13 Aug 2024
https://github.com/alexeykosh/lingcorpora.py
API for corpora
api corpora corpus national-corpus package
Last synced: 07 Aug 2024
https://github.com/AsoSoft/AsoSoft-TTS-Speech-Corpus-for-Central-Kurdish
AsoSoft Speech Corpus for Central-Kurdish Text-To-Speech
Last synced: 03 Aug 2024
https://github.com/global-asp/pb-source
Pratham Books stories in Markdown format
corpus creative-commons india multilingual storybooks
Last synced: 29 Sep 2024
https://github.com/KurdishBLARK/KTC
Kurdish Textbooks Corpus
corpus corpus-linguistics kurdish kurdish-language-processing language-resources natural-language-processing
Last synced: 03 Aug 2024
https://github.com/nevmenandr/bashkir-corpus
Тексты для корпуса башкирского языка
bashkir corpus corpus-data minority-language texts
Last synced: 03 Aug 2024
https://github.com/global-asp/lcb-source
Little Cree Books stories in Markdown format
canada corpus cree indigenous-languages storybooks storytelling syllabics
Last synced: 29 Sep 2024
https://github.com/KurdishBLARK/KurdishLyricsCorpus
A Corpus of the Kurdish Folkloric Lyrics
corpus folkloristics kurdish kurdish-language-processing lyrics
Last synced: 03 Aug 2024
https://github.com/KurdishBLARK/InterdialectCorpus
A parallel corpus of Sorani, Kurmanji and English
corpus kurdish kurdish-language-processing machine-translation natural-language-processing parallel-corpus
Last synced: 03 Aug 2024
https://github.com/sagesolar/Corpus-of-Taylor-Swift
This is a dataset consisting of all song lyric words found on all of Taylor Swift's studio albums (up to and including TTPD), as well as a selection of other songs written by her.
corpus corpus-data song-dataset song-lyrics taylor-swift ttpd
Last synced: 31 Jul 2024
https://github.com/sinaahmadi/ZazaGoraniCorpus
A corpus for the Zazaki and Gorani languages
computational-linguistics corpus corpus-data corpus-linguistics feyli gorani kurdish kurdish-language-processing less-resource-languages natural-language-processing southern-kurdish zazaki
Last synced: 03 Aug 2024
https://github.com/dwhieb/nuuchahnulth
Linguistic data on the Nuuchahnulth (Wakashan) language
corpora corpus corpus-linguistics documentary-linguistics language-documentation linguistics nuuchahnulth wakashan
Last synced: 02 Oct 2024
https://github.com/vxern/tatoeba
📜 A complete, documented API wrapper for querying and retrieving sentences from the Tatoeba corpus.
api clean corpus documented erlang gleam language sentence tatoeba tested translation wrapper
Last synced: 29 Sep 2024
https://github.com/clemsciences/cltk-2019-graz
Presentation of CLTK with slides and notebooks
cltk corpus digital-humanities jupyter-notebook lemmatizer nlp
Last synced: 02 Oct 2024
https://github.com/KurdishBLARK/KTC-Segmented
A segmented version of KTC
corpus kurdish kurdish-language-processing natural-language-processing
Last synced: 03 Aug 2024
https://github.com/dellison/wikitext.jl
Julia interface to the WikiText dataset.
corpus dataset julia language-modeling natural-language-processing nlp
Last synced: 30 Sep 2024
https://github.com/rexshijaku/chatgpt-generated-text-detection-corpus
ChatGPT Generated Text Detection Corpus
chatgpt corpus dataset linguistics text-classification text-detection
Last synced: 01 Oct 2024
https://github.com/datwaft/tree-sitter-corpus
A tree-sitter parser for tree-sitter's test files
corpus grammar tests tree-sitter tree-sitter-grammar tree-sitter-parser
Last synced: 26 Sep 2024
https://github.com/richardlitt/fortune-cookie-corpus
A growing corpus of fortune cookies (for NLP and fun). Add your fortunes!
corpora corpus corpus-linguistics fortune fortune-cookie fortune-cookies
Last synced: 03 Oct 2024
https://github.com/retr0327/corpus-backend
A simple corpus backend API built with KoaJs and Apache Lucene.
Last synced: 01 Oct 2024