Projects in Awesome Lists tagged with corpus-processing
A curated list of projects in awesome lists tagged with corpus-processing .
https://github.com/hankcs/treebankpreprocessing
Python scripts preprocessing Penn Treebank and Chinese Treebank
corpus-processing natural-language-processing
Last synced: 17 Mar 2025
https://github.com/Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
corpus-processing corpus-tools machine-translation natural-language-processing nlp parallel-corpus
Last synced: 19 Nov 2025
https://github.com/notesjor/corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
big-data cleaning-data cooccurrence corpus-linguistics corpus-processing data-minig data-mining data-science datajournalism journalism linguistics natural-language-processing natural-language-understanding nlp sdk tagger text-analysis text-mining text-processing visualization
Last synced: 17 Jan 2026
https://github.com/jaytimm/corpuslingr
A library of functions enabling complex corpus search in context (KWIC), search aggregation, bag-of-words building & keyphrase extraction.
corpus-processing corpus-search corpus-tools
Last synced: 13 Jul 2025
https://github.com/zgornel/stringanalysis.jl
Hard-Forked from JuliaText/TextAnalysis.jl
corpus-processing latent-semantic-analysis random-projections text-analysis text-processing
Last synced: 24 Jul 2025
https://github.com/cscfi/kielipankki-utilities
Scripts for data conversion
corpus-processing corpus-tools korp vrt
Last synced: 24 Apr 2025
https://github.com/clariah/wp6-missieven
General Missives in Text-Fabric
corpus-data corpus-linguistics corpus-processing corpus-tools dutch history nlp
Last synced: 22 Apr 2025
https://github.com/thecsw/katya-dev
Katya or The Liberated Corpus a text corpus that allows you to request and scrape any web resource!
corpus corpus-analysis corpus-builder corpus-generator corpus-linguistics corpus-processing russian russian-literature tagger text-corpus
Last synced: 17 Jan 2026
https://github.com/frankier/stiff
Sense Tagged Instances For Finnish
corpus-processing linguistic-corpora nlp word-sense-disambiguation wsd
Last synced: 05 Jul 2025
https://github.com/gederajeg/corplingr
Tidy concordances, collocates, and wordlist
corpus-data corpus-linguistics corpus-processing corpus-tools indonesian indonesian-language indonesian-linguistics leipzig-corpora-collection leipzig-corpus-files usage-based-linguistics
Last synced: 01 Apr 2025
https://github.com/utrechtuniversity/dataquest
A configurable pipeline for extracting and filtering articles from large corpora, tailored for the Delpher Kranten corpus, with support for features like keyword filtering and tf-idf-based relevance scoring.
article-extraction corpus-processing delpher-kranten information-retrieval keyword-filtering
Last synced: 05 Feb 2026
https://github.com/czcorpus/depreldb
A fast database for UD dependency relations between lemmas
collocation-extraction corpus-linguistics corpus-processing corpus-tools data-retrieval database linguistics universal-dependencies
Last synced: 07 Feb 2026
https://github.com/CentreForDigitalHumanities/ianalyzer-readers
Pre-processing functionality used in I-analyzer
Last synced: 23 Jul 2025
https://github.com/ketanmehra003/parallel-corpus-management-tool
This project is designed to help manage and analyze large corpora of text data. It provides tools for importing, processing, and querying text data efficiently.
corpus corpus-data corpus-processing corpus-tools django language-translator-api machine-learning python3
Last synced: 01 May 2026
https://github.com/mosesab/language-text-extraction-
Gets text and extracts sentences in a language from text using that language's lexicon.
corpus corpus-processing corpus-search english language-processing language-resources languages natural-language-processing nlp python-programming python-standard-library python3 text-processing
Last synced: 28 Feb 2025
https://github.com/jamnicki/split-corpus
Split-corpus package that provide dividing text corpora into the meaningful parts as close to specified size as possible.
corpora corpus-processing large-files natural-language-processing nlp processing
Last synced: 29 Apr 2026
https://github.com/centrefordigitalhumanities/ianalyzer-readers
Pre-processing functionality used in I-analyzer
Last synced: 15 Apr 2025
https://github.com/cadia-lvl/diar-az
Diarization A to Z - Kaldi to Gecko to Kaldi and corpus and back
corpus-processing diarization parsing rttm
Last synced: 11 Mar 2026
https://github.com/rodrigofrancisco/pln
Tareas de Procesamiento del lenguaje natural
Last synced: 20 Aug 2025