An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with corpus-processing

A curated list of projects in awesome lists tagged with corpus-processing .

https://github.com/hankcs/treebankpreprocessing

Python scripts preprocessing Penn Treebank and Chinese Treebank

corpus-processing natural-language-processing

Last synced: 17 Mar 2025

https://github.com/jaytimm/corpuslingr

A library of functions enabling complex corpus search in context (KWIC), search aggregation, bag-of-words building & keyphrase extraction.

corpus-processing corpus-search corpus-tools

Last synced: 13 Jul 2025

https://github.com/cscfi/kielipankki-utilities

Scripts for data conversion

corpus-processing corpus-tools korp vrt

Last synced: 24 Apr 2025

https://github.com/thecsw/katya-dev

Katya or The Liberated Corpus a text corpus that allows you to request and scrape any web resource!

corpus corpus-analysis corpus-builder corpus-generator corpus-linguistics corpus-processing russian russian-literature tagger text-corpus

Last synced: 17 Jan 2026

https://github.com/utrechtuniversity/dataquest

A configurable pipeline for extracting and filtering articles from large corpora, tailored for the Delpher Kranten corpus, with support for features like keyword filtering and tf-idf-based relevance scoring.

article-extraction corpus-processing delpher-kranten information-retrieval keyword-filtering

Last synced: 05 Feb 2026

https://github.com/CentreForDigitalHumanities/ianalyzer-readers

Pre-processing functionality used in I-analyzer

corpus-processing

Last synced: 23 Jul 2025

https://github.com/ketanmehra003/parallel-corpus-management-tool

This project is designed to help manage and analyze large corpora of text data. It provides tools for importing, processing, and querying text data efficiently.

corpus corpus-data corpus-processing corpus-tools django language-translator-api machine-learning python3

Last synced: 01 May 2026

https://github.com/jamnicki/split-corpus

Split-corpus package that provide dividing text corpora into the meaningful parts as close to specified size as possible.

corpora corpus-processing large-files natural-language-processing nlp processing

Last synced: 29 Apr 2026

https://github.com/centrefordigitalhumanities/ianalyzer-readers

Pre-processing functionality used in I-analyzer

corpus-processing

Last synced: 15 Apr 2025

https://github.com/cadia-lvl/diar-az

Diarization A to Z - Kaldi to Gecko to Kaldi and corpus and back

corpus-processing diarization parsing rttm

Last synced: 11 Mar 2026

https://github.com/rodrigofrancisco/pln

Tareas de Procesamiento del lenguaje natural

corpus-processing npl

Last synced: 20 Aug 2025