Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/LanguageMachines/PICCL
A set of workflows for corpus building through OCR, post-correction and normalisation
computational-linguistics corpus-linguistics corpus-tools folia nlp ocr workflow
Last synced: 30 Jun 2024
https://github.com/Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
corpus-processing corpus-tools machine-translation natural-language-processing nlp parallel-corpus
Last synced: 20 Jun 2024
https://github.com/M4t1ss/parallel-corpora-tools
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
cleaning corpora corpus-tools data-processing data-science filtering language language-processing machine machine-translation natural-language natural-language-processing neural neural-machine-translation nlp nmt translation
Last synced: 20 Jun 2024
https://github.com/adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
article-extractor corpus corpus-builder corpus-tools crawler html-to-markdown html2text news news-aggregator news-crawler nlp readability rss-feed scraping tei text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: 09 Jun 2024
https://github.com/lennes/spect
SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/
analysis annotation conversational-speech corpus-linguistics corpus-tools praat spect speech speech-analysis speech-corpus spoken-language transcript transcription
Last synced: 07 Jun 2024
https://github.com/jaytimm/corpuslingr
A library of functions enabling complex corpus search in context (KWIC), search aggregation, bag-of-words building & keyphrase extraction.
corpus-processing corpus-search corpus-tools
Last synced: 20 May 2024
https://github.com/ynop/audiomate
Python library for handling audio datasets.
audio audio-datasets corpus-tools data-loader dataset-creation dataset-filtering dataset-manager music noise speech speech-recognition
Last synced: 28 Apr 2024
https://github.com/adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
corpus-tools language-detection language-identification lemmatiser lemmatization lemmatizer low-resource-nlp morphological-analysis nlp tokenization tokenizer wordlist
Last synced: 19 Apr 2024
https://github.com/koskenni/beta
An open source reimplementation of Benny Brodda's BETA in Python
benny-brodda beta corpus-tools hyphenation linguistics open-source string-manipulation string-rewriting
Last synced: 01 Apr 2024