Projects in Awesome Lists tagged with corpus-tools
A curated list of projects in awesome lists tagged with corpus-tools .
https://github.com/adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
article-extractor corpus corpus-builder corpus-tools crawler html-to-markdown html2text news news-aggregator news-crawler nlp readability rss-feed scraping tei text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: 14 Mar 2025
https://github.com/flairnlp/fundus
A very simple news crawler with a funny name
cc-news commoncrawl corpus corpus-tools crawler datasets image-classification image-extraction news-crawler news-scraping nlp python rss scraper sitemap text-extraction web-corpus web-scraping
Last synced: 14 May 2025
https://github.com/adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
corpus-tools language-detection language-identification lemmatiser lemmatization lemmatizer low-resource-nlp morphological-analysis nlp tokenization tokenizer wordlist
Last synced: 10 May 2025
https://github.com/ynop/audiomate
Python library for handling audio datasets.
audio audio-datasets corpus-tools data-loader dataset-creation dataset-filtering dataset-manager music noise speech speech-recognition
Last synced: 25 Nov 2024
https://github.com/koskenni/beta
An open source reimplementation of Benny Brodda's BETA in Python
benny-brodda beta corpus-tools hyphenation linguistics open-source string-manipulation string-rewriting
Last synced: 01 May 2025
https://github.com/lennes/spect
SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/
analysis annotation conversational-speech corpus-linguistics corpus-tools praat spect speech speech-analysis speech-corpus spoken-language transcript transcription
Last synced: 03 Apr 2025
https://github.com/languagemachines/piccl
A set of workflows for corpus building through OCR, post-correction and normalisation
computational-linguistics corpus-linguistics corpus-tools folia nlp ocr workflow
Last synced: 04 Dec 2024
https://github.com/LanguageMachines/PICCL
A set of workflows for corpus building through OCR, post-correction and normalisation
computational-linguistics corpus-linguistics corpus-tools folia nlp ocr workflow
Last synced: 02 Apr 2025
https://github.com/jaytimm/corpuslingr
A library of functions enabling complex corpus search in context (KWIC), search aggregation, bag-of-words building & keyphrase extraction.
corpus-processing corpus-search corpus-tools
Last synced: 22 Nov 2024
https://github.com/liao961120/concordancer
Searching in-memory corpus with Corpus Query Language (CQL)
concordancer corpus-query-language corpus-tools python3
Last synced: 14 Apr 2025
https://github.com/cscfi/kielipankki-utilities
Scripts for data conversion
corpus-processing corpus-tools korp vrt
Last synced: 24 Apr 2025
https://github.com/clariah/wp6-missieven
General Missives in Text-Fabric
corpus-data corpus-linguistics corpus-processing corpus-tools dutch history nlp
Last synced: 22 Apr 2025
https://github.com/andythefactory/article-extraction-dataset
Article title, authors, date and body extraction dataset.
article-extractor corpus corpus-builder corpus-tools dataset datasets html-to-markdown html2text news news-aggregator news-crawler readability scraping scraping-websites text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: 18 Feb 2025
https://github.com/rishit7/corpusqnatool
CorpusQnATool that uses a Corpus and chatGPT to for answers to input queries.
corpus-tools kmp-algorithm llm
Last synced: 28 Feb 2025
https://github.com/aitor-alvarez/emorabic
Tools for creating speech corpora by extracting audio from YouTube videos
audio corpus-tools speech speech-corpora speech-processing
Last synced: 20 Mar 2025
https://github.com/ancatmara/dl-sfl-2019
Digital Literacy course for School of Foreign Languages (NRU HSE, 2019)
bibliography-managers corpus-tools data-vizualization gephi google-forms google-search google-sheets machine-translation network-analysis office-tools regular-expressions scientometrics tutorials
Last synced: 21 Feb 2025
https://github.com/ancatmara/dl-historians-2017
Digital Literacy course for Historians & Philosophers (NRU HSE, 2017)
bibliography-managers corpus-tools crowdsourcing data-formats data-visualization digital-humanities digital-literacy digital-maps git office-tools presentation-tools regular-expressions tutorials
Last synced: 02 Jan 2025
https://github.com/ancatmara/dl-culturology-2018
Digital Literacy course for students in Culturology & Art History (NRU HSE, 2018)
bibliography-managers corpus-tools css data-vizualization digital-humanities digital-literacy digital-maps elan gephi git google-forms html markdown ms-excel ms-word network-analysis ocr office-tools tutorials web-development
Last synced: 21 Feb 2025
https://github.com/egorsmkv/asr-corpus-by-microphone
This is a simple solution for people who want to create own corpus for Automatic Speech Recognition with just a microphone
asr automatic-speech-recognition corpus corpus-tools
Last synced: 28 Mar 2025
https://github.com/gederajeg/corplingr
Tidy concordances, collocates, and wordlist
corpus-data corpus-linguistics corpus-processing corpus-tools indonesian indonesian-language indonesian-linguistics leipzig-corpora-collection leipzig-corpus-files usage-based-linguistics
Last synced: 01 Apr 2025
https://github.com/ancatmara/dl-philology-2018
Digital Literacy for Philologists (NRU HSE, 2018)
bibliography-managers corpus-tools data-visualization databases digital-humanities digital-literacy digital-maps gephi git google-forms ms-excel ms-word network-analysis office-tools presentation-tools regular-expressions stylo stylometry tutorials xml
Last synced: 21 Feb 2025
https://github.com/ancatmara/dl-sfl-2018
Digita Literacy for School of Foreign Languages (NRU HSE, 2018)
bibliography-managers corpus-tools crowdsourcing data-formats data-vizualization digital-humanities digital-literacy gephi git google-forms inter-annotator-agreement machine-translation ms-excel ms-word network-analysis ocr office-tools presentation-tools regular-expressions tutorials
Last synced: 21 Feb 2025
https://github.com/unhammer/gt-corpustools
branches of https://victorio.uit.no/langtech/trunk/tools/CorpusTools used by Giellatekno.UiT.no for corpus gathering.
Last synced: 18 Feb 2025
https://github.com/ketanmehra003/parallel-corpus-management-tool
This project is designed to help manage and analyze large corpora of text data. It provides tools for importing, processing, and querying text data efficiently.
corpus corpus-data corpus-processing corpus-tools django language-translator-api machine-learning python3
Last synced: 30 Mar 2025