Projects in Awesome Lists tagged with text-data
A curated list of projects in awesome lists tagged with text-data .
https://github.com/microsoft/DialoGPT
Large-scale pretraining for dialogue
data-processing dialogpt dialogue gpt-2 machine-learning pytorch text-data text-generation transformer
Last synced: 19 Jul 2025
https://github.com/asyml/texar
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
bert casl-project data-processing deep-learning dialog-systems gpt-2 machine-learning machine-translation natural-language-processing python tensorflow texar text-data text-generation xlnet
Last synced: 14 May 2025
https://github.com/microsoft/dialogpt
Large-scale pretraining for dialogue
data-processing dialogpt dialogue gpt-2 machine-learning pytorch text-data text-generation transformer
Last synced: 15 May 2025
https://github.com/microsoft/godel
Large-scale pretrained models for goal-directed dialog
conversational-ai data-processing dialogpt dialogue dialogue-systems grounded-generation language-grounding language-model machine-learning pretrained-model pytorch text-data text-generation transformer transformers
Last synced: 12 Apr 2025
https://github.com/microsoft/GODEL
Large-scale pretrained models for goal-directed dialog
conversational-ai data-processing dialogpt dialogue dialogue-systems grounded-generation language-grounding language-model machine-learning pretrained-model pytorch text-data text-generation transformer transformers
Last synced: 27 Mar 2025
https://github.com/asyml/texar-pytorch
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/
bert casl-project data-processing deep-learning dialog-systems gpt-2 machine-learning machine-translation natural-language-processing python pytorch roberta texar texar-pytorch text-data text-generation xlnet
Last synced: 08 Oct 2025
https://github.com/asyml/forte
Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/
data-processing deep-learning information-retrieval machine-learning natural-language natural-language-processing pipeline python text-data
Last synced: 04 Apr 2025
https://github.com/lolei/redditcleaner
Cleans Reddit Text Data :scroll: :broom:
data-cleaning hacktoberfest nlp praw psaw pushshift python reddit text-data
Last synced: 22 Jul 2025
https://github.com/trinker/textreadr
Tools to uniformly read in text data including semi-structured transcripts
doc docx pdf-reading r read-transcripts text-data text-mining
Last synced: 16 Mar 2025
https://github.com/trinker/textshape
Tools for reshaping text data
data-reshaping manipulation r sentence-boundary-detection text-data text-formating tidy
Last synced: 16 Mar 2025
https://github.com/balaka-18/rake_new2
A Python library that enables smooth keyword extraction from any text using the RAKE(Rapid Automatic Keyword Extraction) algorithm.
keyword-extraction keyword-search keywords nlp python-library text text-data
Last synced: 30 Jun 2025
https://github.com/tylerjthomas9/scrapesec.jl
Scrape EDGAR filings from https://www.sec.gov/
edgar finance financial-data julia scraper sec text-data
Last synced: 07 May 2025
https://github.com/hsankesara/the-tweets-of-wisdom
A dataset which contains 30k+ so called "self-help" tweets from 100+ authors.
nlp text-data text-datasets tweepy tweets
Last synced: 13 Oct 2025
https://github.com/signaln/parallelio
For reading from and writing to parallel data files in Python
machine-learning natural-language-processing pre-processing preprocessing text text-data
Last synced: 14 Jan 2026
https://github.com/ptthanh02/vietnam-news-crawler
crawler crawling-python newspaper text-data text-mining
Last synced: 11 Aug 2025
https://github.com/infinitode/crsd
A synthetic customer review sentiment dataset for sentiment analysis generated using different AI models.
ai data dataset datasets huggingface-datasets mit-license ml nlp open-source python sentiment sentiment-analysis sentiment-classification text-data
Last synced: 10 Jun 2026
https://github.com/mhenderson/pages2df
Read morning pages into a data frame in R.
morning-pages rstats rstats-package text-data
Last synced: 05 Mar 2025
https://github.com/putuwaw/slr-emotion-classification
Systematic Literature Review: Machine Learning Methods in Emotion Classification in Textual Data
emotion-classification sisfokom systematic-literature-review text-data
Last synced: 20 Feb 2026
https://github.com/klaragtknst/text_topic
This repository implements a pipeline to store various data of files from a large unstructured dataset. These fields are used for topic modeling (wordclouds, based on low-dimensional versions of embedding vectors, Named Entity Clustering and document-topic incidences). The information is aggregated and visualised using FCA.
documents elasticsearch embeddings fca ner ner-clustering sentence-transformers text-data top2vec topic-aggregation topics-modeling visualisation
Last synced: 26 Feb 2025
https://github.com/infinitode/duplipy
DupliPy is a quick and easy-to-use package that can handle text formatting and data augmentation tasks for NLP in Python. It now offers support for image augmentation tasks as well.
ai augmentation data-analysis data-preprocessing data-science images language-models nlp preprocessing text-data text-datasets text-formatting
Last synced: 15 Apr 2026
https://github.com/fareedkhan-dev/nlp-1k-stories-dataset-genres-100
This repository hosts a diverse NLP dataset comprising 1,000 stories spanning 100 genres for comprehensive language understanding tasks.
dataset deep-learning llm machine-learning nlp python text-data
Last synced: 09 Jun 2026
https://github.com/finnishcancerregistry/fwf
Read and write fixed-width format data.
data-export data-import data-processing epidemiology file-format fixed-width-format fwf io r-package tabular-data text-data
Last synced: 26 Mar 2025