Projects in Awesome Lists tagged with training-data
A curated list of projects in awesome lists tagged with training-data .
https://github.com/snorkel-team/snorkel
A system for quickly generating training data with weak supervision
ai data-augmentation data-science data-slicing labeling machine-learning python snorkel training-data weak-supervision
Last synced: 23 Feb 2026
https://hazyresearch.github.io/snorkel
A system for quickly generating training data with weak supervision
ai data-augmentation data-science data-slicing labeling machine-learning python snorkel training-data weak-supervision
Last synced: 26 Feb 2025
https://github.com/diffgram/diffgram
The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.
annotation annotation-tool annotations data data-analytics data-annotation data-science datasets datastore deep-learning image-annotation kubernetes labeling machine-learning training-data video-annotation
Last synced: 14 Mar 2025
https://github.com/ydataai/ydata-synthetic
Synthetic data generators for tabular and time-series data
datageneration datagenerator deep-learning gan gan-architectures gans generative-adversarial-network machine-learning python3 pytorch synthetic-data tensorflow2 time-series timeseries training-data
Last synced: 13 May 2025
https://github.com/NorskRegnesentral/skweak
skweak: A software toolkit for weak supervision applied to NLP tasks
data-science distant-supervision natural-language-processing nlp-library nlp-machine-learning python spacy training-data weak-supervision
Last synced: 14 Mar 2025
https://github.com/norskregnesentral/skweak
skweak: A software toolkit for weak supervision applied to NLP tasks
data-science distant-supervision natural-language-processing nlp-library nlp-machine-learning python spacy training-data weak-supervision
Last synced: 15 May 2025
https://github.com/ovidijusparsiunas/myvision
Computer vision based ML training data generation tool :rocket:
ai annotation annotation-tool coco computer-vision image image-annotation label labeling-tool labelling machine-learning ml model object-detection tagging tensorflow training-data vgg vision yolo
Last synced: 15 May 2025
https://github.com/OvidijusParsiunas/myvision
Computer vision based ML training data generation tool :rocket:
ai annotation annotation-tool coco computer-vision image image-annotation label labeling-tool labelling machine-learning ml model object-detection tagging tensorflow training-data vgg vision yolo
Last synced: 20 Mar 2025
https://github.com/alteryx/compose
A machine learning tool for automated prediction engineering. It allows you to easily structure prediction problems and generate labels for supervised learning.
ai automl data-labeling data-science labeling labeling-tool machine-learning prediction-engineering prediction-problem training-data
Last synced: 14 May 2025
https://github.com/a-maliarov/amazoncaptcha
Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.
amazon amazon-captcha amazon-scraper amazoncaptcha captcha captcha-solver data-extraction pillow python3 training-data
Last synced: 26 Mar 2025
https://github.com/sparkfish/augraphy
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
augmentation-pipeline computer-vision crappification data-augmentation data-pipeline deep-neural-networks image-processing machine-learning synthetic-data synthetic-dataset-generation training-data
Last synced: 10 Apr 2025
https://github.com/Slava/label-tool
Web application for image labeling and segmentation
boundingbox computer-vision computer-vision-tools data-labeling image-annotation image-label image-labeling image-labeling-tool labelme machine-learning segmentation sematic-segmentation training-data
Last synced: 06 Apr 2025
https://github.com/d5555/tageditor
🏖TagEditor - Annotation tool for spaCy
annotation annotation-tool coreference-resolution data-science labeling-tool machine-learning named-entities named-entity-recognition natural-language-processing neural-networks neuralcoref nlp spacy spacy-visualizer tagging-tool text-annotation text-tagging training-data
Last synced: 20 Aug 2025
https://github.com/d5555/TagEditor
🏖TagEditor - Annotation tool for spaCy
annotation annotation-tool coreference-resolution data-science labeling-tool machine-learning named-entities named-entity-recognition natural-language-processing neural-networks neuralcoref nlp spacy spacy-visualizer tagging-tool text-annotation text-tagging training-data
Last synced: 12 May 2025
https://github.com/kennethenevoldsen/augmenty
Augmenty is an augmentation library based on spaCy for augmenting texts.
augmentation natural-language-processing nlp nlproc python spacy spacy-extension spacy-nlp text-augmentation text-classification training-data
Last synced: 05 Apr 2025
https://github.com/google-research-datasets/swim-ir
SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask prompting.
cross-lingual datasets deep-learning information-retrieval machine-learning multilingual natural-language-processing neural-information-retrieval nlp training-data
Last synced: 01 Apr 2026
https://github.com/ableinc/git2txt
Convert all files in git repository to .txt files. Useful for training LLMs on your codebase.
git llm machine-learning python3 training-data txt
Last synced: 06 Oct 2025
https://github.com/hernanmd/covid-19-train-audio
COVID-19 Coughs files for training AI models
audio-analysis coronavirus cough-monitor covid-19 covid19 training-data wavelet-analysis
Last synced: 07 May 2025
https://github.com/megagonlabs/ruler
Data Programming by Demonstration (DPBD) for Document Classification
data-labeling data-programming data-science machine-learning training-data weak-supervision
Last synced: 07 Jul 2025
https://github.com/milangritta/Pragmatic-Guide-to-Geoparsing-Evaluation
Full resources supporting the publication "A Pragmatic Guide to Geoparsing Evaluation."
analysis data evaluation geocoder geocoding geography geoparser geoparsing google-cloud linguistics location machine-learning named-entity-recognition places spacy-nlp taxonomy toponym-resolution toponyms toponymy training-data
Last synced: 07 Apr 2025
https://github.com/instapy/instapy-gender-classification
🔎 Classification helper for sex classification feature of InstaPy
classification helper instapy training-data
Last synced: 10 Oct 2025
https://github.com/abinashmeher999/voice-data-extract
A command line interface to combine text information from subtitles with voice data in the video. Provides a convenient way to generate training data for speech-recognition purposes.
speech-recognition speech-to-text training-data
Last synced: 14 Mar 2026
https://github.com/minhaskamal/alphabetrecognizer
Simple Optical Character Recognizer (english-ocr-image-to-text-recognition-sample-trainig-alphabet-photo-data-database-dataset)
alphabet-recognizer data database english image-processing java machine-learning ocr sample template-matching text-recognition training-data writing
Last synced: 11 Apr 2025
https://github.com/phineas-pta/speech-synthesis-ngngngan
python script to download & process data to train a speech-synthesis model of Vietnamese M.C. Nguyễn Ngọc Ngạn
data-processing deep-learning matcha-tts model-training pytorch rvc training-data vietnamese vits2
Last synced: 23 Jun 2025
https://github.com/stritti/thermal-solar-plant-dataset
Realtime Thermal Solar Plant Dataset for Machine Learning
dataset examples iot machine-learning opendata public-data research smarthome training-data
Last synced: 27 Jan 2026
https://github.com/alea-institute/kl3m-data
KL3M training data collection and preprocessing
Last synced: 02 Sep 2025
https://github.com/deepraj1729/track
Training images for training self-driving cars on Udacity Nanodegree Self-driving Car Simulator
deep-learning image-processing reinforcement-learning self-driving-car training-data udacity udacity-nanodegree udacity-self-driving-car
Last synced: 28 Feb 2025
https://github.com/raad-labs/raad-video
A high-performance video loading library for machine learning, designed for efficient training data preparation.
cuda machine-learning training-data
Last synced: 17 Oct 2025
https://github.com/mockloop/mockloop-mcp
Intelligent Model Context Protocol (MCP) server for AI-assisted API development. Generate mock servers from OpenAPI specs with advanced logging, performance analytics, and server discovery. Optimized for AI development workflows with comprehensive testing insights and automated analysis.
ai api feedback-loop llm mcp mcp-server mcp-servers mock mocking-server mocking-utility models openapi swagger training training-data
Last synced: 04 Sep 2025
https://github.com/jbaiter/archiscribe-corpus
Repository for 19th century German fraktur lines transcribed via archiscribe.jbaiter.de
19th-century dataset evaluation-data fraktur historical-data ocr training-data
Last synced: 27 Feb 2026
https://github.com/jakarto3d/jakarnotator
The Jakarnotator is an annotation tool to create your own database for instance segmentation problem.
annotations computer-vision data database deep-learning detectron instance-segmentation mscoco training-data
Last synced: 15 May 2025
https://github.com/atticusrussell/bingimageaitrainer
A tool for generating diverse synthetic training images using Bing Image Creator to facilitate the training of AI/ML image models.
ai ai-training bing-api generative-ai image-generation machine-image machine-learning machine-vision semantic-segmentation text-to-image training-data
Last synced: 02 May 2025
https://github.com/lfoppiano/supercon2
Staging-area for automatically collected experimental data for the SuperCon database with a curation interface with enhanced-document viewer and curation-ready interface
feedback grobid superconductors tdm training training-data
Last synced: 27 Mar 2025
https://github.com/openethicsai/oedp
Open Ethics Data Passport
data-governance datasets json-schemas model-management oedp open-ethics training-data
Last synced: 19 Aug 2025
https://github.com/cosmincatalin/shaper
DSL for generating basic images
data-generation dsl synthetic-data training-data
Last synced: 16 Jan 2026
https://github.com/scthornton/securecode
Unified security training dataset (2,185 examples) covering OWASP Top 10 2021 and OWASP LLM Top 10 2025
ai-security huggingface owasp secure-coding security-dataset training-data web-security
Last synced: 22 Jun 2026
https://github.com/headless-start/data-augmentation-impact
This repository contains effect of Data Augmentation of Training Set during Model Training.
augmented-images cuda data gpu keras matplotlib mnist opencv-python python3 tensorflow training-data
Last synced: 05 Apr 2026
https://github.com/ullaskunder3/javascript-ml
Reverse Text Color Based on Background Color Automatically using brain.js brain.NeuralNetwork()
basic brain javascript ml training-data
Last synced: 03 Apr 2025
https://github.com/exortions/diabetes-prediction-with-neural-networks
Predicting if a patient has diabetes based on a training set with an accuracy of ~80%
keras keras-tensorflow matplotlib matplotlib-pyplot neural-network pandas pandas-dataframe pandas-python python python3 reinforcement-learning reinforcement-learning-algorithms seaborn sklearn tensorflow training-data training-project
Last synced: 07 Apr 2026
https://github.com/ashutoshdongare/softskill-ner
Fine tuning 🤗 transformer model for softskill NER task
bert-fine-tuning dataset distilbert huggingface ner softskills token-classification training-data transfer-learning transformers
Last synced: 21 Mar 2025
https://github.com/shreeshrii/kraken_devanagari
Kraken models for Devanagari
devanagari kraken ocr sanskrit training-data
Last synced: 21 Mar 2025
https://github.com/centralfloridaattorney/yahoostocks
YahooStock is a simple tool to make training and testing splits of stock market data from Yahoo using a ticker symbol.
data-science machine-learning stock-market training-data
Last synced: 11 Jun 2025
https://github.com/rastmob/wordpress-llms-output-plugin
A WordPress plugin to export posts, pages, and custom post types as JSON for training Language Models (LLMs).
ai data llm llms training training-data wordpress wordpress-development wordpress-plugin
Last synced: 03 May 2026
https://github.com/scar17off/chess-ai
Chess engine powered by neural networks featuring web and desktop interfaces, training capabilities, and grandmaster opening support.
artificial-intelligence chess chess-engine chess-openings deep-learning game-ai gradio gui machine-learning neural-networks pygame python training-data web-interface
Last synced: 31 Mar 2025
https://github.com/DefinetlyNotAI/VulnScan_Data
Logicytics VulnScan Module's Training Data and old model archive
ai data logicytics ml models pytorch sensitive-files text-processing tfidf-text-analysis training-data
Last synced: 17 Aug 2025
https://github.com/vinayakdon/machine-learning-project-sentimental-classifier-
A sentiment classification tool using machine learning in Python to analyze and predict the sentiment of text data. Features preprocessing, model training, hyperparameter tuning, and evaluation for accurate sentiment analysis.
dataanalytics dataprocessing datascience python training-data
Last synced: 17 May 2026
https://github.com/264gaurav/deep-learning
Deep Learning and Neural Network learning/building and exploring
artificial-neural-networks dagshub deep-learning deep-neural-networks dvc experiment-tracking hyperparameter-tuning keras-tensorflow keras-tuner mlflow-tracking numpy sklearn tensorflow testing training-data
Last synced: 06 May 2026
https://github.com/monu-yaduwanshi/climatexpert-climate_impact_financial_predictor
This app help to find the natural disaster losses before it happen for the businesses which affect by these disaster this AI model is well trained which help you to find you have to apply for insurance or not
android androidapp androidstudio api firebase insurance jetpack-compose kotlin-android python realtime tensorflow training-data
Last synced: 09 May 2026
https://github.com/definetlynotai/vulnscan_data
Logicytics VulnScan Module's Training Data and old model archive
ai data logicytics ml models pytorch sensitive-files text-processing tfidf-text-analysis training-data
Last synced: 11 Oct 2025
https://github.com/mundialis/r.incora
GRASS GIS addon for Incora landcover classification. See also https://github.com/mundialis/incora
classification grass-gis grass-gis-addons incora landcover landcover-classification machine-learning postprocessing training-data
Last synced: 02 Mar 2026
https://github.com/amirzenoozi/simple-classification
A Simple Vehicle Classifier Based on Keras and Tensorflow + Training Script
cars classification dataset deep-learning image-processing keras keras-tensorflow machine-learning planes tensorflow tf2 train training-data vehicle vehicle-classification vehicle-dataset
Last synced: 04 May 2026
https://github.com/pattabhia/dataset-generator
A flexible, template-based dataset generator for creating high-quality training data for enterprise AI and RAG (Retrieval-Augmented Generation) systems.
dataset-generation fine-tuning knowledge-graph llama llama-factory machine-learning python python3 rag training-data vector-database
Last synced: 24 Dec 2025
https://github.com/veldhub/veld_data__apis_oebl__ner_gold
Data velds encapsulating NLP / NER gold data.
gold-data named-entity-recognition ner nlp spacy spacy-nlp spacy-nlp-ner training-data
Last synced: 10 Aug 2025
https://github.com/danmurf/datakeg
Brew synthetic training data from your documentation using LLMs
dataset-generation fine-tuning llm machine-learning nlp synthetic-data training-data
Last synced: 18 Feb 2026
https://github.com/jzombie/rust-triplets
Composable data sampling primitives for deterministic multi-source ML/AI training-data orchestration.
algorithms artificial-intelligence bm25 dataset-sampling science text-processing train-test-split training-data triplet-mining
Last synced: 07 Apr 2026
https://github.com/liuxiaotong/data-recipe
Reverse-engineering framework for AI datasets — extract annotation specs, cost models & reproducibility from samples or requirement docs.
ai-agent ai-data-pipeline annotation-spec cost-estimation dataset-analysis huggingface llm mcp python reverse-engineering training-data workflow-automation
Last synced: 08 Feb 2026
https://github.com/simonko-912/simongpt-simple-instruct
SimonGPT LLM and training data for SimonGPT, You can use this data for your own LLM too, make sure you follow the Apache 2.0 license.
ai json llm lm python pytorch torch training-data
Last synced: 16 Apr 2026
https://github.com/moonyfringers/ladon
crawler data-pipeline ladon ladon-framework llm python training-data web-crawler web-scraping
Last synced: 17 Apr 2026
https://github.com/pedroteixeiraw/variational_quantum_circuit_binary_classification
This project focuses on developing a Variational Quantum Circuit capable of performing Binary Classification between two classes: red wine and white wine, based on their characteristics using machine learning.
binary-classification cost-function json machine-learning matplotlib numpy pandas qiskit qiskit-machine-learning quantum-machine-learning scikit-learn training-data variational-circuit
Last synced: 04 Apr 2026
https://github.com/faizantkhan/machine-learning
Machine Learning Practice and Exercises Welcome to our repository dedicated to the practice and mastery of machine learning (ML) concepts and techniques. This repository serves as a comprehensive resource for learners and enthusiasts looking to enhance their ML skills through hands-on exercises and practical applications.
classification-algorithm clustering-algorithm data-science datavisualization decision-trees eda linear-regression logistic-regression machine-learning machine-learning-algorithms machine-learning-library math matplotlib-pyplot model-selection pandas python sklearn-library testing-data training-data
Last synced: 18 Apr 2026
https://github.com/sivakiran7/pytorch_deeplearning
cnn-classification python pytorch scikit-learn training-data
Last synced: 29 Apr 2026
https://github.com/youseftareq33/java_ai_3_machine-learning
building a predictive model using the Linear Regression algorithm
java linear-regression machine-learning training-data weka
Last synced: 23 Jul 2025
https://github.com/radom12/stockpredictior
Stock Price Prediction Predict stock prices using machine learning and deep learning models. Analyze historical market data, implement state-of-the-art algorithms, and visualize predictions. Explore trends, evaluate accuracy, and contribute to enhance predictive capabilities. Educational and research-focused. 📈💡
ml model nasdaq python stock-price-prediction training-data
Last synced: 09 May 2026
https://github.com/sagargaud01/ai-driven-media-investment-plan-
AI-Driven Media Investment Plan Across Channels for E-commerce
ai business-intelligence data-set juypter python seaborn training training-data
Last synced: 13 Jun 2026
https://github.com/veldhub/veld_data__amc_we_training_data
Data velds encapsulating Austria Media Corpus gold data
gold-data nlp training-data word-embeddings wordembeddings
Last synced: 06 Jan 2026
https://github.com/monu-yaduwanshi/climate_impact_financial_predictor
This app help to find the natural disaster losses before it happen for the businesses which affect by these disaster this AI model is well trained which help you to find you have to apply for insurance or not
android androidapp androidstudio api firebase insurance jetpack-compose kotlin-android python realtime tensorflow training-data
Last synced: 31 Mar 2025
https://github.com/devo8604/cicd_llm_data_scraper
Automated pipeline for generating high-quality Q&A training data from Git repositories. Processes source code with LLMs to create fine-tuning datasets. Features smart caching, resume support, MLX (Apple Silicon) & llama.cpp backends, multiple export formats (Alpaca, ChatML, etc).
alpaca code-analysis data-pipeline dataset-generation fine-tuning instruction-tuning llamacpp llm machine-learning mlx python question-answering sqlite synthetic-data training-data
Last synced: 11 Apr 2026
https://github.com/jeanpaul20/aethon-mission-control
AI-powered coding orchestrator for VS Code — refine prompts, multi-model proposals, training data export
ai ai-agents ai-assistant code-review copilot machine-learning ollama orchestrator prompt-engineering training-data typescript vscode-extension
Last synced: 17 Feb 2026
https://github.com/provos/world-history-1500-qa
This repository contains a set of question and answer pairs derived from the book "World History Since 1500: An Open and Free Textbook" by John Rankin and Constanze Weise of East Tennessee State University.
Last synced: 05 Sep 2025
https://github.com/ekbass/updated-grade-school-math
Updated version of OpenAI's Grade School Math dataset.
dataset fine-tuning llm llm-training machine-learning math mathematical-expressions mathematics training-data
Last synced: 04 Jan 2026
https://github.com/leo-gan/anonymizer
An app and an SDK to anonymize large PDF files
anonymization anonymize anthropic deanonymization gemini healthcare huggingface-hub legal-documents llm ollama openai openrouter pdf python training-data
Last synced: 14 Jan 2026
https://github.com/mgschoen/picpic-explorer
Web UI for the data behind PicPic, an automatic image selection tool for news articles
computational-linguistics ffnn image-selection keyword-extraction machinelearning neural-networks news-articles single-page-app training-data
Last synced: 11 Apr 2026
https://github.com/szgabsz91/morpher-data
A collection of training and evaluation data for Morpher
evaluation-data hungarian morphology training-data
Last synced: 16 Feb 2026
https://github.com/veldhub/veld_data__demo_train_data_ts-vienna-2024
Demo training data for the CLSInfra training school 2024.
conllu gold-data nlp training-data
Last synced: 14 Feb 2026