An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with training-data

A curated list of projects in awesome lists tagged with training-data .

https://github.com/snorkel-team/snorkel

A system for quickly generating training data with weak supervision

ai data-augmentation data-science data-slicing labeling machine-learning python snorkel training-data weak-supervision

Last synced: 23 Feb 2026

https://hazyresearch.github.io/snorkel

A system for quickly generating training data with weak supervision

ai data-augmentation data-science data-slicing labeling machine-learning python snorkel training-data weak-supervision

Last synced: 26 Feb 2025

https://github.com/diffgram/diffgram

The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.

annotation annotation-tool annotations data data-analytics data-annotation data-science datasets datastore deep-learning image-annotation kubernetes labeling machine-learning training-data video-annotation

Last synced: 14 Mar 2025

https://github.com/alteryx/compose

A machine learning tool for automated prediction engineering. It allows you to easily structure prediction problems and generate labels for supervised learning.

ai automl data-labeling data-science labeling labeling-tool machine-learning prediction-engineering prediction-problem training-data

Last synced: 14 May 2025

https://github.com/a-maliarov/amazoncaptcha

Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.

amazon amazon-captcha amazon-scraper amazoncaptcha captcha captcha-solver data-extraction pillow python3 training-data

Last synced: 26 Mar 2025

https://github.com/google-research-datasets/swim-ir

SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask prompting.

cross-lingual datasets deep-learning information-retrieval machine-learning multilingual natural-language-processing neural-information-retrieval nlp training-data

Last synced: 01 Apr 2026

https://github.com/ableinc/git2txt

Convert all files in git repository to .txt files. Useful for training LLMs on your codebase.

git llm machine-learning python3 training-data txt

Last synced: 06 Oct 2025

https://github.com/megagonlabs/ruler

Data Programming by Demonstration (DPBD) for Document Classification

data-labeling data-programming data-science machine-learning training-data weak-supervision

Last synced: 07 Jul 2025

https://github.com/instapy/instapy-gender-classification

🔎 Classification helper for sex classification feature of InstaPy

classification helper instapy training-data

Last synced: 10 Oct 2025

https://github.com/abinashmeher999/voice-data-extract

A command line interface to combine text information from subtitles with voice data in the video. Provides a convenient way to generate training data for speech-recognition purposes.

speech-recognition speech-to-text training-data

Last synced: 14 Mar 2026

https://github.com/minhaskamal/alphabetrecognizer

Simple Optical Character Recognizer (english-ocr-image-to-text-recognition-sample-trainig-alphabet-photo-data-database-dataset)

alphabet-recognizer data database english image-processing java machine-learning ocr sample template-matching text-recognition training-data writing

Last synced: 11 Apr 2025

https://github.com/phineas-pta/speech-synthesis-ngngngan

python script to download & process data to train a speech-synthesis model of Vietnamese M.C. Nguyễn Ngọc Ngạn

data-processing deep-learning matcha-tts model-training pytorch rvc training-data vietnamese vits2

Last synced: 23 Jun 2025

https://github.com/stritti/thermal-solar-plant-dataset

Realtime Thermal Solar Plant Dataset for Machine Learning

dataset examples iot machine-learning opendata public-data research smarthome training-data

Last synced: 27 Jan 2026

https://github.com/alea-institute/kl3m-data

KL3M training data collection and preprocessing

ai alea kl3m training-data

Last synced: 02 Sep 2025

https://github.com/deepraj1729/track

Training images for training self-driving cars on Udacity Nanodegree Self-driving Car Simulator

deep-learning image-processing reinforcement-learning self-driving-car training-data udacity udacity-nanodegree udacity-self-driving-car

Last synced: 28 Feb 2025

https://github.com/raad-labs/raad-video

A high-performance video loading library for machine learning, designed for efficient training data preparation.

cuda machine-learning training-data

Last synced: 17 Oct 2025

https://github.com/mockloop/mockloop-mcp

Intelligent Model Context Protocol (MCP) server for AI-assisted API development. Generate mock servers from OpenAPI specs with advanced logging, performance analytics, and server discovery. Optimized for AI development workflows with comprehensive testing insights and automated analysis.

ai api feedback-loop llm mcp mcp-server mcp-servers mock mocking-server mocking-utility models openapi swagger training training-data

Last synced: 04 Sep 2025

https://github.com/jbaiter/archiscribe-corpus

Repository for 19th century German fraktur lines transcribed via archiscribe.jbaiter.de

19th-century dataset evaluation-data fraktur historical-data ocr training-data

Last synced: 27 Feb 2026

https://github.com/jakarto3d/jakarnotator

The Jakarnotator is an annotation tool to create your own database for instance segmentation problem.

annotations computer-vision data database deep-learning detectron instance-segmentation mscoco training-data

Last synced: 15 May 2025

https://github.com/atticusrussell/bingimageaitrainer

A tool for generating diverse synthetic training images using Bing Image Creator to facilitate the training of AI/ML image models.

ai ai-training bing-api generative-ai image-generation machine-image machine-learning machine-vision semantic-segmentation text-to-image training-data

Last synced: 02 May 2025

https://github.com/lfoppiano/supercon2

Staging-area for automatically collected experimental data for the SuperCon database with a curation interface with enhanced-document viewer and curation-ready interface

feedback grobid superconductors tdm training training-data

Last synced: 27 Mar 2025

https://github.com/cosmincatalin/shaper

DSL for generating basic images

data-generation dsl synthetic-data training-data

Last synced: 16 Jan 2026

https://github.com/scthornton/securecode

Unified security training dataset (2,185 examples) covering OWASP Top 10 2021 and OWASP LLM Top 10 2025

ai-security huggingface owasp secure-coding security-dataset training-data web-security

Last synced: 22 Jun 2026

https://github.com/mykhode/data_mining_py

Simple Scrabe data with Python

ai scrabe-data training-data

Last synced: 16 Jun 2026

https://github.com/headless-start/data-augmentation-impact

This repository contains effect of Data Augmentation of Training Set during Model Training.

augmented-images cuda data gpu keras matplotlib mnist opencv-python python3 tensorflow training-data

Last synced: 05 Apr 2026

https://github.com/ullaskunder3/javascript-ml

Reverse Text Color Based on Background Color Automatically using brain.js brain.NeuralNetwork()

basic brain javascript ml training-data

Last synced: 03 Apr 2025

https://github.com/shreeshrii/kraken_devanagari

Kraken models for Devanagari

devanagari kraken ocr sanskrit training-data

Last synced: 21 Mar 2025

https://github.com/centralfloridaattorney/yahoostocks

YahooStock is a simple tool to make training and testing splits of stock market data from Yahoo using a ticker symbol.

data-science machine-learning stock-market training-data

Last synced: 11 Jun 2025

https://github.com/rastmob/wordpress-llms-output-plugin

A WordPress plugin to export posts, pages, and custom post types as JSON for training Language Models (LLMs).

ai data llm llms training training-data wordpress wordpress-development wordpress-plugin

Last synced: 03 May 2026

https://github.com/scar17off/chess-ai

Chess engine powered by neural networks featuring web and desktop interfaces, training capabilities, and grandmaster opening support.

artificial-intelligence chess chess-engine chess-openings deep-learning game-ai gradio gui machine-learning neural-networks pygame python training-data web-interface

Last synced: 31 Mar 2025

https://github.com/DefinetlyNotAI/VulnScan_Data

Logicytics VulnScan Module's Training Data and old model archive

ai data logicytics ml models pytorch sensitive-files text-processing tfidf-text-analysis training-data

Last synced: 17 Aug 2025

https://github.com/vinayakdon/machine-learning-project-sentimental-classifier-

A sentiment classification tool using machine learning in Python to analyze and predict the sentiment of text data. Features preprocessing, model training, hyperparameter tuning, and evaluation for accurate sentiment analysis.

dataanalytics dataprocessing datascience python training-data

Last synced: 17 May 2026

https://github.com/monu-yaduwanshi/climatexpert-climate_impact_financial_predictor

This app help to find the natural disaster losses before it happen for the businesses which affect by these disaster this AI model is well trained which help you to find you have to apply for insurance or not

android androidapp androidstudio api firebase insurance jetpack-compose kotlin-android python realtime tensorflow training-data

Last synced: 09 May 2026

https://github.com/definetlynotai/vulnscan_data

Logicytics VulnScan Module's Training Data and old model archive

ai data logicytics ml models pytorch sensitive-files text-processing tfidf-text-analysis training-data

Last synced: 11 Oct 2025

https://github.com/mundialis/r.incora

GRASS GIS addon for Incora landcover classification. See also https://github.com/mundialis/incora

classification grass-gis grass-gis-addons incora landcover landcover-classification machine-learning postprocessing training-data

Last synced: 02 Mar 2026

https://github.com/pattabhia/dataset-generator

A flexible, template-based dataset generator for creating high-quality training data for enterprise AI and RAG (Retrieval-Augmented Generation) systems.

dataset-generation fine-tuning knowledge-graph llama llama-factory machine-learning python python3 rag training-data vector-database

Last synced: 24 Dec 2025

https://github.com/danmurf/datakeg

Brew synthetic training data from your documentation using LLMs

dataset-generation fine-tuning llm machine-learning nlp synthetic-data training-data

Last synced: 18 Feb 2026

https://github.com/jzombie/rust-triplets

Composable data sampling primitives for deterministic multi-source ML/AI training-data orchestration.

algorithms artificial-intelligence bm25 dataset-sampling science text-processing train-test-split training-data triplet-mining

Last synced: 07 Apr 2026

https://github.com/liuxiaotong/data-recipe

Reverse-engineering framework for AI datasets — extract annotation specs, cost models & reproducibility from samples or requirement docs.

ai-agent ai-data-pipeline annotation-spec cost-estimation dataset-analysis huggingface llm mcp python reverse-engineering training-data workflow-automation

Last synced: 08 Feb 2026

https://github.com/simonko-912/simongpt-simple-instruct

SimonGPT LLM and training data for SimonGPT, You can use this data for your own LLM too, make sure you follow the Apache 2.0 license.

ai json llm lm python pytorch torch training-data

Last synced: 16 Apr 2026

https://github.com/pedroteixeiraw/variational_quantum_circuit_binary_classification

This project focuses on developing a Variational Quantum Circuit capable of performing Binary Classification between two classes: red wine and white wine, based on their characteristics using machine learning.

binary-classification cost-function json machine-learning matplotlib numpy pandas qiskit qiskit-machine-learning quantum-machine-learning scikit-learn training-data variational-circuit

Last synced: 04 Apr 2026

https://github.com/faizantkhan/machine-learning

Machine Learning Practice and Exercises Welcome to our repository dedicated to the practice and mastery of machine learning (ML) concepts and techniques. This repository serves as a comprehensive resource for learners and enthusiasts looking to enhance their ML skills through hands-on exercises and practical applications.

classification-algorithm clustering-algorithm data-science datavisualization decision-trees eda linear-regression logistic-regression machine-learning machine-learning-algorithms machine-learning-library math matplotlib-pyplot model-selection pandas python sklearn-library testing-data training-data

Last synced: 18 Apr 2026

https://github.com/youseftareq33/java_ai_3_machine-learning

building a predictive model using the Linear Regression algorithm

java linear-regression machine-learning training-data weka

Last synced: 23 Jul 2025

https://github.com/radom12/stockpredictior

Stock Price Prediction Predict stock prices using machine learning and deep learning models. Analyze historical market data, implement state-of-the-art algorithms, and visualize predictions. Explore trends, evaluate accuracy, and contribute to enhance predictive capabilities. Educational and research-focused. 📈💡

ml model nasdaq python stock-price-prediction training-data

Last synced: 09 May 2026

https://github.com/sagargaud01/ai-driven-media-investment-plan-

AI-Driven Media Investment Plan Across Channels for E-commerce

ai business-intelligence data-set juypter python seaborn training training-data

Last synced: 13 Jun 2026

https://github.com/veldhub/veld_data__amc_we_training_data

Data velds encapsulating Austria Media Corpus gold data

gold-data nlp training-data word-embeddings wordembeddings

Last synced: 06 Jan 2026

https://github.com/monu-yaduwanshi/climate_impact_financial_predictor

This app help to find the natural disaster losses before it happen for the businesses which affect by these disaster this AI model is well trained which help you to find you have to apply for insurance or not

android androidapp androidstudio api firebase insurance jetpack-compose kotlin-android python realtime tensorflow training-data

Last synced: 31 Mar 2025

https://github.com/devo8604/cicd_llm_data_scraper

Automated pipeline for generating high-quality Q&A training data from Git repositories. Processes source code with LLMs to create fine-tuning datasets. Features smart caching, resume support, MLX (Apple Silicon) & llama.cpp backends, multiple export formats (Alpaca, ChatML, etc).

alpaca code-analysis data-pipeline dataset-generation fine-tuning instruction-tuning llamacpp llm machine-learning mlx python question-answering sqlite synthetic-data training-data

Last synced: 11 Apr 2026

https://github.com/jeanpaul20/aethon-mission-control

AI-powered coding orchestrator for VS Code — refine prompts, multi-model proposals, training data export

ai ai-agents ai-assistant code-review copilot machine-learning ollama orchestrator prompt-engineering training-data typescript vscode-extension

Last synced: 17 Feb 2026

https://github.com/provos/world-history-1500-qa

This repository contains a set of question and answer pairs derived from the book "World History Since 1500: An Open and Free Textbook" by John Rankin and Constanze Weise of East Tennessee State University.

llm rag training-data

Last synced: 05 Sep 2025

https://github.com/mgschoen/picpic-explorer

Web UI for the data behind PicPic, an automatic image selection tool for news articles

computational-linguistics ffnn image-selection keyword-extraction machinelearning neural-networks news-articles single-page-app training-data

Last synced: 11 Apr 2026

https://github.com/szgabsz91/morpher-data

A collection of training and evaluation data for Morpher

evaluation-data hungarian morphology training-data

Last synced: 16 Feb 2026

https://github.com/veldhub/veld_data__demo_train_data_ts-vienna-2024

Demo training data for the CLSInfra training school 2024.

conllu gold-data nlp training-data

Last synced: 14 Feb 2026