An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with synthetic-data

A curated list of projects in awesome lists tagged with synthetic-data .

https://github.com/lk-geimfari/mimesis

Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.

data dataframe datascience dummy factory factory-boy fake fixtures generator json-generator mimesis mock pandas polars pytest-plugin python schema syntetic synthetic-data testing

Last synced: 28 Dec 2025

https://github.com/nucleuscloud/neosync

Open Source Data Security Platform for Developers to Monitor and Detect PII, Anonymize Production Data and Sync it across environments.

benthos docker etl faker fine-tuning golang kubernetes mysql nextjs open-source orchestration postgresql reactjs self-hosted synthetic-data synthetic-data-generation test-data-generator testing typescript

Last synced: 12 May 2025

https://github.com/kiln-ai/kiln

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

ai chain-of-thought collaboration dataset-generation evals evaluation fine-tuning machine-learning macos ml ollama openai prompt prompt-engineering python rlhf synthetic-data windows

Last synced: 23 Apr 2025

https://github.com/argilla-io/distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

ai huggingface llms openai python rlaif rlhf synthetic-data synthetic-dataset-generation

Last synced: 11 Apr 2025

https://github.com/hitsz-ids/synthetic-data-generator

SDG is a specialized framework designed to generate high-quality structured tabular data.

agent data-generator deep-learning gan generative-ai llm machine-learning privacy synthetic-data tabular-data

Last synced: 13 May 2025

https://github.com/unrealcv/unrealcv

UnrealCV: Connecting Computer Vision to Unreal Engine

computer-vision embodied-ai machine-learning simulation synthetic-data ue4 virtual-worlds

Last synced: 25 Sep 2025

https://github.com/huggingface/aisheets

Build, enrich, and transform datasets using AI models with no code

ai llm-evaluation llms nocode oss synthetic-data

Last synced: 14 Oct 2025

https://github.com/sdv-dev/ctgan

Conditional GAN for generating synthetic tabular data.

data-generation generative-adversarial-network synthetic-data synthetic-data-generation tabular-data

Last synced: 13 May 2025

https://github.com/sdv-dev/CTGAN

Conditional GAN for generating synthetic tabular data.

data-generation generative-adversarial-network synthetic-data synthetic-data-generation tabular-data

Last synced: 02 May 2025

https://github.com/jofpin/synthBTC

A tool that uses advanced Monte Carlo simulations and Turbit parallel processing to create possible Bitcoin prediction scenarios.

bitcoin data-processing monte-carlo-simulation nodejs prediction synthetic-data turbit

Last synced: 27 Sep 2025

https://github.com/Kiln-AI/Kiln

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

ai chain-of-thought collaboration dataset-generation fine-tuning machine-learning macos ml ollama openai prompt prompt-engineering python rlhf synthetic-data windows

Last synced: 06 Oct 2025

https://github.com/batsresearch/bonito

A lightweight library for generating synthetic instruction tuning datasets for your data without GPT.

domain-adaptation gpt llm synthetic-data synthetic-dataset-generation task-adaptation zero-shot-learning

Last synced: 21 Apr 2025

https://github.com/magpie-align/magpie

[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data generation pipeline!

alignment dataset gemma llama2 llama3 llm nlp paper phi3 qwen2 supervised-finetuning synthetic-data synthetic-dataset-generation

Last synced: 15 May 2025

https://github.com/jofpin/synthbtc

A tool that uses advanced Monte Carlo simulations and Turbit parallel processing to create possible Bitcoin prediction scenarios.

bitcoin data-processing monte-carlo-simulation nodejs prediction synthetic-data turbit

Last synced: 16 May 2025

https://github.com/BatsResearch/bonito

A lightweight library for generating synthetic instruction tuning datasets for your data without GPT.

domain-adaptation gpt llm synthetic-data synthetic-dataset-generation task-adaptation zero-shot-learning

Last synced: 16 Apr 2025

https://github.com/gretelai/gretel-synthetics

Synthetic data generators for structured and unstructured text, featuring differentially private learning.

artificial-intelligence differential-privacy privacy synthetic-data tensorflow

Last synced: 14 May 2025

https://github.com/SciPhi-AI/synthesizer

A multi-purpose LLM framework for RAG and data creation.

agents ai artificial-intelligence machine-learning synthetic-data

Last synced: 10 Jul 2025

https://github.com/paulbricman/thisrepositorydoesnotexist

A curated list of awesome projects which use Machine Learning to generate synthetic content.

generation-algorithms generative-adversarial-network synthetic-data synthetic-dataset-generation synthetic-images

Last synced: 12 Oct 2025

https://github.com/vanderschaarlab/synthcity

A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.

data-augmentation fairness-ml generative-model machine-learning privacy pytorch synthetic-data tabular-data

Last synced: 16 May 2025

https://github.com/plaitpy/plaitpy

plait.py - a fake data modeler

declarative modeling synthetic-data

Last synced: 04 Apr 2025

https://github.com/databrickslabs/dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

data-generation databricks datagen datageneration datagenerator delta-live-tables deltalake faker pyspark python spark spark-streaming synthetic-data

Last synced: 07 Jul 2025

https://github.com/GeorgeCazenavette/mtt-distillation

Official code for our CVPR '22 paper "Dataset Distillation by Matching Training Trajectories"

artificial-intelligence computer-vision machine-learning synthetic-data

Last synced: 08 May 2025

https://github.com/microsoft/genalog

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

data-generation data-science machine-learning ner ocr-recognition python synthetic-data synthetic-data-generation synthetic-images text-alignment

Last synced: 04 Apr 2025

https://github.com/BMW-InnovationLab/BMW-Labeltool-Lite

This repository provides you with an easy-to-use labeling tool for State-of-the-art Deep Learning training purposes. It supports Auto-Labeling.

annotaion auto-label autolabeling bounding-box boundingbox computer-vision deep-learning docker image-annotation inference label labeling-tool labeltool neural-network object-detection smart-labeling synthetic-data tensorflow voc yolov4

Last synced: 07 May 2025

https://github.com/unity-technologies/robotics-object-pose-estimation

A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.

autonomy computer-vision deep-learning machine-learning manipulation model-training motion-planning perception physics-simulation pose-estimation robotics robotics-simulation ros simulation synthetic-data trajectory-generation tutorial unity ur3-robot-arm urdf

Last synced: 06 Apr 2025

https://github.com/tabularis-ai/be_great

A novel approach for synthesizing tabular data using pretrained large language models

data-generation deep-learning synthetic-data synthetic-dataset-generation tabular-data transformers

Last synced: 12 Jun 2025

https://github.com/milaan9/clustering-datasets

This repository contains the collection of UCI (real-life) datasets and Synthetic (artificial) datasets (with cluster labels and MATLAB files) ready to use with clustering algorithms.

benchmark-datasets cluster cluster-labels clustering clustering-datasets dataset datasets real-world-datasets synthetic-data synthetic-datasets uci uci-dataset uci-machine-learning

Last synced: 03 Jul 2025

https://github.com/fjxmlzn/DoppelGANger

[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions

dataset-generation datasets doppelganger fidelity gan gans generative-adversarial-network privacy synthetic-data synthetic-data-generation synthetic-data-generator synthetic-dataset-generation time-series timeseries

Last synced: 15 May 2025

https://github.com/ZumoLabs/zpy

Synthetic data for computer vision. An open source toolkit using Blender and Python.

ai blender blender-addon computer-vision data deep-learning ml python synthetic synthetic-data

Last synced: 11 May 2025

https://github.com/sdv-dev/TGAN

Generative adversarial training for generating synthetic tabular data.

generative-adversarial-network synthesizing-tabular-data synthetic-data tabular-data

Last synced: 02 May 2025

https://github.com/sdv-dev/tgan

Generative adversarial training for generating synthetic tabular data.

generative-adversarial-network synthesizing-tabular-data synthetic-data tabular-data

Last synced: 06 Apr 2025

https://github.com/gszfwsb/NCFM

Official PyTorch implementation of the paper "Dataset Distillation with Neural Characteristic Function" (NCFM) in CVPR 2025.

computer-vision data-centric-ai dataset-distillation synthetic-data

Last synced: 01 Apr 2025

https://github.com/expectedparrot/edsl

Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.

anthropic data-labeling deepinfra domain-specific-language experiments llama2 llm llm-agent llm-framework llm-inference market-research mixtral open-source openai python social-science surveys synthetic-data

Last synced: 15 May 2025

https://github.com/worldbank/REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.

data-generation deep-learning gpt gpt-2 seq2seq-model synthetic-data synthetic-dataset-generation tabular-data transformers

Last synced: 17 Aug 2025

https://github.com/sdv-dev/SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.

metrics quality synthetic-data

Last synced: 02 May 2025

https://github.com/sdv-dev/sdmetrics

Metrics to evaluate quality and efficacy of synthetic datasets.

metrics quality synthetic-data

Last synced: 14 Apr 2025

https://github.com/worldbank/realtabformer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.

data-generation deep-learning gpt gpt-2 seq2seq-model synthetic-data synthetic-dataset-generation tabular-data transformers

Last synced: 04 Jan 2026

https://github.com/project-agml/agml

AgML is a centralized framework for agricultural machine learning. AgML provides access to public agricultural datasets for common agricultural deep learning tasks, with standard benchmarks and pretrained models, as well the ability to generate synthetic data and annotations.

agriculture computer-vision dataset deep-learning image-classification object-detection pytorch semantic-segmentation synthetic-data

Last synced: 15 May 2025

https://github.com/Project-AgML/AgML

AgML is a centralized framework for agricultural machine learning. AgML provides access to public agricultural datasets for common agricultural deep learning tasks, with standard benchmarks and pretrained models, as well the ability to generate synthetic data and annotations.

agriculture computer-vision dataset deep-learning image-classification object-detection pytorch semantic-segmentation synthetic-data

Last synced: 07 May 2025

https://github.com/TonicAI/masquerade

A Postgres Proxy to Mask Data in Realtime

fake-data postgres postgresql synthetic-data

Last synced: 23 Aug 2025

https://github.com/ndrplz/surround_vehicles_awareness

Learn to map surrounding vehicles onto a bird's eye view of the scene.

adas bird-eye deep-learning self-driving-car synthetic-data

Last synced: 23 Oct 2025

https://github.com/alexandervnikitin/tsgm

Generation and evaluation of synthetic time series datasets (also, augmentations, visualizations, a collection of popular datasets) NeurIPS'24

augmentations data-augmentation data-science datasets deep-learning generative-model keras machine-learning python synthetic-data synthetic-time-series tensorflow2 time-series vae

Last synced: 06 Apr 2025

https://github.com/zjrwtx/sft-data-builder

利用免费的大模型api来结合你的私域数据来生成sft训练数据(妥妥白嫖)支持llamafactory等工具的训练数据格式synthetic data

agents alpaca cot datagene gpt40 llm mllm multiagents o1 python react sharegpt slm synthetic-data tailwindcss visionlanguagemodel

Last synced: 05 Apr 2025

https://github.com/aimclub/BAMT

Repository of a data modeling and analysis tool based on Bayesian networks

bayesian-networks mixed-data parameters-learning structure-learning synthetic-data

Last synced: 02 May 2025

https://github.com/sdv-dev/deepecho

Synthetic Data Generation for mixed-type, multivariate time series.

data-generation deep-learning generative-adversarial-network sdv synthetic-data synthetic-data-generation time-series

Last synced: 16 May 2025

https://github.com/sdv-dev/DeepEcho

Synthetic Data Generation for mixed-type, multivariate time series.

data-generation deep-learning generative-adversarial-network sdv synthetic-data synthetic-data-generation time-series

Last synced: 04 Apr 2025

https://github.com/khawar-islam/diffuseMix

Official PyTorch implementation of DiffuseMix : Label-Preserving Data Augmentation with Diffusion Models (CVPR'2024)

cutmix data-augmentation diffusion-models generative-data-augmentation image-classification mixup synthetic-data transfer-learning

Last synced: 15 Aug 2025

https://github.com/stefan-jansen/synthetic-data-for-finance

Material for QuantUniversity talk on Sythetic Data Generation for Finance.

algorithmic-trading finance generative-adversarial-network machine-learning synthetic-data

Last synced: 12 Apr 2025

https://github.com/microsoft/dpsda

Private Evolution: Generating DP Synthetic Data without Training [ICLR 2024, ICML 2024 Spotlight]

differential-privacy foundation-models private-evolution synthetic-data training-free

Last synced: 04 Jul 2025

https://github.com/microsoft/DPSDA

Private Evolution: Generating DP Synthetic Data without Training [ICLR 2024, ICML 2024 Spotlight]

differential-privacy foundation-models private-evolution synthetic-data training-free

Last synced: 04 Apr 2025

https://github.com/Baukebrenninkmeijer/table-evaluator

Evaluate real and synthetic datasets against each other

data data-evaluation evaluation generation synthetic synthetic-data table-evaluator

Last synced: 02 May 2025

https://github.com/justchenhao/IAug_CDNet

Official Pytorch Implementation of Adversarial Instance Augmentation for Building Change Detection in Remote Sensing Images.

bi-temporal-images building-change-detection cdnet change-detection instance-augmentation remote-se synthetic-data

Last synced: 11 May 2025

https://github.com/bmw-innovationlab/sordi-ai-evaluation-gui

This repository allows you to evaluate a trained computer vision model and get general information and evaluation metrics with little configuration.

ai bmw computer-vision dataset deeplearning docker evaluation evaluation-framework no-code python rest-api sordi synthetic-data tensorflow

Last synced: 02 Jul 2025

https://github.com/ryoungj/BoLT

Code for "Reasoning to Learn from Latent Thoughts"

language-model latent-variable-models pretraining self-improvement synthetic-data

Last synced: 04 Oct 2025

https://github.com/jason718/game-feature-learning

Code for paper "Cross-Domain Self-supervised Multi-task Feature Learning using Synthetic Imagery", Ren et al., CVPR'18

computer-vision deep-learning domain-adaptation representation-learning self-supervised synthetic-data

Last synced: 10 Jul 2025

https://github.com/spiros/tofu

Tofu is a Python tool for generating synthetic UK Biobank data.

synthetic-data ukbiobank

Last synced: 09 Apr 2025

https://github.com/gretelai/gretel-python-client

The Gretel Python Client allows you to interact with the Gretel REST API.

datascience machine-learning privacy privacy-enhancing-technologies stream-processing synthetic-data

Last synced: 04 Apr 2025

https://github.com/howiehwong/unigen

[ICLR'25] DataGen: Unified Synthetic Dataset Generation via Large Language Models

benchmark dataset dataset-generation large-language-models llm synthetic-data toolkit

Last synced: 09 Apr 2025

https://github.com/sodascience/metasyn

Transparent and privacy-friendly synthetic data generation

metadata open-data privacy synthetic-data

Last synced: 07 Apr 2025

https://github.com/sunchang0124/dp_cgans

A library to generate synthetic tabular or RDF data using Conditional Generative Adversary Networks (GANs) combined with Differential Privacy techniques.

differential-privacy gan synthesizer synthetic-data

Last synced: 07 Apr 2025

https://github.com/gretelai/synthetic-data-genomics

Proof of concept code from Gretel.ai and Illumina using generative neural networks to create synthetic versions of mouse genotype and phenotype data.

generative-model genomics privacy-enhancing-technologies synthetic-data

Last synced: 11 Jul 2025

https://github.com/dbt-labs/jaffle-shop-generator

🥪🏭 A simple CLI for generating synthetic Jaffle Shop data.

analytics-engineering faker synthetic-data synthetic-data-generator

Last synced: 01 May 2025

https://github.com/vincentkoc/synthetic-user-research

Example Notebook for Synthetic User Research with Persona Prompting and Autonomous Agents

autogen autonomous-agents research synthetic-data

Last synced: 22 Mar 2025

https://github.com/vincentkoc/tiny_qa_benchmark_pp

Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.

benchmark dataset evaluation huggingface-datasets litellm llm llm-testing llmops qa-dataset smoke-test synthetic-data tinybenchmarks

Last synced: 12 Jun 2025

https://github.com/hicservices/synthehr

Library and CLI for randomly generating medical data like you might get out of an Electronic Health Records (EHR) system

cli dataset ehr electronic-health-records hospital-admission nuget patient synthetic-data testing-tools tests

Last synced: 26 Jul 2025