An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-generation

A curated list of projects in awesome lists tagged with data-generation .

https://github.com/idea-research/grounded-segment-anything

Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything

3d-whole-body-pose-estimation automatic-labeling-system caption data-generation image-editing open-vocabulary-detection open-vocabulary-segmentation speech

Last synced: 23 Apr 2025

https://github.com/IDEA-Research/Grounded-Segment-Anything

Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything

3d-whole-body-pose-estimation automatic-labeling-system caption data-generation image-editing open-vocabulary-detection open-vocabulary-segmentation speech

Last synced: 20 Mar 2025

https://github.com/sdv-dev/ctgan

Conditional GAN for generating synthetic tabular data.

data-generation generative-adversarial-network synthetic-data synthetic-data-generation tabular-data

Last synced: 13 May 2025

https://github.com/sdv-dev/CTGAN

Conditional GAN for generating synthetic tabular data.

data-generation generative-adversarial-network synthetic-data synthetic-data-generation tabular-data

Last synced: 02 May 2025

https://github.com/whatyouhide/stream_data

Data generation and property-based testing for Elixir. 🔮

data-generation elixir property-based-testing property-testing quickcheck

Last synced: 13 May 2025

https://github.com/open-sciencelab/GraphGen

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

ai4science data-generation data-synthesis graphgen knowledge-graph llama-factory llm llm-training pretrain pretraining qa question-answering qwen sft sft-data xtuner

Last synced: 29 Nov 2025

https://github.com/tom-lord/regexp-examples

Generate strings that match a given regular expression

data-generation mri random-string regexp ruby

Last synced: 06 Jul 2025

https://github.com/cieslarmichal/faker-cxx

C++ Faker library for generating fake (but realistic) data.

cpp cpp20 data-generation fake fake-data faker

Last synced: 04 Feb 2026

https://github.com/databrickslabs/dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

data-generation databricks datagen datageneration datagenerator delta-live-tables deltalake faker pyspark python spark spark-streaming synthetic-data

Last synced: 07 Jul 2025

https://github.com/microsoft/genalog

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

data-generation data-science machine-learning ner ocr-recognition python synthetic-data synthetic-data-generation synthetic-images text-alignment

Last synced: 04 Apr 2025

https://github.com/tabularis-ai/be_great

A novel approach for synthesizing tabular data using pretrained large language models

data-generation deep-learning synthetic-data synthetic-dataset-generation tabular-data transformers

Last synced: 12 Jun 2025

https://github.com/trinker/wakefield

Generate random data sets

data-generation r wakefield

Last synced: 04 Apr 2025

https://github.com/worldbank/REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.

data-generation deep-learning gpt gpt-2 seq2seq-model synthetic-data synthetic-dataset-generation tabular-data transformers

Last synced: 17 Aug 2025

https://github.com/worldbank/realtabformer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.

data-generation deep-learning gpt gpt-2 seq2seq-model synthetic-data synthetic-dataset-generation tabular-data transformers

Last synced: 04 Jan 2026

https://github.com/finos/datahelix

The DataHelix generator allows you to quickly create data, based on a JSON profile that defines fields and the relationships between them, for the purpose of testing and validation

data-engineering data-generation data-generator java test-data-generator

Last synced: 27 Feb 2025

https://github.com/sdv-dev/deepecho

Synthetic Data Generation for mixed-type, multivariate time series.

data-generation deep-learning generative-adversarial-network sdv synthetic-data synthetic-data-generation time-series

Last synced: 12 Feb 2026

https://github.com/sdv-dev/DeepEcho

Synthetic Data Generation for mixed-type, multivariate time series.

data-generation deep-learning generative-adversarial-network sdv synthetic-data synthetic-data-generation time-series

Last synced: 04 Apr 2025

https://github.com/louisYen/Gen4Gen

🏞️ Official implementation of "Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition"

data-generation llm personalization stable-diffusion text-to-image-generation

Last synced: 27 Mar 2025

https://github.com/mjkvaak/ImageDataAugmentor

Custom image data generator for TF Keras that supports the modern augmentation module albumentations

augmentation augmentations data-generation deep-learning image-augmentation image-classification machine-learning python tensorflow2

Last synced: 08 May 2025

https://github.com/kgoldfeld/simstudy

simstudy: Illuminating research methods through data generation

data-generation data-simulation r simulation statistical-models

Last synced: 22 Oct 2025

https://github.com/br0kej/bin2ml

A command line tool for extracting machine learning ready data from software binaries powered by Radare2

binary-analysis data-generation graph-neural-networks machine-learning ml4sec nlp radare2 reverse-engineering

Last synced: 22 Jan 2026

https://github.com/microsoft/codemixed-text-generator

This tool helps automatic generation of grammatically valid synthetic Code-mixed data by utilizing linguistic theories such as Equivalence Constant Theory and Matrix Language Theory.

code-mixing code-switching data-generation language-modeling linguistics natural-language-processing python3 synthetic-data-generation

Last synced: 29 May 2026

https://github.com/smartcat-labs/ranger

Ranger is contextual data generator used to make sensible data for integration tests or to play with it in the database

contextual-data data-generation data-generator test-data

Last synced: 12 Aug 2025

https://github.com/rasinmuhammed/misata

High-performance open-source synthetic data engine. Uses LLMs for schema design and vectorized NumPy for deterministic, scalable generation.

data-engineering data-generation database-seeding developer-tools generative-ai llm mock-data numpy pandas python synthetic synthetic-data synthetic-dataset-generation testing

Last synced: 10 May 2026

https://github.com/grafana/k6-example-data-generation

Example repository showing how to utilise k6 and faker to load test using generated data

data-generation examples load-testing performance-testing

Last synced: 30 Apr 2025

https://github.com/edyan/neuralyzer

Neuralyzer is a library and a command line tool to anonymize databases (by updating existing data or populating a table with fake data)

anonymisation anonymization anonymize data-generation data-generator data-privacy database dgpr private-life rgpd

Last synced: 04 Apr 2025

https://github.com/Stranger6667/hypothesis-graphql

Generate arbitrary queries matching your GraphQL schema, and use them to verify your backend implementation.

data-generation graphql hypothesis python testing

Last synced: 11 May 2025

https://github.com/goodarzmehr/simbev

[ITSC 2026] SimBEV is a configurable and scalable synthetic driving data generation tool and dataset based on the CARLA Simulator.

3d-object-detection autonomous-driving bev bev-perception bev-segmentation carla-simulator computer-vision data-generation dataset depth-estimation semantic semantic-occupancy-prediction semantic-segmentation

Last synced: 30 May 2026

https://github.com/stranger6667/hypothesis-graphql

Generate arbitrary queries matching your GraphQL schema, and use them to verify your backend implementation.

data-generation graphql hypothesis python testing

Last synced: 20 Aug 2025

https://github.com/gretelai/trainer

Simple interface to synthesize complex and highly dimensional datasets using Gretel APIs.

data-generation deep-learning gan gans language-model machine-learning synthetic-data

Last synced: 04 Apr 2025

https://github.com/glynnbird/datamaker

Data generator command-line tool and library. Create JSON, CSV, XML data from templates.

cli csv data-generation json nodejs xml

Last synced: 08 May 2025

https://github.com/farlee2121/fsspec

FsSpec represents value constraints as data to reuse one constraint declaration for validation, data generation, error explanation, and more.

data-generation type-driven-development validation

Last synced: 15 Apr 2025

https://github.com/synthesized-io/tdk-demo

This is a collection of TDK demo projects that use different databases and options

data-generation data-generator db2 mysql oracle postgresql synthetic-data synthetic-dataset-generation test-data-generator vault

Last synced: 24 Jun 2025

https://github.com/stefanheng/proggen

Code for paper "ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models"

data-generation efficient-nlp few-shot-learning large-language-models low-resource-nlp named-entity-recognition natural-language-processing training-data-generation

Last synced: 13 Apr 2025

https://github.com/matousc89/signalz

Data generators in Python

data data-generation signals test-data-generator

Last synced: 14 Jan 2026

https://github.com/daffidwilde/edo

A library for generating artificial datasets through genetic evolution.

data-generation evolutionary-algorithms optimisation

Last synced: 05 May 2025

https://github.com/infineon/streamgen

Python framework for generating streams of labeled data.

continual-learning data-generation data-streams data-structures function-composition python

Last synced: 24 Jun 2025

https://github.com/shaoyijia/cmg

Code for ECML-PKDD 2022 Paper --- CMG: A Class-Mixed Generation Approach to Out-of-Distribution Detection

anomaly-detection data-generation novelty-detection out-of-distribution-detection

Last synced: 04 Apr 2025

https://github.com/ipjohnson/simplefixture

Testing fixture for .Net

c-sharp data-generation fixtures testing

Last synced: 15 Aug 2025

https://github.com/buaadreamer/spn4cir

[ACM MM 2024] Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

acmmm2024 blip blip2 clip composed-image-retrieval cross-modal-retrieval data-generation image-retrieval llama llava memory-bank multi-modal-retrieval multimodal-learning transformer

Last synced: 14 Mar 2026

https://github.com/chaturv3di/absynthe

A (branching) Behaviour Synthesiser -- Simulates the generation of application or process logs, where multiple modules (or processes) can execute simultaneously, in a distributed deployment, and dump the log messages in an interleaved manner in a single log file.

control-flow-graph data-generation gitflow-workflow labelled-data log-analysis log-generator python3

Last synced: 09 Apr 2026

https://github.com/matherealize/simdata

An R package for simulating data

data-generation r-package regression simulation statistics

Last synced: 22 Oct 2025

https://github.com/tensorsense/vlm_databuilder

This SDK generates datasets for training Video LLMs from youtube videos.

data-generation data-science llm video-llms vlm

Last synced: 11 Sep 2025

https://github.com/team-enderio/regilite

Registration and Data Generation Utility for NeoForge.

data-generation minecraft minecraft-mod minecraft-neoforge

Last synced: 25 Jul 2025

https://github.com/markusjx/datagen

Random data generator based on JSON schemas

data-generation hacktoberfest json rust test-data-generator

Last synced: 07 Sep 2025

https://github.com/valencian-digital/planter

Blazingly fast data generation for MongoDB

data-generation database mongodb rust seeding

Last synced: 31 Mar 2025

https://github.com/nfdi4health/docker-vambn

A containerized implementation of the VAMBN approach by TA6.4.

data-generation differential-privacy synthetic-data

Last synced: 20 Feb 2026

https://github.com/cosmincatalin/shaper

DSL for generating basic images

data-generation dsl synthetic-data training-data

Last synced: 16 Jan 2026

https://github.com/fillol/iiot-simulator

An advanced Industrial IoT (IIoT) simulator for Smart Factory 4.0 environments using Python, MQTT, and Docker. Emulates configurable production lines with realistic sensor data (vibration, temperature, quality) and predictive alerts.

data-generation digital-twin docker docker-compose iiot industrial-automation industry-4-0 manufacturing mosquitto mqtt python sensor-simulation simulator smart-factory

Last synced: 04 Aug 2025

https://github.com/jaehyeon-kim/dynamic-des

Real-time SimPy control plane to dynamically update parameters and stream outputs via external systems like Kafka, Redis, or Postgres. Built for event-driven digital twins.

data-generation descrete-event-simulation digital-twin industry-4 kafka postgres python redis simpy simulation

Last synced: 21 Apr 2026

https://github.com/kaos599/apollo-synthetic-data-generator

Apollo is a Python GUI application designed to simplify the complex process of generating random data based on fixed values. It allows users to generate various types of binary datasets, such as Yes/No type questions, by specifying probabilities.

data data-engineering data-generation data-generator data-science faker-library machine-learning tkinter-gui

Last synced: 22 Jul 2025

https://github.com/dilumdesilva/daugmentor

DAugmentor: Automating Generative Adversarial Network Tuning for Data Augmentation

data-generation deep-learning deep-neural-networks gan

Last synced: 08 May 2026

https://github.com/ihebbelhadj/synthetic-time-series-hr-data

A Python project that transforms a static HR employee snapshot into a rich, historized dataset of event logs, perfect for powering HR analytics and testing ELT/ETL pipelines.

data-generation hr-analytics pandas python simulation timeseries

Last synced: 30 Apr 2026

https://github.com/cobluestars/dataherd-raika

"Dataherd-Raika is a library designed to simulate large-scale user behavior datasets. It takes a single user event (like a click or keyword input) and, by applying simple probability distributions and custom variables, expands it into a vast dataset."

big-data data data-generation data-generator data-science front-end javascript machine-learning npm-package simulator statistics typescript user-behavior user-experience

Last synced: 02 Jan 2026

https://github.com/wilhelmagren/syndgen

SYNthetic Data GENeration made easy for everyone, free and open-sourced.

data-generation deep-learning gan generative-ai machine-learning multi-table relational-datasets synthetic-data

Last synced: 26 Mar 2025

https://github.com/kevindeyne/vardogr

Vardøgr is a CLI that can push production-like data to test environments securely and at scale

cli data-generation data-generator database mariadb mysql postgresql scrambled-data

Last synced: 12 Apr 2026

https://github.com/nicolasbizzozzero/datagenerator

Randomly generate various commonly used data

data data-generation data-generator data-science

Last synced: 18 Oct 2025

https://github.com/rozhakxd/ifakenumber

📱 IFakeNumber: Create fake Indonesian phone numbers for testing & education. Supports bulk generation & CSV export. Powered by Python. 🚀

data-generation data-simulation developer-tools educational-tools fake-number-generator indonesian-phone-number open-source python qa-testing testing-tools

Last synced: 08 Mar 2025

https://github.com/vesko-vujovic/dummy-data-rust

Data generation writen in rust. This generator will generate users, transaction, payment providers and user adresses.

data-generation data-generator rust

Last synced: 29 Apr 2026