{"id":44611723,"url":"https://github.com/msg555/subsetter","last_synced_at":"2026-02-14T12:01:32.130Z","repository":{"id":186807101,"uuid":"675567074","full_name":"msg555/subsetter","owner":"msg555","description":"CLI tool to subset mysql/postgres/sqlite databases","archived":false,"fork":false,"pushed_at":"2025-04-11T01:43:27.000Z","size":247,"stargazers_count":9,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-29T18:35:01.875Z","etag":null,"topics":["automation","cli","databases","mysql","postgresql","sampling","sqlite","subsetting","testing-tools"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/msg555.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-08-07T08:12:14.000Z","updated_at":"2025-09-22T15:15:27.000Z","dependencies_parsed_at":"2023-08-07T20:54:35.326Z","dependency_job_id":"fc7c7c11-7b85-47e0-8308-05310ff3cd8e","html_url":"https://github.com/msg555/subsetter","commit_stats":null,"previous_names":["msg555/subsetter"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/msg555/subsetter","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msg555%2Fsubsetter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msg555%2Fsubsetter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msg555%2Fsubsetter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msg555%2Fsubsetter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/msg555","download_url":"https://codeload.github.com/msg555/subsetter/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msg555%2Fsubsetter/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29443468,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-14T10:51:12.367Z","status":"ssl_error","status_checked_at":"2026-02-14T10:50:52.088Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","cli","databases","mysql","postgresql","sampling","sqlite","subsetting","testing-tools"],"created_at":"2026-02-14T12:00:59.891Z","updated_at":"2026-02-14T12:01:32.125Z","avatar_url":"https://github.com/msg555.png","language":"Python","readme":"# Subsetter\n\nSubsetter is a Python utility that can be used for subsetting portions of\nrelational databases. _Subsetting_ is the action extracting a smaller set of rows\nfrom your database that still maintain expected foreign-key relationships\nbetween your data. This can be useful for testing against a small but\nrealistic dataset or for generating sample data for use in demonstrations.\nThis tool also supports filtering that allows you to remove/anonymize rows that\nmay contain sensitive data.\n\nSimilar tools include Tonic.ai's platform and [condenser](https://github.com/TonicAI/condenser).\nThis is meant to be a simple CLI tool that overcomes many of the difficulties in\nusing `condenser`.\n\n## Installation\n\nYou can use subsetter by installing it through pip:\n\n```sh\npip install subsetter\n```\n\nOr by using the published `msg555/subsetter` image:\n\n```sh\ndocker run --rm -v \"./subsetter.yaml:/tmp/subsetter.yaml\" msg555/subsetter -c /tmp/subsetter.yaml subset\n```\n\n## Limitations\n\nThe subsetter tool takes an approach of \"one table, one query\". This means that\nthe subsetter will sample each table using only a single query. It cannot\nsupport calculating a full transitive closure of foreign key relationships for\nschemas that contain cycles. In general, as long as your schema contains no\nforeign key cycles and no target is reachable from another target, the subsetter\nwill be able to automatically generate a plan that can sample your data.\n\n# Usage\n\n## Create a sampling plan\n\nThe first step in subsetting a database is to generate a sampling plan. A\nsampling plan defines both the direct targets of the subsetter and what tables\nshould be brought in through indirect foreign key references.  You'll want to\ncreate a configuration file similar to\n[subsetter.example.yaml](subsetter.example.yaml), making sure to fill out the\n`planner` section to tell the planner what tables you want to sample any\nadditional constraints that should be considered. Then you can create a plan\nwith the below command:\n\n```sh\nsubsetter -c my-config.yaml plan \u003e plan.yaml\n```\n\nIf you inspect the generated plan YAML document you will see a syntax tree\nthat defines how each table will be sampled, potentially referencing other\ntables. Queries can reference either source tables or previously sampled tables.\nIf you need to customize the way that tables are sampled beyond what the planner\ncan automatically produce this is the place to do it. If needed, you can even\nwrite direct SQL here.\n\n## Sample a database with a plan\n\nThe sample sub-command will sample rows from the source database into your\ntarget output (either a database or as json files) using a plan generated\nusing the plan sub-command. By default this tool will **not** copy schema\nfrom the source database and expects tables to already exist. If you would like\nit to attempt to create tables in the output database pass the `--create` flag.\nAdditionally you must pass `--truncate` if you wish to clear any existing data\nin the output tables that may otherwise interfere with the sampling process.\n\n```sh\nsubsetter --config my-config.yaml sample --plan my-plan.yaml --create --truncate\n```\n\nThe sampling process proceeds in four phases:\n\n1. If `--create` is specified it will attempt to create any missing tables. Existing tables will not be touched even if the schema does not match what is expected.\n2. If `--truncate` is specified any tables about to be sampled will be first truncated. subsetter expects there to be no existing data in the destination database unless configured to run in _merge_ mode.\n3. Any sampled tables that are referenced by other tables will first be materialized into temporary tables on the source database.\n4. Data is copied for each table from the source to destination.\n\n## Plan and sample in one action\n\nThere's also a `subset` subcommand to perform the `plan` and `sample` actions\ntogether. This will automatically feed the generated plan into the sampler,\nin addition to ensuring the same source database configuration is used for\neach.\n\n```sh\nsubsetter -c my-config.yaml subset --create --truncate\n```\n\n# Sample Transformations\n\nBy default any sampled row is copied directly from the source database to the\ndestination database. However, there are several transformation steps that can\nbe configured at the sampling stage that can change this behavior.\n\n## Filtering\n\nFilters allow you to transform the columns in each sampled row using either a\nset of built-in filters or through custom plugins. Built in filters allow you to\neasily replace common sources of personally identifiable information with fake\ndata using the [faker](https://faker.readthedocs.io/en/master/) library. Filters\nfor name, email, phone number, address, and location, and more come built in.\nSee [subsetter.example.yaml](subsetter.example.yaml) for full details on what\nfilters exist and how to create a custom filter plugin.\n\n## Identifier Compaction\n\nOften tables make use of auto-incrementing integer identifiers to function as\ntheir primary key. Sometimes we may want the identifiers in our sampled data\nto be compact -- instead of retaining the value in the source database we may\nwant our N sampled rows to have identifiers ranging from 1 to N. This is useful\nfor sample data where we want to keep the identifiers easy to reference.\n\nAny other table that has a foreign key that references one of these compacted\ncolumns will automatically also have the column involved in that foreign key\nadjusted to maintain semantic consistency.\n\nNote that enabling compaction can have a noticable impact on performance.\nCompaction both requires more tables to be materialized on the source database\nand requries more joins when streaming data into the destination database.\n\n## Merging\n\nBy default the sampler expects no data to exist in the destination database.\nTo get around this constraint we can turn on \"merge\" mode. To use merge mode all\nsampled tables must be either marked as \"passthrough\" or have a single-column,\nnon-negative, integral primary key.\n\nWhen enabled, the sampler will calculate the largest existing primary key\nidentifier for each non-passthrough table and automatically shift the primary\nkey of each sampled row to be larger using the equation:\n\n```\nnew_id = source_id + max(0, existing_ids...) + 1\n```\n\nPassthrough tables instead will be sampled as normal except they will use the\n'skip' conflict strategy which will have the effect of only inserting rows in\na passthrough table if no row with the matching primary key exists in the\ndestination database.\n\nIf merging multiple times it may be necessary to turn on identifier compaction\nto avoid the largest identifier in each table from growing too quickly due to\nlarge gaps.\n\n## Multiplicity\n\nSampling usually means condensing a large dataset into a semantically consistent\nsmall dataset. However, there are times that what you really want to do is\ncreate a semantically consistent large dataset from your existing data (e.g. for\nperformance testing). The sampler has support for this by setting the multiplicity factor.\n\nMultiplicity works by creating multiple copies of your sampled dataset in your\noutput database. To ensure these datasets do not collide it remaps all foreign\nkeys into a new key-space. Note that this process assumes your foreign keys are\nopaque integers identifiers.\n\n# FAQ\n\n## How do multiple targets work\n\nWhen using multiple targets each target table will be sampled entirely\nindependently unless another target table directly or indirectly depends on some\nrows from it through a series of foreign keys. In the later case the subsetter\nwill sample a union of the rows from the independently sampling of the table and\nthose rows that other targets depend on.\n\n## How does the subsetter use foreign keys?\n\nThe subsetter uses the foreign keys present in the database schema to understand\nrelationships between data and generate a sampling plan. Foreign key\nrelationships can be followed in both directions if need be. For example,\nsuppose there was a `users` and an `orders` table where `orders` had a foreign key\nto the `users` table.\n\nIf `users` was sampled first the subsetter would sample `orders` from `users` by\nsampling all rows from `orders` such that their corresponding user row existed.\nThis represents the _maximal_ set of rows that can be included without violating\nforeign key constraints.\n\nOtherwise if `orders` was sampled first the subsetter would sample `users` from\n`orders` by sampling all rows from `users` such that they had at least one\n`order`. This represents the _minimal_ set of rows that can be included without\nviolating foreign key constraints.\n\nIn general the subsetter will always sample tables in an order such that all\nforeign key relationships to previously sampled tables are going in the same\ndirection. If they are followed in the forwards direction (as in our first case)\nthe subsetter will select the _intersection_ of all rows that obey each foreign\nkey relationship. Otherwise if they are followed in the backwards direction (as\nin our second case) the subsetter will select the _union_ of all rows that obey\neach foreign key relationship. This strategy ensures no foreign key\nrelationships are violated in the sampled data.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsg555%2Fsubsetter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmsg555%2Fsubsetter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsg555%2Fsubsetter/lists"}