{"id":20500048,"url":"https://github.com/xeger/pipeclean","last_synced_at":"2026-04-20T01:03:40.370Z","repository":{"id":150214661,"uuid":"617292434","full_name":"xeger/pipeclean","owner":"xeger","description":"Parallel, streaming data sanitizer. Fast multi-core execution with no file size limits.","archived":false,"fork":false,"pushed_at":"2024-03-30T15:39:19.000Z","size":249,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-03-31T15:45:03.038Z","etag":null,"topics":["masking","pii","sanitization"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xeger.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-22T04:38:51.000Z","updated_at":"2025-09-11T21:24:56.000Z","dependencies_parsed_at":null,"dependency_job_id":"aaa8f137-db8e-440f-be7c-4b08164c4b32","html_url":"https://github.com/xeger/pipeclean","commit_stats":null,"previous_names":["xeger/sqlstream"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/xeger/pipeclean","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xeger%2Fpipeclean","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xeger%2Fpipeclean/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xeger%2Fpipeclean/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xeger%2Fpipeclean/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xeger","download_url":"https://codeload.github.com/xeger/pipeclean/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xeger%2Fpipeclean/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32028547,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T00:18:06.643Z","status":"ssl_error","status_checked_at":"2026-04-20T00:17:31.068Z","response_time":55,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["masking","pii","sanitization"],"created_at":"2024-11-15T18:19:36.025Z","updated_at":"2026-04-20T01:03:40.348Z","avatar_url":"https://github.com/xeger.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# What Is This?\n\nPipeclean is a tool to efficiently remove sensitive information from large datasets by streaming them from stdin to stdout. By default, it expects its input to consist of a MySQL dump produced with `mysqldump`:\n\n```bash\ncat data.sql | pipeclean scrub \u003e sanitized.sql\n```\n\nPipeclean utilizes streaming parsers, achieving constant memory usage even for multi-GiB input files. Maximum filesize is bounded only by available disk space. It also employs parallelism up to the number of available CPU cores. Sanitizing a 500 MiB MySQL dump takes ~30 seconds with peak memory usage of ~4 GiB on a 2020-era MacBook Pro M1 with eight cores.\n\nPipeclean attempts to sanitize encapsulated data, too; if a MySQL column contains JSON or YAML, it will parse and traverse the encapsulated document and write sanitized JSON/YAML as the output column value.\n\nPipeclean can employ Markovian language models to generate plausible-looking replacement data; you can train it with actual peoples' names from your database, for example, and at runtime, use the trained model to generate replacement names that have a similar look and feel. Markovian models can also be used for heuristic scrubbing, allowing pipeclean to recognize data fields that contain a first name (etc) regardless of how the field is named or where in the input it appears, and replace each occurrence with a fake name.\n\nFinally, pipeclean includes a `verify` command to report on the efficancy and safety of your scrubbing configuration.\n\n## Install\n\nIf you have the Go [SDK](https://go.dev/doc/install) installed, you can `go install github.com/xeger/pipeclean@main`. Otherwise, visit the [release page](https://github.com/xeger/pipeclean/releases/latest) to download binaries.\n\n# Usage Overview\n\nThis is a summary; for more detailed information, see [the reference guide](REFERENCE.md).\n\nPipeclean is invoked with a subcommand that tells it what to do:\n\n```bash\n# train models based on input data\ncat data.sql | pipeclean learn path/to/models\n\n# sanitize sensitive data; print clean stream to stdout\ncat data.sql | pipeclean scrub path/to/models\n```\n\n## Configuration\n\nPipeclean works best when you specify a configuration file to influence its behavior. If none is provided, the [default configuration](scrubbing/policy.go#L26) masks emails, phone numbers and zip codes by scrambling individual characters.\n\nTo customize its behavior, author a `pipeclean.json` to define some models and a scrubbing policy of your own:\n\n```json\n{\n  \"learning\": {\n    \"givenName\": {\n      \"markov\": {\n        \"order\": 2\n      }\n    },\n    \"sn\": {\n      \"markov\": {\n        \"order\": 2\n      }\n    },\n  },\n  \"scrubbing\": {\n    \"fieldname\": [\n      {\n        \"in\": \"email\",\n        \"out\": \"mask\"\n      },\n      {\n        \"in\": \"first_name\",\n        \"out\": \"generate(givenName)\"\n      },\n      {\n        \"in\": \"last_name\",\n        \"out\": \"generate(sn)\"\n      }\n    ]\n  }\n}\n```\n\n## Training\n\nAfter defining some models in the configuration, you can train the models using genuine input data. Make sure to create a directory to hold the models. Pipeclean takes the location of its config file and models as CLI parameters:\n\n```bash\ncat data.sql | pipeclean learn --config pipeclean.json ./data/models\n```\n\nTo use the trained models, re-run this command but replace the subcommand with `scrub` to generate sanitized output:\n\n```bash\ncat data.sql | pipeclean scrub -c pipeclean.json ./data/models\n```\n\nPipeclean will use your scrubbing policy to identify input fields and train the corresponding model with the real value. At the end, it will write the trained models to JSON files under `./data/models`, and those models can be used with future invocations of the command.\n\n## Providing Context\n\nThe `learn` and `scrub` commands both accept a `-x` / `--context` flag, which is a list of extra files that pipeclean should parse to learn about the structure of data. The contents of these files do not appear in pipeclean's output nor contribute to the training of models.\n\n**Context is important**. For example, in MySQL dumps, the `INSERT` statement use a shorthand form that does not specify column names. Without context, your pipeclean rules need to refer to columns by their insertion index:\n\n```json\n{ \"in\": \"users.3\", \"out\": \"mask\" }\n```\n\nIf you produce your MySQL dump as two files, a `schema.sql` produced with `mysqldump --no-data` and a `data.sql` produced with `mysqldump --no-create-info`, you can tell pipeclean about your column names with an `-x schema.sql` flag. This allows your configuration to specify column names instead of indices:\n\n```json\n{ \"in\": \"users.email\", \"out\": \"mask\" }\n```\n\nIf you provide the schema definition as context, pipeclean's default configuration is quite useful, handling common field names such as `email`, `phone`, or `zip`, and you can omit a configuration file for basic sanitization. Without context, the default configuration is useless and pipeclean won't be able to sanitize anything.\n\nIf your MySQL dump is a single file that contains both the schema and data, you can still use it to provide context:\n\n```bash\ncat dump.sql | pipeclean -x dump.sql\n```\n\nHowever, **context does not use a streaming parser**, so pipeclean's memory usage may be extraordinarily high if your single-file database dump is large. It is _much_ better to separate the schema from the data.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxeger%2Fpipeclean","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxeger%2Fpipeclean","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxeger%2Fpipeclean/lists"}