{"id":47725696,"url":"https://github.com/nvidia-nemo/safe-synthesizer","last_synced_at":"2026-06-04T20:00:33.721Z","repository":{"id":348040052,"uuid":"1138622116","full_name":"NVIDIA-NeMo/Safe-Synthesizer","owner":"NVIDIA-NeMo","description":":shield: NeMo Safe Synthesizer: Create private, safe versions of sensitive tabular datasets.","archived":false,"fork":false,"pushed_at":"2026-06-01T18:52:58.000Z","size":50855,"stargazers_count":23,"open_issues_count":93,"forks_count":3,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-01T20:24:08.713Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://nvidia-nemo.github.io/Safe-Synthesizer/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA-NeMo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.md","codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":"DCO","cla":null}},"created_at":"2026-01-20T22:55:42.000Z","updated_at":"2026-05-28T22:13:39.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/NVIDIA-NeMo/Safe-Synthesizer","commit_stats":null,"previous_names":["nvidia-nemo/safe-synthesizer"],"tags_count":17,"template":false,"template_full_name":null,"purl":"pkg:github/NVIDIA-NeMo/Safe-Synthesizer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FSafe-Synthesizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FSafe-Synthesizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FSafe-Synthesizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FSafe-Synthesizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA-NeMo","download_url":"https://codeload.github.com/NVIDIA-NeMo/Safe-Synthesizer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FSafe-Synthesizer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33917184,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-04T02:00:06.755Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-02T20:22:41.070Z","updated_at":"2026-06-04T20:00:33.715Z","avatar_url":"https://github.com/NVIDIA-NeMo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🛡️ NeMo Safe Synthesizer\n\nNVIDIA NeMo Safe Synthesizer creates private, safe versions of sensitive tabular datasets -- entirely synthetic data with no one-to-one mapping to your original records. Purpose-built for privacy compliance and sensitive information protection while preserving data utility for downstream AI tasks.\n\n## Quick Start\n\nRead detailed usage below, or jump to the documentation with [Getting Started](https://nvidia-nemo.github.io/Safe-Synthesizer/user-guide/getting-started/) or the [Safe Synthesizer 101](https://nvidia-nemo.github.io/Safe-Synthesizer/tutorials/safe-synthesizer-101/) notebook.\n\n\n### Prerequisites\n\n- Python 3.11–3.13 (we pin a specific 3.11.x in `.python-version` for local/dev bootstrap; any 3.11, 3.12, or 3.13 interpreter works. Python 3.14+ is NOT supported because ray, a transitive dependency of vLLM, does not yet publish `cp314` wheels)\n- [uv](https://docs.astral.sh/uv/) (recommended) or pip -- Python package manager\n- NVIDIA GPU (A100 or larger) for training and generation\n- Linux only -- macOS, Windows, and Apple Silicon are not supported for training or generation. A CPU-only install is available for development and configuration validation.\n\n### Installation\n\n```bash\n# With uv (recommended):\nuv pip install \"nemo-safe-synthesizer[cu129,engine]\" \\\n  --index https://flashinfer.ai/whl/cu129 \\\n  --index https://download.pytorch.org/whl/cu129 \\\n  --index https://wheels.vllm.ai/88d34c6409e9fb3c7b8ca0c04756f061d2099eb1/cu129 \\\n  --index-strategy unsafe-best-match\n\n# With pip:\npip install \"nemo-safe-synthesizer[cu129,engine]\" \\\n  --extra-index-url https://download.pytorch.org/whl/cu129 \\\n  --extra-index-url https://flashinfer.ai/whl/cu129 \\\n  --extra-index-url https://wheels.vllm.ai/88d34c6409e9fb3c7b8ca0c04756f061d2099eb1/cu129\n```\n\nOr install from source:\n\n```bash\ngit clone https://github.com/NVIDIA-NeMo/Safe-Synthesizer.git\ncd Safe-Synthesizer\nmake setup # installs pinned mise, pinned tools from mise.lock, and .venv\nmise run bootstrap-nss cuda\n```\n\nDevelopment tools (`ruff`, `ty`, `yq`, `gh`, etc.) are managed via [mise](https://mise.jdx.dev/). Tool versions are declared in `.mise.toml` and locked in `mise.lock` (committed). mise also manages environment variables -- place project-local secrets or overrides in `.env` or `.env.local` (both git-ignored, auto-loaded by mise).\n\nProject commands run through mise tasks under `.mise/tasks/`: `*.toml` files for declarative tasks, executable scripts for bash-heavy logic.\n\n```bash\nmise tasks                # list public tasks\nmise tasks --hidden       # include helper and legacy alias tasks\nmise tasks deps validate  # inspect the pre-PR validation graph\nmise run validate         # check + lock-check + CI unit tests\n```\n\n### Running\n\nActivate Python virtual environment and run the CLI using `safe-synthesizer`:\n\n```bash\n\u003e safe-synthesizer --help\nUsage: safe-synthesizer [OPTIONS] COMMAND [ARGS]...\n\n  NeMo Safe Synthesizer command-line interface. This application is used to\n  run the Safe Synthesizer pipeline. It can be used to train a model, generate\n  synthetic data, and evaluate the synthetic data. It can also be used to\n  modify a config file.\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  artifacts  Artifacts management commands.\n  config     Manage Safe Synthesizer configurations.\n  run        Run the Safe Synthesizer end-to-end pipeline.\n```\n\n## Running the Pipeline\n\nThe `run` command executes the Safe Synthesizer pipeline. Without a subcommand, it runs the full end-to-end pipeline:\n\n```bash\n\u003e uv run safe-synthesizer run --help\nUsage: safe-synthesizer run [OPTIONS] COMMAND [ARGS]...\n\n  Run the Safe Synthesizer end-to-end pipeline.\n\n  Without a subcommand, runs the full end-to-end pipeline. Use 'run train' or\n  'run generate' for individual stages.\n\nOptions:\n  --config TEXT                   path to a yaml config file\n  --data-source TEXT                      Dataset name, URL, or path to CSV dataset.\n                                  For 'run generate', this is optional if a\n                                  cached dataset exists in the workdir.\n  --artifact-path DIRECTORY       Base directory for all runs. Runs are\n                                  created as \u003cartifact-\n                                  path\u003e/\u003cconfig\u003e---\u003cdataset\u003e/\u003ctimestamp\u003e/. Can\n                                  also be set via NSS_ARTIFACTS_PATH env var.\n                                  [default: ./safe-synthesizer-artifacts]\n  --run-path DIRECTORY            Explicit path for this run's output\n                                  directory. When specified, outputs go\n                                  directly to this path. Overrides --artifact-\n                                  path.\n  --output-file PATH              Path to output CSV file. Overrides the\n                                  default workdir output location.\n  --log-format [json|plain]       Log format for console output. File logging\n                                  will always be JSON. Can also be set via\n                                  NSS_LOG_FORMAT env var. [default: plain]\n  --log-color / --no-log-color    Whether to colorize the log output on the\n                                  console. [default: --log-color]\n  --log-file PATH                 Path to log file. Defaults to a file nested\n                                  under the run directory. Can also be set via\n                                  NSS_LOG_FILE env var.\n  --wandb-mode [online|offline|disabled]\n                                  Wandb mode. 'online' will upload logs to\n                                  wandb, 'offline' will save logs to a local\n                                  file, 'disabled' will not upload logs to\n                                  wandb. Can also be set via WANDB_MODE env\n                                  var. [default: disabled]\n  --wandb-project TEXT            Wandb project. Can also be set via\n                                  WANDB_PROJECT env var.\n  -v                              Verbose logging. 'v' shows debug info from\n                                  main program, 'vv' shows debug from\n                                  dependencies too\n  --dataset-registry TEXT         URL or path of a dataset registry YAML file.\n                                  If provided, datasets in the registry may be\n                                  referenced by name in --data-source. Can also be set\n                                  via NSS_DATASET_REGISTRY env var. If both\n                                  env var and CLI option are provided, the CLI\n                                  option takes precedence.\n  --emit_telemetry BOOLEAN        Whether to emit anonymous Safe Synthesizer\n                                  telemetry events.\n  --help                          Show this message and exit.\n\nCommands:\n  generate  Run the generation stage only.\n  train     Run the training stage only.\n```\n\n### Subcommands\n\n- `safe-synthesizer run train` - Run only the training stage, saving the adapter to the run directory.\n- `safe-synthesizer run generate` - Run only the generation stage using a saved adapter.\n\n```bash\n\u003e uv run safe-synthesizer run generate --help\nUsage: safe-synthesizer run generate [OPTIONS]\n\n  Run the generation stage only.\n\n  This command loads a trained adapter and generates synthetic data. Requires\n  'run train' to have been executed first.\n\n  Use --run-path to specify the exact run directory containing the trained\n  model, or use --auto-discover-adapter with --artifact-path to automatically\n  find the latest trained run.\n\nOptions:\n  --config TEXT                   path to a yaml config file\n  --data-source TEXT                      Dataset name, URL, or path to CSV dataset.\n                                  [required]\n  --artifact-path DIRECTORY       Base directory for all runs. Runs are\n                                  created as \u003cartifact-path\u003e/\u003cconfig\u003e-\n                                  \u003cdataset\u003e/\u003ctimestamp\u003e/. [default: ./safe-\n                                  synthesizer-artifacts]\n  --run-path DIRECTORY            Explicit path for this run's output\n                                  directory. When specified, outputs go\n                                  directly to this path. Overrides --artifact-\n                                  path.\n  --output-file PATH              Path to output CSV file. Overrides the\n                                  default workdir output location.\n  --log-format [json|plain]       Log format for console output. File logging\n                                  will always be JSON.\n  --log-color / --no-log-color    Whether to colorize the log output on the\n                                  console\n  --log-file PATH                 Path to log file. Defaults to a file nested\n                                  under the run directory.\n  -v                              Verbose logging. 'v' shows debug info from\n                                  main program, 'vv' shows debug from\n                                  dependencies too\n  --wandb-mode [online|offline|disabled]\n                                  Wandb mode. 'online' will upload logs to\n                                  wandb, 'offline' will save logs to a local\n                                  file, 'disabled' will not upload logs to\n                                  wandb.\n  --wandb-project TEXT            Wandb project. If not specified, the project\n                                  will be taken from the environment variable\n                                  WANDB_PROJECT.\n  --auto-discover-adapter         Automatically find the latest trained\n                                  adapter in --artifact-path. Without this\n                                  flag, --run-path must point to a specific\n                                  trained run.\n  --help                          Show this message and exit.\n```\n\n## Managing Configurations\n\nThe `config` command provides tools to validate and modify configuration files:\n\n```bash\n\u003e uv run safe-synthesizer config --help\nUsage: safe-synthesizer config [OPTIONS] COMMAND [ARGS]...\n\n  Manage Safe Synthesizer configurations.\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  modify    Modify a Safe Synthesizer configuration.\n  validate  Validate a Safe Synthesizer configuration.\n```\n\n## Attention Configuration\n\nSafe Synthesizer exposes attention implementation settings for both training and generation.\n\n### Training (`attn_implementation`)\n\nControls the HuggingFace attention backend used during model loading for training. Set via config YAML, CLI, or SDK:\n\n```yaml\n# config.yaml\ntraining:\n  attn_implementation: \"sdpa\"\n```\n\n```bash\n# CLI override\nsafe-synthesizer run --training__attn_implementation sdpa --data-source my_data.csv\n```\n\n| Value | Description | Requires |\n|-------|-------------|----------|\n| `sdpa` | PyTorch scaled dot product attention (default) | None (built-in) |\n| `eager` | Standard PyTorch attention | None (built-in) |\n| `kernels-community/flash-attn2` | Flash Attention 2 via HuggingFace Kernels Hub | `kernels` pip package |\n| `kernels-community/vllm-flash-attn3` | Flash Attention 3 via HuggingFace Kernels Hub | `kernels` pip package and compatible prebuilt kernel |\n| `flash_attention_2` | Flash Attention 2 (traditional) | `flash-attn` pip package |\n| `flash_attention_3` | Flash Attention 3 (traditional) | `flash-attn-3` support |\n\nIf a `kernels-community/...` value is configured but the `kernels` package is not installed, the backend automatically falls back to `sdpa`.\n\n### Generation (`attention_backend`)\n\nControls the vLLM attention backend used during synthetic data generation. Defaults to `\"auto\"`, which lets vLLM auto-select the best available backend.\n\n```yaml\n# config.yaml\ngeneration:\n  attention_backend: \"FLASH_ATTN\"\n```\n\nCommon values: `FLASHINFER`, `FLASH_ATTN`, `TORCH_SDPA`, `TRITON_ATTN`, `FLEX_ATTENTION`.\n\n## NIM Integration\n\nColumn classification uses a NIM/OpenAI-compatible endpoint to detect entity types\nin your data. `NSS_INFERENCE_ENDPOINT` defaults to `https://integrate.api.nvidia.com/v1`;\noverride it to use a different endpoint.\n\nWhen using the CLI or Python SDK, set `NSS_INFERENCE_KEY` (and `NSS_INFERENCE_ENDPOINT` only if not\nusing the default) so column classification can run.\n\n### Local Endpoint\n\nTo point to a locally hosted LLM, add the variables to `.env.local` (git-ignored, auto-loaded by mise):\n\n```bash\n# .env.local\nNSS_INFERENCE_ENDPOINT=https://your-local-nim-endpoint\nNSS_INFERENCE_KEY=your-api-key  # pragma: allowlist secret\n```\n\nOr export them in your shell:\n\n```bash\nexport NSS_INFERENCE_ENDPOINT=\"https://your-local-nim-endpoint\"\nexport NSS_INFERENCE_KEY=\"your-api-key\"  # pragma: allowlist secret\n```\n\n### Disable Classification\n\nTo disable classification entirely:\n\n```yaml\nreplace_pii:\n  globals:\n    classify:\n      enable_classify: false\n```\n\nWhen classification is disabled, NSS falls back to default entity types.\n\n## Artifacts and Workdirs\n\nSafe Synthesizer uses a structured directory format to manage artifacts (trained models, synthetic data, logs).\n\n### Directory Layout\n\nBy default, runs are nested under `--artifact-path` using the project name (`\u003cconfig\u003e---\u003cdataset\u003e`) and a unique run name.\n\n```text\n\u003cartifact-path\u003e/\u003cconfig\u003e---\u003cdataset\u003e/\u003crun_name\u003e/\n├── train/\n│   ├── safe-synthesizer-config.json\n│   └── adapter/                     # trained PEFT adapter\n│       ├── adapter_config.json\n│       ├── adapter_model.safetensors\n│       ├── metadata_v2.json\n│       └── dataset_schema.json\n├── generate/\n│   ├── logs.jsonl                   # generate-only workflow\n│   ├── info.json                    # generate-only workflow\n│   ├── synthetic_data.csv\n│   ├── evaluation_report.html\n│   └── evaluation_metrics.json      # machine-readable metrics\n├── dataset/\n│   ├── training.csv\n│   ├── test.csv\n│   ├── validation.csv               # when training.validation_ratio \u003e 0\n│   └── transformed_training.csv     # when PII replacement transforms the data\n└── logs/\n    └── \u003cphase\u003e.jsonl                # e.g. end_to_end.jsonl or train.jsonl\n```\n\n### Run Names\n\nIf not provided with `--run-path`, run names are automatically generated using the current `\u003ctimestamp\u003e`.\n\n### Overriding Paths\n\n- Use `--run-path` to specify an explicit directory for the run, bypassing the `\u003cproject\u003e/\u003ctimestamp\u003e` nesting.\n- Use `--output-file` to specify an explicit path for the final synthetic CSV, overriding the default location in the `generate/` directory.\n\n## WandB Logging\n\nSafe Synthesizer supports Weights \u0026 Biases (WandB) for experiment tracking.\n\n### Configuration\n\nYou can enable WandB logging using CLI options or environment variables:\n\n- `--wandb-mode [online|offline|disabled]`: Set the WandB mode. Default is `disabled`.\n- `--wandb-project \u003cname\u003e`: Specify the WandB project name.\n- `WANDB_API_KEY`: Ensure your API key is set in your environment.\n\n### Logged Data\n\nThe following information is logged to WandB:\n\n- Configuration parameters\n- Training metrics (if supported by the backend)\n- Generation statistics\n- Evaluation results\n- Timing information\n\n## Dataset Registry\n\nSafe Synthesizer supports a *dataset registry* to simplify working with a standard set of datasets.\nDatasets in the registry may be referenced by name, rather than repeatedly specifying long URLS or file paths on the command line.\nAdditionally, the registry supports custom config overrides or args that are specific to individual datasets.\n\n### Providing a Dataset Registry\n\nYou can supply a dataset registry (YAML file) via either the CLI or an environment variable:\n\n- CLI Option:\n`--dataset-registry \u003cpath_or_url\u003e`\n- Environment Variable:\nSet `NSS_DATASET_REGISTRY` to point to your YAML file (path or URL).\n\nIf both are provided, the CLI option takes precedence.\n\n### Referencing Datasets\n\nWhen a dataset registry is provided, you can use dataset names defined in the registry with the `--data-source` argument.\nFor example:\n\n```bash\nnemo-safe-synthesizer run --dataset-registry my_registry.yaml --data-source my_dataset\n```\n\nThis will load the dataset from the url plus apply any overrides for `my_dataset` from the registry YAML.\n\n### Dataset Registry YAML Format\n\nThe registry file should conform to the pydantic model defined by `DatasetRegistry` in `cli/datasets.py`. For example,\n\n```yaml\n# registry.yaml\nbase_url: /root/data/location\ndatasets:\n- name: dataset1\n  url: dataset1.csv\n- name: dataset2\n  url: dataset2.jsonl\n  overrides:\n    data:\n      group_training_examples_by: id\n- name: dataset3\n  url: /absolute/path/to/dataset.csv\n- name: dataset4\n  url: https://myhost.com/path/to/dataset.json\n  load_args:\n    keyword: custom_arg_for_data_reader\n```\n\n- Minimal requirements for each entry in the `datasets:` list are a `name` and a `url`.\n`url` may be a URL or a file path, anything that data readers like `pd.read_csv` will accept.\n- `base_url` - Any relative urls or paths will be prepended with the `base_url` before attempting to load the dataset.\nThis only applies to the named datasets in the registry which have a relative url.\nPassing a relative `--data-source` on the CLI will attempt to load the file relative to your current working directory, regardless of whether a registry is provided or whether `base_url` is set.\n`base_url` is optional, if not provided, it is recommended to use absolute urls or file paths for all entries.\n- `overrides` - Dataset specific config overrides, such as a dataset that should always be run with `group_training_examples_by`.\nConfig values passed as CLI arguments always take precendence, then any overrides from the registry, and finally values from the `--config` yaml file.\n- `load_args` - Extra arguments needed by the data reader for a specific dataset.\nFor example, changing the separator used by `pd.read_csv` for a `.csv` file with a different delimiter.\n\n## Telemetry \u0026 Privacy\n\nNeMo Safe Synthesizer includes an optional function to share anonymous telemetry data with NVIDIA for product improvement. Data collected is limited to run-level operational metrics (such as final run status, processing time, record and token counts, configuration parameters, top-level quality and privacy scores, base model used, deployment type, and GPU type). No user or device information is collected. This data is used to prioritize product improvements and will be shared in aggregate with the community. It is not used to track any individual user behavior.\n\nYou may opt out of telemetry collection at any time. Opting out applies only to data collection by the NeMo Safe Synthesizer library itself. To disable telemetry in a YAML config, set:\n\n```yaml\nemit_telemetry: false\n```\n\nTo disable telemetry for one CLI invocation, pass `--emit_telemetry false`:\n\n```bash\nsafe-synthesizer run --emit_telemetry false --data-source my_data.csv\n```\n\nTo disable telemetry for the current shell, set `NEMO_TELEMETRY_ENABLED=false` (other accepted disabling values: `0`, `no`) in your environment before running:\n\n```bash\nexport NEMO_TELEMETRY_ENABLED=false\n```\n\nUse of third-party endpoints, including NVIDIA Build: NeMo Safe Synthesizer can be configured to use various inference endpoints, including build.nvidia.com (NVIDIA Build). If you choose to use NVIDIA Build or any other third-party endpoint, that endpoint's own terms of service and privacy practices apply independently of this library. Any opt-out you exercise within NeMo Safe Synthesizer does not extend to data collection by your chosen endpoint. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not submit any confidential information or personal data when using NVIDIA Build.\n\n## License\n\nNeMo Safe Synthesizer is licensed under the [Apache License 2.0](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/blob/main/LICENSE).\n\n## Contact\n\n- [Need help? Ask us a question](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/discussions)\n- [Report a bug](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/issues/new?template=bug-report.yml)\n- [Make a feature request](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/issues/new?template=feature-request.yml)\n- [Report a security vulnerability](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/security/policy)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia-nemo%2Fsafe-synthesizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnvidia-nemo%2Fsafe-synthesizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia-nemo%2Fsafe-synthesizer/lists"}