{"id":37079997,"url":"https://github.com/mostly-ai/mostlyai-qa","last_synced_at":"2026-04-27T11:01:10.598Z","repository":{"id":262848622,"uuid":"888516274","full_name":"mostly-ai/mostlyai-qa","owner":"mostly-ai","description":"Synthetic Data Quality Assurance 🔎","archived":false,"fork":false,"pushed_at":"2026-04-23T16:30:35.000Z","size":135399,"stargazers_count":66,"open_issues_count":1,"forks_count":13,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-04-23T16:32:07.695Z","etag":null,"topics":["synthetic-data","synthetic-data-quality"],"latest_commit_sha":null,"homepage":"https://mostly-ai.github.io/mostlyai-qa/","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mostly-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-11-14T14:33:41.000Z","updated_at":"2026-04-23T15:16:33.000Z","dependencies_parsed_at":"2025-01-21T16:32:09.130Z","dependency_job_id":"988b3129-6ebf-4205-bae0-03b59f73ca2c","html_url":"https://github.com/mostly-ai/mostlyai-qa","commit_stats":null,"previous_names":["mostly-ai/mostlyai-qa"],"tags_count":55,"template":false,"template_full_name":null,"purl":"pkg:github/mostly-ai/mostlyai-qa","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mostly-ai%2Fmostlyai-qa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mostly-ai%2Fmostlyai-qa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mostly-ai%2Fmostlyai-qa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mostly-ai%2Fmostlyai-qa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mostly-ai","download_url":"https://codeload.github.com/mostly-ai/mostlyai-qa/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mostly-ai%2Fmostlyai-qa/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32333199,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T23:26:28.701Z","status":"online","status_checked_at":"2026-04-27T02:00:06.769Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["synthetic-data","synthetic-data-quality"],"created_at":"2026-01-14T09:40:45.923Z","updated_at":"2026-04-27T11:01:10.572Z","avatar_url":"https://github.com/mostly-ai.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Synthetic Data Quality Assurance 🔎\n\n[![Documentation](https://img.shields.io/badge/docs-latest-green)](https://mostly-ai.github.io/mostlyai-qa/) [![stats](https://pepy.tech/badge/mostlyai-qa)](https://pypi.org/project/mostlyai-qa/) ![license](https://img.shields.io/github/license/mostly-ai/mostlyai-qa) ![GitHub Release](https://img.shields.io/github/v/release/mostly-ai/mostlyai-qa) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mostlyai-qa)\n\n[Documentation](https://mostly-ai.github.io/mostlyai-qa/) | [Sample Reports](#sample-reports) | [Technical White Paper](https://arxiv.org/abs/2504.01908)\n\nAssess the fidelity and novelty of synthetic samples with respect to original samples:\n\n1. calculate a rich set of accuracy, similarity and distance [metrics](https://mostly-ai.github.io/mostlyai-qa/api/#mostlyai.qa.metrics.ModelMetrics)\n2. visualize statistics for easy comparison to training and holdout samples\n3. generate a standalone, easy-to-share, easy-to-read HTML summary report\n\n...all with a few lines of Python code 💥.\n\nhttps://github.com/user-attachments/assets/b27e270a-f19c-4059-b4f2-ed209c9a26b9\n\n## Installation\n\nThe latest release of `mostlyai-qa` can be installed via pip:\n\n```bash\npip install -U mostlyai-qa\n```\n\nOn Linux, one can explicitly install the CPU-only variant of torch together with `mostlyai-qa`:\n\n```bash\npip install -U torch==2.8.0+cpu torchvision==0.23.0+cpu mostlyai-qa --extra-index-url https://download.pytorch.org/whl/cpu\n```\n\n## Quick Start\n\n```python\nimport pandas as pd\nimport webbrowser\nfrom mostlyai import qa\n\n# initialize logging to stdout\nqa.init_logging()\n\n# fetch original + synthetic data\nbase_url = \"https://github.com/mostly-ai/mostlyai-qa/raw/refs/heads/main/examples/quick-start\"\nsyn = pd.read_csv(f\"{base_url}/census2k-syn_mostly.csv.gz\")\n# syn = pd.read_csv(f'{base_url}/census2k-syn_flip30.csv.gz') # a 30% perturbation of trn\ntrn = pd.read_csv(f\"{base_url}/census2k-trn.csv.gz\")\nhol = pd.read_csv(f\"{base_url}/census2k-hol.csv.gz\")\n\n# calculate metrics\nreport_path, metrics = qa.report(\n    syn_tgt_data=syn,\n    trn_tgt_data=trn,\n    hol_tgt_data=hol,\n)\n\n# pretty print metrics\nprint(metrics.model_dump_json(indent=4))\n\n# open up HTML report in new browser window\nwebbrowser.open(f\"file://{report_path.absolute()}\")\n```\n\n## Basic Usage\n\n```python\nfrom mostlyai import qa\n\n# initialize logging to stdout\nqa.init_logging()\n\n# analyze single-table data\nreport_path, metrics = qa.report(\n    syn_tgt_data = synthetic_df,\n    trn_tgt_data = training_df,\n    hol_tgt_data = holdout_df,  # optional\n)\n\n# analyze sequential data\nreport_path, metrics = qa.report(\n    syn_tgt_data = synthetic_df,\n    trn_tgt_data = training_df,\n    hol_tgt_data = holdout_df,  # optional\n    tgt_context_key = \"user_id\",\n)\n\n# analyze sequential data with context\nreport_path, metrics = qa.report(\n    syn_tgt_data = synthetic_df,\n    trn_tgt_data = training_df,\n    hol_tgt_data = holdout_df,  # optional\n    syn_ctx_data = synthetic_context_df,\n    trn_ctx_data = training_context_df,\n    hol_ctx_data = holdout_context_df,  # optional\n    ctx_primary_key = \"id\",\n    tgt_context_key = \"user_id\",\n)\n```\n\n## Sample Reports\n\n* [Baseball Players](https://html-preview.github.io/?url=https://github.com/mostly-ai/mostlyai-qa/blob/main/examples/baseball-players.html) (Flat Data)\n* [Baseball Seasons](https://html-preview.github.io/?url=https://github.com/mostly-ai/mostlyai-qa/blob/main/examples/baseball-seasons-with-context.html) (Sequential Data)\n\n## Citation\n\nPlease consider citing our project if you find it useful:\n\n```bibtex\n@misc{mostlyai-qa,\n      title={Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework},\n      author={Andrey Sidorenko and Michael Platzer and Mario Scriminaci and Paul Tiwald},\n      year={2025},\n      eprint={2504.01908},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https://arxiv.org/abs/2504.01908},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmostly-ai%2Fmostlyai-qa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmostly-ai%2Fmostlyai-qa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmostly-ai%2Fmostlyai-qa/lists"}