{"id":34110166,"url":"https://github.com/munichpavel/fake-data-for-learning","last_synced_at":"2026-04-08T12:05:32.951Z","repository":{"id":36814275,"uuid":"192226100","full_name":"munichpavel/fake-data-for-learning","owner":"munichpavel","description":"Sample interesting fake data for machine and human learning","archived":false,"fork":false,"pushed_at":"2025-02-27T10:45:13.000Z","size":486,"stargazers_count":8,"open_issues_count":5,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-12-17T02:32:58.988Z","etag":null,"topics":["data-generation","data-science","fake-data","fake-data-generator","machine-learning","python","statistics"],"latest_commit_sha":null,"homepage":"https://munichpavel.github.io/fake-data-for-learning","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/munichpavel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-16T18:54:51.000Z","updated_at":"2025-02-27T10:45:14.000Z","dependencies_parsed_at":"2025-02-27T11:47:18.233Z","dependency_job_id":"e6dc7781-ee4d-4539-9872-17f0bff78b39","html_url":"https://github.com/munichpavel/fake-data-for-learning","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/munichpavel/fake-data-for-learning","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/munichpavel%2Ffake-data-for-learning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/munichpavel%2Ffake-data-for-learning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/munichpavel%2Ffake-data-for-learning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/munichpavel%2Ffake-data-for-learning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/munichpavel","download_url":"https://codeload.github.com/munichpavel/fake-data-for-learning/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/munichpavel%2Ffake-data-for-learning/sbom","scorecard":{"id":668067,"data":{"date":"2025-08-11","repo":{"name":"github.com/munichpavel/fake-data-for-learning","commit":"5d836bd683a2badee1ec5716b0b2f985f4aabb53"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.8,"checks":[{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Code-Review","score":0,"reason":"Found 0/16 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/ci.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/ci.yml:19: update your workflow using https://app.stepsecurity.io/secureworkflow/munichpavel/fake-data-for-learning/ci.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/ci.yml:20: update your workflow using https://app.stepsecurity.io/secureworkflow/munichpavel/fake-data-for-learning/ci.yml/main?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/ci.yml:27","Warn: pipCommand not pinned by hash: .github/workflows/ci.yml:28","Warn: pipCommand not pinned by hash: .github/workflows/ci.yml:29","Info:   0 out of   2 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   3 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":-1,"reason":"internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration","details":null,"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":0,"reason":"17 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2022-249 / GHSA-9jmq-rx5f-8jwq","Warn: Project is vulnerable to: PYSEC-2018-34 / GHSA-2fc2-6r4j-p65h","Warn: Project is vulnerable to: PYSEC-2021-856 / GHSA-5545-2q6w-2gh6","Warn: Project is vulnerable to: PYSEC-2019-108 / GHSA-9fq2-x9r6-wfmf","Warn: Project is vulnerable to: PYSEC-2018-33 / GHSA-cw6w-4rcx-xphc","Warn: Project is vulnerable to: PYSEC-2021-857 / GHSA-f7c7-j99h-c22f","Warn: Project is vulnerable to: GHSA-fpfv-jqm9-f5jm","Warn: Project is vulnerable to: PYSEC-2017-1 / GHSA-frgw-fgh6-9g52","Warn: Project is vulnerable to: PYSEC-2020-73","Warn: Project is vulnerable to: PYSEC-2018-100 / GHSA-r38r-qp28-2m63","Warn: Project is vulnerable to: PYSEC-2020-107 / GHSA-jjw5-xxj6-pcv5","Warn: Project is vulnerable to: PYSEC-2024-110 / GHSA-jw8x-6495-233v","Warn: Project is vulnerable to: PYSEC-2020-108","Warn: Project is vulnerable to: PYSEC-2019-156 / GHSA-xp76-357g-9wqq","Warn: Project is vulnerable to: PYSEC-2023-102","Warn: Project is vulnerable to: PYSEC-2023-114","Warn: Project is vulnerable to: PYSEC-2022-43017 / GHSA-qwmp-2cf2-g9g6"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 18 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-21T18:51:20.748Z","repository_id":36814275,"created_at":"2025-08-21T18:51:20.748Z","updated_at":"2025-08-21T18:51:20.748Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31554164,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T10:21:54.569Z","status":"ssl_error","status_checked_at":"2026-04-08T10:21:38.171Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-generation","data-science","fake-data","fake-data-generator","machine-learning","python","statistics"],"created_at":"2025-12-14T18:46:52.348Z","updated_at":"2026-04-08T12:05:32.944Z","avatar_url":"https://github.com/munichpavel.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# fake-data-for-learning\n\n![ci](https://github.com/munichpavel/fake-data-for-learning/actions/workflows/ci.yml/badge.svg)\n\nInteresting fake multivariate data is harder to generate than it should be. Textbooks typically give definitions, two standard examples (multinomial and multivariate normal) and then proceed to proving theorems and propositions. True, one dimensional distributions can be combined, but here as well the source of examples is also sparse, e.g. products of distributions or copulas (typically Gaussian or t-copulas) applied to these 1-d examples.\n\nFor machine learning experimentation, it is useful to have an unlimited supply of interesting fake data, where by interesting I mean that we know certain properties of the data and want to test if the algorithm can pick this up. A great potential source of such data is graphical models.\n\nThe goal of this package is to make it easy to generate interesting fake data. In the current release, we generate fake data with discrete Bayesian networks (also known as directed graphical models).\n\n* **Website**: [https://munichpavel.github.io/fake-data-for-learning](https://munichpavel.github.io/fake-data-for-learning)\n* **Documentation**: [https://munichpavel.github.io/fake-data-docs/](https://munichpavel.github.io/fake-data-docs/)\n\n## Basic usage\n\nThe methods and interfaces for `fake_data_for_learning` largely follow those of [scipy](https://scipy.org), e.g. the method `rvs` to generate random samples, and `pmf` for the probability mass function, with extensions to handle non-integer sample values.\n\nDefining and sampling from (discrete) conditional random variables:\n\n```python\nimport numpy as np\nfrom fake_data_for_learning.fake_data_for_learning import BayesianNodeRV, SampleValue\n\n# Gender -\u003e Y\n# Define Gender with probability table, node label and value labels\nGender = BayesianNodeRV('Gender', np.array([0.55, 0.45]), values=['female', 'male'])\n\n# Define Y with conditional probability table, node, value and parent labels\npt_YcGender = np.array([\n    [0.9, 0.1],\n    [0.4, 0.6],\n])\nY = BayesianNodeRV('Y', pt_YcGender, parent_names=['Gender'])\n\n# Evaluate probability mass function for given parent values\nY.pmf(0, parent_values={'Gender': SampleValue('male', label_encoder=Gender.label_encoder)})\n# 0.4\n\n# Sample from Y given Gender\nY.rvs({'Gender': SampleValue('male', label_encoder=Gender.label_encoder)}, seed=42)\n# array([0])\n```\n\nCombine into a Bayesian network; sample and calculate the probability mass function of each sample:\n\n```python\nfrom fake_data_for_learning.fake_data_for_learning import FakeDataBayesianNetwork\nsamples = bn.rvs(size=5)\n# Rounding of pmf is only for display purposes\nsamples['pmf'] = samples[['Gender', 'Y']].apply(lambda sample: round(bn.pmf(sample), 3), axis=1)\n```\n\n![docs/graphics/network_sample.png](docs/graphics/network_sample.png)\n\nVisualize the Bayesian network:\n\n```python\nbn.draw_graph()\n```\n\n![docs/graphics/graph.png](docs/graphics/graph.png)\n\nSee the demo notebook [notebooks/bayesian-network.ipynb](notebooks/bayesian-network.ipynb) for feature examples.\n\nTo avoid having to enter all each value of a conditional probability array, there are also two methods to generate random conditional probability tables.\n\nThe method `fake_data_for_learning.utils.RandomCpt()` gives a random conditional probability table, but if you want to constrain the entries to satisfy constraints on expectation values, this is done in the class `fake_data_for_learning.utils.ProbabilityPolytope`; see the example notebook [notebooks/conditional-probability-tables-with-constraints.ipynb](notebooks/conditional-probability-tables-with-constraints.ipynb). See also [Optional Dependencies](#optional-dependencies) below.\n\n## Installation\n\nInstall from [pypi](https://pypi.org/project/fake-data-for-learning/): `pip install fake-data-for-learning`\n\n### Optional dependencies\n\nNote that the methods of `utils.ProbabilityPolytope` that use polytope calculations to generate conditional probability tables subject to constraints on expectation value require the non-pure-python library [pypoman](https://github.com/stephane-caron/pypoman). System dependencies include [`cython`](https://cython.org) and [`glpk`](https://www.gnu.org/software/glpk/). See the [installation instructions](https://github.com/stephane-caron/pypoman#installation) for external dependency instructions.\n\nOn Mac OS X, these can be installed with\n\n```console\nbrew install cython glpk cddlib gmp\n```\n\n(see also the [pycddlib](https://github.com/mcmtroffaes/pycddlib) issue [Installation using pip on MacOS 11.6 #49](https://github.com/mcmtroffaes/pycddlib/issues/49), but beware the typo in the first comment dependency name).\n\nFollow the instructions regarding setting environment variables (e.g. in your `.bashrc` or `.zshrc` files). For example, your `.zshrc` file on Mac OS X might contain\n\n```bash\n# openblas\nexport LDFLAGS=\"-L/opt/homebrew/opt/openblas/lib\"\nexport CPPFLAGS=\"-I/opt/homebrew/opt/openblas/include\"\n\n# pycddlib\nexport CFLAGS=\"-I$(brew --prefix)/include -L$(brew --prefix)/lib\"\nexport PATH=\"/opt/homebrew/opt/cython/bin:$PATH\"\n\n# For pkg-config to find openblas you may need to set:\nexport PKG_CONFIG_PATH=\"/opt/homebrew/opt/openblas/lib/pkgconfig\"\n```\n\nBy default the python dependencies for `utils.ProbabilityPolytope` are not installed; to do so, run from your virtual environment `pip install 'fake-data-for-learning[probability_polytope]'`\n\n### Local development\n\n* ``git clone`` the repository and ``cd`` into the project directory\n* Create a virtual environment from the included ``requirements.txt`` file\n\n## Documentation\n\nTo generate your own [Sphinx documentation](http://sphinx-doc.org/), you must set the environment variable ``LOCAL_BUILDDIR``.\n\nConvenience scripts for the case of a separate build directories (locally and remotely) are in [docs/scripts](https://github.com/munichpavel/fake-data-for-learning/tree/master/docs/scripts).\n\n## Related packages\n\nThis package exists because I became tired of googling for existing implementations of how I wanted to generate fake data. In the development process, however, I found other packages for generating interesting fake data, notably\n\n* [pyro](https://pyro.ai/) is convenient for generating a wide variety of interesting fake data. It is easy to generate fake data from Bayesian networks joined by link functions; see e.g. [the introductory tutorial](http://pyro.ai/examples/intro_part_i.html).\n\n* [pgmpy](http://pgmpy.org/index.html) has a large amount of overlapping functionality, noting that `pgmpy` has a significantly larger scope. One difference is the bookkeeping convention for conditional probability tables: `pgmpy` represents conditional probability tables as 2d matrices, whereas we give each of the *n*-1 conditioned variables its own dimension, resulting in an *n* dimensional matrix.\n\n* [pyagrum](https://pyagrum.readthedocs.io) is a Python wrapper around the C++ library [aGrUM](http://agrum.org/), and has similar funcionality with a larger scope. Unlike `pgmpy`, `pyagrum` has a similar API for specifying conditional probability tables to the one used here.\n\n* [causalgraphicalmodels](https://github.com/ijmbarr/causalgraphicalmodels)'s class `StructuralCausalModel` allows sampling from Bayesian network where the variables are related as functions of one another, rather than via the conditional probability tables used here.\n\n## Change log\n\n### v0.4.5\n\nAdd to Mac OS X optional install instructions.\n\n### v0.4.4\n\nFix missing usage of optional dependency specification\n\n### v0.4.3\n\nMake non-python-dependencies from `utils.ProbabilityPolytope` an optional install.\n\n### v0.4.2\n\nFix mac os x dependency install issue.\n\n### v0.4.1\n\nFix dependencies' API changes.\n\n### v0.4.0\n\nThis release adds a method for generating categorical data whose (multidimensional) contingency table equals a given one. The motivation is to generate fake data exhibiting [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmunichpavel%2Ffake-data-for-learning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmunichpavel%2Ffake-data-for-learning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmunichpavel%2Ffake-data-for-learning/lists"}