{"id":37064315,"url":"https://github.com/bkestelman/dasy-ml","last_synced_at":"2026-01-14T07:30:58.445Z","repository":{"id":62566547,"uuid":"441320824","full_name":"bkestelman/dasy-ml","owner":"bkestelman","description":"DaSy DataSynthesizer - Create synthetic data with desired statistical properties for machine learning research.","archived":false,"fork":false,"pushed_at":"2021-12-28T20:30:32.000Z","size":121,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-01-08T04:54:00.762Z","etag":null,"topics":["data","data-science","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bkestelman.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-12-23T23:47:38.000Z","updated_at":"2021-12-28T20:30:35.000Z","dependencies_parsed_at":"2022-11-03T16:16:04.205Z","dependency_job_id":null,"html_url":"https://github.com/bkestelman/dasy-ml","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bkestelman/dasy-ml","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bkestelman%2Fdasy-ml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bkestelman%2Fdasy-ml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bkestelman%2Fdasy-ml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bkestelman%2Fdasy-ml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bkestelman","download_url":"https://codeload.github.com/bkestelman/dasy-ml/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bkestelman%2Fdasy-ml/sbom","scorecard":{"id":241601,"data":{"date":"2025-08-11","repo":{"name":"github.com/bkestelman/dasy-ml","commit":"25d3d4029bb036c7dabd88ffd02aac6c72687137"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3,"checks":[{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}}]},"last_synced_at":"2025-08-17T06:46:12.680Z","repository_id":62566547,"created_at":"2025-08-17T06:46:12.680Z","updated_at":"2025-08-17T06:46:12.680Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28413323,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T05:26:33.345Z","status":"ssl_error","status_checked_at":"2026-01-14T05:21:57.251Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-science","machine-learning"],"created_at":"2026-01-14T07:30:57.699Z","updated_at":"2026-01-14T07:30:58.437Z","avatar_url":"https://github.com/bkestelman.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dasy-ml\nDaSy DataSynthesizer - Create synthetic data with desired statistical properties for machine learning research.\n\n## Quick-Start\n```\npip install dasy-ml\n```\n### Simple Usage\n\n#### dasy for Classification\n```python3\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom dasy.synthesizers.gaussian import GaussianSynth\nfrom dasy.labelers.classification.centroids import CentroidsLabeler\nplt.clf()\n\n# 1. Define the problem\ndim = 2 # dimension of each input\nclasses = 2 # number of classes\nn = 100 # number of data points\n# 2. Create synthetic input data X\nsynth = GaussianSynth(dim=dim) \nX = synth.sample(n=n) # sample n dim-dimensional data points from Gaussian\n# 3. Assign labels y\nlabeler = CentroidsLabeler(classes=classes, dim=dim) # for 2 classes, this creates linearly separable labels\ny = labeler.assign(X)\n# 4. Plot\nplt.scatter(X.T[0], X.T[1], c=y)\nplt.title('Synthetic Classification Problem')\nplt.tight_layout()\nplt.show()\n```\n\n#### dasy for Regression\n```python3\nfrom dasy.synthesizers.uniform import UniformSynth\nfrom dasy.labelers.regression.linear import LinearRegressionLabeler\nplt.clf()\n\n# 1. Define the problem\ndim = 1 # dimension of each input\nn = 50 # number of data points\n# 2. Create synthetic input data X\nsynth = UniformSynth(dim=dim) \nX = synth.sample(n=n) # sample n dim-dimensional data points from Uniform distribution\n# 3. Assign continuous targets y\nlabeler = LinearRegressionLabeler(dim=dim)\ny = labeler.assign(X)\n# 4. Plot\nplt.scatter(X.T[0], y)\nplt.axline((0, labeler.b), slope=labeler.w, color='r') # plot the underlying line which generated the targets\nplt.title('Synthetic Regression Problem')\nplt.tight_layout()\nplt.show()\n```\n\n### Developers\n```\ngit clone https://github.com/bkestelman/dasy-ml\ncd dasy-ml\npip install -e .\npython -m unittest\n```\n\n## Introduction\nWhen researching machine learning algorithms, we often want to know how they behave on data with specific properties. For example: linearly separable, correlated, isotropic, etc. This library aims to provide functionality to construct synthetic datasets with any desired statistical properties, so researchers can easily study how algorithms respond to different types of data. \n\nWhy is this useful for machine learning research compared to using existing datasets?\n- Existing datasets may lack certain statistical properties you want to test your algorithm against.\n- You may not have enough information about where an existing dataset comes from. For example, is it IID?\n- You may want to test against many different types of data. \n- You may want to arbitrarily adjust the size of the dataset. \n\nNote: this is not a library for adding synthetic data to an existing dataset - there are already many other libraries that do this. \n\n## Examples\n![](https://i.ibb.co/VY2Q2d9/gaussian-centroids-subplots.png)\n\nAbove, the input X data is simply sampled from a Gaussian centered at the origin. Then, the data is labeled by creating random centroids and labeling each point according to its nearest centroid (similar to the first step in k-means). On the left with only 2 classes, the classes are linearly separable. With 3 or more classes, they are no longer linearly separable and the boundaries essentially form a Voronoi diagram. \n\n## DataSynthesizers and Labelers \nThe core of synthetic-data are DataSynthesizers and Labelers. \n\nDataSynthesizers sample inputs X from the feature-space. \n\nLabelers take inputs X and assign labels y to them. \n\nThese are very general classes. The procedure for creating X typically involves sampling from some probability distribution. Assigning labels may be a deterministic or probabilistic function. Each x or y may be sampled independently or it may not be, for example if created by a Markov process.\n\nA DataSynthesizer may also assign labels directly to its own data if you want to couple the label distribution with the input distribution. \n\n## Discussion of Kinds of Data\n\n### Independent vs. Non-Independent Data\n\n### Time-Series Data\n\n### Data for Classification Problems\n\n#### Deterministic vs. Probabilistic Labels\nIf for any given input x, the label must always be a specific value, then the labels are deterministic. In other words, the label y=f(x), where f is a pure function. Typically, y is encoded as a one-hot vector. \n\nOn the other hand, if a given input x may be assigned different labels, then labels are probabilistic. Here, y is drawn from the possible classes according to some probability distribution p(x), representing the probability of each class for the given input. \n\nTheoretically, it is possible to achieve 100% accuracy on a deterministic classification problem. This is impossible in a probabilistic classification problem. \n\n#### Noisy Labels\n\n#### Linearly Separable Data\n\n### Data for Regression Problems\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbkestelman%2Fdasy-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbkestelman%2Fdasy-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbkestelman%2Fdasy-ml/lists"}