{"id":21461140,"url":"https://github.com/hmasdev/ssbgm","last_synced_at":"2026-05-17T21:02:24.033Z","repository":{"id":261905335,"uuid":"884866359","full_name":"hmasdev/ssbgm","owner":"hmasdev","description":"Score Based Generative Model with scikit-learn","archived":false,"fork":false,"pushed_at":"2024-12-22T12:12:58.000Z","size":7132,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-23T14:23:02.553Z","etag":null,"topics":["generative-model","scikit-learn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hmasdev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-07T14:29:01.000Z","updated_at":"2024-12-22T12:13:01.000Z","dependencies_parsed_at":"2024-11-09T06:30:48.155Z","dependency_job_id":"1f232ce2-dff8-40df-9536-30c5fe4f787b","html_url":"https://github.com/hmasdev/ssbgm","commit_stats":null,"previous_names":["hmasdev/ssbgm"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmasdev%2Fssbgm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmasdev%2Fssbgm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmasdev%2Fssbgm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmasdev%2Fssbgm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hmasdev","download_url":"https://codeload.github.com/hmasdev/ssbgm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243976493,"owners_count":20377692,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["generative-model","scikit-learn"],"created_at":"2024-11-23T07:07:36.000Z","updated_at":"2026-05-17T21:02:23.936Z","avatar_url":"https://github.com/hmasdev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# `ssbgm`: Scikit-learn-based Score Based Generative Model\n\n![GitHub top language](https://img.shields.io/github/languages/top/hmasdev/ssbgm)\n![GitHub tag (latest SemVer)](https://img.shields.io/github/v/tag/hmasdev/ssbgm?sort=semver)\n![GitHub](https://img.shields.io/github/license/hmasdev/ssbgm)\n![GitHub last commit](https://img.shields.io/github/last-commit/hmasdev/ssbgm)\n\n![Scheduled Test](https://github.com/hmasdev/ssbgm/actions/workflows/tests-on-schedule.yaml/badge.svg)\n\n`ssbgm` is a python library which enables you to generate synthetic data using a score based generative model with `scikit-learn`.\n\nYou can use `ssbgm` to predict a target value with some features by generating synthetic data given the features.\n\n## Installation\n\n### Requirements\n\n- Python (\u003e= 3.10)\n- libraries:\n  - catboost\u003e=1.2.7\n  - lightgbm\u003e=4.5.0\n  - numpy\u003e=1.26.4\n  - scikit-learn\u003e=1.5.2\n  - tqdm\u003e=4.67.0\n  - types-tqdm\u003e=4.66.0.20240417\n\nSee [./pyproject.toml](./pyproject.toml) for more details.\n\n### How to Install\n\nYou can install `ssbgm` via pip:\n\n```bash\npip install git+https://github.com/hmasdev/ssbgm.git\n```\n\nor\n\n```bash\ngit clone https://github.com/hmasdev/ssbgm.git\npip install .\n```\n\n## Usage\n\n### Generate Synthetic Data\n\nHere is an example of generating synthetic data using `ssbgm`:\n\n```python\nfrom sklearn.linear_model import LinearRegression\nfrom ssbgm import ScoreBasedGenerator\n\n# Prepare the dataset which you want to generate synthetic data\n# row: sample, column: output dimension\nX: np.ndarray = ...\n\n# initialize the generator with LinearRegression\ngenerator = ScoreBasedGenerator(estimator=LinearRegression())\n\n# fit the generator\ngenerator.fit(X)\n\n# generate synthetic data\n# Langevin Monte Carlo is used to generate synthetic data\nX_syn_lmc = sbmgenerator2.sample(n_samples=128, sampling_method=sbmgenerator2.SamplingMethod.LANGEVIN_MONTECARLO, alpha=0.2).squeeze()\nX_syn_euler = sbmgenerator2.sample(n_samples=128, sampling_method=sbmgenerator2.SamplingMethod.EULER).squeeze()\nX_syn_em = sbmgenerator2.sample(n_samples=128, sampling_method=sbmgenerator2.SamplingMethod.EULER_MARUYAMA).squeeze()\n# The shape of each X_syn_* is (128, X.shape[1])\n```\n\n### Conditional Generation\n\nYou can use `ssbgm` to predict a target value with some features by generating synthetic data given the features.\n\n```python\nfrom sklearn.linear_model import LinearRegression\nfrom ssbgm import ScoreBasedGenerator\n\n# Prepare the dataset which you want to generate synthetic data\n# row: sample, column: features\nX: np.ndarray = ...\n# row: sample, column: target value\ny: np.ndarray = ...\n\n# initialize the generator with LinearRegression\ngenerator = ScoreBasedGenerator(estimator=LinearRegression())\n\n# fit the generator\ngenerator.fit(X, y)\n\n# predict the target value with on X\ny_pred_by_mean, y_pred_std = generator.predict(X, aggregate='mean', return_std=True)  # Shape: (X.shape[0], y.shape[1]), (X.shape[0], y.shape[1])\ny_pred_by_median = generator.predict(X, aggregate='median')  # Shape: (X.shape[0], y.shape[1])\n\n# generate synthetic data conditioned by X\n# Langevin Monte Carlo is used to generate synthetic data\nX_syn_lmc = sbmgenerator2.sample(X, n_samples=128, sampling_method=sbmgenerator2.SamplingMethod.LANGEVIN_MONTECARLO, alpha=0.2, n_warmup=1000).squeeze()\nX_syn_euler = sbmgenerator2.sample(X, n_samples=128, sampling_method=sbmgenerator2.SamplingMethod.EULER).squeeze()\nX_syn_em = sbmgenerator2.sample(X, n_samples=128, sampling_method=sbmgenerator2.SamplingMethod.EULER_MARUYAMA).squeeze()\n# The shape of each X_syn_* is (128, X.shape[0], X.shape[1])\n```\n\n### Examples\n\nIn this section, we will see some examples of using `ssbgm`.\n\nIf you want to know more details, see [./samples](./samples) directory.\nEspecially, [./samples/cheatsheet.ipynb](./samples/cheatsheet.ipynb) is a good starting point.\n\n#### Mixed Gaussian Distribution\n\nSee [./samples/mixed_gaussian_distribution.ipynb](./samples/mixed_gaussian_distribution.ipynb) for more details.\n\n```python\n# import libraries\nfrom catboost import CatBoostRegressor\nfrom lightgbm import LGBMRegressor\nimport matplotlib.pyplot as plt  # Installing matplotlib is required\nimport numpy as np\n\nimport sys\nsys.path.append('../')\nfrom ssbgm import ScoreBasedGenerator\n\nnp.random.seed(0)\nN = 10000\n```\n\n```python\n# Case: 1d mixed gaussian\n\n# generate a training dataset\nx_train = np.random.randn(N) + (2*(np.random.rand(N) \u003e 0.5) - 1) * 1.6\n\n# train a generative model with score-based model\ngenerative_model_1d_mixed_gaussian = ScoreBasedGenerator(LGBMRegressor(random_state=42)).fit(x_train, noise_strengths=np.sqrt(np.logspace(-3, np.log(x_train.var()), 101)))\n\n# generate samples from the trained model\nx_gen = generative_model_1d_mixed_gaussian.sample(n_samples=N, sampling_method=ScoreBasedGenerator.SamplingMethod.EULER).squeeze()\n\n# plot the results\ntrue_pdf = lambda x: 0.5*np.exp(-0.5*(x-1.6)**2)/np.sqrt(2*np.pi) + 0.5*np.exp(-0.5*(x+1.6)**2)/np.sqrt(2*np.pi)\nplt.hist(x_train, bins=30, label='train data', color='blue', alpha=0.5, density=True)\nplt.hist(x_gen, bins=30, label='generated data', color='red', alpha=0.5, density=True)\nplt.plot(np.linspace(x_train.min(), x_train.max()), true_pdf(np.linspace(x_train.min(), x_train.max())), 'k-', label='true pdf')\nplt.legend(loc='upper left')\nplt.show()\n```\n\n   [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004026 seconds.\n   You can set `force_row_wise=true` to remove the overhead.\n   And if memory is not enough, you can set `force_col_wise=true`.\n   [LightGBM] [Info] Total Bins 357\n   [LightGBM] [Info] Number of data points in the train set: 1010000, number of used features: 2\n   [LightGBM] [Info] Start training from score 0.000087\n\n![./pics/1d_mixed_gaussian_distribution_example.png](./pics/1d_mixed_gaussian_distribution_example.png)\n\n```python\n# Case: 2d mixed gaussian\n\n# generate a training dataset\nX_train = np.random.randn(N, 2)\nlabel = 2*(np.random.rand(N) \u003e 0.5) - 1\nX_train[:, 0] = X_train[:, 0] + label * 1.6\nX_train[:, 1] = X_train[:, 1] + label * 1.6\n\n# train a generative model with score-based model\ngenerative_model_2d_mixed_gaussian = ScoreBasedGenerator(\n    estimator=CatBoostRegressor(\n        verbose=0,\n        loss_function='MultiRMSE',\n        random_state=42,\n    )\n)\ngenerative_model_2d_mixed_gaussian.fit(\n    X_train,\n    noise_strengths=np.sqrt(np.logspace(-3, np.log(max(np.var(X_train, axis=0))), 11)),\n)\n\n# generate samples from the trained model\nX_gen = generative_model_2d_mixed_gaussian.sample(n_samples=N, sampling_method=ScoreBasedGenerator.SamplingMethod.EULER).squeeze()\n\n# plot the results\ntrue_pdf = lambda X: 0.5*np.exp(-0.5*(X[:, 0]-1.6)**2 - 0.5*(X[:, 1]-1.6)**2)/2/np.pi + 0.5*np.exp(-0.5*(X[:, 0]+1.6)**2 - 0.5*(X[:, 1]+1.6)**2)/2/np.pi\nXX_, YY_ = np.meshgrid(np.linspace(X_train[:, 0].min(), X_train[:, 0].max()), np.linspace(X_train[:, 1].min(), X_train[:, 1].max()))\nplt.scatter(X_train[:, 0], X_train[:, 1], label='train data', color='blue', alpha=0.2, marker='x')\nplt.scatter(X_gen[:, 0], X_gen[:, 1], label='generated data', color='red', alpha=0.2, marker='o')\nplt.contourf(XX_, YY_, true_pdf(np.c_[XX_.ravel(), YY_.ravel()]).reshape(XX_.shape), alpha=0.5)\nplt.legend(loc='upper left')\nplt.xlim(X_train[:, 0].min(), X_train[:, 0].max())\nplt.ylim(X_train[:, 1].min(), X_train[:, 1].max())\nplt.show()\n```\n\n![./pics/2d_mixed_gaussian_distribution_example.png](./pics/2d_mixed_gaussian_distribution_example.png)\n\n## How to Develop\n\n1. Fork the repository: [https://github.com/hmasdev/ssbgm](https://github.com/hmasdev/ssbgm)\n2. Clone the repository\n\n   ```bash\n   git clone https://github.com/{YOURE_NAME}/ssbgm\n   cd ssbgm\n   ```\n\n3. Create a virtual environment\n\n   ```bash\n   python -m venv venv\n   source venv/bin/activate\n   ```\n\n4. Install the required packages\n\n   ```bash\n   pip install -e .[dev]\n   ```\n\n5. Checkout your working branch\n\n   ```bash\n   git checkout -b your-working-branch\n   ```\n\n6. Make your changes\n\n7. Test your changes\n\n   ```bash\n   pytest\n   flake8 ssbgm tests\n   mypy ssbgm tests\n   ```\n\n8. Commit your changes\n\n   ```bash\n   git add .\n   git commit -m \"Your commit message\"\n   ```\n\n9. Push your changes\n\n   ```bash\n   git push origin your-working-branch\n   ```\n\n10. Create a pull request: [https://github.com/hmasdev/ssbgm/compare](https://github.com/hmasdev/ssbgm/compare)\n\n## License\n\n[MIT](./LICENSE)\n\n## Author\n\n[hmasdev](https://github.com/hmasdev)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhmasdev%2Fssbgm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhmasdev%2Fssbgm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhmasdev%2Fssbgm/lists"}