{"id":19668430,"url":"https://github.com/pat-alt/optimalsubsampling","last_synced_at":"2026-06-14T15:34:13.214Z","repository":{"id":119125617,"uuid":"313033055","full_name":"pat-alt/optimalSubsampling","owner":"pat-alt","description":"This project investigates if and how systematic subsampling can be applied to imbalanced learning.","archived":false,"fork":false,"pushed_at":"2021-01-01T15:26:11.000Z","size":15346,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-10T21:41:38.953Z","etag":null,"topics":["bias-variance","subsampling"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pat-alt.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-15T13:05:55.000Z","updated_at":"2023-04-26T16:28:42.000Z","dependencies_parsed_at":null,"dependency_job_id":"19880e7d-b4de-4bbe-934e-a644dec9f12c","html_url":"https://github.com/pat-alt/optimalSubsampling","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/pat-alt/optimalSubsampling","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pat-alt%2FoptimalSubsampling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pat-alt%2FoptimalSubsampling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pat-alt%2FoptimalSubsampling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pat-alt%2FoptimalSubsampling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pat-alt","download_url":"https://codeload.github.com/pat-alt/optimalSubsampling/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pat-alt%2FoptimalSubsampling/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285636559,"owners_count":27205878,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-21T02:00:06.175Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bias-variance","subsampling"],"created_at":"2024-11-11T16:35:26.474Z","updated_at":"2025-11-21T15:03:54.334Z","avatar_url":"https://github.com/pat-alt.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\ntitle: \"README\"\noutput: github_document\n---\n\n```{r setup, include=FALSE}\nknitr::opts_chunk$set(echo = TRUE)\n```\n\n# optimalSubsampling\n\nThis project investigates if and how systematic subsampling can be applied to imbalanced learning. \nAll details can be found in this [Jupyter notebook](notebook.ipynb) - good if you want a condensed, interactive version and like working with Jupyter notebooks. For a better reading experience I would recommend using the more detailed [HTML](model_selection.html). A quick overview is provided below.\n\n## Overview\n\nThe case for subsampling involves $n \u003e\u003e p$, so very large values of $n$. In such cases we may be interested in estimating model coefficients $\\hat\\beta_m$ instead of $\\hat\\beta_n$ where $p\\le m\u003c\u003cn$ with $m$ freely chosen by us. In practice we may want to do this to avoid high computational costs associated with large $n$ as discussed above. The basic algorithm for estimating $\\hat\\beta_m$ is simple:\n\n1. Subsample with replacement from the data with some sampling probability $\\{\\pi_i\\}$.\n2. Estimate least-squares estimator $\\hat\\beta_m$ using the subsample. \n\nHere we look at a few of the different subsampling methods investigated and proposed in Zhu et al, 2015, which differ primarily in their choice of subsampling probabilities $\\{\\pi_i\\}$. The baseline results from Zhu et al, 2015, are replicated here and consistent with the authors' findings: systematic subsampling can greatly improve model performance.\n\n![](www/mse.png)\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpat-alt%2Foptimalsubsampling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpat-alt%2Foptimalsubsampling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpat-alt%2Foptimalsubsampling/lists"}