{"id":19932060,"url":"https://github.com/amazon-science/ssepy","last_synced_at":"2026-02-28T11:43:59.166Z","repository":{"id":247868343,"uuid":"826973043","full_name":"amazon-science/ssepy","owner":"amazon-science","description":"Python package for stratifying, sampling, and estimating model performance with fewer annotations.","archived":false,"fork":false,"pushed_at":"2025-03-06T05:05:42.000Z","size":390,"stargazers_count":3,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-09T05:10:43.040Z","etag":null,"topics":["estimation","sampling","statistical-inference","statistics","stratified-sampling"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amazon-science.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-07-10T18:51:18.000Z","updated_at":"2025-09-08T20:28:39.000Z","dependencies_parsed_at":"2024-07-11T03:13:58.836Z","dependency_job_id":"a9a9e162-2211-43fe-901f-3b61ff7bdf39","html_url":"https://github.com/amazon-science/ssepy","commit_stats":null,"previous_names":["amazon-science/ssepy"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/amazon-science/ssepy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fssepy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fssepy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fssepy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fssepy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amazon-science","download_url":"https://codeload.github.com/amazon-science/ssepy/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fssepy/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29932765,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-28T09:58:13.507Z","status":"ssl_error","status_checked_at":"2026-02-28T09:57:57.047Z","response_time":90,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["estimation","sampling","statistical-inference","statistics","stratified-sampling"],"created_at":"2024-11-12T23:08:54.503Z","updated_at":"2026-02-28T11:43:59.142Z","avatar_url":"https://github.com/amazon-science.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# `ssepy`: A Library for Efficient Model Evaluation through \u003cins\u003eS\u003c/ins\u003etratification, \u003cins\u003eS\u003c/ins\u003eampling, and \u003cins\u003eE\u003c/ins\u003estimation in \u003cins\u003ePy\u003c/ins\u003ethon\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://arxiv.org/pdf/2406.07320\"\u003e\u003cimg src=\"https://img.shields.io/badge/paper-arXiv-red\" alt=\"Paper\"\u003e\u003c/a\u003e\n            \u003ca style=\"text-decoration:none !important;\" href=\"https://pypi.org/project/ssepy/\" alt=\"package management\"\u003e \u003cimg src=\"https://img.shields.io/badge/pip-package-blue\" /\u003e\u003c/a\u003e\n        \u003cimg src=\"https://img.shields.io/github/license/amazon-science/ssepy\" alt=\"Apache-2.0\"\u003e\n\u003c/p\u003e\n\n**Given an unlabeled dataset and model predictions, how can we select which\ninstances to annotate in one go to maximize the precision of our estimates of\nmodel performance on the entire dataset?**\n\nssepy helps you estimate the mean of any random variable across a large dataset. When the focus is on a model’s performance, it treats each sample’s performance as a random variable and aims to estimate the average (i.e., mean) performance over the entire dataset.\n\nThe main idea:\n\n1. **Predict**: Obtain a proxy or predicted value for each sample (e.g., a model’s predicted performance on that sample).\n2. **Stratify**: Use these proxies to group the samples into strata.\n3. **Sample**: From each stratum, draw a subset of samples according to the chosen allocation method (proportional, Neyman, or others).\n4. **Annotate**: Acquire ground-truth labels or real outcomes for the sampled subset.\n5. **Estimate**: Compute the overall mean (e.g., the mean model performance) using an estimator such as Horvitz-Thompson or a difference estimator.\n\nSee our paper [here](https://arxiv.org/pdf/2406.07320) for a technical overview of the framework.\n\n# Getting started\n\nIn order to intall the package, run \n```python\npip install ssepy\n```\n\nAlternatively, clone the repo, `cd` into it, and run\n\n```python\npip install .\n```\n\nYou may want to initialize a conda environment before running this operation.\n\nTest your setup using this example, which demonstrates data stratification,\nn allocation for annotation via proportional allocation, sampling via\nstratified simple random sampling, and estimation using the Horvitz-Thompson\nestimator:\n\n```python\nimport numpy as np\nfrom sklearn.cluster import KMeans\nfrom ssepy import ModelPerformanceEvaluator\n\nnp.random.seed(0)\n# Generate data\nN = 100000\nY = np.random.normal(0, 1, N) # Ground truth\n\n# Unobserved target\nprint(np.mean(Y))\n\nn = 100 # Annotation n\n# 1. Proxy for ground truth\nYh = Y + np.random.normal(0, 0.1, N)\nevaluator = ModelPerformanceEvaluator(Yh = Yh, budget = n) # Initialize evaluator\n# 2. Stratify on Yh\nevaluator.stratify_data(clustering_algo=KMeans(n_clusters=5, random_state=0, n_init=\"auto\"), X=Yh) # 5 strata\n# 3. Allocate n with proportional allocation and sample\nevaluator.allocate_budget(allocation_type=\"proportional\")\nsampled_idx = evaluator.sample()\n# 4. Annotate\nYl = Y[sampled_idx]\n# 5. Estimate target and variance of estimate\nestimate, variance_estimate = evaluator.compute_estimate(Yl, estimator=\"ht\")\nprint(estimate, variance_estimate)\n```\n\nFor the difference estimator under simple random sampling, run\n\n```python\nevaluator = ModelPerformanceEvaluator(Yh=Yh, budget=n) # initialize sampler\nsampled_idx = evaluator.sample(sampling_method=\"srs\") # 3. sample\nYl = Y[sampled_idx] # 4. annotate\nestimate, variance_estimate = evaluator.compute_estimate(Yl, estimator=\"df\") # 5. estimate\nprint(estimate, variance_estimate)\n```\n\nSee also some examples in the associated folder. \n\n# Features\n\nThe supported sample designs are: (SRS) simple random sampling without\nreplacement, (SSRS) stratified simple random sampling without replacement with\nproportional and optimal/Neyman allocation, (Poisson) sampling. All sampling\nmethods have associated (HT) Horvitz-Thompson and (DF) difference estimators.\n\n# Bugs and contribute\n\nFeel free to reach out if you find any bugs or you would like other features to\nbe implemented in the package.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fssepy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famazon-science%2Fssepy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fssepy/lists"}