{"id":30990616,"url":"https://github.com/finite-sample/fewlab","last_synced_at":"2025-09-14T22:06:32.860Z","repository":{"id":314027251,"uuid":"1052834142","full_name":"finite-sample/fewlab","owner":"finite-sample","description":"Pick the fewest items to label for most efficient unbiased OLS regression on per‑row trait shares","archived":false,"fork":false,"pushed_at":"2025-09-08T17:38:54.000Z","size":38,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-10T06:33:08.956Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/finite-sample.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-08T16:01:17.000Z","updated_at":"2025-09-09T23:55:45.000Z","dependencies_parsed_at":"2025-09-10T06:33:10.298Z","dependency_job_id":"8d47ded9-7d59-4f67-a75a-fec2d78c9a20","html_url":"https://github.com/finite-sample/fewlab","commit_stats":null,"previous_names":["finite-sample/fewlab"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/finite-sample/fewlab","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Ffewlab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Ffewlab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Ffewlab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Ffewlab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/finite-sample","download_url":"https://codeload.github.com/finite-sample/fewlab/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/finite-sample%2Ffewlab/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274792995,"owners_count":25350653,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-12T02:00:09.324Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-09-12T19:51:08.234Z","updated_at":"2025-09-12T19:51:12.199Z","avatar_url":"https://github.com/finite-sample.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## fewlab: fewest items to label for most efficient unbiased OLS on shares\n\n[![PyPI version](https://img.shields.io/pypi/v/fewlab.svg)](https://pypi.org/project/fewlab/)\n[![Downloads](https://pepy.tech/badge/fewlab)](https://pepy.tech/project/fewlab)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n\n**Problem**: You have usage data (users × items) and want to understand how user traits relate to item preferences. But you can't afford to label every item. This tool tells you which items to label first to get the most accurate analysis.\n\n## When You Need This\n\nYou have:\n- A usage matrix: rows are users, columns are items (websites, products, apps)\n- User features you want to analyze (demographics, behavior patterns)\n- Limited budget to label items (safe/unsafe, brand affiliation, category)\n\nYou want to run a regression to understand relationships between user features and item traits, but labeling is expensive. Random sampling wastes budget on items that don't affect your analysis.\n\n## How It Works\n\nThe tool identifies items that most influence your regression coefficients. It prioritizes items that:\n1. Are used by many people\n2. Show different usage patterns across your user segments\n3. Would most change your conclusions if mislabeled\n\nThink of it as \"statistical leverage\"—some items matter more for understanding user-trait relationships.\n\n## Basic Usage\n\n```python\nfrom fewlab import items_to_label\nimport pandas as pd\n\n# Your data: user features and item usage\nuser_features = pd.DataFrame(...)  # User characteristics\nitem_usage = pd.DataFrame(...)     # Usage counts per user-item\n\n# Get top 100 items to label\npriority_items = items_to_label(\n    counts=item_usage,\n    X=user_features,\n    K=100\n)\n\n# Send priority_items to your labeling team\nprint(f\"Label these items first: {priority_items}\")\n```\n\n## What You Get\n\nA ranked list of K items that will give you the most precise regression estimates. The tool considers:\n- How much each item is used\n- Which user segments use which items  \n- The statistical relationship between items and your analysis goals\n\n## Practical Considerations\n\n**Choosing K**: Start with 10-20% of items. You can always label more if needed.\n\n**Validation**: Compare regression stability with different K values. When coefficients stop changing significantly, you have enough labels.\n\n**Limitations**: \n- Works best when usage patterns correlate with user features\n- Assumes item labels are binary (has trait / doesn't have trait)\n- Most effective for sparse usage matrices\n\n## Advanced: Ensuring Unbiased Estimates\n\nThe basic approach gives you optimal items to label but technically requires some randomization for completely unbiased statistical estimates. If you need formal statistical guarantees, add a small random sample on top of the priority list. See the [statistical details](link) for more.\n\n## Installation\n\n```bash\npip install fewlab\n```\n\nRequires: numpy, pandas\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffinite-sample%2Ffewlab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffinite-sample%2Ffewlab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffinite-sample%2Ffewlab/lists"}