{"id":22173474,"url":"https://github.com/jlumbroso/affirmative-sampling","last_synced_at":"2025-10-26T11:52:37.839Z","repository":{"id":62591910,"uuid":"474830155","full_name":"jlumbroso/affirmative-sampling","owner":"jlumbroso","description":"Reference implementation of the Affirmative Sampling algorithm by Jérémie Lumbroso and Conrado Martínez (2022). 🍀","archived":false,"fork":false,"pushed_at":"2022-06-02T15:18:09.000Z","size":813,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-13T15:08:01.861Z","etag":null,"topics":["affirmative-sampling","cardinality-estimation","data-streaming","probabilistic-algorithm","random-sampling"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/affirmative-sampling/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jlumbroso.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null}},"created_at":"2022-03-28T03:21:20.000Z","updated_at":"2024-11-08T01:08:10.000Z","dependencies_parsed_at":"2022-11-03T22:56:01.291Z","dependency_job_id":null,"html_url":"https://github.com/jlumbroso/affirmative-sampling","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlumbroso%2Faffirmative-sampling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlumbroso%2Faffirmative-sampling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlumbroso%2Faffirmative-sampling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlumbroso%2Faffirmative-sampling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jlumbroso","download_url":"https://codeload.github.com/jlumbroso/affirmative-sampling/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227690972,"owners_count":17805017,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["affirmative-sampling","cardinality-estimation","data-streaming","probabilistic-algorithm","random-sampling"],"created_at":"2024-12-02T07:33:33.642Z","updated_at":"2025-10-26T11:52:32.806Z","avatar_url":"https://github.com/jlumbroso.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Affirmative Sampling: Reference Implementation\n\n[![DOI](https://zenodo.org/badge/474830155.svg)](https://zenodo.org/badge/latestdoi/474830155)\n[![pytest](https://github.com/jlumbroso/affirmative-sampling/actions/workflows/continuous-integration.yaml/badge.svg)](https://github.com/jlumbroso/affirmative-sampling/actions/workflows/continuous-integration.yaml)\n[![codecov](https://codecov.io/gh/jlumbroso/affirmative-sampling/branch/main/graph/badge.svg?token=4S8TD999YC)](https://codecov.io/gh/jlumbroso/affirmative-sampling)\n\nThis repository contains a reference implementation, in Python, of the _Affirmative Sampling_ algorithm by Jérémie Lumbroso and Conrado Martínez (2022), as well as the original paper, accepted at the Analysis of Algorithms 2022 edition in Philadelphia.\n\n**Table of contents:**\n\n- [Abstract](#abstract)\n- [Installation](#installation)\n- [Historical Context](#historical-context)\n- [Example](#example)\n- [Intuition of How the Random Sample Grows](#intuition-of-how-the-random-sample-grows)\n- [License](#license)\n\n\u003csmall\u003e\u003ci\u003e\u003ca href='http://ecotrust-canada.github.io/markdown-toc/'\u003eToC generated with markdown-toc\u003c/a\u003e\u003c/i\u003e\u003c/small\u003e\n\n## Abstract\n\n_Affirmative Sampling_ is a practical and efficient novel algorithm to obtain random samples of distinct elements from a data stream.\n\nIts most salient feature is that the size $S$ of the sample will, on expectation, **grow with the (unknown) number $n$ of distinct elements in the data stream**.\n\nAs any distinct element has the same probability to be sampled, and the sample size is greater when the \"diversity\" (the number of distinct elements) is greater, the samples that _Affirmative Sampling_ delivers are more representative than those produced by any scheme where the sample size is fixed _a priori_—hence its name. This repository contains a reference implementation, in Python, to illustrate how the algorithm works and showcase some basic experimentation.\n\n## Installation\n\nThis package is available on PyPI and can be installed through the typical means:\n\n```shell\n$ pip install affirmative_sampling\n```\n\nThe hash functions that are used in this package come from [the `randomhash` Python package](https://github.com/jlumbroso/python-random-hash).\n\n## Historical Context\n\nSampling is a very important tool, because it makes it possible to infer information about a large data set using the characteristics of a much smaller data set. Historically, it came in the following flavors:\n\n- **(Straight) Sampling**: Each element of the initial data set of size $N$ is taken with the same (fixed) probability $p$. Such sample's size is a random variable, distributed like a binomial and centered in a mean of $Np$.\n\n- **Reservoir Sampling**: This (family of) algorithm(s), [introduced by Jeffrey Vitter (1985)](https://doi.org/10.1145/3147.3165), ensures the size of the resulting sample is fixed, by using _replacements_—indeed an element that is in the sample at some point in these algorithms, might later be evicted and replaced, to ensure the sample is both of fixed size, yet contains elements with uniform probability.\n\n- **Adaptive/Distinct Sampling**: This algorithm, introduced by Mark Wegman (1980), [analyzed by Philippe Flajolet (1990)](https://doi.org/10.1007/BF02241657) and [rebranded by Philip Gibbons (2001)](http://www.vldb.org/conf/2001/P541.pdf), draws elements not from a data set of size $N$, but by the underlying set of distinct items of that data set, of cardinality $n$. Both previous families of algorithm are susceptible to large, frequent elements, that drawn out other more rare elements. Distinct sampling algorithms are family of sampling algorithms that use hash functions to be insensitive to repetitions. While the size of the sample is not fixed, it oscillates closely around a fixed (constant) size.\n\n- **Affirmative Sampling**: This novel algorithm conserves the properties of the Distinct Sampling family of algorithms (because it also uses a hash function to be insensitive to repeated elements), but allows the target size of the sample to be a function of $n$, the number of distinct elements in the source data set—to be precise, the size of the sample is supposed to be $~k \\cdot \\log \\frac n k + k$, logarithmic in the number of distinct elements. This is important, because the accuracy of estimates inferred from a random sample depend on how representative the sample is of the diversity of the source data set, and Affirmative Sampling calibrates the size of the sample to deliver accurate estimates.\n\n## Example\n\nYou can look and run an example. Assuming you have `pipenv`:\n\n```shell\n$ pipenv run python example.py\n```\n\nor otherwise assuming your current Python environment has the package `affirmative_sampling` installed:\n\n```shell\n$ python example.py\n```\n\nThe output will be something along the following lines (exact value will change as the seed depends on the computer's clock):\n\n```\n=====================================================\n'Affirmative Sampling' by J. Lumbroso and C. Martínez\n=====================================================\n\n   Examples use Moby Dick (from the Gutenberg Project)\n   N=215436, n=19962, k=100, k*ln(n/k)=629.641555925844\n\nEXAMPLE 1: Number of tokens without 'e'\n====================================\n- Exact count: 6839\n- Estimated count: 7398.94\n  - Error: 8.19%\n\n- Size of sample: 648\n   - Expected size of sample: 629.64\n   - Tokens in sample without 'e': 242\n   - Proportion of tokens in sample without 'e': 37.35%\n\n\nEXAMPLE 2: Number of mice (freq. less or equal to 5)\n====================================================\n- Exact count: 16450\n- Estimated count: 16999.21\n  - Error: 3.34%\n\n- Size of sample: 648\n   - Expected size of sample: 629.64\n   - Number of mice in sample: 556\n   - Proportion of mice in sample: 85.8%\n\nSAMPLE\n======\n 1780 but\n  427 would\n  315 do\n  165 water\n  103 sight\n   86 give\n   68 name\n   61 together\n   54 entire\n   43 straight\n   37 famous\n   33 idea\n   31 mariners\n   29 person\n   29 stands\n   27 wooden\n   26 circumstance\n   26 cutting\n   26 otherwise\n   25 souls\n   22 aboard\n   22 owing\n   20 ah\n   19 concluded\n   19 deeper\n   19 leaves\n   19 ordinary\n   18 anchor\n   18 presently\n   17 foolish\n   17 previously\n   17 weight\n   16 fate\n   15 fit\n   15 flag\n   15 grass\n   15 shake\n   14 intent\n   14 rock\n   13 bunger\n   13 cool\n   13 eager\n   13 glancing\n   13 slightly\n   13 token\n   13 trademark\n   13 visit\n   12 america\n   12 smells\n   12 solemn\n   12 street\n   12 touched\n   11 ashes\n   11 carefully\n   11 carpenters\n   11 dish\n   11 downwards\n   11 sounding\n   11 stream\n   10 event\n   10 inferior\n   10 lift\n   10 perch\n    9 cask\n    9 change\n    9 driving\n    9 everlasting\n    8 crushed\n    8 currents\n    8 damp\n    8 leviathanic\n    8 mayhew\n    8 monkey\n    8 ought\n    8 published\n    8 shooting\n    8 strove\n    7 cracked\n    7 destined\n    7 knocking\n    7 lookout\n    6 arch\n    6 bury\n    6 cheek\n    6 comfort\n    6 decent\n    6 longitude\n    6 probable\n    6 purple\n    6 subjects\n    6 symptoms\n    6 value\n    5 depend\n    5 dip\n    5 disordered\n    5 faded\n    5 fasten\n    5 france\n    5 guard\n    5 humming\n    5 invited\n    5 navy\n    5 paradise\n    5 pen\n    5 riveted\n    5 rude\n    5 specimen\n    5 sufficient\n    5 wait\n    4 anchored\n    4 arsacidean\n    4 assert\n    4 beast\n    4 beaver\n    4 blubberroom\n    4 boatknife\n    4 cease\n    4 damages\n    4 distinguish\n    4 fidelity\n    4 follows\n    4 gills\n    4 hearses\n    4 moves\n    4 music\n    4 ninety\n    4 offers\n    4 paintings\n    4 razor\n    4 respectfully\n    4 scorching\n    4 sets\n    4 spaniards\n    4 standers\n    4 stroll\n    4 supposition\n    4 tufted\n    4 unrecorded\n    3 asiatic\n    3 behooves\n    3 brilliancy\n    3 capacity\n    3 capricious\n    3 cares\n    3 characteristics\n    3 charley\n    3 churned\n    3 closet\n    3 cuts\n    3 describe\n    3 disposition\n    3 dodge\n    3 entity\n    3 epidemic\n    3 eternities\n    3 extinct\n    3 fancies\n    3 figured\n    3 fleetness\n    3 flooded\n    3 flurry\n    3 grizzled\n    3 halls\n    3 hip\n    3 inconsiderable\n    3 inmates\n    3 inseparable\n    3 mending\n    3 mule\n    3 pouring\n    3 pregnant\n    3 providence\n    3 quoted\n    3 rags\n    3 romish\n    3 route\n    3 shun\n    3 smoky\n    3 socks\n    3 spots\n    3 stained\n    3 stolen\n    3 substantiated\n    3 suspect\n    3 tarpaulins\n    3 tashtegos\n    3 thrusts\n    3 ticklish\n    3 tows\n    3 tragedy\n    3 treat\n    3 typhoons\n    3 unabated\n    3 user\n    3 weighty\n    3 westward\n    3 whittling\n    3 wraps\n    2 accessible\n    2 admitting\n    2 admonished\n    2 aglow\n    2 agonized\n    2 alluding\n    2 attain\n    2 avenues\n    2 awed\n    2 backwoodsman\n    2 barely\n    2 belshazzars\n    2 bout\n    2 brag\n    2 bravest\n    2 bumps\n    2 burkes\n    2 ceases\n    2 chancelike\n    2 chasefirst\n    2 complement\n    2 confidently\n    2 constitution\n    2 cows\n    2 cringing\n    2 decanting\n    2 digest\n    2 dilapidated\n    2 distinctive\n    2 dusting\n    2 egotistical\n    2 enlivened\n    2 ensue\n    2 entrances\n    2 error\n    2 essentially\n    2 exertion\n    2 expiring\n    2 faraway\n    2 fearlessly\n    2 fishe\n    2 fishspears\n    2 girdling\n    2 glide\n    2 grammar\n    2 halloo\n    2 hilariously\n    2 housekeeping\n    2 hover\n    2 hudson\n    2 imputation\n    2 injured\n    2 junks\n    2 keyhole\n    2 manofwar\n    2 masterless\n    2 meridian\n    2 misanthropic\n    2 navel\n    2 newspaper\n    2 obligations\n    2 opulent\n    2 oughts\n    2 outlast\n    2 outwardbound\n    2 overseeing\n    2 paramount\n    2 penetrating\n    2 performed\n    2 permitting\n    2 pumping\n    2 quaint\n    2 quilt\n    2 rabelais\n    2 reappeared\n    2 regulating\n    2 ripple\n    2 ruinous\n    2 sadder\n    2 sagittarius\n    2 saltsea\n    2 scandinavian\n    2 scratches\n    2 serves\n    2 shunned\n    2 snows\n    2 squeezed\n    2 stiffest\n    2 sympathies\n    2 tarpaulin\n    2 temperature\n    2 texas\n    2 toilings\n    2 tweezers\n    2 underneath\n    2 unthinkingly\n    2 unwarrantably\n    2 ushered\n    2 vagabond\n    2 whalehunters\n    2 woodlands\n    1 abstemious\n    1 accomplishment\n    1 acquiesce\n    1 admirer\n    1 adoring\n    1 affghanistan\n    1 afterhes\n    1 ahabshudder\n    1 airfreighted\n    1 alpine\n    1 amosti\n    1 ancestress\n    1 andromedaindeed\n    1 animosity\n    1 annually\n    1 antecedent\n    1 aroostook\n    1 arter\n    1 asa\n    1 atom\n    1 attarofrose\n    1 backof\n    1 ballena\n    1 bamboozingly\n    1 bamboozle\n    1 battled\n    1 bays\n    1 bedclothes\n    1 beehive\n    1 bellbuttons\n    1 bestreaked\n    1 billiardball\n    1 billiardballs\n    1 boatsmark\n    1 boatswain\n    1 brandingiron\n    1 breedeth\n    1 brutal\n    1 brutes\n    1 bungle\n    1 burlybrowed\n    1 cajoling\n    1 cambrics\n    1 centipede\n    1 channel\n    1 characteristically\n    1 chickens\n    1 circumambient\n    1 clapt\n    1 claw\n    1 claws\n    1 cloudscud\n    1 colorless\n    1 commentator\n    1 confidentially\n    1 congeniality\n    1 connexions\n    1 consolatory\n    1 constrain\n    1 contiguity\n    1 controllable\n    1 costermongers\n    1 couldin\n    1 counteracted\n    1 counterbalanced\n    1 counters\n    1 courtesymay\n    1 coverlid\n    1 creware\n    1 crookedness\n    1 crownjewels\n    1 czarship\n    1 dallied\n    1 deaden\n    1 decisionone\n    1 defiles\n    1 delightwho\n    1 demonism\n    1 departing\n    1 detects\n    1 digester\n    1 dines\n    1 disbands\n    1 discipline\n    1 dissolve\n    1 domineered\n    1 donned\n    1 donthe\n    1 doubleshuffle\n    1 doubling\n    1 doughnuts\n    1 dubiouslooking\n    1 dugongs\n    1 dumbest\n    1 dwarfed\n    1 earththat\n    1 eavetroughs\n    1 ego\n    1 ellery\n    1 elucidating\n    1 emoluments\n    1 englishknowing\n    1 engraven\n    1 enthusiasmbut\n    1 errorabounding\n    1 eventuated\n    1 exaggerate\n    1 exegetists\n    1 exploring\n    1 expressly\n    1 exultation\n    1 factories\n    1 feasting\n    1 featuring\n    1 ferdinando\n    1 fiercefanged\n    1 fissures\n    1 fitsthats\n    1 flavorish\n    1 froissart\n    1 funereally\n    1 furs\n    1 garterknights\n    1 ghastliness\n    1 glimmering\n    1 gloss\n    1 glows\n    1 godomnipresent\n    1 grease\n    1 greenly\n    1 grog\n    1 groupings\n    1 guido\n    1 halfbelieved\n    1 hangdog\n    1 hayseed\n    1 headladen\n    1 heraldic\n    1 hitching\n    1 hoarfrost\n    1 honing\n    1 hopefulness\n    1 horned\n    1 hussars\n    1 ifand\n    1 ignore\n    1 ignoring\n    1 ills\n    1 illumination\n    1 imitated\n    1 import\n    1 incidents\n    1 indianfile\n    1 inflated\n    1 instigation\n    1 intangible\n    1 intercedings\n    1 interflow\n    1 inventing\n    1 inventors\n    1 irresolution\n    1 ithow\n    1 ixion\n    1 jobcoming\n    1 jollynot\n    1 jugglers\n    1 lackaday\n    1 lacks\n    1 ladthe\n    1 lakeevinced\n    1 laureate\n    1 legmaker\n    1 lend\n    1 leopardsthe\n    1 leviathanism\n    1 lifeas\n    1 lighten\n    1 lighthouse\n    1 lordvishnoo\n    1 lovings\n    1 maintruckha\n    1 maltreated\n    1 manufacturer\n    1 marquee\n    1 meatmarket\n    1 miasmas\n    1 midnighthow\n    1 migrating\n    1 milkiness\n    1 misfortune\n    1 mixing\n    1 mock\n    1 moons\n    1 mossy\n    1 mutinying\n    1 mystically\n    1 namelessly\n    1 nantuckois\n    1 napoleons\n    1 naythe\n    1 negligence\n    1 neighborsthe\n    1 netted\n    1 nondescripts\n    1 offwe\n    1 oftenest\n    1 ohwhew\n    1 oilpainting\n    1 onsets\n    1 overbalance\n    1 overdoing\n    1 palpableness\n    1 panicstricken\n    1 parenthesize\n    1 particoloured\n    1 pascal\n    1 pauselessly\n    1 pave\n    1 peddlin\n    1 pedestal\n    1 perturbation\n    1 pester\n    1 philopater\n    1 plaintively\n    1 platos\n    1 poker\n    1 prescribed\n    1 princess\n    1 proas\n    1 propulsion\n    1 prtorians\n    1 queerqueer\n    1 quitthe\n    1 quivered\n    1 rads\n    1 rarities\n    1 readable\n    1 reasona\n    1 rechristened\n    1 regardless\n    1 reglar\n    1 repent\n    1 reverenced\n    1 reveriestallied\n    1 rightdown\n    1 rioting\n    1 rob\n    1 rosesome\n    1 rustling\n    1 saidtherefore\n    1 sawlightning\n    1 sayshands\n    1 scoot\n    1 scornfully\n    1 scuffling\n    1 seafowl\n    1 seamless\n    1 seasalt\n    1 seconds\n    1 sedentary\n    1 seducing\n    1 selfcollectedness\n    1 sheathed\n    1 shindy\n    1 shipwhich\n    1 shirrbut\n    1 shortwarpthe\n    1 shoutedsail\n    1 sideladder\n    1 silverso\n    1 singlesheaved\n    1 sirin\n    1 slanderous\n    1 soars\n    1 sodom\n    1 songster\n    1 sphynxs\n    1 spill\n    1 spoiling\n    1 spurzheim\n    1 starbuckbut\n    1 staterooms\n    1 stingy\n    1 stoopingly\n    1 sunburnt\n    1 superseded\n    1 surcoat\n    1 surpassingly\n    1 surveying\n    1 syren\n    1 tenement\n    1 terribleness\n    1 theni\n    1 therethe\n    1 thingbe\n    1 thingnamely\n    1 thingsoak\n    1 thinkbut\n    1 thisgreen\n    1 thisthe\n    1 thunderclotted\n    1 ticdollyrow\n    1 tick\n    1 ticking\n    1 tie\n    1 timberhead\n    1 tipping\n    1 topple\n    1 tracingsout\n    1 traditional\n    1 trans\n    1 treachery\n    1 treasuries\n    1 trivial\n    1 trumpblister\n    1 tunnels\n    1 unblinkingly\n    1 unchallenged\n    1 undecided\n    1 undefiled\n    1 underground\n    1 unequal\n    1 unfavourable\n    1 unfractioned\n    1 unmisgiving\n    1 unsay\n    1 unthought\n    1 usei\n    1 vanquished\n    1 victory\n    1 virgo\n    1 volunteered\n    1 wading\n    1 wales\n    1 wan\n    1 wary\n    1 waythats\n    1 weathersheet\n    1 weaverpauseone\n    1 wept\n    1 wethough\n    1 whalethis\n    1 whalewise\n    1 winces\n    1 workmen\n    1 worm\n    1 wornout\n    1 worseat\n    1 zip\n```\n\n## Intuition of How the Random Sample Grows\n\nThe novel property of the algorithm is that it grows in a controlled way, that is related to the logarithm of the number of distinct elements. The sample is divided into two parts: A fixed-size part (`sample_core`) that will always be of size $k$; and a variable-size part (`sample_xtra`) that will grow slowly throughout the process of the algorithm. Depending on its hashed value, a new element $z$ might either be DISCARDED, REPLACE an existing element of the sample, or EXPAND the variable-size sample, see diagram below:\n\n```\nREPRESENTATION OF THE SAMPLE DURING THE ALGORITHM         |   OUTCOMES FOR NEW ELEMENT z\n                                                          |         y = hash(z)\n      High hash values                                    |\n             ^                                            |\n             |                                            |\n             |                                            |\n+-------------------------+ \u003c-- max hash of S so far      |\n|                         |     (no need to track this)   | \u003c-- y \u003e= k-th hash\n| sample_core             |                               |\n| size = k (always/fixed) |                               |  EXPAND:\n|                         |                               |    ADD z to sample_core\n+-------------------------+ \u003c-- k-th hash of S            |    MOVE z_kth_hash from\n|                         |     = min hash in sample_core |      sample_core to sample_xtra\n|                         |                               |    total size ++\n| sample_xtra             |                               |\n| size = S - k (variable) |                               | \u003c-- kth_hash \u003e y \u003e min_hash\n|                         |                               |\n|                         |                               |  REPLACE z_min_hash with z:\n|                         |                               |    ADD z to sample_xtra\n+-------------------------+ \u003c-- min hash of S             |    REMOVE z_min_hash from sample_xtra\n             |                  = min hash in sample_xtra |\n             |                                            | \u003c-- y \u003c= min_hash\n             |                                            |\n             v                                            |  DISCARD z\n       Low hash values                                    |\n                                                          |\n```\n\nAs the paper illustrates, it is also possible to design variants of the Affirmative Sampling algorithm, with a growth rate that is different than logarithmic.\n\n## License\n\nThis project is licensed under the MIT license, which means that you can do whatever you want with this code, as long as you preserve, in some form, the associated copyright and license notice.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjlumbroso%2Faffirmative-sampling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjlumbroso%2Faffirmative-sampling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjlumbroso%2Faffirmative-sampling/lists"}