{"id":27430985,"url":"https://github.com/bigmlcom/sampling","last_synced_at":"2026-03-12T17:36:30.854Z","repository":{"id":3748714,"uuid":"4823719","full_name":"bigmlcom/sampling","owner":"bigmlcom","description":"Random Sampling in Clojure","archived":false,"fork":false,"pushed_at":"2019-04-17T17:22:59.000Z","size":121,"stargazers_count":109,"open_issues_count":0,"forks_count":11,"subscribers_count":16,"default_branch":"master","last_synced_at":"2024-04-16T10:59:00.726Z","etag":null,"topics":["clojure","reservoir-sampling","sampling","stream-sampling"],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bigmlcom.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-06-28T17:56:55.000Z","updated_at":"2024-04-10T11:31:38.000Z","dependencies_parsed_at":"2022-09-24T04:43:34.861Z","dependency_job_id":null,"html_url":"https://github.com/bigmlcom/sampling","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigmlcom%2Fsampling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigmlcom%2Fsampling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigmlcom%2Fsampling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigmlcom%2Fsampling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bigmlcom","download_url":"https://codeload.github.com/bigmlcom/sampling/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248905864,"owners_count":21181065,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","reservoir-sampling","sampling","stream-sampling"],"created_at":"2025-04-14T15:28:12.377Z","updated_at":"2026-03-12T17:36:30.798Z","avatar_url":"https://github.com/bigmlcom.png","language":"Clojure","readme":"# Random Sampling in Clojure\n\n## Installation\n\n`sampling` is available as a Maven artifact from\n[Clojars](http://clojars.org/bigml/sampling).\n\n[![Clojars Project](https://img.shields.io/clojars/v/bigml/sampling.svg)](https://clojars.org/bigml/sampling)\n\n## Overview\n\nThis library supports three flavors of random sampling:\n[simple sampling](#simple-sampling),\n[reservoir sampling](#reservoir-sampling),\nand [stream sampling](#stream-sampling).\n\nSimple sampling is the best choice if the data is small enough to\ncomfortably keep in memory.  If that's not true but the sample you\nwant to take is small enough for memory, then reservoir sampling is a\ngood choice.  If neither the original population nor the sample can\nreside in memory, then take a look at stream sampling.\n\n![](img/memory.png)\n\nAs we review each, feel free to follow along in the REPL:\n\n```clojure\nuser\u003e (ns test\n        (:require (bigml.sampling [simple :as simple]\n                                  [reservoir :as reservoir]\n                                  [stream :as stream])\n                  (bigml.sampling.test [stream :as stream-test])))\n```\n\n## Simple Sampling\n\n`sample.simple` provides simple random sampling.  With this technique\nthe original population is kept in memory but the resulting sample is\na lazy sequence.\n\nBy default, sampling is done [without replacement](http://www.ma.utexas.edu/users/parker/sampling/repl.htm). This\nis equivalent to a lazy [Fisher-Yates shuffle](http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle).\n\n```clojure\ntest\u003e (simple/sample (range 5))\n(2 3 1 0 4)\n```\n\nSetting `:replace` as true will sample with replacement.  Since there\nis no limit to the number of items that may be sampled with\nreplacement from a population, the result will be an infinite length\nlist.  So make sure to `take` however many samples you need.\n\n```clojure\ntest\u003e (take 10 (simple/sample (range 5) :replace true))\n(2 3 3 2 4 1 1 1 3 0)\n```\n\nEach call to `simple/sample` will return a new sample order.\n\n```clojure\ntest\u003e (simple/sample (range 5))\n(0 2 3 1 4)\ntest\u003e (simple/sample (range 5))\n(3 1 4 2 0)\n```\n\nSetting the `:seed` parameter allows the sample order to be recreated.\n\n```clojure\ntest\u003e (simple/sample (range 5) :seed 7)\n(2 1 3 4 0)\ntest\u003e (simple/sample (range 5) :seed 7)\n(2 1 3 4 0)\n```\n\nAny value that's hashable is valid as a seed:\n\n```clojure\ntest\u003e (simple/sample (range 5) :seed :foo)\n(1 0 3 2 4)\n```\n\nThe underlying random number generator may also be selected with the\n`:generator` parameter.  The two options are `:lcg`\n([linear congruential generator](http://en.wikipedia.org/wiki/Linear_congruential_generator))\nand `:twister`\n([Marsenne twister](http://en.wikipedia.org/wiki/Mersenne_twister)).\nThe default is `:lcg`.\n\n```clojure\ntest\u003e (simple/sample (range 5) :seed 7 :generator :lcg)\n(2 1 3 4 0)\ntest\u003e (simple/sample (range 5) :seed 7 :generator :twister)\n(4 3 0 1 2)\n```\n\n### Weighted Simple Sampling\n\nA sample may be weighted using the `:weigh` parameter.  If the\nparameter is supplied with a function that takes an item and produces\na non-negative weight, then the resulting sample will be weighted\naccordingly.\n\n```clojure\ntest\u003e (take 5 (simple/sample [:heads :tails]\n                             :weigh {:heads 0.5 :tails 0.5}\n                             :replace true))\n(:tails :heads :heads :heads :tails)\n```\n\nThe weights need not sum to 1.\n\n```clojure\ntest\u003e (frequencies (take 100 (simple/sample [:heads :tails]\n                                            :weigh {:heads 2 :tails 1}\n                                            :replace true)))\n{:heads 66, :tails 34}\n```\n\n## Reservoir Sampling\n\n`sample.reservoir` provides functions for [reservoir sampling](http://en.wikipedia.org/wiki/Reservoir_sampling).  Reservoir sampling\nkeeps the sampled population in memory (the 'reservoir').  However,\nthe original population is streamed through the reservoir so it does\nnot need to reside in memory.  This makes reservoirs useful when the\noriginal population is too large to fit into memory or the overall\nsize of the population is unknown.\n\nTo create a sample reservoir, use `reservoir/create` and give it the\nnumber of samples you desire.  The resulting reservoir acts as a\ncollection, so you can simply `conj` values into the reservoir to\ncreate a sample.  For example:\n\n```clojure\ntest\u003e (reduce conj (reservoir/create 3) (range 10))\n(5 7 2)\n```\n\nSimilarly, a collection can be fed into the reservoir with `into`:\n\n```clojure\ntest\u003e (into (reservoir/create 3) (range 10))\n(7 0 8)\n```\n\nTo see how the reservoir changes as items are added, we can use\n`reductions`:\n\n```clojure\ntest\u003e (reductions conj (reservoir/create 3) (range 10))\n(() (0) (0 1) (0 1 2) (0 3 2) (0 3 2) (5 3 2) (6 3 2) (6 3 2) (6 3 2) (6 9 2))\n```\n\nFor convenience, `reservoir/sample` accepts a collection and a\nreservoir size and returns the final reservoir:\n\n```clojure\ntest\u003e (reservoir/sample (range 10) 5)\n(0 9 2 1 4)\n```\n\nBoth `reservoir/sample` and `reservoir/create` support the `:replace`,\n`:seed`, `:generator`, and `:weigh` parameters.\n\n```clojure\ntest\u003e (reservoir/sample (range 10) 5 :replace true :seed 1 :weigh identity)\n(7 2 4 5 1)\n```\n\nOne caveat is that samples for reservoirs using `:weigh` won't be in a\nrandom order (with respect to item weights).  So you may need to\nshuffle the results if that's important for you.\n\n### Merging Reservoirs\n\nReservoirs may be merged with `reservoir/merge`. The resulting sample\nwill be similar to a single reservoir over the entire population.  For\nexample:\n\n```clojure\ntest\u003e (reduce + (reservoir/sample (range 0 10000) 500))\n2517627\n\ntest\u003e (reduce + (reservoir/merge\n                 (reservoir/sample (range 0 5000) 500)\n                 (reservoir/sample (range 5000 8000) 500)\n                 (reservoir/sample (range 8000 10000) 500)))\n2527384\n```\n\nWith `reservoir/merge`, reservoirs may be built in parallel on subsets\nof the population and combined afterwords, even if the subsets are of\nvarying size.\n\n### Reservoir Implementations\n\nLastly, there are two implementations of reservoir sampling available:\n`:insertion` and `:efraimdis`.  `:efraimdis` is the default and\ngenerally the better option.  `:insertion` does not support the\n`:weigh` parameter, however it can be faster when sampling from\nsmall-ish populations or when sampling with replacement.\n\nThe implementation may be selected when calling either\n`reservoir/sample` or `reservoir/create` by using the\n`:implementation` parameter:\n\n```clojure\ntest\u003e (time (count (reservoir/sample (range 10000) 2000\n                                     :implementation :efraimdis\n                                     :replace true)))\n\"Elapsed time: 4197.798 msecs\"\n2000\ntest\u003e (time (count (reservoir/sample (range 10000) 2000\n                                     :implementation :insertion\n                                     :replace true)))\n\"Elapsed time: 651.868 msecs\"\n2000\n```\n\n## Stream Sampling\n\n`sample.stream` is useful when taking a large sample from a large\npopulation. Neither the original population or the resulting sample\nare kept in memory.  There are a couple of caveats.  First, unlike the\nother sampling techniques, the resulting sample stream is not in\nrandom order.  It will be in the order of the original population.  So\nif you need a random ordering, you'll want to shuffle the sample.  The\nsecond caveat is that, unlike reservoir sampling, the size of the\npopulation must be declared up-front.\n\nTo use stream sampling, call `stream/sample` with the population, the\ndesired number of samples, and the size of the population.  The result\nis a lazy sequence of samples.\n\nAs an example, we take five samples from a population of ten values:\n\n```clojure\ntest\u003e (stream/sample (range) 5 10)\n(1 2 4 7 9)\n```\n\nAs elsewhere, `stream/sample` supports `:replace`, `:seed`, and\n`:generator`:\n\n```clojure\ntest\u003e (stream/sample (range) 5 10 :replace true :seed 2)\n(0 0 4 6 7)\n```\n\n### Out-of-bag Items\n\nIf an item isn't selected as part of a sampling, it's called\n*out-of-bag*.  Setting the `:out-of-bag` parameter to true will return\na sequence of the out-of-bag items instead of the sampled items.  This\ncan be useful when paired with `:seed`.\n\n```clojure\ntest\u003e (stream/sample (range) 7 10 :seed 0)\n(0 2 3 5 6 7 9)\ntest\u003e (stream/sample (range) 7 10 :seed 0 :out-of-bag true)\n(1 4 8)\n```\n\n### Rate\n\nIt's computationally expensive to select the exact number of desired\nsamples when using `stream/sample` with replacement.  If you're okay\nwith the number of samples being approximately the desired number,\nthen you can set `:rate` to true to decrease the computation cost.\nWhen this is the case, the probability of selecting an item will be\ncalculated only once and then applied to each item in the population\nindependently.  As an example:\n\n```clojure\ntest\u003e (time (count (stream/sample (range 10000) 5000 10000\n                                  :replace true)))\n\"Elapsed time: 374.021 msecs\"\n5000\ntest\u003e (time (count (stream/sample (range 10000) 5000 10000\n                                  :replace true :rate true)))\n\"Elapsed time: 33.923 msecs\"\n4954\n```\n\n`:rate` is also useful if you want to sample the population at a\nparticular rate rather than collect a specific sample size.\n\nTo illustrate, when `stream/sample` is given an infinite list of\nvalues as the population, the default behavior is to take the\nrequested samples from the expected population.  In this case, it\nmeans taking exactly one sample from the first thousand values of the\npopulation:\n\n```clojure\ntest\u003e (stream/sample (range) 1 1000)\n(229)\n```\n\nHowever, when `:rate` is true the resulting sample is also infinite,\nwith each item sampled at a probability of `1/1000`:\n\n```clojure\ntest\u003e (take 10 (stream/sample (range) 1 1000 :rate true))\n(1149 1391 1562 3960 4359 4455 5141 5885 6310 7568 7828)\n```\n\n### Cond-Sample\n\nWhile stream sampling does not yet support sample weights, the\n`cond-sample` fn can be useful for fine tuned sampling.\n\n`cond-sample` accepts a collection followed by pairs of clauses and\nsample definitions.  A clause should be a function that accepts an\nitem and returns either true of false.  After each clause should\nfollow a sample defition that describes the sampling technique to use\nwhen the condition is true.\n\nAs an example, we'll use the well known [iris dataset](http://en.wikipedia.org/wiki/Iris_flower_data_set):\n```clojure\ntest\u003e (first stream-test/iris-data)\n[5.1 3.5 1.4 0.2 \"Iris-setosa\"]\n```\n\nThere are 50 instances of each species:\n```clojure\ntest\u003e (frequencies (map last stream-test/iris-data))\n{\"Iris-setosa\" 50, \"Iris-versicolor\" 50, \"Iris-virginica\" 50}\n```\n\nLet's say we want to sample all of `Iris-setosa`, half as many\n`Iris-versicolor`, and none of the `Iris-virginica`.  If you knew the\npopulation for each class ahead of time, you could use `cond-sample`\nlike so:\n\n```clojure\ntest\u003e (def new-sample\n         (stream/cond-sample stream-test/iris-data\n                             #(= \"Iris-setosa\" (last %)) [50 50]\n                             #(= \"Iris-versicolor\" (last %)) [25 50]\n                             #(= \"Iris-virginica\" (last %)) [0 50]))\ntest\u003e (frequencies (map last new-sample))\n{\"Iris-setosa\" 50, \"Iris-versicolor\" 25}\n```\n\nIf you did not know the class populations ahead of time, a similar\nsample could be done using `:rate`.  Also, an item that doesn't\nsatisfy any condition will be left out of the final sample.  So\n`Iris-virginica` does not need to have its own clause:\n\n```clojure\ntest\u003e (def new-sample\n         (stream/cond-sample stream-test/iris-data\n                             #(= \"Iris-setosa\" (last %)) [1 1 :rate true]\n                             #(= \"Iris-versicolor\" (last %)) [1 2 :rate true]))\ntest\u003e (frequencies (map last new-sample))\n{\"Iris-setosa\" 50, \"Iris-versicolor\" 23}\n```\n\n### Multi-Sample\n\nThe `stream/multi-sample` fn can be used to generate multiple\nsamplings in one pass over the population.  The fn takes the\npopulation followed by sets of sampling parameters, one for each\ndesired sampling.\n\nEach set of sample parameters should be composed of a consumer fn, the\nsample size, the population size, and optionally the parameters\n`:replace`, `:seed`, and `:rate`.\n\n`multi-sample` will generate a unique sampling for every parameter\nset.  Whenever a value is sampled, it will be consumed by the\nparameter set's consumer fn.  A consumer fn should accept a single\nparameter.\n\nAs an example, let's imagine we're running a retail store and want to\ndistribute awards to the stream of customers entering the store.  To\ndo this we'll create two samplings from the customer stream: 1 out of\n100 will win a gift certificate and 1 out of 500 will win a Hawaiian\nvacation.\n\n```clojure\ntest\u003e (defn award-gift-certificate! [customer-id]\n        (println \"Customer\" customer-id \"wins a gift certificate.\"))\ntest\u003e (defn award-hawaiian-vacation! [customer-id]\n        (println \"Customer\" customer-id \"wins a Hawaiian vacation.\"))\ntest\u003e (def customer-ids (range 1000))\ntest\u003e (stream/multi-sample customer-ids\n                           [award-gift-certificate! 1 100 :rate true]\n                           [award-hawaiian-vacation! 1 500 :rate true])\nCustomer 161 wins a Hawaiian vacation.\nCustomer 427 wins a gift certificate.\nCustomer 627 wins a gift certificate.\nCustomer 646 wins a gift certificate.\nCustomer 661 wins a gift certificate.\nCustomer 731 wins a gift certificate.\nCustomer 745 wins a gift certificate.\nCustomer 786 wins a gift certificate.\nCustomer 794 wins a gift certificate.\nCustomer 833 wins a Hawaiian vacation.\nCustomer 836 wins a gift certificate.\n```\n\n### Multi-Reduce\n\n`multi-reduce` is very similar to `multi-sample`, except every set of\nsample parameters defines a sampling along with a reduction function.\nSo each set of sample parameters should be composed of a reduce fn, an\ninitial reduce value, the sample size, the population size, and\noptionally the `:replace`, `:seed`, and `:rate` parameters.\n\n`multi-reduce` will return a seq of values, each value being the final\nreduction for a sampling.  A reducer fn should accept two parameters.\n\nAn example:\n\n```clojure\ntest\u003e (stream/multi-reduce (range) [+ 0 20 30 :seed 3]\n                                   [+ 0 20 30 :seed 4])\n(273 330)\n```\n\n## License\n\nCopyright (C) 2013-2018 BigML Inc.\n\nDistributed under the Apache License, Version 2.0.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigmlcom%2Fsampling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbigmlcom%2Fsampling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigmlcom%2Fsampling/lists"}