https://github.com/aluriak/linear_choosens
Choose randomly n in k elements, in linear time and constant memory complexity
https://github.com/aluriak/linear_choosens
benchmark choice python random
Last synced: 7 months ago
JSON representation
Choose randomly n in k elements, in linear time and constant memory complexity
- Host: GitHub
- URL: https://github.com/aluriak/linear_choosens
- Owner: Aluriak
- Created: 2016-08-07T17:24:56.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2018-12-06T10:47:46.000Z (almost 7 years ago)
- Last Synced: 2025-01-06T05:12:10.982Z (9 months ago)
- Topics: benchmark, choice, python, random
- Language: Python
- Size: 57.5 MB
- Stars: 3
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.mkd
Awesome Lists containing this project
README
# n choose k problem, and related implementations
Comparison of four Python implementations of the n choose k problem: *Choose randomly n elements in a list of k elements*.
The distribution should be uniform: equiprobability is a necessary condition.This works takes its roots from [this blog post](https://getkerf.wordpress.com/2016/03/30/the-best-algorithm-no-one-knows-about/),that explain both interests and history of the problem and its solutions.
## Notes on the linear implementation
The complexity in time is linear, and constant in memory (if you do not save the choosen items).The linear implementation provides not only a performance boost compared to stdlib when the number of item is greater than 20 in the tests, but also a greater flexibility: you don't have to provides the full items in order to walk them, just their number.
You can therefore choose N elements into a generator of K elements, without having to store the full generator
or even the N elements if you treat them immediatly.The current implementation of linear_choosen implements these uses, and is a standalone ;
you can therefore copy-paste it whereever you need it.## Repository
- **plot.py:** plotting with matplotlib
- **test.py:** launch benchmarks
- **linear_choosen.py:** methods implementation
- **results/:** images outputs## Implemented methods
The four following methods are implemented. The third one is the main interest of this repository.
Further formal analysis of the method needs to be done.### stdlib
The `random.sample` method do exactly the job. It automatically choose between two internal implementations,
function to n. [Link to the doc](https://hg.python.org/cpython/file/3.5/Lib/random.py#l280).[Source](https://github.com/Aluriak/linear_choosens/blob/master/linear_choosens.py#L11).
### dumb
The very obvious *mix it, then take the n firsts*.
[Very costly](https://github.com/Aluriak/linear_choosens#runtime-comparison).[Source](https://github.com/Aluriak/linear_choosens/blob/master/linear_choosens.py#L16).
### linear
This algorithm is detailed in [Vitter paper](http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf)
*An Efficient Algorithm for Sequential Random Sampling*,
published in 1987, with the mathematical proof.
You can also find it in Knutt's *Seminumerical Algorithms*.
(source: Jon Bentley's *Programming Pearls*)This is an implementation proposal, that could probably be improved.
Its less efficient than stdlib when n is small, but notably quicker when n comes near k.The first element have a `n/k` likelihood to be in the output subset.
If this element is choosen, then find the next choosens is like n-1 choose k-1.
This treatment is recursively applied on the k items.This method allows an O(1) complexity in memory, and a O(k) complexity in time (worst case),
because elements are walked only once, and deciding whether an element is choosen or not
is a comparison of a random number against a likelihood.
The subset is consequently constructed during the walk of all elements.
All the k elements don't need to be walked, when the search is 0 choose k-i.[Source](https://github.com/Aluriak/linear_choosens/blob/master/linear_choosens.py#L31).
### (linear) recursive
Purely recursive implementation of the linear algorithm.
Slower, don't work on big dataset without modification of python stack size.[Source](https://github.com/Aluriak/linear_choosens/blob/master/linear_choosens.py#L65).
## Runtime comparison

The stdlib implementation is better when n is small, but the linear implementation
is quicker in other cases.## linear method distribution
Following graphics don't show any obvious distribution bias between firsts and lasts elements.

These results are comparable to stdlib:


And to the dumb method:


Further analysis of the data could highlight a bias induced by the linear method.
If so, it is probably a bias of implementation (bad source code) or a random bias (bad random generator).
The method itself is proven.