https://github.com/aluriak/linear_choosens

Choose randomly n in k elements, in linear time and constant memory complexity
https://github.com/aluriak/linear_choosens

benchmark choice python random

Last synced: 7 months ago
JSON representation

Choose randomly n in k elements, in linear time and constant memory complexity

Host: GitHub
URL: https://github.com/aluriak/linear_choosens
Owner: Aluriak
Created: 2016-08-07T17:24:56.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2018-12-06T10:47:46.000Z (almost 7 years ago)
Last Synced: 2025-01-06T05:12:10.982Z (9 months ago)
Topics: benchmark, choice, python, random
Language: Python
Size: 57.5 MB
Stars: 3
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.mkd

Awesome Lists containing this project

README

# n choose k problem, and related implementations
Comparison of four Python implementations of the n choose k problem: *Choose randomly n elements in a list of k elements*.
The distribution should be uniform: equiprobability is a necessary condition.

This works takes its roots from [this blog post](https://getkerf.wordpress.com/2016/03/30/the-best-algorithm-no-one-knows-about/),that explain both interests and history of the problem and its solutions.

## Notes on the linear implementation
The complexity in time is linear, and constant in memory (if you do not save the choosen items).

The linear implementation provides not only a performance boost compared to stdlib when the number of item is greater than 20 in the tests, but also a greater flexibility: you don't have to provides the full items in order to walk them, just their number.

You can therefore choose N elements into a generator of K elements, without having to store the full generator
or even the N elements if you treat them immediatly.

The current implementation of linear_choosen implements these uses, and is a standalone ;
you can therefore copy-paste it whereever you need it.

## Repository

- **plot.py:** plotting with matplotlib
- **test.py:** launch benchmarks
- **linear_choosen.py:** methods implementation
- **results/:** images outputs

## Implemented methods
The four following methods are implemented. The third one is the main interest of this repository.
Further formal analysis of the method needs to be done.

### stdlib
The `random.sample` method do exactly the job. It automatically choose between two internal implementations,
function to n. [Link to the doc](https://hg.python.org/cpython/file/3.5/Lib/random.py#l280).

[Source](https://github.com/Aluriak/linear_choosens/blob/master/linear_choosens.py#L11).

### dumb
The very obvious *mix it, then take the n firsts*.
[Very costly](https://github.com/Aluriak/linear_choosens#runtime-comparison).

[Source](https://github.com/Aluriak/linear_choosens/blob/master/linear_choosens.py#L16).

### linear
This algorithm is detailed in [Vitter paper](http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf)
*An Efficient Algorithm for Sequential Random Sampling*,
published in 1987, with the mathematical proof.
You can also find it in Knutt's *Seminumerical Algorithms*.
(source: Jon Bentley's *Programming Pearls*)

This is an implementation proposal, that could probably be improved.
Its less efficient than stdlib when n is small, but notably quicker when n comes near k.

The first element have a `n/k` likelihood to be in the output subset.
If this element is choosen, then find the next choosens is like n-1 choose k-1.
This treatment is recursively applied on the k items.

This method allows an O(1) complexity in memory, and a O(k) complexity in time (worst case),
because elements are walked only once, and deciding whether an element is choosen or not
is a comparison of a random number against a likelihood.
The subset is consequently constructed during the walk of all elements.
All the k elements don't need to be walked, when the search is 0 choose k-i.

[Source](https://github.com/Aluriak/linear_choosens/blob/master/linear_choosens.py#L31).

### (linear) recursive
Purely recursive implementation of the linear algorithm.
Slower, don't work on big dataset without modification of python stack size.

[Source](https://github.com/Aluriak/linear_choosens/blob/master/linear_choosens.py#L65).

## Runtime comparison
![runtimes](results/runtime_100.png)
The stdlib implementation is better when n is small, but the linear implementation
is quicker in other cases.

## linear method distribution
Following graphics don't show any obvious distribution bias between firsts and lasts elements.

![runtimes](./results/benchmark_l_100_1000_100.png)

![runtimes](./results/benchmark_l_900_1000_100.png)

These results are comparable to stdlib:

![runtimes](./results/benchmark_s_100_1000_100.png)

![runtimes](./results/benchmark_s_900_1000_100.png)

And to the dumb method:

![runtimes](./results/benchmark_d_100_1000_100.png)

![runtimes](./results/benchmark_d_900_1000_100.png)

Further analysis of the data could highlight a bias induced by the linear method.
If so, it is probably a bias of implementation (bad source code) or a random bias (bad random generator).
The method itself is proven.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aluriak/linear_choosens

Awesome Lists containing this project

README