Reservoir sampling is a fascinating algorithm that is especially useful when you have to deal with streaming data, which is usually what you have in production.
The main benefit of Reservoir Sampling is that it provides an upper bound on memory usage that is invariant of the input stream length. This allows sampling input streams that are far larger in size than the available memory of a system. It is also useful in situations where computations on a small sample are both representative of the complete population and faster/more efficient than operating on the complete input.
In general, Reservoir Sampling is well suited to situations where data arrives sequentially over time and/or is too large to fit entirely into memory.
For example:
Imagine you have an incoming stream of tweets and you want to sample a certain number, k, of tweets to do analysis or train a model on. You don’t know how many tweets there are, but you know you can’t fit them all in memory, which means you don’t know in advance the probability at which a tweet should be selected. You want to ensure that:
• Every tweet has an equal probability of being selected.
• You can stop the algorithm at any time and the tweets are sampled with the correct probability.
One solution for this problem is reservoir sampling. The algorithm involves a reservoir, which can be an array, and consists of three steps:
1. Put the first k elements into the reservoir.
2. For each incoming nth element, generate a random number i such that 1 ≤ i ≤ n.
3. If 1 ≤ i ≤ k: replace the ith element in the reservoir with the nth element. Else, do
nothing.
This means that each incoming nth element has k/n probability of being in the reservoir. You can also prove that each element in the reservoir has k/n probability of being there. This means that all samples have an equal chance of being selected. If we stop the algorithm at any time, all samples in the reservoir have been sampled with the correct probability. Figure 4-2 shows an illustrative example of how reservoir sampling works.
k
entries from n
numbers. Make sure each number is selected with the probability of k/n
1, 2, 3, ..., k
first and put them into the reservoir.k+1
, pick it with a probability of k/(k+1)
, and randomly replace a number in the reservoir.k+i
, pick it with a probability of k/(k+i)
, and randomly replace a number in the reservoir.k+i
reaches n
k+i
, the probability that it is selected and will replace a number in the reservoir is k/(k+i)
X
), the probability that it keeps staying in the reservoir isP(X was in the reservoir last time)
× P(X is not replaced by k+i)
P(X was in the reservoir last time)
× (1
– P(k+i is selected and replaces X)
)k/(k+i-1)
× (1
– k/(k+i)
× 1/k
)k/(k+i)
k+i
reaches n
, the probability of each number staying in the reservoir is k/n
3
numbers from [111, 222, 333, 444]
. Make sure each number is selected with a probability of 3/4
[111, 222, 333]
as the initial reservoir444
with a probability of 3/4
111
, it stays with a probability ofP(444 is not selected)
+ P(444 is selected but it replaces 222 or 333)
1/4
+ 3/4
*2/3
3/4
222
and 333
3/4
to be pickedResources:
https://gregable.com/2007/10/reservoir-sampling.html
https://towardsdatascience.com/reservoir-sampling-for-efficient-stream-processing-97f47f85c11b
https://steadbytes.com/blog/reservoir-sampling/
https://florian.github.io/reservoir-sampling/