 ###### A comprehensive tutorial on MLflow for MLOps: From experimentation to production
2022-07-30 ###### A simple tutorial on Sampling Importance and Monte Carlo with Python codes
2022-08-01

Reservoir sampling is a fascinating algorithm that is especially useful when you have to deal with streaming data, which is usually what you have in production.

## When/why is Reservoir Sampling useful?

The main benefit of Reservoir Sampling is that it provides an upper bound on memory usage that is invariant of the input stream length. This allows sampling input streams that are far larger in size than the available memory of a system. It is also useful in situations where computations on a small sample are both representative of the complete population and faster/more efficient than operating on the complete input.

In general, Reservoir Sampling is well suited to situations where data arrives sequentially over time and/or is too large to fit entirely into memory.

For example:

• Streaming data calculations.
• Testing/verifying a system in production.
• Checking every input would impact performance too much.
• Estimating aggregate properties of a large dataset
• The total number of records satisfying a predicate.
• Type inference on semi-structured data.
• Infer the type of each column in a large CSV file without exhaustively checking every row.

Imagine you have an incoming stream of tweets and you want to sample a certain number, k, of tweets to do analysis or train a model on. You don’t know how many tweets there are, but you know you can’t fit them all in memory, which means you don’t know in advance the probability at which a tweet should be selected. You want to ensure that:
• Every tweet has an equal probability of being selected.
• You can stop the algorithm at any time and the tweets are sampled with the correct probability.
One solution for this problem is reservoir sampling. The algorithm involves a reservoir, which can be an array, and consists of three steps:
1. Put the first k elements into the reservoir.
2. For each incoming nth element, generate a random number i such that 1 ≤ i ≤ n.
3. If 1 ≤ i ≤ k: replace the ith element in the reservoir with the nth element. Else, do
nothing.

This means that each incoming nth element has k/n probability of being in the reservoir. You can also prove that each element in the reservoir has k/n probability of being there. This means that all samples have an equal chance of being selected. If we stop the algorithm at any time, all samples in the reservoir have been sampled with the correct probability. Figure 4-2 shows an illustrative example of how reservoir sampling works.

### Problem:

• Choose `k` entries from `n` numbers. Make sure each number is selected with the probability of `k/n`

### Basic idea:

• Choose `1, 2, 3, ..., k` first and put them into the reservoir.
• For `k+1`, pick it with a probability of `k/(k+1)`, and randomly replace a number in the reservoir.
• For `k+i`, pick it with a probability of `k/(k+i)`, and randomly replace a number in the reservoir.
• Repeat until `k+i` reaches `n`

### Proof:

• For `k+i`, the probability that it is selected and will replace a number in the reservoir is `k/(k+i)`
• For a number in the reservoir before (let’s say `X`), the probability that it keeps staying in the reservoir is
• `P(X was in the reservoir last time)` × `P(X is not replaced by k+i)`
• `P(X was in the reservoir last time)` × (`1` – `P(k+i is selected and replaces X)`)
• `k/(k+i-1)` × （`1` – `k/(k+i)` × `1/k`
• `k/(k+i)`
• When `k+i` reaches `n`, the probability of each number staying in the reservoir is `k/n`

### Example

• Choose `3` numbers from `[111, 222, 333, 444]`. Make sure each number is selected with a probability of `3/4`
• First, choose `[111, 222, 333]` as the initial reservoir
• Then choose `444` with a probability of `3/4`
• For `111`, it stays with a probability of
• `P(444 is not selected)` + `P(444 is selected but it replaces 222 or 333)`
• `1/4` + `3/4`*`2/3`
• `3/4`
• The same case with `222` and `333`
• Now all the numbers have the probability of `3/4` to be picked

Resources:

https://gregable.com/2007/10/reservoir-sampling.html 