|
1 | 1 | KLL Sketch |
2 | 2 | ---------- |
| 3 | +Implementation of a very compact quantiles sketch with lazy compaction scheme |
| 4 | +and nearly optimal accuracy per retained item. |
| 5 | +See `Optimal Quantile Approximation in Streams`. |
| 6 | + |
| 7 | +This is a stochastic streaming sketch that enables near real-time analysis of the |
| 8 | +approximate distribution of items from a very large stream in a single pass, requiring only |
| 9 | +that the items are comparable. |
| 10 | +The analysis is obtained using `get_quantile()` function or the |
| 11 | +inverse functions `get_rank()`, `get_pmf()` (Probability Mass Function), and `get_cdf()` |
| 12 | +(Cumulative Distribution Function). |
| 13 | + |
| 14 | +As of May 2020, this implementation produces serialized sketches which are binary-compatible |
| 15 | +with the equivalent Java implementation only when template parameter `T = float` |
| 16 | +(32-bit single precision values). |
| 17 | + |
| 18 | +Given an input stream of `N` items, the `natural rank` of any specific |
| 19 | +item is defined as its index `(1 to N)` in inclusive mode |
| 20 | +or `(0 to N-1)` in exclusive mode |
| 21 | +in the hypothetical sorted stream of all `N` input items. |
| 22 | + |
| 23 | +The `normalized rank` (`rank`) of any specific item is defined as its |
| 24 | +`natural rank` divided by `N`. |
| 25 | +Thus, the `normalized rank` is between zero and one. |
| 26 | +In the documentation for this sketch `natural rank` is never used so any |
| 27 | +reference to just `rank` should be interpreted to mean `normalized rank`. |
| 28 | + |
| 29 | +This sketch is configured with a parameter `k`, which affects the size of the sketch |
| 30 | +and its estimation error. |
| 31 | + |
| 32 | +The estimation error is commonly called `epsilon` (or `eps`) and is a fraction |
| 33 | +between zero and one. Larger values of `k` result in smaller values of `epsilon`. |
| 34 | +Epsilon is always with respect to the rank and cannot be applied to the |
| 35 | +corresponding items. |
| 36 | + |
| 37 | +The relationship between the `normalized rank` and the corresponding items can be viewed |
| 38 | +as a two-dimensional monotonic plot with the `normalized rank` on one axis and the |
| 39 | +corresponding items on the other axis. If the y-axis is specified as the item-axis and |
| 40 | +the x-axis as the `normalized rank`, then `y = get_quantile(x)` is a monotonically |
| 41 | +increasing function. |
| 42 | + |
| 43 | +The function `get_quantile(rank)` translates ranks into |
| 44 | +corresponding quantiles. The functions `get_rank(item)`, |
| 45 | +`get_cdf(...)` (Cumulative Distribution Function), and `get_pmf(...)` |
| 46 | +(Probability Mass Function) perform the opposite operation and translate items into ranks. |
| 47 | + |
| 48 | +The `get_pmf(...)` function has about 13 to 47% worse rank error (depending |
| 49 | +on `k`) than the other queries because the mass of each "bin" of the PMF has |
| 50 | +"double-sided" error from the upper and lower edges of the bin as a result of a subtraction, |
| 51 | +as the errors from the two edges can sometimes add. |
| 52 | + |
| 53 | +The default `k` of 200 yields a "single-sided" `epsilon` of about 1.33% and a |
| 54 | +"double-sided" (PMF) `epsilon` of about 1.65%. |
| 55 | + |
| 56 | +A `get_quantile(rank)` query has the following guarantees: |
| 57 | +- Let `q = get_quantile(r)` where `r` is the rank between zero and one. |
| 58 | +- The quantile `q` will be an item from the input stream. |
| 59 | +- Let `true_rank` be the true rank of `q` derived from the hypothetical sorted |
| 60 | +stream of all `N` items. |
| 61 | +- Let `eps = get_normalized_rank_error(false)`. |
| 62 | +- Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%. Note that the |
| 63 | +error is on the rank, not the quantile. |
| 64 | + |
| 65 | +A `get_rank(item)` query has the following guarantees: |
| 66 | +- Let `r = get_rank(i)` where `i` is an item between the min and max items of |
| 67 | +the input stream. |
| 68 | +- Let `true_rank` be the true rank of `i` derived from the hypothetical sorted |
| 69 | +stream of all `N` items. |
| 70 | +- Let `eps = get_normalized_rank_error(false)`. |
| 71 | +- Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%. |
| 72 | + |
| 73 | +A `get_pmf()` query has the following guarantees: |
| 74 | +- Let `{r1, r2, ..., r(m+1)} = get_pmf(s1, s2, ..., sm)` where `s1, s2` are |
| 75 | +split points (items from the input domain) between the min and max items of |
| 76 | +the input stream. |
| 77 | +- Let `mass_i = estimated mass between s_i and s_i+1`. |
| 78 | +- Let `true_mass` be the true mass between the items of `s_i`, |
| 79 | +`s_i+1` derived from the hypothetical sorted stream of all `N` items. |
| 80 | +- Let `eps = get_normalized_rank_error(true)`. |
| 81 | +- then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%. |
| 82 | +- `r(m+1)` includes the mass of all points larger than `s_m`. |
| 83 | + |
| 84 | +A `get_cdf(...)` query has the following guarantees; |
| 85 | +- Let `{r1, r2, ..., r(m+1)} = get_cdf(s1, s2, ..., sm)` where `s1, s2, ...` are |
| 86 | +split points (items from the input domain) between the min and max items of |
| 87 | +the input stream. |
| 88 | +- Let `mass_i = r_(i+1) - r_i`. |
| 89 | +- Let `true_mass` be the true mass between the true ranks of `s_i`, |
| 90 | +`s_i+1` derived from the hypothetical sorted stream of all `N` items. |
| 91 | +- Let `eps = get_normalized_rank_error(true)`. |
| 92 | +- then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%. |
| 93 | +- `1 - r(m+1)` includes the mass of all points larger than `s_m`. |
| 94 | + |
| 95 | +From the above, it might seem like we could make some estimates to bound the |
| 96 | +`item` returned from a call to `get_quantile()`. The sketch, however, does not |
| 97 | +let us derive error bounds or confidences around items. Because errors are independent, we |
| 98 | +can approximately bracket a value as shown below, but there are no error estimates available. |
| 99 | +Additionally, the interval may be quite large for certain distributions. |
| 100 | +- Let `q = get_quantile(r)`, the estimated quantile of rank `r`. |
| 101 | +- Let `eps = get_normalized_rank_error(false)`. |
| 102 | +- Let `q_lo = estimated quantile of rank (r - eps)`. |
| 103 | +- Let `q_hi = estimated quantile of rank (r + eps)`. |
| 104 | +- Then `q_lo ≤ q ≤ q_hi`, with 99% confidence. |
| 105 | + |
| 106 | + |
| 107 | + |
3 | 108 |
|
4 | 109 | .. autoclass:: _datasketches.kll_ints_sketch |
5 | 110 | :members: |
|
0 commit comments