Skip to content

Commit a83ebbc

Browse files
authored
Merge pull request #28 from apache/docs-update
Docs update
2 parents e10511a + 7aa4701 commit a83ebbc

12 files changed

Lines changed: 345 additions & 47 deletions

docs/README.md

Lines changed: 13 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,23 @@
11
Follow these steps to build the documentation.
22
1. Clone the directory in an appropriate location `git clone https://github.com/apache/datasketches-python.git`
3-
2. Switch to the correct branch: `git checkout python-docs`.
4-
3. In project root run `source python-docs-venv/bin/activate`
5-
6-
If there are problems running the virtual env then you may need to install `virtualenv`
7-
and install the packages manually as below
8-
(nb my environment has `python` aliased to `python3` so just use whichever is appropriate for your installation)
3+
2. Make a new branch and switch to that branch.
94
```
10-
python -m venv python-docs-venv # create a new virtual env named python-docs-venv
11-
source python-docs-venv/bin/activate
12-
python -m pip install sphinx
13-
python -m pip install sphinx-rtd-theme
5+
cd datasketches-python
6+
git branch new-branch
7+
git checkout new-branch
8+
```
9+
3. In project root, make a new virtual environment with the appropriate packages. Depending on how python is aliased in your environment, you may
10+
need `python` or `python3`, as indicated by `python(3)`.
1411
```
15-
4. In project root run `python3 -m pip install .` to build the python bindings.
12+
python -m venv venv # create a new virtual env named venv using system python
13+
source venv/bin/activate
14+
python(3) -m pip install sphinx # now using venv python
15+
python(3) -m pip install sphinx-rtd-theme
16+
```
17+
4. In project root run `python(3) -m pip install .` to build the python bindings.
1618
5. Build and open the documentation:
1719
```
1820
cd python/docs
1921
make html
2022
open build/html/index.html
2123
```
22-
23-
## Problems
24-
The `density_sketch` and `tuple_sketch` are not yet included.
25-
I have not included the file to avoid cluttering the PR with things that may not work.
26-
You can easily include them by making a `density_sketch.rst` file in the same location as
27-
all of the other `X.rst` files for the sketches and copying in the following:
28-
29-
```
30-
Density Sketch
31-
--------------
32-
33-
.. autoclass:: datasketches.density_sketch
34-
:members:
35-
:undoc-members:
36-
37-
.. autoclass:: datasketches.GaussianKernel
38-
:members:
39-
```
40-
Additionally, you will need to add the below to `index.rst`
41-
```
42-
Density Estimation
43-
##################
44-
45-
.. toctree::
46-
:maxdepth: 1
47-
48-
density_sketch
49-
```
50-

docs/source/conf.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88

99
import sys
1010
import os
11-
import datasketches
1211

1312
# need to fix the paths so that sphinx can find the source code.
1413
sys.path.insert(0, os.path.abspath("../../datasketches"))
@@ -19,12 +18,12 @@
1918
copyright = '2023'
2019
author = 'Apache Software Foundation'
2120
release = '0.1'
21+
add_module_names = False
2222

2323
# -- General configuration ---------------------------------------------------
2424
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
2525

26-
extensions = ["sphinx.ext.autodoc","sphinx.ext.autosummary"]
27-
26+
extensions = ["sphinx.ext.autodoc"]
2827
templates_path = ['_templates']
2928
exclude_patterns = []
3029

docs/source/count_min_sketch.rst

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,28 @@
11
CountMin Sketch
22
---------------
33

4-
.. autoclass:: _datasketches.count_min_sketch
4+
The CountMin sketch, as described in Cormode and Muthukrishnan in
5+
http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf,
6+
is used for approximate Frequency Estimation.
7+
For an item :math:`x` with frequency :math:`f_x`, the sketch provides an estimate, :math:`\hat{f_x}`,
8+
such that :math:`f_x \approx \hat{f_x}.`
9+
The sketch guarantees that :math:`f_x \le \hat{f_x}` and provides a probabilistic upper bound which is dependent on the size parameters.
10+
The sketch provides an estimate of the occurrence frequency for any queried item but, in contrast
11+
to the Frequent Items Sketch, this sketch does not provide a list of
12+
heavy hitters.
13+
14+
.. currentmodule:: _datasketches
15+
16+
.. autoclass:: count_min_sketch
517
:members:
618
:undoc-members:
19+
:exclude-members: deserialize, suggest_num_buckets, suggest_num_hashes
20+
:member-order: groupwise
21+
22+
.. rubric:: Static Methods:
23+
24+
.. automethod:: deserialize
25+
.. automethod:: suggest_num_buckets
26+
.. automethod:: suggest_num_hashes
27+
28+
.. rubric:: Non-static Methods:

docs/source/cpc.rst

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,24 @@
11
Compressed Probabilistic Counting (CPC)
22
---------------------------------------
3-
The *Compressed Probabilistic Counting* sketch is a space-efficient method for estimating cardinalities of sets.
3+
High performance C++ implementation of Compressed Probabilistic Counting (CPC) Sketch.
4+
This is a unique-counting sketch that implements the Compressed Probabilistic Counting (CPC, a.k.a FM85) algorithms developed by Kevin Lang in his paper
5+
`Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm <https://arxiv.org/abs/1708.06839>`_.
6+
This sketch is extremely space-efficient when serialized.
7+
In an apples-to-apples empirical comparison against compressed HyperLogLog sketches, this new algorithm simultaneously wins on the two dimensions of the space/accuracy tradeoff and produces sketches that are smaller than the entropy of HLL, so no possible implementation of compressed HLL can match its space efficiency for a given accuracy. As described in the paper this sketch implements a newly developed ICON estimator algorithm that survives unioning operations, another well-known estimator, the Historical Inverse Probability (HIP) estimator does not.
8+
The update speed performance of this sketch is quite fast and is comparable to the speed of HLL.
9+
The unioning (merging) capability of this sketch also allows for merging of sketches with different configurations of K.
10+
For additional security this sketch can be configured with a user-specified hash seed.
11+
412

513
.. autoclass:: _datasketches.cpc_sketch
614
:members:
715
:undoc-members:
16+
:exclude-members: deserialize,
17+
:member-order: groupwise
18+
19+
.. rubric:: Static Methods:
20+
21+
.. automethod:: deserialize
22+
23+
.. rubric:: Non-static Methods:
24+

docs/source/density_sketch.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
11
Density Sketch
22
--------------
3+
Builds a coreset from the given set of input points.
4+
Provides density estimate at a given point.
5+
6+
Based on the following paper: Zohar Karnin, Edo Liberty
7+
"Discrepancy, Coresets, and Sketches in Machine Learning"
8+
https://proceedings.mlr.press/v99/karnin19a/karnin19a.pdf
9+
10+
Inspired by the following implementation: https://github.com/edoliberty/streaming-quantiles/blob/f688c8161a25582457b0a09deb4630a81406293b/gde.py
11+
312
.. autoclass:: datasketches.density_sketch
413
:members:
514
:undoc-members:

docs/source/frequent_items.rst

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,83 @@
11
Frequent Items
22
--------------
33

4+
This sketch is useful for tracking approximate frequencies of items of type `<T>` with optional associated counts `(<T> item, int count)`
5+
that are members of a multiset of such items.
6+
The true frequency of an item is defined to be the sum of associated counts.
7+
8+
This implementation provides the following capabilities:
9+
10+
* Estimate the *frequency* of an item.
11+
* Return *upper* and *lower bounds* of any item, such that the true frequency is always between the upper and lower bounds.
12+
* Return a global *maximum error* that holds for all items in the stream.
13+
* Return an array of frequent items that qualify either a *NO_FALSE_POSITIVES* or a *NO_FALSE_NEGATIVES* error type.
14+
* *Merge* itself with another sketch object created from this class.
15+
* *Serialize/Deserialize* to/from a byte array.
16+
17+
**Space Usage**
18+
19+
The sketch is initialized with a maximum map size, `maxMapSize`, that specifies the maximum physical length of the internal hash map of the form `(<T> item, int count)`.
20+
The maximum map size is always a power of 2, defined through the variables `lg_max_map_size`.
21+
22+
The hash map starts at a very small size (8 entries) and grows as needed up to the specified maximum map size.
23+
24+
Excluding external space required for the item objects, the internal memory space usage of this sketch is `18 * mapSize bytes` (assuming 8 bytes for each reference),
25+
plus a small constant number of additional bytes.
26+
The internal memory space usage of this sketch will never exceed `18 * maxMapSize` bytes, plus a small constant number of additional bytes.
27+
28+
**Maximum Capacity of the Sketch**
29+
30+
The `LOAD_FACTOR` for the hash map is internally set at :math:`75\%`, which means at any time the map capacity of `(item, count)` pairs is `mapCap = 0.75 * mapSize`.
31+
The maximum capacity of `(item, count)`` pairs of the sketch is `maxMapCap = 0.75 * maxMapSize`.
32+
33+
**Updating the sketch with `(item, count)` pairs**
34+
35+
If the item is found in the hash map, the mapped count field (the "counter") is incremented by the incoming count; otherwise, a new counter `"(item, count) pair"` is created.
36+
If the number of tracked counters reaches the maximum capacity of the hash map, the sketch decrements all of the counters (by an approximately computed median)
37+
and removes any non-positive counters.
38+
39+
**Accuracy**
40+
41+
If fewer than `0.75 * maxMapSize` different items are inserted into the sketch, the estimated frequencies returned by the sketch will be exact.
42+
43+
The logic of the frequent items sketch is such that the stored counts and true counts are never too different.
44+
More specifically, for any item, the sketch can return an estimate of the true frequency of item, along with upper and lower bounds on the frequency (that hold deterministically).
45+
46+
For this implementation and for a specific active item, it is guaranteed that the true frequency will be between the Upper Bound (UB) and the Lower Bound (LB) computed for that item.
47+
Specifically, `(UB- LB) ≤ W * epsilon`, where :math:`W` denotes the sum of all item counts, and :math:`epsilon = 3.5/M`, where :math:`epsilon = M` is the maxMapSize.
48+
49+
This is a worst-case guarantee that applies to arbitrary inputs.
50+
For inputs typically seen in practice, `(UB-LB)` is usually much smaller.
51+
52+
**Background**
53+
54+
This code implements a variant of what is commonly known as the "Misra-Gries algorithm".
55+
Variants of it were discovered and rediscovered and redesigned several times over the years:
56+
57+
* *Finding repeated elements*, Misra, Gries, 1982
58+
* *Frequency estimation of Internet packet streams with limited space* Demaine, Lopez-Ortiz, Munro, 2002
59+
* *A simple algorithm for finding frequent elements in streams and bags* Karp, Shenker, Papadimitriou, 2003
60+
* *Efficient Computation of Frequent and Top-k Elements in Data Streams* Metwally, Agrawal, Abbadi, 2006
61+
62+
63+
For speed, we do employ some randomization that introduces a small probability that our proof of the worst-case bound might not apply to a given run.
64+
However, we have ensured that this probability is extremely small.
65+
For example, if the stream causes one table purge (rebuild), our proof of the worst-case bound applies with a probability of at least `1 - 1E-14`.
66+
If the stream causes `1E9` purges, our proof applies with a probability of at least `1 - 1E-5`.
67+
68+
Parameter: <T> The type of item to be tracked by this sketch
69+
70+
471
.. autoclass:: _datasketches.frequent_items_sketch
572
:members:
673
:undoc-members:
74+
:exclude-members: deserialize, get_epsilon_for_lg_size, get_apriori_error
75+
:member-order: groupwise
76+
77+
.. rubric:: Static Methods:
78+
79+
.. automethod:: deserialize
80+
.. automethod:: get_epsilon_for_lg_size
81+
.. automethod:: get_apriori_error
82+
83+
.. rubric:: Non-static Methods:

docs/source/hyper_log_log.rst

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,32 @@
11
HyperLogLog (HLL)
22
-----------------
3-
The HyperLogLog (HLL) sketch is a space-efficient method for estimating cardinalities of sets.
3+
This is a high performance implementation of Phillipe Flajolet's HLL sketch but with significantly improved error behavior.
4+
5+
If the ONLY use case for sketching is counting uniques and merging, the HLL sketch is a reasonable choice, although the highest performing in terms of accuracy for storage space consumed is CPC (Compressed Probabilistic Counting). For large enough counts, this HLL version (with HLL_4) can be 2 to 16 times smaller than the Theta sketch family for the same accuracy.
6+
7+
This implementation offers three different types of HLL sketch, each with different trade-offs with accuracy, space and performance.
8+
These types are specified with the target_hll_type parameter.
9+
10+
In terms of accuracy, all three types, for the same lg_config_k, have the same error distribution as a function of n, the number of unique values fed to the sketch.
11+
The configuration parameter `lg_config_k` is the log-base-2 of `K`, where `K` is the number of buckets or slots for the sketch.
12+
13+
During warmup, when the sketch has only received a small number of unique items (up to about 10% of `K`), this implementation leverages a new class of estimator algorithms with significantly better accuracy.
14+
15+
This sketch also offers the capability of operating off-heap.
16+
Given a WritableMemory object created by the user, the sketch will perform all of its updates and internal phase transitions in that object, which can actually reside either on-heap or off-heap based on how it is configured.
17+
In large systems that must update and merge many millions of sketches, having the sketch operate off-heap avoids the serialization and deserialization costs of moving sketches to and from off-heap memory-mapped files, for example, and eliminates big garbage collection delays.
418

519
.. autoclass:: _datasketches.hll_sketch
620
:members:
721
:undoc-members:
22+
:exclude-members: deserialize, get_max_updatable_serialization_bytes, get_rel_err
23+
24+
:member-order: groupwise
25+
26+
.. rubric:: Static Methods:
27+
28+
.. automethod:: deserialize
29+
.. automethod:: get_max_updatable_serialization_bytes
30+
.. automethod:: get_rel_err
31+
32+
.. rubric:: Non-static Methods:

docs/source/kll.rst

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,110 @@
11
KLL Sketch
22
----------
3+
Implementation of a very compact quantiles sketch with lazy compaction scheme
4+
and nearly optimal accuracy per retained item.
5+
See `Optimal Quantile Approximation in Streams`.
6+
7+
This is a stochastic streaming sketch that enables near real-time analysis of the
8+
approximate distribution of items from a very large stream in a single pass, requiring only
9+
that the items are comparable.
10+
The analysis is obtained using `get_quantile()` function or the
11+
inverse functions `get_rank()`, `get_pmf()` (Probability Mass Function), and `get_cdf()`
12+
(Cumulative Distribution Function).
13+
14+
As of May 2020, this implementation produces serialized sketches which are binary-compatible
15+
with the equivalent Java implementation only when template parameter `T = float`
16+
(32-bit single precision values).
17+
18+
Given an input stream of `N` items, the `natural rank` of any specific
19+
item is defined as its index `(1 to N)` in inclusive mode
20+
or `(0 to N-1)` in exclusive mode
21+
in the hypothetical sorted stream of all `N` input items.
22+
23+
The `normalized rank` (`rank`) of any specific item is defined as its
24+
`natural rank` divided by `N`.
25+
Thus, the `normalized rank` is between zero and one.
26+
In the documentation for this sketch `natural rank` is never used so any
27+
reference to just `rank` should be interpreted to mean `normalized rank`.
28+
29+
This sketch is configured with a parameter `k`, which affects the size of the sketch
30+
and its estimation error.
31+
32+
The estimation error is commonly called `epsilon` (or `eps`) and is a fraction
33+
between zero and one. Larger values of `k` result in smaller values of `epsilon`.
34+
Epsilon is always with respect to the rank and cannot be applied to the
35+
corresponding items.
36+
37+
The relationship between the `normalized rank` and the corresponding items can be viewed
38+
as a two-dimensional monotonic plot with the `normalized rank` on one axis and the
39+
corresponding items on the other axis. If the y-axis is specified as the item-axis and
40+
the x-axis as the `normalized rank`, then `y = get_quantile(x)` is a monotonically
41+
increasing function.
42+
43+
The function `get_quantile(rank)` translates ranks into
44+
corresponding quantiles. The functions `get_rank(item)`,
45+
`get_cdf(...)` (Cumulative Distribution Function), and `get_pmf(...)`
46+
(Probability Mass Function) perform the opposite operation and translate items into ranks.
47+
48+
The `get_pmf(...)` function has about 13 to 47% worse rank error (depending
49+
on `k`) than the other queries because the mass of each "bin" of the PMF has
50+
"double-sided" error from the upper and lower edges of the bin as a result of a subtraction,
51+
as the errors from the two edges can sometimes add.
52+
53+
The default `k` of 200 yields a "single-sided" `epsilon` of about 1.33% and a
54+
"double-sided" (PMF) `epsilon` of about 1.65%.
55+
56+
A `get_quantile(rank)` query has the following guarantees:
57+
- Let `q = get_quantile(r)` where `r` is the rank between zero and one.
58+
- The quantile `q` will be an item from the input stream.
59+
- Let `true_rank` be the true rank of `q` derived from the hypothetical sorted
60+
stream of all `N` items.
61+
- Let `eps = get_normalized_rank_error(false)`.
62+
- Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%. Note that the
63+
error is on the rank, not the quantile.
64+
65+
A `get_rank(item)` query has the following guarantees:
66+
- Let `r = get_rank(i)` where `i` is an item between the min and max items of
67+
the input stream.
68+
- Let `true_rank` be the true rank of `i` derived from the hypothetical sorted
69+
stream of all `N` items.
70+
- Let `eps = get_normalized_rank_error(false)`.
71+
- Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%.
72+
73+
A `get_pmf()` query has the following guarantees:
74+
- Let `{r1, r2, ..., r(m+1)} = get_pmf(s1, s2, ..., sm)` where `s1, s2` are
75+
split points (items from the input domain) between the min and max items of
76+
the input stream.
77+
- Let `mass_i = estimated mass between s_i and s_i+1`.
78+
- Let `true_mass` be the true mass between the items of `s_i`,
79+
`s_i+1` derived from the hypothetical sorted stream of all `N` items.
80+
- Let `eps = get_normalized_rank_error(true)`.
81+
- then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%.
82+
- `r(m+1)` includes the mass of all points larger than `s_m`.
83+
84+
A `get_cdf(...)` query has the following guarantees;
85+
- Let `{r1, r2, ..., r(m+1)} = get_cdf(s1, s2, ..., sm)` where `s1, s2, ...` are
86+
split points (items from the input domain) between the min and max items of
87+
the input stream.
88+
- Let `mass_i = r_(i+1) - r_i`.
89+
- Let `true_mass` be the true mass between the true ranks of `s_i`,
90+
`s_i+1` derived from the hypothetical sorted stream of all `N` items.
91+
- Let `eps = get_normalized_rank_error(true)`.
92+
- then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%.
93+
- `1 - r(m+1)` includes the mass of all points larger than `s_m`.
94+
95+
From the above, it might seem like we could make some estimates to bound the
96+
`item` returned from a call to `get_quantile()`. The sketch, however, does not
97+
let us derive error bounds or confidences around items. Because errors are independent, we
98+
can approximately bracket a value as shown below, but there are no error estimates available.
99+
Additionally, the interval may be quite large for certain distributions.
100+
- Let `q = get_quantile(r)`, the estimated quantile of rank `r`.
101+
- Let `eps = get_normalized_rank_error(false)`.
102+
- Let `q_lo = estimated quantile of rank (r - eps)`.
103+
- Let `q_hi = estimated quantile of rank (r + eps)`.
104+
- Then `q_lo ≤ q ≤ q_hi`, with 99% confidence.
105+
106+
107+
3108

4109
.. autoclass:: _datasketches.kll_ints_sketch
5110
:members:

0 commit comments

Comments
 (0)