Merge pull request #28 from apache/docs-update

jmalkin · web-flow · commit a83ebbce7cf8 · 2024-01-11T14:09:19.000-08:00
Docs update
diff --git a/docs/README.md b/docs/README.md
@@ -1,50 +1,23 @@
 Follow these steps to build the documentation.
 1. Clone the directory in an appropriate location `git clone https://github.com/apache/datasketches-python.git`
-2. Switch to the correct branch: `git checkout python-docs`.
-3. In project root run `source python-docs-venv/bin/activate`
-
-If there are problems running the virtual env then you may need to install `virtualenv`
-and install the packages manually as below
-(nb my environment has `python` aliased to `python3` so just use whichever is appropriate for your installation)
+2. Make a new branch and switch to that branch.
 ```
-python -m venv python-docs-venv # create a new virtual env named python-docs-venv
-source python-docs-venv/bin/activate
-python -m pip install sphinx 
-python -m pip install sphinx-rtd-theme
+cd datasketches-python
+git branch new-branch
+git checkout new-branch
+``` 
+3. In project root, make a new virtual environment with the appropriate packages.  Depending on how python is aliased in your environment, you may 
+need `python` or `python3`, as indicated by `python(3)`.
 ```
-4. In project root run `python3 -m pip install .` to build the python bindings.
+python -m venv venv # create a new virtual env named venv using system python
+source venv/bin/activate
+python(3) -m pip install sphinx # now using venv python
+python(3) -m pip install sphinx-rtd-theme
+```
+4. In project root run `python(3) -m pip install .` to build the python bindings.
 5. Build and open the documentation:
 ```
 cd python/docs
 make html
 open build/html/index.html
 ```
-
-## Problems
-The `density_sketch` and `tuple_sketch` are not yet included.  
-I have not included the file to avoid cluttering the PR with things that may not work.
-You can easily include them by making a `density_sketch.rst` file in the same location as 
-all of the other `X.rst` files for the sketches and copying in the following:
-
-```
-Density Sketch
---------------
-
-.. autoclass:: datasketches.density_sketch
-    :members:
-    :undoc-members:
-
-.. autoclass:: datasketches.GaussianKernel
-    :members:
-```
-Additionally, you will need to add the below to `index.rst`
-```
-Density Estimation
-##################
-
-.. toctree::
-   :maxdepth: 1 
-
-   density_sketch
-```
-
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -8,7 +8,6 @@
 
 import sys
 import os 
-import datasketches
 
 # need to fix the paths so that sphinx can find the source code.
 sys.path.insert(0, os.path.abspath("../../datasketches"))
@@ -19,12 +18,12 @@
 copyright = '2023'
 author = 'Apache Software Foundation'
 release = '0.1'
+add_module_names = False
 
 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
 
-extensions = ["sphinx.ext.autodoc","sphinx.ext.autosummary"]
-
+extensions = ["sphinx.ext.autodoc"]
 templates_path = ['_templates']
 exclude_patterns = []
 
diff --git a/docs/source/count_min_sketch.rst b/docs/source/count_min_sketch.rst
@@ -1,6 +1,28 @@
 CountMin Sketch
 ---------------
 
-.. autoclass:: _datasketches.count_min_sketch
+The CountMin sketch, as described in Cormode and Muthukrishnan in
+http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf,
+is used for approximate Frequency Estimation. 
+For an item :math:`x` with frequency :math:`f_x`, the sketch provides an estimate, :math:`\hat{f_x}`, 
+such that :math:`f_x \approx \hat{f_x}.` 
+The sketch guarantees that :math:`f_x \le \hat{f_x}` and provides a probabilistic upper bound which is dependent on the size parameters.
+The sketch provides an estimate of the occurrence frequency for any queried item but, in contrast
+to the Frequent Items Sketch, this sketch does not provide a list of 
+heavy hitters.
+
+.. currentmodule:: _datasketches
+
+.. autoclass:: count_min_sketch
     :members:
     :undoc-members:
+    :exclude-members: deserialize, suggest_num_buckets, suggest_num_hashes
+    :member-order: groupwise
+
+    .. rubric:: Static Methods:
+
+    .. automethod:: deserialize
+    .. automethod:: suggest_num_buckets
+    .. automethod:: suggest_num_hashes
+
+    .. rubric:: Non-static Methods:
diff --git a/docs/source/cpc.rst b/docs/source/cpc.rst
@@ -1,7 +1,24 @@
 Compressed Probabilistic Counting (CPC)
 ---------------------------------------
-The *Compressed Probabilistic Counting* sketch is a space-efficient method for estimating cardinalities of sets.
+High performance C++ implementation of Compressed Probabilistic Counting (CPC) Sketch.
+This is a unique-counting sketch that implements the Compressed Probabilistic Counting (CPC, a.k.a FM85) algorithms developed by Kevin Lang in his paper
+`Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm <https://arxiv.org/abs/1708.06839>`_.
+This sketch is extremely space-efficient when serialized. 
+In an apples-to-apples empirical comparison against compressed HyperLogLog sketches, this new algorithm simultaneously wins on the two dimensions of the space/accuracy tradeoff and produces sketches that are smaller than the entropy of HLL, so no possible implementation of compressed HLL can match its space efficiency for a given accuracy. As described in the paper this sketch implements a newly developed ICON estimator algorithm that survives unioning operations, another well-known estimator, the Historical Inverse Probability (HIP) estimator does not. 
+The update speed performance of this sketch is quite fast and is comparable to the speed of HLL. 
+The unioning (merging) capability of this sketch also allows for merging of sketches with different configurations of K.
+For additional security this sketch can be configured with a user-specified hash seed.
+
 
 .. autoclass:: _datasketches.cpc_sketch
     :members:
     :undoc-members:
+    :exclude-members: deserialize,
+    :member-order: groupwise
+
+    .. rubric:: Static Methods:
+
+    .. automethod:: deserialize
+
+    .. rubric:: Non-static Methods:
+
diff --git a/docs/source/density_sketch.rst b/docs/source/density_sketch.rst
@@ -1,5 +1,14 @@
 Density Sketch
 --------------
+Builds a coreset from the given set of input points. 
+Provides density estimate at a given point.
+
+Based on the following paper: Zohar Karnin, Edo Liberty 
+"Discrepancy, Coresets, and Sketches in Machine Learning" 
+https://proceedings.mlr.press/v99/karnin19a/karnin19a.pdf
+
+Inspired by the following implementation: https://github.com/edoliberty/streaming-quantiles/blob/f688c8161a25582457b0a09deb4630a81406293b/gde.py
+
 .. autoclass:: datasketches.density_sketch
     :members:
     :undoc-members:
diff --git a/docs/source/frequent_items.rst b/docs/source/frequent_items.rst
@@ -1,6 +1,83 @@
 Frequent Items
 --------------
 
+This sketch is useful for tracking approximate frequencies of items of type `<T>` with optional associated counts `(<T> item, int count)` 
+that are members of a multiset of such items. 
+The true frequency of an item is defined to be the sum of associated counts.
+
+This implementation provides the following capabilities:
+
+* Estimate the *frequency* of an item.
+* Return *upper* and *lower bounds* of any item, such that the true frequency is always between the upper and lower bounds.
+* Return a global *maximum error* that holds for all items in the stream.
+* Return an array of frequent items that qualify either a *NO_FALSE_POSITIVES* or a *NO_FALSE_NEGATIVES* error type.
+* *Merge* itself with another sketch object created from this class.
+* *Serialize/Deserialize* to/from a byte array.
+
+**Space Usage**
+
+The sketch is initialized with a maximum map size, `maxMapSize`, that specifies the maximum physical length of the internal hash map of the form `(<T> item, int count)`. 
+The maximum map size is always a power of 2, defined through the variables `lg_max_map_size`.
+
+The hash map starts at a very small size (8 entries) and grows as needed up to the specified maximum map size.
+
+Excluding external space required for the item objects, the internal memory space usage of this sketch is `18 * mapSize bytes` (assuming 8 bytes for each reference), 
+plus a small constant number of additional bytes. 
+The internal memory space usage of this sketch will never exceed `18 * maxMapSize` bytes, plus a small constant number of additional bytes.
+
+**Maximum Capacity of the Sketch**
+
+The `LOAD_FACTOR` for the hash map is internally set at :math:`75\%`, which means at any time the map capacity of `(item, count)` pairs is `mapCap = 0.75 * mapSize`. 
+The maximum capacity of `(item, count)`` pairs of the sketch is `maxMapCap = 0.75 * maxMapSize`.
+
+**Updating the sketch with `(item, count)` pairs**
+
+If the item is found in the hash map, the mapped count field (the "counter") is incremented by the incoming count; otherwise, a new counter `"(item, count) pair"` is created. 
+If the number of tracked counters reaches the maximum capacity of the hash map, the sketch decrements all of the counters (by an approximately computed median) 
+and removes any non-positive counters.
+
+**Accuracy**
+
+If fewer than `0.75 * maxMapSize` different items are inserted into the sketch, the estimated frequencies returned by the sketch will be exact.
+
+The logic of the frequent items sketch is such that the stored counts and true counts are never too different. 
+More specifically, for any item, the sketch can return an estimate of the true frequency of item, along with upper and lower bounds on the frequency (that hold deterministically).
+
+For this implementation and for a specific active item, it is guaranteed that the true frequency will be between the Upper Bound (UB) and the Lower Bound (LB) computed for that item. 
+Specifically, `(UB- LB) ≤ W * epsilon`, where :math:`W` denotes the sum of all item counts, and :math:`epsilon = 3.5/M`, where :math:`epsilon = M` is the maxMapSize.
+
+This is a worst-case guarantee that applies to arbitrary inputs.
+For inputs typically seen in practice, `(UB-LB)` is usually much smaller.
+
+**Background**
+
+This code implements a variant of what is commonly known as the "Misra-Gries algorithm". 
+Variants of it were discovered and rediscovered and redesigned several times over the years:
+
+* *Finding repeated elements*, Misra, Gries, 1982
+* *Frequency estimation of Internet packet streams with limited space* Demaine, Lopez-Ortiz, Munro, 2002
+* *A simple algorithm for finding frequent elements in streams and bags* Karp, Shenker, Papadimitriou, 2003
+* *Efficient Computation of Frequent and Top-k Elements in Data Streams* Metwally, Agrawal, Abbadi, 2006
+
+
+For speed, we do employ some randomization that introduces a small probability that our proof of the worst-case bound might not apply to a given run. 
+However, we have ensured that this probability is extremely small. 
+For example, if the stream causes one table purge (rebuild), our proof of the worst-case bound applies with a probability of at least `1 - 1E-14`. 
+If the stream causes `1E9` purges, our proof applies with a probability of at least `1 - 1E-5`.
+
+Parameter: <T> The type of item to be tracked by this sketch
+
+
 .. autoclass:: _datasketches.frequent_items_sketch
     :members:
     :undoc-members:
+    :exclude-members: deserialize, get_epsilon_for_lg_size, get_apriori_error
+    :member-order: groupwise
+
+    .. rubric:: Static Methods:
+
+    .. automethod:: deserialize
+    .. automethod:: get_epsilon_for_lg_size
+    .. automethod:: get_apriori_error
+
+    .. rubric:: Non-static Methods:
diff --git a/docs/source/hyper_log_log.rst b/docs/source/hyper_log_log.rst
@@ -1,7 +1,32 @@
 HyperLogLog (HLL)
 -----------------
-The HyperLogLog (HLL) sketch is a space-efficient method for estimating cardinalities of sets.
+This is a high performance implementation of Phillipe Flajolet's HLL sketch but with significantly improved error behavior.
+
+If the ONLY use case for sketching is counting uniques and merging, the HLL sketch is a reasonable choice, although the highest performing in terms of accuracy for storage space consumed is CPC (Compressed Probabilistic Counting). For large enough counts, this HLL version (with HLL_4) can be 2 to 16 times smaller than the Theta sketch family for the same accuracy.
+
+This implementation offers three different types of HLL sketch, each with different trade-offs with accuracy, space and performance. 
+These types are specified with the target_hll_type parameter.
+
+In terms of accuracy, all three types, for the same lg_config_k, have the same error distribution as a function of n, the number of unique values fed to the sketch.
+The configuration parameter `lg_config_k` is the log-base-2 of `K`, where `K` is the number of buckets or slots for the sketch.
+
+During warmup, when the sketch has only received a small number of unique items (up to about 10% of `K`), this implementation leverages a new class of estimator algorithms with significantly better accuracy.
+
+This sketch also offers the capability of operating off-heap. 
+Given a WritableMemory object created by the user, the sketch will perform all of its updates and internal phase transitions in that object, which can actually reside either on-heap or off-heap based on how it is configured. 
+In large systems that must update and merge many millions of sketches, having the sketch operate off-heap avoids the serialization and deserialization costs of moving sketches to and from off-heap memory-mapped files, for example, and eliminates big garbage collection delays.
 
 .. autoclass:: _datasketches.hll_sketch
     :members:
     :undoc-members:
+    :exclude-members: deserialize, get_max_updatable_serialization_bytes, get_rel_err 
+
+    :member-order: groupwise
+
+    .. rubric:: Static Methods:
+
+    .. automethod:: deserialize
+    .. automethod:: get_max_updatable_serialization_bytes
+    .. automethod:: get_rel_err
+
+    .. rubric:: Non-static Methods:
diff --git a/docs/source/kll.rst b/docs/source/kll.rst
@@ -1,5 +1,110 @@
 KLL Sketch
 ----------
+Implementation of a very compact quantiles sketch with lazy compaction scheme
+and nearly optimal accuracy per retained item.
+See `Optimal Quantile Approximation in Streams`.
+
+This is a stochastic streaming sketch that enables near real-time analysis of the
+approximate distribution of items from a very large stream in a single pass, requiring only
+that the items are comparable.
+The analysis is obtained using `get_quantile()` function or the
+inverse functions `get_rank()`, `get_pmf()` (Probability Mass Function), and `get_cdf()`
+(Cumulative Distribution Function).
+
+As of May 2020, this implementation produces serialized sketches which are binary-compatible
+with the equivalent Java implementation only when template parameter `T = float`
+(32-bit single precision values).
+
+Given an input stream of `N` items, the `natural rank` of any specific
+item is defined as its index `(1 to N)` in inclusive mode
+or `(0 to N-1)` in exclusive mode
+in the hypothetical sorted stream of all `N` input items.
+
+The `normalized rank` (`rank`) of any specific item is defined as its
+`natural rank` divided by `N`.
+Thus, the `normalized rank` is between zero and one.
+In the documentation for this sketch `natural rank` is never used so any
+reference to just `rank` should be interpreted to mean `normalized rank`.
+
+This sketch is configured with a parameter `k`, which affects the size of the sketch
+and its estimation error.
+
+The estimation error is commonly called `epsilon` (or `eps`) and is a fraction
+between zero and one. Larger values of `k` result in smaller values of `epsilon`.
+Epsilon is always with respect to the rank and cannot be applied to the
+corresponding items.
+
+The relationship between the `normalized rank` and the corresponding items can be viewed
+as a two-dimensional monotonic plot with the `normalized rank` on one axis and the
+corresponding items on the other axis. If the y-axis is specified as the item-axis and
+the x-axis as the `normalized rank`, then `y = get_quantile(x)` is a monotonically
+increasing function.
+
+The function `get_quantile(rank)` translates ranks into
+corresponding quantiles. The functions `get_rank(item)`,
+`get_cdf(...)` (Cumulative Distribution Function), and `get_pmf(...)`
+(Probability Mass Function) perform the opposite operation and translate items into ranks.
+
+The `get_pmf(...)` function has about 13 to 47% worse rank error (depending
+on `k`) than the other queries because the mass of each "bin" of the PMF has
+"double-sided" error from the upper and lower edges of the bin as a result of a subtraction,
+as the errors from the two edges can sometimes add.
+
+The default `k` of 200 yields a "single-sided" `epsilon` of about 1.33% and a
+"double-sided" (PMF) `epsilon` of about 1.65%.
+
+A `get_quantile(rank)` query has the following guarantees:
+- Let `q = get_quantile(r)` where `r` is the rank between zero and one.
+- The quantile `q` will be an item from the input stream.
+- Let `true_rank` be the true rank of `q` derived from the hypothetical sorted
+stream of all `N` items.
+- Let `eps = get_normalized_rank_error(false)`.
+- Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%. Note that the
+error is on the rank, not the quantile.
+
+A `get_rank(item)` query has the following guarantees:
+- Let `r = get_rank(i)` where `i` is an item between the min and max items of
+the input stream.
+- Let `true_rank` be the true rank of `i` derived from the hypothetical sorted
+stream of all `N` items.
+- Let `eps = get_normalized_rank_error(false)`.
+- Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%.
+
+A `get_pmf()` query has the following guarantees:
+- Let `{r1, r2, ..., r(m+1)} = get_pmf(s1, s2, ..., sm)` where `s1, s2` are
+split points (items from the input domain) between the min and max items of
+the input stream.
+- Let `mass_i = estimated mass between s_i and s_i+1`.
+- Let `true_mass` be the true mass between the items of `s_i`,
+`s_i+1` derived from the hypothetical sorted stream of all `N` items.
+- Let `eps = get_normalized_rank_error(true)`.
+- then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%.
+- `r(m+1)` includes the mass of all points larger than `s_m`.
+
+A `get_cdf(...)` query has the following guarantees;
+- Let `{r1, r2, ..., r(m+1)} = get_cdf(s1, s2, ..., sm)` where `s1, s2, ...` are
+split points (items from the input domain) between the min and max items of
+the input stream.
+- Let `mass_i = r_(i+1) - r_i`.
+- Let `true_mass` be the true mass between the true ranks of `s_i`,
+`s_i+1` derived from the hypothetical sorted stream of all `N` items.
+- Let `eps = get_normalized_rank_error(true)`.
+- then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%.
+- `1 - r(m+1)` includes the mass of all points larger than `s_m`.
+
+From the above, it might seem like we could make some estimates to bound the
+`item` returned from a call to `get_quantile()`. The sketch, however, does not
+let us derive error bounds or confidences around items. Because errors are independent, we
+can approximately bracket a value as shown below, but there are no error estimates available.
+Additionally, the interval may be quite large for certain distributions.
+- Let `q = get_quantile(r)`, the estimated quantile of rank `r`.
+- Let `eps = get_normalized_rank_error(false)`.
+- Let `q_lo = estimated quantile of rank (r - eps)`.
+- Let `q_hi = estimated quantile of rank (r + eps)`.
+- Then `q_lo ≤ q ≤ q_hi`, with 99% confidence.
+
+
+
 
 .. autoclass:: _datasketches.kll_ints_sketch
     :members:
diff --git a/docs/source/quantiles_depr.rst b/docs/source/quantiles_depr.rst
diff --git a/docs/source/req.rst b/docs/source/req.rst
diff --git a/docs/source/theta.rst b/docs/source/theta.rst
diff --git a/docs/source/tuple.rst b/docs/source/tuple.rst