Skip to content

Commit 53ff9e9

Browse files
committed
Provided a more useful blurb in each family index page
1 parent 17d3103 commit 53ff9e9

7 files changed

Lines changed: 55 additions & 25 deletions

File tree

docs/source/distinct_counting/index.rst

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,19 @@
11
Distinct Counting
22
=================
3-
These are all of the sketches for distinct counting....
3+
4+
.. currentmodule:: datasketches
5+
6+
Distinct counting is one of the earliest tasks to which sketches were applied. The concept is simple:
7+
Provide an estimate of the number of unique elements in a set of data. One of the earliest solutions came
8+
from Flajolet and Martin in 1985 with their seminal work
9+
`Probabilistic counting Algorithms for Data Base Applications <http://db.cs.berkeley.edu/cs286/papers/flajoletmartin-jcss1985.pdf>`_.
10+
11+
The DataSketches library offers several types of distinct counting sketches, each with different properties.
12+
13+
* :class:`hll_sketch`: Hyper Log Log, a well-known sketch for distinct counting but no longer state-of-the-art.
14+
* :class:`cpc_sketch`: Provides a better accuracy-space trade-off than HLL, but with a somewhat larger footprint while in-memory.
15+
* :class:`theta_sketch`: Theta sketch, a type of k-minimum value sketch, which provide good performance with intersection and set difference operations.
16+
* :class:`tuple_sketch`: Tuple sketch, which is similar to a theta sketch but supports additional data stored with each key.
417

518
.. toctree::
619
:maxdepth: 1

docs/source/frequency/index.rst

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
Frequency Sketches
22
==================
3-
These are all of the sketches for frequency estimation
3+
4+
Frequency estimation involves determining how often an item has been seen in a stream. The library currently
5+
offers two types of sketches for frequency estimation, one of which has two closely-related variants.
6+
7+
* :class:`frequent_items_sketch`: Identifies the *Top K* or *heavy hitters* in a stream, those items whose weight is above a certain percentage of the entire stream. Does not necessarily provide an estimate for most items outside the heavy hitters.
8+
* :class:`frequent_strings_sketch`: Like the items version but containing snly strings (an implementation from before the library handled generic objects).
9+
* :class:`count_min_sketch`: Provides an estimate for any item, regardless of relative weight, but does not maintain a list of the heaviest items.
410

511
.. toctree::
612
:maxdepth: 1

docs/source/helper/index.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,19 @@
11
Helper Clsses
22
=============
33

4+
.. currentmodule:: dataksetches
5+
46
These classes are required for certain sketches or specific
57
functionality within sketches.
68
Some of them are abstract base classes, but in those cases there is at
79
least one reference example of a concrete class.
810

11+
* :class:`serde` is used when serializing and deserializing sketches/
12+
* :class:`jaccard` is used to compute the Jaccard similarity between pairs of theta or tuple sketches.
13+
* :class:`tuple_policy` is required to use a :class:`tuple_sketch` by specifying how summaries are combined.
14+
* :func:`ks_test` performs a Kolmogorov-Smirnov test on absolute-error quantiles family sketches.
15+
* :class:`kernel_function` is required when using a :class:`kernel_sketch` for Kernel Density Estimation.
16+
917
.. toctree::
1018
:maxdepth: 1
1119

docs/source/quantiles/index.rst

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,22 @@
11
Quantiles Sketches
22
==================
3-
These are all of the sketches for quantile estimation....
3+
4+
Quantile estimation is useful for understanding the distribution of data values in a stream. The sketches currently
5+
in the library are designed to answer queries about the `rank` of an item in the stream of items. That is, when
6+
applying a global ordering on all the items, what is the portion of items seen so far that are less than (alternatively,
7+
less-than-or-equal-to) the given item. Using straightforward logic, they can also estimate the item at a given rank
8+
in the stream.
9+
10+
These sketches may be used to compute approximate histograms, Probability Mass Functions (PMFs), or
11+
Cumulative Distribution Functions (CDFs).
12+
13+
The library provides three types of quantiles sketches, each of which has generic items as well as versions
14+
specific to a given numeric type (e.g. integer or floating point values). All three types provide error
15+
bounds on rank estimation with proven probabilistic error distributions.
16+
17+
* KLL: Provides uniform rank estimation error over the entire range
18+
* REQ: Provides relative rank error estimates, which decreases approaching either the high or low end values.
19+
* Classic quantiles: Largely deprecated in favor of KLL, also provides uniform rank estimation error. Included largely for backwards compatibility with historic data.
420

521
.. toctree::
622
:maxdepth: 1

docs/source/sampling/index.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
Random Sampling Sketches
22
========================
33

4+
.. currentmodule:: datasketches
5+
46
These sketches are used to randomly sample items. The length of the input
57
stream does not need to be known in advance.
68

@@ -9,8 +11,8 @@ Probability Proportional to Size) sketches will include sample items based on
911
each item's weight relative to the weight of the entire stream but
1012
they differ in details:
1113

12-
* EBPPS ensures that the probability of including an item is always exactly proportional to the item's weight.
13-
* VarOpt optimizes for applying a predicate to the resulting sample such that the variance of the subset sum after applying the predicate is minimized, even if the inclusion probability differs somewhat from being proportional to the item's weight.
14+
* :class:`ebpps_sketch` ensures that the probability of including an item is always exactly proportional to the item's weight.
15+
* :class:`var_opt_sketch` optimizes for applying a predicate to the resulting sample such that the variance of the subset sum after applying the predicate is minimized, even if the inclusion probability differs somewhat from being proportional to the item's weight.
1416

1517
.. toctree::
1618
:maxdepth: 1

docs/source/sampling/index.rst~

Lines changed: 0 additions & 19 deletions
This file was deleted.

docs/source/vector/index.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
Vector Sketches
22
==================
3-
These sketches are designed to accept vector inputs.
3+
4+
.. currentmodule:: dataksetches
5+
6+
These sketches are designed to accept vector inputs. For now, the library provides only the
7+
:class:`density_sketch` for Kernel Density Estimation.
48

59
.. toctree::
610
:maxdepth: 1

0 commit comments

Comments
 (0)