You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/distinct_counting/index.rst
+14-1Lines changed: 14 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,19 @@
1
1
Distinct Counting
2
2
=================
3
-
These are all of the sketches for distinct counting....
3
+
4
+
.. currentmodule:: datasketches
5
+
6
+
Distinct counting is one of the earliest tasks to which sketches were applied. The concept is simple:
7
+
Provide an estimate of the number of unique elements in a set of data. One of the earliest solutions came
8
+
from Flajolet and Martin in 1985 with their seminal work
9
+
`Probabilistic counting Algorithms for Data Base Applications <http://db.cs.berkeley.edu/cs286/papers/flajoletmartin-jcss1985.pdf>`_.
10
+
11
+
The DataSketches library offers several types of distinct counting sketches, each with different properties.
12
+
13
+
* :class:`hll_sketch`: Hyper Log Log, a well-known sketch for distinct counting but no longer state-of-the-art.
14
+
* :class:`cpc_sketch`: Provides a better accuracy-space trade-off than HLL, but with a somewhat larger footprint while in-memory.
15
+
* :class:`theta_sketch`: Theta sketch, a type of k-minimum value sketch, which provide good performance with intersection and set difference operations.
16
+
* :class:`tuple_sketch`: Tuple sketch, which is similar to a theta sketch but supports additional data stored with each key.
Copy file name to clipboardExpand all lines: docs/source/frequency/index.rst
+7-1Lines changed: 7 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,12 @@
1
1
Frequency Sketches
2
2
==================
3
-
These are all of the sketches for frequency estimation
3
+
4
+
Frequency estimation involves determining how often an item has been seen in a stream. The library currently
5
+
offers two types of sketches for frequency estimation, one of which has two closely-related variants.
6
+
7
+
* :class:`frequent_items_sketch`: Identifies the *Top K* or *heavy hitters* in a stream, those items whose weight is above a certain percentage of the entire stream. Does not necessarily provide an estimate for most items outside the heavy hitters.
8
+
* :class:`frequent_strings_sketch`: Like the items version but containing snly strings (an implementation from before the library handled generic objects).
9
+
* :class:`count_min_sketch`: Provides an estimate for any item, regardless of relative weight, but does not maintain a list of the heaviest items.
Copy file name to clipboardExpand all lines: docs/source/quantiles/index.rst
+17-1Lines changed: 17 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,22 @@
1
1
Quantiles Sketches
2
2
==================
3
-
These are all of the sketches for quantile estimation....
3
+
4
+
Quantile estimation is useful for understanding the distribution of data values in a stream. The sketches currently
5
+
in the library are designed to answer queries about the `rank` of an item in the stream of items. That is, when
6
+
applying a global ordering on all the items, what is the portion of items seen so far that are less than (alternatively,
7
+
less-than-or-equal-to) the given item. Using straightforward logic, they can also estimate the item at a given rank
8
+
in the stream.
9
+
10
+
These sketches may be used to compute approximate histograms, Probability Mass Functions (PMFs), or
11
+
Cumulative Distribution Functions (CDFs).
12
+
13
+
The library provides three types of quantiles sketches, each of which has generic items as well as versions
14
+
specific to a given numeric type (e.g. integer or floating point values). All three types provide error
15
+
bounds on rank estimation with proven probabilistic error distributions.
16
+
17
+
* KLL: Provides uniform rank estimation error over the entire range
18
+
* REQ: Provides relative rank error estimates, which decreases approaching either the high or low end values.
19
+
* Classic quantiles: Largely deprecated in favor of KLL, also provides uniform rank estimation error. Included largely for backwards compatibility with historic data.
Copy file name to clipboardExpand all lines: docs/source/sampling/index.rst
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,8 @@
1
1
Random Sampling Sketches
2
2
========================
3
3
4
+
.. currentmodule:: datasketches
5
+
4
6
These sketches are used to randomly sample items. The length of the input
5
7
stream does not need to be known in advance.
6
8
@@ -9,8 +11,8 @@ Probability Proportional to Size) sketches will include sample items based on
9
11
each item's weight relative to the weight of the entire stream but
10
12
they differ in details:
11
13
12
-
* EBPPS ensures that the probability of including an item is always exactly proportional to the item's weight.
13
-
* VarOpt optimizes for applying a predicate to the resulting sample such that the variance of the subset sum after applying the predicate is minimized, even if the inclusion probability differs somewhat from being proportional to the item's weight.
14
+
* :class:`ebpps_sketch` ensures that the probability of including an item is always exactly proportional to the item's weight.
15
+
* :class:`var_opt_sketch` optimizes for applying a predicate to the resulting sample such that the variance of the subset sum after applying the predicate is minimized, even if the inclusion probability differs somewhat from being proportional to the item's weight.
0 commit comments