Skip to content

Commit 332ca80

Browse files
authored
Merge pull request #31 from apache/sidebar-v2
Sidebar v2
2 parents 253de6a + 38422ed commit 332ca80

24 files changed

Lines changed: 143 additions & 47 deletions
File renamed without changes.
File renamed without changes.
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
Distinct Counting
2+
=================
3+
4+
.. currentmodule:: datasketches
5+
6+
Distinct counting is one of the earliest tasks to which sketches were applied. The concept is simple:
7+
Provide an estimate of the number of unique elements in a set of data. One of the earliest solutions came
8+
from Flajolet and Martin in 1985 with their seminal work
9+
`Probabilistic counting Algorithms for Data Base Applications <http://db.cs.berkeley.edu/cs286/papers/flajoletmartin-jcss1985.pdf>`_.
10+
11+
The DataSketches library offers several types of distinct counting sketches, each with different properties.
12+
13+
* :class:`hll_sketch`: Hyper Log Log, a well-known sketch for distinct counting but no longer state-of-the-art.
14+
* :class:`cpc_sketch`: Provides a better accuracy-space trade-off than HLL, but with a somewhat larger footprint while in-memory.
15+
* :class:`theta_sketch`: Theta sketch, a type of k-minimum value sketch, which provide good performance with intersection and set difference operations.
16+
* :class:`tuple_sketch`: Tuple sketch, which is similar to a theta sketch but supports additional data stored with each key.
17+
18+
.. toctree::
19+
:maxdepth: 1
20+
21+
hyper_log_log
22+
cpc
23+
theta
24+
tuple
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

docs/source/frequency/index.rst

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
Frequency Sketches
2+
==================
3+
4+
Frequency estimation involves determining how often an item has been seen in a stream. The library currently
5+
offers two types of sketches for frequency estimation, one of which has two closely-related variants.
6+
7+
* :class:`frequent_items_sketch`: Identifies the *Top K* or *heavy hitters* in a stream, those items whose weight is above a certain percentage of the entire stream. Does not necessarily provide an estimate for most items outside the heavy hitters.
8+
* :class:`frequent_strings_sketch`: Like the items version but containing snly strings (an implementation from before the library handled generic objects).
9+
* :class:`count_min_sketch`: Provides an estimate for any item, regardless of relative weight, but does not maintain a list of the heaviest items.
10+
11+
.. toctree::
12+
:maxdepth: 1
13+
14+
frequent_items
15+
count_min_sketch

docs/source/helper/index.rst

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
Helper Classes
2+
==============
3+
4+
.. currentmodule:: dataksetches
5+
6+
These classes are required for certain sketches or specific
7+
functionality within sketches.
8+
Some of them are abstract base classes, but in those cases there is at
9+
least one reference example of a concrete class.
10+
11+
* :class:`serde` is used when serializing and deserializing sketches.
12+
* :class:`jaccard` is used to compute the Jaccard similarity between pairs of theta or tuple sketches.
13+
* :class:`tuple_policy` is required to use a :class:`tuple_sketch` by specifying how summaries are combined.
14+
* :func:`ks_test` performs a Kolmogorov-Smirnov test on absolute-error quantiles family sketches.
15+
* :class:`kernel_function` is required when using a :class:`kernel_sketch` for Kernel Density Estimation.
16+
17+
.. toctree::
18+
:maxdepth: 1
19+
20+
serde
21+
jaccard
22+
tuple_policy
23+
ks_test
24+
kernel
File renamed without changes.

0 commit comments

Comments
 (0)