Skip to content

Commit db90841

Browse files
committed
Add notes about which classes require a serde and which use python built-in object methods
1 parent 6c501d5 commit db90841

7 files changed

Lines changed: 69 additions & 29 deletions

File tree

docs/source/ebpps.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
Exact and Bounded, Probabilitiy Proportional to Size (EBPPS) Sampling
22
---------------------------------------------------------------------
33

4+
.. currentmodule:: datasketches
5+
46
An EBPPS sketch produces a randome sample of data from a stream of items, ensuring that the probability
57
of including an item is always exactly equal to the item's size. The size of an item is defined as its
68
weight relative to the total weight of all items seen so far by the sketch. In contrast to VarOpt sampling,
@@ -14,7 +16,10 @@ Information Processing Letters, 2023.
1416
EBPPS sampling is related to reservoir sampling, but handles unequal item weights.
1517
Feeding the sketch items with a uniform weight value will produce a sample equivalent to reservoir sampling.
1618

17-
.. autoclass:: datasketches.ebpps_sketch
19+
.. note::
20+
Serializing and deserializing this sketch requires the use of a :class:`PyObjectSerDe`.
21+
22+
.. autoclass:: ebpps_sketch
1823
:members:
1924
:undoc-members:
2025
:exclude-members: deserialize

docs/source/frequent_items.rst

Lines changed: 21 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
Frequent Items
22
--------------
33

4+
.. currentmodule:: dataskethches
5+
46
This sketch is useful for tracking approximate frequencies of items of type `<T>` with optional associated counts `(<T> item, int count)`
57
that are members of a multiset of such items.
68
The true frequency of an item is defined to be the sum of associated counts.
@@ -16,38 +18,38 @@ This implementation provides the following capabilities:
1618

1719
**Space Usage**
1820

19-
The sketch is initialized with a maximum map size, `maxMapSize`, that specifies the maximum physical length of the internal hash map of the form `(<T> item, int count)`.
20-
The maximum map size is always a power of 2, defined through the variables `lg_max_map_size`.
21+
The sketch is initialized with a maximum map size, ``maxMapSize``, that specifies the maximum physical length of the internal hash map of the form ``(object item, int count)``.
22+
The maximum map size is always a power of 2, defined through the variables ``lg_max_map_size``.
2123

2224
The hash map starts at a very small size (8 entries) and grows as needed up to the specified maximum map size.
2325

24-
Excluding external space required for the item objects, the internal memory space usage of this sketch is `18 * mapSize bytes` (assuming 8 bytes for each reference),
26+
Excluding external space required for the item objects, the internal memory space usage of this sketch is `18 * ``mapSize`` bytes` (assuming 8 bytes for each reference),
2527
plus a small constant number of additional bytes.
26-
The internal memory space usage of this sketch will never exceed `18 * maxMapSize` bytes, plus a small constant number of additional bytes.
28+
The internal memory space usage of this sketch will never exceed `18 * ``maxMapSize`` ` bytes, plus a small constant number of additional bytes.
2729

2830
**Maximum Capacity of the Sketch**
2931

30-
The `LOAD_FACTOR` for the hash map is internally set at :math:`75\%`, which means at any time the map capacity of `(item, count)` pairs is `mapCap = 0.75 * mapSize`.
31-
The maximum capacity of `(item, count)`` pairs of the sketch is `maxMapCap = 0.75 * maxMapSize`.
32+
The ``LOAD_FACTOR`` for the hash map is internally set at :math:`75\%`, which means at any time the map capacity of ``(item, count)`` pairs is ``mapCap = 0.75 * mapSize``.
33+
The maximum capacity of ``(item, count)`` pairs of the sketch is `maxMapCap = 0.75 * maxMapSize`.
3234

33-
**Updating the sketch with `(item, count)` pairs**
35+
**Updating the sketch with ``(item, count)`` pairs**
3436

3537
If the item is found in the hash map, the mapped count field (the "counter") is incremented by the incoming count; otherwise, a new counter `"(item, count) pair"` is created.
3638
If the number of tracked counters reaches the maximum capacity of the hash map, the sketch decrements all of the counters (by an approximately computed median)
3739
and removes any non-positive counters.
3840

3941
**Accuracy**
4042

41-
If fewer than `0.75 * maxMapSize` different items are inserted into the sketch, the estimated frequencies returned by the sketch will be exact.
43+
If fewer than ``0.75 * maxMapSize`` different items are inserted into the sketch, the estimated frequencies returned by the sketch will be exact.
4244

4345
The logic of the frequent items sketch is such that the stored counts and true counts are never too different.
4446
More specifically, for any item, the sketch can return an estimate of the true frequency of item, along with upper and lower bounds on the frequency (that hold deterministically).
4547

4648
For this implementation and for a specific active item, it is guaranteed that the true frequency will be between the Upper Bound (UB) and the Lower Bound (LB) computed for that item.
47-
Specifically, `(UB- LB) ≤ W * epsilon`, where :math:`W` denotes the sum of all item counts, and :math:`epsilon = 3.5/M`, where :math:`epsilon = M` is the maxMapSize.
49+
Specifically, ``(UB- LB) ≤ W * epsilon``, where :math:`W` denotes the sum of all item counts, and :math:`epsilon = 3.5/M`, where :math:`epsilon = M` is the maxMapSize.
4850

4951
This is a worst-case guarantee that applies to arbitrary inputs.
50-
For inputs typically seen in practice, `(UB-LB)` is usually much smaller.
52+
For inputs typically seen in practice, ``(UB-LB)`` is usually much smaller.
5153

5254
**Background**
5355

@@ -63,14 +65,19 @@ Variants of it were discovered and rediscovered and redesigned several times ove
6365
For speed, we do employ some randomization that introduces a small probability that our proof of the worst-case bound might not apply to a given run.
6466
However, we have ensured that this probability is extremely small.
6567
For example, if the stream causes one table purge (rebuild), our proof of the worst-case bound applies with a probability of at least `1 - 1E-14`.
66-
If the stream causes `1E9` purges, our proof applies with a probability of at least `1 - 1E-5`.
68+
If the stream causes ``1E9`` purges, our proof applies with a probability of at least ``1 - 1E-5``.
6769

6870
There are two flavors of Frequent Items Sketches, one with generic items (objects) and another specific to strings.
6971
The string version is a legacy name from before the library supported generic objects and is retained
7072
only for backwards compatibility.
7173

74+
.. note::
75+
The :class:`frequent_items_sketch` uses an input object's ``__hash__`` and ``__eq__`` methods.
76+
77+
.. note::
78+
Serializing and deserializing the :class:`frequent_items_sketch` requires the use of a :class:`PyObjectSerDe`.
7279

73-
.. autoclass:: _datasketches.frequent_items_error_type
80+
.. autoclass:: frequent_items_error_type
7481

7582
.. autoattribute:: NO_FALSE_POSITIVES
7683
:annotation: : Returns only true positives but may miss some heavy hitters.
@@ -79,7 +86,7 @@ only for backwards compatibility.
7986
:annotation: : Does not miss any heavy hitters but may return false positives.
8087

8188

82-
.. autoclass:: _datasketches.frequent_items_sketch
89+
.. autoclass:: frequent_items_sketch
8390
:members:
8491
:undoc-members:
8592
:exclude-members: deserialize, get_epsilon_for_lg_size, get_apriori_error
@@ -96,7 +103,7 @@ only for backwards compatibility.
96103
.. automethod:: __init__
97104

98105

99-
.. autoclass:: _datasketches.frequent_strings_sketch
106+
.. autoclass:: frequent_strings_sketch
100107
:members:
101108
:undoc-members:
102109
:exclude-members: deserialize, get_epsilon_for_lg_size, get_apriori_error

docs/source/kll.rst

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
KLL Sketch
22
----------
3+
4+
.. currentmodule:: datasketches
5+
36
Implementation of a very compact quantiles sketch with lazy compaction scheme
47
and nearly optimal accuracy per retained item.
58
See `Optimal Quantile Approximation in Streams`.
@@ -106,7 +109,11 @@ Additionally, the interval may be quite large for certain distributions.
106109
.. note::
107110
For the :class:`kll_items_sketch`, objects must be comparable with ``__lt__``.
108111

109-
.. autoclass:: _datasketches.kll_ints_sketch
112+
.. note::
113+
Serializing and deserializing a :class:`kll_items_sketch` requires the use of a :class:`PyObjectSerDe`.
114+
115+
116+
.. autoclass:: kll_ints_sketch
110117
:members:
111118
:undoc-members:
112119
:exclude-members: deserialize, get_normalized_rank_error
@@ -120,7 +127,7 @@ Additionally, the interval may be quite large for certain distributions.
120127

121128
.. automethod:: __init__
122129

123-
.. autoclass:: _datasketches.kll_floats_sketch
130+
.. autoclass:: kll_floats_sketch
124131
:members:
125132
:undoc-members:
126133
:exclude-members: deserialize, get_normalized_rank_error
@@ -134,7 +141,7 @@ Additionally, the interval may be quite large for certain distributions.
134141

135142
.. automethod:: __init__
136143

137-
.. autoclass:: _datasketches.kll_doubles_sketch
144+
.. autoclass:: kll_doubles_sketch
138145
:members:
139146
:undoc-members:
140147
:exclude-members: deserialize, get_normalized_rank_error
@@ -148,7 +155,7 @@ Additionally, the interval may be quite large for certain distributions.
148155

149156
.. automethod:: __init__
150157

151-
.. autoclass:: _datasketches.kll_items_sketch
158+
.. autoclass:: kll_items_sketch
152159
:members:
153160
:undoc-members:
154161
:exclude-members: deserialize, get_normalized_rank_error

docs/source/quantiles_depr.rst

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
11
Quantiles Sketch (Deprecated)
22
-----------------------------
3+
4+
.. currentmodule:: datasketches
5+
36
This is a deprecated quantiles sketch that is included for cross-language compatibility.
4-
Most new projects will favor the KLL sketch over this one.
7+
Most new projects will favor the KLL sketch over this one, or the REQ sketch for higher accuracy
8+
at the very edge of a distribution.
59

610
This is a stochastic streaming sketch that enables near-real time analysis of the
711
approximate distribution from a very large stream in a single pass.
@@ -42,7 +46,10 @@ a confidence of about 99%.
4246
.. note::
4347
For the :class:`quantiles_items_sketch`, objects must be comparable with ``__lt__``.
4448

45-
.. autoclass:: _datasketches.quantiles_ints_sketch
49+
.. note::
50+
Serializing and deserializing a :class:`quantiles_items_sketch` requires the use of a :class:`PyObjectSerDe`.
51+
52+
.. autoclass:: quantiles_ints_sketch
4653
:members:
4754
:undoc-members:
4855
:exclude-members: deserialize, get_normalized_rank_error
@@ -56,7 +63,7 @@ a confidence of about 99%.
5663

5764
.. automethod:: __init__
5865

59-
.. autoclass:: _datasketches.quantiles_floats_sketch
66+
.. autoclass:: quantiles_floats_sketch
6067
:members:
6168
:undoc-members:
6269
:exclude-members: deserialize, get_normalized_rank_error
@@ -70,7 +77,7 @@ a confidence of about 99%.
7077

7178
.. automethod:: __init__
7279

73-
.. autoclass:: _datasketches.quantiles_doubles_sketch
80+
.. autoclass:: quantiles_doubles_sketch
7481
:members:
7582
:undoc-members:
7683
:exclude-members: deserialize, get_normalized_rank_error
@@ -84,7 +91,7 @@ a confidence of about 99%.
8491

8592
.. automethod:: __init__
8693

87-
.. autoclass:: _datasketches.quantiles_items_sketch
94+
.. autoclass:: quantiles_items_sketch
8895
:members:
8996
:undoc-members:
9097
:exclude-members: deserialize, get_normalized_rank_error

docs/source/req.rst

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
Relative Error Quantiles (REQ) Sketch
22
-------------------------------------
3+
4+
.. currentmodule:: datasketches
5+
36
This is an implementation based on the `paper <https://arxiv.org/abs/2004.01668>`_ "Relative Error Streaming Quantiles" by Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, Pavel Veselý, and loosely derived from a Python prototype written by Pavel Veselý.
47

58
This implementation differs from the algorithm described in the paper in the following:
@@ -30,7 +33,10 @@ This is not only useful for debugging, but is a powerful tool to help users unde
3033
.. note::
3134
For the :class:`req_items_sketch`, objects must be comparable with ``__lt__``.
3235

33-
.. autoclass:: _datasketches.req_ints_sketch
36+
.. note::
37+
Serializing and deserializing a :class:`req_items_sketch` requires the use of a :class:`PyObjectSerDe`.
38+
39+
.. autoclass:: req_ints_sketch
3440
:members:
3541
:undoc-members:
3642
:exclude-members: deserialize, get_RSE
@@ -44,7 +50,7 @@ This is not only useful for debugging, but is a powerful tool to help users unde
4450

4551
.. automethod:: __init__
4652

47-
.. autoclass:: _datasketches.req_floats_sketch
53+
.. autoclass:: req_floats_sketch
4854
:members:
4955
:undoc-members:
5056
:exclude-members: deserialize, get_RSE
@@ -58,7 +64,7 @@ This is not only useful for debugging, but is a powerful tool to help users unde
5864

5965
.. automethod:: __init__
6066

61-
.. autoclass:: _datasketches.req_items_sketch
67+
.. autoclass:: req_items_sketch
6268
:members:
6369
:undoc-members:
6470
:exclude-members: deserialize, get_RSE

docs/source/tuple.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,9 @@ Set operations (union, intersection, A-not-B) are performed through the use of d
1515
Several `Jaccard similarity <https://en.wikipedia.org/wiki/Jaccard_similarity>`_
1616
measures can be computed between theta sketches with the :class:`tuple_jaccard_similarity` class.
1717

18+
.. note::
19+
Serializing and deserializing this sketch requires the use of a :class:`PyObjectSerDe`.
20+
1821
.. autoclass:: tuple_sketch
1922
:members:
2023
:undoc-members:

docs/source/varopt.rst

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
Variance Optimal Sampling (VarOpt)
22
----------------------------------
33

4+
.. currentmodule:: datasketches
5+
46
A VarOpt sketch samples data from a stream of items. The sketch is desinged for optimal (minimum)
57
variance when querying the sketch to estimate subset sums of items matching a provided predicate.
68
The sketch will produce a sample of size `k` (or smaller if fewer items have been presented), with
@@ -10,7 +12,10 @@ weight of all items presented to the sketch.
1012
VarOpt sampling is related to reservoir sampling, with improved error bounds for subset sum estimation.
1113
Feeding the sketch items with a uniform weight value will produce a sample equivalent to reservoir sampling.
1214

13-
.. autoclass:: datasketches.var_opt_sketch
15+
.. note::
16+
Serializing and deserializing this sketch requires the use of a :class:`PyObjectSerDe`.
17+
18+
.. autoclass:: var_opt_sketch
1419
:members:
1520
:undoc-members:
1621
:exclude-members: deserialize
@@ -23,7 +28,7 @@ Feeding the sketch items with a uniform weight value will produce a sample equiv
2328

2429
.. automethod:: __init__
2530

26-
.. autoclass:: datasketches.var_opt_union
31+
.. autoclass:: var_opt_union
2732
:members:
2833
:undoc-members:
2934
:exclude-members: deserialize

0 commit comments

Comments
 (0)