You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/frequent_items.rst
+21-14Lines changed: 21 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,8 @@
1
1
Frequent Items
2
2
--------------
3
3
4
+
.. currentmodule:: dataskethches
5
+
4
6
This sketch is useful for tracking approximate frequencies of items of type `<T>` with optional associated counts `(<T> item, int count)`
5
7
that are members of a multiset of such items.
6
8
The true frequency of an item is defined to be the sum of associated counts.
@@ -16,38 +18,38 @@ This implementation provides the following capabilities:
16
18
17
19
**Space Usage**
18
20
19
-
The sketch is initialized with a maximum map size, `maxMapSize`, that specifies the maximum physical length of the internal hash map of the form `(<T> item, int count)`.
20
-
The maximum map size is always a power of 2, defined through the variables `lg_max_map_size`.
21
+
The sketch is initialized with a maximum map size, ``maxMapSize``, that specifies the maximum physical length of the internal hash map of the form ``(object item, int count)``.
22
+
The maximum map size is always a power of 2, defined through the variables ``lg_max_map_size``.
21
23
22
24
The hash map starts at a very small size (8 entries) and grows as needed up to the specified maximum map size.
23
25
24
-
Excluding external space required for the item objects, the internal memory space usage of this sketch is `18 * mapSize bytes` (assuming 8 bytes for each reference),
26
+
Excluding external space required for the item objects, the internal memory space usage of this sketch is `18 * ``mapSize`` bytes` (assuming 8 bytes for each reference),
25
27
plus a small constant number of additional bytes.
26
-
The internal memory space usage of this sketch will never exceed `18 * maxMapSize` bytes, plus a small constant number of additional bytes.
28
+
The internal memory space usage of this sketch will never exceed `18 * ``maxMapSize`` ` bytes, plus a small constant number of additional bytes.
27
29
28
30
**Maximum Capacity of the Sketch**
29
31
30
-
The `LOAD_FACTOR` for the hash map is internally set at :math:`75\%`, which means at any time the map capacity of `(item, count)` pairs is `mapCap = 0.75 * mapSize`.
31
-
The maximum capacity of `(item, count)`` pairs of the sketch is `maxMapCap = 0.75 * maxMapSize`.
32
+
The ``LOAD_FACTOR`` for the hash map is internally set at :math:`75\%`, which means at any time the map capacity of ``(item, count)`` pairs is ``mapCap = 0.75 * mapSize``.
33
+
The maximum capacity of ``(item, count)`` pairs of the sketch is `maxMapCap = 0.75 * maxMapSize`.
32
34
33
-
**Updating the sketch with `(item, count)` pairs**
35
+
**Updating the sketch with ``(item, count)`` pairs**
34
36
35
37
If the item is found in the hash map, the mapped count field (the "counter") is incremented by the incoming count; otherwise, a new counter `"(item, count) pair"` is created.
36
38
If the number of tracked counters reaches the maximum capacity of the hash map, the sketch decrements all of the counters (by an approximately computed median)
37
39
and removes any non-positive counters.
38
40
39
41
**Accuracy**
40
42
41
-
If fewer than `0.75 * maxMapSize` different items are inserted into the sketch, the estimated frequencies returned by the sketch will be exact.
43
+
If fewer than ``0.75 * maxMapSize`` different items are inserted into the sketch, the estimated frequencies returned by the sketch will be exact.
42
44
43
45
The logic of the frequent items sketch is such that the stored counts and true counts are never too different.
44
46
More specifically, for any item, the sketch can return an estimate of the true frequency of item, along with upper and lower bounds on the frequency (that hold deterministically).
45
47
46
48
For this implementation and for a specific active item, it is guaranteed that the true frequency will be between the Upper Bound (UB) and the Lower Bound (LB) computed for that item.
47
-
Specifically, `(UB- LB) ≤ W * epsilon`, where :math:`W` denotes the sum of all item counts, and :math:`epsilon = 3.5/M`, where :math:`epsilon = M` is the maxMapSize.
49
+
Specifically, ``(UB- LB) ≤ W * epsilon``, where :math:`W` denotes the sum of all item counts, and :math:`epsilon = 3.5/M`, where :math:`epsilon = M` is the maxMapSize.
48
50
49
51
This is a worst-case guarantee that applies to arbitrary inputs.
50
-
For inputs typically seen in practice, `(UB-LB)` is usually much smaller.
52
+
For inputs typically seen in practice, ``(UB-LB)`` is usually much smaller.
51
53
52
54
**Background**
53
55
@@ -63,14 +65,19 @@ Variants of it were discovered and rediscovered and redesigned several times ove
63
65
For speed, we do employ some randomization that introduces a small probability that our proof of the worst-case bound might not apply to a given run.
64
66
However, we have ensured that this probability is extremely small.
65
67
For example, if the stream causes one table purge (rebuild), our proof of the worst-case bound applies with a probability of at least `1 - 1E-14`.
66
-
If the stream causes `1E9` purges, our proof applies with a probability of at least `1 - 1E-5`.
68
+
If the stream causes ``1E9`` purges, our proof applies with a probability of at least ``1 - 1E-5``.
67
69
68
70
There are two flavors of Frequent Items Sketches, one with generic items (objects) and another specific to strings.
69
71
The string version is a legacy name from before the library supported generic objects and is retained
70
72
only for backwards compatibility.
71
73
74
+
.. note::
75
+
The :class:`frequent_items_sketch` uses an input object's ``__hash__`` and ``__eq__`` methods.
76
+
77
+
.. note::
78
+
Serializing and deserializing the :class:`frequent_items_sketch` requires the use of a :class:`PyObjectSerDe`.
Copy file name to clipboardExpand all lines: docs/source/req.rst
+9-3Lines changed: 9 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,8 @@
1
1
Relative Error Quantiles (REQ) Sketch
2
2
-------------------------------------
3
+
4
+
.. currentmodule:: datasketches
5
+
3
6
This is an implementation based on the `paper <https://arxiv.org/abs/2004.01668>`_ "Relative Error Streaming Quantiles" by Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, Pavel Veselý, and loosely derived from a Python prototype written by Pavel Veselý.
4
7
5
8
This implementation differs from the algorithm described in the paper in the following:
@@ -30,7 +33,10 @@ This is not only useful for debugging, but is a powerful tool to help users unde
30
33
.. note::
31
34
For the :class:`req_items_sketch`, objects must be comparable with ``__lt__``.
32
35
33
-
.. autoclass:: _datasketches.req_ints_sketch
36
+
.. note::
37
+
Serializing and deserializing a :class:`req_items_sketch` requires the use of a :class:`PyObjectSerDe`.
38
+
39
+
.. autoclass:: req_ints_sketch
34
40
:members:
35
41
:undoc-members:
36
42
:exclude-members: deserialize, get_RSE
@@ -44,7 +50,7 @@ This is not only useful for debugging, but is a powerful tool to help users unde
44
50
45
51
.. automethod:: __init__
46
52
47
-
.. autoclass:: _datasketches.req_floats_sketch
53
+
.. autoclass:: req_floats_sketch
48
54
:members:
49
55
:undoc-members:
50
56
:exclude-members: deserialize, get_RSE
@@ -58,7 +64,7 @@ This is not only useful for debugging, but is a powerful tool to help users unde
0 commit comments