You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/frequent_items.rst
+53-16Lines changed: 53 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,10 @@
1
1
Frequent Items
2
2
--------------
3
3
4
-
This sketch is useful for tracking approximate frequencies of items of type `<T>` with optional associated counts `(<T> item, int count)`
5
-
that are members of a multiset of such items.
4
+
.. currentmodule:: datasketches
5
+
6
+
This sketch is useful for tracking approximate frequencies of items (``object`` or ``string``) with optional associated
7
+
integer counts that are members of a multiset of such items.
6
8
The true frequency of an item is defined to be the sum of associated counts.
7
9
8
10
This implementation provides the following capabilities:
@@ -16,38 +18,38 @@ This implementation provides the following capabilities:
16
18
17
19
**Space Usage**
18
20
19
-
The sketch is initialized with a maximum map size, `maxMapSize`, that specifies the maximum physical length of the internal hash map of the form `(<T> item, int count)`.
20
-
The maximum map size is always a power of 2, defined through the variables `lg_max_map_size`.
21
+
The sketch is initialized with a maximum map size, ``maxMapSize``, that specifies the maximum physical length of the internal hash map of the form ``(object item, int count)``.
22
+
The maximum map size is always a power of 2, defined through the variables ``lg_max_map_size``.
21
23
22
24
The hash map starts at a very small size (8 entries) and grows as needed up to the specified maximum map size.
23
25
24
-
Excluding external space required for the item objects, the internal memory space usage of this sketch is `18 * mapSize bytes` (assuming 8 bytes for each reference),
26
+
Excluding external space required for the item objects, the internal memory space usage of this sketch is ``18 * mapSize`` bytes (assuming 8 bytes for each reference),
25
27
plus a small constant number of additional bytes.
26
-
The internal memory space usage of this sketch will never exceed `18 * maxMapSize` bytes, plus a small constant number of additional bytes.
28
+
The internal memory space usage of this sketch will never exceed ``18 * maxMapSize`` bytes, plus a small constant number of additional bytes.
27
29
28
30
**Maximum Capacity of the Sketch**
29
31
30
-
The `LOAD_FACTOR` for the hash map is internally set at :math:`75\%`, which means at any time the map capacity of `(item, count)` pairs is `mapCap = 0.75 * mapSize`.
31
-
The maximum capacity of `(item, count)`` pairs of the sketch is `maxMapCap = 0.75 * maxMapSize`.
32
+
The ``LOAD_FACTOR`` for the hash map is internally set at :math:`75\%`, which means at any time the map capacity of ``(item, count)`` pairs is ``mapCap = 0.75 * mapSize``.
33
+
The maximum capacity of ``(item, count)`` pairs of the sketch is ``maxMapCap = 0.75 * maxMapSize``.
32
34
33
-
**Updating the sketch with `(item, count)` pairs**
35
+
**Updating the sketch with ``(item, count)`` pairs**
34
36
35
-
If the item is found in the hash map, the mapped count field (the "counter") is incremented by the incoming count; otherwise, a new counter `"(item, count) pair"` is created.
37
+
If the item is found in the hash map, the mapped count field (the "counter") is incremented by the incoming count; otherwise, a new counter ``(item, count)`` pair is created.
36
38
If the number of tracked counters reaches the maximum capacity of the hash map, the sketch decrements all of the counters (by an approximately computed median)
37
39
and removes any non-positive counters.
38
40
39
41
**Accuracy**
40
42
41
-
If fewer than `0.75 * maxMapSize` different items are inserted into the sketch, the estimated frequencies returned by the sketch will be exact.
43
+
If fewer than ``0.75 * maxMapSize`` different items are inserted into the sketch, the estimated frequencies returned by the sketch will be exact.
42
44
43
45
The logic of the frequent items sketch is such that the stored counts and true counts are never too different.
44
46
More specifically, for any item, the sketch can return an estimate of the true frequency of item, along with upper and lower bounds on the frequency (that hold deterministically).
45
47
46
48
For this implementation and for a specific active item, it is guaranteed that the true frequency will be between the Upper Bound (UB) and the Lower Bound (LB) computed for that item.
47
-
Specifically, `(UB- LB) ≤ W * epsilon`, where :math:`W` denotes the sum of all item counts, and :math:`epsilon = 3.5/M`, where :math:`epsilon = M` is the maxMapSize.
49
+
Specifically, ``(UB- LB) ≤ W * epsilon``, where :math:`W` denotes the sum of all item counts, and :math:`epsilon = 3.5/M`, where :math:`epsilon = M` is the maxMapSize.
48
50
49
51
This is a worst-case guarantee that applies to arbitrary inputs.
50
-
For inputs typically seen in practice, `(UB-LB)` is usually much smaller.
52
+
For inputs typically seen in practice, ``(UB-LB)`` is usually much smaller.
51
53
52
54
**Background**
53
55
@@ -63,12 +65,45 @@ Variants of it were discovered and rediscovered and redesigned several times ove
63
65
For speed, we do employ some randomization that introduces a small probability that our proof of the worst-case bound might not apply to a given run.
64
66
However, we have ensured that this probability is extremely small.
65
67
For example, if the stream causes one table purge (rebuild), our proof of the worst-case bound applies with a probability of at least `1 - 1E-14`.
66
-
If the stream causes `1E9` purges, our proof applies with a probability of at least `1 - 1E-5`.
68
+
If the stream causes ``1E9`` purges, our proof applies with a probability of at least ``1 - 1E-5``.
69
+
70
+
There are two flavors of Frequent Items Sketches, one with generic items (objects) and another specific to strings.
71
+
The string version is a legacy name from before the library supported generic objects and is retained
72
+
only for backwards compatibility.
73
+
74
+
.. note::
75
+
The :class:`frequent_items_sketch` uses an input object's ``__hash__`` and ``__eq__`` methods.
76
+
77
+
.. note::
78
+
Serializing and deserializing the :class:`frequent_items_sketch` requires the use of a :class:`PyObjectSerDe`.
79
+
80
+
.. autoclass:: frequent_items_error_type
81
+
82
+
.. autoattribute:: NO_FALSE_POSITIVES
83
+
:annotation: : Returns only true positives but may miss some heavy hitters.
84
+
85
+
.. autoattribute:: NO_FALSE_NEGATIVES
86
+
:annotation: : Does not miss any heavy hitters but may return false positives.
Copy file name to clipboardExpand all lines: docs/source/hyper_log_log.rst
+31-8Lines changed: 31 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,26 +7,49 @@ If the ONLY use case for sketching is counting uniques and merging, the HLL sket
7
7
This implementation offers three different types of HLL sketch, each with different trade-offs with accuracy, space and performance.
8
8
These types are specified with the target_hll_type parameter.
9
9
10
-
In terms of accuracy, all three types, for the same lg_config_k, have the same error distribution as a function of n, the number of unique values fed to the sketch.
11
-
The configuration parameter `lg_config_k` is the log-base-2 of `K`, where `K` is the number of buckets or slots for the sketch.
10
+
In terms of accuracy, all three types, for the same lg_config_k, have the same error distribution as a function of ``n``, the number of unique values fed to the sketch.
11
+
The configuration parameter ``lg_config_k`` is the log-base-2 of ``k``, where ``k`` is the number of buckets or slots for the sketch.
12
12
13
-
During warmup, when the sketch has only received a small number of unique items (up to about 10% of `K`), this implementation leverages a new class of estimator algorithms with significantly better accuracy.
13
+
During warmup, when the sketch has only received a small number of unique items (up to about 10% of ``k``), this implementation leverages a new class of estimator algorithms with significantly better accuracy.
14
+
15
+
16
+
.. autoclass:: _datasketches.tgt_hll_type
17
+
18
+
.. autoattribute:: HLL_4
19
+
:annotation: : 4 bits per entry
20
+
21
+
.. autoattribute:: HLL_6
22
+
:annotation: : 6 bits per entry
23
+
24
+
.. autoattribute:: HLL_8
25
+
:annotation: : 8 bits per entry
14
26
15
-
This sketch also offers the capability of operating off-heap.
16
-
Given a WritableMemory object created by the user, the sketch will perform all of its updates and internal phase transitions in that object, which can actually reside either on-heap or off-heap based on how it is configured.
17
-
In large systems that must update and merge many millions of sketches, having the sketch operate off-heap avoids the serialization and deserialization costs of moving sketches to and from off-heap memory-mapped files, for example, and eliminates big garbage collection delays.
0 commit comments