Skip to content

Commit 6c501d5

Browse files
committed
add ks test to docs, add support for items sketches in ks test
1 parent edcd1c4 commit 6c501d5

8 files changed

Lines changed: 50 additions & 6 deletions

File tree

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ Helper Classes
7676
kernel
7777
jaccard
7878
tuple_policy
79+
ks_test
7980

8081

8182
These classes are required for certain sketches or specific functionality withing sketches.

docs/source/kll.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,8 @@ Additionally, the interval may be quite large for certain distributions.
103103
- Let `q_hi = estimated quantile of rank (r + eps)`.
104104
- Then `q_lo ≤ q ≤ q_hi`, with 99% confidence.
105105

106+
.. note::
107+
For the :class:`kll_items_sketch`, objects must be comparable with ``__lt__``.
106108

107109
.. autoclass:: _datasketches.kll_ints_sketch
108110
:members:

docs/source/ks_test.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
Kolmogorov-Smirnov Test
2+
#######################
3+
4+
.. currentmodule:: datasketches
5+
6+
A `Kolmogorov-Smirnov Test <https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test>`_` is
7+
a test of equality of two distributions to determine if they are likely to have come from the same
8+
underlying distribution.
9+
The DataSketches library provides a modified form of the test that takes into account the error
10+
in each underlying sketch in the analysis.
11+
12+
Currently, the test assumes both input sketches are of the same family and data type.
13+
14+
.. autofunction:: ks_test

docs/source/quantiles_depr.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,9 @@ For example, the median item returned from `get_quantile(0.5)` will be between t
3939
from the hypothetically sorted array of input items at normalized ranks of 0.483 and 0.517, with
4040
a confidence of about 99%.
4141

42+
.. note::
43+
For the :class:`quantiles_items_sketch`, objects must be comparable with ``__lt__``.
44+
4245
.. autoclass:: _datasketches.quantiles_ints_sketch
4346
:members:
4447
:undoc-members:

docs/source/req.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,9 @@ This implementation allows the user to use both the `INCLUSIVE` criterion and th
2727
This implementation provides extensive debug visibility into the operation of the sketch with two levels of detail output.
2828
This is not only useful for debugging, but is a powerful tool to help users understand how the sketch works.
2929

30+
.. note::
31+
For the :class:`req_items_sketch`, objects must be comparable with ``__lt__``.
32+
3033
.. autoclass:: _datasketches.req_ints_sketch
3134
:members:
3235
:undoc-members:

src/ks_wrapper.cpp

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
#include "kolmogorov_smirnov.hpp"
2121
#include "kll_sketch.hpp"
2222
#include "quantiles_sketch.hpp"
23+
#include "py_object_lt.hpp"
2324

2425
#include <nanobind/nanobind.h>
2526

@@ -29,38 +30,50 @@ void init_kolmogorov_smirnov(nb::module_ &m) {
2930
using namespace datasketches;
3031

3132
m.def("ks_test", &kolmogorov_smirnov::test<kll_sketch<int>>, nb::arg("sk_1"), nb::arg("sk_2"), nb::arg("p"),
32-
"Performs the Kolmogorov-Smirnov Test between kll_ints_sketches.\n"
33+
"Performs the Kolmogorov-Smirnov Test for :code:`kll_ints_sketch` pairs.\n"
3334
"Note: if the given sketches have insufficient data or if the sketch sizes are too small, "
3435
"this will return false.\n"
3536
"Returns True if we can reject the null hypothesis (that the sketches reflect the same underlying "
3637
"distribution) using the provided p-value, otherwise False.");
3738
m.def("ks_test", &kolmogorov_smirnov::test<kll_sketch<float>>, nb::arg("sk_1"), nb::arg("sk_2"), nb::arg("p"),
38-
"Performs the Kolmogorov-Smirnov Test between kll_floats_sketches.\n"
39+
"Performs the Kolmogorov-Smirnov Test for :code:`kll_floats_sketch` pairs.\n"
3940
"Note: if the given sketches have insufficient data or if the sketch sizes are too small, "
4041
"this will return false.\n"
4142
"Returns True if we can reject the null hypothesis (that the sketches reflect the same underlying "
4243
"distribution) using the provided p-value, otherwise False.");
4344
m.def("ks_test", &kolmogorov_smirnov::test<kll_sketch<double>>, nb::arg("sk_1"), nb::arg("sk_2"), nb::arg("p"),
44-
"Performs the Kolmogorov-Smirnov Test between kll_doubles_sketches.\n"
45+
"Performs the Kolmogorov-Smirnov Test for :code:`kll_doubles_sketch` pairs.\n"
46+
"Note: if the given sketches have insufficient data or if the sketch sizes are too small, "
47+
"this will return false.\n"
48+
"Returns True if we can reject the null hypothesis (that the sketches reflect the same underlying "
49+
"distribution) using the provided p-value, otherwise False.");
50+
m.def("ks_test", &kolmogorov_smirnov::test<kll_sketch<nb::object, py_object_lt>>, nb::arg("sk_1"), nb::arg("sk_2"), nb::arg("p"),
51+
"Performs the Kolmogorov-Smirnov Test for :code:`kll_items_sketch` pairs.\n"
4552
"Note: if the given sketches have insufficient data or if the sketch sizes are too small, "
4653
"this will return false.\n"
4754
"Returns True if we can reject the null hypothesis (that the sketches reflect the same underlying "
4855
"distribution) using the provided p-value, otherwise False.");
4956

5057
m.def("ks_test", &kolmogorov_smirnov::test<quantiles_sketch<int>>, nb::arg("sk_1"), nb::arg("sk_2"), nb::arg("p"),
51-
"Performs the Kolmogorov-Smirnov Test between quantiles_ints_sketches.\n"
58+
"Performs the Kolmogorov-Smirnov Test for :code:`quantiles_ints_sketch` pairs.\n"
5259
"Note: if the given sketches have insufficient data or if the sketch sizes are too small, "
5360
"this will return false.\n"
5461
"Returns True if we can reject the null hypothesis (that the sketches reflect the same underlying "
5562
"distribution) using the provided p-value, otherwise False.");
5663
m.def("ks_test", &kolmogorov_smirnov::test<quantiles_sketch<float>>, nb::arg("sk_1"), nb::arg("sk_2"), nb::arg("p"),
57-
"Performs the Kolmogorov-Smirnov Test between quantiles_floats_sketches.\n"
64+
"Performs the Kolmogorov-Smirnov Test for :code:`quantiles_floats_sketch` pairs.\n"
5865
"Note: if the given sketches have insufficient data or if the sketch sizes are too small, "
5966
"this will return false.\n"
6067
":Returns True if we can reject the null hypothesis (that the sketches reflect the same underlying "
6168
"distribution) using the provided p-value, otherwise False.");
6269
m.def("ks_test", &kolmogorov_smirnov::test<quantiles_sketch<double>>, nb::arg("sk_1"), nb::arg("sk_2"), nb::arg("p"),
63-
"Performs the Kolmogorov-Smirnov Test between quantiles_doubles_sketches.\n"
70+
"Performs the Kolmogorov-Smirnov Test for :code:`quantiles_doubles_sketch` pairs.\n"
71+
"Note: if the given sketches have insufficient data or if the sketch sizes are too small, "
72+
"this will return false.\n"
73+
"Returns True if we can reject the null hypothesis (that the sketches reflect the same underlying "
74+
"distribution) using the provided p-value, otherwise False.");
75+
m.def("ks_test", &kolmogorov_smirnov::test<quantiles_sketch<nb::object, py_object_lt>>, nb::arg("sk_1"), nb::arg("sk_2"), nb::arg("p"),
76+
"Performs the Kolmogorov-Smirnov Test for :code:`quantiles_items_sketch` pairs.\n"
6477
"Note: if the given sketches have insufficient data or if the sketch sizes are too small, "
6578
"this will return false.\n"
6679
"Returns True if we can reject the null hypothesis (that the sketches reflect the same underlying "

tests/kll_test.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,10 @@ def test_kll_items_sketch(self):
163163
self.assertEqual(kll.get_quantile(0.7), new_kll.get_quantile(0.7))
164164
self.assertEqual(kll.get_rank(str(n/4)), new_kll.get_rank(str(n/4)))
165165

166+
# The sketches are identical so we cannot reject the null hypothesis that they
167+
# reflect the same underlying distribtion
168+
self.assertFalse(ks_test(kll, new_kll, 0.001))
169+
166170

167171
if __name__ == '__main__':
168172
unittest.main()

tests/quantiles_test.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,10 @@ def test_quantiles_items_sketch(self):
163163
self.assertEqual(quantiles.get_quantile(0.7), new_quantiles.get_quantile(0.7))
164164
self.assertEqual(quantiles.get_rank(str(n/4)), new_quantiles.get_rank(str(n/4)))
165165

166+
# The sketches are identical so we cannot reject the null hypothesis that they
167+
# reflect the same underlying distribtion
168+
self.assertFalse(ks_test(quantiles, new_quantiles, 0.001))
169+
166170

167171
if __name__ == '__main__':
168172
unittest.main()

0 commit comments

Comments
 (0)