Skip to content

Commit e85a56f

Browse files
committed
add medoid silhouette, prepare 0.4.0
1 parent 743994c commit e85a56f

6 files changed

Lines changed: 179 additions & 28 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,11 @@
22

33
For changes to the main Rust package, please see <https://github.com/kno10/rust-kmedoids/blob/main/CHANGELOG.md>
44

5-
## kmedoids 0.4.0 (2022-09-23)
5+
## kmedoids 0.4.0 (2022-09-24)
66

7-
- Add PAMSIL, PAMMEDSIL, FastMSC, FasterMSC (Lars Lenssen)
7+
- add clustering by optimizing the Silhouette: PAMSIL
8+
- add medoid silhouette
9+
- add medoid silhouette clustering: PAMMEDSIL, FastMSC, FasterMSC
810

911
## kmedoids 0.3.3 (2022-07-06)
1012

README.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,13 @@
33
This python package implements k-medoids clustering with PAM and variants of clustering by direct optimization of the (Medoid) Silhouette.
44
It can be used with arbitrary dissimilarites, as it requires a dissimilarity matrix as input.
55

6+
This software package has been introduced in JOSS:
7+
8+
> Erich Schubert and Lars Lenssen
9+
> **Fast k-medoids Clustering in Rust and Python**
10+
> Journal of Open Source Software 7(75), 4183
11+
> <https://doi.org/10.21105/joss.04183> (open access)
12+
613
For further details on the implemented algorithm FasterPAM, see:
714

815
> Erich Schubert, Peter J. Rousseeuw
@@ -120,10 +127,11 @@ print("Loss with PAM:", pam.loss)
120127
* PAM (Kaufman and Rousseeuw, 1987) with BUILD and SWAP
121128
* Alternating optimization (k-means-style algorithm)
122129
* Silhouette index for evaluation (Rousseeuw, 1987)
123-
* FasterMSC (Lenssen and Schubert, 2022)
130+
* **FasterMSC** (Lenssen and Schubert, 2022)
124131
* FastMSC (Lenssen and Schubert, 2022)
125132
* PAMSIL (Van der Laan and Pollard, 2003)
126133
* PAMMEDSIL (Van der Laan and Pollard, 2003)
134+
* Medoid Silhouette index for evaluation (Van der Laan and Pollard, 2003)
127135

128136
Note that the k-means-like algorithm for k-medoids tends to find much worse solutions.
129137

docs/index.rst

Lines changed: 26 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ This package is a wrapper around the fast
77
`Rust k-medoids package <https://github.com/kno10/rust-kmedoids>`_,
88
implementing the FasterPAM and FastPAM algorithms
99
along with the baseline k-means-style and PAM algorithms.
10+
Furthermore, the (Medoid) Silhouette can be optimized by the
11+
FasterMSC, FastMSC, PAMMEDSIL and PAMSIL algorithms.
1012

1113
All algorithms expect a *distance matrix* and the number of clusters as input.
1214
They hence can be used with arbitrary dissimilarity functions.
@@ -111,8 +113,10 @@ Implemented Algorithms
111113
* :ref:`FastMSC<fastmsc>` (Lenssen and Schubert, 2022)
112114
* :ref:`PAMSIL<pamsil>` (Van der Laan and Pollard, 2003)
113115
* :ref:`PAMMEDSIL<pammedsil>` (Van der Laan and Pollard, 2003)
116+
* :ref:`Medoid Silhouette<medoid_silhouette>` (Van der Laan and Pollard, 2003)
114117

115-
Note that the k-means style "alternating" algorithm yields rather poor result quality.
118+
Note that the k-means style "alternating" algorithm yields rather poor result quality
119+
(see Schubert and Rousseeuw 2021 for an example and explanation).
116120

117121
.. _FasterPAM:
118122

@@ -149,13 +153,6 @@ PAM BUILD
149153

150154
.. autofunction:: pam_build
151155

152-
.. _Silhouette:
153-
154-
Silhouette
155-
=========
156-
157-
.. autofunction:: silhouette
158-
159156
.. _FasterMSC:
160157

161158
FasterMSC
@@ -184,6 +181,20 @@ PAMMEDSIL
184181

185182
.. autofunction:: pammedsil
186183

184+
.. _Silhouette:
185+
186+
Silhouette
187+
=========
188+
189+
.. autofunction:: silhouette
190+
191+
.. _MedoidSilhouette:
192+
193+
Medoid Silhouette
194+
=================
195+
196+
.. autofunction:: medoid_silhouette
197+
187198
.. _KMedoidsResult:
188199

189200
k-Medoids result object
@@ -205,6 +216,13 @@ sklearn-compatible API
205216
References
206217
==========
207218

219+
This software has been published in the Journal of Open-Source Software:
220+
221+
| Erich Schubert and Lars Lenssen:
222+
| **Fast k-medoids Clustering in Rust and Python**
223+
| Journal of Open Source Software 7(75), 4183
224+
| https://doi.org/10.21105/joss.04183 (open access)
225+
208226
For further details on the implemented algorithm FasterPAM, see:
209227

210228
| Erich Schubert, Peter J. Rousseeuw
@@ -251,9 +269,3 @@ License: GPL-3 or later
251269
You should have received a copy of the GNU General Public License
252270
along with this program. If not, see <https://www.gnu.org/licenses/>.
253271
254-
FAQ: Why GPL and not Apache/MIT/BSD?
255-
------------------------------------
256-
257-
Because copyleft software like Linux is what built the open-source community.
258-
259-
Tit for tat: you get to use my code, I get to use your code.

kmedoids/__init__.py

Lines changed: 65 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,11 @@
2121
2222
References:
2323
24+
| Erich Schubert and Lars Lenssen:
25+
| Fast k-medoids Clustering in Rust and Python
26+
| Journal of Open Source Software 7(75), 4183
27+
| https://doi.org/10.21105/joss.04183 (open access)
28+
2429
| Erich Schubert, Peter J. Rousseeuw
2530
| Fast and Eager k-Medoids Clustering:
2631
| O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms
@@ -603,7 +608,7 @@ def silhouette(diss, labels, samples=False, n_cpu=-1):
603608
:param n_cpu: number of threads to use (-1: automatic)
604609
:type n_cpu: int
605610
606-
:return: tuple containing the overall silhouette and the individual samples
611+
:return: tuple containing the average silhouette and the individual samples
607612
:rtype: (float, ndarray)
608613
"""
609614
import numpy as np, os
@@ -613,7 +618,7 @@ def silhouette(diss, labels, samples=False, n_cpu=-1):
613618
if not isinstance(diss, np.ndarray):
614619
diss = np.array(diss)
615620
if not isinstance(labels, np.ndarray):
616-
labels = np.array(labels, dtype=np.uint)
621+
labels = np.array(labels, dtype=np.uint64)
617622

618623
if isinstance(diss, np.ndarray):
619624
dtype = diss.dtype
@@ -623,11 +628,11 @@ def silhouette(diss, labels, samples=False, n_cpu=-1):
623628
if n_cpu > 1:
624629
assert not samples, "samples=true currently may only be used with n_cpu=1"
625630
if dtype == np.float32:
626-
return _par_silhouette_f32(diss, labels.astype(np.uint64), n_cpu)
631+
return (_par_silhouette_f32(diss, labels.astype(np.uint64), n_cpu), [])
627632
elif dtype == np.float64:
628-
return _par_silhouette_f64(diss, labels.astype(np.uint64), n_cpu)
633+
return (_par_silhouette_f64(diss, labels.astype(np.uint64), n_cpu), [])
629634
elif dtype == np.int32:
630-
return _par_silhouette_i32(diss, labels.astype(np.uint64), n_cpu)
635+
return (_par_silhouette_i32(diss, labels.astype(np.uint64), n_cpu), [])
631636
elif dtype == np.int64:
632637
raise ValueError("Input of int64 is currently not supported, as it could overflow the float64 used internally when computing Silhouette. Use diss.astype(numpy.float64) if that is acceptable and you have the necessary memory for this copy.")
633638
else:
@@ -641,6 +646,54 @@ def silhouette(diss, labels, samples=False, n_cpu=-1):
641646
raise ValueError("Input of int64 is currently not supported, as it could overflow the float64 used internally when computing Silhouette. Use diss.astype(numpy.float64) if that is acceptable and you have the necessary memory for this copy.")
642647
raise ValueError("Input data not supported. Use a numpy array of floats.")
643648

649+
def medoid_silhouette(diss, meds, samples=False):
650+
"""Medoid Silhouette index for cluster evaluation.
651+
652+
The Medoid Silhouette is an approximation to the Silhouette index, that
653+
uses the distance to the cluster medoids instead of the average distance
654+
to the other cluster members. If every point is assigned to the nearest
655+
medoid, the Medoid Silhouette of a point reduces to 1-a/b where a is the
656+
distance to the nearest, and b the distance to the second nearest medoid.
657+
If b is 0, the Medoid Silhouette is 1.
658+
659+
This function assumes you already have a distance matrix. It is not necessary
660+
to compute a distance matrix to evaluate the medoid silhouette -- only the
661+
distances between points and medoids are necessary. If you do not have a
662+
distance matrix, simply compute the medoid Silhouette directly, by computing
663+
(1) the N x k distance matrix to the medoids, (2) finding the two smallest values
664+
for each data point, and (3) computing the average of 1-a/b on these (with 0/0 as 0).
665+
This can be implemented in a few lines with numpy easily.
666+
667+
:param diss: square numpy array of dissimilarities
668+
:type diss: ndarray
669+
:param meds: medoid indexes (k distinct values in 0 to n-1)
670+
:type meds: ndarray of int
671+
:param samples: whether to return individual samples or not
672+
:type samples: boolean
673+
674+
:return: tuple containing the average Medoid Silhouette and the individual samples
675+
:rtype: (float, ndarray)
676+
"""
677+
import numpy as np, os
678+
from .kmedoids import _medoid_silhouette_i32, _medoid_silhouette_f32, _medoid_silhouette_f64
679+
680+
if not isinstance(diss, np.ndarray):
681+
diss = np.array(diss)
682+
if not isinstance(meds, np.ndarray):
683+
meds = np.array(meds, dtype=np.uint64)
684+
685+
if isinstance(diss, np.ndarray):
686+
dtype = diss.dtype
687+
if dtype == np.float32:
688+
return _medoid_silhouette_f32(diss, meds.astype(np.uint64), samples)
689+
elif dtype == np.float64:
690+
return _medoid_silhouette_f64(diss, meds.astype(np.uint64), samples)
691+
elif dtype == np.int32:
692+
return _medoid_silhouette_i32(diss, meds.astype(np.uint64), samples)
693+
elif dtype == np.int64:
694+
raise ValueError("Input of int64 is currently not supported, as it could overflow the float64 used internally when computing Silhouette. Use diss.astype(numpy.float64) if that is acceptable and you have the necessary memory for this copy.")
695+
raise ValueError("Input data not supported. Use a numpy array of floats.")
696+
644697
# This is a hack to make sklearn an optional dependency only:
645698
try:
646699
from sklearn.base import BaseEstimator, ClusterMixin, TransformerMixin
@@ -655,8 +708,13 @@ class KMedoids(SKLearnClusterer):
655708
656709
References:
657710
658-
| Erich Schubert, Peter J. Rousseeuw
659-
| Fast and Eager k-Medoids Clustering:
711+
| Erich Schubert and Lars Lenssen:
712+
| Fast k-medoids Clustering in Rust and Python
713+
| Journal of Open Source Software 7(75), 4183
714+
| https://doi.org/10.21105/joss.04183 (open access)
715+
716+
| Erich Schubert, Peter J. Rousseeuw:
717+
| Fast and Eager k-Medoids Clustering
660718
| O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms
661719
| Information Systems (101), 2021, 101804
662720
| https://doi.org/10.1016/j.is.2021.101804 (open access)

src/lib.rs

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,33 @@ par_silhouette_call!(par_silhouette_f64, f64);
225225
par_silhouette_call!(par_silhouette_i32, i32);
226226
// i64 not supported, as the f64 used internally may overflow
227227

228+
macro_rules! medoid_silhouette_call {
229+
($name:ident, $type: ty) => {
230+
/// Run the Medoid Silhouette index evaluation for $type precision
231+
///
232+
/// :param dist: distance matrix
233+
/// :type dist: ndarray
234+
/// :param meds: medoids indexes
235+
/// :type meds: ndarray
236+
/// :param samples: return the per-point Medoid Silhouette values
237+
/// :type samples: bool
238+
/// :return: Medoid Silhouette evaluation result
239+
/// :rtype: pair of Medoid Silhouette score and Medoid Silhouette coefficients per point
240+
#[pyfunction]
241+
fn $name(dist: PyReadonlyArray2<'_, $type>, meds: PyReadonlyArray1<'_, usize>, samples: bool) -> PyResult<Py<PyAny>> {
242+
assert_eq!(dist.ndim(), 2);
243+
assert_eq!(dist.shape()[0], dist.shape()[1]);
244+
let (sil, sils): (f64, _) = rustkmedoids::medoid_silhouette(&dist.as_array(), &meds.to_vec()?, samples);
245+
Python::with_gil(|py| -> PyResult<Py<PyAny>> {
246+
Ok((sil, PyArray1::from_vec(py, sils)).to_object(py))
247+
})
248+
}
249+
}}
250+
medoid_silhouette_call!(medoid_silhouette_f32, f32);
251+
medoid_silhouette_call!(medoid_silhouette_f64, f64);
252+
medoid_silhouette_call!(medoid_silhouette_i32, i32);
253+
// i64 not supported, as the f64 used internally may overflow
254+
228255
#[pymodule]
229256
#[allow(unused_variables)]
230257
fn kmedoids(py: Python, m: &PyModule) -> PyResult<()> {
@@ -263,14 +290,19 @@ fn kmedoids(py: Python, m: &PyModule) -> PyResult<()> {
263290
m.add("_silhouette_f32", wrap_pyfunction!(silhouette_f32, m)?)?;
264291
m.add("_silhouette_f64", wrap_pyfunction!(silhouette_f64, m)?)?;
265292
m.add("_silhouette_i32", wrap_pyfunction!(silhouette_i32, m)?)?;
293+
// not supported: m.add("_silhouette_i64", wrap_pyfunction!(silhouette_i64, m)?)?;
266294
m.add("_par_fasterpam_f32", wrap_pyfunction!(par_fasterpam_f32, m)?)?;
267295
m.add("_par_fasterpam_f64", wrap_pyfunction!(par_fasterpam_f64, m)?)?;
268296
m.add("_par_fasterpam_i32", wrap_pyfunction!(par_fasterpam_i32, m)?)?;
269297
m.add("_par_fasterpam_i64", wrap_pyfunction!(par_fasterpam_i64, m)?)?;
270298
m.add("_par_silhouette_f32", wrap_pyfunction!(par_silhouette_f32, m)?)?;
271299
m.add("_par_silhouette_f64", wrap_pyfunction!(par_silhouette_f64, m)?)?;
272300
m.add("_par_silhouette_i32", wrap_pyfunction!(par_silhouette_i32, m)?)?;
273-
// not supported: m.add("_silhouette_i64", wrap_pyfunction!(silhouette_i64, m)?)?;
301+
// not supported: m.add("_üarsilhouette_i64", wrap_pyfunction!(par_silhouette_i64, m)?)?;
302+
m.add("_medoid_silhouette_f32", wrap_pyfunction!(medoid_silhouette_f32, m)?)?;
303+
m.add("_medoid_silhouette_f64", wrap_pyfunction!(medoid_silhouette_f64, m)?)?;
304+
m.add("_medoid_silhouette_i32", wrap_pyfunction!(medoid_silhouette_i32, m)?)?;
305+
// not supported: m.add("_medoid_silhouette_i64", wrap_pyfunction!(medoid_silhouette_i64, m)?)?;
274306
Ok(())
275307
}
276308

tests/test_integration.py

Lines changed: 42 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,14 +43,53 @@ def test_alternating(self):
4343
assert np.array_equal(alt.medoids, alt_rust[2])
4444
assert alt.loss == alt_rust[0]
4545

46+
def test_pamsil(self):
47+
dist = np.array([[0, 2, 3, 4, 5], [2, 0, 6, 7, 8], [3, 6, 0, 9, 10], [4, 7, 9, 0, 11], [5, 8, 10, 11, 0]], dtype=np.float32)
48+
pamsil = kmedoids.pamsil(dist, 2)
49+
pamsil_rust = kmedoids.kmedoids._pamsil_swap_f32(dist, pamsil.medoids, 100)
50+
assert pamsil.loss == 0.5137878787878788
51+
assert pamsil.loss == pamsil_rust[0]
52+
assert np.array_equal(pamsil.medoids, pamsil_rust[2])
53+
54+
def test_pammedsil(self):
55+
dist = np.array([[0, 2, 3, 4, 5], [2, 0, 6, 7, 8], [3, 6, 0, 9, 10], [4, 7, 9, 0, 11], [5, 8, 10, 11, 0]], dtype=np.float32)
56+
pms = kmedoids.pammedsil(dist, 2)
57+
pms_rust = kmedoids.kmedoids._pammedsil_swap_f32(dist, pms.medoids, 100)
58+
assert pms.loss == 0.8172727272727272
59+
assert pms.loss == pms_rust[0]
60+
assert np.array_equal(pms.medoids, pms_rust[2])
61+
62+
def test_fastmsc(self):
63+
dist = np.array([[0, 2, 3, 4, 5], [2, 0, 6, 7, 8], [3, 6, 0, 9, 10], [4, 7, 9, 0, 11], [5, 8, 10, 11, 0]], dtype=np.float32)
64+
fmsc = kmedoids.fastmsc(dist, 2, init="build")
65+
fmsc_rust = kmedoids.kmedoids._fastmsc_f32(dist, fmsc.medoids, 100)
66+
assert fmsc.loss == 0.8172727272727272
67+
assert np.array_equal(fmsc.medoids, fmsc_rust[2])
68+
assert fmsc.loss == fmsc_rust[0]
69+
70+
def test_fastermsc(self):
71+
dist = np.array([[0, 2, 3, 4, 5], [2, 0, 6, 7, 8], [3, 6, 0, 9, 10], [4, 7, 9, 0, 11], [5, 8, 10, 11, 0]], dtype=np.float32)
72+
fmsc = kmedoids.fastermsc(dist, 2, init="build")
73+
fmsc_rust = kmedoids.kmedoids._fastermsc_f32(dist, fmsc.medoids, 100)
74+
assert fmsc.loss == 0.8172727272727272
75+
assert np.array_equal(fmsc.medoids, fmsc_rust[2])
76+
assert fmsc.loss == fmsc_rust[0]
77+
4678
def test_silhouette(self):
4779
dist = np.array([[0, 2, 3, 4, 5], [2, 0, 6, 7, 8], [3, 6, 0, 9, 10], [4, 7, 9, 0, 11], [5, 8, 10, 11, 0]], dtype=np.int32)
4880
pam = kmedoids.pam(dist, 2)
49-
sil = kmedoids.silhouette(dist, pam.labels)
81+
sil = kmedoids.silhouette(dist, pam.labels, n_cpu=1)
5082
par_sil = kmedoids.silhouette(dist, pam.labels, n_cpu=2)
5183
sil_rust = kmedoids.kmedoids._silhouette_i32(dist, pam.labels, False)
52-
assert sil == par_sil
53-
assert sil == sil_rust[0]
84+
assert sil[0] == par_sil
85+
assert sil[0] == sil_rust[0]
86+
87+
def test_medoid_silhouette(self):
88+
dist = np.array([[0, 2, 3, 4, 5], [2, 0, 6, 7, 8], [3, 6, 0, 9, 10], [4, 7, 9, 0, 11], [5, 8, 10, 11, 0]], dtype=np.float32)
89+
pam = kmedoids.fasterpam(dist, 2)
90+
sil = kmedoids.medoid_silhouette(dist, pam.labels)
91+
sil_rust = kmedoids.kmedoids._medoid_silhouette_f32(dist, pam.labels, False)
92+
assert sil[0] == sil_rust[0]
5493

5594
def test_sklearn_interface(self):
5695
dist = np.array([[0, 2, 3, 4, 5], [2, 0, 6, 7, 8], [3, 6, 0, 9, 10], [4, 7, 9, 0, 11], [5, 8, 10, 11, 0]], dtype=np.int32)

0 commit comments

Comments
 (0)