You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Feb 9, 2025. It is now read-only.
[$k$-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The Spark MLlib implementation includes a parallelized variant of the [$k$-means++](https://en.wikipedia.org/wiki/K-means%2B%2B) method called [$k$-means||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
23
+
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The Spark MLlib implementation includes a parallelized variant of the [k-means++](https://en.wikipedia.org/wiki/K-means%2B%2B) method called [k-means||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
24
24
25
25
`KMeans` is implemented as an `Estimator` and generates a [`KMeansModel`](https://spark.apache.org/docs/3.5.0/api/python/reference/api/pyspark.ml.clustering.KMeansModel.html) as the base model.
26
26
@@ -37,8 +37,6 @@ The following parameters are available:
37
37
-*distanceMeasure*: either Euclidean (default) or cosine distance measure
38
38
-*weightCol*: optional weighting of data points
39
39
40
-
Let us request for 2 cores using a regular queue. We activate the environment as usual and then install `matplotlib` (if you have not done so).
41
-
42
40
### Getting started
43
41
44
42
First log into the Stanage cluster
@@ -52,7 +50,7 @@ You need to replace `$USER` with your username (using **lowercase** and without
52
50
Once logged in, we can request 2 cpu cores from reserved resources by
source myspark.sh # assuming you copied HPC/myspark.sh to your root directory (see Lab 1 Task 2)
65
+
source myspark.sh # assuming you copied HPC/myspark.sh to your root directory (see Lab 1, Task 2)
68
66
```
69
67
70
68
if you created a `myspark.sh` script in Lab 1. If not, use
@@ -354,4 +352,4 @@ Carry out some further studies on the iris clustering problem above.
354
352
355
353
### Color Quantization using K-Means
356
354
357
-
- Follow the scikit-learn example [Color Quantization using K-Means](https://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html#sphx-glr-auto-examples-cluster-plot-color-quantization-py) to perform the same using PySpark on your high-resolution photos.
355
+
- Follow the scikit-learn example [Color Quantization using K-Means](https://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html#sphx-glr-auto-examples-cluster-plot-color-quantization-py) to perform the same using PySpark on your high-resolution photos.
0 commit comments