Skip to content
This repository was archived by the owner on Feb 9, 2025. It is now read-only.

Commit 2411f8d

Browse files
authored
Updated Lab 8
1 parent 65e66ac commit 2411f8d

1 file changed

Lines changed: 7 additions & 9 deletions

File tree

Lab 8 - Scalable k-means clustering.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,13 @@
1414
- [Clustering in Spark](https://spark.apache.org/docs/3.5.0/ml-clustering.html)
1515
- [PySpark API on clustering](https://spark.apache.org/docs/3.5.0/api/python/reference/api/pyspark.ml.clustering.KMeans.html)
1616
- [PySpark code on clustering](https://github.com/apache/spark/blob/master/python/pyspark/ml/clustering.py)
17-
- [$k$-means clustering on Wiki](https://en.wikipedia.org/wiki/K-means_clustering)
18-
- [$k$-means++ on Wiki](https://en.wikipedia.org/wiki/K-means%2B%2B)
19-
- [$k$-means|| paper](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf)
17+
- [k-means clustering on Wiki](https://en.wikipedia.org/wiki/K-means_clustering)
18+
- [k-means++ on Wiki](https://en.wikipedia.org/wiki/K-means%2B%2B)
19+
- [k-means|| paper](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf)
2020

2121
## 1. $k$-means clustering
2222

23-
[$k$-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The Spark MLlib implementation includes a parallelized variant of the [$k$-means++](https://en.wikipedia.org/wiki/K-means%2B%2B) method called [$k$-means||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
23+
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The Spark MLlib implementation includes a parallelized variant of the [k-means++](https://en.wikipedia.org/wiki/K-means%2B%2B) method called [k-means||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
2424

2525
`KMeans` is implemented as an `Estimator` and generates a [`KMeansModel`](https://spark.apache.org/docs/3.5.0/api/python/reference/api/pyspark.ml.clustering.KMeansModel.html) as the base model.
2626

@@ -37,8 +37,6 @@ The following parameters are available:
3737
- *distanceMeasure*: either Euclidean (default) or cosine distance measure
3838
- *weightCol*: optional weighting of data points
3939

40-
Let us request for 2 cores using a regular queue. We activate the environment as usual and then install `matplotlib` (if you have not done so).
41-
4240
### Getting started
4341

4442
First log into the Stanage cluster
@@ -52,7 +50,7 @@ You need to replace `$USER` with your username (using **lowercase** and without
5250
Once logged in, we can request 2 cpu cores from reserved resources by
5351

5452
```sh
55-
srun --account=default --reservation=com6012-7 --cpus-per-task=2 --time=01:00:00 --pty /bin/bash
53+
srun --account=default --reservation=com6012-8 --cpus-per-task=2 --time=01:00:00 --pty /bin/bash
5654
```
5755

5856
if the reserved resources are not available, request core from the general queue by
@@ -64,7 +62,7 @@ srun --pty --cpus-per-task=2 bash -i
6462
Now set up our conda environment, using
6563

6664
```sh
67-
source myspark.sh # assuming you copied HPC/myspark.sh to your root directory (see Lab 1 Task 2)
65+
source myspark.sh # assuming you copied HPC/myspark.sh to your root directory (see Lab 1, Task 2)
6866
```
6967

7068
if you created a `myspark.sh` script in Lab 1. If not, use
@@ -354,4 +352,4 @@ Carry out some further studies on the iris clustering problem above.
354352

355353
### Color Quantization using K-Means
356354

357-
- Follow the scikit-learn example [Color Quantization using K-Means](https://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html#sphx-glr-auto-examples-cluster-plot-color-quantization-py) to perform the same using PySpark on your high-resolution photos.
355+
- Follow the scikit-learn example [Color Quantization using K-Means](https://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html#sphx-glr-auto-examples-cluster-plot-color-quantization-py) to perform the same using PySpark on your high-resolution photos.

0 commit comments

Comments
 (0)