Skip to content
This repository was archived by the owner on Feb 9, 2025. It is now read-only.

Commit 65e66ac

Browse files
committed
added Lab 8
1 parent 28aa92f commit 65e66ac

1 file changed

Lines changed: 357 additions & 0 deletions

File tree

Lines changed: 357 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,357 @@
1+
# Lab 8: $k$-means clustering
2+
3+
[COM6012 Scalable Machine Learning **2024**](https://github.com/COM6012/ScalableML) by [Shuo Zhou](https://shuo-zhou.github.io/) at The University of Sheffield
4+
5+
## Study schedule
6+
7+
- [Task 1](#1-k-means-clustering): To finish in the lab session on 19th April. **Essential**
8+
- [Task 2](#2-exercises): To finish by the following Wednesday 24th March. ***Exercise***
9+
- [Task 3](#3-additional-ideas-to-explore-optional): To explore further. *Optional*
10+
11+
### Suggested reading
12+
13+
- Chapters *Clustering* and *RFM Analysis* of [PySpark tutorial](https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf)
14+
- [Clustering in Spark](https://spark.apache.org/docs/3.5.0/ml-clustering.html)
15+
- [PySpark API on clustering](https://spark.apache.org/docs/3.5.0/api/python/reference/api/pyspark.ml.clustering.KMeans.html)
16+
- [PySpark code on clustering](https://github.com/apache/spark/blob/master/python/pyspark/ml/clustering.py)
17+
- [$k$-means clustering on Wiki](https://en.wikipedia.org/wiki/K-means_clustering)
18+
- [$k$-means++ on Wiki](https://en.wikipedia.org/wiki/K-means%2B%2B)
19+
- [$k$-means|| paper](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf)
20+
21+
## 1. $k$-means clustering
22+
23+
[$k$-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The Spark MLlib implementation includes a parallelized variant of the [$k$-means++](https://en.wikipedia.org/wiki/K-means%2B%2B) method called [$k$-means||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
24+
25+
`KMeans` is implemented as an `Estimator` and generates a [`KMeansModel`](https://spark.apache.org/docs/3.5.0/api/python/reference/api/pyspark.ml.clustering.KMeansModel.html) as the base model.
26+
27+
[API](https://spark.apache.org/docs/3.5.0/api/python/reference/api/pyspark.ml.clustering.KMeans.html): `class pyspark.ml.clustering.KMeans(featuresCol='features', predictionCol='prediction', k=2, initMode='k-means||', initSteps=2, tol=0.0001, maxIter=20, seed=None, distanceMeasure='euclidean', weightCol=None)`
28+
29+
The following parameters are available:
30+
31+
- *k*: the number of desired clusters.
32+
- *maxIter*: the maximum number of iterations
33+
- *initMode*: specifies either random initialization or initialization via k-means||
34+
- *initSteps*: determines the number of steps in the k-means|| algorithm (default=2, advanced)
35+
- *tol*: determines the distance threshold within which we consider k-means to have converged.
36+
- *seed*: setting the **random seed** (so that multiple runs have the same results)
37+
- *distanceMeasure*: either Euclidean (default) or cosine distance measure
38+
- *weightCol*: optional weighting of data points
39+
40+
Let us request for 2 cores using a regular queue. We activate the environment as usual and then install `matplotlib` (if you have not done so).
41+
42+
### Getting started
43+
44+
First log into the Stanage cluster
45+
46+
```sh
47+
ssh $USER@stanage.shef.ac.uk
48+
```
49+
50+
You need to replace `$USER` with your username (using **lowercase** and without `$`).
51+
52+
Once logged in, we can request 2 cpu cores from reserved resources by
53+
54+
```sh
55+
srun --account=default --reservation=com6012-7 --cpus-per-task=2 --time=01:00:00 --pty /bin/bash
56+
```
57+
58+
if the reserved resources are not available, request core from the general queue by
59+
60+
```sh
61+
srun --pty --cpus-per-task=2 bash -i
62+
```
63+
64+
Now set up our conda environment, using
65+
66+
```sh
67+
source myspark.sh # assuming you copied HPC/myspark.sh to your root directory (see Lab 1 Task 2)
68+
```
69+
70+
if you created a `myspark.sh` script in Lab 1. If not, use
71+
72+
```sh
73+
module load Java/17.0.4
74+
module load Anaconda3/2022.05
75+
source activate myspark
76+
```
77+
78+
We'll be generating plots as part of this lab, so you will need to install `matplotlib` if you have not done so already with:
79+
80+
```sh
81+
pip install matplotlib
82+
```
83+
84+
Now we can start the PySpark shell with two cpu cores
85+
86+
```sh
87+
cd com6012/ScalableML # our main working directory
88+
pyspark --master local[2] # start pyspark with the 2 cpu cores requested above.
89+
```
90+
91+
If you are experiencing a `segmentation fault` when entering the `pyspark` interactive shell, run `export LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8` to fix it. It is recommended to add this line to your `myspark.sh` file.
92+
93+
We will do some plotting in this lab. To plot and save figures on HPC, we need to do the following before using pyplot:
94+
95+
```python
96+
import matplotlib
97+
matplotlib.use('Agg') # Must be before importing matplotlib.pyplot or pylab!
98+
```
99+
100+
Now import modules needed in this lab:
101+
102+
```python
103+
from pyspark.ml.clustering import KMeans
104+
from pyspark.ml.clustering import KMeansModel
105+
from pyspark.ml.evaluation import ClusteringEvaluator
106+
from pyspark.ml.linalg import Vectors
107+
import matplotlib.pyplot as plt
108+
```
109+
110+
### Clustering of simple synthetic data
111+
112+
Here, we study $k$-means clustering on a simple example with four well-separated data points as the following.
113+
114+
```python
115+
data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
116+
(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
117+
df = spark.createDataFrame(data, ["features"])
118+
kmeans = KMeans(k=2, seed=1) # Two clusters with seed = 1
119+
model = kmeans.fit(df)
120+
```
121+
122+
We examine the cluster centers (centroids) and use the trained model to "predict" the cluster index for a data point.
123+
124+
```python
125+
centers = model.clusterCenters()
126+
len(centers)
127+
# 2
128+
for center in centers:
129+
print(center)
130+
# [0.5 0.5]
131+
# [8.5 8.5]
132+
model.predict(df.head().features)
133+
# 0
134+
```
135+
136+
We can use the trained model to cluster any data points in the same space, where the cluster index is considered as the `prediction`.
137+
138+
```python
139+
transformed = model.transform(df)
140+
transformed.show()
141+
# +---------+----------+
142+
# | features|prediction|
143+
# +---------+----------+
144+
# |[0.0,0.0]| 0|
145+
# |[1.0,1.0]| 0|
146+
# |[9.0,8.0]| 1|
147+
# |[8.0,9.0]| 1|
148+
# +---------+----------+
149+
```
150+
151+
We can examine the training summary for the trained model.
152+
153+
```python
154+
model.hasSummary
155+
# True
156+
summary = model.summary
157+
summary
158+
# <pyspark.ml.clustering.KMeansSummary object at 0x2b1662948d30>
159+
summary.k
160+
# 2
161+
summary.clusterSizes
162+
# [2, 2]]
163+
summary.trainingCost #sum of squared distances of points to their nearest center
164+
# 2.0
165+
```
166+
167+
You can check out the [KMeansSummary API](https://spark.apache.org/docs/3.5.0/api/java/org/apache/spark/ml/clustering/KMeansSummary.html) for details of the summary information, e.g., we can find out that the training cost is the sum of squared distances to the nearest centroid for all points in the training dataset.
168+
169+
### Save and load an algorithm/model
170+
171+
We can save an algorithm/model in a temporary location (see [API on save](https://spark.apache.org/docs/3.5.0/api/python/reference/api/pyspark.ml.PipelineModel.html?highlight=pipelinemodel%20save#pyspark.ml.PipelineModel.save)) and then load it later.
172+
173+
Save and load the $k$-means algorithm (settings):
174+
175+
```python
176+
import tempfile
177+
178+
temp_path = tempfile.mkdtemp()
179+
kmeans_path = temp_path + "/kmeans"
180+
kmeans.save(kmeans_path)
181+
kmeans2 = KMeans.load(kmeans_path)
182+
kmeans2.getK()
183+
# 2
184+
```
185+
186+
Save and load the learned $k$-means model (note that only the learned model is saved, not including the summary):
187+
188+
```python
189+
model_path = temp_path + "/kmeans_model"
190+
model.save(model_path)
191+
model2 = KMeansModel.load(model_path)
192+
model2.hasSummary
193+
# False
194+
model2.clusterCenters()
195+
# [array([0.5, 0.5]), array([8.5, 8.5])]
196+
```
197+
198+
### Iris clustering
199+
200+
Clustering of the [Iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set) is a classical example [discussed the Wikipedia page of $k$-means clustering](https://en.wikipedia.org/wiki/K-means_clustering#Discussion). This data set was introduced by [Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher), "the father of modern statistics and experimental design" (and thus machine learning) and also "the greatest biologist since Darwin". The code below is based on Chapter *Clustering* of [PySpark tutorial](https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf), with some changes introduced.
201+
202+
#### Load and inspect the data
203+
204+
```python
205+
df = spark.read.load("Data/iris.csv", format="csv", inferSchema="true", header="true").cache()
206+
df.show(5,True)
207+
# +------------+-----------+------------+-----------+-------+
208+
# |sepal_length|sepal_width|petal_length|petal_width|species|
209+
# +------------+-----------+------------+-----------+-------+
210+
# | 5.1| 3.5| 1.4| 0.2| setosa|
211+
# | 4.9| 3.0| 1.4| 0.2| setosa|
212+
# | 4.7| 3.2| 1.3| 0.2| setosa|
213+
# | 4.6| 3.1| 1.5| 0.2| setosa|
214+
# | 5.0| 3.6| 1.4| 0.2| setosa|
215+
# +------------+-----------+------------+-----------+-------+
216+
# only showing top 5 rows
217+
df.printSchema()
218+
# root
219+
# |-- sepal_length: double (nullable = true)
220+
# |-- sepal_width: double (nullable = true)
221+
# |-- petal_length: double (nullable = true)
222+
# |-- petal_width: double (nullable = true)
223+
# |-- species: string (nullable = true)
224+
```
225+
226+
We can use `.describe().show()` to inspect the (statistics of) data:
227+
228+
```python
229+
df.describe().show()
230+
# +-------+------------------+-------------------+------------------+------------------+---------+
231+
# |summary| sepal_length| sepal_width| petal_length| petal_width| species|
232+
# +-------+------------------+-------------------+------------------+------------------+---------+
233+
# | count| 150| 150| 150| 150| 150|
234+
# | mean| 5.843333333333335| 3.0540000000000007|3.7586666666666693|1.1986666666666672| null|
235+
# | stddev|0.8280661279778637|0.43359431136217375| 1.764420419952262|0.7631607417008414| null|
236+
# | min| 4.3| 2.0| 1.0| 0.1| setosa|
237+
# | max| 7.9| 4.4| 6.9| 2.5|virginica|
238+
# +-------+------------------+-------------------+------------------+------------------+---------+
239+
```
240+
241+
#### Convert the data to dense vector (features)
242+
243+
Use a `transData` function similar to that in Lab 2 to convert the attributes into feature vectors.
244+
245+
```python
246+
def transData(data):
247+
return data.rdd.map(lambda r: [Vectors.dense(r[:-1])]).toDF(['features'])
248+
249+
dfFeatureVec= transData(df).cache()
250+
dfFeatureVec.show(5, False)
251+
# +-----------------+
252+
# |features |
253+
# +-----------------+
254+
# |[5.1,3.5,1.4,0.2]|
255+
# |[4.9,3.0,1.4,0.2]|
256+
# |[4.7,3.2,1.3,0.2]|
257+
# |[4.6,3.1,1.5,0.2]|
258+
# |[5.0,3.6,1.4,0.2]|
259+
# +-----------------+
260+
# only showing top 5 rows
261+
```
262+
263+
#### Determine $k$ via silhouette analysis
264+
265+
We can perform a [Silhouette Analysis](https://en.wikipedia.org/wiki/Silhouette_(clustering)) to determine $k$ by running multiple $k$-means with different $k$ and evaluate the clustering results. See [the ClusteringEvaluator API](https://spark.apache.org/docs/3.5.0/api/python/reference/api/pyspark.ml.evaluation.ClusteringEvaluator.html), where `silhouette` is the default metric. You can also refer to this [scikit-learn notebook on the same topic](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html). Other ways of determining the best $k$ can be found on [a dedicated wiki page](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set).
266+
267+
```python
268+
import numpy as np
269+
270+
numK=10
271+
silhouettes = np.zeros(numK)
272+
costs= np.zeros(numK)
273+
for k in range(2,numK): # k = 2:9
274+
kmeans = KMeans().setK(k).setSeed(11)
275+
model = kmeans.fit(dfFeatureVec)
276+
predictions = model.transform(dfFeatureVec)
277+
costs[k]=model.summary.trainingCost
278+
evaluator = ClusteringEvaluator() # to compute the silhouette score
279+
silhouettes[k] = evaluator.evaluate(predictions)
280+
```
281+
282+
We can take a look at the clustering results (the `prediction` below is the cluster index/label).
283+
284+
```python
285+
predictions.show(15)
286+
# +-----------------+----------+
287+
# | features|prediction|
288+
# +-----------------+----------+
289+
# |[5.1,3.5,1.4,0.2]| 1|
290+
# |[4.9,3.0,1.4,0.2]| 1|
291+
# |[4.7,3.2,1.3,0.2]| 1|
292+
# |[4.6,3.1,1.5,0.2]| 1|
293+
# |[5.0,3.6,1.4,0.2]| 1|
294+
# |[5.4,3.9,1.7,0.4]| 5|
295+
# |[4.6,3.4,1.4,0.3]| 1|
296+
# |[5.0,3.4,1.5,0.2]| 1|
297+
# |[4.4,2.9,1.4,0.2]| 1|
298+
# |[4.9,3.1,1.5,0.1]| 1|
299+
# |[5.4,3.7,1.5,0.2]| 5|
300+
# |[4.8,3.4,1.6,0.2]| 1|
301+
# |[4.8,3.0,1.4,0.1]| 1|
302+
# |[4.3,3.0,1.1,0.1]| 1|
303+
# |[5.8,4.0,1.2,0.2]| 5|
304+
# +-----------------+----------+
305+
# only showing top 15 rows
306+
```
307+
308+
Plot the cost (sum of squared distances of points to their nearest centroid, the smaller the better) against $k$.
309+
310+
```python
311+
fig, ax = plt.subplots(1,1, figsize =(8,6))
312+
ax.plot(range(2,numK),costs[2:numK],marker="o")
313+
ax.set_xlabel('$k$')
314+
ax.set_ylabel('Cost')
315+
plt.grid()
316+
plt.savefig("Output/Lab8_cost.png")
317+
```
318+
319+
We can see that this cost measure is biased towards a large $k$. Let us plot the silhouette metric (the larger the better) against $k$.
320+
321+
```python
322+
fig, ax = plt.subplots(1,1, figsize =(8,6))
323+
ax.plot(range(2,numK),silhouettes[2:numK],marker="o")
324+
ax.set_xlabel('$k$')
325+
ax.set_ylabel('Silhouette')
326+
plt.grid()
327+
plt.savefig("Output/Lab8_silhouette.png")
328+
```
329+
330+
We can see that the silhouette measure is biased towards a small $k$. By the silhouette metric, we should choose $k=2$ but we know the ground truth $k$ is 3 (read the [data description](https://archive.ics.uci.edu/ml/datasets/iris) or count unique species). Therefore, this metric is not giving the ideal results in this case (either). [Determining the optimal number of clusters](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set) is an open problem.
331+
332+
## 2. Exercises
333+
334+
### Further study on iris clustering
335+
336+
Carry out some further studies on the iris clustering problem above.
337+
338+
1. Choose $k=3$ and evaluate the clustering results against the ground truth (class labels) using the [Normalized Mutual Information (NMI) available in scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html). You need to install `scikit-learn` in the `myspark` environment via `conda install -y scikit-learn`. This allows us to study the clustering quality when we know the true number of clusters.
339+
2. Use multiple (e.g., 10 or 20) random seeds to generate different clustering results and plot the respective NMI values (with respect to ground truth with $k=3$ as in the question above) to observe the effect of initialisation.
340+
341+
## 3. Additional ideas to explore (*optional*)
342+
343+
### RFM Customer Value Analysis
344+
345+
- Follow Chapter *RFM Analysis* of [PySpark tutorial](https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf) to perform [RFM Customer Value Analysis](https://en.wikipedia.org/wiki/RFM_(customer_value))
346+
- The data can be downloaded from [Online Retail Data Set](https://archive.ics.uci.edu/ml/datasets/online+retail) at UCI.
347+
- Note the **data cleaning** step that checks and removes rows containing null value via `.dropna()`. You may need to do the same when you are dealing with real data.
348+
- The **data manipulation** steps are also useful to learn.
349+
350+
### Network intrusion detection
351+
352+
- The original task is a classification task. We can ignore the class labels and perform clustering on the data.
353+
- Write a standalone program (and submit as a batch job to HPC) to do $k$-means clustering on the [KDDCUP1999 data](https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data) with 4M points. You may start with the smaller 10% subset.
354+
355+
### Color Quantization using K-Means
356+
357+
- Follow the scikit-learn example [Color Quantization using K-Means](https://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html#sphx-glr-auto-examples-cluster-plot-color-quantization-py) to perform the same using PySpark on your high-resolution photos.

0 commit comments

Comments
 (0)