Skip to content
This repository was archived by the owner on Feb 9, 2025. It is now read-only.

Commit 52fe440

Browse files
committed
update lab 8
1 parent f45d54f commit 52fe440

1 file changed

Lines changed: 7 additions & 54 deletions

File tree

Lab 8- Scalable Decision trees.md

Lines changed: 7 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -19,31 +19,11 @@ There are several challenges when implementing decision trees in a distributed s
1919

2020
You can find more technical details on the implementation of Decision Trees in Apache Spark in the youtube video [Scalable Decision Trees in Spark MLlib](https://www.youtube.com/watch?v=N453EV5gHRA&t=10m30s) by Manish Amde and the youtube video [Decision Trees on Spark](https://www.youtube.com/watch?v=3WS9OK3EXVA) by Joseph Bradley. These technical details are also reviewed in a [blog post on decision trees](https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.html) and another [blog post on random forests](https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html).
2121

22-
**Dependencies.** For this lab, we need to install the ``matplotlib`` and `pandas` packages. Make sure you install the packages in the environment **myspark**
23-
24-
Before you continue, open a new terminal in [ShARC](https://docs.hpc.shef.ac.uk/en/latest/hpc/index.html), use the `rse-com6012` queue with four nodes, and activate the **myspark** environment
25-
26-
`module load apps/java/jdk1.8.0_102/binary`
27-
28-
`module load apps/python/conda`
29-
30-
`source activate myspark`
31-
32-
You can now use pip to install the packages using
33-
34-
`pip install matplotlib pandas`
35-
3622
**You only need to install matplotlib and pandas in your environment once.**
3723

3824
## 1. Decision trees in PySpark
3925

40-
We will build a decision tree classifier that will be able to detect spam from the text in an email. We already saw this example using [scikit-learn](https://scikit-learn.org/stable/) in the previous module [COM6509 Machine Learning and Adaptive Intelligence](https://github.com/maalvarezl/MLAI). The Notebook is in [this link](https://colab.research.google.com/github/maalvarezl/MLAI/blob/master/Labs/Lab%203%20-%20Decision%20trees%20and%20ensemble%20methods.ipynb).
41-
42-
The dataset that we will use is from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php), where UCI stands for University of California Irvine. The UCI repository is and has been a valuable resource in Machine Learning. It contains datasets for classification, regression, clustering and several other machine learning problems. These datasets are open source and they have been uploaded by contributors of many research articles.
43-
44-
The particular dataset that we will use wil be referred to is the [Spambase Dataset](http://archive.ics.uci.edu/ml/datasets/Spambase). A detailed description is in the previous link. The dataset contains 57 features related to word frequency, character frequency, and others related to capital letters. The description of the features and labels in the dataset is available [here](http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names). The output label indicated whether an email was considered 'ham' or 'spam', so it is a binary label.
45-
46-
After installing `matplotlib` and `pandas`, go to the folder `ScalableML` in your terminal and open pyspark.
26+
In this lab, we will explore the performance of Decision Trees on the datasets we already used in the Notebook for Logistic Regression for Classification, [Lab 3](https://github.com/haipinglu/ScalableML/blob/master/Lab%203%20-%20Scalable%20Logistic%20Regression.md).
4727

4828
We now load the dataset and load the names of the features and label that we will use to create the schema for the dataframe. We also cache the dataframe since we are going to perform several operations to rawdata inside a loop.
4929

@@ -62,7 +42,6 @@ for i in range(number_names):
6242

6343
We use the [<tt>withColumnRenamed</tt>](https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html?highlight=withcolumn#pyspark.sql.DataFrame.withColumnRenamed) method for the dataframe to rename the columns using the more familiar names for the features.
6444

65-
6645
```python
6746
schemaNames = rawdata.schema.names
6847
spam_names[ncolumns-1] = 'labels'
@@ -74,7 +53,6 @@ Perhaps one of the most important operations when doing data analytics in Apache
7453

7554
Let us see first what is the type of the original features after reading the file.
7655

77-
7856
```python
7957
rawdata.printSchema()
8058
```
@@ -138,7 +116,7 @@ rawdata.printSchema()
138116
|-- capital_run_length_longest: string (nullable = true)
139117
|-- capital_run_length_total: string (nullable = true)
140118
|-- labels: string (nullable = true)
141-
119+
142120
We notice that all the features and the label are of type `String`. We import the <tt>String</tt> type from pyspark.sql.types, and later use the [<tt>withColumn</tt>](https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html?highlight=withcolumn#pyspark.sql.DataFrame.withColumn) method for the dataframe to `cast()` each column to `Double`.
143121

144122
```python
@@ -215,7 +193,7 @@ rawdata.printSchema()
215193
|-- capital_run_length_longest: double (nullable = true)
216194
|-- capital_run_length_total: double (nullable = true)
217195
|-- labels: double (nullable = true)
218-
196+
219197
We have now a dataframe that contains several columns corresponding to the features, of type double, and the last column corresponding to the labels, also of type double.
220198

221199
We can now start the machine learning analysis by creating the training and test set and then designing the DecisionTreeClassifier using the training data.
@@ -259,7 +237,7 @@ vecTrainingData.select("features", "labels").show(5)
259237
|(57,[54,55,56],[1...| 0.0|
260238
+--------------------+------+
261239
only showing top 5 rows
262-
240+
263241
The [DecisionTreeClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassifier.html#pyspark.ml.classification.DecisionTreeClassifier) implemented in PySpark has several parameters to tune. Some of them are
264242

265243
> **maxDepth**: it corresponds to the maximum depth of the tree. The default is 5.<p>
@@ -303,7 +281,6 @@ plt.savefig("./Output/feature_importances.png")
303281

304282
The feature with the highest importance is
305283

306-
307284
```python
308285
spam_names[np.argmax(imp_feat)]
309286
```
@@ -312,7 +289,6 @@ spam_names[np.argmax(imp_feat)]
312289

313290
We can visualise the DecisionTree in the form of *if-then-else* statements.
314291

315-
316292
```python
317293
print(model.toDebugString)
318294
```
@@ -607,14 +583,11 @@ print(model.toDebugString)
607583
Predict: 0.0
608584
Else (feature 51 > 0.4325)
609585
Predict: 1.0
610-
611-
612586

613587
Indirectly, decision trees allow feature selection: features that allow making decisions in the top of the tree are more relevant for the decision problem.
614588

615589
We can organise the information provided by the visualisation above in the form of a Table using Pandas
616590

617-
618591
```python
619592
import pandas as pd
620593
featureImp = pd.DataFrame(
@@ -623,9 +596,6 @@ featureImp = pd.DataFrame(
623596
featureImp.sort_values(by="importance", ascending=False)
624597
```
625598

626-
627-
628-
629599
<div>
630600
<table border="1" class="dataframe">
631601
<thead>
@@ -925,13 +895,10 @@ featureImp.sort_values(by="importance", ascending=False)
925895
</table>
926896
</div>
927897

928-
929-
930898
A better visualisation of the tree in pyspark can be obtained by using, for example, [spark-tree-plotting](https://github.com/julioasotodv/spark-tree-plotting). The trick is to convert the spark tree to a JSON format. Once you have the JSON format, you can visualise it using [D3](https://d3js.org/) or you can transform from JSON to DOT and use graphviz as we did in scickit-learn for the Notebook in MLAI.
931899

932900
**Pipeline** We have not mentioned the test data yet. Before applying the decision tree to the test data, this is a good opportunity to introduce a pipeline that includes the VectorAssembler and the Decision Tree.
933901

934-
935902
```python
936903
from pyspark.ml import Pipeline
937904

@@ -944,7 +911,6 @@ pipelineModel = pipeline.fit(trainingData)
944911

945912
We finally use the [MulticlassClassificationEvaluator](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.MulticlassClassificationEvaluator.html?highlight=multiclassclassificationevaluator#pyspark.ml.evaluation.MulticlassClassificationEvaluator) tool to assess the accuracy on the test set.
946913

947-
948914
```python
949915
predictions = pipelineModel.transform(testData)
950916
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
@@ -956,7 +922,6 @@ print("Accuracy = %g " % accuracy)
956922

957923
Accuracy = 0.915679
958924

959-
960925
### Decision trees for regression
961926

962927
The main difference between Decision Trees for Classification and Decision Trees for Regression is in the impurity measure used. For regression, PySpark uses the variance of the target features as the impurity measure.
@@ -972,7 +937,7 @@ You will have the opportunity to experiment with the [DecisionTreeRegressor](htt
972937

973938
## 2. Ensemble methods
974939

975-
We studied the implementation of ensemble methods in scikit-learn in COM6509. See [this notebook](https://colab.research.google.com/github/maalvarezl/MLAI/blob/master/Labs/Lab%203%20-%20Decision%20trees%20and%20ensemble%20methods.ipynb) for a refresher.
940+
We studied the implementation of ensemble methods in scikit-learn in COM6509. See [this notebook](https://colab.research.google.com/github/maalvarezl/MLAI/blob/master/Labs/Lab%203%20-%20Decision%20trees%20and%20ensemble%20methods.ipynb) for a refresher.
976941

977942
PySpark implemenst two types of Tree Ensembles, random forests and gradient boosting. The main difference between both methods is the way in which they combine the different trees that compose the ensemble.
978943

@@ -983,7 +948,7 @@ The variant of Random Forests implemented in Apache Spark is also known as baggi
983948
Besides the parameters that we already mentioned for the [DecisionTreeClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassifier.html#pyspark.ml.classification.DecisionTreeClassifier) and the [DecisionTreeRegressor](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.DecisionTreeRegressor.html), the [RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html) and the [RandomForestRegressor](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.RandomForestRegressor.html) in PySpark require three additional parameters:
984949
> **numTrees** the total number of trees to train<p>
985950
**featureSubsetStrategy** number of features to use as candidates for splitting at each tree node. Options include all, onethird, sqrt, log2, [1-n]<p>
986-
**subsamplingRate**: size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset.
951+
**subsamplingRate**: size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset.
987952

988953
We already did an example of classification with decision trees. Let us use now random forests for performing regression.
989954

@@ -1017,16 +982,13 @@ rawdataw.printSchema()
1017982
|-- sulphates: string (nullable = true)
1018983
|-- alcohol: string (nullable = true)
1019984
|-- quality: string (nullable = true)
1020-
1021-
1022985

1023986
We now follow a very familiar procedure to get the dataset to a format that can be input to Spark MLlib, which consists of:
1024987
1. transforming the data from type string to type double.
1025988
2. creating a pipeline that includes a vector assembler and a random forest regressor.
1026989

1027990
We first start transforming the data types.
1028991

1029-
1030992
```python
1031993
from pyspark.sql.types import StringType
1032994
from pyspark.sql.functions import col
@@ -1039,7 +1001,6 @@ rawdataw = rawdataw.withColumnRenamed('quality', 'labels')
10391001

10401002
Notice that we used the withColumnRenamed method to rename the name of the target feature from 'quality' to 'label'.
10411003

1042-
10431004
```python
10441005
rawdataw.printSchema()
10451006
```
@@ -1057,26 +1018,21 @@ rawdataw.printSchema()
10571018
|-- sulphates: double (nullable = true)
10581019
|-- alcohol: double (nullable = true)
10591020
|-- labels: double (nullable = true)
1060-
1061-
10621021

10631022
We now partition the data into a training and a test set
10641023

1065-
10661024
```python
10671025
trainingDataw, testDataw = rawdataw.randomSplit([0.7, 0.3], 42)
10681026
```
10691027

10701028
Now, we create the pipeline. First, we create the vector assembler.
10711029

1072-
10731030
```python
10741031
vecAssemblerw = VectorAssembler(inputCols=StringColumns[:-1], outputCol="features")
10751032
```
10761033

10771034
And now, the Random Forests regressor and the pipeline
10781035

1079-
10801036
```python
10811037
from pyspark.ml.regression import RandomForestRegressor
10821038
rf = RandomForestRegressor(labelCol="labels", featuresCol="features", maxDepth=5, numTrees=3, \
@@ -1090,7 +1046,6 @@ pipelineModelw = pipeline.fit(trainingDataw)
10901046

10911047
We apply now the pipeline to the test data and compute the RMSE between the predictions and the ground truth
10921048

1093-
10941049
```python
10951050
predictions = pipelineModelw.transform(testDataw)
10961051

@@ -1185,8 +1140,6 @@ featureImp.sort_values(by="importance", ascending=False)
11851140
</table>
11861141
</div>
11871142

1188-
1189-
11901143
### Gradient Boosting
11911144

11921145
In [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting) or [Gradient-boosted trees](https://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting) (GBT), each tree in the ensemble is trained sequentially: the first tree is trained as usual using the training data, the second tree is trained on the residuals between the predictions of the first tree and the labels of the training data, the third tree is trained on the residuals of the predictions of the second tree, etc. The predictions of the ensemble will be the sum of the predictions of each individual tree. The type of residuals are related to the loss function that wants to be minimised. In the PySpark implementations of Gradient-Boosted trees, the loss function for binary classification is the Log-Loss function and the loss function for regression is either the squared error or the absolute error. For details, follow this [link](https://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts).
@@ -1226,7 +1179,7 @@ print("RMSE = %g " % rmse)
12261179

12271180
## 3. Exercises
12281181

1229-
**Note**: A *reference* solution will be provided in Blackboard for this part by the following Thursday (06.04.2023).
1182+
**Note**: A *reference* solution will be provided in Blackboard for this part by the following Thursday (27.04.2023).
12301183

12311184
### Exercise 1
12321185

0 commit comments

Comments
 (0)