You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Feb 9, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: Lab 8- Scalable Decision trees.md
+7-54Lines changed: 7 additions & 54 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,31 +19,11 @@ There are several challenges when implementing decision trees in a distributed s
19
19
20
20
You can find more technical details on the implementation of Decision Trees in Apache Spark in the youtube video [Scalable Decision Trees in Spark MLlib](https://www.youtube.com/watch?v=N453EV5gHRA&t=10m30s) by Manish Amde and the youtube video [Decision Trees on Spark](https://www.youtube.com/watch?v=3WS9OK3EXVA) by Joseph Bradley. These technical details are also reviewed in a [blog post on decision trees](https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.html) and another [blog post on random forests](https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html).
21
21
22
-
**Dependencies.** For this lab, we need to install the ``matplotlib`` and `pandas` packages. Make sure you install the packages in the environment **myspark**
23
-
24
-
Before you continue, open a new terminal in [ShARC](https://docs.hpc.shef.ac.uk/en/latest/hpc/index.html), use the `rse-com6012` queue with four nodes, and activate the **myspark** environment
25
-
26
-
`module load apps/java/jdk1.8.0_102/binary`
27
-
28
-
`module load apps/python/conda`
29
-
30
-
`source activate myspark`
31
-
32
-
You can now use pip to install the packages using
33
-
34
-
`pip install matplotlib pandas`
35
-
36
22
**You only need to install matplotlib and pandas in your environment once.**
37
23
38
24
## 1. Decision trees in PySpark
39
25
40
-
We will build a decision tree classifier that will be able to detect spam from the text in an email. We already saw this example using [scikit-learn](https://scikit-learn.org/stable/) in the previous module [COM6509 Machine Learning and Adaptive Intelligence](https://github.com/maalvarezl/MLAI). The Notebook is in [this link](https://colab.research.google.com/github/maalvarezl/MLAI/blob/master/Labs/Lab%203%20-%20Decision%20trees%20and%20ensemble%20methods.ipynb).
41
-
42
-
The dataset that we will use is from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php), where UCI stands for University of California Irvine. The UCI repository is and has been a valuable resource in Machine Learning. It contains datasets for classification, regression, clustering and several other machine learning problems. These datasets are open source and they have been uploaded by contributors of many research articles.
43
-
44
-
The particular dataset that we will use wil be referred to is the [Spambase Dataset](http://archive.ics.uci.edu/ml/datasets/Spambase). A detailed description is in the previous link. The dataset contains 57 features related to word frequency, character frequency, and others related to capital letters. The description of the features and labels in the dataset is available [here](http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names). The output label indicated whether an email was considered 'ham' or 'spam', so it is a binary label.
45
-
46
-
After installing `matplotlib` and `pandas`, go to the folder `ScalableML` in your terminal and open pyspark.
26
+
In this lab, we will explore the performance of Decision Trees on the datasets we already used in the Notebook for Logistic Regression for Classification, [Lab 3](https://github.com/haipinglu/ScalableML/blob/master/Lab%203%20-%20Scalable%20Logistic%20Regression.md).
47
27
48
28
We now load the dataset and load the names of the features and label that we will use to create the schema for the dataframe. We also cache the dataframe since we are going to perform several operations to rawdata inside a loop.
49
29
@@ -62,7 +42,6 @@ for i in range(number_names):
62
42
63
43
We use the [<tt>withColumnRenamed</tt>](https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html?highlight=withcolumn#pyspark.sql.DataFrame.withColumnRenamed) method for the dataframe to rename the columns using the more familiar names for the features.
64
44
65
-
66
45
```python
67
46
schemaNames = rawdata.schema.names
68
47
spam_names[ncolumns-1] ='labels'
@@ -74,7 +53,6 @@ Perhaps one of the most important operations when doing data analytics in Apache
74
53
75
54
Let us see first what is the type of the original features after reading the file.
We notice that all the features and the label are of type `String`. We import the <tt>String</tt> type from pyspark.sql.types, and later use the [<tt>withColumn</tt>](https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html?highlight=withcolumn#pyspark.sql.DataFrame.withColumn) method for the dataframe to `cast()` each column to `Double`.
We have now a dataframe that contains several columns corresponding to the features, of type double, and the last column corresponding to the labels, also of type double.
220
198
221
199
We can now start the machine learning analysis by creating the training and test set and then designing the DecisionTreeClassifier using the training data.
The [DecisionTreeClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassifier.html#pyspark.ml.classification.DecisionTreeClassifier) implemented in PySpark has several parameters to tune. Some of them are
264
242
265
243
> **maxDepth**: it corresponds to the maximum depth of the tree. The default is 5.<p>
We can visualise the DecisionTree in the form of *if-then-else* statements.
314
291
315
-
316
292
```python
317
293
print(model.toDebugString)
318
294
```
@@ -607,14 +583,11 @@ print(model.toDebugString)
607
583
Predict: 0.0
608
584
Else (feature 51 > 0.4325)
609
585
Predict: 1.0
610
-
611
-
612
586
613
587
Indirectly, decision trees allow feature selection: features that allow making decisions in the top of the tree are more relevant for the decision problem.
614
588
615
589
We can organise the information provided by the visualisation above in the form of a Table using Pandas
A better visualisation of the tree in pyspark can be obtained by using, for example, [spark-tree-plotting](https://github.com/julioasotodv/spark-tree-plotting). The trick is to convert the spark tree to a JSON format. Once you have the JSON format, you can visualise it using [D3](https://d3js.org/) or you can transform from JSON to DOT and use graphviz as we did in scickit-learn for the Notebook in MLAI.
931
899
932
900
**Pipeline** We have not mentioned the test data yet. Before applying the decision tree to the test data, this is a good opportunity to introduce a pipeline that includes the VectorAssembler and the Decision Tree.
We finally use the [MulticlassClassificationEvaluator](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.MulticlassClassificationEvaluator.html?highlight=multiclassclassificationevaluator#pyspark.ml.evaluation.MulticlassClassificationEvaluator) tool to assess the accuracy on the test set.
946
913
947
-
948
914
```python
949
915
predictions = pipelineModel.transform(testData)
950
916
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
The main difference between Decision Trees for Classification and Decision Trees for Regression is in the impurity measure used. For regression, PySpark uses the variance of the target features as the impurity measure.
@@ -972,7 +937,7 @@ You will have the opportunity to experiment with the [DecisionTreeRegressor](htt
972
937
973
938
## 2. Ensemble methods
974
939
975
-
We studied the implementation of ensemble methods in scikit-learn in COM6509. See [this notebook](https://colab.research.google.com/github/maalvarezl/MLAI/blob/master/Labs/Lab%203%20-%20Decision%20trees%20and%20ensemble%20methods.ipynb) for a refresher.
940
+
We studied the implementation of ensemble methods in scikit-learn in COM6509. See [this notebook](https://colab.research.google.com/github/maalvarezl/MLAI/blob/master/Labs/Lab%203%20-%20Decision%20trees%20and%20ensemble%20methods.ipynb) for a refresher.
976
941
977
942
PySpark implemenst two types of Tree Ensembles, random forests and gradient boosting. The main difference between both methods is the way in which they combine the different trees that compose the ensemble.
978
943
@@ -983,7 +948,7 @@ The variant of Random Forests implemented in Apache Spark is also known as baggi
983
948
Besides the parameters that we already mentioned for the [DecisionTreeClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassifier.html#pyspark.ml.classification.DecisionTreeClassifier) and the [DecisionTreeRegressor](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.DecisionTreeRegressor.html), the [RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html) and the [RandomForestRegressor](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.RandomForestRegressor.html) in PySpark require three additional parameters:
984
949
> **numTrees** the total number of trees to train<p>
985
950
**featureSubsetStrategy** number of features to use as candidates for splitting at each tree node. Options include all, onethird, sqrt, log2, [1-n]<p>
986
-
**subsamplingRate**: size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset.
951
+
**subsamplingRate**: size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset.
987
952
988
953
We already did an example of classification with decision trees. Let us use now random forests for performing regression.
989
954
@@ -1017,16 +982,13 @@ rawdataw.printSchema()
1017
982
|-- sulphates: string (nullable = true)
1018
983
|-- alcohol: string (nullable = true)
1019
984
|-- quality: string (nullable = true)
1020
-
1021
-
1022
985
1023
986
We now follow a very familiar procedure to get the dataset to a format that can be input to Spark MLlib, which consists of:
1024
987
1. transforming the data from type string to type double.
1025
988
2. creating a pipeline that includes a vector assembler and a random forest regressor.
In [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting) or [Gradient-boosted trees](https://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting) (GBT), each tree in the ensemble is trained sequentially: the first tree is trained as usual using the training data, the second tree is trained on the residuals between the predictions of the first tree and the labels of the training data, the third tree is trained on the residuals of the predictions of the second tree, etc. The predictions of the ensemble will be the sum of the predictions of each individual tree. The type of residuals are related to the loss function that wants to be minimised. In the PySpark implementations of Gradient-Boosted trees, the loss function for binary classification is the Log-Loss function and the loss function for regression is either the squared error or the absolute error. For details, follow this [link](https://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts).
@@ -1226,7 +1179,7 @@ print("RMSE = %g " % rmse)
1226
1179
1227
1180
## 3. Exercises
1228
1181
1229
-
**Note**: A *reference* solution will be provided in Blackboard for this part by the following Thursday (06.04.2023).
1182
+
**Note**: A *reference* solution will be provided in Blackboard for this part by the following Thursday (27.04.2023).
0 commit comments