You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 15, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+31-9Lines changed: 31 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -50,23 +50,45 @@ There are totally 19 workloads in HiBench. The workloads are divided into 6 cate
50
50
51
51
**Machine Learning:**
52
52
53
-
1. Bayesian Classification (bayes)
53
+
1. Bayesian Classification (Bayes)
54
54
55
-
This workload benchmarks NaiveBayesian Classification implemented in Spark-MLLib/Mahout examples.
55
+
This workload benchmarks NaiveBayesian Classification implemented in Spark-MLLib. The workload uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.
56
56
57
-
Large-scale machine learning is another important use of MapReduce. This workload tests the Naive Bayesian (a popular classification algorithm for knowledge discovery and data mining) trainer in Mahout 0.7, which is an open source (Apache project) machine learning library. The workload uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.
57
+
2. K-means clustering (Kmeans)
58
58
59
-
2.K-means clustering (kmeans)
59
+
This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in Spark-MLlib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.
60
60
61
-
This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in Mahout 0.7/Spark-MLlib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.
61
+
3. Logistic Regression (LR)
62
62
63
-
3.Logistic Regression (lr)
63
+
This workload benchmarks Logistic Regression (LR) implemented in Spark-MLLib with LBFGS optimizer. The input data set is generated by LogisticRegressionDataGenerator based on random balance decision tree. It contains three different kinds of data types, including categorical data, continuous data, and binary data.
64
64
65
-
This workload benchmarks Logistic Regression implemented in Spark-MLLib examples. Logistic Regreesion is realized with LBFGS. The input data set is generated by LabeledPointDataGenerator based on random balance decision tree. It contains three different kinds of data types, including categorical data, continuous data, and binary data.
65
+
4. Alternating Least Squares (ALS)
66
66
67
-
4.Alternating Least Squares (als)
67
+
This workload benchmarks Alternating Least Squares (ALS) implememnted in Spark-MLLib. The input data set is generated by RatingDataGenerator for a product recommendation system.
68
68
69
-
This workload benchmarks Alternating Least Squares implememnted in Spark-MLLib examples. The input data set is generated by RatingDataGenerator for a product recommendation system.
69
+
5. Gradient Boosting Tree (GBT)
70
+
71
+
This workload benchmarks Gradient Boosting Tree (GBT) implememnted in Spark-MLLib. The input data set is generated by GradientBoostingTreeDataGenerator.
72
+
73
+
6. Linear Regression (LiR)
74
+
75
+
This workload benchmarks Linear Regression (LiR) implemented in Spark-MLLib with SGD optimizer. The input data set is generated by LinearRegressionDataGenerator.
76
+
77
+
7. Latent Dirichlet Allocation (lda)
78
+
79
+
This workload benchmarks Latent Dirichlet Allocation (LDA) implemented in Spark-MLLib. The input data set is generated by LDADataGenerator.
80
+
81
+
8. Principal Components Analysis (PCA)
82
+
83
+
This workload benchmarks Principal Components Analysis (PCA) implemented in Spark-MLLib. The input data set is generated by PCADataGenerator.
84
+
85
+
9. Random Forest (RF)
86
+
87
+
This workload benchmarks Random Forest (RF) implemented in Spark-MLLib. The input data set is generated by RandomForestDataGenerator.
88
+
89
+
10. Support Vector Machine (SVM)
90
+
91
+
This workload benchmarks Support Vector Machine (SVM) implemented in Spark-MLLib. The input data set is generated by SVMDataGenerator.
0 commit comments