You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 15, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,47 +52,47 @@ There are totally 19 workloads in HiBench. The workloads are divided into 6 cate
52
52
53
53
1. Bayesian Classification (Bayes)
54
54
55
-
This workload benchmarks NaiveBayesian Classification implemented in Spark-MLLib. The workload uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.
55
+
Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. This workload is implemented in spark.mllib and uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.ords.
56
56
57
57
2. K-means clustering (Kmeans)
58
58
59
-
This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in Spark-MLlib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.
59
+
This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in spark.mllib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.
60
60
61
61
3. Logistic Regression (LR)
62
62
63
-
This workload benchmarks Logistic Regression (LR) implemented in Spark-MLLib with LBFGS optimizer. The input data set is generated by LogisticRegressionDataGenerator based on random balance decision tree. It contains three different kinds of data types, including categorical data, continuous data, and binary data.
63
+
Logistic Regression (LR) is a popular method to predict a categorical response. This workload is implemented in spark.mllib with LBFGS optimizer and the input data set is generated by LogisticRegressionDataGenerator based on random balance decision tree. It contains three different kinds of data types, including categorical data, continuous data, and binary data.
64
64
65
65
4. Alternating Least Squares (ALS)
66
66
67
-
This workload benchmarks Alternating Least Squares (ALS) implememnted in Spark-MLLib. The input data set is generated by RatingDataGenerator for a product recommendation system.
67
+
The alternating least squares (ALS) algorithm is a well-known algorithm for collaborative filtering. This workload is implemented in spark.mllib and the input data set is generated by RatingDataGenerator for a product recommendation system.
68
68
69
-
5. Gradient Boosting Tree (GBT)
69
+
5. Gradient Boosting Trees (GBT)
70
70
71
-
This workload benchmarks Gradient Boosting Tree (GBT) implememnted in Spark-MLLib. The input data set is generated by GradientBoostingTreeDataGenerator.
71
+
Gradient-boosted trees (GBT) is a popular regression method using ensembles of decision trees. This workload is implemented in spark.mllib and the input data set is generated by GradientBoostingTreeDataGenerator.
72
72
73
-
6. Linear Regression (LiR)
73
+
6. Linear Regression (Linear)
74
74
75
-
This workload benchmarks Linear Regression (LiR) implemented in Spark-MLLib with SGD optimizer. The input data set is generated by LinearRegressionDataGenerator.
75
+
Linear Regression (Linear) is a workload that implemented in spark.mllib with SGD optimizer. The input data set is generated by LinearRegressionDataGenerator.
76
76
77
-
7. Latent Dirichlet Allocation (lda)
77
+
7. Latent Dirichlet Allocation (LDA)
78
78
79
-
This workload benchmarks Latent Dirichlet Allocation (LDA) implemented in Spark-MLLib. The input data set is generated by LDADataGenerator.
79
+
Latent Dirichlet allocation (LDA) is a topic model which infers topics from a collection of text documents. This workload is implemented in spark.mllib and the input data set is generated by LDADataGenerator.
80
80
81
81
8. Principal Components Analysis (PCA)
82
82
83
-
This workload benchmarks Principal Components Analysis (PCA) implemented in Spark-MLLib. The input data set is generated by PCADataGenerator.
83
+
Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. PCA is used widely in dimensionality reduction. This workload is implemented in spark.mllib. The input data set is generated by PCADataGenerator.
84
84
85
85
9. Random Forest (RF)
86
86
87
-
This workload benchmarks Random Forest (RF) implemented in Spark-MLLib. The input data set is generated by RandomForestDataGenerator.
87
+
Random forests (RF) are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. They combine many decision trees in order to reduce the risk of overfitting. This workload is implemented in spark.mllib and the input data set is generated by RandomForestDataGenerator.
88
88
89
89
10. Support Vector Machine (SVM)
90
90
91
-
This workload benchmarks Support Vector Machine (SVM) implemented in Spark-MLLib. The input data set is generated by SVMDataGenerator.
91
+
Support Vector Machine (SVM) is a standard method for large-scale classification tasks. This workload is implemented in spark.mllib and the input data set is generated by SVMDataGenerator.
92
92
93
93
11. Singular Value Decomposition (SVD)
94
94
95
-
This workload benchmarks Singular Value Decomposition (SVD) implemented in Spark-MLLib. The input data set is generated by SVDDataGenerator.
95
+
Singular value decomposition (SVD) factorizes a matrix into three matrices. This workload is implemented in spark.mllib and its input data set is generated by SVDDataGenerator.
0 commit comments