Skip to content
This repository was archived by the owner on Dec 15, 2025. It is now read-only.

Commit fc3aed3

Browse files
committed
Merge remote-tracking branch 'upstream/master'
update to master
2 parents 50b9750 + 85c651d commit fc3aed3

13 files changed

Lines changed: 174 additions & 87 deletions

File tree

README.md

Lines changed: 31 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -50,23 +50,45 @@ There are totally 19 workloads in HiBench. The workloads are divided into 6 cate
5050

5151
**Machine Learning:**
5252

53-
1. Bayesian Classification (bayes)
53+
1. Bayesian Classification (Bayes)
5454

55-
This workload benchmarks NaiveBayesian Classification implemented in Spark-MLLib/Mahout examples.
55+
This workload benchmarks NaiveBayesian Classification implemented in Spark-MLLib. The workload uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.
5656

57-
Large-scale machine learning is another important use of MapReduce. This workload tests the Naive Bayesian (a popular classification algorithm for knowledge discovery and data mining) trainer in Mahout 0.7, which is an open source (Apache project) machine learning library. The workload uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.
57+
2. K-means clustering (Kmeans)
5858

59-
2. K-means clustering (kmeans)
59+
This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in Spark-MLlib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.
6060

61-
This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in Mahout 0.7/Spark-MLlib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.
61+
3. Logistic Regression (LR)
6262

63-
3. Logistic Regression (lr)
63+
This workload benchmarks Logistic Regression (LR) implemented in Spark-MLLib with LBFGS optimizer. The input data set is generated by LogisticRegressionDataGenerator based on random balance decision tree. It contains three different kinds of data types, including categorical data, continuous data, and binary data.
6464

65-
This workload benchmarks Logistic Regression implemented in Spark-MLLib examples. Logistic Regreesion is realized with LBFGS. The input data set is generated by LabeledPointDataGenerator based on random balance decision tree. It contains three different kinds of data types, including categorical data, continuous data, and binary data.
65+
4. Alternating Least Squares (ALS)
6666

67-
4. Alternating Least Squares (als)
67+
This workload benchmarks Alternating Least Squares (ALS) implememnted in Spark-MLLib. The input data set is generated by RatingDataGenerator for a product recommendation system.
6868

69-
This workload benchmarks Alternating Least Squares implememnted in Spark-MLLib examples. The input data set is generated by RatingDataGenerator for a product recommendation system.
69+
5. Gradient Boosting Tree (GBT)
70+
71+
This workload benchmarks Gradient Boosting Tree (GBT) implememnted in Spark-MLLib. The input data set is generated by GradientBoostingTreeDataGenerator.
72+
73+
6. Linear Regression (LiR)
74+
75+
This workload benchmarks Linear Regression (LiR) implemented in Spark-MLLib with SGD optimizer. The input data set is generated by LinearRegressionDataGenerator.
76+
77+
7. Latent Dirichlet Allocation (lda)
78+
79+
This workload benchmarks Latent Dirichlet Allocation (LDA) implemented in Spark-MLLib. The input data set is generated by LDADataGenerator.
80+
81+
8. Principal Components Analysis (PCA)
82+
83+
This workload benchmarks Principal Components Analysis (PCA) implemented in Spark-MLLib. The input data set is generated by PCADataGenerator.
84+
85+
9. Random Forest (RF)
86+
87+
This workload benchmarks Random Forest (RF) implemented in Spark-MLLib. The input data set is generated by RandomForestDataGenerator.
88+
89+
10. Support Vector Machine (SVM)
90+
91+
This workload benchmarks Support Vector Machine (SVM) implemented in Spark-MLLib. The input data set is generated by SVMDataGenerator.
7092

7193
**SQL:**
7294

bin/functions/hibench_prop_env_mapping.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,12 @@
119119
# For Random Forest
120120
NUM_EXAMPLES_RF="hibench.rf.examples",
121121
NUM_FEATURES_RF="hibench.rf.features",
122-
NUMTREES="hibench.rf.numTrees",
122+
NUM_TREES_RF="hibench.rf.numTrees",
123+
NUM_CLASSES_RF="hibench.rf.numClasses",
124+
FEATURE_SUBSET_STRATEGY_RF="hibench.rf.featureSubsetStrategy",
125+
IMPURITY_RF="hibench.rf.impurity",
126+
MAX_DEPTH_RF="hibench.rf.maxDepth",
127+
MAX_BINS_RF="hibench.rf.maxBins",
123128
# For SVD
124129
NUM_EXAMPLES_SVD="hibench.svd.examples",
125130
NUM_FEATURES_SVD="hibench.svd.features",

bin/workloads/ml/lda/spark/run.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ rmr_hdfs $OUTPUT_HDFS || true
2626

2727
SIZE=`dir_size $INPUT_HDFS`
2828
START_TIME=`timestamp`
29-
run_spark_job com.intel.hibench.sparkbench.ml.LDAExample $INPUT_HDFS $OUTPUT_HDFS $NUM_TOPICS_LDA $NUM_ITERATIONS_LDA $OPTIMIZER_LDA $MAXRESULTSIZE_LDA
29+
run_spark_job com.intel.hibench.sparkbench.ml.LDAExample --numTopics $NUM_TOPICS_LDA --maxIterations $NUM_ITERATIONS_LDA --optimizer $OPTIMIZER_LDA --maxResultSize $MAXRESULTSIZE_LDA $INPUT_HDFS $OUTPUT_HDFS
3030
END_TIME=`timestamp`
3131

3232
gen_report ${START_TIME} ${END_TIME} ${SIZE}

bin/workloads/ml/rf/spark/run.sh

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,13 @@ rmr_hdfs $OUTPUT_HDFS || true
2626

2727
SIZE=`dir_size $INPUT_HDFS`
2828
START_TIME=`timestamp`
29-
run_spark_job com.intel.hibench.sparkbench.ml.RandomForestClassification ${INPUT_HDFS} ${NUMTREES}
29+
OPTION="--numTrees $NUM_TREES_RF \
30+
--numClasses $NUM_CLASSES_RF \
31+
--featureSubsetStrategy $FEATURE_SUBSET_STRATEGY_RF \
32+
--impurity $IMPURITY_RF \
33+
--maxDepth $MAX_DEPTH_RF \
34+
--maxBins $MAX_BINS_RF"
35+
run_spark_job com.intel.hibench.sparkbench.ml.RandomForestClassification $OPTION $INPUT_HDFS
3036
END_TIME=`timestamp`
3137

3238
gen_report ${START_TIME} ${END_TIME} ${SIZE}

conf/hibench.conf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Data scale profile. Available value is tiny, small, large, huge, gigantic and bigdata.
22
# The definition of these profiles can be found in the workload's conf file i.e. conf/workloads/micro/wordcount.conf
3-
hibench.scale.profile tiny
3+
hibench.scale.profile tiny
44
# Mapper number in hadoop, partition number in Spark
55
hibench.default.map.parallelism 8
66

conf/workloads/ml/rf.conf

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,26 @@
1-
hibench.rf.tiny.examples 10
2-
hibench.rf.tiny.features 100
3-
hibench.rf.small.examples 100
4-
hibench.rf.small.features 500
5-
hibench.rf.large.examples 1000
6-
hibench.rf.large.features 1000
7-
hibench.rf.huge.examples 10000
8-
hibench.rf.huge.features 200000
9-
hibench.rf.gigantic.examples 10000
10-
hibench.rf.gigantic.features 300000
11-
hibench.rf.bigdata.examples 20000
12-
hibench.rf.bigdata.features 220000
1+
hibench.rf.tiny.examples 10
2+
hibench.rf.tiny.features 100
3+
hibench.rf.small.examples 100
4+
hibench.rf.small.features 500
5+
hibench.rf.large.examples 1000
6+
hibench.rf.large.features 1000
7+
hibench.rf.huge.examples 10000
8+
hibench.rf.huge.features 200000
9+
hibench.rf.gigantic.examples 10000
10+
hibench.rf.gigantic.features 300000
11+
hibench.rf.bigdata.examples 20000
12+
hibench.rf.bigdata.features 220000
1313

1414

1515
hibench.rf.examples ${hibench.rf.${hibench.scale.profile}.examples}
1616
hibench.rf.features ${hibench.rf.${hibench.scale.profile}.features}
1717
hibench.rf.partitions ${hibench.default.map.parallelism}
1818
hibench.rf.numTrees 100
19+
hibench.rf.numClasses 2
20+
hibench.rf.featureSubsetStrategy auto
21+
hibench.rf.impurity gini
22+
hibench.rf.maxDepth 4
23+
hibench.rf.maxBins 32
1924

2025
hibench.workload.input ${hibench.hdfs.data.dir}/RF/Input
2126
hibench.workload.output ${hibench.hdfs.data.dir}/RF/Output

sparkbench/ml/src/main/scala/com/intel/sparkbench/ml/ALSExample.scala

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -126,10 +126,11 @@ object ALSExample {
126126

127127
println(s"Test RMSE = $rmse.")
128128

129-
// Recommend products for all users
129+
// Recommend products for all users, enable the following code to test recommendForAll
130+
/*
130131
val userRecommend = model.recommendProductsForUsers(numRecommends)
131132
userRecommend.count()
132-
133+
*/
133134
sc.stop()
134135
}
135136

sparkbench/ml/src/main/scala/com/intel/sparkbench/ml/GradientBoostingTree.scala

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ object GradientBoostingTree {
5353
// Train a GradientBoostedTrees model.
5454
// The defaultParams for Classification use LogLoss by default.
5555
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
56-
boostingStrategy.numIterations = numIterations // Note: Use more iterations in practice.
56+
boostingStrategy.numIterations = numIterations
5757
boostingStrategy.treeStrategy.numClasses = numClasses
5858
boostingStrategy.treeStrategy.maxDepth = maxDepth
5959
// Empty categoricalFeaturesInfo indicates all features are continuous.
@@ -68,7 +68,6 @@ object GradientBoostingTree {
6868
}
6969
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
7070
println("Test Error = " + testErr)
71-
println("Learned classification GBT model:\n" + model.toDebugString)
7271

7372
sc.stop()
7473
}

sparkbench/ml/src/main/scala/com/intel/sparkbench/ml/GradientBoostingTreeDataGenerator.scala

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -29,13 +29,13 @@ import org.apache.spark.rdd.RDD
2929

3030
/**
3131
* :: DeveloperApi ::
32-
* Generate test data for LogisticRegression. This class chooses positive labels
32+
* Generate test data for Gradient Boosting Tree. This class chooses positive labels
3333
* with probability `probOne` and scales features for positive examples by `eps`.
3434
*/
3535
object GradientBoostingTreeDataGenerator {
3636

3737
/**
38-
* Generate an RDD containing test data for LogisticRegression.
38+
* Generate an RDD containing test data for Gradient Boosting Tree.
3939
*
4040
* @param sc SparkContext to use for creating the RDD.
4141
* @param nexamples Number of examples that will be contained in the RDD.
@@ -44,7 +44,7 @@ object GradientBoostingTreeDataGenerator {
4444
* @param nparts Number of partitions of the generated RDD. Default value is 2.
4545
* @param probOne Probability that a label is 1 (and not 0). Default value is 0.5.
4646
*/
47-
def generateLogisticRDD(
47+
def generateGBTRDD(
4848
sc: SparkContext,
4949
nexamples: Int,
5050
nfeatures: Int,
@@ -73,7 +73,7 @@ object GradientBoostingTreeDataGenerator {
7373
val parallel = sc.getConf.getInt("spark.default.parallelism", sc.defaultParallelism)
7474
val numPartitions = IOCommon.getProperty("hibench.default.shuffle.parallelism")
7575
.getOrElse((parallel / 2).toString).toInt
76-
val eps = 3
76+
val eps = 0.3
7777

7878
if (args.length == 3) {
7979
outputPath = args(0)
@@ -84,12 +84,12 @@ object GradientBoostingTreeDataGenerator {
8484
println(s"Num of Features: $numFeatures")
8585
} else {
8686
System.err.println(
87-
s"Usage: $LogisticRegressionDataGenerator <OUTPUT_PATH> <NUM_EXAMPLES> <NUM_FEATURES>"
87+
s"Usage: $GradientBoostingTreeDataGenerator <OUTPUT_PATH> <NUM_EXAMPLES> <NUM_FEATURES>"
8888
)
8989
System.exit(1)
9090
}
9191

92-
val data = generateLogisticRDD(sc, numExamples, numFeatures, eps, numPartitions)
92+
val data = generateGBTRDD(sc, numExamples, numFeatures, eps, numPartitions)
9393

9494
data.saveAsObjectFile(outputPath)
9595

sparkbench/ml/src/main/scala/com/intel/sparkbench/ml/LDAExample.scala

Lines changed: 46 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -22,44 +22,63 @@ import org.apache.spark.{SparkConf, SparkContext}
2222
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel, LocalLDAModel}
2323
import org.apache.spark.mllib.linalg.{Vector, Vectors}
2424
import org.apache.spark.rdd.RDD
25-
25+
import scopt.OptionParser
2626
object LDAExample {
27-
27+
case class Params(
28+
inputPath: String = null,
29+
outputPath: String = null,
30+
numTopics: Int = 10,
31+
maxIterations: Int = 10,
32+
optimizer: String = "online",
33+
maxResultSize: String = "1g")
34+
2835
def main(args: Array[String]): Unit = {
29-
var inputPath = ""
30-
var outputPath = ""
31-
var numTopics: Int = 10
32-
var maxIterations: Int = 10
33-
var optimizer = "online"
34-
var maxResultSize = "1g"
36+
val defaultParams = Params()
37+
38+
val parser = new OptionParser[Params]("LDA") {
39+
head("LDA: an example app for LDA.")
40+
opt[String]("optimizer")
41+
.text(s"optimizer, default: ${defaultParams.optimizer}")
42+
.action((x, c) => c.copy(optimizer = x))
43+
opt[String]("maxResultSize")
44+
.text("max resultSize, default: ${defaultParams.maxResultSize}")
45+
.action((x, c) => c.copy(maxResultSize = x))
46+
opt[Int]("numTopics")
47+
.text(s"number of Topics, default: ${defaultParams.numTopics}")
48+
.action((x, c) => c.copy(numTopics = x))
49+
opt[Int]("maxIterations")
50+
.text(s"number of max iterations, default: ${defaultParams.maxIterations}")
51+
.action((x, c) => c.copy(maxIterations = x))
52+
arg[String]("<inputPath>")
53+
.required()
54+
.text("Input paths")
55+
.action((x, c) => c.copy(inputPath = x))
56+
arg[String]("<outputPath>")
57+
.required()
58+
.text("outputPath paths")
59+
.action((x, c) => c.copy(outputPath = x))
3560

36-
if (args.length == 6) {
37-
inputPath = args(0)
38-
outputPath = args(1)
39-
numTopics = args(2).toInt
40-
maxIterations = args(3).toInt
41-
optimizer = args(4)
42-
maxResultSize = args(5)
43-
} else {
44-
System.err.println(
45-
s"Usage: $LDAExample <INPUT_PATH> <OUTPUT_PATH> <NUM_TOPICS> <MAX_RESULT_SIZE>"
46-
)
47-
System.exit(1)
4861
}
49-
62+
parser.parse(args, defaultParams) match {
63+
case Some(params) => run(params)
64+
case _ => sys.exit(1)
65+
}
66+
}
67+
68+
def run(params: Params): Unit = {
5069
val conf = new SparkConf()
51-
.setAppName("LDA Example")
52-
.set("spark.driver.maxResultSize",maxResultSize)
70+
.setAppName(s"LDA Example with $params")
71+
.set("spark.driver.maxResultSize", params.maxResultSize)
5372
val sc = new SparkContext(conf)
5473

55-
val corpus: RDD[(Long, Vector)] = sc.objectFile(inputPath)
74+
val corpus: RDD[(Long, Vector)] = sc.objectFile(params.inputPath)
5675

5776
// Cluster the documents into numTopics topics using LDA
58-
val ldaModel = new LDA().setK(numTopics).setMaxIterations(maxIterations).setOptimizer(optimizer).run(corpus)
77+
val ldaModel = new LDA().setK(params.numTopics).setMaxIterations(params.maxIterations).setOptimizer(params.optimizer).run(corpus)
5978

6079
// Save and load model.
61-
ldaModel.save(sc, outputPath)
62-
val savedModel = LocalLDAModel.load(sc, outputPath)
80+
ldaModel.save(sc, params.outputPath)
81+
val savedModel = LocalLDAModel.load(sc, params.outputPath)
6382

6483
sc.stop()
6584
}

0 commit comments

Comments
 (0)