Apply initial centroids on Spark Kmeans workload. #187

pfxuan · 2016-03-30T15:28:59Z

This PR makes HiBench hold the same convergence condition on both MapReduce and Spark benchmarks. Otherwise, Spark will use scalable K-means++ as default implementation, which results a huge computational cost and make Spark Kmeans performance totally incomparable with MapReduce. e.g. the existing Spark Kmeans performance is actually lower than MapReduce benchmark with a same problem size. As a comparison, the improved version enable Spark to achieve >4 speedups using 2 iterations.

… hold the same convergence condition on both MapReduce and Spark benchmark.

carsonwang · 2016-03-31T04:02:01Z

cc @hqzizania

hqzizania · 2016-04-10T16:23:16Z

src/sparkbench/src/main/java/org/apache/spark/examples/mllib/JavaKMeans.java

+        .setK(k)
+        .setMaxIterations(iterations)
+        .setRuns(runs)
+        .setInitialModel(initModel)


Why not use KMeans.RANDOM? What is the difference?

KMeans.RANDOM is from Spark MLLib and KMeans.setInitialModel shares the initial centroids generated by HiBench GenKMeansDataset. To get a meaningful comparison, we have to pick one of approaches and apply it on both of MapReduce and Spark benchmarks. In this PR, we select the latter approach, HiBench GenKMeansDataset, to generate the random centroids based-on normal (Gaussian) distribution.

Ye, the random centroids would be generated by takeSample without a normal distribution in Spark Kmeans even if KMeans.RANDOM is used. If getting a meaningful comparison between MapReduce and Spark benchmarks is the core value of HiBench, I agree with your approach, it looks like not enough for the performance of Spark K-Means algorithm implementation by itself though. @carsonwang

if k != number of centroids in initiModel, Spark KMeans will throw exception:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: mismatched cluster count

@hqzizania, this should be an expected behavior. If Spark KMeans uses HiBench generated initial model, the parameter k must has the matched value as defined in the initiModel.

@pfxuan , please see 10-data-scale-profile.conf. The num_of_clusters used to generate the input data and the k might be different. This is one of the concerns for using the input centroids as the initial model. The MR version is currently using the input centroids and doesn't pass the k parameter. We'd prefer to using the k parameter as this is the expected number of clusters.
As @hqzizania mentioned, the problem in Java Kmeans is the RRD is not cached. Can you please have a check with the scala version which supports both KMeas || and Random. It seems there is no huge computation cost when initializing the mode as the RDD is cached? If this is the case, we can fix the Java version and also pass -k to the MR version. This should make all of them comparable.

I did the performance test on scala version Kmeans. The size of input data is about 100 GB across 4 nodes. The running time on Random and Parallel is almost same, which took about 4 mins for running 3 iterations including 1.3 mins on centroid initialization. So there is about 48.1% overhead when using either the Spark version of Random or Parallel. As a comparison, the implementation of this PR only took about 2.4 mins for 3 iterations with almost zero-overhead on initialization.

In addition, Mahout version of random initialization is a sequential rather than MapReduce-based implementation. I passed -k 20 parameter to the MR benchmark, and it took 18.8 mins to generate 20 initial centroids using only one CPU core. To make a reasonable comparison, I think it would be better to keep the original HiBench generator for all Kmeans benchmarks.

Hi @pfxuan , I just got chance to run KMean with Random initialization by passing --initMode Random in run.sh. In my test, the random initialization ran much faster and I saw much less stages. The Parallel initialization is slow as there are many stages like collect and aggregate which run several iterations. But these stages are not expected when using Random initialization. Can you take a look at the Spark UI to see if there is any difference?

pfxuan · 2016-04-19T13:29:11Z

@carsonwang, @hqzizania any update on this PR? Thanks!

chenghao-intel · 2016-04-20T12:27:42Z

@hqzizania are you suggesting to merge this PR?

hqzizania · 2016-04-20T13:36:12Z

This is a good way to get a meaningful comparison between MapReduce and Spark. But unfortunately it will not use "k-means||", a specific and excellent initialization parallel algorithm in Spark KMeans, which is very time consuming but can get better initial centroids than "randomly choosing" to speed up the convergence. Thus, the benchmark results can't completely reflect the performance of spark KMeans, since the important feature "k-means||" is not used.
There may be a way to have both. We can make "randomly choosing" as the initialization mode of mahout KMeans and "k-means||" as Spark KMeans, and use the same convergence delta to control the iterations rather than the maxIter for them. Does it make sense? @pfxuan

carsonwang · 2016-04-20T14:30:42Z

I think it is easier to update the KMeans MR version to make them comparable. Simply pass a -k parameter on the Mahout command line and it will ignore the input centroids and use k random sampled points. We can also use "random choosing" for Spark KMeans.
https://mahout.apache.org/users/clustering/k-means-clustering.html#running-k-means-clustering
By the way, we need update the scala and pthyon version as well if we want to update the spark java KMeans.

pfxuan · 2016-04-20T14:55:32Z

Is HiBench a performance benchmark suite for big data execution engines or for big data applications?

I probably misunderstood the design goal of HiBench benchmark suite, and please correct me if i'm wrong. I was thinking HiBench is a performance benchmark tool for characterizing and measuring the efficiency of big data execution engines/runners (MapReduce/Spark/Flink/Storm). Thus, to get a fair enough comparison, it's better to apply a similar workload with an exactly same initial condition (e.g. randomly chosen centroids) over different engines/runners, which eventually requires a matched algorithm with a very close implementation for each of those engines/runners. Otherwise, the comparison of the use of two different workloads is something like the comparison of apples and oranges.

The comparison mentioned in the previous discussion looks more like an algorithm and implementation oriented benchmark, which suggests that we should use a fixed convergence condition rather than a fixed algorithm, implementation and initial conditions to benchmark each of those engines/runners. It makes sense to get a highly efficient implementation for our production system by using such measurement rule. But this might lead an imprecise evaluation on big data execution engines/runners, and makes HiBench become a benchmark suite for the big data applications.

Any thoughts and comments will be much appreciated? BIG THANKS FOR ALL OF U!

hqzizania · 2016-04-20T16:14:53Z

OK, thanks for the above explanation on Hibench design goal. As @carsonwang suggested, choosing random centroids is a more elegant way?

pfxuan · 2016-04-20T17:44:02Z

@hqzizania, if we use application built-in generator, it'll result two different random sets on centroids and thus provides the same input sample with two different initial conditions on MapReduce and Spark benchmarks. If this difference can be safely ignored comparing with the overall Kmeans computation cost, the computation cost on generating random centroids may or may not be ignored.

I did a quick check. It looks like Mahout and Spark MLLib use a slight different implmenetation on generating random centroids. Would you be able to confirm if its computation cost, especially the cost used on I/O part, is equivalent? Because I know the Spark k-means|| goes through the whole input sample and results a huge I/O overhead when preparing the initial centroids. I'm afraid Spark Kmeans random also has a heavy I/O load on this part.

Thank you so much!

hqzizania · 2016-04-21T02:51:52Z

@pfxuan ooops, a key problem is that the RDD is not cached in Java code. I suggest it should be cached before transformed into KMeans like scala version, since Spark is a memory-based computing engine. Even if a very very large RDD cannot be cached, Spark only use RDD takeSamplewith a very light I/O load for random.

In other hand, we can support both randomand KMeans|| ways for users like scala version.

Apply initial centroids on Spark Kmeans workload, which makes HiBench…

dcd8b39

… hold the same convergence condition on both MapReduce and Spark benchmark.

hqzizania reviewed Apr 10, 2016
View reviewed changes

carsonwang force-pushed the master branch 2 times, most recently from 76a5a5f to cb633ac Compare November 8, 2016 07:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply initial centroids on Spark Kmeans workload. #187

Apply initial centroids on Spark Kmeans workload. #187

pfxuan commented Mar 30, 2016

carsonwang commented Mar 31, 2016

hqzizania Apr 10, 2016

pfxuan Apr 10, 2016

hqzizania Apr 11, 2016

hqzizania Apr 22, 2016

pfxuan Apr 22, 2016

carsonwang Apr 22, 2016

pfxuan Apr 25, 2016

pfxuan Apr 25, 2016

carsonwang May 6, 2016

pfxuan commented Apr 19, 2016

chenghao-intel commented Apr 20, 2016

hqzizania commented Apr 20, 2016

carsonwang commented Apr 20, 2016 •

edited

Loading

pfxuan commented Apr 20, 2016

hqzizania commented Apr 20, 2016

pfxuan commented Apr 20, 2016

hqzizania commented Apr 21, 2016 •

edited

Loading

Apply initial centroids on Spark Kmeans workload. #187

Are you sure you want to change the base?

Apply initial centroids on Spark Kmeans workload. #187

Conversation

pfxuan commented Mar 30, 2016

carsonwang commented Mar 31, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfxuan commented Apr 19, 2016

chenghao-intel commented Apr 20, 2016

hqzizania commented Apr 20, 2016

carsonwang commented Apr 20, 2016 • edited Loading

pfxuan commented Apr 20, 2016

hqzizania commented Apr 20, 2016

pfxuan commented Apr 20, 2016

hqzizania commented Apr 21, 2016 • edited Loading

carsonwang commented Apr 20, 2016 •

edited

Loading

hqzizania commented Apr 21, 2016 •

edited

Loading