exception trying to run Mnist example #121

dumoulma · 2016-04-10T10:22:35Z

Hi!

I'm trying to run SparkNet on a MapR cluster running Spark 1.5.2
I can get Caffe to run locally, including python bindings, and the SparkNet assembly is using the SPARKNETCPU artefacts (with JavaCPP on the 03-16 version as indicated in another post.

the job starts up and completes Stage 3 successfully but then throws an exception:
16/04/10 10:18:52 WARN TaskSetManager: Lost task 3.0 in stage 14.0 (TID 41, 10.0.0.217): java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at libs.JavaNDArray.baseFlatInto(JavaNDArray.java:67)
at libs.JavaNDArray.recursiveFlatInto(JavaNDArray.java:79)
at libs.JavaNDArray.recursiveFlatInto(JavaNDArray.java:82)
at libs.JavaNDArray.flatCopy(JavaNDArray.java:93)
at libs.JavaNDArray.toFlat(JavaNDArray.java:111)
at libs.NDArray.toFlat(NDArray.scala:32)
at libs.TensorFlowUtils$.tensorFromNDArray(TensorFlowUtils.scala:71)
at libs.TensorFlowNet$$anonfun$setWeights$1.apply(TensorFlowNet.scala:114)
at libs.TensorFlowNet$$anonfun$setWeights$1.apply(TensorFlowNet.scala:112)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:102)
at libs.TensorFlowNet.setWeights(TensorFlowNet.scala:112)
at apps.MnistApp$$anonfun$main$4.apply$mcVI$sp(MnistApp.scala:96)
at apps.MnistApp$$anonfun$main$4.apply(MnistApp.scala:96)
at apps.MnistApp$$anonfun$main$4.apply(MnistApp.scala:96)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:894)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:894)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Any help would be greatly appreciated.

Note: the Cifar example also fails with what seems to be the exact same error.

robertnishihara · 2016-04-10T17:25:12Z

Just to be sure, can you tell me what command you're running to launch the Mnist app? Also, did you download the Mnist data with Sparknet/data/mnist/get_mnist.sh? Similarly, did you download the Cifar data with Sparknet/data/cifar10/get_cifar10.sh?

dumoulma · 2016-04-11T00:33:33Z

Yes, I used get_mnist.sh/get_cifar10.sh and use the command as shown on the readme
spark_submit.sh --class apps.CifarApp path/to/Sparknet-jar-with-deps.jar 2

robertnishihara · 2016-04-11T01:42:45Z

I'd suggest running the individual commands from a Spark shell and seeing specifically where the error occurs. Also, are there any error messages on the workers?

dumoulma · 2016-04-11T02:58:20Z

The error happens after the data is loaded. The Caffe network config loads and runs a bit then it crashes with the ArrayOutOfBounds.
CaffeOnSpark runs without issues on that EC2 instance (m3.xlarge) running a spark 1.6 or 1.5 standalone with 2 workers.

robertnishihara · 2016-04-11T21:53:40Z

Since it's on EC2, if you want to share the image with us, it'd be easy for us to look into it.

It should work fine on the image that we provide (in the readme).

abongLee · 2016-04-27T02:30:00Z

did you solve the problem, I have a simliar problem , the SparkNet assembly is also using the SPARKNETCPU , and crash with the same exception ArrayOutBounds

dumoulma2 · 2016-04-27T08:36:33Z

I have not. I was severely pressed for time and got CoffeOnSpark working and decided to go with that one. I would still like to get SparkNet working though.

pcmoritz · 2016-04-29T00:49:02Z

Hey, thanks for keeping us updated. I think I can reproduce the problem now, it seems to occur in local mode with more than one SparkNet worker; that is not a regime we typically use, so we haven't run into it yet. I'll keep you updated if I find out why the problem occurs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exception trying to run Mnist example #121

exception trying to run Mnist example #121

dumoulma commented Apr 10, 2016

robertnishihara commented Apr 10, 2016

dumoulma commented Apr 11, 2016

robertnishihara commented Apr 11, 2016

dumoulma commented Apr 11, 2016

robertnishihara commented Apr 11, 2016

abongLee commented Apr 27, 2016

dumoulma2 commented Apr 27, 2016

pcmoritz commented Apr 29, 2016

exception trying to run Mnist example #121

exception trying to run Mnist example #121

Comments

dumoulma commented Apr 10, 2016

robertnishihara commented Apr 10, 2016

dumoulma commented Apr 11, 2016

robertnishihara commented Apr 11, 2016

dumoulma commented Apr 11, 2016

robertnishihara commented Apr 11, 2016

abongLee commented Apr 27, 2016

dumoulma2 commented Apr 27, 2016

pcmoritz commented Apr 29, 2016