Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exception trying to run Mnist example #121

Open
dumoulma opened this issue Apr 10, 2016 · 8 comments
Open

exception trying to run Mnist example #121

dumoulma opened this issue Apr 10, 2016 · 8 comments

Comments

@dumoulma
Copy link

Hi!

I'm trying to run SparkNet on a MapR cluster running Spark 1.5.2
I can get Caffe to run locally, including python bindings, and the SparkNet assembly is using the SPARKNETCPU artefacts (with JavaCPP on the 03-16 version as indicated in another post.

the job starts up and completes Stage 3 successfully but then throws an exception:
16/04/10 10:18:52 WARN TaskSetManager: Lost task 3.0 in stage 14.0 (TID 41, 10.0.0.217): java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at libs.JavaNDArray.baseFlatInto(JavaNDArray.java:67)
at libs.JavaNDArray.recursiveFlatInto(JavaNDArray.java:79)
at libs.JavaNDArray.recursiveFlatInto(JavaNDArray.java:82)
at libs.JavaNDArray.flatCopy(JavaNDArray.java:93)
at libs.JavaNDArray.toFlat(JavaNDArray.java:111)
at libs.NDArray.toFlat(NDArray.scala:32)
at libs.TensorFlowUtils$.tensorFromNDArray(TensorFlowUtils.scala:71)
at libs.TensorFlowNet$$anonfun$setWeights$1.apply(TensorFlowNet.scala:114)
at libs.TensorFlowNet$$anonfun$setWeights$1.apply(TensorFlowNet.scala:112)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:102)
at libs.TensorFlowNet.setWeights(TensorFlowNet.scala:112)
at apps.MnistApp$$anonfun$main$4.apply$mcVI$sp(MnistApp.scala:96)
at apps.MnistApp$$anonfun$main$4.apply(MnistApp.scala:96)
at apps.MnistApp$$anonfun$main$4.apply(MnistApp.scala:96)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:894)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:894)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Any help would be greatly appreciated.

Note: the Cifar example also fails with what seems to be the exact same error.

@robertnishihara
Copy link
Member

Just to be sure, can you tell me what command you're running to launch the Mnist app? Also, did you download the Mnist data with Sparknet/data/mnist/get_mnist.sh? Similarly, did you download the Cifar data with Sparknet/data/cifar10/get_cifar10.sh?

@dumoulma
Copy link
Author

Yes, I used get_mnist.sh/get_cifar10.sh and use the command as shown on the readme
spark_submit.sh --class apps.CifarApp path/to/Sparknet-jar-with-deps.jar 2

@robertnishihara
Copy link
Member

I'd suggest running the individual commands from a Spark shell and seeing specifically where the error occurs. Also, are there any error messages on the workers?

@dumoulma
Copy link
Author

The error happens after the data is loaded. The Caffe network config loads and runs a bit then it crashes with the ArrayOutOfBounds.
CaffeOnSpark runs without issues on that EC2 instance (m3.xlarge) running a spark 1.6 or 1.5 standalone with 2 workers.

@robertnishihara
Copy link
Member

Since it's on EC2, if you want to share the image with us, it'd be easy for us to look into it.

It should work fine on the image that we provide (in the readme).

@abongLee
Copy link

did you solve the problem, I have a simliar problem , the SparkNet assembly is also using the SPARKNETCPU , and crash with the same exception ArrayOutBounds

@dumoulma2
Copy link

I have not. I was severely pressed for time and got CoffeOnSpark working and decided to go with that one. I would still like to get SparkNet working though.

@pcmoritz
Copy link
Collaborator

Hey, thanks for keeping us updated. I think I can reproduce the problem now, it seems to occur in local mode with more than one SparkNet worker; that is not a regime we typically use, so we haven't run into it yet. I'll keep you updated if I find out why the problem occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants