Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python_mllib_perf: java.lang.OutOfMemoryError: Java heap space #92

Open
ghost opened this issue Dec 1, 2015 · 11 comments
Open

python_mllib_perf: java.lang.OutOfMemoryError: Java heap space #92

ghost opened this issue Dec 1, 2015 · 11 comments

Comments

@ghost
Copy link

ghost commented Dec 1, 2015

Java options: -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.executor.memory=16g -Dspark.locality.wait=60000000 -Dspark.shuffle.manager=SORT
Options: GLMClassificationTest --num-trials=10 --inter-trial-wait=3 --num-partitions=128 --random-seed=5 --num-examples=1000000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=sgd --per-negative=0.3 --scale-factor=1.0 --loss=logistic

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/01 15:06:11 INFO SparkContext: Running Spark version 1.5.1
15/12/01 15:06:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/01 15:06:11 INFO SecurityManager: Changing view acls to: root
15/12/01 15:06:11 INFO SecurityManager: Changing modify acls to: root
15/12/01 15:06:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/12/01 15:06:12 INFO Slf4jLogger: Slf4jLogger started
15/12/01 15:06:12 INFO Remoting: Starting remoting
15/12/01 15:06:12 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:56882]
15/12/01 15:06:12 INFO Utils: Successfully started service 'sparkDriver' on port 56882.
15/12/01 15:06:12 INFO SparkEnv: Registering MapOutputTracker
15/12/01 15:06:12 INFO SparkEnv: Registering BlockManagerMaster
15/12/01 15:06:12 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-528106e5-c257-4af4-9bf6-276bc5a64f00
15/12/01 15:06:12 INFO MemoryStore: MemoryStore started with capacity 10.8 GB
15/12/01 15:06:12 INFO HttpFileServer: HTTP File server directory is /tmp/spark-ddb4b717-dfbe-4167-9ece-02f36c77c8a3/httpd-53fa3c9e-df28-476b-9bdc-b5a2fd07602b
15/12/01 15:06:12 INFO HttpServer: Starting HTTP Server
15/12/01 15:06:12 INFO Utils: Successfully started service 'HTTP file server' on port 56904.
15/12/01 15:06:12 INFO SparkEnv: Registering OutputCommitCoordinator
15/12/01 15:06:13 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/12/01 15:06:13 INFO SparkUI: Started SparkUI at http://10.88.67.113:4040
15/12/01 15:06:13 INFO Utils: Copying /home/test/spark-perf-master/pyspark-tests/mllib_tests.py to /tmp/spark-ddb4b717-dfbe-4167-9ece-02f36c77c8a3/userFiles-228e8f4d-b978-457a-b499-2d1a13e8153c/mllib_tests.py
15/12/01 15:06:13 INFO SparkContext: Added file file:/home/test/spark-perf-master/pyspark-tests/mllib_tests.py at http://10.88.67.113:56904/files/mllib_tests.py with timestamp 1448962573242
15/12/01 15:06:13 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/12/01 15:06:13 INFO AppClient$ClientEndpoint: Connecting to master spark://pts00450-vm8:7077...
15/12/01 15:06:13 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20151201150613-0000
15/12/01 15:06:13 INFO AppClient$ClientEndpoint: Executor added: app-20151201150613-0000/0 on worker-20151201150600-10.88.67.113-54856 (10.88.67.113:54856) with 2 cores
15/12/01 15:06:13 INFO SparkDeploySchedulerBackend: Granted executor ID app-20151201150613-0000/0 on hostPort 10.88.67.113:54856 with 2 cores, 9.0 GB RAM
15/12/01 15:06:13 INFO AppClient$ClientEndpoint: Executor updated: app-20151201150613-0000/0 is now RUNNING
15/12/01 15:06:13 INFO AppClient$ClientEndpoint: Executor updated: app-20151201150613-0000/0 is now LOADING
15/12/01 15:06:13 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46515.
15/12/01 15:06:13 INFO NettyBlockTransferService: Server created on 46515
15/12/01 15:06:13 INFO BlockManagerMaster: Trying to register BlockManager
15/12/01 15:06:13 INFO BlockManagerMasterEndpoint: Registering block manager 10.88.67.113:46515 with 10.8 GB RAM, BlockManagerId(driver, 10.88.67.113, 46515)
15/12/01 15:06:13 INFO BlockManagerMaster: Registered BlockManager
15/12/01 15:06:13 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
15/12/01 15:06:14 INFO SparkContext: Starting job: count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73
15/12/01 15:06:14 INFO DAGScheduler: Got job 0 (count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73) with 128 output partitions
15/12/01 15:06:14 INFO DAGScheduler: Final stage: ResultStage 0(count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73)
15/12/01 15:06:14 INFO DAGScheduler: Parents of final stage: List()
15/12/01 15:06:14 INFO DAGScheduler: Missing parents: List()
15/12/01 15:06:14 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[3] at count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73), which has no missing parents
15/12/01 15:06:14 INFO MemoryStore: ensureFreeSpace(86504) called with curMem=0, maxMem=11596411699
15/12/01 15:06:14 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 84.5 KB, free 10.8 GB)
15/12/01 15:06:14 INFO MemoryStore: ensureFreeSpace(85000) called with curMem=86504, maxMem=11596411699
15/12/01 15:06:14 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 83.0 KB, free 10.8 GB)
15/12/01 15:06:14 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.88.67.113:46515 (size: 83.0 KB, free: 10.8 GB)
15/12/01 15:06:14 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:861
15/12/01 15:06:14 INFO DAGScheduler: Submitting 128 missing tasks from ResultStage 0 (PythonRDD[3] at count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73)
15/12/01 15:06:14 INFO TaskSchedulerImpl: Adding task set 0.0 with 128 tasks
15/12/01 15:06:17 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://[email protected]:52848/user/Executor#-940891447]) with ID 0
15/12/01 15:06:17 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:17 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:17 INFO BlockManagerMasterEndpoint: Registering block manager 10.88.67.113:51505 with 8.1 GB RAM, BlockManagerId(0, 10.88.67.113, 51505)
15/12/01 15:06:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.88.67.113:51505 (size: 83.0 KB, free: 8.1 GB)
15/12/01 15:06:23 INFO BlockManagerInfo: Added rdd_2_0 in memory on 10.88.67.113:51505 (size: 597.0 MB, free: 7.5 GB)
15/12/01 15:06:23 INFO BlockManagerInfo: Added rdd_2_1 in memory on 10.88.67.113:51505 (size: 597.1 MB, free: 6.9 GB)
15/12/01 15:06:24 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:24 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:24 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6987 ms on 10.88.67.113 (1/128)
15/12/01 15:06:24 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 6931 ms on 10.88.67.113 (2/128)
15/12/01 15:06:28 INFO BlockManagerInfo: Added rdd_2_2 in memory on 10.88.67.113:51505 (size: 597.0 MB, free: 6.4 GB)
15/12/01 15:06:29 INFO BlockManagerInfo: Added rdd_2_3 in memory on 10.88.67.113:51505 (size: 597.1 MB, free: 5.8 GB)
15/12/01 15:06:30 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:30 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 5991 ms on 10.88.67.113 (3/128)
15/12/01 15:06:30 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:30 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 6197 ms on 10.88.67.113 (4/128)
15/12/01 15:06:36 INFO BlockManagerInfo: Added rdd_2_5 in memory on 10.88.67.113:51505 (size: 597.1 MB, free: 5.2 GB)
15/12/01 15:06:36 INFO BlockManagerInfo: Added rdd_2_4 in memory on 10.88.67.113:51505 (size: 597.0 MB, free: 4.6 GB)
15/12/01 15:06:36 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:36 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:36 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 6511 ms on 10.88.67.113 (5/128)
15/12/01 15:06:36 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 6768 ms on 10.88.67.113 (6/128)
15/12/01 15:06:41 INFO BlockManagerInfo: Added rdd_2_6 in memory on 10.88.67.113:51505 (size: 597.0 MB, free: 4.0 GB)
15/12/01 15:06:42 INFO BlockManagerInfo: Added rdd_2_7 in memory on 10.88.67.113:51505 (size: 597.1 MB, free: 3.4 GB)
15/12/01 15:06:42 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:42 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 5899 ms on 10.88.67.113 (7/128)
15/12/01 15:06:42 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 6308 ms on 10.88.67.113 (8/128)
15/12/01 15:08:24 INFO TaskSetManager: Starting task 10.0 in stage 0.0 (TID 10, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:08:24 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, 10.88.67.113): java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2479)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:130)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:105)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:165)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:134)
at org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:294)
at org.xerial.snappy.SnappyOutputStream.compressInput(SnappyOutputStream.java:306)
at org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:245)
at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:107)
at org.apache.spark.io.SnappyOutputStreamWrapper.write(CompressionCodec.scala:189)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:627)
at java.lang.Thread.run(Thread.java:809)

15/12/01 15:08:24 ERROR TaskSchedulerImpl: Lost executor 0 on 10.88.67.113: remote Rpc client disassociated

@ghost
Copy link
Author

ghost commented Dec 1, 2015

Max. Heap Size (Estimated): 8.00G


Following changes we made in spark-perf-master/config/config.py file.

SPARK_HOME_DIR = "/home/test/spark-1.5.1"
RUN_SPARK_TESTS = False
RUN_PYSPARK_TESTS =False
RUN_STREAMING_TESTS =False
RUN_MLLIB_TESTS = False
RUN_PYTHON_MLLIB_TESTS = True

PREP_SPARK_TESTS = False
PREP_PYSPARK_TESTS = False
PREP_STREAMING_TESTS =False
PREP_MLLIB_TESTS = False

COMMON_JAVA_OPTS = [ JavaOptionSet("spark.storage.memoryFraction", [0.66]), JavaOptionSet("spark.serializer", ["org.apache.spark.serializer.JavaSerializer"]), JavaOptionSet("spark.executor.memory", ["16g"]), JavaOptionSet("spark.locality.wait", [str(60 * 1000 * 1000)]) ]

SPARK_DRIVER_MEMORY = "4g"
MLLIB_SPARK_VERSION = 1.5

@JoshRosen
Copy link
Contributor

Is this a bug in spark-perf? A problem with Spark 1.5.1? I'm trying to figure out if this issue is actionable.

@ghost
Copy link
Author

ghost commented Dec 2, 2015

Exactly not able to figure out, but i think it's not problem with spark 1.5.1., because we can get output for other SPARK_TESTS entry in spark-perf-master/config/config.py

@ghost
Copy link
Author

ghost commented Dec 7, 2015

Even we tried spark-perf with spark 1.5.2 getting same issue java.io.IOException (No space left on device) for RUN_PYTHON_MLLIB_TESTS = True

@ehiggs
Copy link

ehiggs commented Dec 7, 2015

I see this outside spark-perf. in Spark 1.5.0 and ALS. Something similar was reported in July for Spark 1.3.

@ghost
Copy link
Author

ghost commented Dec 8, 2015

so @ehiggs @JoshRosen i think it might be bug in spark-perf. Please take this issue on action.

@jkbradley
Copy link
Contributor

@JetsonPaul Can you try running a problem of similar size outside of Spark perf, but on the same cluster and with a similar configuration? Also, are you running with only 1 worker with 16GB of memory, and if so, does it work if you add a few more workers?

@ghost
Copy link
Author

ghost commented Dec 9, 2015

@jkbradley We got output for RUN_MLLIB_TESTS = True with same cluster and with a similar configuration but for RUN_PYTHON_MLLIB_TESTS = True getting above issue. As well till now we tried 1 worker with 24GB of memory (max) not with more workers.

@jkbradley
Copy link
Contributor

I see. I'm still not sure if it's a problem with spark-perf or with pyspark requiring more memory than equivalent Scala/Java jobs. The best way to figure that out might be to try to replicate the ML task outside of spark-perf.

I wouldn't call this a fix, but one thing you could try is lowering the number of partitions. For ML jobs, I tend to set # partitions = # CPU cores available in cluster. That will likely mean less memory allocated.

@ghost
Copy link
Author

ghost commented Dec 10, 2015

@jkbradley where we need to do that configuration changes..? Plz anyone can explain little more with small example...?

@jkbradley
Copy link
Contributor

For ML jobs, I tend to set # partitions = # CPU cores available in cluster

This can be set in the spark-perf/config/config.py file. Note that there is an mllib-specific num-partitions config in that file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants