-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python_mllib_perf: java.lang.OutOfMemoryError: Java heap space #92
Comments
Max. Heap Size (Estimated): 8.00G Following changes we made in SPARK_HOME_DIR = PREP_SPARK_TESTS = COMMON_JAVA_OPTS = SPARK_DRIVER_MEMORY = |
Is this a bug in |
Exactly not able to figure out, but i think it's not problem with |
Even we tried |
I see this outside spark-perf. in Spark 1.5.0 and ALS. Something similar was reported in July for Spark 1.3. |
so @ehiggs @JoshRosen i think it might be bug in |
@JetsonPaul Can you try running a problem of similar size outside of Spark perf, but on the same cluster and with a similar configuration? Also, are you running with only 1 worker with 16GB of memory, and if so, does it work if you add a few more workers? |
@jkbradley We got output for |
I see. I'm still not sure if it's a problem with spark-perf or with pyspark requiring more memory than equivalent Scala/Java jobs. The best way to figure that out might be to try to replicate the ML task outside of spark-perf. I wouldn't call this a fix, but one thing you could try is lowering the number of partitions. For ML jobs, I tend to set # partitions = # CPU cores available in cluster. That will likely mean less memory allocated. |
@jkbradley where we need to do that configuration changes..? Plz anyone can explain little more with small example...? |
This can be set in the spark-perf/config/config.py file. Note that there is an mllib-specific num-partitions config in that file. |
Java options: -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.executor.memory=16g -Dspark.locality.wait=60000000 -Dspark.shuffle.manager=SORT
Options: GLMClassificationTest --num-trials=10 --inter-trial-wait=3 --num-partitions=128 --random-seed=5 --num-examples=1000000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=sgd --per-negative=0.3 --scale-factor=1.0 --loss=logistic
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/01 15:06:11 INFO SparkContext: Running Spark version 1.5.1
15/12/01 15:06:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/01 15:06:11 INFO SecurityManager: Changing view acls to: root
15/12/01 15:06:11 INFO SecurityManager: Changing modify acls to: root
15/12/01 15:06:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/12/01 15:06:12 INFO Slf4jLogger: Slf4jLogger started
15/12/01 15:06:12 INFO Remoting: Starting remoting
15/12/01 15:06:12 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:56882]
15/12/01 15:06:12 INFO Utils: Successfully started service 'sparkDriver' on port 56882.
15/12/01 15:06:12 INFO SparkEnv: Registering MapOutputTracker
15/12/01 15:06:12 INFO SparkEnv: Registering BlockManagerMaster
15/12/01 15:06:12 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-528106e5-c257-4af4-9bf6-276bc5a64f00
15/12/01 15:06:12 INFO MemoryStore: MemoryStore started with capacity 10.8 GB
15/12/01 15:06:12 INFO HttpFileServer: HTTP File server directory is /tmp/spark-ddb4b717-dfbe-4167-9ece-02f36c77c8a3/httpd-53fa3c9e-df28-476b-9bdc-b5a2fd07602b
15/12/01 15:06:12 INFO HttpServer: Starting HTTP Server
15/12/01 15:06:12 INFO Utils: Successfully started service 'HTTP file server' on port 56904.
15/12/01 15:06:12 INFO SparkEnv: Registering OutputCommitCoordinator
15/12/01 15:06:13 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/12/01 15:06:13 INFO SparkUI: Started SparkUI at http://10.88.67.113:4040
15/12/01 15:06:13 INFO Utils: Copying /home/test/spark-perf-master/pyspark-tests/mllib_tests.py to /tmp/spark-ddb4b717-dfbe-4167-9ece-02f36c77c8a3/userFiles-228e8f4d-b978-457a-b499-2d1a13e8153c/mllib_tests.py
15/12/01 15:06:13 INFO SparkContext: Added file file:/home/test/spark-perf-master/pyspark-tests/mllib_tests.py at http://10.88.67.113:56904/files/mllib_tests.py with timestamp 1448962573242
15/12/01 15:06:13 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/12/01 15:06:13 INFO AppClient$ClientEndpoint: Connecting to master spark://pts00450-vm8:7077...
15/12/01 15:06:13 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20151201150613-0000
15/12/01 15:06:13 INFO AppClient$ClientEndpoint: Executor added: app-20151201150613-0000/0 on worker-20151201150600-10.88.67.113-54856 (10.88.67.113:54856) with 2 cores
15/12/01 15:06:13 INFO SparkDeploySchedulerBackend: Granted executor ID app-20151201150613-0000/0 on hostPort 10.88.67.113:54856 with 2 cores, 9.0 GB RAM
15/12/01 15:06:13 INFO AppClient$ClientEndpoint: Executor updated: app-20151201150613-0000/0 is now RUNNING
15/12/01 15:06:13 INFO AppClient$ClientEndpoint: Executor updated: app-20151201150613-0000/0 is now LOADING
15/12/01 15:06:13 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46515.
15/12/01 15:06:13 INFO NettyBlockTransferService: Server created on 46515
15/12/01 15:06:13 INFO BlockManagerMaster: Trying to register BlockManager
15/12/01 15:06:13 INFO BlockManagerMasterEndpoint: Registering block manager 10.88.67.113:46515 with 10.8 GB RAM, BlockManagerId(driver, 10.88.67.113, 46515)
15/12/01 15:06:13 INFO BlockManagerMaster: Registered BlockManager
15/12/01 15:06:13 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
15/12/01 15:06:14 INFO SparkContext: Starting job: count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73
15/12/01 15:06:14 INFO DAGScheduler: Got job 0 (count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73) with 128 output partitions
15/12/01 15:06:14 INFO DAGScheduler: Final stage: ResultStage 0(count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73)
15/12/01 15:06:14 INFO DAGScheduler: Parents of final stage: List()
15/12/01 15:06:14 INFO DAGScheduler: Missing parents: List()
15/12/01 15:06:14 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[3] at count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73), which has no missing parents
15/12/01 15:06:14 INFO MemoryStore: ensureFreeSpace(86504) called with curMem=0, maxMem=11596411699
15/12/01 15:06:14 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 84.5 KB, free 10.8 GB)
15/12/01 15:06:14 INFO MemoryStore: ensureFreeSpace(85000) called with curMem=86504, maxMem=11596411699
15/12/01 15:06:14 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 83.0 KB, free 10.8 GB)
15/12/01 15:06:14 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.88.67.113:46515 (size: 83.0 KB, free: 10.8 GB)
15/12/01 15:06:14 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:861
15/12/01 15:06:14 INFO DAGScheduler: Submitting 128 missing tasks from ResultStage 0 (PythonRDD[3] at count at /home/test/spark-perf-master/pyspark-tests/mllib_tests.py:73)
15/12/01 15:06:14 INFO TaskSchedulerImpl: Adding task set 0.0 with 128 tasks
15/12/01 15:06:17 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://[email protected]:52848/user/Executor#-940891447]) with ID 0
15/12/01 15:06:17 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:17 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:17 INFO BlockManagerMasterEndpoint: Registering block manager 10.88.67.113:51505 with 8.1 GB RAM, BlockManagerId(0, 10.88.67.113, 51505)
15/12/01 15:06:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.88.67.113:51505 (size: 83.0 KB, free: 8.1 GB)
15/12/01 15:06:23 INFO BlockManagerInfo: Added rdd_2_0 in memory on 10.88.67.113:51505 (size: 597.0 MB, free: 7.5 GB)
15/12/01 15:06:23 INFO BlockManagerInfo: Added rdd_2_1 in memory on 10.88.67.113:51505 (size: 597.1 MB, free: 6.9 GB)
15/12/01 15:06:24 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:24 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:24 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6987 ms on 10.88.67.113 (1/128)
15/12/01 15:06:24 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 6931 ms on 10.88.67.113 (2/128)
15/12/01 15:06:28 INFO BlockManagerInfo: Added rdd_2_2 in memory on 10.88.67.113:51505 (size: 597.0 MB, free: 6.4 GB)
15/12/01 15:06:29 INFO BlockManagerInfo: Added rdd_2_3 in memory on 10.88.67.113:51505 (size: 597.1 MB, free: 5.8 GB)
15/12/01 15:06:30 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:30 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 5991 ms on 10.88.67.113 (3/128)
15/12/01 15:06:30 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:30 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 6197 ms on 10.88.67.113 (4/128)
15/12/01 15:06:36 INFO BlockManagerInfo: Added rdd_2_5 in memory on 10.88.67.113:51505 (size: 597.1 MB, free: 5.2 GB)
15/12/01 15:06:36 INFO BlockManagerInfo: Added rdd_2_4 in memory on 10.88.67.113:51505 (size: 597.0 MB, free: 4.6 GB)
15/12/01 15:06:36 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:36 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:36 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 6511 ms on 10.88.67.113 (5/128)
15/12/01 15:06:36 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 6768 ms on 10.88.67.113 (6/128)
15/12/01 15:06:41 INFO BlockManagerInfo: Added rdd_2_6 in memory on 10.88.67.113:51505 (size: 597.0 MB, free: 4.0 GB)
15/12/01 15:06:42 INFO BlockManagerInfo: Added rdd_2_7 in memory on 10.88.67.113:51505 (size: 597.1 MB, free: 3.4 GB)
15/12/01 15:06:42 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:42 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 5899 ms on 10.88.67.113 (7/128)
15/12/01 15:06:42 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:06:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 6308 ms on 10.88.67.113 (8/128)
15/12/01 15:08:24 INFO TaskSetManager: Starting task 10.0 in stage 0.0 (TID 10, 10.88.67.113, PROCESS_LOCAL, 2042 bytes)
15/12/01 15:08:24 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, 10.88.67.113): java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2479)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:130)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:105)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:165)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:134)
at org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:294)
at org.xerial.snappy.SnappyOutputStream.compressInput(SnappyOutputStream.java:306)
at org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:245)
at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:107)
at org.apache.spark.io.SnappyOutputStreamWrapper.write(CompressionCodec.scala:189)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:627)
at java.lang.Thread.run(Thread.java:809)
15/12/01 15:08:24 ERROR TaskSchedulerImpl: Lost executor 0 on 10.88.67.113: remote Rpc client disassociated
The text was updated successfully, but these errors were encountered: