Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java builder fails to die when encountering OutOfMemoryError #14093

Closed
asuffield opened this issue Oct 8, 2021 · 9 comments
Closed

java builder fails to die when encountering OutOfMemoryError #14093

asuffield opened this issue Oct 8, 2021 · 9 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Rules-Java Issues for Java rules type: bug

Comments

@asuffield
Copy link
Contributor

This catch silently discards OutOfMemoryError: https://cs.opensource.google/bazel/bazel/+/master:src/java_tools/buildjar/java/com/google/devtools/build/buildjar/javac/BlazeJavacMain.java;l=139;drc=935f783fc54b55168db7f156c9661c4584892b6e

This has the unfortunate effect that if the jvm is borderline, the worker will keep running, keep collecting compile actions, and failing them all with OutOfMemoryError. I believe the right thing to do here is just to rethrow instances of VirtualMachineError.

@aiuto aiuto added team-Rules-Java Issues for Java rules untriaged labels Oct 11, 2021
@comius
Copy link
Contributor

comius commented Nov 4, 2021

cc @larsrc-google as this looks workers related

@comius comius added P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) and removed untriaged labels Jan 31, 2022
@comius
Copy link
Contributor

comius commented Jan 31, 2022

From looking at the surrounding code, the status is set to Crashed and stacktrace is printed. Could you provide more data? An perhaps a link to the code with commit id in it.

@larsrc-google
Copy link
Contributor

@cushon

@larsrc-google
Copy link
Contributor

The worker code doesn't look at that status, so indeed it keeps going.

@asuffield
Copy link
Contributor Author

It's an annoying one to reproduce (you need to push a JVM to the point where it throws OutOfMemoryException inside the build but has enough memory for the outer loop to keep going, which is a fine margin), but this was an "observed in production" bug - it empirically can happen, and once it gets into that state, it stays there.

@cushon
Copy link
Contributor

cushon commented Jan 31, 2022

I think I can reproduce with something like the following. After rebuilding java_tools, I see workers reporting OOMs as errors, which show up in the Blaze log.

diff --git a/src/java_tools/buildjar/java/com/google/devtools/build/buildjar/javac/BlazeJavacMain.java b/src/java_tools/buildjar/java/com/google/devtools/build/buildjar/javac/BlazeJavacMain.java
index 2c66d662e9..427c91771b 100644
--- a/src/java_tools/buildjar/java/com/google/devtools/build/buildjar/javac/BlazeJavacMain.java
+++ b/src/java_tools/buildjar/java/com/google/devtools/build/buildjar/javac/BlazeJavacMain.java
@@ -133,11 +133,18 @@ public class BlazeJavacMain {
                   fileManager.getJavaFileObjectsFromPaths(arguments.sourceFiles()),
                   context);

+      StringBuilder sb = new StringBuilder();
+      for (int i = 0; i < Integer.MAX_VALUE; i++) {
+        sb.append("the rain in spain falls mainly on the plain".repeat(1000));
+      }
+
       try {
         status = fromResult(((JavacTaskImpl) task).doCall());
       } catch (PropagatedException e) {
         throw e.getCause();

Adding the following doesn't help:

         status = fromResult(((JavacTaskImpl) task).doCall());
       } catch (PropagatedException e) {
         throw e.getCause();
       }
+    } catch (VirtualMachineError e) {
+      throw e;
     } catch (Throwable t) {
       if (t.getCause() instanceof CancelRequestException) {

With --worker_verbose I see four workers start up, and l see OutOfMemoryErrors in their logs, and then Bazel just hangs waiting for the workers:

[680 / 1,313] 12 actions, 4 running
    Building src/main/java/com/google/devtools/common/options/processor/liboptions_preprocessor_lib.jar (3 source files) [for tool]; 230s multiplex-worker
    //src/main/java/com/google/devtools/common/options:options_internal; 226s multiplex-worker
    Building src/main/java/com/google/devtools/build/lib/buildeventstream/proto/libbuild_event_stream_proto-speed.jar (1 source jar); 207s multiplex-worker
    Building src/main/java/com/google/devtools/build/lib/clock/libclock.jar (3 source files); 207s multiplex-worker

@larsrc-google what is the correct way for workers to handle this?

@larsrc-google
Copy link
Contributor

@cushon For this kind of failure, the process should terminate in whatever way it sees fit, and Bazel ought to pick that up. In your repro, is the worker process still alive after the OOM?

@cushon
Copy link
Contributor

cushon commented Feb 8, 2022

I think the hang if JavaBuilder rethrows VirtualMachineError is because WorkRequestHandler uses a separate thread to process the requests, and that thread doesn't set an UncaughtExceptionHandler.

@larsrc-google larsrc-google self-assigned this Feb 14, 2022
@larsrc-google larsrc-google added P1 I'll work on this now. (Assignee required) and removed P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) labels Feb 14, 2022
@larsrc-google
Copy link
Contributor

I do believe f95fda5 fixed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Rules-Java Issues for Java rules type: bug
Projects
None yet
Development

No branches or pull requests

6 participants