Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak when Recompiling Modules and Reinstantiating Runtimes #1534

Closed
abraithwaite opened this issue Jun 23, 2023 · 22 comments
Closed

Memory Leak when Recompiling Modules and Reinstantiating Runtimes #1534

abraithwaite opened this issue Jun 23, 2023 · 22 comments
Labels
bug Something isn't working

Comments

@abraithwaite
Copy link
Contributor

Describe the bug
We are recompiling modules frequently because our users may change their code.
We're also reinstantiating runtimes as a way to easily clear the module cache and start with a "clean slate" (we think/thought) for isolation purposes.
We're using the python interpreter guest.
We're closing all references to all wazero structs before reinstantiating things.
The memory continues to grow between each close/create cycle.

To Reproduce
Reproducable, just run main.go:
https://github.com/runreveal/wazero_memleak

Expected behavior
We expect memory to not grow when we're not holding any references to wazero objects any longer.

Environment (please complete the relevant information):

  • Go version: 1.20.5
  • wazero Version: v1.2.1
  • Host architecture: amd64, arm64 (m2)
  • Runtime mode: compiler mode
@abraithwaite abraithwaite added the bug Something isn't working label Jun 23, 2023
@inkeliz
Copy link
Contributor

inkeliz commented Jun 25, 2023

I was able to reproduce that issue, and I run some PPROF and take a look in the code. So, my suspicious was: "some mmap is leaking" and that seems o be the case, but that was odd. Because Wazero uses runtime.SetFinalizer.

However, the releaseCompiledModule is never called. So, I tried to hunt who is holding the reference. I did one hack to remove the "inner reference".

Please, apply that change and see if that fix the issue: 7367c2e

Patch: https://github.com/tetratelabs/wazero/commit/7367c2ea24b3127280bf7210be71cb972b55aa00.patch

I notice the following:

	compiledModule struct {
		executable        asm.CodeSegment
		functions         []compiledFunction
		source            *wasm.Module
		ensureTermination bool
	}

If you take a look at []compiledFunction each compliledFunction, you have:

compiledFunction struct {
		executableOffset uintptr
		stackPointerCeil uint64

		index           wasm.Index
		goFunc          interface{}
		listener        experimental.FunctionListener

		parent          *compiledModule                          ////  << HERE?!

		sourceOffsetMap sourceOffsetMap
	}

Each compiledFunction holds the pointer to compiledModule. So, I GUESS, that is the reason for the leak. Because, compileModule is always referenced.

That is not the perfect solution. Someone need to find why compiledFunction still referenced, because that might be the cause to not release. So, if compiledFunction is not released then compiledModule isn't. My change doesn't fix the first problem, just the second one (which almost fix the memory leak).

@abraithwaite
Copy link
Contributor Author

This is great! Thank you!! 🙏

Testing with our contrived test seems to have a stable memory profile and runtime! 🎉

@abraithwaite
Copy link
Contributor Author

abraithwaite commented Jun 26, 2023

Grrr, fat fingered "enter"...

ANYWAY, I was going to say that we tried it in our environment and actually started getting panics for what seems like a related reference.

Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: panic: module closed with exit_code(0) [recovered]
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         panic: runtime error: invalid memory address or nil pointer dereference
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0xf0ac76]
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: goroutine 2911 [running]:
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero/internal/engine/compiler.(*function).definition(...)
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:699
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).deferredOnCall(0xc00c5b14a0, {0x3a84a30, 0xc00f8eb1d0}, 0xc0154e7110?, {0x112a860, 0x56c7f94})
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:831 +0x196
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).call.func1()
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:739 +0x6e
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: panic({0x112a860, 0x56c7f94})
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /opt/homebrew/Cellar/go/1.20.5/libexec/src/runtime/panic.go:884 +0x213
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).execWasmFunction(0xc00c5b14a0, {0x3a84a30, 0xc00f8eb1d0}, 0xf0?)
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:1056 +0x425
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).call(0xc00c5b14a0, {0x3a84a30, 0xc00f8eb1d0}, {0x0?, 0x0?, 0xc00ff00820?}, {0x0, 0x0, 0x0})
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:754 +0x2e5
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).Call(0xc00c5b13b0?, {0x3a84a30?, 0xc00f8eb1d0?}, {0x0?, 0x3a84a30?, 0xc00f8eb1d0?})
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:708 +0xd9
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero.(*runtime).InstantiateModule(0xc00ff20fc0, {0x3a84a30, 0xc00f8eb1d0}, {0x3a8b550?, 0xc00fde8200}, {0x3a92620?, 0xc00fc78ff0})
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/runtime.go:333 +0x41c
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/runreveal/runreveal/internal/wasm/python.WASM.Run({{0x3a8b550, 0xc00fde8200}, {0x3a8c2d8, 0xc00ff20fc0}, {0x3a92620, 0xc00fc78ff0}, {{0x3a7d0e0, 0xc00d76bfe0}}, {{0x3a7d0c0, 0xc00d76bfe8}}, ...})
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/github.com/runreveal/runreveal/internal/wasm/python/wasm.go:180 +0x4f
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/runreveal/runreveal/internal/transforms.(*TransformEngine).loadTransforms.func1()
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/github.com/runreveal/runreveal/internal/transforms/transforms.go:187 +0xa5
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: created by github.com/runreveal/runreveal/internal/transforms.(*TransformEngine).loadTransforms
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/github.com/runreveal/runreveal/internal/transforms/transforms.go:186 +0x7b3

This only appears to happen when we're recreating the runtime (calling the Close methods), not on first instantiation, or during routine operation.

@abraithwaite abraithwaite reopened this Jun 26, 2023
@codefromthecrypt
Copy link
Contributor

on it

@codefromthecrypt
Copy link
Contributor

There are a lot of errors possible if a file cache is in use and it is closed in the middle of a function call. @abraithwaite is this possible in your case?

@codefromthecrypt
Copy link
Contributor

#1537 reproduces the panic

@abraithwaite
Copy link
Contributor Author

There are a lot of errors possible if a file cache is in use and it is closed in the middle of a function call. @abraithwaite is this possible in your case?

We're not using a file cache to my knowledge, unless one is created/set by default internally by wazero for something.

@codefromthecrypt
Copy link
Contributor

Thanks @abraithwaite ... hmm I was unable to trigger the panic except yanking the cache. any chance you can share any code that may be custom about below:

This only appears to happen when we're recreating the runtime (calling the Close methods), not on first instantiation, or during routine operation.

@abraithwaite
Copy link
Contributor Author

abraithwaite commented Jun 27, 2023

Okay, looking at the stack trace and the fix in #1535, I think I understand what's going on. I don't believe it has to do with any filesystem cache.

Our setup hasn't changed hardly at all from this example I made a little while back:

abraithwaite@ca10894d

What I see in #1535 is a change that, when closing an engine, we're resetting the parent references of the functions to nil.

I believe when we deployed this, it's panicing because we've still got a function running. Specifically, the start routine of the python interpreter, which doesn't and shouldn't return until close is called (at least that's my understanding of how it's implemented now).

When we're hitting the following code, the error is sys.ExitCode(0):

So the panic is triggered because err != nil despite it being exit code 0 (I understand the reason for this, just context for the next bit).

After that panic triggers, we enter the deferred recover routine on line 739:

// We ensure that this Call method never panics as
// this Call method is indirectly invoked by embedders via store.CallFunction,
// and we have to make sure that all the runtime errors, including the one happening inside
// host functions, will be captured as errors, not panics.
defer func() {
err = ce.deferredOnCall(ctx, m, recover())
if err == nil {
// If the module closed during the call, and the call didn't err for another reason, set an ExitError.
err = m.FailIfClosed()
}
}()

The deferredOnCall routine attempts to call fn.definition on 831, which attempts to reference the function parent (the module) which is now nil because we set it to nil when first closing the engine. This triggers the second panic of the nil pointer dereference.

This is our run and our close code:

func (w WASM) Run() error {
	var err error
	_, err = w.runtime.InstantiateModule(w.modCtx, w.module, w.config)
	if errors.Is(err, sys.NewExitError(sys.ExitCodeContextCanceled)) {
		return nil
	}
	return err
}

func (w *WASM) Close(ctx context.Context) error {
	w.modCancel()
	w.stdin.Close()
	w.stdout.Close()
	return w.runtime.Close(ctx)
}

We're setting WithCloseOnContextDone(true) on the runtime config.

Perhaps canceling the context then closing the runtime is causing a double close unintentionally? Also something I don't understand is what's actually happening in the close function for the engine. Is setting everything to nil really all that's necessary to stop the engine?

// Close implements the same method as documented on wasm.Engine.
func (e *engine) Close() (err error) {
e.mux.Lock()
defer e.mux.Unlock()
// Releasing the references to compiled codes including the memory-mapped machine codes.
for i := range e.codes {
for j := range e.codes[i].functions {
e.codes[i].functions[j].parent = nil
}
}
e.codes = nil
return
}

Finally, I see that ensureTermination is actually referenced twice inside call:

if ce.ensureTermination {
select {
case <-ctx.Done():
// If the provided context is already done, close the call context
// and return the error.
m.CloseWithCtxErr(ctx)
return nil, m.FailIfClosed()
default:
}
}

and here:

if ce.ensureTermination {
done := m.CloseModuleOnCanceledOrTimeout(ctx)
defer done()
}

For my own understanding, why is that?

Thanks again for the great work! 🤩 Love what you all are doing and it's been working pretty great for us. This stuff is so complex, I can't fathom how hard it must be to keep all these stateful things in your head. Best of luck and let me know if I can provide anything else!

@ckaznocha
Copy link
Contributor

ckaznocha commented Jun 27, 2023

I see that ensureTermination is actually referenced twice inside call … why is that?

Since I'm in the "blame" for parts of this I can help explain. The ce.ensureTermination guard on line 723 checks if the context.Context passed in has already finished. This prevents call from doing any work if its been called after some caller up the chain has already decided to stop processing. The one on line 749 sets up a cleanup routine.

I'd have to dig back into this bit but they could probably be consolidated by moving lines 750-751 into the default case of the select on line 730. Without revisiting the code I can't remember if there was some side-effecty thing requiring the done to be setup later or if this is just ripe for a light refactor.

@evacchi
Copy link
Contributor

evacchi commented Jun 29, 2023

I suspect the issue is that the function that is left running is never given a chance to verify that the module has closed. In fact, I could verify this with a quick and dirty change:

if I modify execWasmFunction by ensuring that ExitCode 0 is ignored:

			case builtinFunctionIndexCheckExitCode:
				// Note: this operation must be done in Go, not native code. The reason is that
				// native code cannot be preempted and that means it can block forever if there are not
				// enough OS threads (which we don't have control over).
				if err := m.FailIfClosed(); err != nil && err != sys.NewExitError(0) {
					panic(err)
				}

and then I add an extra check above:

		case nativeCallStatusCodeCallGoHostFunction:
			if closed := atomic.LoadUint64(&m.Closed); closed != 0 {
				break
			}
                        // leave the rest unmodified

then the reproducer at abraithwaite@ca10894d won't fail. However, this is not really a fix because it introduces an extra atomic load at every single host function call

🤔

@abraithwaite
Copy link
Contributor Author

To be clear, the example I linked with the race is/was a different issue. It's not directly relevant to this issue, I merely included it to show the configuration and setup of our runtime environment.

@evacchi
Copy link
Contributor

evacchi commented Jun 29, 2023

oh I see. If that's representative at all (it does panic) it may be enough to apply this small patch evacchi@f319648

it appears that the function that is running exits as expected (people correct me if I am wrong, but the panic() in engine.go seems to be used for unwinding the stack) but when the panic is recovered() the function instance state is inconsistent because it's been closed already. Since the cleanup has already happened, I figure it's ok to just break out of the loop.

@evacchi
Copy link
Contributor

evacchi commented Jun 29, 2023

I think this may be related #1546 -- it is not the same exact issue, though, because evacchi@f319648 doesn't fix it

@abraithwaite
Copy link
Contributor Author

oh I see. If that's representative at all (it does panic) it may be enough to apply this small patch evacchi@f319648

We'll try this patch. Looks as if there shouldn't be any major side effects except maybe slightly reduced debugability, is that right?

@evacchi
Copy link
Contributor

evacchi commented Jun 30, 2023

not really, the error you were seeing was a side effect of #1535: because the fields were cleared but the function was still running, in some specific cases (like yours on the Python interpreter) the method function.definition() would be invoked but the underlying function.parent would be nil, causing a panic. That small patch just avoids the panic and returns immediately; my expectation is that in that case the function is about to terminate anyway.

@evacchi
Copy link
Contributor

evacchi commented Jun 30, 2023

@abraithwaite I think I figured out what the issue really is. You don't need that patch, but you need to ensure that the module is shut down safely. What is happening e.g. in the example abraithwaite@ca10894d is that you have a blocking read going on on stdin, and you can't just close the runtime abruptly, otherwise you will be clearing some internal fields while a function is potentially still running.

What you really want to do is:

  • cancel the context
  • wait for the context to be actually Done()
  • close the file handles that are currently open so that you give the chance to the blocking read/writes to unblock
  • wait for the runtime to exit cleanly (with an exit code)
  • close the runtime

so, I added to the WASM struct a done chan struct{} that you will close on Run():

func (w *WASM) Run() error {
	var err error
	w.moduleInstance, err = w.runtime.InstantiateModule(w.modCtx, w.module, w.config)
	close(w.done)
	if errors.Is(err, sys.NewExitError(sys.ExitCodeContextCanceled)) {
		return nil
	}
	return err
}

then, your close routine becomes:

func (w *WASM) Close(ctx context.Context) error {
	w.modCancel() // cancel the context
	<-w.modCtx.Done() // wait for Done()
	w.stdin.Close() // close the file handles 
	w.stdout.Close()
	<-w.done // wait until the module terminates cleanly 
	return w.runtime.Close(ctx) // close the runtime
}

for an example look here https://github.com/evacchi/wazero/pull/24/files

let me know if I understood everything correctly and this works :)

@abraithwaite
Copy link
Contributor Author

abraithwaite commented Jun 30, 2023

Hey @evacchi, really appreciate the feedback and advice! I'm curious if and how this will help with the memory leak mentioned in the OP.

Does closing the module cleanly properly free the resources?

Just trying to reconcile your advice with the original reproducing code I gave in the OP:

https://github.com/runreveal/wazero_memleak/blob/main/main.go

I don't see how this much simpler example code has any issues, or I'm clearly missing something. I added context cancellation to match what you've given and it's still leaking memory. There are no file handles to close in the example I've given.

@evacchi
Copy link
Contributor

evacchi commented Jun 30, 2023

wait, I thought that #1535 solved the memory issues, and now you had panics for that nil reference, I was trying to address that part of the problem :P

EDIT: to be more precise, the issue re: the nil reference should be specifically related to closing the runtime before the function has terminated; so you still have to wait on, say, a done channel before closing. That I think should fix the panic.

@abraithwaite
Copy link
Contributor Author

Nope, apologies for the confusion. I try to stay pretty strict about staying on topic, but this one involved a lot of context in the code. I was talking about the panic not because the panic itself was a problem, but because I believed that the panic meant that some memory wasn't being freed properly.

The panic only happened after #1535, so I mentioned it because I don't believe it's the right fix.

@abraithwaite
Copy link
Contributor Author

abraithwaite commented Jul 1, 2023

Gonna call this one resolved! Thanks everybody! And huge props to @inkeliz and @evacchi for fixing the bug and helping me understand the panic afterwards. 🚢

go-health

Looks like we may have another more minor leak happening, but the big obvious one is gone now. Much apprecaited!

@abraithwaite
Copy link
Contributor Author

Related: #1600 #1608

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants