Memory Leak when Recompiling Modules and Reinstantiating Runtimes #1534

abraithwaite · 2023-06-23T20:26:47Z

Describe the bug
We are recompiling modules frequently because our users may change their code.
We're also reinstantiating runtimes as a way to easily clear the module cache and start with a "clean slate" (we think/thought) for isolation purposes.
We're using the python interpreter guest.
We're closing all references to all wazero structs before reinstantiating things.
The memory continues to grow between each close/create cycle.

To Reproduce
Reproducable, just run main.go:
https://github.com/runreveal/wazero_memleak

Expected behavior
We expect memory to not grow when we're not holding any references to wazero objects any longer.

Environment (please complete the relevant information):

Go version: 1.20.5
wazero Version: v1.2.1
Host architecture: amd64, arm64 (m2)
Runtime mode: compiler mode

inkeliz · 2023-06-25T17:31:21Z

I was able to reproduce that issue, and I run some PPROF and take a look in the code. So, my suspicious was: "some mmap is leaking" and that seems o be the case, but that was odd. Because Wazero uses runtime.SetFinalizer.

However, the releaseCompiledModule is never called. So, I tried to hunt who is holding the reference. I did one hack to remove the "inner reference".

Please, apply that change and see if that fix the issue: 7367c2e

Patch: https://github.com/tetratelabs/wazero/commit/7367c2ea24b3127280bf7210be71cb972b55aa00.patch

I notice the following:

	compiledModule struct {
		executable        asm.CodeSegment
		functions         []compiledFunction
		source            *wasm.Module
		ensureTermination bool
	}

If you take a look at []compiledFunction each compliledFunction, you have:

compiledFunction struct {
		executableOffset uintptr
		stackPointerCeil uint64

		index           wasm.Index
		goFunc          interface{}
		listener        experimental.FunctionListener

		parent          *compiledModule                          ////  << HERE?!

		sourceOffsetMap sourceOffsetMap
	}

Each compiledFunction holds the pointer to compiledModule. So, I GUESS, that is the reason for the leak. Because, compileModule is always referenced.

That is not the perfect solution. Someone need to find why compiledFunction still referenced, because that might be the cause to not release. So, if compiledFunction is not released then compiledModule isn't. My change doesn't fix the first problem, just the second one (which almost fix the memory leak).

abraithwaite · 2023-06-26T18:22:10Z

This is great! Thank you!! 🙏

Testing with our contrived test seems to have a stable memory profile and runtime! 🎉

abraithwaite · 2023-06-26T18:23:37Z

Grrr, fat fingered "enter"...

ANYWAY, I was going to say that we tried it in our environment and actually started getting panics for what seems like a related reference.

Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: panic: module closed with exit_code(0) [recovered]
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         panic: runtime error: invalid memory address or nil pointer dereference
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0xf0ac76]
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: goroutine 2911 [running]:
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero/internal/engine/compiler.(*function).definition(...)
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:699
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).deferredOnCall(0xc00c5b14a0, {0x3a84a30, 0xc00f8eb1d0}, 0xc0154e7110?, {0x112a860, 0x56c7f94})
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:831 +0x196
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).call.func1()
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:739 +0x6e
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: panic({0x112a860, 0x56c7f94})
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /opt/homebrew/Cellar/go/1.20.5/libexec/src/runtime/panic.go:884 +0x213
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).execWasmFunction(0xc00c5b14a0, {0x3a84a30, 0xc00f8eb1d0}, 0xf0?)
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:1056 +0x425
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).call(0xc00c5b14a0, {0x3a84a30, 0xc00f8eb1d0}, {0x0?, 0x0?, 0xc00ff00820?}, {0x0, 0x0, 0x0})
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:754 +0x2e5
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).Call(0xc00c5b13b0?, {0x3a84a30?, 0xc00f8eb1d0?}, {0x0?, 0x3a84a30?, 0xc00f8eb1d0?})
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/internal/engine/compiler/engine.go:708 +0xd9
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/tetratelabs/wazero.(*runtime).InstantiateModule(0xc00ff20fc0, {0x3a84a30, 0xc00f8eb1d0}, {0x3a8b550?, 0xc00fde8200}, {0x3a92620?, 0xc00fc78ff0})
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/golang/pkg/mod/github.com/tetratelabs/[email protected]/runtime.go:333 +0x41c
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/runreveal/runreveal/internal/wasm/python.WASM.Run({{0x3a8b550, 0xc00fde8200}, {0x3a8c2d8, 0xc00ff20fc0}, {0x3a92620, 0xc00fc78ff0}, {{0x3a7d0e0, 0xc00d76bfe0}}, {{0x3a7d0c0, 0xc00d76bfe8}}, ...})
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/github.com/runreveal/runreveal/internal/wasm/python/wasm.go:180 +0x4f
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: github.com/runreveal/runreveal/internal/transforms.(*TransformEngine).loadTransforms.func1()
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/github.com/runreveal/runreveal/internal/transforms/transforms.go:187 +0xa5
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]: created by github.com/runreveal/runreveal/internal/transforms.(*TransformEngine).loadTransforms
Jun 26 18:15:06 ip-172-30-2-252 rrq[35306]:         /Users/abraithwaite/Projects/github.com/runreveal/runreveal/internal/transforms/transforms.go:186 +0x7b3

This only appears to happen when we're recreating the runtime (calling the Close methods), not on first instantiation, or during routine operation.

codefromthecrypt · 2023-06-26T23:21:36Z

on it

codefromthecrypt · 2023-06-27T00:48:26Z

There are a lot of errors possible if a file cache is in use and it is closed in the middle of a function call. @abraithwaite is this possible in your case?

See #1534 Signed-off-by: Adrian Cole <[email protected]>

codefromthecrypt · 2023-06-27T00:55:06Z

#1537 reproduces the panic

abraithwaite · 2023-06-27T04:33:13Z

There are a lot of errors possible if a file cache is in use and it is closed in the middle of a function call. @abraithwaite is this possible in your case?

We're not using a file cache to my knowledge, unless one is created/set by default internally by wazero for something.

codefromthecrypt · 2023-06-27T05:07:18Z

Thanks @abraithwaite ... hmm I was unable to trigger the panic except yanking the cache. any chance you can share any code that may be custom about below:

This only appears to happen when we're recreating the runtime (calling the Close methods), not on first instantiation, or during routine operation.

abraithwaite · 2023-06-27T17:44:14Z

Okay, looking at the stack trace and the fix in #1535, I think I understand what's going on. I don't believe it has to do with any filesystem cache.

Our setup hasn't changed hardly at all from this example I made a little while back:

abraithwaite@ca10894d

What I see in #1535 is a change that, when closing an engine, we're resetting the parent references of the functions to nil.

I believe when we deployed this, it's panicing because we've still got a function running. Specifically, the start routine of the python interpreter, which doesn't and shouldn't return until close is called (at least that's my understanding of how it's implemented now).

When we're hitting the following code, the error is sys.ExitCode(0):

wazero/internal/engine/compiler/engine.go

Line 1056 in 451a1b6

panic(err)

So the panic is triggered because err != nil despite it being exit code 0 (I understand the reason for this, just context for the next bit).

After that panic triggers, we enter the deferred recover routine on line 739:

wazero/internal/engine/compiler/engine.go

Lines 733 to 745 in 451a1b6

    
           // We ensure that this Call method never panics as 
        
           // this Call method is indirectly invoked by embedders via store.CallFunction, 
        
           // and we have to make sure that all the runtime errors, including the one happening inside 
        
           // host functions, will be captured as errors, not panics. 
        
           defer func() { 
        
           	err = ce.deferredOnCall(ctx, m, recover()) 
        
           	if err == nil { 
        
           		// If the module closed during the call, and the call didn't err for another reason, set an ExitError. 
        
           		err = m.FailIfClosed() 
        
           	} 
        
           }()

The deferredOnCall routine attempts to call fn.definition on 831, which attempts to reference the function parent (the module) which is now nil because we set it to nil when first closing the engine. This triggers the second panic of the nil pointer dereference.

This is our run and our close code:

func (w WASM) Run() error {
	var err error
	_, err = w.runtime.InstantiateModule(w.modCtx, w.module, w.config)
	if errors.Is(err, sys.NewExitError(sys.ExitCodeContextCanceled)) {
		return nil
	}
	return err
}

func (w *WASM) Close(ctx context.Context) error {
	w.modCancel()
	w.stdin.Close()
	w.stdout.Close()
	return w.runtime.Close(ctx)
}

We're setting WithCloseOnContextDone(true) on the runtime config.

Perhaps canceling the context then closing the runtime is causing a double close unintentionally? Also something I don't understand is what's actually happening in the close function for the engine. Is setting everything to nil really all that's necessary to stop the engine?

wazero/internal/engine/compiler/engine.go

Lines 494 to 508 in 451a1b6

    
           // Close implements the same method as documented on wasm.Engine. 
        
           func (e *engine) Close() (err error) { 
        
           	e.mux.Lock() 
        
           	defer e.mux.Unlock() 
        
           	// Releasing the references to compiled codes including the memory-mapped machine codes. 
        
           	for i := range e.codes { 
        
           		for j := range e.codes[i].functions { 
        
           			e.codes[i].functions[j].parent = nil 
        
           		} 
        
           	} 
        
           	e.codes = nil 
        
           	return 
        
           }

Finally, I see that ensureTermination is actually referenced twice inside call:

wazero/internal/engine/compiler/engine.go

Lines 723 to 732 in 451a1b6

    
           if ce.ensureTermination { 
        
           	select { 
        
           	case <-ctx.Done(): 
        
           		// If the provided context is already done, close the call context 
        
           		// and return the error. 
        
           		m.CloseWithCtxErr(ctx) 
        
           		return nil, m.FailIfClosed() 
        
           	default: 
        
           	} 
        
           }

and here:

wazero/internal/engine/compiler/engine.go

Lines 749 to 752 in 451a1b6

    
           if ce.ensureTermination { 
        
           	done := m.CloseModuleOnCanceledOrTimeout(ctx) 
        
           	defer done() 
        
           }

For my own understanding, why is that?

Thanks again for the great work! 🤩 Love what you all are doing and it's been working pretty great for us. This stuff is so complex, I can't fathom how hard it must be to keep all these stateful things in your head. Best of luck and let me know if I can provide anything else!

ckaznocha · 2023-06-27T18:51:24Z

I see that ensureTermination is actually referenced twice inside call … why is that?

Since I'm in the "blame" for parts of this I can help explain. The ce.ensureTermination guard on line 723 checks if the context.Context passed in has already finished. This prevents call from doing any work if its been called after some caller up the chain has already decided to stop processing. The one on line 749 sets up a cleanup routine.

I'd have to dig back into this bit but they could probably be consolidated by moving lines 750-751 into the default case of the select on line 730. Without revisiting the code I can't remember if there was some side-effecty thing requiring the done to be setup later or if this is just ripe for a light refactor.

evacchi · 2023-06-29T10:49:04Z

I suspect the issue is that the function that is left running is never given a chance to verify that the module has closed. In fact, I could verify this with a quick and dirty change:

if I modify execWasmFunction by ensuring that ExitCode 0 is ignored:

			case builtinFunctionIndexCheckExitCode:
				// Note: this operation must be done in Go, not native code. The reason is that
				// native code cannot be preempted and that means it can block forever if there are not
				// enough OS threads (which we don't have control over).
				if err := m.FailIfClosed(); err != nil && err != sys.NewExitError(0) {
					panic(err)
				}

and then I add an extra check above:

		case nativeCallStatusCodeCallGoHostFunction:
			if closed := atomic.LoadUint64(&m.Closed); closed != 0 {
				break
			}
                        // leave the rest unmodified

then the reproducer at abraithwaite@ca10894d won't fail. However, this is not really a fix because it introduces an extra atomic load at every single host function call

🤔

abraithwaite · 2023-06-29T14:42:12Z

To be clear, the example I linked with the race is/was a different issue. It's not directly relevant to this issue, I merely included it to show the configuration and setup of our runtime environment.

evacchi · 2023-06-29T14:49:40Z

oh I see. If that's representative at all (it does panic) it may be enough to apply this small patch evacchi@f319648

it appears that the function that is running exits as expected (people correct me if I am wrong, but the panic() in engine.go seems to be used for unwinding the stack) but when the panic is recovered() the function instance state is inconsistent because it's been closed already. Since the cleanup has already happened, I figure it's ok to just break out of the loop.

evacchi · 2023-06-29T19:47:24Z

I think this may be related #1546 -- it is not the same exact issue, though, because evacchi@f319648 doesn't fix it

abraithwaite · 2023-06-29T23:09:03Z

oh I see. If that's representative at all (it does panic) it may be enough to apply this small patch evacchi@f319648

We'll try this patch. Looks as if there shouldn't be any major side effects except maybe slightly reduced debugability, is that right?

evacchi · 2023-06-30T11:54:03Z

not really, the error you were seeing was a side effect of #1535: because the fields were cleared but the function was still running, in some specific cases (like yours on the Python interpreter) the method function.definition() would be invoked but the underlying function.parent would be nil, causing a panic. That small patch just avoids the panic and returns immediately; my expectation is that in that case the function is about to terminate anyway.

evacchi · 2023-06-30T17:05:52Z

@abraithwaite I think I figured out what the issue really is. You don't need that patch, but you need to ensure that the module is shut down safely. What is happening e.g. in the example abraithwaite@ca10894d is that you have a blocking read going on on stdin, and you can't just close the runtime abruptly, otherwise you will be clearing some internal fields while a function is potentially still running.

What you really want to do is:

cancel the context
wait for the context to be actually Done()
close the file handles that are currently open so that you give the chance to the blocking read/writes to unblock
wait for the runtime to exit cleanly (with an exit code)
close the runtime

so, I added to the WASM struct a done chan struct{} that you will close on Run():

func (w *WASM) Run() error {
	var err error
	w.moduleInstance, err = w.runtime.InstantiateModule(w.modCtx, w.module, w.config)
	close(w.done)
	if errors.Is(err, sys.NewExitError(sys.ExitCodeContextCanceled)) {
		return nil
	}
	return err
}

then, your close routine becomes:

func (w *WASM) Close(ctx context.Context) error {
	w.modCancel() // cancel the context
	<-w.modCtx.Done() // wait for Done()
	w.stdin.Close() // close the file handles 
	w.stdout.Close()
	<-w.done // wait until the module terminates cleanly 
	return w.runtime.Close(ctx) // close the runtime
}

for an example look here https://github.com/evacchi/wazero/pull/24/files

let me know if I understood everything correctly and this works :)

abraithwaite · 2023-06-30T17:27:57Z

Hey @evacchi, really appreciate the feedback and advice! I'm curious if and how this will help with the memory leak mentioned in the OP.

Does closing the module cleanly properly free the resources?

Just trying to reconcile your advice with the original reproducing code I gave in the OP:

https://github.com/runreveal/wazero_memleak/blob/main/main.go

I don't see how this much simpler example code has any issues, or I'm clearly missing something. I added context cancellation to match what you've given and it's still leaking memory. There are no file handles to close in the example I've given.

evacchi · 2023-06-30T17:31:59Z

wait, I thought that #1535 solved the memory issues, and now you had panics for that nil reference, I was trying to address that part of the problem :P

EDIT: to be more precise, the issue re: the nil reference should be specifically related to closing the runtime before the function has terminated; so you still have to wait on, say, a done channel before closing. That I think should fix the panic.

abraithwaite · 2023-06-30T17:35:20Z

Nope, apologies for the confusion. I try to stay pretty strict about staying on topic, but this one involved a lot of context in the code. I was talking about the panic not because the panic itself was a problem, but because I believed that the panic meant that some memory wasn't being freed properly.

The panic only happened after #1535, so I mentioned it because I don't believe it's the right fix.

abraithwaite · 2023-07-01T00:59:23Z

Gonna call this one resolved! Thanks everybody! And huge props to @inkeliz and @evacchi for fixing the bug and helping me understand the panic afterwards. 🚢

Looks like we may have another more minor leak happening, but the big obvious one is gone now. Much apprecaited!

abraithwaite · 2023-10-30T01:31:37Z

Related: #1600 #1608

abraithwaite added the bug Something isn't working label Jun 23, 2023

inkeliz mentioned this issue Jun 25, 2023

engine: mitigates memory leak #1535

Merged

abraithwaite closed this as completed Jun 26, 2023

abraithwaite reopened this Jun 26, 2023

codefromthecrypt pushed a commit that referenced this issue Jun 27, 2023

Adds failing test for cache closed before runtime

0249830

See #1534 Signed-off-by: Adrian Cole <[email protected]>

codefromthecrypt mentioned this issue Jun 27, 2023

Adds failing test for cache closed while function that panics is in-flight #1537

Closed

evacchi mentioned this issue Jun 29, 2023

wasi: panic on cancel with ongoing rw on stdio #1546

Closed

abraithwaite closed this as completed Jul 1, 2023

mathetake mentioned this issue Aug 3, 2023

how to cancel execution? #1610

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Leak when Recompiling Modules and Reinstantiating Runtimes #1534

Memory Leak when Recompiling Modules and Reinstantiating Runtimes #1534

abraithwaite commented Jun 23, 2023

inkeliz commented Jun 25, 2023 •

edited

Loading

abraithwaite commented Jun 26, 2023

abraithwaite commented Jun 26, 2023 •

edited

Loading

codefromthecrypt commented Jun 26, 2023

codefromthecrypt commented Jun 27, 2023

codefromthecrypt commented Jun 27, 2023

abraithwaite commented Jun 27, 2023

codefromthecrypt commented Jun 27, 2023

abraithwaite commented Jun 27, 2023 •

edited

Loading

ckaznocha commented Jun 27, 2023 •

edited

Loading

evacchi commented Jun 29, 2023 •

edited

Loading

abraithwaite commented Jun 29, 2023

evacchi commented Jun 29, 2023 •

edited

Loading

evacchi commented Jun 29, 2023 •

edited

Loading

abraithwaite commented Jun 29, 2023

evacchi commented Jun 30, 2023

evacchi commented Jun 30, 2023 •

edited

Loading

abraithwaite commented Jun 30, 2023 •

edited

Loading

evacchi commented Jun 30, 2023 •

edited

Loading

abraithwaite commented Jun 30, 2023

abraithwaite commented Jul 1, 2023 •

edited

Loading

abraithwaite commented Oct 30, 2023

Memory Leak when Recompiling Modules and Reinstantiating Runtimes #1534

Memory Leak when Recompiling Modules and Reinstantiating Runtimes #1534

Comments

abraithwaite commented Jun 23, 2023

inkeliz commented Jun 25, 2023 • edited Loading

abraithwaite commented Jun 26, 2023

abraithwaite commented Jun 26, 2023 • edited Loading

codefromthecrypt commented Jun 26, 2023

codefromthecrypt commented Jun 27, 2023

codefromthecrypt commented Jun 27, 2023

abraithwaite commented Jun 27, 2023

codefromthecrypt commented Jun 27, 2023

abraithwaite commented Jun 27, 2023 • edited Loading

ckaznocha commented Jun 27, 2023 • edited Loading

evacchi commented Jun 29, 2023 • edited Loading

abraithwaite commented Jun 29, 2023

evacchi commented Jun 29, 2023 • edited Loading

evacchi commented Jun 29, 2023 • edited Loading

abraithwaite commented Jun 29, 2023

evacchi commented Jun 30, 2023

evacchi commented Jun 30, 2023 • edited Loading

abraithwaite commented Jun 30, 2023 • edited Loading

evacchi commented Jun 30, 2023 • edited Loading

abraithwaite commented Jun 30, 2023

abraithwaite commented Jul 1, 2023 • edited Loading

abraithwaite commented Oct 30, 2023

inkeliz commented Jun 25, 2023 •

edited

Loading

abraithwaite commented Jun 26, 2023 •

edited

Loading

abraithwaite commented Jun 27, 2023 •

edited

Loading

ckaznocha commented Jun 27, 2023 •

edited

Loading

evacchi commented Jun 29, 2023 •

edited

Loading

evacchi commented Jun 29, 2023 •

edited

Loading

evacchi commented Jun 29, 2023 •

edited

Loading

evacchi commented Jun 30, 2023 •

edited

Loading

abraithwaite commented Jun 30, 2023 •

edited

Loading

evacchi commented Jun 30, 2023 •

edited

Loading

abraithwaite commented Jul 1, 2023 •

edited

Loading