Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NRG: Wait for goroutines to shutdown when recreating group #5832

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

neilalexander
Copy link
Member

@neilalexander neilalexander commented Aug 26, 2024

This should fix some logical races where multiple sets of goroutines can fight over the same store directory, i.e. when shutting down and recreating a group rapidly.

Signed-off-by: Neil Twigg [email protected]

@neilalexander neilalexander marked this pull request as ready for review August 26, 2024 22:22
@neilalexander neilalexander requested a review from a team as a code owner August 26, 2024 22:22
@neilalexander neilalexander changed the title NRG: Wait for goroutines on shutdown NRG (2.11): Wait for goroutines on shutdown Aug 27, 2024
@neilalexander neilalexander changed the title NRG (2.11): Wait for goroutines on shutdown NRG (2.11): Wait for goroutines on shutdown, hold JS lock on group create Sep 2, 2024
@neilalexander neilalexander marked this pull request as draft September 2, 2024 18:41
@neilalexander neilalexander changed the title NRG (2.11): Wait for goroutines on shutdown, hold JS lock on group create NRG (2.11): Wait for goroutines to shutdown when recreating group Sep 18, 2024
@neilalexander neilalexander force-pushed the neil/nrgwg branch 6 times, most recently from 61b9a2e to 3cc6288 Compare September 18, 2024 10:48
@neilalexander neilalexander marked this pull request as ready for review September 18, 2024 12:10
@neilalexander neilalexander changed the title NRG (2.11): Wait for goroutines to shutdown when recreating group NRG: Wait for goroutines to shutdown when recreating group Sep 20, 2024
@neilalexander
Copy link
Member Author

Have rebased, this one is ready for review.

if node := s.lookupRaftNode(rg.Name); node != nil {
if node.State() == Closed {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like there may be a better approach here.

Also I think we need to check that the peer lists are the same if we are going to re-use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to return a closed node here?

// Wait for goroutines to shut down before trying to clean up
// the log below, otherwise we might end up deleting the store
// from underneath running goroutines.
n.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have these patterns for sure elsewhere but I try hard to avoid them if at all possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can avoid it here, it's the only way we can potentially avoid a deadlock if the runAs goroutine is finishing up processing some operations.

This should fix some logical races where multiple sets of goroutines
can fight over the same store directory, i.e. when shutting down and
recreating a group rapidly.

Signed-off-by: Neil Twigg <[email protected]>
if node := s.lookupRaftNode(rg.Name); node != nil {
if node.State() == Closed {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to return a closed node here?

@@ -1728,12 +1731,17 @@ func (n *raft) Stop() {
n.shutdown(false)
}

func (n *raft) WaitForStop() {
n.wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check that the state is set to stopped to avoid calling in here and never returning?

@@ -1744,12 +1752,12 @@ func (n *raft) shutdown(shouldDelete bool) {
if n.state.Swap(int32(Closed)) == int32(Closed) {
// If we get called again with shouldDelete, in case we were called first with Stop() cleanup
if shouldDelete {
n.wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we hold the lock here this may never return.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants