Proposal: Avoid blocking sends to a channel without timeout #131

prashantv · 2021-07-09T00:59:58Z

Semi-related to Channel Size is One or None

We've seen cases of deadlocks/unexpected blocking caused by blocking sends to a channel (example: tchannel-go#528). It's generally much safer to use a select with another case for timeout / ctx.Done(), or a non-blocking send.

There are some use-cases where it's safe; typically when the number of writers is bound to channel size and can be verified easily, so not sure how generally applicable this is.

The text was updated successfully, but these errors were encountered:

peterbourgon · 2021-07-10T01:12:35Z

Oof, I hope not!

I can't speak to specific packages that may serve specific purposes for you, but in general, it should be very rare that you make a nonblocking send on a channel. Go already gives you async where it matters, on syscalls; you want your application code to be nice and synchronous, i.e. deterministic, so that it takes advantage of the runtime. In the hopefully rare cases you use channels to communicate between components, they should almost always be blocking sends and recvs on chans with capacity zero.

Buffered channels are usually queues, and queues exist solely to solve burstiness. That means they belong at the edges of a system, not within it 😇

prashantv · 2021-07-10T01:21:17Z

This issue isn't to indicate that all sends must be non-blocking, but that blocking sends probably should have a fallback, e.g., context cancellation or timeout.

peterbourgon · 2021-07-10T01:30:26Z

blocking sends probably should have a fallback, e.g., context cancellation or timeout.

When a chan send occurs in a context-cancelable context 😉 I agree, you want that cancelation respected. But I don't think it's correct to claim blocking sends should have a fallback/timeout generally.

mway · 2021-07-12T16:51:33Z

It's true that there's not really a "one-size-fits-all" approach, but I don't think the issue is whether the send is sync/async, or whether channel usage is error-prone: the issue that it isn't defensive - it's fragile - unless you add an unblocking strategy, and channel capacity is not an unblocking strategy.

In any event, I would argue that one should have a good reason for not having any control mechanisms in selects - avoiding them should be intentional, and not by default, especially if there is any API exposure to users that could cause unexpected behavior (deadlocks, goroutine leaks, close-send races, etc). Even for situations without context propagation/timeout/etc plumbing, having a basic "has done chan struct{}, which is managed by Close()" would be an improvement over YOLO mode, such that there is always at least one component-level guard against locks/leaks/etc.

peterbourgon · 2021-07-12T17:25:34Z

channel capacity is not an unblocking strategy

Absolutely true! A rule like "Don't add capacity to a channel without knowing exactly why" would be 😎

it isn't defensive - it's fragile - unless you add an unblocking strategy

I'm not sure the rule generalizes in the way you're expressing it here. A goroutine blocked in a chan send, or recv, or select statement, isn't by itself a problem — it's only a problem if that introduces deadlocks, races, etc. as a consequence. And if that can happen, adding a timeout to the select statement isn't any better than adding capacity to the channel — both just paper over the underlying structural errors in the code.

To be clear, I get the idea you're trying to convey, here. I tend to express it slightly differently, not in terms of the channel operations, but in terms of goroutine lifecycle management. From my own style guide —

Contain your concurrency

Treat goroutines like stack variables -- don't let them outlive their originating function (i.e. structured concurrency)

Avoid "fire and forget" goroutines -- know how to stop the goroutine, and verify it's stopped

Avoid spaghetti synchronization, e.g. sync.Atomic as a state bit, starting/stopping/done channels, etc.

Keep concurrency management and business logic separate

No: wg.Add(1); go f(wg, input)

Yes: wg.Add(1); go func() { defer wg.Done(); f(input) }()

There is no such thing as a benign data race -- races are showstopper bugs that must be fixed immediately

The key points here being the first two.

And just for context, so it doesn't look like I'm swooping in here with all these unsolicited complaints 😅 — the Uber style guide is frequently referenced as a source of truth among the Gophers I help and mentor and stuff, so I'm somewhat motivated to nudge the (few!) points that in my experience tend to lead those folks down suboptimal paths.

mway · 2021-07-12T18:42:23Z

The key points here being the first two

They're good points overall. Like anything, there are exceptions, but in common cases these hold true. This is effectively the spirit in which this select guidance is being drafted in.

A goroutine blocked in a chan send, or recv, or select statement, isn't by itself a problem — it's only a problem if that introduces deadlocks, races, etc. as a consequence

I think that's a situational assessment, and is predicated on the effect that it has on the rest of the program relative to that behavior (i.e., it could either be a feature or a defect, depending on how you're rationalizing it). If the effects of said blocking are insulated from the rest of the program, or if transitive backpressure is intended, then yeah, I agree with you that it's not an issue in the same sense as I'm describing. That argument would not hold for situations with leaks, obviously.

However, we should avoid assuming specialized application behaviors/contracts/semantics/etc, so I think that we need to treat those "situationally okay" scenarios as exceptions, rather than the common case (hence taking a more general stance that "a bug is the possibility of defective behavior, regardless of how many times it manifests" rather than "it's not a bug if we haven't observed it [yet]"), and observing that a lot of the time, basic flow management or other synchronization tools reduce the possibility of various related bugs.

(We definitely can't cover every possible related bug, and there is certainly some amount of RTFM required because folks should understand the implications of what they're doing, but I'm personally of the opinion that it's good to document behaviors that encourage thoughtfulness, clear intent, and consistently nominal behavior, which I hope that this falls into.)

And just for context, so it doesn't look like I'm swooping in here with all these unsolicited complaints

We appreciate the feedback! Glad to hear that folks are (hopefully) finding it helpful, and we're also quite motivated to ensure not only that it's as useful to others as it is for us, but also that it evolves as we (and the community) do.

peterbourgon · 2021-07-13T03:36:36Z

a bug is the possibility of defective behavior, regardless of how many times it manifests

Sure! But a goroutine blocked in a chan send, recv, or select does not by itself constitute the possibility of defective behavior, any moreso than does an integer with a value of 0, say. The possibility of deadlock, or a goroutine leak, or whatever else is the defective behavior, and that requires more than just the one goroutine's code to assess. Similarly, a function that uses that integer as a denominator is required for its 0 value to be worrisome.

The thing I'm guarding against here is the addition of timeout cases to channel operations that serve no purpose. Today I wrote some code that selected on one of four channels; three were errors that may or may not come, and one was the result of an operation that would take an indeterminate amount of time. The function containing that select statement is meant to block until one of those results had manifest. It would be incorrect to add a timeout to the select, or to thread a context through, in this situation. But this rule would suggest that my code is faulty.

Or, even easier: consider signal.Notify — same semantics!

Sometimes goroutines can block indefinitely. And it's not a rare situation: the Go runtime is optimized for exactly this situation. So it's not necessarily a problem. Right?

mway · 2021-07-19T19:03:52Z

Sometimes goroutines can block indefinitely. And it's not a rare situation: the Go runtime is optimized for exactly this situation. So it's not necessarily a problem. Right?

This is essentially the crux of the issue. I'm not suggesting that we add channel dogma that should be followed in all situations, but rather to be sensitive to context/semantics and provide reasonable heuristics for when you probably should (and if there are any overwhelmingly common cases, call them out as examples). Agreed that it doesn't make sense to "just add timeouts" everywhere because that itself somehow magically makes channel usage "resilient" or "correct", because it doesn't.

To that end, whatever guidance we add should probably be something to the effect of: "if your channel send/recv should not block for an unknown amount of time (potentially forever), add a timeout/etc case". (TBD, of course; we're not in a hurry to rush any guidance.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Avoid blocking sends to a channel without timeout #131

Proposal: Avoid blocking sends to a channel without timeout #131

prashantv commented Jul 9, 2021 •

edited

Loading

peterbourgon commented Jul 10, 2021 •

edited

Loading

prashantv commented Jul 10, 2021

peterbourgon commented Jul 10, 2021

mway commented Jul 12, 2021

peterbourgon commented Jul 12, 2021

Contain your concurrency

mway commented Jul 12, 2021

peterbourgon commented Jul 13, 2021 •

edited

Loading

mway commented Jul 19, 2021

Proposal: Avoid blocking sends to a channel without timeout #131

Proposal: Avoid blocking sends to a channel without timeout #131

Comments

prashantv commented Jul 9, 2021 • edited Loading

peterbourgon commented Jul 10, 2021 • edited Loading

prashantv commented Jul 10, 2021

peterbourgon commented Jul 10, 2021

mway commented Jul 12, 2021

peterbourgon commented Jul 12, 2021

Contain your concurrency

mway commented Jul 12, 2021

peterbourgon commented Jul 13, 2021 • edited Loading

mway commented Jul 19, 2021

prashantv commented Jul 9, 2021 •

edited

Loading

peterbourgon commented Jul 10, 2021 •

edited

Loading

peterbourgon commented Jul 13, 2021 •

edited

Loading