Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't pause processing when send_local_response fails #423

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

krinkinmu
Copy link

For context see the Envoy issue envoyproxy/envoy#28826. Here is a shorter summary:

  1. A wasm plugin calls proxy_send_local_response from both onRequestHeaders and onResponseHeaders
  2. When proxy_send_local_reply is called from onRequestHeaders it triggers a local reply and that reply goes through the filter chain in Envoy
  3. The same plugin is called again as part of the filter chain processing but this time onResponseHeaders is called
  4. onResponseHeaders calls proxy_send_local_response which ultimately does not generate a local reply, but it stops filter chain processing.

As a result we end up with a stuck connection on Envoy - no local reply and processing is stopped.

I think that proxy wasm plugins shouldn't use proxy_send_local_response this way, so ultimately whoever created such a plugin shot themselves in the foot. That being said, I think there are a few improvements that could be made here on Envoy/proxy-wasm side to handle this situation somewhat better:

  1. We can avoid stopping processing in such cases to prevent stuck connections on Envoy
  2. We can return errors from proxy_send_local_response instead of silently ignoring them.

Currently Envoy implementation of sendLocalResponse can detect when a second local response is requested and returns an error in this case without actually trying to send a local response.

However, even though Envoy reports an error, send_local_response ignores the result of the host specific sendLocalResponse implementation and stops processing and returns success to the wasm plugin.

With this change, send_local_response will check the result of the host-specific implementation of the sendLocalResponse. In cases when sendLocalResponse fails it will just propagate the error to the caller and do nothing else (including stopping processing).

I think this behavior makes sense in principle because on the one hand we don't ignore the failure from sendLocalResponse and on the other hand, when the failure happens we don't trigger any side-effects expected from the successful proxy_send_local_response call.

NOTE: Even though I do think that this is a more resonable behavior, it's still a change from the previous behavior and it might break existing proxy-wasm plugins. Specifically:

  1. C++ plugins that proactively check the result of proxy_send_local_response will change behavior (e.g., before proxy_send_local_response failed silently)
  2. Rust plugins, due to the way Rust SDK handles errors from proxy_send_local_response will crash in runtime in this case.

On the bright side of things, the plugins that are affected by this change currently just cause stuck connections in Envoy, so we are changing one undesirable behavior for another, but more explicit.

A couple of additional notes for reviewers:

  • If there are not disagreement with the overall approach, but you don't want to change user visible beahvior when sendLocalResponse fails, I can revert to silencing the error, though it would not be my first preference;
  • I created an utility function for unit tests to stringify a list of arguments, but I'm pretty sure similar functions already exist in libraries like Abseil; if reviewers will be in favor of including Abseil to the proxy-wasm-cpp-host, I can spend some time and work out how to make that happen and not break Envoy in the meantime.

@krinkinmu
Copy link
Author

+cc @keithmattix

@keithmattix
Copy link
Contributor

Seems reasonable to me! The change seems worth the risk IMO; from what I understand, the affected population are plugin authors who are already pausing envoy

@krinkinmu
Copy link
Author

Seems reasonable to me! The change seems worth the risk IMO; from what I understand, the affected population are plugin authors who are already pausing envoy

Yes, that's correct. Plugins affected by this change are already affected, but in a different way.

@PiotrSikora
Copy link
Member

Propagating returned status is definitely a good thing to do. But please note that proxy_send_local_response was originally infallible, and the error codepath was added in envoyproxy/envoy#23049 to address the "double send local response" issue (breaking the ABI contract in the process, which is why Rust SDK panics when this happens).

Having said that:

  1. The plugin returning multiple local responses is clearly buggy, so why are we adding a workaround for it if it's not crashing hosts? We're not adding workarounds for a plugins that consistently return Pause and never make any progress, which results in exactly the same "stuck" behavior.
  2. What should be the fallback for failed proxy_send_local_response (which in Envoy is pretty much exclusively used for generating short error responses) that plugins should implement? This adds extra complexity to all plugins in order to address a broken use case.

Also, and that's a topic for a separate issue, but I question whether we should be calling the plugin for response processing of response it generated itself.

@PiotrSikora
Copy link
Member

  1. What should be the fallback for failed proxy_send_local_response (which in Envoy is pretty much exclusively used for generating short error responses) that plugins should implement? This adds extra complexity to all plugins in order to address a broken use case.

For example, in Rust SDK's example HTTP authorization plugin that relies on this behavior (like every other authorization plugin), what should be behavior when this calls fails?

The only solution that comes to mind is returning Pause and adding checks to make sure that Pause is also returned in all other callbacks, which is basically reimplementing existing logic and "stuck" behavior on the plugin side in much more error-prone way, and at much higher cost.

What am I missing?

For context see the Envoy issue envoyproxy/envoy#28826.
Here is a shorter summary:

1. A wasm plugin calls proxy_send_local_response from both onRequestHeaders and
   onResponseHeaders
2. When proxy_send_local_reply is called from onRequestHeaders it triggers
   a local reply and that reply goes through the filter chain in Envoy
3. The same plugin is called again as part of the filter chain processing
   but this time onResponseHeaders is called
4. onResponseHeaders calls proxy_send_local_response which ultimately does
   not generate a local reply, but it stops filter chain processing.

As a result we end up with a stuck connection on Envoy - no local reply
and processing is stopped.

I think that proxy wasm plugins shouldn't use proxy_send_local_response this
way, so ultimately whoever created such a plugin shot themselves in the foot.
That being said, I think there are a few improvements that could be made here
on Envoy/proxy-wasm side to handle this situation somewhat better:

1. We can avoid stopping processing in such cases to prevent stuck connections
   on Envoy
2. We can return errors from proxy_send_local_response instead of silently
   ignoring them.

Currently Envoy implementation of sendLocalResponse can detect when a second
local response is requested and returns an error in this case without actually
trying to send a local response.

However, even though Envoy reports an error, send_local_response ignores the
result of the host specific sendLocalResponse implementation and stops
processing and returns success to the wasm plugin.

With this change, send_local_response will check the result of the
host-specific implementation of the sendLocalResponse. In cases when
sendLocalResponse fails it will just propagate the error to the caller and
do nothing else (including stopping processing).

I think this behavior makes sense in principle because on the one hand we don't
ignore the failure from sendLocalResponse and on the other hand, when the
failure happens we don't trigger any side-effects expected from the successful
proxy_send_local_response call.

NOTE: Even though I do think that this is a more resonable behavior, it's
still a change from the previous behavior and it might break existing
proxy-wasm plugins. Specifically:

1. C++ plugins that proactively check the result of proxy_send_local_response
   will change behavior (e.g., before proxy_send_local_response failed silently)
2. Rust plugins, due to the way Rust SDK handles errors from
   proxy_send_local_response will crash in runtime in this case.

On the bright side of things, the plugins that are affected by this change
currently just cause stuck connections in Envoy, so we are changing one
undesirable behavior for another, but more explicit.

Signed-off-by: Mikhail Krinkin <[email protected]>
@krinkinmu
Copy link
Author

@PiotrSikora you're correct that the plugins that do that are buggy and it's definitely not the intent here to create a workarounds for them.

That being said, I think Envoy can do better in a presense of a buggy plugin and not leave a stuck connection behind. And a buggy plugin could benefit from a signal that would tell them that they are doing something wrong.

As for the fallback, my thinking here is that plugins shouldn't need a fallback in this case - they should just stop calling proxy_send_local_response when processing a local response. In a way proxy_send_local_response remains infallible as long as it is used correctly.

Rather than returning an error, I can try and find a way to "crash" the plugin (i.e., Rust SDK would just panic in this case and I can see if I can do the same for C++). I can also skip returning an error all together, though I don't think it's the best way forward, because plugins that do that basically get no clear indication that they are doing something wrong.

I also agree with you that the whole situation when a plugin is processing its own local reply is confusing and seem to be causing subtle issues. So before moving forward with the review of this PR let me see if I can come up with a change on the Envoy side to address that behavior - maybe Envoy folks will be receptive to this change of Envoy behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants