Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add client request size metric channel #2023

Merged
merged 10 commits into from
Oct 3, 2023

Conversation

andybradshaw
Copy link
Contributor

Before this PR

Currently, the best way to understand the size of requests made by a particular client is to analyze server request logs. While this does provide some high-level insight into the behavior of a client, it doesn't provide any information about the clients understanding of if a request is "repeatable" or not. This is important as not all retry approaches will have the same opinion on the repeatability of a request (e.g. envoy has a payload size constraint), so we need to know which large requests dialogue currently believes would get retried.

After this PR

==COMMIT_MSG==
Add a request size metric channel, which records the size of payloads written by the client.
==COMMIT_MSG==

Possible downsides?

This is currently as histogram, but I don't believe we need all of the metrics that would come out of it, and it's likely quite expensive to add this metric to all clients. I think we probably only care about the p99 or max values here, as average request size is largely irrelevant.

@changelog-app
Copy link

changelog-app bot commented Sep 28, 2023

Generate changelog in changelog/@unreleased

What do the change types mean?
  • feature: A new feature of the service.
  • improvement: An incremental improvement in the functionality or operation of the service.
  • fix: Remedies the incorrect behaviour of a component of the service in a backwards-compatible way.
  • break: Has the potential to break consumers of this service's API, inclusive of both Palantir services
    and external consumers of the service's API (e.g. customer-written software or integrations).
  • deprecation: Advertises the intention to remove service functionality without any change to the
    operation of the service itself.
  • manualTask: Requires the possibility of manual intervention (running a script, eyeballing configuration,
    performing database surgery, ...) at the time of upgrade for it to succeed.
  • migration: A fully automatic upgrade migration task with no engineer input required.

Note: only one type should be chosen.

How are new versions calculated?
  • ❗The break and manual task changelog types will result in a major release!
  • 🐛 The fix changelog type will result in a minor release in most cases, and a patch release version for patch branches. This behaviour is configurable in autorelease.
  • ✨ All others will result in a minor version release.

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Add a request size metric channel, which records the size of payloads written by the client.

Check the box to generate changelog(s)

  • Generate changelog entry

private final EndpointChannel delegate;
private final Histogram requestSize;

static EndpointChannel create(Config cf, EndpointChannel channel, Endpoint endpoint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to wire this up from DialogueChannel so that it's part of the chain of channels used to send requests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! wasn't sure if we'd want to do that separately or not, and also wasn't sure where in the chain to add this... maybe alongside the timing endpoint channel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the timing endpoint channel is in a conditional to protect against an edge case where clients may arbitrarily build new Endpoint objects (if they're not using standard dialogue generated bindings). I don't want to miss out on that data, so perhaps we can avoid fanout by setting a relatively high threshold for reporting data, something like 1mb+.

We'll need to restructure a bit of the code to avoid creating a histogram until we've seen a large request for the first time (which is impossible-ish on GET endpoints, and many PUTs/POSTs will never reach it) by creating a memoized supplier for the histogram rather than passing the histogram directly.

Setting a threshold improves the probability of capturing maximums as well, since we use a sampling reservoir, we only hold ~1024 samples at a given time (with heavy recency bias), so reducing small and uninteresting samples will increase the probability we capture outliers.

@@ -56,6 +56,10 @@ namespaces:
type: timer
tags: [ channel-name ]
docs: Time spent waiting in the sticky queue before execution attempt.
requests.size:
type: histogram
tags: [service-name, endpoint]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include channel-name which is roughly equivalent to the service name in a service discovery block (this shouldn't increase cardinality)

Comment on lines +63 to +66
if (body.isEmpty()) {
// No need to record empty bodies
return request;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this will record metrics for both requests which can and cannot be retried, without distinction. Should we tag metrics based on whether or not retires are allowed? (we could hold two histograms and pass the correct one to RequestSizeRecordingRequestBody based on repeatable()).

}
}

@Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to override and pass through the default interface methods from RequestBody (some tests are failing due to the missing Content-Length header)

@@ -71,6 +71,11 @@ Dialogue-specific metrics that are not necessarily applicable to other client im
- `dialogue.client.request.queued.time` tagged `channel-name` (timer): Time spent waiting in the queue before execution.
- `dialogue.client.request.endpoint.queued.time` tagged `channel-name`, `service-name`, `endpoint` (timer): Time spent waiting in the queue before execution on a specific endpoint due to server QoS.
- `dialogue.client.request.sticky.queued.time` tagged `channel-name` (timer): Time spent waiting in the sticky queue before execution attempt.
- `dialogue.client.requests.size` (histogram): Size of requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to describe the reporting size threshold in this metric description (though I suppose it doesn't claim to report the size of every request ;-)

Can update the the metric-schema/docs later if needs be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated! Yeah, good to call this out... might even be worth renaming the metric to large_requests or something to explicitly call out that there's some distinction/threshold that's used for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current name gives us flexibility if we want to change thresholds later on, I don't mind it as is if you're happy with it. Thanks for updating!

Copy link
Contributor

@carterkozak carterkozak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff, thanks for the contribution! :-)

@bulldozer-bot bulldozer-bot bot merged commit 50b0bfd into develop Oct 3, 2023
6 checks passed
@bulldozer-bot bulldozer-bot bot deleted the ab/add-client-request-size-metrics branch October 3, 2023 15:15
@svc-autorelease
Copy link
Collaborator

Released 3.94.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants