Add client request size metric channel #2023

andybradshaw · 2023-09-28T18:13:49Z

Before this PR

Currently, the best way to understand the size of requests made by a particular client is to analyze server request logs. While this does provide some high-level insight into the behavior of a client, it doesn't provide any information about the clients understanding of if a request is "repeatable" or not. This is important as not all retry approaches will have the same opinion on the repeatability of a request (e.g. envoy has a payload size constraint), so we need to know which large requests dialogue currently believes would get retried.

After this PR

==COMMIT_MSG==
Add a request size metric channel, which records the size of payloads written by the client.
==COMMIT_MSG==

Possible downsides?

This is currently as histogram, but I don't believe we need all of the metrics that would come out of it, and it's likely quite expensive to add this metric to all clients. I think we probably only care about the p99 or max values here, as average request size is largely irrelevant.

changelog-app · 2023-09-28T18:13:53Z

Generate changelog in `changelog/@unreleased`

What do the change types mean?

feature: A new feature of the service.
improvement: An incremental improvement in the functionality or operation of the service.
fix: Remedies the incorrect behaviour of a component of the service in a backwards-compatible way.
break: Has the potential to break consumers of this service's API, inclusive of both Palantir services
and external consumers of the service's API (e.g. customer-written software or integrations).
deprecation: Advertises the intention to remove service functionality without any change to the
operation of the service itself.
manualTask: Requires the possibility of manual intervention (running a script, eyeballing configuration,
performing database surgery, ...) at the time of upgrade for it to succeed.
migration: A fully automatic upgrade migration task with no engineer input required.

Note: only one type should be chosen.

How are new versions calculated?

❗The break and manual task changelog types will result in a major release!
🐛 The fix changelog type will result in a minor release in most cases, and a patch release version for patch branches. This behaviour is configurable in autorelease.
✨ All others will result in a minor version release.

Type

Description

Add a request size metric channel, which records the size of payloads written by the client.

Check the box to generate changelog(s)

Generate changelog entry

carterkozak · 2023-09-28T18:15:15Z

dialogue-core/src/main/java/com/palantir/dialogue/core/RequestSizeMetricsChannel.java

+    private final EndpointChannel delegate;
+    private final Histogram requestSize;
+
+    static EndpointChannel create(Config cf, EndpointChannel channel, Endpoint endpoint) {


We'll need to wire this up from DialogueChannel so that it's part of the chain of channels used to send requests

Yep! wasn't sure if we'd want to do that separately or not, and also wasn't sure where in the chain to add this... maybe alongside the timing endpoint channel?

Hmm, the timing endpoint channel is in a conditional to protect against an edge case where clients may arbitrarily build new Endpoint objects (if they're not using standard dialogue generated bindings). I don't want to miss out on that data, so perhaps we can avoid fanout by setting a relatively high threshold for reporting data, something like 1mb+.

We'll need to restructure a bit of the code to avoid creating a histogram until we've seen a large request for the first time (which is impossible-ish on GET endpoints, and many PUTs/POSTs will never reach it) by creating a memoized supplier for the histogram rather than passing the histogram directly.

Setting a threshold improves the probability of capturing maximums as well, since we use a sampling reservoir, we only hold ~1024 samples at a given time (with heavy recency bias), so reducing small and uninteresting samples will increase the probability we capture outliers.

carterkozak · 2023-09-28T18:16:24Z

dialogue-core/src/main/metrics/dialogue-core-metrics.yml

@@ -56,6 +56,10 @@ namespaces:
        type: timer
        tags: [ channel-name ]
        docs: Time spent waiting in the sticky queue before execution attempt.
+      requests.size:
+        type: histogram
+        tags: [service-name, endpoint]


Let's include channel-name which is roughly equivalent to the service name in a service discovery block (this shouldn't increase cardinality)

carterkozak · 2023-09-28T18:21:12Z

dialogue-core/src/main/java/com/palantir/dialogue/core/RequestSizeMetricsChannel.java

+        if (body.isEmpty()) {
+            // No need to record empty bodies
+            return request;
+        }


Currently this will record metrics for both requests which can and cannot be retried, without distinction. Should we tag metrics based on whether or not retires are allowed? (we could hold two histograms and pass the correct one to RequestSizeRecordingRequestBody based on repeatable()).

dialogue-core/src/main/java/com/palantir/dialogue/core/RequestSizeMetricsChannel.java

carterkozak · 2023-10-02T15:07:22Z

dialogue-core/src/main/java/com/palantir/dialogue/core/RequestSizeMetricsChannel.java

+            }
+        }
+
+        @Override


We need to override and pass through the default interface methods from RequestBody (some tests are failing due to the missing Content-Length header)

dialogue-core/src/main/metrics/dialogue-core-metrics.yml

dialogue-core/src/test/java/com/palantir/dialogue/core/RequestSizeMetricsChannelTest.java

carterkozak · 2023-10-03T15:03:36Z

dialogue-clients/metrics.md

@@ -71,6 +71,11 @@ Dialogue-specific metrics that are not necessarily applicable to other client im
 - `dialogue.client.request.queued.time` tagged `channel-name` (timer): Time spent waiting in the queue before execution.
 - `dialogue.client.request.endpoint.queued.time` tagged `channel-name`, `service-name`, `endpoint` (timer): Time spent waiting in the queue before execution on a specific endpoint due to server QoS.
 - `dialogue.client.request.sticky.queued.time` tagged `channel-name` (timer): Time spent waiting in the sticky queue before execution attempt.
+- `dialogue.client.requests.size` (histogram): Size of requests


We may want to describe the reporting size threshold in this metric description (though I suppose it doesn't claim to report the size of every request ;-)

Can update the the metric-schema/docs later if needs be.

Updated! Yeah, good to call this out... might even be worth renaming the metric to large_requests or something to explicitly call out that there's some distinction/threshold that's used for it.

I think the current name gives us flexibility if we want to change thresholds later on, I don't mind it as is if you're happy with it. Thanks for updating!

carterkozak

Great stuff, thanks for the contribution! :-)

svc-autorelease · 2023-10-03T15:15:41Z

Released 3.94.0

This reverts commit 50b0bfd.

andybradshaw added 2 commits September 26, 2023 17:37

Add request size metrics channel

6dc9052

Don't throw on close exception

0582992

andybradshaw requested a review from carterkozak September 28, 2023 18:13

andybradshaw self-assigned this Sep 28, 2023

probot-autolabeler bot added the autorelease label Sep 28, 2023

carterkozak reviewed Sep 28, 2023

View reviewed changes

dialogue-core/src/main/java/com/palantir/dialogue/core/RequestSizeMetricsChannel.java Outdated Show resolved Hide resolved

carterkozak reviewed Sep 28, 2023

View reviewed changes

dialogue-core/src/main/java/com/palantir/dialogue/core/RequestSizeMetricsChannel.java Outdated Show resolved Hide resolved

andybradshaw and others added 4 commits September 28, 2023 15:11

PR feedback

28dcb36

Add generated changelog entries

4084d82

fix checks and tests

638a1a2

PR feedback

880b94c