Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-8331 telemetry: Support client metrics dump without agent #14289

Merged
merged 2 commits into from
May 3, 2024

Conversation

mjmac
Copy link
Contributor

@mjmac mjmac commented Apr 30, 2024

Enable scenarios where client telemetry is collected and dumped
to a CSV without agent config changes or involvement.

Setting D_CLIENT_METRICS_DUMP_DIR in the client process
environment will enable client telemetry dump to the
specified directory even if the agent is not configured
to export telemetry.

Features: telemetry
Required-githooks: true
Change-Id: I243d11a2e00059ef3115d392d63c523048477122
Signed-off-by: Michael MacDonald [email protected]

Copy link

Ticket title is 'Client side metrics/stats support for DAOS'
Status is 'In Review'
Labels: 'HPE'
Errors are Unknown component
https://daosio.atlassian.net/browse/DAOS-8331

@mjmac mjmac changed the title DAOS-8331 metrics: Support client metrics dump without agent DAOS-8331 telemetry: Support client metrics dump without agent Apr 30, 2024
Enable scenarios where client telemetry is collected and dumped
to a CSV without agent config changes or involvement.

Setting D_CLIENT_METRICS_DUMP_DIR in the client process
environment will enable client telemetry dump to the
specified directory even if the agent is not configured
to export telemetry.

Skip-NLT: true
Features: telemetry
Required-githooks: true
Change-Id: I243d11a2e00059ef3115d392d63c523048477122
Signed-off-by: Michael MacDonald <[email protected]>
@mjmac mjmac force-pushed the mjmac/DAOS-8331-no_agent branch from 35ec346 to 7118526 Compare May 1, 2024 16:16
@mjmac mjmac marked this pull request as ready for review May 2, 2024 14:43
@mjmac mjmac requested review from a team as code owners May 2, 2024 14:43
@mjmac
Copy link
Contributor Author

mjmac commented May 2, 2024

Moving this out of draft for review. Note, NLT seems to be having issues with this PR. The actual tests pass, but the stage is marked as failed due to an error in calculating deltas. I wound up pushing with Skip-NLT: true out of frustration, because I wanted to get the rest of the test coverage.

Given that NLT actually passed and the failure does not seem to be related to changes made in this PR, I don't think it's worth re-running again to try and get an NLT pass.

@mjmac mjmac requested a review from kccain May 2, 2024 14:47
/* Request that the agent adds our segment into the tree. */
rc = dc_mgmt_tm_register(NULL, dc_jobid, pid, &agent_uid);
if (rc != 0) {
DL_ERROR(rc, "client telemetry setup failed.");
if (rc == -DER_UNINIT && d_isenv_def(DAOS_CLIENT_METRICS_DUMP_DIR)) {
D_INFO("telemetry dump dir set -- proceeding without agent management.\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be appropriate to classify as a warning / D_WARN? Since the user has requested metrics to be retained but it can't be done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but decided that it would be too noisy/alarming. The intent behind this PR is to allow a user to dump client telemetry without any involvement from the admin (i.e. no need to reconfigure the agent, etc).

Rather than requiring the user to set yet another environment variable to indicate that they want telemetry to work even if the agent isn't configured to manage it, I just modified the code to quietly handle this special case (-DER_UNINIT indicates that the agent hasn't initialized the telemetry library). To me, this follows the principle of least surprise (or annoyance, depending on your outlook).

The exact use cases here are still being hashed out, but in case it's not clear, my expectation is that if the user is asking to dump the client telemetry at program exit, then they probably don't care about whether or not the agent is configured to manage the telemetry. Agent management is only needed if the telemetry will be sampled in realtime while the application is still running.

@mjmac mjmac added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label May 2, 2024
@mjmac mjmac requested a review from a team May 2, 2024 20:30
Required-githooks: true

Change-Id: I10d6df7d67399b564f964fca6f3da7af0698b18f
@mjmac mjmac removed the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label May 2, 2024
@mjmac mjmac removed the request for review from a team May 2, 2024 23:17
@mjmac
Copy link
Contributor Author

mjmac commented May 2, 2024

grumble ... Just merged master, re-running.

Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mjmac mjmac merged commit b364e49 into master May 3, 2024
50 of 51 checks passed
@mjmac mjmac deleted the mjmac/DAOS-8331-no_agent branch May 3, 2024 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants