diff --git a/CHANGELOG.md b/CHANGELOG.md index abedf0c84a3..b4560403436 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -217,6 +217,7 @@ * [ENHANCEMENT] Specify in which component the configuration flags `-compactor.blocks-retention-period`, `-querier.max-query-lookback`, `-query-frontend.max-total-query-length`, `-query-frontend.max-query-expression-size-bytes` are applied and that they are applied to remote read as well. #8433 * [ENHANCEMENT] Provide more detailed recommendations on how to migrate from classic to native histograms. #8864 * [ENHANCEMENT] Clarify that `{namespace}` and `{groupName}` path segments in the ruler config API should be URL-escaped. #8969 +* [ENHANCEMENT] Include stalled compactor network drive information in runbooks. #9297 ### Tools diff --git a/docs/sources/mimir/manage/mimir-runbooks/_index.md b/docs/sources/mimir/manage/mimir-runbooks/_index.md index a77e161e3b5..87f88398a0f 100644 --- a/docs/sources/mimir/manage/mimir-runbooks/_index.md +++ b/docs/sources/mimir/manage/mimir-runbooks/_index.md @@ -624,6 +624,10 @@ How to **investigate**: ./tools/markblocks/markblocks -backend gcs -gcs.bucket-name -mark no-compact -tenant -details "Result block exceeds symbol table maximum size" ... ``` - Further reading: [Compaction algorithm]({{< relref "../../references/architecture/components/compactor#compaction-algorithm" >}}). + - Compactor network disk unresponsive: + - **How to detect**: A telltale sign is having many cores of sustained kernel-mode CPU usage by the compactor process. Check the metric `rate(container_cpu_system_seconds_total{pod=""}[$__rate_interval])` for the affected pod. + - **What it means**: The compactor process has frozen because it's blocked on kernel-mode flushes to an unresponsive network block storage device. + - **How to mitigate**: Unknown. This typically self-resolves after ten to twenty minutes. - Check the [Compactor Dashboard]({{< relref "../monitor-grafana-mimir/dashboards/compactor" >}}) and set it to view the last 7 days.