Skip to content

Commit

Permalink
Include stalled network storage device info in compactor runbooks. (#…
Browse files Browse the repository at this point in the history
…9297)

* Include stalled network drive in compactor runbook info.

* Tweaks.

* Changelog

* Update docs/sources/mimir/manage/mimir-runbooks/_index.md

Co-authored-by: Taylor C <[email protected]>

* Update docs/sources/mimir/manage/mimir-runbooks/_index.md

Co-authored-by: Taylor C <[email protected]>

* Update docs/sources/mimir/manage/mimir-runbooks/_index.md

Co-authored-by: Taylor C <[email protected]>

* Update CHANGELOG.md

Co-authored-by: Taylor C <[email protected]>

---------

Co-authored-by: Taylor C <[email protected]>
  • Loading branch information
seizethedave and tacole02 authored Sep 13, 2024
1 parent 90fd07b commit 841f2d1
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 0 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,7 @@
* [ENHANCEMENT] Specify in which component the configuration flags `-compactor.blocks-retention-period`, `-querier.max-query-lookback`, `-query-frontend.max-total-query-length`, `-query-frontend.max-query-expression-size-bytes` are applied and that they are applied to remote read as well. #8433
* [ENHANCEMENT] Provide more detailed recommendations on how to migrate from classic to native histograms. #8864
* [ENHANCEMENT] Clarify that `{namespace}` and `{groupName}` path segments in the ruler config API should be URL-escaped. #8969
* [ENHANCEMENT] Include stalled compactor network drive information in runbooks. #9297

### Tools

Expand Down
4 changes: 4 additions & 0 deletions docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -624,6 +624,10 @@ How to **investigate**:
./tools/markblocks/markblocks -backend gcs -gcs.bucket-name <bucket> -mark no-compact -tenant <tenant-id> -details "Result block exceeds symbol table maximum size" <block-1> <block-2>...
```
- Further reading: [Compaction algorithm]({{< relref "../../references/architecture/components/compactor#compaction-algorithm" >}}).
- Compactor network disk unresponsive:
- **How to detect**: A telltale sign is having many cores of sustained kernel-mode CPU usage by the compactor process. Check the metric `rate(container_cpu_system_seconds_total{pod="<pod>"}[$__rate_interval])` for the affected pod.
- **What it means**: The compactor process has frozen because it's blocked on kernel-mode flushes to an unresponsive network block storage device.
- **How to mitigate**: Unknown. This typically self-resolves after ten to twenty minutes.

- Check the [Compactor Dashboard]({{< relref "../monitor-grafana-mimir/dashboards/compactor" >}}) and set it to view the last 7 days.

Expand Down

0 comments on commit 841f2d1

Please sign in to comment.