Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAS-130426 / Cherry-pick multi-gen LRU patches for DF #187

Merged
merged 3 commits into from
Aug 6, 2024

Conversation

usaleem-ix
Copy link
Contributor

This PR contains 2 commits, that are cherry-picked from v6.6.44 release. These commits contain fixes for multi-gen LRU.

Changelog is also updated.

API tests run can be found here.

Manually tested by creating an ISO and installing on disk. No new warnings or failures found in dmesg logs.

yuzhaogoogle and others added 3 commits August 6, 2024 11:43
commit 3f74e6bd3b84a8b6bb3cc51609c89e5b9d58eed7 upstream.

set_initial_priority() tries to jump-start global reclaim by estimating
the priority based on cold/hot LRU pages.  The estimation does not account
for shrinker objects, and it cannot do so because their sizes can be in
different units other than page.

If shrinker objects are the majority, e.g., on TrueNAS SCALE 24.04.0 where
ZFS ARC can use almost all system memory, set_initial_priority() can
vastly underestimate how much memory ARC shrinker can evict and assign
extreme low values to scan_control->priority, resulting in overshoots of
shrinker objects.

To reproduce the problem, using TrueNAS SCALE 24.04.0 with 32GB DRAM, a
test ZFS pool and the following commands:

  fio --name=mglru.file --numjobs=36 --ioengine=io_uring \
      --directory=/root/test-zfs-pool/ --size=1024m --buffered=1 \
      --rw=randread --random_distribution=random \
      --time_based --runtime=1h &

  for ((i = 0; i < 20; i++))
  do
    sleep 120
    fio --name=mglru.anon --numjobs=16 --ioengine=mmap \
      --filename=/dev/zero --size=1024m --fadvise_hint=0 \
      --rw=randrw --random_distribution=random \
      --time_based --runtime=1m
  done

To fix the problem:
1. Cap scan_control->priority at or above DEF_PRIORITY/2, to prevent
   the jump-start from being overly aggressive.
2. Account for the progress from mm_account_reclaimed_pages(), to
   prevent kswapd_shrink_node() from raising the priority
   unnecessarily.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: e4dde56 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <[email protected]>
Reported-by: Alexander Motin <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 8b671fe1a879923ecfb72dda6caf01460dd885ef upstream.

evict_folios() uses a second pass to reclaim folios that have gone through
page writeback and become clean before it finishes the first pass, since
folio_rotate_reclaimable() cannot handle those folios due to the
isolation.

The second pass tries to avoid potential double counting by deducting
scan_control->nr_scanned.  However, this can result in underflow of
nr_scanned, under a condition where shrink_folio_list() does not increment
nr_scanned, i.e., when folio_trylock() fails.

The underflow can cause the divisor, i.e., scale=scanned+reclaimed in
vmpressure_calc_level(), to become zero, resulting in the following crash:

  [exception RIP: vmpressure_work_fn+101]
  process_one_work at ffffffffa3313f2b

Since scan_control->nr_scanned has no established semantics, the potential
double counting has minimal risks.  Therefore, fix the problem by not
deducting scan_control->nr_scanned in evict_folios().

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 359a5e1 ("mm: multi-gen LRU: retry folios written back while isolated")
Reported-by: Wei Xu <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>
Cc: Alexander Motin <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
@bugclerk
Copy link

bugclerk commented Aug 6, 2024

@ixhamza
Copy link
Contributor

ixhamza commented Aug 6, 2024

Are we also planning to enable CONFIG_LRU_GEN_ENABLED in dragonfish? I am wondering whether it's worth maintaining this patch with CONFIG_LRU_GEN_ENABLED disabled, or if there is a plan to enable it in the upcoming dragonfish release.

@usaleem-ix
Copy link
Contributor Author

usaleem-ix commented Aug 6, 2024

CONFIG_LRU_GEN_ENABLED option enables the multi-gen LRU by default. Currently, it is disabled for both EE and DF (this is coming from Debian). I am not sure we are planning to enable this by default.

Since, these patches have landed in v6.6.44, I don't think this would introduce extra maintainability overhead. Moving to or merging a newer release will already contain these patches.

@amotin amotin merged commit ef7d464 into stable/dragonfish Aug 6, 2024
6 checks passed
@amotin amotin deleted the NAS-130426-df branch August 6, 2024 13:44
@bugclerk
Copy link

bugclerk commented Aug 6, 2024

JIRA ticket https://ixsystems.atlassian.net/browse/NAS-130426 is targeted to the following versions which have not received their corresponding PRs: 24.10

@bugclerk
Copy link

bugclerk commented Aug 6, 2024

This PR has been merged and conversations have been locked.
If you would like to discuss more about this issue please use our forums or raise a Jira ticket.

@truenas truenas locked as resolved and limited conversation to collaborators Aug 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants