Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shorten critical section in findEviction #217

Closed
wants to merge 3 commits into from

Conversation

igchor
Copy link
Contributor

@igchor igchor commented Mar 24, 2023

This is the next part of the 'critical section' patch after #183.

Original PR (some of the changes already upstreamed): #172

This PR contains changes to findEviction only. The remaining part (changes in SlabRelease code) will be provided in the next (final) PR.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 24, 2023
Comment on lines 1333 to 1328
if (!toRecycle)
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!toRecycle)
continue;
if (!toRecycle) {
continue;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens iff we exhausted the entire linked list and hasn't found a candidate while not exhausting searchTries right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the format suggestion.

// remove the child from the mmContainer as we will not be evicting
// it. We could abort right here, but we need to cleanup in case
// unmarkForEviction() returns 0 - so just go through normal path.
if (!toRecycle_->isChainedItem() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more?
In the old way of eviction, if we are trying to recycle a chained item (toRecycle.isChainedItem()) but the parent has changed, we don't do any specific treatment until in releaseBackToAllocator, we'll go through the chain with parent being candidate, and found out that toRecycle was never reached by walking the chain. So releaseBackToAllocator fails and we continue iterating the eviction queue.

How does the new way work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new way is pretty much the same. This comment states that we don't do any specific treatment as well. We'll just try to evict the item as usual and leverage the existing logic in releaseBackToAllocator.

It might be possible to optimize this case but for this PR I felt like having similar behavior is better.

The only reason for this if here is to optimize the happy case: if the item is not a chained item or the parent hasn't changed, we can safely remove the item from MMContainer (we already hold the MMContainer lock so doing it here has low overhead). If the parent has changed we cannot call the mmContainer.remove() here as it could lead to deadlocks. Instead, we'll just call mmContainer.remove() inside unlinkItemForEviction which is slower (need to acquire the lock again) but safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the optimization now.
I think as long as the chained item can't change parent after this point then this is safe. And since we marked the parent as forEviction, it can't be moving nor accessed, so we don't have to do the chainedItemLock_ here.

How much does this optimization give us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using default hit_ratio/graph_cache_leader_fbobj/config.json this optimization improves throughput by ~40% on my machine. This is mostly due to almost 100% eviction Success rate (without the optimization I get ~50% Success rate).


token = createPutToken(*candidate_);

if (shouldWriteToNvmCache(*candidate_) && !token.isValid()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldWriteToNvmCache is called once in createPutToken and another time in here, can we make it called only once? I think we can just do it as the old way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I wrapped token creation in createPutToken I no longer know if token is invalid due to FailConcurrentFill or due because shouldWriteToNvmCache returned false.

Perhaps I should just put stats_.evictFailConcurrentFill.inc(); inside createPutToken?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you have to create a function for createPutToken?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use it in other places as well. But since for now it's only used in findEviction I can remove it. Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I just realized I bumped evictFailParentAC and evictFailAC even if the failure was due to invalid token - I fixed this as well.

and and enable combinedLocking support. Shorten MMContainer
critical section by removing from Access Container after
MMContainer lock is droped. Only markForEviction is executed
under the MMContainer critical section now. If markForEviction
succeeds, the item is guaranteed to be evicted.

token = createPutToken(*candidate_);

if (shouldWriteToNvmCache(*candidate_) && !token.isValid()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you have to create a function for createPutToken?

// remove the child from the mmContainer as we will not be evicting
// it. We could abort right here, but we need to cleanup in case
// unmarkForEviction() returns 0 - so just go through normal path.
if (!toRecycle_->isChainedItem() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the optimization now.
I think as long as the chained item can't change parent after this point then this is safe. And since we marked the parent as forEviction, it can't be moving nor accessed, so we don't have to do the chainedItemLock_ here.

How much does this optimization give us?

@igchor igchor force-pushed the find_eviction_refactor branch 2 times, most recently from efba6bc to 0253683 Compare April 17, 2023 22:54
@facebook-github-bot
Copy link
Contributor

@haowu14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Comment on lines 1292 to 1294
if (evictToNvmCache)
token = nvmCache_->createPutToken(candidate_->getKey());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (evictToNvmCache)
token = nvmCache_->createPutToken(candidate_->getKey());
if (evictToNvmCache) {
token = nvmCache_->createPutToken(candidate_->getKey());
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this might have been lost in one of the patches you add after you ran test but here is a bug.
The token can be created on candidate A, then candidate A failed to markForEviction and we move to the next item.
For candidate B, evictToNvmCache is false but we successfully markForEviction.

We'll exit this loop with candidate B but the token created from candidate A.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a crash in production test with exception "Item is not nvm evicted and nvm clean". And I suspect this is the reason. I'm putting together a hacky fix on my end to resume the test.

Comment on lines 1333 to 1328
if (!toRecycle)
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the format suggestion.

Copy link
Contributor

@haowu14 haowu14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igchor I continued the production test with a hacky change where I do auto tokoen_ = createPutToken(...) and only do token == std::move(token) iff the item successfully mark for eviction. However, I noticed a regression in allocation latency mostly caused by creating and destroying PutTokens.
I think this is because in this PR, we try to create put token before mark for eviction while before this PR we do refcount==0 and markMoving before trying to create PutToken.
Can we do it the same way as before?

@facebook-github-bot
Copy link
Contributor

@igchor has updated the pull request. You must reimport the pull request before landing.

@igchor
Copy link
Contributor Author

igchor commented Apr 25, 2023

@igchor I continued the production test with a hacky change where I do auto tokoen_ = createPutToken(...) and only do token == std::move(token) iff the item successfully mark for eviction. However, I noticed a regression in allocation latency mostly caused by creating and destroying PutTokens. I think this is because in this PR, we try to create put token before mark for eviction while before this PR we do refcount==0 and markMoving before trying to create PutToken. Can we do it the same way as before?

We need to create token before marking for eviction because markForEviction is the point where the item stops being visible to other threads. I think we could do some optimizations here: for example, check the refcount / moving bit first, and move to the next item if necessary (it's still possible we'll need to destroy the token but it should be less common).

Could you share what is the Evict Success rate and how much regression have you observed? I've run a few benchmarks where evict success rate was around 80% and didn't see any regression with the fix. I've also included the fix in this PR as a separate commit.

Also, have you tried running with "useCombinedLockForIterators": true ?

@haowu14
Copy link
Contributor

haowu14 commented Apr 25, 2023

@igchor
I just realized I made a mistake in my hacky fix. I just fixed that and am rebuilding a binary. The mistake I made was checking the wrong token, which would always be invalid and resulted in the advance of iterator. Hopefully this de-signify the fact that now we create token before mark evict.

"useCombinedLockForIterators": true

Not yet. I'd need to make sure with this being false there's no regression (I hope)?

@facebook-github-bot
Copy link
Contributor

@haowu14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@igchor
Copy link
Contributor Author

igchor commented Apr 26, 2023

@igchor I just realized I made a mistake in my hacky fix. I just fixed that and am rebuilding a binary. The mistake I made was checking the wrong token, which would always be invalid and resulted in the advance of iterator. Hopefully this de-signify the fact that now we create token before mark evict.

"useCombinedLockForIterators": true

Not yet. I'd need to make sure with this being false there's no regression (I hope)?

Got it. Right, regarding useCombinedLockForIterators, I would expect there is no regression when it's set to false. When it's set to 'true' I'd expect lower allocate latency and higher throughput at the cost of slightly higher find latency.

@haowu14
Copy link
Contributor

haowu14 commented Apr 28, 2023

In the most recent version there's no significant regression however I don't see improvement with combined locking on either.

Our test is a shadow set tup, meaning that the binary running base (trunk) and test(this PR) will receive exact same get/set at the same time regardless of how quickly that binary can process the request. So we can't look at QPS directly for comparison.
Looking at p95 allocation latency, they look pretty much the same (test is about 1us slower, not sure if it's just accounting error). The eviction success rate is between 55 ~ 70 and they are the same most of the time across the two versions.

I'll run the experiment again on a new experiment since this existing pair of servers ran my buggy code and may not have exactly the same cache content (but I don't think it matters much).

My questions for you:

  1. Any other metrics you'd like me to look at to use as a poxy to evaluate this throughput win?
  2. Can you remind me of what's the slow down on find path?

@igchor
Copy link
Contributor Author

igchor commented Apr 28, 2023

In the most recent version there's no significant regression however I don't see improvement with combined locking on either.

Our test is a shadow set tup, meaning that the binary running base (trunk) and test(this PR) will receive exact same get/set at the same time regardless of how quickly that binary can process the request. So we can't look at QPS directly for comparison. Looking at p95 allocation latency, they look pretty much the same (test is about 1us slower, not sure if it's just accounting error). The eviction success rate is between 55 ~ 70 and they are the same most of the time across the two versions.

I'll run the experiment again on a new experiment since this existing pair of servers ran my buggy code and may not have exactly the same cache content (but I don't think it matters much).

My questions for you:

  1. Any other metrics you'd like me to look at to use as a poxy to evaluate this throughput win?
  2. Can you remind me of what's the slow down on find path?
  1. We actually saw the highest difference on p99 (and higher) latencies.
    I think combined locking helps the most if there is higher contention on the MMContainer lock. If it's not the case for the workload you're running, then there might no difference. However, combined locking would still probably help if our multi-tier version would be run on that workload (the length of the critical section for multi-tier is longer which causes contention to go up).

  2. MMContainer lock is used in (at least) 2 places: findEviction (called from allocate) to protect the search for an eviction candidate and in markUseful which is called by all readers (on find). Without combined locking, a thread calling find would just execute its (short) critical section or wait in case of contention. With combined locking, that thread might end up executing a queued task from findEviction as well, and since the findEviction critical section length is longer, it might increase find latency. My understanding is that with combined locking we are pretty much splitting the work (under MMContainer lock) more equally between allocate and find

Also, one more thing that we've noticed when running benchmarks. With this patch, we can get similar (or even better) performance (QPS and latency) while using lower AC locksPower_ (because we no longer lock AC under MMContainer lock).

Copy link
Contributor

@therealgymmy therealgymmy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. This LGTM overall. A few comments inline.

template <typename CacheTrait>
void CacheAllocator<CacheTrait>::unlinkItemForEviction(Item& it) {
XDCHECK(it.isMarkedForEviction());
XDCHECK(it.getRefCount() == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: XDCHECK_EQ(0, it.getRefcount());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

XDCHECK(it.isMarkedForEviction());
XDCHECK(it.getRefCount() == 0);
accessContainer_->remove(it);
removeFromMMContainer(it);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't we already remove this item from mm-container in findEviction at L1313? (if it were a parent or just a regular item?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we might have removed the item but removeFromMMContainer checks if the item is still in the MMContainer and only then calls mmcontainer.remove(). We need to keep this call here because If the parent of the item changed, the item will still be in the mmcontainer.

// Since we managed to mark the item for eviction we must be the only
// owner of the item.
const auto ref = it.unmarkForEviction();
XDCHECK(ref == 0u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same nitpick: XDCHECK_EQ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (evictToNvmCache && !token_.isValid()) {
stats_.evictFailConcurrentFill.inc();
} else if (candidate_->markForEviction()) {
XDCHECK(candidate_->isMarkedForEviction());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor this a bit to have fewer nested if-stmt?

e.g.

if (evictToNvmCache && !token_.isValid()) {
  stats_.evictFailConcurrentFill.inc();
  ++itr;
  continue;
}

auto markedForEviction = candidate_->markForEviction();
if (!markedForEviction) {
  ++itr;
  continue;
}

// ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done.

if (candidate->getRefCount() != 0 || !candidate->markMoving()) {
++itr;
while (config_.evictionSearchTries == 0 ||
config_.evictionSearchTries > searchTries) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this when we're doing a while loop already inside withEvictionIterator?

Can we just reset the iterator internally and restart the search under lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially yes, but it might have an impact on performance. The current implementation behaves pretty much the same as the original implementation. If we got rid of this loop, the MMContainer lock could be held for a longer duration.

Resetting the iterator under the lock might actually be a better solution but I just didn't want to change the behavior too much in this PR - perhaps it would be better to do it in a separate PR and evaluate it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think the outer loop is also need for cases when releaseBackToAllocator ends up not releasing the item we intend to - this is already done outside the lock.

@facebook-github-bot
Copy link
Contributor

@igchor has updated the pull request. You must reimport the pull request before landing.

@haowu14
Copy link
Contributor

haowu14 commented May 31, 2023

Performance tests have passed with autoFDO off. So we are good performance wise.

@facebook-github-bot
Copy link
Contributor

@haowu14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@haowu14 merged this pull request in 80af379.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by ed9e088.

@haowu14
Copy link
Contributor

haowu14 commented Jun 8, 2023

Crashed in staging test for a service. Validating a fix. We'll land it again once it's fixed.

@haowu14
Copy link
Contributor

haowu14 commented Jun 13, 2023

We upstreamed this again ae7442a

The problem was the lifetime of putToken. It could live after the candidate is released and its key becomes empty. The destructor would try to remove this empty key from the InflightPuts map and assert for success. It's fixed by making sure the putToken is destroyed right after attempting to insert into NVM, before trying to release candidate back to allocator.

@igchor
Copy link
Contributor Author

igchor commented Jun 14, 2023

@haowu14 thanks for letting us now! We'll try to apply the fix on our end and validate it as well. We've actually seen a few inconsistencies reported by cachebench the last few days on our develop branch but we thought this was due to other changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged Reverted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants