Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider switching to TinyUFO #411

Open
zaidoon1 opened this issue Apr 5, 2024 · 2 comments
Open

Consider switching to TinyUFO #411

zaidoon1 opened this issue Apr 5, 2024 · 2 comments

Comments

@zaidoon1
Copy link

zaidoon1 commented Apr 5, 2024

See https://crates.io/crates/TinyUFO. I think we would see a good perf boost from this

@ben-manes
Copy link

Actually that policy they use is called W-TinyLFU and it's documented on this wiki as a future feature. Their implementation has a few gaps that ideally would be addressed,

  • It does not adaptively adjust the region sizes or use jitter to avoid getting stuck by a flawed admission decision. These improve the hit rate across a variety of workloads by adjusting to more optimal behavior rather than using a static one. It would be nice if both libraries could take advantage of those design improvements.
  • It appears from a casual read of the code that an update that grows or shrinks the entry concurrently with it being moved across queues could lead to inconsistency in the region sizes. That's a tricky race that is easy to overlook.
  • A fully lock-free design may be increase complexity of handling races, be brittle to evolve, and may not outperform a lock-based one. For example, the cache read can be lock-free using ring buffers to record the policy updates while applying the policy maintenance under a lock. In either case eviction does not offer much parallelization as it serializes on the queues. Thus one can get the performance benefits of a lock-free design while retaining the advantages of more advanced functionality of a lock-based one (adaptive regions, O(1) expiration, etc). For example, Java's Caffeine scored 940M reads/s and 585M/s mixed using 16 threads. The policy is less important (as proven first using LRU), it is the concurrency design that matters. The Pingora results are also excellent, it is just unclear if they've limited the cache's flexibility and made it more difficult to resolve bugs by taking too brittle of a concurrency design approach. Or, it might be a good shortcut to their performance goals if they don't want to include more advanced functionality (so, as usual, it all depends).

@tatsuya6502
Copy link
Member

tatsuya6502 commented Apr 12, 2024

Hi. Thank you for the information. I reviewed the design of TinyUFO and S3-FIFO. I also compared the performance of a TinyUFO implementation ([email protected]) against [email protected] using my mokabench tool with the ARC-S3 trace dataset.

Here is my first thought:

  1. We can learn from its design and could apply some of them to moka.
  2. However we do not have to rush as the performance gain might be small and we have very limited development bandwidth.

Before looking at the mokabench result, let me clarify the current development priority of moka.

(Ordered by higher to lower priorities)

  1. Provide high cache hit rate.
  2. Easy to use.
    • e.g. No lock is needed in the user code.
  3. Convenient features.
    • e.g. Expirations, atomic insert (get_with, and_compute_with, etc.), eviction listener, iterator, cache statistics.
  4. Small memory footprint.
  5. Short compile time and small binary size.
  6. Small performance overhead compared to a concurrent hash table.

As for 1 (hit rate), I believe TinyUFO will provide similar hit rate to W-TinyLFU with fixed size of W. High hit rate is critical for application performance as the cache miss penalty (the extra latency to get the data from a slower media) is much greater than the latency to read the data from the cache.

From the mokabench result I am going to show, the average latency per cache operation (read or write) were the followings:

  • moka: Between ~700 ns and ~850 ns (nanoseconds).
  • TinyUFO: Between ~400 ns and ~700 ns.

As for examples of cache miss penalties, here is a quote from a book: The Systems Performance: Enterprise and the Cloud, 2nd Edition by Brendan Gregg

systems-performance-p26

The "latency" column shows example system latencies, and the "scaled" column shows the latencies scaled to an imaginary system in which a CPU cycle takes one full second.

  • By using the same scale, 700 ns for a cache operation in real life, takes 42 minutes.
  • SSD: 9 — 90 hours
  • HDD: 1 — 12 months
  • Internet: San Francisco to New York: 4 years

As you can see, 42 minutes is almost nothing compared to the latencies of accessing HDD or Internet.

So, in general, 1 (hit rate) is much more important than 6 (small performance overhead). And by design, TinyUFO and W-TinyLFU (with fixed size of W) will provide competing hit rates.

As for 4 (memory footprint), TinyUFO can do better than W-TinyLFU as they use queues and doubly linked lists respectively. A queue can be array-based and have smaller memory footprint than a doubly linked list. However, queue is not a suitable data structure to provide efficient algorithm for entry expirations, so moka may have to continue using doubly linked lists to support expirations. This is a trade-off between 4 and 3 (convenient features).

There is a trade-off between 3 and 6 (small performance overhead) too. So, moka that prioritizes 3 will generally perform less well than other cache implementations that offer simpler functionality. But this will not be a problem in real world applications as 1 (hit rate) has much greater impact to application performance than 6.

OK. Having say that. Here is the mokabench result.

  • Command: cargo run -F tiny-ufo --release -- -n 1
  • Number of clients (-n) was 1 (single thread).
  • Ran on a Linux x86_64 PC with Intel Core i7-12700F.
  • The memory usage (RSS) is taken using the top command.
    • This is the whole memory used by the mokabench process.
Cache Max Capacity Estimated number of keys for LFU Clients Inserts Reads Hit Ratio Memory Usage (RSS) Duration Secs Avg. nanosecs per operation
Moka Sync Cache 100,000 n/a 1 14,694,850 16,407,702 10.439% 572,852k 21.595 s 694 ns
TinyUFO 100,000 100,000 1 15,685,418 16,407,702 4.402% 570,560k 13.077 s 407 ns
TinyUFO 100,000 1,000,000 1 15,074,058 16,407,702 8.128% 633,272k 21.706 s 689 ns
Moka Sync Cache 400,000 n/a 1 9,439,274 16,407,702 42.470% 725,132k 21.926 s 848 ns
TinyUFO 400,000 400,000 1 11,456,282 16,407,702 30.177% 723,656k 17.174 s 616 ns
TinyUFO 400,000 4,000,000 1 10,948,476 16,407,702 33.272% 994,580k 19.670 s 719 ns
Moka Sync Cache 800,000 n/a 1 4,868,066 16,407,702 70.331% 927,520k 15.246 s 716 ns
TinyUFO 800,000 800,000 1 5,597,449 16,407,702 65.885% 929,512k 12.203 s 555 ns
TinyUFO 800,000 8,000,000 1 5,597,708 16,407,702 65.884% 1.4g 13.530 s 614 ns
  • The TinyUFO implementation was up to twice as fast as moka.
  • By some reason, TinyUFO provided lower hit-miss ratio than moka.
    • TinyUFO is expected to provide about the same hit-miss ratio to moka.
  • By some reason, TinyUFO used about the same amount of memory to moka.
    • TinyUFO is expected to use smaller amount of memory than moka.
    • I will probably need to subtract the baseline memory used by mokabench's driver.

It seems the current TinyUFO implementation can be improved for better hit-miss ratio and smaller memory footprint. But even though it is improved, I am afraid that the performance gain might be small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants