fix: BoundedQueue, and restart thread / keep alive on errors #38

aryascripts · 2024-10-28T19:47:35Z

Why

OOMs for GraphQL Hive Reporter due to @queue never flushing after an error occurs in the thread.

What is happening:

Thread only processes until @Queue returns something
When @Queue returns an operation that has an error, the thread ends and the while loop ends with it
add_operation can still be called, even though nothing is processing any operations
@queue is unbounded
@queue size keeps increasing, causing a memory leak

What

Create a bounded queue to avoid OOMs
Restart / keep thread alive when there are errors, so we can keep processing

aryascripts · 2024-10-28T20:10:28Z

lib/graphql-hive/bounded_queue.rb

+        if size >= @bound
+          @logger.error("BoundedQueue is full, discarding operation")
+          return
+        end


There's maybe a problem with the way it's being done here because this is not safe from concurrency. There could be many puma threads adding to this queue, and potentially they both get the same number for size, and add to the queue. Though this may be safe because we always break the cycle whenever the size is >= bound

You can write a test that uses threads to push to the queue.

subject(:queue) { GraphQL::Hive::BoundedQueue.new(bound: 4, logger: logger) } it "should discard items and log when full" do 2.times do |i| Thread.new do 3.times do |ii| queue.push("Thread #{i} operation {ii}") end end end expect(queue.size).to eq(4) expect(logger).to have_received(:error).with("BoundedQueue is full, discarding operation").twice end

TY! I Just added a test very similar to this one
#38 (comment)

cassidycodes · 2024-10-28T20:18:27Z

lib/graphql-hive/bounded_queue.rb

+        if size >= @bound
+          @logger.error("BoundedQueue is full, discarding operation")
+          return
+        end


You can write a test that uses threads to push to the queue.

subject(:queue) { GraphQL::Hive::BoundedQueue.new(bound: 4, logger: logger) } it "should discard items and log when full" do 2.times do |i| Thread.new do 3.times do |ii| queue.push("Thread #{i} operation {ii}") end end end expect(queue.size).to eq(4) expect(logger).to have_received(:error).with("BoundedQueue is full, discarding operation").twice end

cassidycodes · 2024-10-28T20:21:14Z

lib/graphql-hive.rb

@@ -36,6 +36,7 @@ class Hive < GraphQL::Tracing::PlatformTracing
      read_operations: true,
      report_schema: true,
      buffer_size: 50,
+      bounded_queue_multiple: 5,


I'm guessing that you added this so that we don't drop operations when the queue is being flushed. I don't think we need a new configuration value here, though. We want to drop the buffer array anyway in the future.

Suggested change

bounded_queue_multiple: 5,

This one is for how big is the size of the @queue in UsageReporter, and it's different from the buffer size. Maybe it needs a better name!

If we have a buffer size of 5, we would have @queue of size 25 by default. And the puma thread could add max of 25 before it starts dropping operations.

While puma is adding operations, we would be flushing the @queue continuously in the reporting thread, and making room for puma to add more. But if things get too fast (higher rps), we would start dropping since there's no more room in the queue. I've tested this locally with the k6 code.

Some services could update the bounded_queue_multiple to get a higher queue size if they're OK with the memory usage of the queue. I just thought 5 was a good number to start with, but I'm open to discussion!

Yes, I understand that. My comment here is that I am doubtful that we want to expose this as a configuration value. What benefit do users get by configuring the bounded queue size? Why is it a multiplier? For example.

I would recommend that we do not expose the bounded_queue size and, for now, make it a slightly larger number than the buffer size. After all, we have discussed dropping the buffer and just using the queue.

I would prefer to keep configuration values consistent so we don't have to deprecate and remove them in the future.

Ah, I see what you mean, ty for clarifying! I might be giving too much control here to the consumer of this gem.

I've updated this now, to remove the bounded multiple config value, and just using the buffer_size. We discussed that this should never practically reach anyways since we pop after every operation. But if it is reached, then we probably have a larger problem to solve (like this OOM we're trying to fix)

cassidycodes · 2024-10-28T20:25:17Z

lib/graphql-hive/usage_reporter.rb

+        queue_bound = (options[:buffer_size] * options[:bounded_queue_multiple]).to_int
+        @queue = BoundedQueue.new(bound: queue_bound, logger: options[:logger])


Suggested change

queue_bound = (options[:buffer_size] * options[:bounded_queue_multiple]).to_int

@queue = BoundedQueue.new(bound: queue_bound, logger: options[:logger])

queue_bound = (options[:buffer_size].to_i * 5)

@queue = BoundedQueue.new(bound: queue_bound, logger: options[:logger])

because

irb(main):003> ("5" * 5).to_i => 55555 irb(main):004> ("5".to_i * 5) => 25

Good catch, thank you!

aryascripts

Just leaving some comments for reviewers, and also parts that I want some feedback on.

aryascripts · 2024-10-29T15:25:58Z

lib/graphql-hive/bounded_queue.rb

+      end
+
+      def push(item)
+        @lock.synchronize do


Should we also synchronize the pop here? That would mean that we sync between the Hive Reporting Thread (pop) and the main Thread for puma (push).

Yes, it would make size accurate.

Hmm. I thought I'd comment here where I'm at the moment. I'm having some deadlock issues with the way it is currently, because @lock.synchronize on pop makes it so that pop happens first, and the thread while(operation) instantly ends, and there are no more threads available to process.

In tests, this means that the test instantly fails since the processing thread dies. And in the server, we seem to be running into the puma main thread never being able to get the lock 🤔

The puma main thread isn't able to get the lock because we're constantly pop'ing in our thread. We discussed that this refactor can be part of or after #33 so we don't make too many changes here.

aryascripts · 2024-10-29T15:27:41Z

spec/graphql/graphql-hive/bounded_queue_spec.rb

+  it "should be thsead-safe and discard items when full" do
+    threads = []
+    20.times do |i|
+      threads << Thread.new do
+        queue.push(i)
+      end
+    end
+
+    threads.each(&:join)
+
+    expect(queue.size).to eq(2)
+    expect(logger).to have_received(:error).with("BoundedQueue is full, discarding operation").exactly(18).times
+  end


I'd love to get some opinions on this test here!

I think this should test cases where the queue is popped after hitting the bound. I know you test that in a single thread on line 36 but we should demonstrate the lock and bound work as expected in a multi-threaded scenario too.

Good point! I added a new test case for this

Co-authored-by: Cassia Scheffer <[email protected]>

…yng/graphql-ruby-hive into APLT-606-oom-on-errors-hive-buffer

al-yanna · 2024-10-29T17:51:32Z

lib/graphql-hive/usage_reporter.rb

        @sampler = Sampler.new(options[:collect_usage_sampling], options[:logger]) # NOTE: logs for deprecated field
+        @queue = BoundedQueue.new(bound: options[:buffer_size], logger: options[:logger])


i know that the bounded_queue_multiple config was removed, but should we still have the bound as the buffer size * 5?

Suggested change

@queue = BoundedQueue.new(bound: options[:buffer_size], logger: options[:logger])

@queue = BoundedQueue.new(bound: options[:buffer_size].to_i * 5, logger: options[:logger])

We had a discussion here about this! #38 (comment)

We don't need this to be considerably larger than the buffer size because the thread wakes up for every single operation, and calls pop.

So ideally, this won't every reach larger than size of ~1. But to be safe, we can provide a larger queue (buffer size). If we reach this limit, we have other things that are going wrong.

that makes sense! i thought we still wanted it to be considerably larger. thanks for clearing that up!

spec/graphql/graphql-hive/usage_reporter_spec.rb

cassidycodes · 2024-10-30T11:12:20Z

lib/graphql-hive/bounded_queue.rb

+      end
+
+      def push(item)
+        @lock.synchronize do


Yes, it would make size accurate.

cassidycodes · 2024-10-30T11:13:14Z

lib/graphql-hive/usage_reporter.rb

+              @options[:logger].debug("processing operation from queue: #{operation}")
+              buffer << operation if @sampler.sample?(operation)
+
+              @options_mutex.synchronize do


Perhaps out of scope because you are fixing the OOM issue. But I don't think this mutex does anything. There isn't another thread accessing this code.

aryascripts added 3 commits October 28, 2024 14:56

fix: create a new BoundedQueue for Hive

0aa5862

gitignores

e5e4314

fix: use bounded queue for usage reporter

72e9da0

aryascripts changed the title ~~Aplt 606 oom on errors hive buffer~~ fix: BoundedQueue, and restart thread / keep alive on errors Oct 28, 2024

aryascripts added 4 commits October 28, 2024 15:53

fix: specs

8aa6edb

fix: cops

ca348a8

fix: use name bound

3d8ad83

fix: bound

11689a4

aryascripts commented Oct 28, 2024

View reviewed changes

cassidycodes reviewed Oct 28, 2024

View reviewed changes

aryascripts added 8 commits October 28, 2024 17:28

fix: use mutex for queue, and add @running variable

7468bd7

fix: use loop instead of while

81dfa29

fix: test for thread safety

8a4ca18

fix: updated test code for synch

1037961

fix: remove test

b896219

fix: usage reporter while loop + tests for invalid operation

745e41a

fix: extra @running

0262ce4

fix: prints

3079776

aryascripts commented Oct 29, 2024

View reviewed changes

aryascripts and others added 3 commits October 29, 2024 11:35

fix: readme

985691e

fix: readme

78b10e0

fix: operation queue multiple

04e4e85

Co-authored-by: Cassia Scheffer <[email protected]>

aryascripts marked this pull request as ready for review October 29, 2024 15:39

aryascripts added 3 commits October 29, 2024 11:43

Merge branch 'APLT-606-oom-on-errors-hive-buffer' of github.com:rperr…

027256c

…yng/graphql-ruby-hive into APLT-606-oom-on-errors-hive-buffer

fix: remove bounded queue size

3e77fdd

fix: updated specs for code comments

ae8faca

al-yanna reviewed Oct 29, 2024

View reviewed changes

fix: comment for SizedQueue

81e1552

aryascripts requested review from cassidycodes and al-yanna October 29, 2024 18:45

cassidycodes approved these changes Oct 30, 2024

View reviewed changes

aryascripts added 2 commits October 30, 2024 10:09

fix: test comments

8b086af

fix: version bump

14411f7

aryascripts requested review from rperryng and removed request for al-yanna October 30, 2024 15:31

aryascripts merged commit 06bfa53 into master Oct 30, 2024
5 checks passed

aryascripts mentioned this pull request Oct 30, 2024

fix: update version to correct semantic version #39

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: BoundedQueue, and restart thread / keep alive on errors #38

fix: BoundedQueue, and restart thread / keep alive on errors #38

aryascripts commented Oct 28, 2024 •

edited

Loading

aryascripts Oct 28, 2024

cassidycodes Oct 28, 2024

aryascripts Oct 29, 2024 •

edited

Loading

cassidycodes Oct 28, 2024

cassidycodes Oct 28, 2024

aryascripts Oct 29, 2024 •

edited

Loading

cassidycodes Oct 29, 2024

aryascripts Oct 29, 2024

aryascripts Oct 29, 2024 •

edited

Loading

cassidycodes Oct 28, 2024

aryascripts Oct 29, 2024

aryascripts left a comment

aryascripts Oct 29, 2024

cassidycodes Oct 30, 2024

aryascripts Oct 30, 2024

aryascripts Oct 30, 2024

aryascripts Oct 29, 2024

cassidycodes Oct 29, 2024

aryascripts Oct 29, 2024

al-yanna Oct 29, 2024

aryascripts Oct 29, 2024

al-yanna Oct 29, 2024

cassidycodes Oct 30, 2024

cassidycodes Oct 30, 2024

		queue_bound = (options[:buffer_size] * options[:bounded_queue_multiple]).to_int
		@queue = BoundedQueue.new(bound: queue_bound, logger: options[:logger])

		@sampler = Sampler.new(options[:collect_usage_sampling], options[:logger]) # NOTE: logs for deprecated field
		@queue = BoundedQueue.new(bound: options[:buffer_size], logger: options[:logger])

fix: BoundedQueue, and restart thread / keep alive on errors #38

fix: BoundedQueue, and restart thread / keep alive on errors #38

Conversation

aryascripts commented Oct 28, 2024 • edited Loading

Why

What

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aryascripts Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aryascripts Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aryascripts Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aryascripts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aryascripts commented Oct 28, 2024 •

edited

Loading

aryascripts Oct 29, 2024 •

edited

Loading

aryascripts Oct 29, 2024 •

edited

Loading

aryascripts Oct 29, 2024 •

edited

Loading