Implement operations status polling in backend pod #749

mbarnes · 2024-10-21T13:03:56Z

What this PR does

This adds functionality to the backend pod which updates Cosmos DB items with the status of active (non-terminal) asynchronous operations for clusters. Status polling for node pools is not currently possible; this part will be completed when the new /api/aro_hcp/v1 OCM endpoint becomes available.

This runs on two independent polling intervals, each of which can be overridden through environment variables.

The first polling, which by default runs every 30s, scans the Cosmos DB "Operations" container for items with a non-terminal status and stores an internal list.

The second polling, which by default runs every 10s, iterates over that list and queries Cluster Service for the status of each resource. It then translates the Cluster Service status to a ProvisioningState value, takes an error message for failed operations, and -- with the subscription locked -- updates the appropriate "Operations" item and "Resources" item in Cosmos DB.

Additionally, if a deletion operation has completed successfully, it deletes the corresponding "Resources" item from Cosmos DB.

Jira: ARO-8598 - Prototype async op status notification mechanism for RP
Link to demo recording:

Special notes for your reviewer

For the record, I'm not a fan of this design and consider it a "Mark I" iteration. Polling introduces a lot of latency in the statuses reported by the Microsoft.RedHatOpenShift/hcpOpenShiftClusters API. But we're stuck with polling for now for two reasons:

Microsoft's own Go SDK for Cosmos is minimally functional and lacks any support for change feeds. So notification of changes to Cosmos containers through the SDK is currently not possible, meaning the backend cannot respond immediately to new operations. (There's other ways to address this but the team agreed to initially just poll for simplicity's sake.)
Cluster Service today lacks any kind of publish-subscribe mechanism and is, by design, unaware of the ARO-HCP RP. This is something that could more easily be addressed if the recently posted ADR is approved. For example, a dedicated Cluster Service backend for ARO-HCP could update Cosmos DB directly and eliminate the need for polling altogether.

SudoBrendan

At a glance, this appears to perform what we need. This feels like code that should get some tests because it is foundational - can we add them to solidify our requirements as code? e.g. happy paths / error paths / "no more than 1 call is made to Items since it is not concurrency safe", etc

SudoBrendan · 2024-10-23T14:10:53Z

internal/database/cache.go

+func (iter operationCacheIterator) Items(ctx context.Context) iter.Seq[[]byte] {
+	return func(yield func([]byte) bool) {
+		for _, doc := range iter.operation {
+			// Marshalling the document struct only to immediately unmarshal
+			// it back to a document struct is a little silly but this is to
+			// conform to the DBClientIterator interface.
+			item, err := json.Marshal(doc)
+			if err != nil {
+				iter.err = err
+				return
+			}
+
+			if !yield(item) {
+				return
+			}
+		}
+	}
+}
+
+func (iter operationCacheIterator) GetError() error {
+	return iter.err
+}


this doesn't look concurrency safe - not sure if that's a requirement. Should we at least log err as we set it? How would we know what call to Items failed?

It's not meant to be concurrency safe, it's meant to be used as part of a range-over-function pattern, which is new in Go 1.23. (Think generator functions in Python, if you're familiar with that.)

The pattern as described in the blog post doesn't address the possibility of iteration failing (such as when paging over results from a remote source) so I came up with this iterator interface as a way to stash an iterator error.

The idiom looks like:

for item := range iterator.Items(ctx) { ... } // Find out if iteration completed or aborted early. err := iterator.GetError()

An iterator error would immediately break the "for" loop.

Also maybe worth mentioning, since I didn't realize you were highlighting the in-memory cache...

I don't believe this particular function is currently used. I wrote it for the in-memory cache in order to fulfill the DBClient interface, but the in-memory cache is nowadays only used to mock database operations in unit tests.

mbarnes · 2024-10-23T15:17:18Z

This feels like code that should get some tests because it is foundational - can we add them to solidify our requirements as code?

Are you referring to the new database functions or to the backend code paths... or both? The backend might be difficult to write unit tests for at the moment since parts of it rely on database locking (see #680), which I don't have a way to mock at the moment. (Mocking should be doable, it's just not done.)

Converts the result of azcosmos.ContainerItem.NewQueryItemsPager to a failable push iterator (push iterators are new in Go 1.23).

This also defines a DBClientIterator interface so the in-memory cache can mimic QueryItemsIterator.

Will add something similar for node pools once the new "aro_hcp" API is available from Cluster Service.

Periodically scans the "Operations" Cosmos DB container.

mbarnes requested review from bennerv, mjlshen and SudoBrendan as code owners October 21, 2024 13:03

mbarnes force-pushed the backend-operations-scan branch from 4426287 to e910484 Compare October 21, 2024 19:03

mbarnes requested review from jharrington22, UlrichSchlueter and zgalor as code owners October 21, 2024 19:03

mbarnes force-pushed the backend-operations-scan branch from e910484 to cd016ef Compare October 21, 2024 19:08

mbarnes mentioned this pull request Oct 23, 2024

Block requests based on provisioning state #759

Open

SudoBrendan reviewed Oct 23, 2024

View reviewed changes

mbarnes force-pushed the backend-operations-scan branch 3 times, most recently from b60460d to a8d253e Compare October 28, 2024 12:06

Matthew Barnes added 6 commits October 28, 2024 08:08

frontend: Add a note to ConvertCStoHCPOpenShiftCluster

b8cc9dc

database: Add QueryItemsIterator type

1e4f067

Converts the result of azcosmos.ContainerItem.NewQueryItemsPager to a failable push iterator (push iterators are new in Go 1.23).

database: Add DBClient.UpdateOperationDoc method

37c9223

database: Add DBClient.ListAllOperationDocs method

91d3e09

This also defines a DBClientIterator interface so the in-memory cache can mimic QueryItemsIterator.

ocm: Add ClusterServiceConfig.GetCSClusterStatus method

5b8c447

Will add something similar for node pools once the new "aro_hcp" API is available from Cluster Service.

backend: Add OperationsScanner

8e280fe

Periodically scans the "Operations" Cosmos DB container.

mbarnes force-pushed the backend-operations-scan branch from a8d253e to 8e280fe Compare October 28, 2024 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement operations status polling in backend pod #749

Implement operations status polling in backend pod #749

mbarnes commented Oct 21, 2024 •

edited

Loading

SudoBrendan left a comment •

edited

Loading

SudoBrendan Oct 23, 2024

mbarnes Oct 23, 2024 •

edited

Loading

mbarnes Oct 23, 2024

mbarnes commented Oct 23, 2024

Implement operations status polling in backend pod #749

Are you sure you want to change the base?

Implement operations status polling in backend pod #749

Conversation

mbarnes commented Oct 21, 2024 • edited Loading

What this PR does

Special notes for your reviewer

SudoBrendan left a comment • edited Loading

Choose a reason for hiding this comment

SudoBrendan Oct 23, 2024

Choose a reason for hiding this comment

mbarnes Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

mbarnes Oct 23, 2024

Choose a reason for hiding this comment

mbarnes commented Oct 23, 2024

mbarnes commented Oct 21, 2024 •

edited

Loading

SudoBrendan left a comment •

edited

Loading

mbarnes Oct 23, 2024 •

edited

Loading