Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix replication service denial after MaxObjectSize increase #2911

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cthulhu-rider
Copy link
Contributor

overall, solution is very simple: listen to config changes and update the server limit. But...

gRPC

lib does not provide option to update max recv msg size on the runnig server. https://pkg.go.dev/google.golang.org/grpc#MaxRecvMsgSize is static. It can be very easily supported, so i patched my fork. If we decide this proposal is good, i suggest to add an organization fork (and later propose it to the upstream)

there is an alternative approach i dont like completely:

  1. gracefully stop the running server
  2. create new one with updated limit and run it
    although MaxObjectSize is expected to be changed very rarely, such method looks very ugly taking into account the native option's simplicity

Notifications

we dont have them now (nspcc-dev/neofs-contract#427). Polling is used as a temp soluition


this brings us closer to fix the test within which the bug was originally detected. It should be noted that there is not 100% stable solution for it: there will always be a gap b/w contract update and node reaction. The only option i see for now is to ignore overflow errors for about 1-2 minutes and retry
cc @evgeniiz321

Since storage node serves `ObjectService.Replicate` RPC, the gRPC server
must be able to accept the biggest allowed object. Previously, node set
max size of received gRPC messages to sum of payload (`MaxObjectSize`
network setting) and header (16K at the moment) size limits. This was
not entirely correct because replication request message also contains
non-object fields (only signature for now) and protobuf service bytes.
Thus, when the size of an object approached the maximum allowed, it was
possible to overflow the calculated limit and receive service refuse.

This adds 1 KB to the calculated limit. This is a heuristic value that
is larger than the current protocol version's query extra data, while
supporting small extensions in advance. The exact value is meaningless
given the smallness volume degrees.

Signed-off-by: Leonard Lyubich <[email protected]>
Since storage node serves `ObjectService.Replicate` RPC, the gRPC server
must be able to accept the biggest allowed object. Previously, node
calculated global message limit for the gRPC server once on startup.
With this behavior, when network setting `MaxObjectSize` was increased,
the node stopped accepting write objects larger than the previous limit.
This manifested itself in a denial of replication service.

From now storage node updates max received gRPC message size (if
needed) on each refresh of the `MaxObjectSize` setting cache and via
Netmap contract's polling done once per minute.

Refs #2910.

Signed-off-by: Leonard Lyubich <[email protected]>
Copy link

codecov bot commented Aug 8, 2024

Codecov Report

Attention: Patch coverage is 31.81818% with 45 lines in your changes missing coverage. Please review.

Project coverage is 23.78%. Comparing base (73e8414) to head (19f39df).

Files Patch % Lines
cmd/neofs-node/netmap.go 0.00% 16 Missing ⚠️
cmd/neofs-node/cache.go 0.00% 15 Missing ⚠️
cmd/neofs-node/grpc.go 65.62% 11 Missing ⚠️
cmd/neofs-node/object.go 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2911      +/-   ##
==========================================
+ Coverage   23.75%   23.78%   +0.02%     
==========================================
  Files         774      774              
  Lines       44866    44911      +45     
==========================================
+ Hits        10660    10681      +21     
- Misses      33356    33380      +24     
  Partials      850      850              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@roman-khimov
Copy link
Member

If we decide this proposal is good

It's OK to me (a new feature there, solves a problem, not very invasive). But I wouldn't like to use any forks of the upstream library, so server restart is perfectly fine to me as well as a temporary solution (until your patch is accepted). This is a very rare event.

We can consider some handling on the test side as well. Node restart will solve this for sure.

@roman-khimov
Copy link
Member

Please upstream the library improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants