Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Living Ticket] Scalability related efforts #621

Open
1 of 11 tasks
okdas opened this issue Jun 18, 2024 · 6 comments
Open
1 of 11 tasks

[Living Ticket] Scalability related efforts #621

okdas opened this issue Jun 18, 2024 · 6 comments

Comments

@okdas
Copy link
Member

okdas commented Jun 18, 2024

Objective

Ensure that Shannon scales both on-chain & off-chain.

Origin Document

This issue is intended to be a living document to keep track of all related efforts.

Identified issues and points of investigation

  • RelayMiner RAM usage: [Relay Miner] Address high memory usage #551
  • AppGateServer (and Gateway) CPU #Infrastructure
    • Relays are not going through, and CPU utilization is up to the limit. Need to get pprof snapshots & evaluate.
  • Validator scalability
    • Ensure the validator's resource usage (CPU, RAM, etc...) is reasonable when the number of claims & proofs grows VERY LARGE. Note: This is why we need distribution of claims & proof
    • Probabilistic Proofs: make sure these parameters are adjusted properly for both: #Algorithmic
      • Validator scalability #Infrastructure
      • Discourding adversarial actors #Algorithmic
  • Relay Mining #Algorithmic #Infrastructure
    • Ensure gateway consumption is reasonable #Infrastructure
    • Ensure relayminer footprint is as small as possible #Infrastructure

Things to investigate:

  • Replacing the KV store in the SMT (e.g. BadgerDB or other)
  • Keeping things in memory or flushing to disk
  • Changing parameters

Creator: @okdas
Co-Owners: @red-0ne @bryanchriswhite @Olshansk

@okdas okdas added this to the Shannon Beta TestNet Launch milestone Jun 18, 2024
@okdas okdas self-assigned this Jun 18, 2024
@Olshansk Olshansk changed the title [Scalability] Living ticket: tracking related efforts [Living Ticket] Scalability related efforts Jun 19, 2024
@Olshansk
Copy link
Member

@okdas Made some changes, updates & improements to this ticket. PTAL

@okdas
Copy link
Member Author

okdas commented Aug 17, 2024

To investigate - ran into a panic - we potentially not handling the error from the RPC gracefully:

Panic Error

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3b15c58]

Goroutine Stack Trace

goroutine 297 [running]:
github.com/pokt-network/poktroll/pkg/relayer/session.(*sessionTree).Delete(0x4003e33040)
    /Users/dk/pocket/poktroll/pkg/relayer/session/sessiontree.go:250 +0xc8

github.com/pokt-network/poktroll/pkg/relayer/session.(*relayerSessionsManager).deleteExpiredSessionTreesFn.func1({0x51e5e00, 0x4000c07830}, {0x4001675aa0, 0x1, 0x1})
    /Users/dk/pocket/poktroll/pkg/relayer/session/session.go:456 +0x278

github.com/pokt-network/poktroll/pkg/observable/channel.ForEach[...].func1({0x4001675aa0, 0x1, 0x1})
    /Users/dk/pocket/poktroll/pkg/observable/channel/map.go:103 +0x6c

github.com/pokt-network/poktroll/pkg/observable/channel.goMapTransformNotification[...]({0x51e5e00, 0x4000c07830}, {0x51df2b0, 0x400157b620}, 0x40012bd008, 0x40012bd050, 0x40012da480)
    /Users/dk/pocket/poktroll/pkg/observable/channel/map.go:125 +0xc4

created by github.com/pokt-network/poktroll/pkg/observable/channel.Map[...] in goroutine 1
    /Users/dk/pocket/poktroll/pkg/observable/channel/map.go:24 +0x318

Related Log Messages

2024-08-16 17:19:22.783    {"level":"debug","message":"deleting expired session"}

2024-08-16 17:19:22.781    {"level":"error","error":"with hash: a451156fe642c5f425af9bc1818ae423307789be0a4c581d26621f7fc698a419: error in json rpc client, with http response metadata: (Status: 200 OK, Protocol HTTP/1.1). RPC error -32603 - Internal error: tx (A451156FE642C5F425AF9BC1818AE423307789BE0A4C581D26621F7FC698A419) not found: error encountered while querying for tx","message":"failed to create claims"}

2024-08-16 17:19:22.783    {"level":"error","error":"with hash: a451156fe642c5f425af9bc1818ae423307789be0a4c581d26621f7fc698a419: error in json rpc client, with http response metadata: (Status: 200 OK, Protocol HTTP/1.1). RPC error -32603 - Internal error: tx (A451156FE642C5F425AF9BC1818AE423307789BE0A4C581D26621F7FC698A419) not found: error encountered while querying for tx"}

@okdas
Copy link
Member Author

okdas commented Aug 17, 2024

To investigate. Given the nature of RelayMiner we need it to try to recover first.

RelayMiner stops on:
{"level":"error","work_name":"goPublishEvents","error":"eventsqueryclient connection closed","message":"on retry: 1"}

@Olshansk
Copy link
Member

@okdas This is related to the observable, so I think we may be reaching a place where:

  1. A deadlock happens (or something mutex related)
  2. The observable is blocked on events (either empty or too many)

Do you mind created a dedicated ticket to your comment here for @bryanchriswhite to tackle?

okdas added a commit that referenced this issue Aug 22, 2024
## Summary

## Issue

- #551 
- #621 

## Type of change

Select one or more:

- [ ] New feature, functionality or library
- [ ] Bug fix
- [x] Code health or cleanup
- [ ] Documentation
- [ ] Other (specify)

## Testing

**Documentation changes** (only if making doc changes)
- [ ] `make docusaurus_start`; only needed if you make doc changes

**Local Testing** (only if making code changes)
- [ ] **Unit Tests**: `make go_develop_and_test`
- [ ] **LocalNet E2E Tests**: `make test_e2e`
- See [quickstart
guide](https://dev.poktroll.com/developer_guide/quickstart) for
instructions

**PR Testing** (only if making code changes)
- [ ] **DevNet E2E Tests**: Add the `devnet-test-e2e` label to the PR.
- **THIS IS VERY EXPENSIVE**, so only do it after all the reviews are
complete.
- Optionally run `make trigger_ci` if you want to re-trigger tests
without any code changes
- If tests fail, try re-running failed tests only using the GitHub UI as
shown
[here](https://github.com/pokt-network/poktroll/assets/1892194/607984e9-0615-4569-9452-4c730190c1d2)


## Sanity Checklist

- [ ] I have tested my changes using the available tooling
- [ ] I have commented my code
- [ ] I have performed a self-review of my own code; both comments &
source code
- [ ] I create and reference any new tickets, if applicable
- [ ] I have left TODOs throughout the codebase, if applicable

---------

Co-authored-by: Daniel Olshansky <[email protected]>
Co-authored-by: Bryan White <[email protected]>
@okdas
Copy link
Member Author

okdas commented Sep 30, 2024

There are some new issues uncovered by #742 (more details in that ticket) - so far nothing super critical, and we address issues as we find them.

@okdas
Copy link
Member Author

okdas commented Oct 29, 2024

Just to provide an update: we've been finding and resolving different issues, mostly in scope of #742 lately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏗 In progress
Development

No branches or pull requests

2 participants