Skip to content

Latest commit

 

History

History
164 lines (138 loc) · 14.4 KB

2022-05-31-drop-1.md

File metadata and controls

164 lines (138 loc) · 14.4 KB

Retrospective: OP Drop #1

Summary

On May 31, 2022, prior to the official launch announcement from Optimism, some users noticed that Airdrop #1 claims were live and commented in Discord and on Twitter. Soon after, we saw a massive influx of traffic that caused several of our internal services to fail under the load. We provisioned additional capacity both internally and with our infrastructure providers in response to the failures. During this process some users experienced delays or were prevented from claiming their airdrop. Additionally, our public endpoint experienced high error rates and degraded performance, and the combination of issues caused frustration within our community. Our internal teams worked closely with partners and infrastructure providers to mitigate issues and make claims available to all users attempting to access their tokens. The issues were resolved at 2:57 pm PST that day, with all systems operating normally.

We dramatically underestimated the amount of traffic that the airdrop would create. We’re humbled by the excitement from the community but because we didn’t expect this response, we hadn’t given Alchemy a heads up. Since nodes can take up to 26 hours to start up, we had agreed to give Alchemy a 48 hour heads up on large capacity increases. However, only until after the airdrop began did we find out we needed to 7x the capacity of our public endpoint - which meant doubling the global capacity of Optimism.

We’d like to extend special thanks to Alchemy, who dropped everything and deployed 10 engineers to implement creative solutions to keep our public endpoint up, while they immediately started doubling their global capacity of Optimism. They have been supporting our public endpoint for free for months and we couldn’t have withstood the traffic without their help.

While we’re thrilled that OP Airdrop #1 is live, we recognize the frustrations our users experienced and have detailed the days’ events - and more importantly, what we’ve learned and how we’ll apply those lessons to future airdrops. We hope this retrospective provides helpful takeaways for other foundations and organizations planning airdrops as well.

Timeline

All times listed in PST.

  • May 31 Optimism team deploys airdrop contract
  • 07:35 Team partially funds airdrop contract
  • 07:38 Claims frontend are tested on staging
  • 07:54 Airdrop contract is funded with full amount
  • 07:54 Team deploys claims frontend
  • 07:55 Some Discord users notice and post that claims frontend is live
  • 07:58 Tweets from users also gain traction
  • 08:09 Claims backend stops responding to traffic
  • 08:15 Public endpoint shows elevated error rates
  • 08:16 Team reaches out to Alchemy to request ~700% capacity increase on the public endpoint. Airdrop requires more capacity than every other app and user on Optimism combined, and Alchemy engineering team goes heads down to hack system to handle double global capacity of Optimism while nodes are spinning up
  • 08:20 Team makes the decision to postpone public airdrop announcement scheduled for 8:30
  • 08:25 Alchemy increases capacity on public endpoint
  • 08:25 Team takes down claims frontend to prevent users from hitting errors
  • 08:26 Edge proxy nodes exhaust available Redis connections
  • 09:10 Indexer stops responding to requests
  • 09:15 Edge proxy nodes begin experiencing resource exhaustion
  • 09:40 Alchemy continues increasing capacity, and spinning up new nodes
  • 09:54 Indexer service is restored
  • 09:55 Warp Speed deposit processing halts
  • 10:11 Redis instance is resized to avoid connection pool exhaustion
  • 10:15 Edge proxy Redis cache is turned off, since resizing the instances did not solve the issue
  • 10:30 Claims frontend is put back up, despite degraded performance
  • 10:36 Doubled number of edge proxy boxes
  • 11:16 Alchemy continues increasing capacity
  • 11:20 Banner is added to claims site to alert users of delayed transactions
  • 11:26 Team reaches out to additional providers, QuickNode and Infura, to increase capacity if needed
  • 11:31 Degraded performance declared on status page
  • 11:54 Work begins to limit archive requests at our edge proxy
  • 11:57 Team is alerted to lags with the batch submitter
  • 12:06 Team joins video call with Alchemy to debug archive issues
  • 12:10 Soft announce of airdrop — explained increased load situation, but did not yet announce that claims were live
  • 12:21 Claims frontend is taken down again
  • 12:28 Pingdom alert is deployed for the claims API
  • 12:54 Rate limiter on archival requests is deployed at edge proxy
  • 12:55 Team reduced batch submitter confirmation depth from 6 to 3
  • 13:00 We notice that Discord invite links are rate limited
  • 13:18 Claims frontend is put back up
  • 14:10 Alchemy deploys a follow-up to the stopgap to restore full archive support and finishes provisioning additional capacity
  • 14:10 Update is deployed to Warp Speed to clear the backlog
  • 14:15 After determining it was safe to do so, batch submitter confirmation depth is further reduced to 2
  • 14:17 Status page is set to monitoring
  • 14:45 Airdrop is officially announced from Optimism’s Twitter account
  • 14:57 Status page is set to resolved
  • 15:44 A fix is pushed for a bug on the claims backend that prevented claims
  • 16:55 Warp Speed backlog clears
  • 17:01 A fix is implemented for case sensitivity in claims API
  • 17:42 Team publishes a mini retro and hosts a Discord AMA
  • 17:57 Delayed transaction banner is removed from website

Leadup

After performing internal testing on staging, we enabled the claims frontend in preparation for the public launch of OP Airdrop #1. Community members on Discord noticed almost immediately, and high-traction tweets followed shortly afterwards. Traffic to our public endpoint, indexer, and claims flow backend increased by a factor of 10.

Causes

Our internal services were unable to keep up with demand. Almost all of our infrastructure was affected: the public endpoint, Warp Speed, the indexer, and the new claims backend. Each service fell over for different reasons, so we’ll describe each cause individually:

Public Endpoint

  • Insufficient edge proxy capacity.
  • Optimism didn’t give Alchemy sufficient advance notice to increase capacity.

Claims Flow

  • The claims flow initiated multiple requests to the public endpoint for each user request. This caused us to rate limit ourselves.
  • The API was case sensitive to addresses.
  • The API was improperly managing disk usage, which caused Kubernetes pods to be evicted.

Indexer

  • Insufficient Kubernetes replica capacity.
  • Insufficient database CPU.

Warp Speed

  • The deposit processing service used the public endpoint, and so was unable to query chain state when the API went down.
  • The disburser API tried to do too many disbursements at once, which caused the disbursement contract to revert.

Batch Submitter

  • Submission rate was too low to keep up with demand.

Impact

  • Some users were unable to claim their airdrop at various points throughout the day, while users of the public endpoint had difficulty interacting with the chain at all.
  • For a period of time, more technical users were able to get their claims through by calling the contract directly, while users relying on the claims flow UI were not. This made the airdrop feel unfair, and magnified user frustration.
  • Centralized exchanges launches and deposits were delayed several hours.
  • Deposits made using Warp Speed were backed up for several hours.

Recovery

Public Endpoint

  • We reached out to Alchemy, who doubled global Optimism capacity for us.
  • Alchemy pushed a stopgap measure to temporarily route more traffic to full nodes rather than archive nodes.
  • We temporarily limited archival request depth to 64 blocks to reduce pressure on Alchemy’s archive nodes.
  • We doubled the size of the edge proxy’s cache, and when that was ineffective disabled caching altogether.
  • We doubled the number of active proxy instances from 15 to 30.

Claims Backend

  • We fixed a bug that caused the claims backend to make multiple RPC requests for each delegate request.
  • We fixed a bug that was causing pod disk growth to increase and the pods to be killed.
  • We pushed a fix that made queries for address proofs case insensitive.
  • We added a Pingdom alert to notify us of any further issues with the claims backend.

Indexer

  • We increased the number of replica pods.
  • We doubled the size of the indexer’s database.

Warp Speed

  • We switched Warp Speed’s L2 endpoint to a dedicated Alchemy URL.
  • We updated Warp Speed to send a maximum of 15 disbursements per transaction. This circumvented a size limit that Warp Speed was running into while trying to clear the backlog.

Batch Submitter

  • We reduced the number of confirmations the batch submitter requires before submitting a new batch from 6 to 3 to 2.

Lessons Learned

Principles

Plan Ahead for Traffic Increases

We dramatically underestimated the amount of traffic that the airdrop would create. This was a planning failure, and could have been avoided had we incorporated the following:

  • Perform load tests regularly. Load tests let us find performance bugs that only happen when the system is under stress. They also give us a baseline upon which we can calculate our future capacity needs. Had we performed a load test prior to launch we could have caught the Redis issues we saw in our edge proxy, the multiple-RPCs-per-request problem in the claims backend, and had a much more rigorous idea of how much traffic we could handle.
  • Over-provision - it’s cheaper than going down. We didn’t provision any more infrastructure to support launch because we thought we were already over provisioned by 4x. This was the wrong call. It’s much cheaper to over-provision then scale down than it is to under-provision and have to scale up under load.
  • Give partners a ton of warning. We should make sure that our infrastructure providers and partners have at least 48 hours notice to provision additional infrastructure. We didn’t ask partners to provision new capacity before the launch itself. It takes 26 hours to spin up a new node, which creates a hard cap on new capacity beyond a certain point. No provider could increase capacity by 10x such no warning.
  • Prioritize concurrent batch submission. We will continue to fall behind on batch submission during periods of high throughput until we implement the concurrent batch submitter. We should do this sooner rather than later, since this problem does not go away with Bedrock.

Get Out of the Endpoint Business

We’re currently subsidizing the ecosystem’s addiction to the public endpoint. We should continue implementing our plan to get out of the endpoint business. Specifically:

  • Have additional providers. Mission critical services need redundancy. Having multiple providers available with sufficient capacity to handle the load lets us split traffic between multiple providers, and reduces the amount that each provider needs to scale up in order to handle the traffic. We should have this ready before launching.
  • Discourage non-MetaMask usage of the public endpoint. Too many people are using the public endpoint from data centers, or as part of their replicas. We should start discouraging these use cases. Additionally, large dApps like Uniswap should pay for their own endpoint.

Button Up Our Practices

  • Update the status page quickly when issues occur. It took us several hours to update the status page after the public endpoint went down. This left users in the dark, and contributed to their frustration. It also eroded trust in the status page itself. Whenever something happens that impacts a production service, the status page should be updated within minutes of confirming the issue.
  • Inform our community with timely and transparent communications. With uncertainty about how long the launch would be delayed - and in an effort to avoid preempting planned announcement communications - we remained quiet in the midst of increased questions and speculation from the community in Discord and on Twitter. Moving forward, we’ll structure our communications to prioritize transparency and community experience.
  • Monitoring and alerting is a requirement to go to prod. The claims backend crashed, and it was our users who let us know. We need monitoring and alerting in place for every service before it goes live to production traffic. There’s too much going on on launch day to manually watch every dashboard. We need to get pushed alerts when things go down.
  • Don’t use free-tier infra internally. We shouldn’t use the public endpoint for internal services for the same reasons the community shouldn’t. We should use a dedicated Alchemy key per service.
  • Lock down our launches. On-chain activity is public, so whenever smart contracts are involved in a launch those contracts should be pause-able.
  • Don’t sacrifice quality for speed. The claims backend was rushed, and had little oversight or review. We should put a cultural stake in the ground and clearly say that we won’t sacrifice that oversight in order to make a launch.
  • Minimize cross-team dependencies. The OP Labs engineering team has historically been structured by function (e.g., "frontend team", "client team") and not by product. This has increased the need for cross-team communication and increased the overhead required to ship production-quality software. OP Labs is in the process of transitioning its engineering team into a product-oriented structure which should significantly reduce the number of cross-team dependencies and generally diffuse engineering knowledge across the entire organization.

Action Items

Public Endpoint

  • Add Redis read replicas to edge proxy
  • Increase size of edge proxy VMs (in progress)
  • Split public RPC traffic through multiple providers (in progress)
  • Implement rate limiting for non-browser user agents (in progress)
  • Share developer best practices for using RPC endpoints (in progress)

Infra

  • Go through our infra, and remove any configurations that use the public endpoint (in progress)
  • Create internal documentation for best practices on periodic load testing
  • Ensure that alerting is in place for:
    • Warp Speed
    • Claims backend
  • Scope, plan, and implement concurrent batch submission (in progress)

Errata

  • Disable text-to-speech on Discord globally (if possible)
  • Document that users should not make assumptions around block time until Bedrock