Retrospective: OP Drop #1

Summary

On May 31, 2022, prior to the official launch announcement from Optimism, some users noticed that Airdrop #1 claims were live and commented in Discord and on Twitter. Soon after, we saw a massive influx of traffic that caused several of our internal services to fail under the load. We provisioned additional capacity both internally and with our infrastructure providers in response to the failures. During this process some users experienced delays or were prevented from claiming their airdrop. Additionally, our public endpoint experienced high error rates and degraded performance, and the combination of issues caused frustration within our community. Our internal teams worked closely with partners and infrastructure providers to mitigate issues and make claims available to all users attempting to access their tokens. The issues were resolved at 2:57 pm PST that day, with all systems operating normally.

We dramatically underestimated the amount of traffic that the airdrop would create. We’re humbled by the excitement from the community but because we didn’t expect this response, we hadn’t given Alchemy a heads up. Since nodes can take up to 26 hours to start up, we had agreed to give Alchemy a 48 hour heads up on large capacity increases. However, only until after the airdrop began did we find out we needed to 7x the capacity of our public endpoint - which meant doubling the global capacity of Optimism.

We’d like to extend special thanks to Alchemy, who dropped everything and deployed 10 engineers to implement creative solutions to keep our public endpoint up, while they immediately started doubling their global capacity of Optimism. They have been supporting our public endpoint for free for months and we couldn’t have withstood the traffic without their help.

While we’re thrilled that OP Airdrop #1 is live, we recognize the frustrations our users experienced and have detailed the days’ events - and more importantly, what we’ve learned and how we’ll apply those lessons to future airdrops. We hope this retrospective provides helpful takeaways for other foundations and organizations planning airdrops as well.

Timeline

All times listed in PST.

May 31 Optimism team deploys airdrop contract
07:35 Team partially funds airdrop contract
07:38 Claims frontend are tested on staging
07:54 Airdrop contract is funded with full amount
07:54 Team deploys claims frontend
07:55 Some Discord users notice and post that claims frontend is live
07:58 Tweets from users also gain traction
08:09 Claims backend stops responding to traffic
08:15 Public endpoint shows elevated error rates
08:16 Team reaches out to Alchemy to request ~700% capacity increase on the public endpoint. Airdrop requires more capacity than every other app and user on Optimism combined, and Alchemy engineering team goes heads down to hack system to handle double global capacity of Optimism while nodes are spinning up
08:20 Team makes the decision to postpone public airdrop announcement scheduled for 8:30
08:25 Alchemy increases capacity on public endpoint
08:25 Team takes down claims frontend to prevent users from hitting errors
08:26 Edge proxy nodes exhaust available Redis connections
09:10 Indexer stops responding to requests
09:15 Edge proxy nodes begin experiencing resource exhaustion
09:40 Alchemy continues increasing capacity, and spinning up new nodes
09:54 Indexer service is restored
09:55 Warp Speed deposit processing halts
10:11 Redis instance is resized to avoid connection pool exhaustion
10:15 Edge proxy Redis cache is turned off, since resizing the instances did not solve the issue
10:30 Claims frontend is put back up, despite degraded performance
10:36 Doubled number of edge proxy boxes
11:16 Alchemy continues increasing capacity
11:20 Banner is added to claims site to alert users of delayed transactions
11:26 Team reaches out to additional providers, QuickNode and Infura, to increase capacity if needed
11:31 Degraded performance declared on status page
11:54 Work begins to limit archive requests at our edge proxy
11:57 Team is alerted to lags with the batch submitter
12:06 Team joins video call with Alchemy to debug archive issues
12:10 Soft announce of airdrop — explained increased load situation, but did not yet announce that claims were live
12:21 Claims frontend is taken down again
12:28 Pingdom alert is deployed for the claims API
12:54 Rate limiter on archival requests is deployed at edge proxy
12:55 Team reduced batch submitter confirmation depth from 6 to 3
13:00 We notice that Discord invite links are rate limited
13:18 Claims frontend is put back up
14:10 Alchemy deploys a follow-up to the stopgap to restore full archive support and finishes provisioning additional capacity
14:10 Update is deployed to Warp Speed to clear the backlog
14:15 After determining it was safe to do so, batch submitter confirmation depth is further reduced to 2
14:17 Status page is set to monitoring
14:45 Airdrop is officially announced from Optimism’s Twitter account
14:57 Status page is set to resolved
15:44 A fix is pushed for a bug on the claims backend that prevented claims
16:55 Warp Speed backlog clears
17:01 A fix is implemented for case sensitivity in claims API
17:42 Team publishes a mini retro and hosts a Discord AMA
17:57 Delayed transaction banner is removed from website

Leadup

After performing internal testing on staging, we enabled the claims frontend in preparation for the public launch of OP Airdrop #1. Community members on Discord noticed almost immediately, and high-traction tweets followed shortly afterwards. Traffic to our public endpoint, indexer, and claims flow backend increased by a factor of 10.

Causes

Our internal services were unable to keep up with demand. Almost all of our infrastructure was affected: the public endpoint, Warp Speed, the indexer, and the new claims backend. Each service fell over for different reasons, so we’ll describe each cause individually:

Public Endpoint

Insufficient edge proxy capacity.
Optimism didn’t give Alchemy sufficient advance notice to increase capacity.

Claims Flow

The claims flow initiated multiple requests to the public endpoint for each user request. This caused us to rate limit ourselves.
The API was case sensitive to addresses.
The API was improperly managing disk usage, which caused Kubernetes pods to be evicted.

Indexer

Insufficient Kubernetes replica capacity.
Insufficient database CPU.

Warp Speed

The deposit processing service used the public endpoint, and so was unable to query chain state when the API went down.
The disburser API tried to do too many disbursements at once, which caused the disbursement contract to revert.

Batch Submitter

Submission rate was too low to keep up with demand.

Impact

Some users were unable to claim their airdrop at various points throughout the day, while users of the public endpoint had difficulty interacting with the chain at all.
For a period of time, more technical users were able to get their claims through by calling the contract directly, while users relying on the claims flow UI were not. This made the airdrop feel unfair, and magnified user frustration.
Centralized exchanges launches and deposits were delayed several hours.
Deposits made using Warp Speed were backed up for several hours.

Recovery

Public Endpoint

We reached out to Alchemy, who doubled global Optimism capacity for us.
Alchemy pushed a stopgap measure to temporarily route more traffic to full nodes rather than archive nodes.
We temporarily limited archival request depth to 64 blocks to reduce pressure on Alchemy’s archive nodes.
We doubled the size of the edge proxy’s cache, and when that was ineffective disabled caching altogether.
We doubled the number of active proxy instances from 15 to 30.

Claims Backend

We fixed a bug that caused the claims backend to make multiple RPC requests for each delegate request.
We fixed a bug that was causing pod disk growth to increase and the pods to be killed.
We pushed a fix that made queries for address proofs case insensitive.
We added a Pingdom alert to notify us of any further issues with the claims backend.

Indexer

We increased the number of replica pods.
We doubled the size of the indexer’s database.

Warp Speed

We switched Warp Speed’s L2 endpoint to a dedicated Alchemy URL.
We updated Warp Speed to send a maximum of 15 disbursements per transaction. This circumvented a size limit that Warp Speed was running into while trying to clear the backlog.

Batch Submitter

We reduced the number of confirmations the batch submitter requires before submitting a new batch from 6 to 3 to 2.

Lessons Learned

Principles

Plan Ahead for Traffic Increases

We dramatically underestimated the amount of traffic that the airdrop would create. This was a planning failure, and could have been avoided had we incorporated the following:

Perform load tests regularly. Load tests let us find performance bugs that only happen when the system is under stress. They also give us a baseline upon which we can calculate our future capacity needs. Had we performed a load test prior to launch we could have caught the Redis issues we saw in our edge proxy, the multiple-RPCs-per-request problem in the claims backend, and had a much more rigorous idea of how much traffic we could handle.
Over-provision - it’s cheaper than going down. We didn’t provision any more infrastructure to support launch because we thought we were already over provisioned by 4x. This was the wrong call. It’s much cheaper to over-provision then scale down than it is to under-provision and have to scale up under load.
Give partners a ton of warning. We should make sure that our infrastructure providers and partners have at least 48 hours notice to provision additional infrastructure. We didn’t ask partners to provision new capacity before the launch itself. It takes 26 hours to spin up a new node, which creates a hard cap on new capacity beyond a certain point. No provider could increase capacity by 10x such no warning.
Prioritize concurrent batch submission. We will continue to fall behind on batch submission during periods of high throughput until we implement the concurrent batch submitter. We should do this sooner rather than later, since this problem does not go away with Bedrock.

Get Out of the Endpoint Business

We’re currently subsidizing the ecosystem’s addiction to the public endpoint. We should continue implementing our plan to get out of the endpoint business. Specifically:

Have additional providers. Mission critical services need redundancy. Having multiple providers available with sufficient capacity to handle the load lets us split traffic between multiple providers, and reduces the amount that each provider needs to scale up in order to handle the traffic. We should have this ready before launching.
Discourage non-MetaMask usage of the public endpoint. Too many people are using the public endpoint from data centers, or as part of their replicas. We should start discouraging these use cases. Additionally, large dApps like Uniswap should pay for their own endpoint.

Button Up Our Practices

Update the status page quickly when issues occur. It took us several hours to update the status page after the public endpoint went down. This left users in the dark, and contributed to their frustration. It also eroded trust in the status page itself. Whenever something happens that impacts a production service, the status page should be updated within minutes of confirming the issue.
Inform our community with timely and transparent communications. With uncertainty about how long the launch would be delayed - and in an effort to avoid preempting planned announcement communications - we remained quiet in the midst of increased questions and speculation from the community in Discord and on Twitter. Moving forward, we’ll structure our communications to prioritize transparency and community experience.
Monitoring and alerting is a requirement to go to prod. The claims backend crashed, and it was our users who let us know. We need monitoring and alerting in place for every service before it goes live to production traffic. There’s too much going on on launch day to manually watch every dashboard. We need to get pushed alerts when things go down.
Don’t use free-tier infra internally. We shouldn’t use the public endpoint for internal services for the same reasons the community shouldn’t. We should use a dedicated Alchemy key per service.
Lock down our launches. On-chain activity is public, so whenever smart contracts are involved in a launch those contracts should be pause-able.
Don’t sacrifice quality for speed. The claims backend was rushed, and had little oversight or review. We should put a cultural stake in the ground and clearly say that we won’t sacrifice that oversight in order to make a launch.
Minimize cross-team dependencies. The OP Labs engineering team has historically been structured by function (e.g., "frontend team", "client team") and not by product. This has increased the need for cross-team communication and increased the overhead required to ship production-quality software. OP Labs is in the process of transitioning its engineering team into a product-oriented structure which should significantly reduce the number of cross-team dependencies and generally diffuse engineering knowledge across the entire organization.

Action Items

Public Endpoint

Add Redis read replicas to edge proxy
Increase size of edge proxy VMs (in progress)
Split public RPC traffic through multiple providers (in progress)
Implement rate limiting for non-browser user agents (in progress)
Share developer best practices for using RPC endpoints (in progress)

Infra

Go through our infra, and remove any configurations that use the public endpoint (in progress)
Create internal documentation for best practices on periodic load testing
Ensure that alerting is in place for:
- Warp Speed
- Claims backend
Scope, plan, and implement concurrent batch submission (in progress)

Errata

Disable text-to-speech on Discord globally (if possible)
Document that users should not make assumptions around block time until Bedrock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2022-05-31-drop-1.md

2022-05-31-drop-1.md

Retrospective: OP Drop #1

Summary

Timeline

Leadup

Causes

Public Endpoint

Claims Flow

Indexer

Warp Speed

Batch Submitter

Impact

Recovery

Public Endpoint

Claims Backend

Indexer

Warp Speed

Batch Submitter

Lessons Learned

Principles

Plan Ahead for Traffic Increases

Get Out of the Endpoint Business

Button Up Our Practices

Action Items

Public Endpoint

Infra

Errata

Files

2022-05-31-drop-1.md

Latest commit

History

2022-05-31-drop-1.md

File metadata and controls

Retrospective: OP Drop #1

Summary

Timeline

Leadup

Causes

Public Endpoint

Claims Flow

Indexer

Warp Speed

Batch Submitter

Impact

Recovery

Public Endpoint

Claims Backend

Indexer

Warp Speed

Batch Submitter

Lessons Learned

Principles

Plan Ahead for Traffic Increases

Get Out of the Endpoint Business

Button Up Our Practices

Action Items

Public Endpoint

Infra

Errata