Starting at 13:57 UTC on July 20, 2021, Square’s Authentication servers experienced a service disruption, impacting Square products.
In this post mortem recap, we’ll communicate the root cause of this disruption, document the steps that we took to diagnose and resolve the disruption, and share our analysis and actions to ensure that we are properly defending our customers from service interruptions like this in the future.
13:57 (start of impact): Higher-than-average traffic began to hit one of our public API endpoints. The rate-limiter for this endpoint, which prevents this type of impact, did not slow the rate of traffic through the endpoint as expected. This resulted in degraded performance of our Authentication services.
13:59: Within two minutes, engineers were alerted to the degraded performance of our Authentication services and began to troubleshoot the issue.
14:04: Our customer success team members raised visibility to our engineering team of reports of failed login attempts and active sessions unexpectedly terminating.
14:28-14:40: Engineers manually shifted traffic between multiple datacenters in an attempt to increase performance.
14:42: Square engineering teams disabled non-critical sources of traffic to our Authentication services.
14:55: Engineers prioritized traffic to ensure active sessions were not unexpectedly terminated.
14:58: Traffic to the impacted endpoint decreased to expected average levels.
15:14: Engineers began a rolling restart of our Authentication servers.
15:32: Engineers reset our traffic load balancing between servers to default and removed all temporary traffic restrictions.
15:33 (end of impact): While there was still some minor latency, Authentication services returned to normal operations and all Square products recovered.
The root cause of this incident can be traced to a rate-limiter misconfiguration. During the incident, our Authentication services received significantly higher-than-normal traffic demands to one of our public endpoints. While a rate limiter did exist for this endpoint, it was incorrectly configured. After some additional investigation, we are confident that this traffic was not malicious in nature.
Immediately after the incident was resolved, Square engineers 1) identified and confirmed the intent of the traffic spike that contributed to the service degradation, 2) identified the affected endpoint and the improperly configured rate limiter, and 3) implemented a fix that properly reconfigured rate limiting to prevent this type of traffic spike from degrading services again.
Additionally, as a takeaway from this incident, we plan on adding an additional level of rate limiting protection for our Authentication services. This will further shield us from any additional similar outages. We also have architectural changes in flight that will ensure these services will scale in line with Square’s growth, reducing the chances that high levels of traffic will cause outages in the future.