Performance Issue: Login Errors
Incident Report for Square US
Postmortem

Incident Summary

Starting at 13:57 UTC on July 20, 2021, Square’s Authentication servers experienced a service disruption, impacting Square products.

In this post mortem recap, we’ll communicate the root cause of this disruption, document the steps that we took to diagnose and resolve the disruption, and share our analysis and actions to ensure that we are properly defending our customers from service interruptions like this in the future.

Timeline (UTC)

13:57 (start of impact): Higher-than-average traffic began to hit one of our public API endpoints. The rate-limiter for this endpoint, which prevents this type of impact, did not slow the rate of traffic through the endpoint as expected. This resulted in degraded performance of our Authentication services.

13:59: Within two minutes, engineers were alerted to the degraded performance of our Authentication services  and began to troubleshoot the issue.

14:04: Our customer success team members raised visibility to our engineering team of reports of failed login attempts and active sessions unexpectedly terminating.

14:28-14:40: Engineers manually shifted traffic between multiple datacenters in an attempt to increase performance.

14:42: Square engineering teams disabled non-critical sources of traffic to our Authentication services.

14:55: Engineers prioritized traffic to ensure active sessions were not unexpectedly terminated.

14:58: Traffic to the impacted endpoint decreased to expected average levels.

15:14: Engineers began a rolling restart of our Authentication servers.

15:32: Engineers reset our traffic load balancing between servers to default and removed all temporary traffic restrictions.

15:33 (end of impact): While there was still some minor latency, Authentication services returned to normal operations and all Square products recovered.

Analysis

The root cause of this incident can be traced to a rate-limiter misconfiguration. During the incident, our Authentication services received significantly higher-than-normal traffic demands to one of our public endpoints.  While a rate limiter did exist for this endpoint, it was incorrectly configured. After some additional investigation, we are confident that this traffic was not malicious in nature.

Immediately after the incident was resolved, Square engineers 1) identified and confirmed the intent of the traffic spike that contributed to the service degradation, 2) identified the affected endpoint and the improperly configured rate limiter, and 3) implemented a fix that properly reconfigured rate limiting to prevent this type of traffic spike from degrading services again.

Additionally, as a takeaway from this incident, we plan on adding an additional level of rate limiting protection for our Authentication services.  This will further shield us from any additional similar outages. We also have architectural changes in flight that will ensure these services will scale in line with Square’s growth, reducing the chances that high levels of traffic will cause outages in the future.

Posted Aug 05, 2021 - 08:13 PDT

Resolved
All our services are up and running as expected.

Sorry for today’s interruption. Thank you for bearing with us.
Posted Jul 20, 2021 - 12:06 PDT
Monitoring
We're beginning to see services recover. You should now be able to login into Square Dashboard and the Square apps as normal. Customers should also be able to place orders on the Square Online site successfully.

We’ll continue to monitor as our teams work to ensure the fix will completely resolve the issue.
Posted Jul 20, 2021 - 08:47 PDT
Update
Our teams are actively working on this issue. If you try to login to your Square account, it'll look like the password is incorrect but this is due to the current outage. Please do not try to request a password reset at the moment.

We'll be back with updates as we learn more.
Posted Jul 20, 2021 - 08:29 PDT
Update
If you're currently logged into your Square app, you may experience some slowness. Please do not try to logout and log back in.

We're also receiving reports that customers may also experience intermittent error when placing an order on your Square Online site.

Our team are working to implement a solution for the issues experience. Thank you for your continued patience.
Posted Jul 20, 2021 - 07:51 PDT
Update
In addition, developers may also have trouble logging into the Developer Dashboard.

Thank you for bearing with us! We'll be back with more updates.
Posted Jul 20, 2021 - 07:35 PDT
Investigating
We're currently fielding reports of issues logging into Square Dashboard, Point of Sale app and Square Online settings overview.

Our teams are aware of the issue and are working towards a fix. We'll continue to provide updates as we learn more.
Posted Jul 20, 2021 - 07:17 PDT
This incident affected: Point of Sale, Dashboard, and Square Online.