Multiple Service Login Issue
Incident Report for Square US
Postmortem

Incident Summary

Starting at Nov 13, 2021 19:06 UTC, Square’s servers experienced a service disruption, impacting Square products. For 14 minutes, services for retrieving application settings suffered performance degradation affecting the mobile login flow and syncing settings, such as whether tipping is enabled. These services automatically recovered without intervention by oncall engineers.

Starting at Nov 13, 2021 20:49 UTC, those same services suffered a more severe performance degradation lasting approximately 1 hour 52 minutes. These services recovered after direct intervention from oncall engineers.

In this postmortem recap, we’ll communicate the root cause of this disruption, document the steps that we took to diagnose and resolve the disruption, and share our analysis and actions to ensure that we are properly defending our customers from service interruptions like this in the future.

Timeline (UTC)

19:06 Beginning of 1st impact: APIs for fetching account & device settings suffered increasing latency and declining request success rate.

19:20 End of 1st impact: Request success rates returned to their baseline before the incident began.

19:24 Engineering team marks as temporarily stable and opens investigation into the incident’s root cause.

20:49 Beginning of 2nd impact: APIs for fetching account & device settings suffered increasing latency and declining request success rate.

21:15 Engineering team eliminates upstream APIs as a possible root cause.

21:23 Customer Success updates issquareup.com with instructions to remain logged in during the incident.

21:30 Engineering team reviews database query performance but no single query is identified as a root cause.

21:39 Account Management shares customer testimonials confirming the impact of the incident on the ability to log in or sync settings.

21:42 Engineering team eliminates recent deployments as a possible root cause by analyzing two week history of all services.

21:49 Engineering team identifies performance degradation is concentrated in the data connection pool for the device settings microservice.

22:00 Engineering team elects to completely disable the worst performing queries to improve database performance. Work begins to construct a new deployable version of the service.

22:40 Engineering team completes migrating database to alternative hardware.

22:41 Engineering team confirms database query performance improved immediately after the migration to alternative hardware.

22:41 End of 2nd impact: Request success rates returned to their baseline before the incident began.

22:47 Account Management shares anecdotes from sellers that have seen functionality restored.

23:00 Engineering team merged pull request to disable the worst performing queries.

23:22 SqSupport tweeted to acknowledge the incident and that Square will prepare a response for Sellers affected by the incident.

24:00 Customer Success updates issquareup.com to confirm the fix and that engineering teams will monitor the situation.

00:54 Engineering team deploys change to disable the worst performing queries.

Analysis

This incident revealed areas of improvement for both our technical infrastructure and our engineering processes, several of which have already been implemented.

The database storing device settings saw elevated load from the microservice and neighbors sharing the same underlying hardware. Our query performance slightly degraded and the database connection pool was exhausted with significant time spent waiting for an open connection.

That performance degradation, despite a different root cause, caused the same negative impact on Sellers as the Sep 18th incident by interrupting their ability to log into mobile devices or sync their preferred tipping settings. Our recent improvement, released in iOS 5.77, prevents the tipping setting from changing during any such incident. That release only reached 72% of devices which left many sellers still vulnerable to the bug in iOS apps that forcibly disables tipping.

We also heard testimonies from sellers that tipping was disabled on Android devices. If any vulnerable iOS device was among a Seller’s fleet of devices, it would automatically share those setting changes across every device. This is the primary way Android devices might have been affected.

Square will continue to seek adoption of iOS 5.77 to prevent this issue and will continue its ongoing work to improve fault tolerance of settings.

Some sellers were hesitant to upgrade to the latest versions to minimize disruption to their businesses, which impeded our ability to minimize the impact of a known issue. Learning from this, we will explore better ways to highlight critical updates in our release notes to empower sellers to make informed decisions around risk.

Lastly, Square will re-architect the mobile login flow to reduce the critical path and increase fault tolerance, yielding positive results for sellers that rely on Square to conduct their business.

Posted Dec 01, 2021 - 15:01 PST

Resolved
Good news! We’ve confirmed with our Engineer Teams that the earlier issues have been resolved, and all previously impacted services have returned to normal. We’re sorry for any inconvenience this may have caused, and thank you for your patience as we worked through the issue.
Posted Nov 13, 2021 - 16:59 PST
Monitoring
We have released a fix for the issues experienced earlier, and sellers should see services returning to full functionality. We’ll continue to watch this closely, and will update you to confirm that this is completely resolved. Thank you for being patient with us!
Posted Nov 13, 2021 - 16:00 PST
Identified
Our Engineering Teams have identified the root cause of the issue, and are actively working on a solution. We recommend you remain logged in until we have more information to provide.

If tipping is forcibly disabled, please remain logged in, update your app to 5.77, and enable tipping from the tipping settings.

We’ll report back once we’ve implemented a fix for this issue. Thank you for your patience.
Posted Nov 13, 2021 - 15:07 PST
Update
We are currently investigating issues impacting multiple services and preventing some sellers from being able to log in to their Square account. Additionally, some sellers may be unable to sync account tipping settings, preventing the tipping screen from appearing during check out.

If you are already signed in to your Square account, please do not sign out at this time.
Our Engineering Teams are aware and we will update issquareup.com as more info becomes available. Thank you for your patience.
Posted Nov 13, 2021 - 14:40 PST
Update
We are continuing to investigate this issue.
Posted Nov 13, 2021 - 13:40 PST
Investigating
We are currently investigating issues impacting multiple services and preventing some sellers from being able to log in to their Square account.
If you are already signed in to your Square account, please do not sign out at this time.
Our Engineering Teams are aware and we will update issquareup.com as more info becomes available. Thank you for your patience.
Posted Nov 13, 2021 - 13:23 PST
This incident affected: Appointments, Point of Sale, Square for Restaurants, and Square for Retail.