Issue accessing Application

Incident Report for Shadow Health

Postmortem

What happened

At 0826 EDT 10/14/2020 our systems engineers were alerted to 503 errors indicating that users could not reach https://app.shadowhealth.com

Our systems engineers confirmed that our compute clusters, in all environments, had lost worker nodes, causing our platform and other services to be unavailable.

At 0930 EDT we were able to restore the nodes, but the underlying compute instances were again terminated starting at 0942 EDT.

By 1027 EDT compute instances underlying the worker nodes and the cluster as a whole had stabilized and we turned our attention to restoring service.

Our platform utilizes a cache service, Redis. Within the cluster we determined that Redis could not achieve high availability due to not having nodes in one of the three availability zones. Additional services within the cluster also require three availability zones for redundancy of data persistence (not user data), however these were able to function partially, i.e. without service disruption, with nodes in only two zones.

We worked to restore Redis within the cluster until 1100 EDT, then pivoted to deploying a separate Redis HA instance outside of the cluster. However services continued to experience network connection issues to this Redis instance. We identified an existing unused Redis instance in our environment and redirected services to that instance at 1210 EDT. Connectivity was confirmed enabling the platform to run successfully; we confirmed traffic immediately.

After testing student assignment attempts end-to-end, we determined that the platform was stable and the immediate issue resolved at 1216 EDT.

After service was restored, Systems engineers continued to monitor the cluster health and provision a redundant cluster in case of recurrent instability.

Why did it happen

Our compute clusters make requests to our cloud infrastructure provider for a variable amount of compute instances (a "fleet"). It is expected in our design that from time to time a request for additional instances cannot be fulfilled or that an instance will be terminated, which can cause a minor, self-healing disruption. However it's not anticipated by our design that all instances will be terminated at the same time; this is what happened, twice, on the morning of Oct 14th. When instances could be provisioned again, they could not be provisioned across all three availability zones required to achieve high-availability, causing Redis HA failures. Our cloud infrastructure provider would or could not provision compute instances of the required type and size in that third availability zone for several hours.

What will we do

We are changing our cluster design parameters to expect complete loss of all non-”reserved” compute instances underlying worker nodes. We are investigating how to ensure baseline compute capacity exists in three availability zones and is available to the services that cannot temporarily survive running in e.g. a single AZ.

Impact

The platform was effectively down for all users from 0826 EDT to 1210 EDT totaling 224 minutes. Most users would have encountered 503 errors when attempting to access the platform. Some users were able to login and view prior work but would not have been able to create or resume assignment attempts during the outage.

Posted Oct 16, 2020 - 10:18 EDT

Resolved

The issue has been resolved and our team is monitoring systems for stability. All platform functionality should be available at this time including student assignment attempts.

Posted Oct 14, 2020 - 12:18 EDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 14, 2020 - 12:10 EDT

Update

Primary service restoration is nearing completion but background workers are experiencing network connection issues which we are working to resolve.

Posted Oct 14, 2020 - 11:48 EDT

Update

Service restoration continues.

Posted Oct 14, 2020 - 11:05 EDT

Update

Service restoration is in progress.

Posted Oct 14, 2020 - 10:27 EDT

Update

We are continuing to work on a fix for this issue.

Posted Oct 14, 2020 - 09:50 EDT

Identified

We have identified an issue with the Shadow Health Platform that is affecting users. Our engineers are currently working to solve the issue.

Posted Oct 14, 2020 - 09:12 EDT

Update

We are aware of an issue that might be affecting users. Our engineers are looking into the problem. We will update when more information is available.

Posted Oct 14, 2020 - 08:27 EDT

Investigating

We are currently investigating this issue.

Posted Oct 14, 2020 - 08:26 EDT

This incident affected: Shadow Health Portal.