At 0826 EDT 10/14/2020 our systems engineers were alerted to 503 errors indicating that users could not reach https://app.shadowhealth.com
Our systems engineers confirmed that our compute clusters, in all environments, had lost worker nodes, causing our platform and other services to be unavailable.
At 0930 EDT we were able to restore the nodes, but the underlying compute instances were again terminated starting at 0942 EDT.
By 1027 EDT compute instances underlying the worker nodes and the cluster as a whole had stabilized and we turned our attention to restoring service.
Our platform utilizes a cache service, Redis. Within the cluster we determined that Redis could not achieve high availability due to not having nodes in one of the three availability zones. Additional services within the cluster also require three availability zones for redundancy of data persistence (not user data), however these were able to function partially, i.e. without service disruption, with nodes in only two zones.
We worked to restore Redis within the cluster until 1100 EDT, then pivoted to deploying a separate Redis HA instance outside of the cluster. However services continued to experience network connection issues to this Redis instance. We identified an existing unused Redis instance in our environment and redirected services to that instance at 1210 EDT. Connectivity was confirmed enabling the platform to run successfully; we confirmed traffic immediately.
After testing student assignment attempts end-to-end, we determined that the platform was stable and the immediate issue resolved at 1216 EDT.
After service was restored, Systems engineers continued to monitor the cluster health and provision a redundant cluster in case of recurrent instability.
Our compute clusters make requests to our cloud infrastructure provider for a variable amount of compute instances (a "fleet"). It is expected in our design that from time to time a request for additional instances cannot be fulfilled or that an instance will be terminated, which can cause a minor, self-healing disruption. However it's not anticipated by our design that all instances will be terminated at the same time; this is what happened, twice, on the morning of Oct 14th. When instances could be provisioned again, they could not be provisioned across all three availability zones required to achieve high-availability, causing Redis HA failures. Our cloud infrastructure provider would or could not provision compute instances of the required type and size in that third availability zone for several hours.
We are changing our cluster design parameters to expect complete loss of all non-”reserved” compute instances underlying worker nodes. We are investigating how to ensure baseline compute capacity exists in three availability zones and is available to the services that cannot temporarily survive running in e.g. a single AZ.
The platform was effectively down for all users from 0826 EDT to 1210 EDT totaling 224 minutes. Most users would have encountered 503 errors when attempting to access the platform. Some users were able to login and view prior work but would not have been able to create or resume assignment attempts during the outage.