Shadow Health is experiencing technical issues.

Incident Report for Shadow Health

Postmortem

What happened

On May 30th at 0652 EDT communication between the Shadow Health platform and some backend services began failing.

Soon after 0700 EDT we received the first user report of issues starting assignment attempts. The impact to users at this time was uncertain, due to a low number of active users, though various features appeared to be impacted. The platform however was available and many features, including user login and admin features were functional.

As our engineers investigated, they confirmed that, while all individual services were available, the platform servers were rejecting connections to the services due to what was reported as an invalid SSL certificate. However our SSL certificates were reported to be valid by internal testing tools and web browsers.

This investigation determined that the impact to end users was the inability to start or resume assignment attempts, view results or the Results Book, or enroll in courses. We also determined there was no risk of data loss or corruption. After confirming this outage of critical features we sent notifications to users within our platform. At 1026 EDT we posted on this status page that we were investigating a critical incident.

Engineers redeployed platform servers on a backup cloud provider and confirmed these servers were able to contact the backend services. We completed testing of this deployment and began directing traffic to it a 1153 EDT. We then migrated additional LTI customer DNS entries, finishing at 1218 EDT. We continued to monitor traffic until 13:51 EDT before removing notifications within the platform and resolving the incident reported on this status page.

Concurrently with the failover we investigated potential upstream issues and noted reports of widespread SSL issues due to a root certificate expiration by a large certificate authority. This was the CA that issued SAN certificates for our backend services. A “cross signing” root certificate used by the CA in 2010 to “increase trust” of its main root certificate had expired at 0648 EDT on 5/30. This caused some HTTP client libraries used in our platform (and dependencies) to reject the SSL certificate used by our backend services.

Though our certificate was valid and the CA certificate at the root of our certificate chain was valid, the certification validation logic used by clients based on some OpenSSL versions and several other TLS libraries considered the certificate to be invalid due to the expired cross-signing certificate in the chain.

The CA’s certificate expiration issue is covered in depth by others' posts:

[https://www.namecheap.com/blog/sectigo-ssl-certificate-root-expiration-issue/](https://www.namecheap.com/blog/sectigo-ssl-certificate-root-expiration-issue) https://www.cmu.edu/iso/service/cert-auth/addtrust.html https://www.agwa.name/blog/post/fixing_the_addtrust_root_expiration https://gitlab.com/gnutls/gnutls/-/issues/1008

Once this cause was identified our engineers removed the offending certificate from the chain used by both the platform servers and the backend services and redeployed with the corrected certificate chain. After testing that all functionality was restored, we began to redirect traffic from the backup deployment to the main deployment. After monitoring traffic overnight, we de-provisioned the backup deployment.

Why did it happen

Without the intervention of people, our platform monitoring could not determine that there was a major loss of functionality due to SSL certificate errors between the platform servers and other services. Automated monitoring could only see that both the platform servers and the backend services were functional. This prevented engineers from being notified immediately, nearer to 0700 EDT, enabling an earlier mitigation and resolution. Though monitoring did alert us to increased error rates, those rates only rose once more users came online, hours after the issue started.

What will we do

The certificate expiration occurred a week before a planned patch release to update several libraries used in our platform to newer versions, which may have avoided the cross-signing certificate expiration from being an issue, as newer versions of OpenSSL-based HTTP clients interpret certificate chains in a smarter, less restrictive way. (Though some widely used TLS libraries' newest versions still do not). The update is expected to prevent a recurrence of this exact scenario, where CA-issued certificate in the chain, but not the root certificate, becomes invalid.

We will also investigate automation of SSL certificate provisioning, which would expedite a step in the process of remediating issues with certificates.

We are also enhancing our application performance monitoring capabilities, which will enable us to automatically alert on issues with performance and availability of individual features, rather than only on issues of performance and availability of entire servers or services.

Impact

We determined from application logs that 774 users were impacted during this feature outage from 0652 EDT to 1218 EDT.

Posted Jun 03, 2020 - 18:02 EDT

Resolved

Issues with the Shadow Health Portal have been resolved. An incident report will be posted within the next two days.

Posted May 30, 2020 - 13:51 EDT

Investigating

We apologize for the inconvenience, Shadow Health is currently experiencing technical issues. Our engineers are aware and working to resolve the problem. Thank you again for your patience.

Posted May 30, 2020 - 10:26 EDT

This incident affected: Shadow Health Portal.