Disruption of Login Service
Incident Report for evalink
Postmortem

Summary - What happened?

On September 28, 2021, at 15:00 CEST, evalink talos users experienced login issues and were unable to access the alarm management platform. The disruption was caused by partial outages in Auth0 - the underlying authentication and authorization platform provider, ongoing in the EU and the US regions from 15:00 to 18:00 CEST.

At 16:30 CEST on September 28, 2021, Sitasys provided a workaround so that users could login again. At 17:45 CEST on September 28, 2021, the service was fully restored and all users were able to successfully access the evalink talos platform.

After further investigation and to prevent this from happening in the future, Sitasys is actively working on an alternative emergency login solution that does not depend on the Auth0 service. We’ll update all users as soon as this solution will becomes available.

1. Detection and remediation

Sitasys became aware of a service disruption in Auth0 at 15:00 CEST on September 28, 2021, impacting multiple users unable to access the evalink talos environment. During the service outage, evalink talos users experienced sudden logouts and were could not re-login to their account.

At approximately 16:30 CEST, Sitasys implemented a temporary workaround to allow users to log in by bypassing the non-functional subsystem in Auth0. Despite the workaround, evalink talos users still experienced unexpected logouts and degraded performance caused by a very high latency in the Auth0 platform.

After closely monitoring of the situation and when Auth0 service was fully restored at approximately 18:00 CEST on September 28, 2021, Sitasys safely disabled the workaround.

Auth0 will provide a Root Cause Analysis (RCA) within 14 days.

2. FAQs

How did the service disruption affect users? The Auth0 outage directly affected the login and the partner login of evalink talos and therefore prevented users from accessing the evalink User Interface (UI).

Were other Sitasys systems affected during or in consequence of the outage? No issues were found on all other Sitasys’ systems.

What happened to alarms and signals that were transmitted during the outage? There was no impact on alarms and signals which were still received and stored by evalink talos. Automated workflows were carried out normally.

Were alarm panels connections affected? No, our virtual receivers were operating normally. Connections remained established and were properly monitored at all times. However, outage messages were not able to be viewed or processed manually in the UI. Actions triggered in automated workflows, including but not limited to E-Mail, SMS, Slack, phone calls and alarm escalation were working normally.

Could users view signals and alarms during outage? Only if previously configured: Alarm escalation and site sharing and API continued to work at all times.

What happened exactly with Auth0? Auth0 will provide a Root Cause Analysis (RCA) within 14 days.

3. Key learnings and next steps

To maintain a high performance level that our customers expect from Sitasys and to prevent this issue from recurring, our focus is on continuous learning and improvement. Sitasys is fully committed to minimizing downtime when incidents do occur. We also continually assess and improve our tools, processes, and architecture to provide you with the best service possible.

  1. Auth0 is a highly redundant platform itself. Nevertheless, outages affecting a critical subsystem in multiple regions at the same time lead to a partial disruption of the service on which the login of evalink talos depends on.
  2. The evalink talos user interface is built to detect outages at the authentication provider and would normally not log the user out in case of an error. Unfortunately, this specific partial outage produced an error that was not handled correctly and caused an automatic logout. Our authentication mechanism is explicitly built to stay functional even during longer outages. We will improve the user interface to better detect outage scenarios and never logout the user automatically when an error occurs.
  3. As an additional safety measure we will implement an alternative login mechanism that will allow users to login into their evalink talos company account even when the normal authentication with auth0 isn’t working - this will become available in the following weeks and Sitasys will notify all users.
Posted Sep 29, 2021 - 17:07 CEST

Resolved
Due to an outage at our authentication provider its currently not possible to login to evalink.
Posted Sep 28, 2021 - 15:00 CEST