Back to overview
Downtime

Sites are experiencing down time

Sep 06 at 06:46am CEST

Resolved
Sep 06 at 10:18am CEST

Website outage post-mortem

What happened?

Starting at 04:04 an automatic upgrade was triggered for our production environment. These automatic upgrades are performed as a gradual rollout of a new version of the underlying infrastructure orchestration solution.

About 15 minutes later, at 04:22, our internal monitoring system indicated the first interruptions to public-facing websites.

While short and intermittent interruptions are to be expected during these automatic upgrades, no significant downtime should occur. However, in this case, the disturbances remained present even after the upgrade itself was completed at approximately 05:15.

At 05:50 our team responded to the incident and started investigation and troubleshooting. The system component responsible for secrets storage was identified as a possible cause of the issues. Our investigations showed that it did not start properly after the upgrade, which led to other parts of the platform being unable to get access to the configuration needed for them to run.

A status update about the incident was posted to our status page at 06:46 and the issue was escalated internally.

Continued investigation confirmed that the secrets storage was part of the cause. At 07:12 the secrets storage was restored to a working state.

In addition to the issue with the secrets storage, some of the applications within the platform were unable to stay running, due to running out of memory. The configurations for these applications were updated at 07:25 to allow for higher memory usage.

Shortly thereafter, our monitoring system detected the recovery of the public-facing websites. At 07:36 all websites had recovered.

What actions are being taken?

We will be taking several actions of different types based on our findings and learnings from this incident:

  • * One thing that will be included is a review of the monitoring configuration to better alert the correct team members.
  • * We will also be talking to our infrastructure provider on how we together with them can handle these automatic upgrades.

Within the platform itself, we will be;

  • * exploring different options on improving the reliability of the secrets storage, as well as
  • * reviewing the memory requirements for the aforementioned applications with high memory usage.

Updated
Sep 06 at 07:33am CEST

All sites are now up and running again. We are still monitoring the sites and will update here with more details later

Updated
Sep 06 at 07:28am CEST

We have implemented the fixes and can see that sites are slowly getting up and running again.

Updated
Sep 06 at 07:17am CEST

We have located the issue and are implementing a fix

Created
Sep 06 at 06:46am CEST

We have received alerts that some sites are currently unavailable.
We are investigating.