mail system down
Incident Report for Fused
Postmortem

One of the mail cluster member nodes experienced a hardware failure of unknown origin and crashed producing zero traces. The cluster itself was built with that situation in mind and supposed to fail-over with almost no noticeable impact to the end users. Sadly, one of the components that was supposed to reliably reset the failed node didn’t work and returned a failed status to the cluster, which affected the fail-over procedure. As the result of that, the cluster was stuck in transitioning state, still “believing” the failed node was operational, thus not letting other nodes to take over. The situation affected mail (smtp,pop3/imap) and some internal services. That said, the alert system was doing its job properly, sending substantial number of notifications. Having a holiday weekend contributed to admins ignoring those. As the first couple of alerts arrived, someone from the team decided to check the status of web-services on the servers instead of doing the mail system check as alert text indicated. Only later on, when another team member noticed the alerts and complaints from the users, the information was passed onto sysadmin and cluster state re-set was performed.

So, lessons learned. We’re going to:

  • first of all, replace the node hardware entirely
  • think out a better node reset method
  • make alert text more clear (apparently, “mailcluster.fused.com is DOWN” wasn’t clear enough)
  • more team training on what to do in such situations

Oh, and one more thing, despite not being able to send and receive, no mail was lost during the outage.

Posted 19 days ago. Dec 31, 2018 - 13:38 UTC

Resolved
Issue has been fully resolved. RFO report will be published in coming days.
Posted 21 days ago. Dec 29, 2018 - 21:26 UTC
Identified
An issue has been identified with the cluster that's handling the email system. Currently, the issue is under control and mail should be flowing normally both ways. We're still investigating the cause.
Posted 21 days ago. Dec 29, 2018 - 20:20 UTC
This incident affected: Email.