One of the mail cluster member nodes experienced a hardware failure of unknown origin and crashed producing zero traces. The cluster itself was built with that situation in mind and supposed to fail-over with almost no noticeable impact to the end users. Sadly, one of the components that was supposed to reliably reset the failed node didn’t work and returned a failed status to the cluster, which affected the fail-over procedure. As the result of that, the cluster was stuck in transitioning state, still “believing” the failed node was operational, thus not letting other nodes to take over. The situation affected mail (smtp,pop3/imap) and some internal services. That said, the alert system was doing its job properly, sending substantial number of notifications. Having a holiday weekend contributed to admins ignoring those. As the first couple of alerts arrived, someone from the team decided to check the status of web-services on the servers instead of doing the mail system check as alert text indicated. Only later on, when another team member noticed the alerts and complaints from the users, the information was passed onto sysadmin and cluster state re-set was performed.
So, lessons learned. We’re going to:
Oh, and one more thing, despite not being able to send and receive, no mail was lost during the outage.