December 20th, 2024 — Downtime
DNS Outage
Monitoring — resolved
STATUS: MONITORING, RESOLVED
DNS / Limited # of Dedicated Servers / Shared / Virtual / Semi-dedicated.
Impact was felt across email, web & internal services.
Incident
- Impact: Due to zone misconfigurations, components of our DNS cluster were knocked offline, leading to DNS resolution issues across the fleet. This impacted users on all systems relying on affected DNS servers — email, websites, internal software.
- Length of Impact: Varied anywhere from 15 minutes to 4 hours depending on impacted client / system.
Root Cause
- Primary Issue:
- Misconfiguration in zone files caused the
named
service to crash. - The corrupted zone files led to improper loading of DNS zones, triggering the service failure.
- Misconfiguration in zone files caused the
- Cascading issues caused:
- Custom code external to the DNS stack exacerbated the issue by propagating blank configurations to some dedicated/virtual servers.
Resolution
- The corrupted zone files were identified and corrected.
- Named service was restarted successfully after validating all DNS zone files.
- Enhanced monitoring has been implemented to track DNS health and service uptime.
Mitigation & Next Steps
-
Immediate Actions Taken:
- Verified and corrected all zone files across affected systems.
- Restarted the
named
service on all impacted servers. - Monitored closely for stability post-fix.
-
Preventative Measures:
- Add additional validation for zone file integrity before deployment.
- Update custom DNS stack code to include fail-safes and stricter error handling.
Conclusion
This incident highlighted some gaps in zone file validation. More frequent dns zone scanning/linting is required to mitigate long-term. The planned mitigations aim to prevent recurrence and ensure faster resolution if similar issues arise.
IMPACTED SYSTEMS
DNS / Limited # of Dedicated servers / Shared / Virtual / Semi-dedicated