====== When Things Go Wrong ====== ===== Email Outage ===== * External facing Linux server running BIND. * Zone updated after a long time but Linux permissions resulted in zone file being locked and non-functional. * Zone included MX records that were then unavailable. * Caused Email outage. * Linux admin team decided that they didn't want DNS service maintenance under their role so moved it to Infoblox. ===== Web Outage ===== * NIOS Hidden Primary. * Third Party DNS provider zone transferred to publicly host zones. * Third Party acquired by another company. * Other company migrates systems. * Everything works. * Other company deletes migration systems. * Turns out, a few customers had their TSIG keys tied up in the migration system. * TSIG keys no longer existed. * Zone transfers failed. * Alerts for zone transfer failure... failed. * After a week, the secondary servers dropped the zones that they could not longer update. * Result - outage of critical website until customer updated public DNS to "unhide" their NIOS boxes.