Infrastructure

It’s 2am. Something is broken in production. An engineer SSHes into the box, finds the problem (a config file with the wrong value), fixes it, restarts the service, watches the metrics recover. Crisis averted. Everyone goes back to sleep. The Ansible playbook never learns about the fix. The next scheduled run either overwrites it or, more likely, doesn’t run at all because nobody wants to roll the dice on a Friday. Six months later, someone re-provisions the server from the same playbook and the old bug is back. Nobody connects the dots for another two weeks.