HELP! The Site’s Down!
Subject: HELP! The Site’s Down! On a long enough timeline, every web developer will receive this email. Every web developer learns to dread this email. Your day just got a whole lot worse. Imagine a major site your team is responsible for just went down – or started timing out, throwing errors, booting users off, etc… You don’t actually know the problem yet. All you know is that people are freaking out. The boss is yelling at you, clients are calling, an executive from another business unit is demanding an update. Your adrenaline starts pumping, you feel that pit forming in your stomach, and if you are not extremely careful, you are about to make some bad decisions. However, there are some steps you can take to minimize risk. Take a deep breath. We’ve all been there. First, Do No Harm The biggest engineering mistakes I’ve made in my career have occurred while trying to fix an urgent problem. This is true for many world-class engineers I know. There is an old saying in politics: It’s not the crime; it’s the cover up that does the most damage. It’s similar in incident management; the response often causes a bigger calamity than the original problem. Before you take any action or make any changes, you need to understand this: the wrong action will make the situation far worse. Don’t let the fear of a mistake paralyze you, but remember that the system being down or slow is not actually the worst case scenario. What could be worse? Data loss. Data leakage. Security breaches. Improperly charging customers. These situations are almost always an order of magnitude worse than an outage. Between us, we could name about a hundred other ways things could get much worse. So even though this incident feels urgent – and it probably is – our first obligation is to not make things worse. Doubt Your Instincts There are many reasons people make bad decisions in a crisis. A big part of it is simple human biology; it’s nearly impossible to think clearly while panicking, which most people do in a crisis. Practice and repeated exposure can lessen that instinctive reaction, but it’s not something you can control entirely. It helps to be aware of it, but you’ll still make bad decisions, even accounting for that fact. Don’t completely disregard your instincts, but verify hunches before acting on them. Slow Down to Speed Up Another reason for mistakes is that in a crisis, people want to take actions, do something, anything!, to try to fix the problem. Other stakeholders want to see something being done. There’s pressure to act now. We know that’s probably a bad idea. There’s a mantra in performance management that sometimes you need to slow down in order to speed up, i.e., being more deliberate and methodical in the short-term can help you achieve your objectives faster over the long-term. This is often true in incident response as well: extra caution in the first half hour can greatly shorten the total time to resolution. So how do we push back against our bad instincts and all of this pressure to act quickly? Get a Second Opinion, Always Ideally you’d have another engineer to talk through these problems with, but if not, grab anyone who can listen and ask questions. Even Dan from accounting, » Read More
Like to keep reading?
This article first appeared on simplethread.com. If you'd like to keep reading, follow the white rabbit.