In 1999, when tasked to setup an organization dedicated to performance, monitoring, site reliability…, I looked for guidance, books, gurus, mentors… there were none. But I did end up finding this amazing quote from Rudyard Kipling: “I keep six honest serving-men (They taught me all I knew);Their names are What and Why and When And How and Where and Who.”
Thanks to the Rudyard quote, we setup guidelines, rules for troubleshooting and ultimately made the Quality of Services group a success at DoubleClick and to this day at Google. We always used our 6 friends when we had a performance or a system reliability issue at 3 am. It has allowed us to ask the right questions during a crisis, filter out the noise, and finally focus on finding the culprit and fixing it.
And with time we tuned our monitoring and data gathering capabilities to be able to answer all those questions very quickly. We used these questions to build a process.
Here are some examples:
What: is there a problem? Is it one monitoring tool or multiple that turned red? is it only one node or multiple monitoring nodes in the case of a 3rd party tool? Is it the DB?…
Where is the problem? is it specific to a country? a city? a customer? a datacenter?
Who? who is affected? All customers? Some customers? Just the monitoring tools?
When? When did it start? Does it always happen? is it sporadic?
Why? Did something change? Did we roll out something? Did the environment change? Did the customer behavior change?
How? How did this happen? Change management? Router upgrade? Did someone upload a crazy ad?
These are just examples, and in some situations one of the “honest serving man” will not be necessary. But I think if you apply some kind of similar protocol in the ops world, you might come closer to a faster resolution.
Enjoy our mutual 6 friends.
Mehdi – Catchpoint