One persistent myth surrounding programmers is that our only task is to sit in front of a screen and convert caffeine to code. But, there are job duties that go beyond programming. One of the critical ones is production support.
Production support or on-call schedules involve dealing with issues that pop up on the live application from time to time. Different organizations have different ways of setting up support systems. But there do arise situations where the developer needs to jump in and help out in solving the issues then and there.
Some good practices which would help developers during on-call issues are:
Know where to look
Monitoring and logging tools are crucial for a robust software system. When on a support call, it is vital to know where to start looking for answers. Having a bookmark of the critical graphs that reflect the system health is a must. Knowing which logs to query and which graphs to check helps save valuable time when dealing with a support issue.
Assess the Situation
Try to identify quickly what the problem is and how one can help. Usually, site reliability engineers will send a page out or a call to the on-call developer. Getting in front of one’s machine, pulling up the necessary graphs and understanding the impact of the problem should be among the first steps to carry out. Then, figure out the course of action necessary to deal with the issue. Sometimes, there is no time to actually fix the problem and so the best course would be to take steps to mitigate it. Sometimes, it may be a false alarm. Whatever be the case, before taking any corrective steps, we would do well to make sure what the actual problem is.
Do Not Make Assumptions
One of the biggest mistakes we can make is assuming what the problem is before actually confirming it. We’ve been there before. We saw bot attacks last week and assume that it’s the same this week. But, it turns out to be real users hitting the site and they’re actually having problems. Ensure that even if a problem is a recurring one, we have diagnosed it using the monitoring tools before declaring what the problem is. In the software engineering world, assuming is knowing.
Cut Through The Nonsense
Many times during a production support call, we tend to spout vague drivel. Sometimes it will be the other people on call who’ll go off on tangents. Phrases like ‘Why wasn’t this caught before’ or ‘Who is responsible for this’ will come thick and fast. We must remember that at that time, the only focus should be on fixing the problem at hand. Anything else is a waste of time. Asking incisive questions about the problem, its impact, the health metrics of the system and other domain specific questions should help focus the conversation. If a discussion is not helping in fixing the problem, it’s not helping at all.
Keep Calm
Some of us have been there before. The site is crashing. People are shouting on a conference call. There is complete pandemonium.
Relax.
Take a deep breath
Running around like headless chickens during a critical issue helps no one. Indeed some developers would rather scream and make others stressed than actually help mitigate the problem. Remaining calm and collected will help resolve the issue faster. We need to stay focused on the problem at hand rather than losing our heads.
Bring In Reinforcements
Sometimes we might not be able to identify or fix the support issue. At such a time, it would make sense to call on other developers for help. We might think that as the others are not on call, they shouldn’t be disturbed. This is a good culture to follow but when the going gets rough, it helps to bring in someone to diagnose the problem faster. Also, when we’re off-call, it would be a good practice to help out if someone does reach out to us. Two heads or even three can always prove to be better than one.
Write It Down
Once the issue is solved, no matter how trivial, it would be beneficial to have a record of it. Teams should have a template for recording support issues. It should include the issue description, the steps taken to find the problem and the fix put in for the problem. This report should be hosted on a public wiki or FAQ section so that other developers have a quick reference when they are on call.
On-call duties might not be the most rewarding or glamorous parts of the software development life. But having a balanced mindset and a keenness to help can help make our on-call days a little bit easier.