The Art of Software Triage: the Trained Eye

David Peček
Aug 6, 2019
3 min read

Updated: Sep 11, 2020

I've often had requests to document how it is I can be so effective at triage. The problem is its not one thing I do to look for a problem. Its a set of skills which combined together help me to get a trained eye to know where to look to understand the problem so quickly. What I will try and do here is outline some of the needed skills and how to use them to be the most effective you can at software triage.

Triage is not one thing, its the combined knowledge of all aspects of the software stack all the way from the base infrastructure to end customer behavior which combine to make an effective triage engineer.

Skillset Areas

What makes an effective dedicated triage engineer? The best phrase that comes to mind is the jack of all trades, or tech ninja. These are the key areas I have found necessary to have that inkling of where to look first that is usually correct.

Software Development: having a basic development background in any language will make for a good triage engineer. Knowing the fundamentals of any object oriented and scripting language will give you the foundational basis for knowing where software problems can lie. You need to understand how to read logs, decipher exceptions, and application configurations. Engineers should also know how to look at status and troubleshoot running applications.

Database administration: applications often have issues related to the database. Having a good working knowledge of the database is key for application triage as many problems can come from this area. I have seen incorrectly used indexes causing major slow downs and performance issues with queries returning. Other times applications have caused data corruption which needs to be cleaned up. Race conditions or deadlocks cause applications to freeze. You will need to know best practices for table setup, normalization, advanced SQL for querying, how to look into the backend of the database for analyzing running queries and performance issues. Its also good to know what healthy data structures look like generated by these applications to spot issues in the database.

Application infrastructure: depending on which platform your applications are running, you should be an expert in understanding maintenance and operation of the containers running your applications. Key things to know are: start, stop, deployments, rollbacks, and scaling. When running in virtual environments understanding the containers they are running in, their limits, and behavior at the limits is also helpful.

Deep application knowledge: it goes without saying but you need to know your business and how your company has implemented it with the software you are supporting. Being a subject matter expert helps you to understand if the customer is mis-understanding the issue or the software is behaving in a way not consistent with business expectations.

Networking / IT / Security: while not as often with modern application frameworks, I have seen issues arise from simple networking issues, or incorrectly configured security groups meaning applications could not talk with each other. Its important to understand the foundational layers of where your applications reside to be able to effectively troubleshoot when problems do arise in this area.

Always Looking for Clues

When you first read through the problem, these are things a seasoned triage engineer is looking for. Basically you are looking for what is out of the ordinary? With your knowledge of correctly running systems, applications and data structures, looking around you should be able to determine where things have gone awry.

Is the data structure correct? If you have a data visualization tool for the backend, use that to see what the raw data says for the hierarchy of data you are trying to look at. Where are there missing pieces, flags or statuses not set correctly?

Where did the processing stop? Often times data flows along a chain in the system. One trick is to follow the chain and see where the logs or the data stopped. Then you know where the problem lies.

What caused the crash? If an application went down, what are the common failures with that application? What do the logs say around that time? Were any heap / thread dumps generated by the application before it crashed?

Which components could not communicate? If you have a system where applications need to communicate or messages are routed between applications, follow the path to see where the communication or message stopped.

Hopefully this list of questions and symptoms will guide you towards a faster and more accurate diagnosis of any issues you might encounter. Happy triaging!

The Art of Software Triage: the Trained Eye

Skillset Areas

Always Looking for Clues

Recent Posts

Comments