This is the final in a series of 4 blogposts on this topic of troubleshooting. The first 3 parts can be found here, here and here
At the risk of sounding like a debater, let me begin with a definition: The Oxford English Dictionary defines Analysis as
Detailed examination of the elements or structure of something
Another good word for this step would the Theorising. OED gives us:
Form a theory or theories about something
Basically we look at the information we have gathered and try to come up with an explanation for the occurrence from it. The order is most important as if you theorise before getting some data the temptation is to make the data fit the theory rather than the theory fit the data. Holmes was very clear on the unacceptability of this in the previous post! Most of this phase changes from issue to issue and there are a lot of times where this phase is done in parallel with the data collection phase (sometimes the cause is obvious when you get some information on the problem). While there is much variation in how the analysis proceeds there are a few general questions that always need to be answered:
- Patch levels: is the product on a supported version?
- Does the issue description match a known bug?
- Check event logs around the time of the known occurrences of the issue. Any items of note should be recorded even if they don’t appear relevant at the moment, may be useful later
- Confirm it’s not a configuration issue.
The final phase – and if you’re working in a customer facing support role, the most important one – is giving results to the interested parties.
The end result of the analysis will take one of four forms:
- The cause is identified and the issue resolved
- The cause is not identified but an acceptable work around is found
- The cause is not identified and no workaround is possible. Needs to be escalated to vendor
- The cause is not identified and you have to refine the approach and find more information
Of the above results I would honestly only consider #1 and #3 to be the end of the matter. With 1 you have resolved the problem and with 3 its escalated to the vendor and you have to work with them but with specialist help a solution should be found or at least a cause identified. I would consider #2 to be ok in the short to medium term but not a valid solution in the long term and obviously #4 means you have to start the whole process again.
Hopefully the time spent reading these posts will help refine whatever method you use to troubleshoot issues.
This is the 3rd post in a series of 4. Parts 1 and 2 are available here and here
For a completely fictional character, Sherlock Holmes does give some great quotes
“It is a capital mistake to theorize before one has data. Insensibly, one begins to twist facts to suit theories, instead of theories to suit facts.”
To solve any problem you need data on what is happening. Without data, you are effectively playing darts blindfolded. Data takes many forms and it’s not all about log collection. In all honesty, logs are usually one of the last things you look for. Unless you have some idea of what you are looking for, there is just too much information to parse. The only reasonably common exception to this would be the Windows event logs as (a) they really don’t record verbose logs and (b) if you have timings for when the problem occurred you can really focus your searches. Generally though the answers from the scoping questions (listed in previous post) are much more useful during the initial phase of the troubleshooting. Of the scoping questions the most important questions are:
- Has this ever worked?
- Is it a supported configuration?
If the configuration is never going to work you are trying to perform a miracle rather than troubleshoot, and miracles in production environments are best avoided.
It’s hard to give specific recommendations about what data to collect as this will vary massively from problem to problem. In fact in my experience the scoping questions usually end up being the data collection for all but the most intractable of problems. While I can’t give specific recommendations about information required there is some general information always required:
- Environmental information ( Virtualisation, physical, Operating system, etc.)
- Software versions and hotfix/patch level
- For Windows servers, Event logs are always a good place to start
- Reproducibility or timings of when the issue occurs/occurred
- Screenshots of error messages or videos of the event happening.
That last one may seem strange but that whole “picture being a thousand words ” isn’t a total exaggeration. In a previous job with a software vendor I had a case once where a customer was having a graphical display problem with the software. After sometime troubleshooting it for some time and getting nowhere I eventually showed the video the customer had sent me to a colleague who knew exactly what bug it was and supplied me with a fix. This illustrates two things: 1) don’t be afraid to ask for help and 2) pictures/videos can be the bit you need to solve the problem