Troubleshooting Methodology 4 : Analysis and Results

Quote

This is the final in a series of 4 blogposts on this topic of troubleshooting. The first 3 parts can be found here, here and here

At the risk of sounding like a debater, let me begin with a definition: The Oxford English Dictionary defines Analysis as

Detailed examination of the elements or structure of something

Another good word for this step would the Theorising. OED gives us:

Form a theory or theories about something

Basically we look at the information we have gathered and try to come up with an explanation for the occurrence from it. The order is most important as if you theorise before getting some data the temptation is to make the data fit the theory rather than the theory fit the data. Holmes was very clear on the unacceptability of this in the previous post! Most of this phase changes from issue to issue and there are a lot of times where this phase is done in parallel with the data collection phase (sometimes the cause is obvious when you get some information on the problem). While there is much variation in how the analysis proceeds there are a few general questions that always need to be answered:

  • Patch levels: is the product on a supported version?
  • Does the issue description match a known bug?
  • Check event logs around the time of the known occurrences of the issue. Any items of note should be recorded even if they don’t appear relevant at the moment, may be useful later
  • Confirm it’s not a configuration issue.

 

Results:

The final phase – and if you’re working in a customer facing support role, the most important one – is giving results to the interested parties.

The end result of the analysis will take one of four forms:

  1. The cause is identified and the issue resolved
  2. The cause is not identified but an acceptable work around is found
  3. The cause is not identified and no workaround is possible. Needs to be escalated to vendor
  4. The cause is not identified and you have to refine the approach and find more information

 

Of the above results I would honestly only consider #1 and #3 to be the end of the matter. With 1 you have resolved the problem and with 3 its escalated to the vendor and you have to work with them but with specialist help a solution should be found or at least a cause identified. I would consider #2 to be ok in the short to medium term but not a valid solution in the long term and obviously #4 means you have to start the whole process again.

Hopefully the time spent reading these posts will help refine whatever method you use to troubleshoot issues.

Troubleshooting Methodology 3: Data collection

Quote

This is the 3rd post in a series of 4. Parts 1 and 2 are available here and here

For a completely fictional character, Sherlock Holmes does give some great quotes

“It is a capital mistake to theorize before one has data. Insensibly, one begins to twist facts to suit theories, instead of theories to suit facts.”

To solve any problem you need data on what is happening. Without data, you are effectively playing darts blindfolded. Data takes many forms and it’s not all about log collection. In all honesty, logs are usually one of the last things you look for. Unless you have some idea of what you are looking for, there is just too much information to parse. The only reasonably common exception to this would be the Windows event logs as (a) they really don’t record verbose logs and (b) if you have timings for when the problem occurred you can really focus your searches. Generally though the answers from the scoping questions (listed in previous post) are much more useful during the initial phase of the troubleshooting. Of the scoping questions the most important questions are:

  • Has this ever worked?
  • Is it a supported configuration?

If the configuration is never going to work you are trying to perform a miracle rather than troubleshoot, and miracles in production environments are best avoided.

It’s hard to give specific recommendations about what data to collect as this will vary massively from problem to problem. In fact in my experience the scoping questions usually end up being the data collection for all but the most intractable of problems. While I can’t give specific recommendations about information required there is some general information always required:

  • Environmental information ( Virtualisation, physical, Operating system, etc.)
  • Software versions and hotfix/patch level
  • For Windows servers, Event logs are always a good place to start
  • Reproducibility or timings of when the issue occurs/occurred
  • Screenshots of error messages or videos of the event happening.

That last one may seem strange but that whole “picture being a thousand words ” isn’t a total exaggeration. In a previous job with a software vendor I had a case once where a customer was having a graphical display problem with the software. After sometime troubleshooting it for some time and getting nowhere I eventually showed the video the customer had sent me to a colleague who knew exactly what bug it was and supplied me with a fix. This illustrates two things: 1) don’t be afraid to ask for help and 2) pictures/videos can be the bit you need to solve the problem

Troubleshooting Methodology 2: Scoping the Problem

Part 1 is available here

The following Einstein quote is probably apocryphal, but that doesn’t make it any less useful:

“If I had an hour to solve a problem and my life depended on the solution, I would spend the first 55 minutes determining the proper question to ask, for once I know the proper question, I could solve the problem in less than five minutes.”

Scoping the problem or determining the proper question to ask is honestly the most important part of the whole troubleshooting process, as if you don’t know what the problem is how can you fix it? Also, if you don’t define the problem how do you know it’s a problem? That one is a little odd but true but the “issue” could just be a configuration that is never going to work.

A long time ago I used to be broadband support tech in a ISP. Wireless routers were new on the market and just being rolled out. Soon after I started supporting broadband  (previously I was supporting only dial-up) I had a call with a customer who couldn’t get their internet working. After about 5 minutes struggling to get a handle on the problem and checking some things on the computer, I went back to the beginning and asked them where their router was. The response was, “Oh that thing? That’s in the shed. Why would I need it – the internet is wireless.”

This one has always stuck in my head. If I had have nailed down what the issue was in the beginning, it would have been a faster solution and there would have been far less messing around. It also illustrates that sometimes the issue may not be a technical problem but a configuration problem. The setup the customer had was obviously never going to work. Correct scoping helps fix these too, and helps keep you clear of rabbit holes.

For scoping a problem I generally have a list of questions that I run through. Some, or even all of them, may not apply to every case, but it’s a good starting point.

The List:

  • What is happening? By this I don’t mean the overall issue, I mean the symptoms of the issue. Computer blue screening would be an obvious example of a symptom rather than an issue.
  • Has this setup ever worked? Is it a new configuration or something that has been in place for some time?
  • What is the configuration trying to accomplish?
  • If it has, when was the last time it worked?
  • How many issues are they experiencing? If more than one, are they all different or related?
  • How many users is this affecting? How many servers is this affecting?
  • How severe is the problem? Crashes , slowness, application crashes or just a vague feeling of unease.
  • If intermittent, how frequent or random is it?
  • Is it reproducible?

Troubleshooting Methodology 1 : Intro

Quote

Like a lot of people working in IT, I ended up doing this job because of the enjoyment I get out of fixing things and generally tinkering with various bits of technology. Curiosity about how things work would probably be the biggest driver in my career (and life – but that is another topic!) Thinking back on it, I would have to say that my earliest memory of troubleshooting would be “helping” fix the upstairs phone with my father (who is a telecoms engineer) when I was about 5. This was a great introduction to troubleshooting as it was such a hands-on problem, but more on this later.

Over the years I’ve spent working in technical support, I’ve noticed people tend to troubleshoot things in one of two ways:

  1. Working on intuition and essentially just randomly making changes with no real reason for making the changes (The Potluck Approach)
  2. Gathering some information, drawing some conclusions and then making changes (The Structured Approach)

The second method is rarer than you would think. Sometimes randomly pushing buttons will fix a problem faster, simply through luck, but overall taking a structured approach is faster and results in fewer disasters caused by pushing the wrong button. It also means you generally learn what caused the problem and thus can prevent it from happening again.

My approach to problems is straightforward enough but I’ve found it does help. Below is my general, step by step, approach:

  1. Scoping the problem:  What is happening and (sometimes) why is this a problem?
  2. Data collection: Varies from problem to problem but usually includes environmental information, version of software, when it last worked etc.
  3. Analysis: Looking at the data collected and seeing if there are any indications of a problem.
  4. Result and conclusions: This varies based on where the analysis has led you. Sometimes you’ll have to go back, change what you’re looking for, and take a different tack.

The result of the historic phone troubleshooting? We ran the cables and initially it didn’t work. Then we checked the first junction box and got a signal, so the problem was between the junction box and the terminating socket. Turned out it was the socket. This might be a basic example, but the benefit of a structured approach is that it applies to all problems. If we apply the structure to the steps taken, it would look like this:

  1. Scoped the problem: Phone wasn’t working.
  2. Data collection: Checked how far the signal was getting.
  3. Analysis: Problem was between junction box and socket.
  4. Results: As replacing the socket was easier than ripping the cable out of the wall, we tried replacing that and hey presto it worked without tearing the wall apart. Success!

To NUC or not to NUC

Intel Next Unit of Computing or NUC, you’ve probably heard of them. Essentially they are a very small form factor computer that Intel produce and sell as a reference design to show that regular PCs can be small and relatively affordable. NUCs come with a range of processors (atom, Celeron, i3, i5) and a range of specs (here, here and here being some examples). All they need is ram and a Hard drive to get them going.

Recently my wife was in need of her own computer so decided to go with the Intel NUC as it would be small and tidy. The monitor had a VESA mount which allowed the NUC to be mounted on the back and with a wireless keyboard and mouse the end result was extremely tidy. Making everyone happy.

The spec of the NUC was as follows

NUC: Intel DN2820FYKH Barebone Desktop (Celeron N2820 2.39GHz, HD Graphics, WLAN, Bluetooth 4.0)

Hard Drive: Intel 530 Series 120GB SSD

Ram: Kingston ValueRam 8 GB DDR3L 1600 MHz SODIMM CL11 Memory Module

Monitor: BenQ GL2250HM 21.5-inch Widescreen LED Multimedia Monitor

Couple of gotchas about the Intel NUCs generally and this model specifically

  1.  The ram must be the low voltage ( 1.35v) Sodimm modules. Normal voltage ram will not work in an Intel NUC. There is a supported memory list which lists out the full specs of the ram : http://www.intel.com/support/motherboards/desktop/sb/CS-034475.htm
  2. For this specific model of NUC you have to use a 2.5 inch hard drive as opposed to a m-SATA hard drive. However to fit the drive bay of the NUC the RAM must be no thicker than 9.5mm.

The installation of a Windows 8.1

Honestly this was the most painful part of this process, the windows installer loaded fine all the pain came from the NUC.

Issues:

  1. SSD not detected
  2. BIOS screen not displaying on monitor

To install and operating system on the intel NUC DN2820FYKH you need to upgrade the BIOS from the factory shipped version of 0015 to a minimum of 0025 along with making some fairly minor changes to the BIOS settings. As you can probably imagine the inability to see the BIOS screen made flashing the BIOS less straightforward but not impossible. Intel supply the BIOS updates in 4 packages :

  • Recovery BIOS update : used for flashing directly from the BIOS screen
  • iFlash BIOS update : DOS based utility for updating the BIOS
  • Self-extracting windows based update file
  • Self-extracting windows PE update file

In my case the simplest fix was to make a windowsPE boot USB instructions here , then copy the appropriate BIOS version to the drive. Then boot into the windows PE and run the .exe which then updated the BIOS and all proceeded smoothly from there. The SSD was detected and OS installed.

 

vCentre appliance fails to boot after changing hostname

Follow up to yesterdays post about changing the certs/applicance name

After regenerating the certs the appliance fails to boot….ooops. The boot process hangs Waiting for embedded database to Start up [OK] ”

embedded_DB

Suppose thats what test environments are for.

To recover from this you need to get into the command line of the VCA. To get into the command line you have to stop the boot process at the Grub boot loader by hitting the “Down” arrow at the boot screen. Then you make the following changes

  1. Press “p” and enter the root password
  2. Select the “VMware vCentre Server appliance” and hit “e” to edit the boot settings
  3. highlight the Kernel option and hit “e” to edit settings
  4. Add in a space and the number 1 at the end of the boot string so it ends with “showops 1” this forced the machine to boot into console
  5. Hiot enter and press b to boot

Once it boots, log in with the Root password and remove the allow_regeneration file with the following line

rm /etc/vmware-vpx/ssl/allow_regeneration

then reboot the Virtual machine

This clears the “toggle certificate setting” and allows the VCA to boot normally

Correct way to regenerate Certificates on Vcentre Virtual appliance

I have been working around with virtual appliance and had to regenerate certificates. The trials of getting this done are covered here , but to properly regenerate the certificates without hangs at boot.

  1. Enable the certificate regeneration either by hitting the “Toggle certificate setting” in the web console or by logging onto the VCA via SSH and running from the command linetouch /etc/vmware-vpx/ssl/allow_regeneration
  2. Stop all the vCentre and SSO services on the Vcentre appliance
  3. Regenerate the certificates
    source vpxd_commonutils; regenerate_certificates
    The result of this should be VC_CFG_RESULT=0
  4. Replace all the certs
    source vpxd_commonutils; generate_all_certificates replace
  5. Clean up the regeneration file by deleting the allow_regeneration file
    rm /etc/vmware-vpx/ssl/allow_regeneration
  6. Reboot the machine and check it comes up cleanly

This should resolve the issue

Changing host name on vCentre Appliance

Just a quick one

In my lab environment I use the virtual centre applicance. As it was setup quite quickly i never bothered adding the VCA to my testing domain at initial setup. Needed to test some domain stuff so decided to add it today.

Process is quite simple to add the VCA to the domain

  1. In your Active directory DNS create both a forward and reverse lookup entry for the VCA
  2. Under Networks cofiguration ensure you have the DNS in your AD configured
  3. On the same screen change the hostname of the VCA to the FQDN you have created ( has to be the FQDN rather than just the appliance name. this is in the form : VCA.domainname.tld )
  4. Reboot is required
  5. After Reboot you have to go to the authentication screen and enter the AD credentials and domain name

After doing all of this you will notice that you can no longer log into the vCentre client , you get the following error

vsphere_client

If you are using the built in certs then to fix this issue you have to go to Admin tab and toggle ” regenerate SSL certificate” setting

If you are using 3rd party certs then they need to be updated to reflect the new host name.

Full documentation on this issue here

E2EVC Rome 2013 a wrap-up

This year I attended the twentieth E2EVC conference ( formerly known as Pubforum) in Rome and all in all it was a fantastic conference. It’s a bit different than other conferences in that its organised by the community for the community so it is largely vendor neutral and there are next to no marketing presentations which is always excellent. Marketing has its place but it is best to understand the technology before trying to sell it, spinning the technology into marketing jazz is never a good approach.

Speakers and sessions

Over the course of the weekend I attended most of the sessions that it was possible to attend ( we had two rooms for sessions so unfortunately not possible to attend all). All of the sessions were worth attending, however in my opinion there were a number of stand-out sessions that both for content and presentation rise above the others ( but only slightly).

  • Andrew Wood and Jim Moyle’s Atlantis IO presentation
    The technology they were demonstrating was fascinating and the two guys presented it so well in terms of playing off each other to get their point across with some humour in the mix too. With the added fun of a live demo
  • Shawn Bass’s multi factor authentication presentation
    I have never seen anyone able to pack more information into a 45 mins presentation without melting everyone’s head and loosing everyone in the course of it. Fascinating look at passwords and pass phrases
  • Andrew Wood and Jim Moyle’s keynote ” What’s new with Citrix”
    Again they got their point across with a minimum of death by Powerpoint, most of the slides were images or single lines that provided a bench to talk from which are my favourite form of presentation . The content was a stark look at where citrix and the market is now and where they appear to be going. It was overall a positive direction but didn’t pull any punches
  • Jeff Wouters’s Powershell DSC deep dive
    A overview of a topic I havent had a opportunity to look at myself. Jeff is good at making complex topic appear easy through the use of examples and metaphor
  • Wilco Van Bragt and Ingmar Verheij’s PVS design decisions
    A really interesting take on this topic. Basically they both talked about the design phases of a project each and showed how one or two constraints early on can radically change how you design a environment. They also discussed how they dealt with those constraints. Excellent demonstration of how there is rarely a 100% correct answer and every project is a learning exercise as there is always something you can do better
  • Carl Webster’s Documentation scripts
    The amount of work that has gone into these scripts is enormous and the level of community involvement is just staggering. I first came across these scripts when i was trying to work with Citrix’s powershell commandlets and found the explanation blog posts that Carl posts with his scripts enormously helpful. The the scripts do an excellent job at their actual function( documentation) and Carl’s presentation style was entertaining

It was a great conference and I’d highly recommend going to the one in Brussels which is on May 30 – June 1, 2014 . Exact venue details still have to be announced. All information for the event will be here

Cant wait for the next one.