Timing Flaw Discovered in Windows Ping Utility
It’s not everyday that you discover a flaw in a core networking component of a major operating system, but that’s just what happened here at FrameFlow last week. The command line “ping” is the workhorse of network and server monitoring. It’s the basic test that any sysadmin will use to determine if a remote system is alive and responding to network requests. As we investigated in more detail we were even more surprised to discover that this bug is deep in the Windows API and affects pretty much any program that needs to run ping tests.
The Story Unfolds
We were working with a partner in the Netherlands who had deployed a new FrameFlow installation at one of their client sites. This customer site had a complex but very stable network topology. As part of the initial deployment, our partner had set up ping event monitors to watch various critical systems on the client’s network. Soon after the deployment, our ping event monitor began to report sporadic timeouts and not just for individual systems, but for pretty much all systems on the network. The client expressed some disbelief that these timeouts were legitimate so we worked with our partner to investigate.
First we set up tests in our labs, but we could not reproduce what was being seen at the client site despite using the exact same settings. The only remaining possibility was that there was some significant difference in our networking infrastructure versus what the client had. Working with our partner, we discovered that the client site had multiple segments so that pings from one end of the network to the other would have significantly higher hop counts than we see at most customers sites. While working on this, we also discovered another puzzling piece of information. Setting the ping threshold higher eliminated all of the timeout messages from our ping event monitor. Recall that command line pings on the client network never reported timeouts and consistently reported response times at 60ms or lower.
Back to the FrameFlow Server Monitoring Lab
Back at our test lab we tried to reproduce what we observed at the client site. We setup a ping event monitor and pointed it at an external website, which guaranteed a higher hop count and had response times around 60ms. After about ten runs we got a timeout! Figuring it might have been a legitimate report we pointed the event monitor at www.google.com and got the same thing. Meanwhile, command line pings to both sites were running flawlessly. Our default ping timeout is 500ms, so we set it to 1000ms which is the default for the command line ping… and then magically the timeouts disappeared.
Digging into the Code
We had finally found the problem: Somewhere in our code there must be something that is handling low ping thresholds incorrectly. Our coders tore through the source line by line, but turned up nothing until one of them said, “What does the Windows command line ping do at 500ms?” It turns out it shows the exact same flaw.
Take a look at these two screen shots. Both are running command line pings to www.google.com. The one on the left is using a time of 999ms and the one the right is using 1000ms.
The one on the left shows two timeouts while the one on the right runs smoothly. The only difference is the selected timeout which varies by exactly 1 millisecond. We observed that the barrier between success and failure is exactly at this point. Anything under 1000ms and periodic timeouts will occur. At 1000ms or more they will not.
As you can see in the following screenshot, we ran an extended test of 1000 pings and the overall loss rate was 4%. In multiple tests the timeout ratio was consistently in the range of 3% to 6%.
Under the Hood… IcmpSendEcho
FrameFlow Server Monitor, the Windows command line ping utility and pretty much any other software that uses ping requests all use a function called IcmpSendEcho which lives in a Windows file call Iphlpapi.dll. We suspect that around the release of Windows Server 2012, a flaw was introduced in the way this module times responses.
Affected Windows Versions
We fired up virtual machines in our lab to see if we could determine whether or not this has been an issue for a long time or not. In our tests we could not reproduce it on Windows Server 2003 R2 nor on Windows Server 2008 R2. However it was seen consistently in Windows Server 2012 R2 and in the latest builds of Windows 10.
Since server monitoring software is our bread and butter, we puzzled over why this had not been discovered earlier. It turns out that a few releases ago we were looking at the default values for various event monitors. The ping event monitor had a default of 1000ms which was ridiculously high for most purposes so we lowered it substantially. Any event monitors that had been previously created would still be configured with 1000ms for the timeout. Secondly, the issue only appears when the hop count is above a certain level. We don’t yet have figures on where that threshold lies, but it appears to be higher than what you’ll find in most corporate networking environments.
Scope and Conclusion
Until Microsoft releases a fix for this issue, it is important to use caution with ping results when an explicit timeout has been specified. These false timeouts can give the appearance of a network that is not 100% reliable, which can then lead to a lot of administrative effort to investigate causes. Many of the IT groups that we deal with are overworked and understaffed, so chasing down a red herring like this takes time away from other important tasks. For our part, in FrameFlow Server Monitor and FrameFlow Multi-Site Monitor we’ve updated our ping tests to use a minimum timeout of 1000ms which works around the problem for now.
FrameFlow develops professional server and network monitoring tools to help you be sure your critical systems are up and running. We invite you to take a closer look and see how FrameFlow can help you to keep your servers going and keep your network flowing.