Here at FrameFlow, as we continually strive to offer a robust and versatile server and IT systems monitoring solution for a variety of enterprise customers, we’ve noticed that there are several common issues, some of which are interrelated, detected by our software across the board. Here’s a rundown on the top five:
Digital technology has become an essential backbone for most business operations in our modern economy, resulting in the proliferation of log files, email, ERP and a host of large and complex computer applications; many of which require a vast and increasing amount of disk space.
As data volume expands, response time can grow right along with it, because applications that are required to read and write data are much less efficient when disk space is low. Low disk space also effects your server’s ability to grow the paging file and can negatively impact virtual memory management. These inefficiencies can slow down your entire network, cause transactions to fail, grind websites to a halt and generally result in a variety of headaches for any IT department aiming for the five nines gold standard of availability.
As Evan Marcus, Principal Engineer at Veritas Software, observes, 99.999 availability works out to 5.39 minutes of total downtime – planned or unplanned – in a given year. – TechTarget.com
As the brain of your server, the CPU has a finite capacity to handle the processing requests it receives from any applications requiring its resources. Sustained high CPU utilization can bog down even the best servers as each new request must wait until the necessary capacity is freed up before it can be processed.
Sporadic spikes in CPU usage are part of normal server operations, but if a CPU is running at more than 80% capacity most of the time, this could indicate a serious problem. The causes for such high CPU utilization can range from poorly written applications that consume excessive CPU time, malware that has infected the server, ‘memory leaks’ from programs that are failing to release memory that is no longer needed, or a high number of interrupts caused by an inefficient page-swapping process.
Low disk space and high CPU utilization can result in unresponsive databases and applications, but these programs can have their own independent range of issues. Poor application architecture or excessive logging could be the culprits. Databases such as SQL Server can have issues with login credentials if passwords are not updated on the server where the instance is installed; startup parameters might have incorrect file path locations; a port could be used in another SQL instance on the same machine; or some database files might have accidentally been deleted or corrupted due a disk failure.
Throughout the course of a given business day, there are bound to be spikes in network traffic to your servers. cheduled backups inside the LAN, the use of remote backup tools and virus scanner updates can all cause significant spikes in network traffic, but these processes can usually be managed and scheduled to have the least amount of impact on operations during peak business hours. Unexpected traffic spikes, however, such as mail server problems, malware outbreaks or hacking attempts are another issue altogether that need to be monitored and investigated where required, as such attacks can have widespread negative implications for your IT systems.
We’ve seen some recent examples of Distributed Denial of Service (DDoS) attacks where high profile networks are overwhelmed with persistent junk traffic to the point where they become unresponsive and shut down. For example, just this past fall:
"Cyberattacks targeting a little known internet infrastructure company, Dyn, disrupted access to dozens of websites on Friday [Oct. 21], preventing some users from accessing PayPal, Twitter and Spotify… Dyn said attacks were coming from tens of millions of internet-connected devices — such as webcams, printers and thermostats — infected with malicious software that turns them into “bots” that can be used in massive denial-of-service attacks."" – CBC News
The last item on FrameFlow’s top five list of issues detected by our server and IT systems monitoring software is hardware failures. “The sweeping majority (80.9%) of [hardware] failures are caused by hard drive malfunction,” (source: Hardware Failure Survey by storagecraft.com). The impact of a hard disk failure depends on the server experiencing the problem and the specific deployment, but disk failures can often be prevented by using RAID arrays and performing regular disk checks.
Power source failure is a distant second, accounting for about 4.7% of hardware failures. Insufficient voltage levels or power fluctuations can cause server power problems. Inoperable, unplugged, or disengaged server power supplies (PSUs) are also a significant cause of power supply issues.
A variety of other problems can cause hardware failure, such a security breach, virus infection, data corruption, faulty firmware updates, systems overheating due to fans moving too slow or not at all, human error (e.g. wrong installation of OS etc.) and of course physical disaster just to name a few.
As new issues emerge that could negatively impact server operations, network flow and the overall effectiveness of IT systems, FrameFlow continues to evolve rights along with these challenges. If you would like to experience how FrameFlow can make a difference in your monitoring capabilities, we invite you to download a 30-day trial version of our software… with no obligation, cost or credit card required!
This article discusses common issues that arise in IT environments and how to use FrameFlow to get advanced warning of them.
Schedule a demo to learn more about FrameFlow's IT monitoring features.