Picture this: It’s a day like any other while running your own large business when suddenly the servers hosting vital processes and customer databases cannot be reached. Uh oh! The phone lines explode with angry clients as they lose thousands of dollars per minute to your critical IT issue. This is a red alert!
It all comes back online within minutes, but the damage to your reputation and your customer satisfaction is already done. Checking in with your server administrator later on, you get to the root of the issue: the necessary server froze up because it was out of available memory. Whoops!
That’s a legitimate enough problem. Computer memory is as scarce as anything else. But did it really have to get that far? Did the whole system have to grind to a halt before anyone noticed? The answer to both of those questions is “absolutely not.”
If I were asked to review the IT monitoring procedures of this organization in the aftermath of such an event, one of the first items I would look into is this: Is monitoring of IT assets conducted reactively and proactively? If the answer to that is “no”, then we’ve got some work to do. When you’re simply monitoring for a down server – “Is this asset available?” – the reaction comes when it’s already too late. No alarms appear until the server is already down and business is impacted, which is a problem. Monitoring should not merely be reactive, but proactive as well.
As the names imply, reactive monitoring is an after-the-fact alarm for a problem while proactive monitoring is meant to predict problems. With proactive monitoring, you’re effectively trying to catch future problems before they turn into real problems. Don’t just monitor for a down node; monitor the conditions that lead to a down node. The former leaves you chasing the wave while the latter keeps you in front of it. Hang 10, bro!
In the scenario mentioned in the first paragraph, the problem could have been avoided with proactive monitoring of server conditions. Had the servers been monitored for memory usage, the issue could’ve been addressed beforehand; perhaps by killing unneeded processes or expanding the available memory. The same can be said for CPU or HD usage – find the thresholds that warn you of impending problems and set your alarms accordingly. As an IT professional with NOC experience, I can say firsthand that proactive server monitoring with appropriate thresholds will save both you and your clients a lot of time and brainpower.
This extends beyond server monitoring. Networks can be monitored by looking at throughput and traffic, examining at the protocol, transport, physical or application levels; but they can be proactively monitored by setting network condition thresholds and alarms to let technicians know when performance is beginning to suffer. Applications can be proactively monitored by metric-based tools for website, database, or application performance. Trends based on application usage can provide insight into key time periods for monitoring and allow for preparation based on business patterns.
All of this will aid in preventing problems before they occur – saving both time and money. In addition to this, modern remote monitoring software is very cost-effective and does not have to be overly difficult to administer. When optimized to your data center and network, inexpensive remote monitoring still performs admirably for catching small issues before they snowball into something worse. And with advanced features like auto discovery, traffic statistics, and report generation, monitoring a large and/or complex network can be heavily simplified.
Reactive monitoring is still as necessary as proactive monitoring. You need alarms that tell you definitively when an outage has occurred. But if your organization ensures that IT assets are monitored proactively and are configured to be relevant to patterns of performance, it’ll make the difference between issues that become disasters and issues that are nipped in the bud.
Categories: The IT Philosopher
Leave a Reply