Asking the Right Technology Questions.

I recently configured a suite of disaster recovery (DR) mechanisms on a cloud platform delivering a business-critical web portal. This mechanism utilizes an HTTP health check to monitor DR targets – in this case, service ports on firewalls in separate cloud regions – and distribute incoming HTTP traffic to A-record targets based on target health status.

The health check was configured to utilize an HTTP Request querying the web server backend for a simple version page. As long as the web server replies successfully, the health check marks the target as healthy and able to receive traffic.

The full DR configuration was not ready for production use but the portal application suddenly experienced an outage one evening for unrelated reasons – so I took the opportunity to watch health check activity with interest as the first unofficial live test of my configuration. Serendipity!

Hmmm. The primary region portal application was defintitely down – visting the portal page loaded nothing but a blank screen – but the health check was not registering any problem and continued to show status green for the primary region. This was no good – the DR health check should have been switching to the secondary target in our backup region, but nothing was happening.

I then realized the problem. The portal application was down due to network problems interfering with web server comms to microservice dependencies, BUT… the web server itself was still online and accessible. The health check was simply designed to query the web server – and despite the ongoing outage killing the application, the web server was fine and responding to requests for version info. So as far as the health check was concerned, there was no problem and therefore no automated DR remediation. *CUE DRAMATIC REVEAL MUSIC*

We are often concerned with finding the right answers – but we more importantly need to ensure we’re asking the right questions. My health check was asking “Is the web server healthy?” when the ACTUAL question needed to be “Is the application healthy?” The DR solution was meant to address the application as a whole, but the critical health check was only looking at a subset of the whole. I needed a health check looking at the whole.

The solution was simple: I selected a different health check path to query a main portal logon component which would be working only if the overall appliation is working – web server, microservices, database and all. I gave it a whirl in the dev environment and presto – I could replicate the outage and get the correct automated DR activity.

When addressing a challenge or problem, it pays to take a moment and consider where you started – not where you finished. Were you asking the right questions from the start? If you ask the wrong questions – don’t be shocked when you get the wrong answers!



Categories: The IT Philosopher

Tags: , ,