When continuity planning is quite literally the difference between life and death, the role of the business continuity planner takes on a whole new meaning.
Roger Jarvis, Disaster Recovery Manager at Ernst & Young, offered us some of his experience of IT service continuity in hospital environments: this article looks what hospitals do when their computer systems fail:
“When something depends on Information Technology (IT) and the IT fails, then that something stops too. If you go to a bank machine to get some cash and the ATM system fails then you get no cash. Except, well, there’s probably another ATM around the corner.
But imagine a different something that can’t be allowed to stop, even if the IT on which it depends crashes. Imagine if the failure of that IT could easily result in death.
Some elements of hospital life are utterly dependent on technology. From lighting and lift systems to medical kit and patient information systems, there’s not much room for down-time that won’t affect patient care. Even a non-emergency technological delay can cause detriment to health, “I’m sorry the patient can’t go on to the ward because our bed-tracker system has gone off line,” ”I didn’t know the test results were critical because my computer was down,” or “We’ve run out of that drug because the automated ordering system sucks,” doesn’t help anyone to get better.
When patients are sick they need expert human intervention: they can’t wait for a ‘system’ to help them. When technology is it also needs expert human intervention. In a hospital environment, neither can wait.
The key reason for investing in IT in hospitals is to ensure vital data flows to where it needs to go so patients get the best care possible; easily entered, easily updated and easily retrieved. Added to that, managing systems, including making sure there are enough staff on the ward and drugs in the cupboards, is massively compromised when technology goes wrong.
In an IT disaster the systems stop but the needs continue. Staff use alternative behavior to replace the alerting, allocating, recording and reporting functions of the IT systems and do that it addition to their normal duties, until the systems come back on line. So now they have a lot more to do, are not well-practiced in doing it. (Think, for example, how you’d personally raise a purchase order without a computer and you get some idea of how tricky it can be to do things in any way you don’t consider routine.)
So everything that was “just managed by a system” is now happening manually: by hand, foot, phone or fax, and so we have a workload problem.
In fact, more than that, consider that time moves on tests, dosages, locations and results happen every minute – information changes and, by the time the system comes back online, some information is massively out of date. So now we have a data integrity problem too.
And while we may scoff at government targets in hospitals, we have to acknowledge that anything that was automated that we now have to do manually is going to delay the speed at which we get things done; the speed at which we are treating everyone that comes through the door. So we have a delay problem and, for the hospital and its management, a Service Level and Target problem too.
And it’s not like you can cancel a 20-car pile up or a heart attack.
So how does the IT department deal with this? The finger of blame is definitely going to be pointed at them, even if the fault actually lies elsewhere. The end users see their screens blank or frozen and that’s down to IT. Well: it’s down to IT to fix it and they’re going to be
harassed and harangued until its sorted.
The challenge then is to provide IT disaster recovery to an environment that is itself dedicated to disaster recovery of an even more serious kind.
And it needs to be done in a way that protects the hospital from delaying treatment, data integrity problems, and missing their target times and helps the IT team show they’re dealing with the issue efficiently.
So what’s the solution?
The solution model can be called Systems Continuity. It comprises a combination of system resilience and recoverability.
In term of resilience: systems need to be able to withstand expected assault and battery and have self-regulating modules and alarms, like a modern lifeboat in heavy seas. Recoverability ensures that if the system does finally sink under overwhelming odds, the way to get it floating again is clear, rapid and agreed. That might even involve borrowing another lifeboat for a while, but it does not mean waiting for calmer weather we cannot wait.
If “Systems Continuity” is the name on the box, what goes in it?
The Test Case
Most real hospital-based IT problems are confidential, so let’s use a fictional example!
In Hospital Simple it’s 0600hrs on a cold wet Monday. The vital bed management system – let’s call it the “Bed Online Keeper Resource System” or BONKERS for short – has completely failed. It prints garbage, looks like it has no data in it and locks all keyboards every time someone tries to use it. Attempts to revive it by restarting the server in the network room have failed.
Now at Hospital Simple the IT staff aren’t dim. They know that policy dictates that the service provided by Bonkers is vital because its absence has critical impact. So, after kicking the server a few times they activate their Disaster Recovery (DR) Plan.
Like all the best plans and all the worst flat-packs, it comes in several parts. First, there’s the bit for the hospital staff to do to workaround the issue, and then there’s the IT team who have to fix the underlying problem, stat.
The Medical Teams and Administrators
The medical teams are locked out of BONKERS. Until it’s fixed, that system is now nothing to do with them and they’ re going to ignore IT until it’s fixed (except to kick them, of course). There are no headless chickens on the wards. The medical staff had a plan for a manual workaround and they didn’t waste time working out what to do. They have a heavier workload, but they’re just getting on with it.
But what are the implications for them when BONKERS comes back online?
By the time BONKERS comes back on line there’s a literal tonne of information that hasn’t been entered or updated. Just because the system is “back online” doesn’t mean its useful. And in a hospitalsetting an out-of-date information system is just dangerous.
So as well as a workaround system, they also need a way to input the missing data into BONKERS when it comes back up, and to let IT know when that’s done, so the message can go out that the system is back online and usable.
The IT Team
The IT team has two key issues. The first is the broken system, of course. But the second is one of communication.
BONKERS has an IT manual, as well as a Users manual. The IT one includes a Disaster Recovery (DR) section (and an appendix on what will happen next if the DR instructions don’t work – because we’ve all been there!). So there are no headless chickens here either. There is a problem, there is a process to resolve it, and that’s all going ahead immediately by the 24 hour staff in situ and, quite probably, the staff they’ve called in to help… because big problems usually happen at 0300 when there aren’t enough people in the office to fix it.
But this isn’t the only problem for IT. They have to “manage” the situation as it pertains to them too. At the outset they have to ensure that they are aware that there is a problem with BONKERS and they are working on it. They have to gather all the relevant information
from the staff so they know all the symptoms before diagnosing the problem (and at least this is terminology the medical teams understand!).
Over time this needs to move on so that everyone knows that IT understand the problem, and are keeping them in touch with what is being done to resolve it and what the timescale to restoration might be.
Towards the the restoration end of the issue, the communications strategy needs to change again. Liaison now is about getting missing information put into BONKERS and understanding when enough of it is there that it’s safe to use it again. And, after letting people know that BONKERS is back up, there’s the post-incident report and debriefs to assure people that the same problem isn’t going to reoccur.
All this communication needs a robust plan! Some IT experts aren’t renown for their communication skills (!) so ensuring the DR plan includes putting the right communicators in place to manage the process is as vital as fixing the system itself.
IT Service Continuity really comprises of three things:
- A well-known and rehearsed workaround system for the system users
- A good DR plan for the system
- An excellent communication protocol to ensure that the incident is managed well and the right people know what they need to know at all times.
Things that can help make sure this can happen:
- IT experts who can spot a sick system
- Process whereby IT are quickly alerted to any problems experienced by users
- 24 hour IT call out system
- Protocol for responding to major technology incidents
- Robust incident communications plan
- Protocol that is well understood by all and well rehearsed by those who may need to enact it
- Understanding that the system isn’t fixed until the data is current
Taking on IT Service Continuity for a hospital isn’t for the feint-hearted or anyone that doesn’t deal well with stress, since everyone is well aware that they’re influencing life and death.
But I can’t think of anywhere where doing a good job is more rewarding.”
Subscribe - weekly news and a free course!