Case Study: Chapter 18 Emergency Department Downtime (page 305 in your textbook).
Complete the case study found in “Health informatics an Interprofessional Approach” by reading the case and answering the questions. In addition, design a form for capturing data when the Emergency Department registration system is not available. Your case study must include:
A minimum of 20 data elements
A data dictionary outlining:
Data field name
Field format (alpha, numeric or alpha numeric)
Number of characters in the filed
Source of information
Example of the content
Your response to the case study questions should be a minimum of one full page in length, excluding the title and reference pages, and be formatted according to APA guidelines as outlined in the Ashford Writing Center. Include a minimum of two scholarly resources, in APA format, in addition to any textbooks.
Chapter 18 Health Information Systems: Downtime and Disaster Recovery
Nancy C. Brazelton
The primary objective for downtime and disaster planning is to protect the organization and the patients who are served by that organization by minimizing disruption to the operations.
At the completion of this chapter the reader will be prepared to:
1.Explain downtime risk assessment
2.Analyze considerations for a health system inventory
3.Describe an assessment tool for evaluating a downtime event
4.Summarize the clinician’s role in system downtime planning
5.Summarize information technology’s role in system downtime planning
6.Establish key components of a business continuity plan for an organization
7.Formulate communication strategies for downtime in an organization
Bolt-on system, 294
Business continuity, 302
Clinical application, 293
Cold site, 298
Configuration management database (CMDB), 292
Data center, 292
Disaster recovery, 302
Electronic data interchange (EDI), 293
Enterprise resource planning (ERP), 293
High availability, 305
Hot site, 298
Human-made disaster, 292
Incident response team (IRT), 300
Information system, 291
Natural disaster, 292
Picture archiving and communication system (PACS), 291
Revenue cycle, 302
Service level agreement (SLA), 298
Healthcare entities are complex operations that are becoming increasingly dependent on computerization. This chapter identifies tactics related to planning for and responding to computer downtime events and disasters. Focus areas include the clinical impact, the information technology (IT) impact, business continuity, and communications. A model for assessing the level of downtime response is provided.
Healthcare entities, no matter how small or large, are extremely complex businesses that are becoming increasingly dependent on computerization in their quest to provide exceptional healthcare. The employees of these healthcare organizations move through a unique labyrinth of systems, machines, workflows, regulatory requirements, business rules, and tools to provide the best care for patients and their families and to keep the business intact.
Given the importance of information systems within a healthcare entity, these institutions and their employees must be prepared for the many variations of downtime including human errors, software and hardware failures, power cables that get severed, software viruses that invade the firewall, and disruptions caused by Mother Nature. The bulk of literature around the topic of downtime is in white papers, blogs, and industry group publications found via Internet searches and there are very few research articles. The impetus to write is often spurred by a downtime event experienced by the author, who admonishes the reader to get plans in place.1–3 Anderson reports a study conducted by AC Group that determined that each hour of downtime can cost $488 per physician per hour.4 In their time and motion study, the authors determined that a physician spends 2.15 minutes during recovery for each minute the system was unavailable. The costs are staggering when translated to a large healthcare enterprise. Given the cost and the risk to patient care, working to prevent downtime and recovering quickly if an incident occurs are priorities for every healthcare institution.
This chapter focuses on practical, tactical ways to put plans in place for managing health system downtime and disaster recovery. Tips and tools for downtime risk assessments, downtime and disaster response planning, clinical and IT system recovery, and business continuity are provided. The chapter also provides information on developing a communication strategy and ideas for keeping patients safe and the business functioning because downtime does happen.
Downtime Risk Assessment
Planning for downtime can and should occur from project inception through system maturity and must include all existing systems and infrastructure. The complexity and criticality of the organization will determine how extensive an exercise this must be. Potential types of downtime and their impacts on the systems should be anticipated, with mitigation plans put in place.
Downtimes can be classified by the root cause and the degree of impact. Thus if the root cause is network related this is different than power outage downtimes or planned software upgrades. A single noncritical software application may be unavailable, which would have a different impact than the admission, discharge, and transfer (ADT) system being unavailable or a third-party ancillary system (i.e., radiology and imaging, anesthesiology, respiratory therapy) or bedside physiologic monitors being down.
Determining the root cause of a downtime is not always straightforward. Determining what is and is not functioning can be a first step but not the final answer. For example, a network downtime may be diagnosed easily by asking the following questions: Can you access the Internet and the intranet? How about the computer next to you? What about the computer on the unit downstairs? If the answer to these questions is “no,” suspect an entire network downtime. However, networks in many healthcare enterprises are now segmented for security purposes and include a series of switches, firewalls with coded rules, as well as the actual fiber network, cable pulls, and other components. This makes diagnosis of the specific network problem or the location of the problem complex, potentially increasing the length of the network downtime. Also, external factors may exist, such as telecommunication fiber vendors that may be having problems as well as the millions of miles of fiber infrastructure vulnerable to physical damage, commonly referred to as a “backhoe outage.”
Thus, the first step in preventing or managing a downtime is to determine what might cause a downtime and then to perform a risk assessment of the impact for each potential downtime. This step can be iterative with step three discussed below, which is compiling an inventory of existing applications and systems. Classifying all potential downtimes and putting them into mutually exclusive categories can be difficult. It is usually best to start by identifying the most common technology source of downtimes and document these. A systematic approach starting with infrastructure is more likely to ensure a comprehensive list. Begin by dividing the infrastructure into IT infrastructure and physical infrastructure.
IT infrastructure can include the network and application delivery systems, such as those listed in Box 18-1. Examples of physical structures include those listed in Box 18-2. Some overlap exists between the IT infrastructure and the physical infrastructure. Also, this model does not address partial versus complete downtimes. Box 18-3 includes examples of IT infrastructure and physical infrastructure. The order of the elements does not reflect their priority. Priority will be unique for each organization. Networks include both hardware and software components and therefore need to be examined from both aspects.
Box 18-1 Sample Elements of Information Technology Infrastructure
• Electronic health record software
• Clinical and ancillary system software (e.g., physiologic monitoring, endoscopy, registry databases)
• Picture archiving and communication system (PACS)
• Laboratory applications
• Cardiology applications
• Radiology applications
• Anesthesia systems
• Surgical processing systems
• Revenue cycle software
• Interfaces or the interface engine
• Enterprise data warehouse
Box 18-2 Examples of Information Technology Physical Structure
• Hardware related to the chillers that keep the data center cool
• Storage (physical hardware that stores the electronic health record, email, and other third-party systems)
• Electrical power
• Network switches and hubs
• Biomedical devices
• Any component of the buildings themselves
Box 18-3 Information Technology (IT) and Physical Infrastructure
Application delivery system (e.g., Citrix)
Biomedical devices with software components or network requirements
Email, other communications
Enterprise data warehouse
Help desk and computer support
Identity management (e.g., active directory)
Keyless entry systems or other security software
Software or applications (should have a complete list with interdependencies identified)
Telecommunication systems (hospital operators, paging, cellphone, analog, voice over Internet Protocol)
Buildings and facilities (list most likely maintained by an environment of care committee per requirements of The Joint Commission and other regulatory agencies)
Chillers for data center
Generators and fuel supply
Help desk and computer support
Inventory (think broadly; replacement computer hardware to patient care supplies and paper forms)
Medical record (paper) storage
Network cables and other physical components
Pneumatic tube systems
Physical security of buildings
Storage area network, other storage
Switches and hubs
Utilities: power, heating, cooling, water
The second step is to identify the most common potential causes of a downtime in the facility, areas of vulnerability, and the most likely scenarios of natural disasters or human-made disasters in the geographic area. Is the facility or data center on an old or outdated power grid? Is the building only rated to withstand a 6.5 magnitude earthquake in an area where experts predict one much stronger? Is the facility’s generator located in an area that is vulnerable to flooding? Common disasters are noted in Table 18-1; these lists are not mutually exclusive but instead provide a starting point for planning at a specific organization.
The third step is to complete an inventory of all systems and document them. All systems in use at the organization should be inventoried because each is important to some aspect of the business. A sample inventory is located in Table 18-2.
An inventory list can be surprisingly difficult to compile and may involve doing walk-throughs of departments or units to observe systems and devices that end-users are actually using in their day-to-day workflow. This is especially critical if the institution has a hybrid system (i.e., a system using multiple vendors for specific functionality). For instance, often the ancillary systems within a hospital have a limited amount of data requirements but highly specific ones. These unique systems are added on to the main electronic health record (EHR) because of the need for specific functionality or perhaps because the ancillary application was built and implemented prior to the EHR. Table 18-3 lists areas that may have special considerations within an acute care setting.
To locate applications, consider functional operations, type of personnel, data and information being processed or consumed, how and where vital records are housed, and policies and procedures that guide the business or practice.5 This compilation will involve persistently reaching out to all members of the IT team or others who host applications or provide some component of infrastructure for input. The more specific and complete the inventory, the more useful and helpful it will be for planning purposes in the event of an actual planned or unplanned downtime. Due to the difficulty of getting a very complete inventory, it is best to start with the most critical applications and work on less critical ones later. Minimally, the inventory should include the items listed in Box 18-4.
Other data useful for general system support and downtime planning should be carefully documented as well. These items may be part of the application inventory or they may be housed in a separate document, including those listed in Box 18-5.
System dependencies, configuration diagrams, and interface data should also be documented and stored in a place that is easily accessible and backed up on a routine basis as the final step. Best practices for maintaining this inventory and documentation from an Information Technology Infrastructure Library (ITIL) perspective is a configuration management database (CMDB) that has configuration items unique to the organization. However, many tools or combinations of tools are available for this purpose, such as shared drives and folders, spreadsheets, databases, vendor-supplied tools, wiki sites, collaboration software, document management systems repositories, or even simple paper notebooks.
TABLE 18-1 Downtime Vulnerabilities and Common Human-Made and Natural Disasters
MOST SIGNIFICANT DOWNTIME VULNERABILITIES
Buildings or data center not to current code or vulnerable to natural or human-made disasters
Lack of recovery site
Explosion: intentional (bomb)
Lack of disaster planning or business continuity planning
Explosion: unintentional (natural gas line rupture)
Lack of backups or inability to recover from backups
Lack of high availability or failover for critical systems or applications
Lack of downtime planning
Outdated or aging physical infrastructure
Landslide, mudslide, debris flow
Outdated or aging technology (i.e., not on current or supported level of code or servers that are no longer supported)
Power grid and supply
Workforce violence, shootings, loss of life of key personnel
System resources at or near capacity (disk space, database, storage, etc.)
Space weather, geomagnetic storm
TABLE 18-2 System Inventory Considerations*
TYPE OF SYSTEM
Core clinical applications
Electronic medical record (EMR), electronic health record (EHR), emergency department, computerized provider order entry (CPOE), clinical documentation, medication administration record (MAR), surgical services and anesthesia information system
Ancillary service and procedure area information services
Pharmacy, radiology and imaging, laboratory, arterial blood gas, cardiology, endoscopy, respiratory, neurology, nutrition care, dictation, health information management, biomedical devices (physiologic monitors, vital sign machines, intravenous pumps, ventilators, pneumatic tube systems, etc.)
Online reference databases
Drug information references; patient education; policies and procedures; disease, diagnosis, and interventional protocol databases; formulas or health-related calculators
Admission, discharge, and transfer; enterprise scheduling; preauthorization; facility and technical billing; coding; professional and physician billing; claim scrubbers; print vendors; address verification; electronic data interchange (EDI) transactions; benefit checking
Business, finance, and personnel
Email, office software, cash collections, credit card transactions, banking, business intelligence, reports and reporting, supply chain and enterprise resource planning (ERP), budgeting, human resources, payroll, staff scheduling, keyless entry, facilities and engineering, telephone systems and wiring, telephone operators, paging systems, wireless communication devices
Printers, Bluetooth devices (scanners, label printers), reports, data warehouse, scanning, print vendor, Internet-based public web pages, Intranet and related internal web sites, wikis, clinical health information exchanges, retail outlets (retail pharmacies, gift shops, food service)
* Not a complete list.
TABLE 18-3 Considerations by Area for Acute Care Setting
Ventilators, anesthetic gases, frequent vital signs
Automated charge capture
Can be from many systems
Cardiac catheterization lab
Hemodynamic monitors, image capture, documentation for registries
Tracking patients in the waiting room and through the department
Image capture and specific discreet documentation for registries or billing
Coding, release of information, maintenance of the legal medical record, legal cases, insurance queries, scanning solutions
Computerized provider order entry (CPOE), clinical decision support systems, diagnostic test results, dictation
Newborn intensive care
Bedside monitoring and extracorporeal membrane oxygenation devices, ventilators, intravenous pumps
Care planning, CPOE, Bar Code Medication Administration or electronic medication administration record, telemetry, patient communication systems with nurses, clinical decision support reminders, nursing databases (e.g., patient education resources)
Fetal monitoring, mother and baby monitoring and documentation during the labor period, preterm wave forms, information from mother’s record that needs to be available on newborn’s record for continuity of care
Medication dispensing machines (Pyxis and Omnicell), robots, inpatient versus retail pharmacy ordering systems, intravenous pumps that contain drug-specific information (Alaris, etc.), pharmacy ordering system often interfaces with the medication supplier
Physical, occupational, and speech therapies
Therapy systems have specific patient education content
Often supported by biomedical engineering and has vendor-specific content
Radiology, imaging, and picture archiving and communication system
Image and procedure capture, multiple modalities, questionnaires
Contains respiratory measures like ventilator settings, ventilator weaning parameters, respiratory treatments and measurements
Box 18-4 Inventory Items
• Vendor name
• If developed in-house, where is source code and other documentation?
• Date of contract and its current location
• Application or module name
• Date of original go-live
• Current version
• Date of upgrade
• Categorization (site defined): major/minor, Tier I/II/III, other
• Host model (where the application or service is located): in-house or remote
• If remote, supported by whom?
• Interfaces, both inbound and outbound
• Third-party bolt-on systems
• Other key dependencies
• Primary use
• Primary users and number of users
• Business owners
• IT contacts
• Notes and comments
Box 18-5 Elements for General Systems Support
• A checklist for the team to follow when a planned or unplanned downtime occurs
• Known system vulnerabilities
• Frequent error messages
• Patterns of error messages indicating known problems or pending system failure
• Knowledge objects used in supporting or maintaining system
• Contact information with phone numbers for vendors and the teams supporting the application
• Service level agreements (SLAs) with key users of the application
• Preferred user communication plan for planned and unplanned downtimes
• Unit- or department-based workflow diagrams
• Unit or department blueprints that document all electronic devices connected to the wired or wireless network
• Agreed-upon time for planned changes and maintenance work, also known as a change window
• Policies, procedures, rules, or standards from information technology or the broader organization applying to the particular application
With the appropriate data collected, IT should work very closely with the organization’s emergency preparedness and disaster planning groups in planning for disasters. Having some component of IT downtime as a part of disaster drills is a very effective way for IT and staff to practice disaster response and hone plans. The Federal Emergency Management Agency (FEMA) website at www.ready.gov is extremely helpful for planning at an institutional and personal level and is full of information in the event of an actual disaster.
Downtime and Response Planning
Once the system documentation has been developed and the risks are understood, the healthcare institution is ready to define the different types of potential downtimes on a continuum indicating the degree of significance for each potential downtime. The significance of the different downtimes will depend on the level of complexity and the installed base of the institution. For example, if the organization has results review implemented and a single results feed is down, the response will be very different than if the organizations has a mature EHR and the network is unavailable. Is this a physician’s office with a single provider and a very experienced staff or a huge integrated delivery network (IDN) spread across multiple states or geographic regions? Consider both planned and unplanned downtimes. If a downtime has been scheduled, there should be ample time to plan; however, if the downtime is unexpected, there may be no contingency plans in place. A worst-case scenario of an unplanned downtime could be a total loss of the network occurring midweek at the start of the business day when the hospital and clinic schedules are full, all operating rooms are in use, and the emergency department is busy and expecting two traumas (one via flight service) on a snowy winter day when 20% of staff is late due to road conditions.
TABLE 18-4 Impact Considerations
Expected or actual duration
4 hours; should also plan for a catastrophic event where network or systems will have to be rebuilt, which will be weeks to months
Time of day
Middle of the slowest night of the week to middle of the day on the busiest day of the week
Number of users affected and scope and breadth of downtime or outage
Single user or department to the entire facility; partial to full; single system or infrastructure component to complete loss of application, network, or building
Information technology (IT) infrastructure
Intact to completely damaged and replacement parts need to be ordered
Impact on workflow
Users are able to carry on activities with minimal disruption to complete change in workflow reverting to inefficient manual systems
Complexity of IT installs and criticality of applications
Single system with review-only functionality to an organization that is >90% electronic and paperless with multiple systems for all business and healthcare requirements
Planned or unplanned
Downtime is scheduled during agreed-upon service level agreement and system comes back up as promised to an unexpected systemwide downtime with no estimated time to recovery
Complexity of health system or complexity and criticality of unit or department affected
Single office to multistate integrated delivery network; office that still keeps paper records to a fully electronic intensive care unit with patients on multiple assistive devices
Communication methods and mechanisms
All communication mechanisms except word of mouth are unavailable to multiple different mechanisms that account for multiple different scenarios
Redundancy of infrastructure and the ability to recover
No redundant systems or infrastructure to fully redundant, highly available system in a colocated data center
Maturity of downtime plans, policies, and procedures and availability of backup supplies
No plans or supplies to mature and tested policy, procedure, and plans with stocked supplies and staff aware of them
Table 18-4 identifies a number of elements that will influence the impact of an individual downtime. The list in Table 18-4 is not intended to be followed linearly. Different scenarios or error messages will lead down different paths just as different symptoms might lead to different diagnoses in patient care. One would take different actions if a patient’s temperature increased by a half-degree Celsius and the heart rate increased by 10 beats per minute over the last 45 minutes as opposed to if a patient had a detached intravenous. The same process occurs when managing EHRs or other systems. Being able to quickly translate error messages and recognize patterns are keys to reducing the length of the downtime and restoring clinician and staff workflow.
Once the organization has clear definitions for downtimes and has methods of assessing the significance of potential events,2 the emergency preparedness and disaster planning team can develop the response and recovery plans. A comprehensive and accurate assessment will provide a reliable starting point for the team responding to the downtime, thereby decreasing chaos and saving critical time at the start of an event. The downtime plan will include different levels of interventions for different events. For example, a downtime event with a simple application may be managed with a decision tool. An example of a simple decision tree is shown in Figure 18-1. However, this same approach may become too
FIG 18-1 Simple downtime decision tree. IT, Information technology.
cumbersome when dealing with multiple systems. In these cases an organization may use a “level” system such as the one shown in Table 18-5.
Table 18-5 Downtime Levels
Part of a system down or unavailable but minimal impact and no loss of content or data integrity. Expected time to recovery less than 1 hour.
Information technology (IT) team and targeted users only are involved per standard service level agreement (SLA).
Complete system unavailable, data may be unavailable, and data will have to be entered into the system to maintain integrity. Expected time to recovery up to 4 hours.
IT team and targeted users are involved per standard SLA. Unit-based downtime plans invoked. May require additional communication to stakeholders and plans for reentry of data.
Multiple systems unavailable, big impact on workflow, and content may be unavailable and will have to be entered into the system to maintain integrity. Expected time to recovery greater than 4 hours.
IT incident response team involved along with multiple teams. Downtime plans invoked. Broad communication to the organization. Notification to key stakeholders and administration. Plan for reentry of data.
All systems and network unavailable but root cause is known and recovery is possible. Users must complete downtime plans. Estimated time to recovery greater than 4 hours.
IT incident response team involved along with multiple teams. Downtime plans invoked. Broad communication to the organization. Notification to key stakeholders and administration. Plan for reentry of data. May involve emergency response team and opening of command center.
All systems and network unavailable. Major catastrophic event and facility structure may be compromised. Systems and infrastructure need to be rebuilt. Systemwide emergency plans and response invoked.
All hands on deck and event directed per emergency response team or administration. May require communication to the wider community.
With a multiple systems downtime event, the use of a tool to quickly assess the significance of the downtime event will help the IT team and users to determine which of the
FIG 18-2 Downtime Determinator model. IT, Information technology.
predefined responses should be invoked. One example of such a tool is the Downtime Determinator© depicted in Figure 18-2. The x-axis is the length of downtime (or time to recovery) and the y-axis is the impact and risk. Each of the seven risk attributes is plotted on the Downtime Determinator tool in one of the four quadrants and a pattern or cluster of numbers will start to emerge. The pattern of numbers becomes the basis for evaluating the event.
Using the risk attributes described above, quadrant responses would be defined by the organization and might be similar to the earlier examples provided with the levels. For example, when the majority of numbers cluster in the lower left quadrant, this should invoke a quadrant 1 response; numbers clustered in the upper right quadrant should invoke a quadrant 4 response. The lower left quadrant represents the least critical events and the upper right quadrant represents the most critical events. The Downtime Determinator displays the numbers 2/3 and 3/2 in the upper left and lower right quadrants, respectively. Each organization will need to assign quadrants 2 and 3 based on its assessment of each individual downtime event, after considering the impact and risk versus time. There may be times when the length of the downtime is so long that the event warrants a quadrant 3 (more intense) response. There may be other times when the event is so massive that even though the downtime is scheduled for 30 minutes, the impact to the organization is so great that it warrants a quadrant 3 response.
The Downtime Determinator is similar to the “level” system mentioned earlier, as both have four categories. However, the Downtime Determinator allows for more specificity and
FIG 18-3 Downtime Determinator quadrant 1 example. IT, Information technology.
nuances in responses because each of the attributes can be considered separately and responded to in relation to other attributes. When using a tool such as the Downtime Determinator, each organization should customize the tool by defining each attribute and delineating time along the x-axis that is significant to the organization. Four scenarios are outlined below using the Downtime Determinator, ranging from least to most impact.
•Scenario 1: Level I trauma and teaching hospital. Planned EHR downtime from 0200 to 0500 on a Wednesday night. IT infrastructure and communications intact (Fig. 18-3).
•Scenario 2: Community hospital. Unplanned downtime of the hospital billing system at 1530 on a Monday. Multiple staff unable to do work but patient care is not affected. Recovery expected in 3 hours. System requires replacement of a hard drive; the hard drive is available locally and it should be delivered to the data center by the vendor shortly. Communications intact (Fig. 18-4).
•Scenario 3: Acute care hospital with multiple intensive care units (ICUs) with a physiologic monitoring system interfaced to the EHR via bedside medical device integration (BMDI). The vendor-specific server is damaged due to a water spill in the communication closet and the BMDI unexpectedly quits working at 0800 on a Saturday morning. The vendor indicates a 2-week lag until a replacement server will be available (Fig. 18-5).
•Scenario 4: An F-16 military airplane crashes into the data center of an academic medical center at 1900 on a Friday night during training exercises. The data center is destroyed physically, a large fuel spill covers the area, and there is complete IT system downtime. The hospital and campus
FIG 18-4 Downtime Determinator quadrant 2 example. IT, Information technology.
FIG 18-5 Downtime Determinator quadrant 3 example. IT, Information technology.
FIG 18-6 Downtime Determinator quadrant 4 example. IT, Information technology.
are otherwise intact. The academic medical center has a hot site that can host approximately 30% of critical systems and a cold site that has capacity to host the remainder of the systems. The critical hot site applications can be available in 24 hours but the remaining 70% of applications to be built in the cold site will take 30 days for complete recovery (Fig. 18-6).
As seen with these scenarios, the clustering of numbers helps to guide the response of the organization in managing the event or disaster. A quadrant 1 event (scenario 1) should be a routine event that is managed with standard processes, communications, and service level agreements (SLAs) that are already in place. With a quadrant 4 major disaster (scenario 4), the response should be massive and system- and organization-wide. Such a disaster would have a significant impact on the ability of the organization to carry on the business of healthcare. Taking time to plan and put realistic processes into place previous to the event will determine whether the organization will continue to care for patients safely and have the business remain intact.
Clinical Impact and Planning: Acute Care Focus
With the increased use of technology at the point of care and in the clinical environment, healthcare organizations must now determine their response when this technology is unavailable. How do clinicians find historical data, including recent vital signs, the first of three troponin results, the history and physical prior to surgery, and the last time the PRN (as needed) pain medication was given and the patient’s response to it? How do healthcare providers document new events, medications, and treatments? How does the pharmacy dispense a medication and keep track of the dispensing? Do the automated supply cabinets have programming that permits overriding the system and, if so, how are charges captured after the fact? These are the potential problems that may arise at institutions in the event of EHR unavailability. Each addition to the technology in use can produce unintended consequences and in turn affect the initial assessment and resulting plan. For example, a 4-year analysis of the unintended consequences of computerized provider order entry (CPOE) identified nine types of unintended consequences.6 A list of these unintended consequences is included in Box 18-6.
Logically, the more electronic components in the organization, the more complex troubleshooting becomes. The elements that need to be considered include inpatient and outpatient venues, networks, intranets, printers, databases, interfaces, storage hardware, published applications, and layering software used to manage the myriad devices in an institution. Failure at any of these points may result in some sort of downtime for clinicians. Reducing the risk of a downtime can be accomplished by using the approaches outlined in the following sections.
Redundant systems, also known as backup systems, provide clinicians the ability to access some if not all patient data during an electronic downtime. If clinicians can recover just enough information to carry on with patient care from the point at which the downtime begins, care can proceed safely.7 Therefore a subset of critical data must always be available, even during downtime. Each individual organization should define the required subset of data according to applications and services. Suggestions are basic demographics, orders, medication administration records (MARs), most recent vitals, laboratory values, imaging reports, and physician and provider progress notes.
Box 18-6 Unintended Consequences of the Impact of Computerized Provider Order Entry
• Increased workload
• Workflow issues with mismatch of order entry and related activities
• Never-ending demands
• Paper persistence
• Communication issues with changes in communication patterns
• Emotional reactions to issues and problems
• New kinds of errors introduced by use of a computer
• Changes in the power structure
• Overdependence on technology
Vendors are increasingly responding to the need for improved downtime solutions. As an example, a vendor might install one to four stand-alone machines in each patient care area depending on the average patient census and geographic layout of the unit. Each machine might be designed to store a subset of historical patient data up to 30 days. During a downtime, staff access these machines to retrieve the previous 30 days of patient-related data. Once the downtime begins, these systems will no longer be updated. New patient data that are generated as patient care continues must be maintained manually during the downtime event. Because healthcare providers may need printed data during the downtime event, each machine should be directly connected by cable to a printer so it is not dependent on the down network connection for printing services. Another benefit of these machines is that they can be portable. In the event of a hospital evacuation, these machines can be removed from the premises and data from these machines can be used until a recovery plan is in place.
However, this particular downtime solution has limitations. One concept that is difficult for clinical and IT staff to understand is that once the network becomes unavailable, these downtime machines are no longer updated with patient information. This requires the healthcare providers to check the new manually recorded data as well as the historical data maintained in the temporary system when providing care. Another limitation is that the data may be organized differently than in the EHR and as a result information may be displayed or print in a different format. This can cause confusion and even errors in patient care.
The downtime solution using temporary machines must meet Health Insurance Portability and Accountability Act (HIPAA) requirements for security, privacy, and confidentiality. As a result, these machines require an extra layer of encryption to prevent information theft in the event that the machine is removed from the hospital. However, these encryption systems typically are add-ons that slow down the response time of patient care applications running on the machines.
Other terms used to describe redundant systems are shadow, mirror, or read-only systems. The downtime system described above is a shadow or mirror system that is able to be only read by clinicians. Clinicians cannot add any patient data to this type of system. Some shadow or mirror systems duplicate the EHR. In the event that the primary system crashes, the secondary system automatically, and hopefully seamlessly, transitions the clinician to the secondary system. The clinician continues to document orders, medications, or care. Generally these systems reside on separate hardware that “mirrors” the configuration of the primary system. These systems are generally more robust than the redundant downtime solution described above, encompassing a similar look and feel and often read and write capability. These systems are beneficial because clinicians use their current log-in and password to access the system, the look and feel of the system are almost identical to the EHR, and printing can be available.
Redundant systems often resemble the configuration of the existing EHR, requiring a substantial financial investment with the vendor. The financial investment can be an obstacle, therefore, a business case should be made with and for the clinicians on behalf of patients. The more mature the EHR, the more dependent the clinicians are on the system to get information to provide patient care. Investing in a backup system of this caliber is arguably a necessity after institutions have reached a certain level of EHR maturity and organizations should assess this requirement frequently.
In addition to the above solutions, homegrown web-based solutions are available for use by clinicians prior to a planned downtime. For example, a web-based solution may be configured so that clinicians can print an MAR or all current patient orders. These have proven helpful during planned upgrades because they provide clinicians with enough information to weather the upgrade as well as provide a place to begin the manual documentation of patient care in the immediate period following the downtime.
Downtime Policies and Procedures
Approved downtime policies and procedures are needed to guide the clinical team. These policies should be prescriptive, include roles and responsibilities, and define workarounds or manual procedures that allow for the continuity of critical functions. They should include specific instructions about required data entry to the legal and permanent EHR record at the conclusion of the downtime. Examples of downtime policies are available in the literature and on the Internet.7–10
A best practice is for patient units to have up-to-date “downtime” boxes.11 Each box contains documentation specific to the patient care area. For example, ICUs may revert to traditional six-panel paper flow sheets. Other patient care areas have screenshots of the electronic “patient admission” form or other forms directly from the EHR. When no preprinted forms are available, blank or lined pieces of paper are used and work as long as healthcare providers are aware of documentation requirements. Each downtime box should be stocked to last at least 24 hours and have a documented plan in the event of a catastrophic event. Each patient care area is expected to maintain and customize the contents of its “downtime” box.11 Informaticists can partner closely with clinicians to create downtime policies and procedures to ensure that clinical requirements are matched with available IT solutions.
IT Impact and Planning
The IT impact and downtime risk can be reduced by following a systematic process when applying changes to the “production” or “live” system. One approach is to organize a service management program to organize a risk assessment and downtime planning document. Service management is a discipline for managing IT systems that focuses on the customer and the business and its operations as opposed to simply being technology-centric. The service life cycle includes service strategy, service design, service transition, service operation, and continual service improvement.12 Interestingly, this life cycle is similar to both the system development life cycle used to implement computer systems13 and the nursing process.14 Various process-based systems exist to assist the IT team in instituting a service management program, including ITIL,12,15 Six Sigma, and total quality management (TQM). These systems require standardized terminology, problem identification and management, change control measures, and communication patterns. The benefits of using these systems are agreed-upon, realistic service levels; predictable and consistent processes; metrics; and alignment with business needs.
Implementing a service management program requires financial and time commitments from the organization, the IT executive, and all members of the IT team but is well worth the investment. Commitments are needed from the IT staff to fill roles on committees such as the change advisory board, incident response team (IRT), and IT service management. The benefits of a service management program include a framework with clear rules and processes to structure IT activities so there are fewer unplanned events. The disadvantages of a service management program include that the new program will invariably increase the time to implementation for some new code or new functionality. However, waiting reduces knee-jerk reaction time and gives the IT team more time to test the new functionality and discover any dependencies. Waiting also benefits clinicians because they can negotiate a standard change time and reduce unnecessary downtimes.
Organizations are obligated to have contingency and disaster plans to be compliant with the HIPAA security rule of 1996, the U.S. Department of Health & Human Services, and The Joint Commission. A separate set of IT policies should exist to supplement the organization’s overall disaster plan and include security and privacy components. Senior leadership of the IT department, the security and privacy office, the emergency preparedness group, and senior leadership from the broader organization should review and approve the plans.5 These plans should be frequently reviewed, tested, and revised as needed. Staff need to be updated on a consistent basis so they are prepared to implement a contingency and disaster plan with minimum effort. Considerations for the IT components of an IT disaster plan are listed in Table 18-6.
Once contingency and disaster plans have been implemented and at the point when the institution is converting back to its standard systems the organization needs to be prepared to test the clinical system rapidly to ensure that all aspects of the system are functioning as planned. Having up-to-date test plans for all clinical systems is an integral part of turning around a downtime quickly once hardware and database issues have been resolved. Software upgrades can take place every 6 months. If the test plans are not current, the testing process may not be reliable. In addition, without systematic preplanned processes in place it is difficult to enlist the help of non-IT people. Keeping test plans up to date ensures that people external to the recovery team can assist with testing and getting the system online sooner.
TABLE 18-6 Information Technology (IT) Contingency and Disaster Recovery Plan Considerations
• Identify and agree on roles and responsibilities
• Document and make accessible system documentation
• Demand “knowledge transfers”; all members to avoid SPOFs (single points of failure)
• Develop stakeholder relationships
• Update emergency contacts for vendors and IT staff
Data backups are completed with the intent to prevent data loss and provide the ability to recover data if they are lost or corrupt. Scheduled daily or as needed, determined by the business. There are hardware and software components for storage and backup. Backups should be stored with the same care and level of security as the production data.
• Preserve all critical data associated with the patient, the business, finances, payroll, and personnel
• Be knowledgeable about retention policies of medical records and business documents for the organization and state
• Duplicate data on tapes, disks, and optical disks
• Store data in one of the clouds and/or across multiple servers
Off-site storage of removable media
• Include a secure plan for transporting the media to an offsite location
• Encrypt data
Ensuring that there is adequate storage for the production systems
• Evaluate storage capacity and procedures proactively
Having development and test domains for all systems
• Test all changes to production prior to promoting the code to production using a formal process
System monitoring and notifications
• Keep a spreadsheet of common errors to help speed up diagnostics
• Introduce various types of downtimes to a test (nonproduction) system and evaluate the error messages or system issues; for example, in the nonproduction system, turn off the interfaces to see what types of errors or alerts you receive in the monitoring
• Work with the database team to mimic at-capacity database tables and carefully monitor system errors that are displayed
Plan for relocation of equipment for continuity of operations. All contracts signed with vendors should have a disaster recovery component based on the organization’s requirements.
• Design systems are highly available and redundant
• Archive source code with a reputable third-party company
• Complete negotiations are completed up front for replacement hardware with commitments on days to ship and configure
• Arrange, with vender, colocation sites with adequate network bandwidth and dual network pathways
• Have reciprocal agreements or consortium arrangements
An organization that has a data center should also have internal plans for colocation sites, which may be hot sites, warm sites, or cold sites
• Ensure that there is adequate network bandwidth
Be disciplined with postevent review and complete revisions of processes and policy as appropriate
• Have formal process for the review, including documentation
• Complete and document event review as near to the event as possible
Data from Hoong LL, Marthandan G. Factors influencing the success of the disaster recovery planning process: a conceptual paper. Research and Innovation in Information Systems (ICRIIS), 2011 International Conference. 2011;1-6, 23-24. doi:10.1109/ICRIIS.2011.6125683; and Federal Emergency Management Agency (FEMA). IT disaster recovery plan. FEMA. http://www.ready.gov/business/implementation/IT; 2012.
Preparedness and planning are the keys to disaster recovery following either a simple incident or a catastrophic event. In fact, the process of planning can be as beneficial to an organization as the final written plan. Recovery should include all components identified as crucial: network, servers, connectivity, data, telecommunications, hardware, software, desktops, security, wireless, and any other specific items.
The goal of disaster recovery is to recover the business fully and completely. Depending on the severity of the event or disaster, it may be necessary to do an incremental recovery. Key administrative leaders, with input from the staff, should be involved in the decisions about the sequence of recovery of systems or applications. All employees in an organization will likely have changed workflows during the disaster and it is important that they understand their roles during the disaster or downtime and during the recovery period. The steps to actual recovery will be different for each event and for each organization. Because of this complexity and the time involved to develop a comprehensive disaster recovery plan, an organization may choose to hire outside consultants instead of using internal resources.16,17
Business continuity management is a complementary process to disaster recovery. Business continuity has a larger scope than recovering only IT systems. It also includes determining which administrative and healthcare services must be available using a defined timeline and identifies which systems can be excluded from initial recovery. Business continuity management outlines the functions, processes, and systems needed to allow the core business of providing health services to continue.
A tier system works well for this purpose and each organization will have unique requirements. For example, Tier I applications would be identified as critical and they are recovered first. The organization defines the expected time to recovery based on the requirement for service and available resources. As a general rule, the faster the recovery must occur, the more expensive the recovery process will be. The cost should also be evaluated in comparison with the cost of the downtime. For a Tier I application to be recovered in 24 hours or less, it is likely that a hot site would be required with hardware standing by. Tier II applications would come next and may be identified as needing to be available within 72 hours. Finally, Tier III and Tier IV may be identified as requiring recovery within 1 week and 1 month, respectively. For healthcare, business continuity includes providing care of both patients and the revenue cycle. Defining business continuity should be a formal process that includes the following:
•A business impact analysis that takes into consideration the institution’s business needs and the needs of the community for healthcare services
•Definition of recovery strategies
•Development of a formal plan
•Planned exercises to test the plan
As with other elements discussed above, this process will need resources (both human and financial) from the organization’s senior leadership.18,19
Downtime boxes were mentioned earlier in this chapter. Every business unit needs to have a downtime box that includes items such as registration forms, charge sheets, fax forms, and other commonly used forms for that business area. In addition, it is a good idea to have these documents stored on a portable media device to be kept in the downtime box. In the event of a disaster, these forms can be stored on another portable device and kept in a secure location. Organizations should make specific assignments to ensure that these are kept up to date.
Communication is an integral part of any downtime. Five components of communication plans are needed to determine the following:
•Who needs to know the details
•What details are needed
•What media or modes of communication will be used
•Who will communicate what information
•The systems or workflow processes that are affected
Of course, the more complex the downtime is, the more people need to be notified and the more information needs to be communicated. For example, if the bedside monitoring device is not transmitting data to the EHR, only the ICU staff need to be notified. If the EHR database becomes corrupt, then all clinicians who use the system will need to be notified as well as all IT teams and possibly hospital administration and the risk management department.
Fahrenholz et al.9 compiled a downtime communication template useful to readers. Their questions are as follows:
•What system will be down?
•When will the downtime begin?
•How long will the system be unavailable?
•Why will the system be down?
•What changes are being made to the system?
•Who will be affected and what can the end-user expect?
•What procedures should be followed during the downtime?
These guidelines can be adapted for any facility’s use during an unexpected downtime.
If the hospital or facility uses a tool for IT service management such as ITIL, the procedures discussed here will be used. If the facility does not use one of these systems, policies and procedures are needed to ensure that communication occurs properly. Communication occurs most predictably and reliably when the responsibility belongs to one consistent team or group of people. A service management or equivalent team works well to manage the communications. Whoever is designated as the primary communication team must work very closely with the incident response team (IRT) and the help desk. Communications are coordinated and the help desk is kept informed of the event and of the information it should supply to end-users as inquiries are made about the event. The help desk is critical to communications. In most cases, when the staff are experiencing technical problems they will contact the help desk first. The help desk staff are on-site and on-call agents available 24 hours a day. Training the help desk staff to manage these communications allows the infrastructure and application teams to work on resolving the problems. Some tools that might be used in addition to managing the trouble ticket queue are continuous or intermittent conference calls, individual and group paging for the IT department, individual and group instant messaging, webcasts, updated web pages, group email updates, recorded phone messages, and coordination with hospital operators. Multiple means of communication need to be considered when planning for an event. The technical problems causing the downtime or the disaster event may also eliminate certain communication modes. For example, if the network is down, an email with information about managing the downtime cannot be sent to the clinical units.
The hospital telecommunications operators can manage many aspects of the communication plan, including individual and group pagers, cellphones, tablets, and other communication devices for clinicians and the operational areas. Sending information to these communication devices may help to manage information distribution during sudden or extended downtimes. A best practice in the age of Internet-based phone systems is to have some analog phones available in key hospital areas because they function during network and electrical downtimes. These phones can be identified by using a different color of phone, such as red. Hospital telecommunications operators can also use the overhead paging system in the hospital to distribute information. The point is to be sure to include these hospital operators in the downtime communication plans.
In the event of a major disaster, satellite radios and phones can be used. Satellite phones and radios are network independent but require electricity to recharge their batteries. The local emergency management office in the organization will have more information about these capabilities.
The IT staff is responsible for communicating the necessary information to the help desk agents. In addition, some electronic systems contain notification alert capability. For example, planned downtimes can be communicated using the notification system in the facility’s EHR. Obviously this method would not be available during an EHR downtime, but it can be used to announce a planned downtime or when any of the ancillary systems are off-line.
IT leaders are responsible for communicating with the organization’s senior leadership and the public relations department so they can manage media relations with the community. Social media applications can also be used to manage information with the media and to distribute information to staff in the event of a downtime (assuming that staff members have subscribed to the service and the service can provide the appropriate level of security and privacy).
Other mechanisms may be in place depending on the institution and the setting. For example, if the institution is affiliated with a university, a “campus alert” system may be available. Using this system, notifications can be sent via email, cellphone, work phone, home phone, or a combination of these. This communication strategy can be very helpful during disaster drills as well as unexpected events.
There are numerous ways to communicate with hospital employees and leadership, IT staff, news media, and the public. Finding the right combination for the facility’s budget and staff and formalizing the ownership of specific communication will facilitate the workflow transitions during EHR downtimes at the facility.
Conclusion and Future Directions
This chapter identifies tactics for health system downtime planning and disaster recovery. It challenges clinicians and informaticists to assess, plan for, respond to, recover from, communicate about, continue business during, and prevent downtimes and disasters when possible. The primary objective for downtime and disaster planning is to protect the organization and the patients who are served by that organization by minimizing disruption to the operations. This includes minimizing economic loss; ensuring organizational stability; protecting critical assets of the organization; ensuring safety for personnel, patients, and other customers; reducing variability in decision making during a disaster; and hopefully minimizing legal liability.5 In healthcare and health information technology, the single most important reason to carry out the activities described in the chapter carefully and methodically is the ability to provide uninterrupted, exceptional service and safe care to all patients.
Going forward, the potential impact of downtimes and disasters will continue to grow as healthcare entities become more dependent on technology and as individual healthcare institutions continue to become part of a larger network. This should drive administrators to invest additional human and material resources in assessing and planning to minimize the impact of potential threats. Advances in technology can be expected to offer better solutions than currently exist. These solutions may be less costly as new and improved technology eventually reduces the potential for downtimes. Additional research is very much needed to help clinicians and health systems understand the experience of downtime workflow interruptions, the patient safety implications, and the operational impacts. In addition, research to drive a health system standard approach to downtime planning, disaster recovery, and business continuity efforts should be initiated. There are many focus areas along the continuum of disaster planning to business continuity where research could have a very positive impact for the health system and its clients.
Polaneczky, M: When the electronic medical record goes down. 2007, The Blog that Ate Manhattan, http://www.tbtam.com/2007/03/when-the-electronic-medical-record-goes-down.html.
Getz, L: Dealing with downtime: how to survive if your EHR system fails. For the Record. 21(21), 2009, 16, http://www.fortherecordmag.com/archives/110909p16.shtml.
Capital One: Business continuity and disaster recovery checklist for small business owners. 2011, Continuity Central, http://www.continuitycentral.com/feature0501.htm.
Anderson, M: The costs and implications of EMR system downtime on physician practices. 2011, XML Journal, http://xml.sys-con.com/node/1900855.
Wold, GH: Disaster recovery planning process. 2011, Disaster Recovery Journal, http://www.drj.com/new2dr/w2_002.htm.
Ash, JS, Sittig, DF, Poon, EG, Guappone, K, Campbell, E, Dykstra, RH: The extent and importance of unintended consequences related to computerized provider order entry. J Am Med Inform Assn. 14(4), 2007, 415–423.
Nelson, N: Downtime procedures for a clinical information system: a critical issue. J Crit Care. 22, 2007, 45–50.
Zodum, MA: Practice management and electronic medical record implementation best practices: guide to successful implementation for primary care and federally qualified health centers. 2008, Eastern Shore Rural Health System, Inc, http://www.himss.org/content/files/Code%2067%20Practice%20Mgmt%20And%20EMR%20Best%20Practices_2008.pdf, November.
Fahrenholz, CG, Smith, LJ, Tucker, K, Warner, D: A practical approach to downtime planning in medical practices. 2009, American Health Information Management Association, http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_045486.hcsp?dDocN.
The University of Kentucky: Department of Pharmacy policy: computer down time procedures (unplanned). 2010, The University of Kentucky, http://www.hosp.uky.edu/policy/pharmacy/departpolicy/PH18-04.pdf.
Vaughn, S: Planning for system downtimes. Comput Nurs. 29(4), 2011, 201–203.
Arraj, V: ITIL: the basics. Best Management Practice. 2010, White Paper www.best-management-practice.com/gempdf/itil_the_basics.pdf.
Whitten, J, Bentley, L: System Analysis and Design Methods. 7th ed, 2007, McGraw-Hill Higher Education, New York, NY.
American Nurses Association: The nursing process. Nursing World. 2012, http://nursingworld.org/EspeciallyForYou/What-is-Nursing/Tools-You-Need/Thenursingprocess.html.
Hoerbst, A, Hackl, WO, Blomer, R, Ammenwerth, E: The status of IT service management in health care: ITIL in selected European countries. BMC Med Inform Decis Mak. 11, 2011, 76.
Hoong, LL, Marthandan, G: Factors influencing the success of the disaster recovery planning process: a conceptual paper. Research and Innovation in Information Systems (ICRIIS), 2011 International Conference. 2011, 1–6, 23-24. doi:10.1109/ICRIIS.2011.6125683.
Federal Emergency Management Agency (FEMA): IT disaster recovery plan. 2012, FEMA, http://www.ready.gov/business/implementation/IT.
Federal Emergency Management Agency (FEMA): Business continuity plan. 2012, FEMA, http://ready.gov/business/implementation/continuity.
Nickolette, C: Business continuity planning description and framework. 2001, Comprehensive Consulting Solutions, Inc, http://www.comp-soln.com/BCP_whitepaper.pdf.
1. Explain the importance of an organization-specific downtime risk assessment.
2. Describe the pros and cons of different assessment tools for evaluating downtime events and discuss scenarios in which they might be used to their best advantage.
3. Compare and contrast the roles of the informaticist, the clinician, and IT personnel in system downtime planning.
4. Describe key components of a business continuity plan and (a) how they might differ for different types of organizations and (b) how they might differ depending on EHR maturity level.
5. Contrast different communication methods for system downtime events and summarize the pros and cons of each.
At your Level 2 trauma center an unplanned EHR downtime occurs at 1700 on a Tuesday. After 1 hour of troubleshooting and working with the vendor’s help desk, the IT team attempts a system reboot, which is unsuccessful. The vendor is in a different time zone so specialists have to be called in from home to respond to this incident. The initial assessment is that the downtime is due to database corruption and the system will have to be recovered from backup systems. Unfortunately the system is not configured with high availability techniques nor is it redundant. The IT department estimates that it will take 8 hours to recover the system for a total downtime of 10 hours.
Pageburst Integrated Resources
As part of your Pageburst Digital Book, you can access the following Integrated Resources: