Lessons learned in business continuity planning
Connecting state and local government leaders
Overlooking or underestimating issues in business continuity planning can add insult to injury when disaster finally occurs.
Many government IT managers today are being tasked with developing and testing business continuity plans for payroll, email, financial and other key administrative systems. In 20 years of developing such continuity plans, we’ve seen areas that many IT managers overlook or underestimate, and which end up adding insult to injury when disaster finally occurs. Some of these areas include:
Missing digital IDs. Many users need digital IDs to access external “secure” systems. Typically, these IDs are tied to the specific user’s workstation, but don’t automatically get saved when the workstation is being backed up. After a disaster, the users are then unable to access the secure application from their “restored” workstations. Mitigating this issue requires IT to work with the remote secure site beforehand, to either determine an alternate means for storing these IDs or to obtain additional IDs to install on backup workstations.
Incomplete backups. Often there are missing files in backups. One client discovered the program that backed up the software had been upgraded to be case sensitive to file names when it never had been before. As a result, crucial files were no longer being backed up, which wasn’t noticed until the organization tried to recover an entire application during testing.
Too few remote user licenses. Agencies that have remote-desktop-access software to allow users to work offsite, probably have licenses that allow only a limited number of users at a time to connect remotely, because that’s typically all that is required. During a disaster, however, more employees may want to connect than the license allows. IT managers should have a plan worked out beforehand with the software vendor for emergency upgrading of the license limit.
Backups that are too hard to validate. Users must validate restored applications for correctness and completeness. This is done by comparing before/after log reports or other global statistics. Often, however, the modules that prepare such logs were not purchased with the application. This is frequently the case even with software library applications that store and manage in-house-developed programs. Without such a capability, the data files must be laboriously checked transaction-by-transaction to validate the restoration.
Seasonality. Some functions and applications are crucial only at certain points in the year. IT managers must document this seasonality and adjust planning so that response priorities are appropriate to the time of the year when the disaster occurs.
Weak notification systems. Many IT managers plan to rely on email to communicate with employees and stakeholders during a disaster. Typically however, email will not be available. Because the cost of outside third-party notification systems has become much less expensive in recent years, these should be evaluated by IT managers as a communications alternative.
Electronics vulnerable to "falling water." Telecommunications lines usually terminate at open wall-mounted panels in the basement. These will burn out when water drips down on them from upstairs. Shielding or enclosing these terminators can avoid the loss of telecommunications caused by bad weather or even a restroom overflow.
Overly optimistic testing plans. Many recovery test plans aim for a full transfer of operations to the backup site on the very first testing attempt. When this fails, it is often very difficult to untangle the threads of what actually went wrong because there were too many components in flux at the same time.
Rather than trying to start with a full system recovery test, we’ve found it’s best to test by components. Typically, testing should start with remote connectivity. This usually doesn’t work for everyone on the first try because of errors in authorization setup, incorrect software on some users’ workstations, or simple lack of training.
After testing system recovery, the restored data must be validated by the IT department, to ensure that what had been backed up was correctly restored. Then, selected users can validate that the original backups themselves were correct and complete.
In this user validation, not all user departments should be involved in the first round. Instead, select one encapsulated, high-value, easily validated application like payroll for initial testing. Following that, test functions that “walk across” many files, such those that produce a trial balance from the general ledger accounting subsystem. After that, other applications can be tested based on the priority and availability of the various other user departments.
Over the years, three key lessons for business continuity planning have become clear:
- Don’t assume something works correctly just because it works.
- Keep tests simple and incremental, building on what has already been proven.
- Strategize with the users to better understand exactly what, when and how much will be demanded of the IT system during the chaotic time following a disaster.