Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant
Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
Coping with Disasters
11/15/2001
Now that we've all witnessed a disaster that beggars the imagination, preventing disaster seems not only an appropriate topic but an imperative one. As Mike Handy, acting director of information technology services for the Library of Congress, said after September 11, "Until recently, our planning efforts have assumed the most significant threats to be from accidental disruptions such as natural calamity, fire, power failure, etc. Obviously, now our assumptions include previously unimaginable possibilities." Although clearly the most unthinkable disaster would involve the loss of life or injury to library staff, that type of disaster planning is outside the scope of this column. Rather, I will consider what can be done to protect your digital library services and collections from the many disasters--whether they be outrageous or minor--that may befall those who do not prepare. Through planning and preparation, we can help prevent disasters from happening or minimize the damage once they do. Prevention "An ounce of prevention is worth a pound of cure" remains a valuable aphorism for disaster prevention. Everything that you can reasonably do to avoid or lessen the impact of disasters by planning ahead of time will be well worth your time, effort, and resources. For digital systems, the classic prevention technique is an effective protection system. Effective computer protection systems are constructed in layers. The first layer is the disk itself--or, more accurately, the way in which data are stored on the disk. The most secure way to store data on hard disks is by using RAID technology. RAID, an acronym for Redundant Array of Inexpensive (or Independent) Disks, specifies various methods of storing data that are optimized for different requirements. For example, if you want to provide a reasonable level of performance while achieving a moderate level of protection, you may choose a less-protective level of RAID (for example, RAID level 1, which is simply mirroring, or creating another complete copy of the data). If, on the other hand, protection is more important than response time, then a more protective level of RAID may be selected. For example, level 5 distributes both the data and information required to recover it across several physical disks, which can protect you from the failure of multiple drives. Layers of protection The second layer of protection includes such strategies as uninterruptable power supplies (which can prevent disk drive damage in power failures), fire extinguisher systems, alarm systems, and other methods for securing the computer disks or the room where they are kept. The third layer entails making copies of the data--backing it up. The typical computer backup system copies the data that you wish to retain to another disk, tape, or other digital medium. This backup can be incremental (only changed files are backed up) or complete. For better protection, the second copy should be stored at a location distant from the first (and I mean really distant--the farther the better, considering the impact of disasters such as earthquakes and hurricanes). A common technique for locating data close to where it is needed can also serve as a default backup system. Called "mirroring," this technique was developed primarily in response to slow or costly Internet connections. For example, those in Australia must pay a per-byte charge for overseas Internet traffic. Therefore, it's helpful for them to copy, or mirror, popular sites locally. Not only can this serve as a default backup, but it can also be essential in emergencies where users can be shunted from the main site to the mirror location. What might be considered the logical endpoint of this technique is represented by a preservation scheme advanced by Stanford University. Called Lots of Copies Keep Stuff Safe (LOCKSS), the strategy employs a large pool of interconnected and physically distant computers that constantly share copies of each computer's data. If any single computer crashes, the data it contained could be recovered from other computers still online. The system is designed to use standard-issue PCs, even those that would be too underpowered to run standard office applications (a typical LOCKSS installation would only require something like a PC with a 100Mhz Pentium chip with 32MB of RAM and one or two large disks). Whether LOCKSS is used or not, since the price of hard disk storage is so cheap (you can find disk drives for as low as $3/GB now, and prices continue to drop), there is no logical reason for not creating multiple redundant copies of critical data. This can (and should) be as simple as setting up a script to copy all of your data to additional hard drives each night. If those drives are physically distant--which the Internet enables easily--it is even better. Emergency response & recovery Good preparation includes knowing what you will do in the middle of an emergency. One of the quickest and easiest ways to solve an emergency situation is to route users to a mirror (see above). If, for example, www.xxx.org goes down, that domain name can quickly be assigned to the host computer that has a mirror. Once this change propagates to the Internet routing system (which can take from a few hours to a few days), users will be none the wiser that they are going to a different physical location, since the domain name remains unchanged. If you're not lucky enough to have a mirror, you will need to do something else. What you do will depend on how essential your operation is to those who matter. If your data are important but not essential, then hang tight until the emergency passes and you can move on to the "recovery" stage. If your systems must be constantly responsive, then, one hopes, you will have determined ahead of time how you will cope. Again, planning is everything. Once the emergency has passed, you should know what steps must be taken to get everything back up and functioning. Specifically, you should know in advance how to install new hardware and software, retrieve data from a backup system, and get everything back online. Here you will discover just how well (or poorly) you have prepared. Those who have planned well will find this process to be quick and smooth, while those who haven't will find it time-consuming and difficult. Run with the big boys If you have data you can't afford to lose, you can't afford to be without a disaster plan. The plan should include aspects dealing with prevention, emergencies, and recovery. Luckily, there is little to prevent small libraries from having a disaster plan similar to that of the Library of Congress, which uses many of the techniques outlined here. If you don't know where to begin, start with the Federal Emergency Management Administration's Emergency Management Guide for Business & Industry, which will guide you through the process of making a plan that will get you through just about anything--except perhaps the unimaginable. __________________________________________________________________ LINK LIST Disaster Recovery Journal's Glossary [124]www.drj.com/glossary/glossary.htm Emergency Management Guide for Business & Industry [125]www.fema.gov/library/bizindex.htm Keeping Memory Alive: Practices for Preserving Content at the National Digital Library Program of the Library of Congress [126]www.rlg.org/preserv/diginews/diginews4-3.html LOCKSS [127]lockss.stanford.edu Public Library Association Tech Note: Digital Disaster Planning [128]www.pla.org/technotes/disaster.html