Backup Failure Costs State a BundleBy Paul Shread
March 23, 2007
A perfect storm of errors caused the Alaska Department of Revenue's data backup procedures to fail, wiping out data on a $38 billion account and costing the state $200,000 to restore the data.
The mistakes apparently began in a Dell/EMC storage array and were compounded by improperly backed up data tapes. The mishap occurred last summer, but the story gained attention this week when the Associated Press reported it.
The good news is the department had a paper backup 800,000 documents that had been scanned into the system over many months. The AP reported that it took 70 people working overtime and weekends two months to reenter the lost data at a cost of $220,700, including $128,400 in overtime and $71,800 for computer consultants.
State officials put out a detailed statement late today describing the data loss that hit the Permanent Fund Dividend (PFD) Division. The statement was attributed to Norm Snyder, IT Manager of the Department of Revenue. The state blamed human failure for the data loss and said there were no hardware or software failures.
The PFD Division uses a pair of clustered Dell servers running Microsoft SQL 2005 connected to Dell/EMC storage arrays with about 3TB of usable space. The PFD database consists of approximately 1.5TB of data and images, with an additional 300-500MB of data and images added each year.
The majority of the PFD database consists of paper documents that have been scanned and added to the database as digital images. Since the space requirement for images is quite large, the state said, the image portion of the database consists of a partitioned table, which is divided into file groups, with each year a separate partition in its own file group.
The active year is a read/write file group, and once closed out, it is marked as read only. Then the entire database is backed up using filegroup (partial) backups, followed by a transaction log backup. These database backup files are then backed up to tape and safely stored at several locations. The partition table is a new feature of SQL 2005, the state said, making it much easier to back up large databases. Since the data in read-only file groups does not change, it does not have to be backed up as frequently as the information kept in the active read-write file groups.
However, that flexibility adds another layer of complexity to file management. As new active read/write file groups are added to a database, they must be backed up at the same time as the other active file groups, along with transaction logs, in order to be able to completely restore the database and bring it online.
Problems Begin to Appear
Last June, hard errors were reported by one of the disk drive storage array processors. The state's Network Specialist established a service call with the hardware vendor, and was advised to run a Background Verify as the last option to repair inconsistencies in the RAID stripe set, which failed to correct the problem.
The final option was to unbind then bind the two LUNs that were corrupted. First, the Specialist moved the database filegroups stored on the corrupted LUNs to another LUN in preparation for the unbind and bind (unbinding and rebinding destroys all data on the LUN).
"Unfortunately, it is difficult to correlate which LUN number corresponds to which drive letter designation on the server, and some of the files were mistakenly moved to the LUNs which were to be rebound," the state said. The specialist was working on a remote desktop session with a Dell storage specialist at the time of the unbind and rebind, and the Dell specialist also missed the error. After the rebind, it was discovered that one of the LUN's rebound had contained data files and also the SQL backup files.
The Network Specialist then attempted to restore the database backups from tape, but one critical file, the primary filegroup (MDF) for the PFD Database, had inadvertently not been selected to be written to tape during the normal backup process. Without a current backup of the primary .MDF file, the database could not be restored and brought online.
IT staffers worked through the weekend restoring all Read Only historical file groups for the years 2000 thru 2005, using an older backup of the Primary (MDF) file, but they were unable to successfully restore the 2006 filegroup. Even though there was a current backup for 2006 and transaction log backups, a matching backup of the Primary file was also needed.
While little actual data was lost, what was lost was 800,000 paper documents that had been received during 2006 and scanned into the system. "It must be made perfectly clear that the loss of these images was not the fault of the backup software, the tape library, storage array, or any other components. It was strictly human error and the consequences of not placing a check box next to the .MDF file, instructing the tape backup software to place the file onto tape."
To recover, the database was placed online with an active (2006) file group containing no data. Four seasonal employees returned to work in the summer of 2006, and for 2 1/2 months, the paper documents were rescanned and the active file group was repopulated with the newly scanned images. The images that could be OCR'd were automatically linked with their corresponding data elements. Those that could not be OCR'd were manually linked. No data or images were permanently lost in the process, the state said.
"Our IT Staff and PFD application programmers are now much more familiar with what is required to successfully restore this database from filegroup backups, some of which is new in MS SQL 2005," the state said.
"We have since added additional disk storage to the disk drive array and now have a total of 7 TB usable space, making it much easier to restore databases and perform regular tests. We now have a formal written backup plan where active file groups and logs are backed up daily and full backups, which include all file groups, are performed quarterly."
A team of IT and PFD Programming staff now review and certify backup logs daily and review all database properties, backup procedures and scripts monthly, the state said. Backup cycle rotations are reviewed and each scheduled backup is now certified by three people as being complete. IT and PFD Programming staff also perform end-to-end backup and restore drills quarterly, verifying that the entire database can be restored from tapes used in the quarterly full backups. Future plans include the development of an off-site file replication system to be located in the state's Anchorage facility.
The lesson to glean from the Alaska Department of Revenue's troubles is to make sure your backup and recovery processes work, say analysts.
Bob Abraham of Freeman Reports, a tape storage research firm, said it "looks like the inability to recover the data was due to the failure to record one critical file," the primary .MDF file.
Greg Schulz, founder and senior analyst of StorageIO, said such data losses are more common than many realize, although usually on a smaller scale.
"If people are using their backups, snapshots or other means to recover data after an accidental deletion, its called a recover, restoration or restart," said Schulz. "Then there are the cases when backups, snapshots and archives are of no use and its time to turn to recovery services like Ontrack to see if they can recover data off of the servers or storage."
Replication and mirroring are also important backup issues to consider, he said.
"There is another hidden message in all of this, and that is if you are relying on just RAID mirroring, local or remote replication/mirroring, if something is deleted at one site, it will be deleted at the other, so if you are relying on just replication and mirroring, at least take regular snapshots, and better yet, backup up those snapshots to other mediums."
Story courtesy of Enterprise Storage Forum.