Backup failure resolution and on-going maintenance.

Jon Cowling 28-Jan-2016 15:58:20

Backup failure resolution and on-going maintenance

Here is the second of 5 blogs written by our Managed Services Practice Head - Marcus Cowles. In this series, Marcus will discuss the top incidents logged against Oracle and Microsoft SQL Server database systems.

In the first blog of this series we looked at the logical data structures that databases use to store information, the methods and approaches which can be used to keep things running smoothly and the negative impact of insufficient or inappropriate maintenance. In this blog we’re going to address what is possibly the most important subject that Microsoft SQL Server and Oracle database administrators face: backups.

All IT systems, regardless of purpose, size and activity should have a backup solution of one type or another. For critical production systems the importance of a backup solution obviously becomes all the more significant. However, what happens when the backup solution in place fails? What are the possible repercussions of a backup failing, and how can we ensure that we have the best backup solution for our environment?

A system which does not have a successful backup does not have a recovery position, which means that server, storage or even data centre failure could mean not only loss of service to an application (e.g. your banking website going offline for a few hours) but a complete loss of all data on that system – in the case of an online banking system this could be as bad as losing all customers’ account data, which is a completely unthinkable prospect.

The bottom line: backups are essential.

The problem is, if a backup fails, the impact is not felt until a restore process is attempted – in the event of a backup failure the system doesn’t stop working, there might not even be any errors, so the failure is benign until the backup set is needed. When you have a major issue, it is not the best time to find out that a backup failed – this is exactly when you need a good backup!

Could this be a case of catch 22? Initially it might seem so, but there are fortunately a few measures which can be employed to ensure that the system is kept in a recoverable state i.e. can be restored in the event of a failure.

Monitoring and Alerting

An obvious but often overlooked solution is to set the system to alert an individual or a group if a backup fails. This could be as simple as an email informing of a failed backup or a more complicated approach involving sms messages, escalation trees and other procedural techniques which should ensure that the alert is not missed or ignored.

To date, all enterprise database solutions (Oracle, Microsoft SQL Server, etc) offer a ‘native’ or built-in backup solution and most feature simple alerting tools to notify system administrators or DBAs of a backup failure. Alerting is a low or zero cost option that is often easily configured or sometimes pre-configured which will give the right people early warning of a backup failure, allowing remedial action to be taken and recoverability maintained.

Backup set availability

Once your backup is all set up and you have verified that the backup job has succeeded (or has failed but you were alerted to the fact) you’re all set right? Unfortunately not, the paranoia around backups doesn’t end simply with a successful backup completing. The primary purpose of a backup is to restore a system (in this case a database) in the event of some kind of failure. So, analogous to placing all your eggs in one basket, one certainly would not wish to store one’s backup in the same location as the system it is designed to save! So what are the options? In principle, a backup should be transferred to a separate system, both topologically and topographically. Topologically, the backup would be transferred to a different logical system i.e. not saved with the system it has backed up. This usually takes the form of a backup vault of some kind which can be as simple as a shared file system or something far more complicated. In the case of a database failure, for example, the database should not be required to access the backup set. Topographically, the backup set should be separated from the originating system by some physical distance. It would, once again, be redundant to store the backup vault on the same physical server as the target system – a server failure would extinguish both systems and a restore would not be possible.

Multiple redundant copies

We are once again employing a degree of paranoia, but with backups you can never be too safe – an unrecoverable database has failed in its most basic task: storing data securely. It does no harm to produce two copies of each backup set and physically store them in two separate locations other than the location of the production system. The odds of losing both the target system site or data centre and the off-site backup location are slim, but the chance of losing the target system and two redundant backup locations is positively infinitesimal and the cost/benefit ratio is massive. To an IT manager or director it is a fairly obvious choice.

Restore tests

Monitoring the backup process for failures and restarting, correcting or resuming a backup does provide some peace of mind, however, even if the process of performing a backup is successful, what reassurance does one have that the backup can actually be used to perform a recovery operation? The simple answer is: none.

Unfortunately, while most systems can validate that a backup has completed and even test the validity of the backup set, there is no substitute for testing a backup set by performing a restore test.

This activity is more active than the passive monitoring approach as it exercises a degree of prejudice as no assumptions are made i.e. just because the backup job completed, it does not mean the backup could be used for a recovery.

There is obviously a real-world vs ideal balance here; ideally, one would test every backup set taken to ensure that it is valid. However, this process would probably take longer than backing up the database itself, so the IT department would find themselves dedicating a team to performing restore tests. Some companies can afford to do this and for others it is mandatory (government bodies, financial organisation and so on) but for most companies it is neither feasible nor necessary. It is up to the specific organisation and their IT policy to decide how frequently the backups should be tested and to what degree; most organisations which perform backup testing once a quarter are well ahead of the curve.

The disadvantage of restore testing, as may already be apparent, is that it involves human effort and therefore cost. While there are automated systems which can schedule restore tests and provide feedback, these must also be monitored and checked and, of course, the validity of the restore test must also be verified in a technological version on ‘who polices the police?’. As stated above, these concerns are normally limited to those organisations who must exercise extreme prejudice with regard to their restore capacity and these organisations almost always have suitable budget to accommodate such testing.

Backup media and performance

These two topics are closely linked as the choice of media often has an impact on the performance of the backup and, possibly more importantly, the restore process. Performance is often regarded as a luxury, but if we consider the scenario of a complete system failure and a restore is required, the question on everybody’s lips will be “how soon until the system will be available?” and all of a sudden performance is not a question of luxury but of lost revenue per hour due to downtime.

The other aspect regarding performance is one of technology coordination. A backup process, like any other, requires horsepower and resources. One would not want a backup to interfere with a critical overnight process, for example. Increased load on multiple processes can cause delays and even failures leading to a failed backup and an unrecoverable system, and all of a sudden the backup becomes the cause of a failure rather than the saviour. Alerting to the fact that a backup has failed and responding is all very well, but a backup solution which is less likely to fail in the first place is a preferred choice.

Choice of media is generally the most significant factor when considering backup and restore performance. A backup is essentially a transfer of data from one place to another, so the same principle applies here as with other aspects of IT: I/O (input/output rate), this will determine how long it takes to transfer the data to the backup destination and complete the process. The options range from the latest SSDs (solid state drives), hybrid disks (which combine SSDs and traditional spindle disks), disk arrays (often hybrid) and tape devices. The general rule of thumb with storage is: the faster a storage device, the more it costs. Since backups are generally large and performance is not as critical as an online system, tape backups are the most common long term backup destination of choice. Tape is a relatively low cost solution to operate (tapes themselves are relatively inexpensive however the actual tape drives can be as expensive as one can imagine) are portable and have excellent data retention - holding data for many years without problems. The portable property of a tape backup is of particular interest as it allows the tape backups to be physically moved to a separate site and contained within a non-electronic fire-proof container i.e. a good old fashioned fire proof safe. A backup to tape can then be labelled, archived and retrieved at a later date according to a central catalogue (be it electronic or papyrus).

Hybrid backup solutions

Tape is not the definitive answer, however, many organisations employ a hybrid backup process involving disk and tape to speed up both the backup and restore process and yet minimise disk array usage. For example, in order to maintain long-term backups but provide quick short-term recovery, the backup would be initially taken to disk (which is relatively fast) with a retention of, say, two backups (to minimise the footprint on storage). As the third backup is taken, one of the disk backups would then be transferred to tape and archived. The tape device is slower than disk but in most cases a restore operation requires a recent backup in which case a faster disk backup could be used. This approach allows an IT manager to maintain long-term cost-effective backup sets in the form of the tape archive, while also making fast-restore backup sets available on disk for a short time. So a principle would emerge: a recent backup can be restored quickly, an archived backup, more slowly. Most organisations accept this principle and depending on the requirements, an appropriate number of ‘fast’ backups could be stored on disk to provide the right flexibility for the business.

Standard technologies

The backup is the system’s, organisation’s and IT manager’s get out of jail free card. It allows recovery from the worst possible nightmare with a high degree of confidence that no data has been lost. Downtime is bad enough, data loss is simply catastrophic.

So the last thing an organisation wants to have to deal with is a custom or heavily customised backup and restore solution. There is a very good reason why enterprise database vendors put a lot of effort into producing standardised and supported backup solutions and there is no need to re-invent the wheel. The disadvantages of using a customised or home-made solution of any kind are fairly obvious: lack of vendor support, buggy behaviour, limited documentation, limited training. So when it comes to the most important part of a database estate, why trust the job to an untested custom built solution? There is no reason to, none. At all costs, an organisation should look to use the tools provided by recognised vendors who have subjected their products to industry standard testing.

If a standard solution does fail (and it happens, nothing is infallible), the organisation has the comfort of knowing that they can pick up the phone and raise a priority incident with the vendor, usually at no cost of the technology is of a supported version.

Obviously there are exceptions and there are some organisations for whom standard technologies may not be acceptable, but these organisations are very much making a compromise rather than a choice. They will also need very rigorous internal testing, validation and documentation around the solution, which obviously comes at a premium cost.

The combination of the right tools and the right configuration means that most systems’ backup solution can be designed and implemented without a great deal of trouble. Enterprise database vendors recognise the importance of database backups and so have made it as easy as possible to take, manage, test and recover backups of all types. Backup and restore projects can be fascinating and rewarding and there are a vast number of options and configurations but in the production environment when the organisation’s revenue is on the line the solution should be: Simple, secure and tested, follow these simple guidelines and it’s hard to go wrong.