Incidents logged against Oracle & SQL database systems. Part 1

Jon Cowling 05-Feb-2016 14:29:36

Incidents logged against Oracle SQL Database Systems Part 1 

Here is the first of 5 blogs written by our Managed Services Practice Head - Marcus Cowles. In this series, Marcus will discuss the top incidents logged against database systems.

Tactics to ensure that the database environment remains functional and available.
Managing an Oracle or Microsoft SQL Server database can be a pleasure or an absolute nightmare depending on a number of factors. Pro-active maintenance, preventative fixes and comprehensive monitoring can prevent most but not all issues. We have been analysing the most common incidents which are logged against database systems and some simple steps which can be taken, both technical and procedural, to reduce their frequency, impact and required resolution time.

Our top 5 incidents are as follows:

  1. Database maintenance (space management etc)
  2. Backups (maintenance and failure)
  3. Performance
  4. Connectivity (users and applications)
  5. Patching

Section 1: Database space management (such as tablespace and/or filesystem management)

Most databases operate as a store for information and in almost all cases, the volume of information stored in a system will increase over time. Customer orders, logged telephone calls, online transactions and other business-as-usual activities all generate data. So where does this data get stored and how does the IT manager or DBA know that the system is equipped to handle the load? Below are a few simple tactics which can be employed to ensure that the database environment remains functional and available:

1. Pro-active tablespace management
This is not to be confused with tablespace monitoring (which we will discuss shortly) – monitoring a tablespace and waiting for it to reach a threshold (say, 90% used) then taking action to expand it is not pro-active, it is reactive since the DBA support team is reacting to a monitoring alert. True, it is not a reaction to an outage, so the database availability is maintained, but it is still a reactive approach.

In the case of a logical storage unit such as a tablespace pro-active action should be pre-hoc, that is, before the event which may cause an incident. Now this sounds great, but we can’t all see into the future so the million dollar question is “what kind of event can cause a tablespace capacity issue?” Fortunately for us mere, non-time-travelling mortals, the answer is simple: putting data into the database will cause an increased capacity requirement. As it happens, some of these activities, such as a large data load, can be predicted and communicated and capacity adjusted accordingly before an incident is triggered. Obviously this sounds very straightforward and might seem more like common sense than an actual IT policy but we very frequently see occasions where the database support team are not aware of large-data activities that other departments/teams are performing and then have to react to changes in capacity requirements at very short notice, sometimes outside business hours which can cause frustration, delay and, of course, cost.

2. It makes sense that the responsibility of making sure that these activities are communicated and prepared for lies with the database support team since it is on their shoulders that the availability and performance of the database lies. So in a simple sound bite: the database support team should endeavour to maintain documentation on all significant data loads, their characteristics and frequency with the assistance of the application/development teams