Oracle database configuration issues cause downtime – Part 2
In this article I continue listing common database configuration issues that can affect Oracle database high availability (HA) causing unplanned downtime. Make sure you read the first part of Oracle database configuration issues that cause downtime.
Control file limit reached
This issue can occur when you reach limits of some DB configuration parameters stored in a control file like MAXLOGFILES, MAXLOGFILEMEMBERS, MAXINSTANCES.
To fix this you need a downtime. 10g has reduced some of those limitations though.
Oracle ASM instance limits
Oracle ASM can be considered as another database instance that have own parameter limitations you also need to consider carefully. For example, If ASM instance reaches maximum number of processes, all of your databases on that host will hang.
Oracle auditing may cause service unavailability
Be aware that Oracle does by default a lot of auditing as on OS level as inside the database system tablespace. As a matter of fact the latest Oracle versions produce even more auditing records by default. So I suggest reviewing defaults and, based on your auditing strategy, keep only what is required and implement proper storage and housekeeping structures.
As in all the software, Oracle bugs sometimes cause database unavailability or simply increase the outage length in case of hardware or storage failure for example. Make your robust patching strategy and follow it carefully.
Patching may introduce database unavailability
This is the other side of the medal. Do not rush upgrading or applying patches on the production environments, but rather test them carefully on the test databases. It is important to consider not only the time to perform the upgrade, but also the effect the changes may have on the overall database and application.
Sometimes patches introduce new bugs reducing database high availability. In spite Oracle states that Patch Sets are only bug fixes and application vendors should not certify those, I strongly recommend conducting extensive testing and receiving a sign off from application vendors or/and your application owners before going productive.
Database recovery situation
This problem potentially can introduce your longest outage and kill your SLA with the customer. Thus to minimize the downtime, you have to be very careful in:
– utilizing modern software and hardware tools for potential recovery situations
– designing your backup and recovery strategy
– documenting it properly
– testing it periodically
– doing preliminary files, database and object consistency checks
– considering values of certain database initialization parameters:
DB_ULTRA_SAFE (new in 11g)
Database poor security
In spite of Oracle having more than17 or even more security certificates, DBAs have to strengthen database security on all the levels during DB setup, configuration, application installation and maintenance. And this is not only to prevent intruder’s brake-ins. My first rule is simple: minimize the number of users that have direct access to your database and stick to the Least Required Privilege rule. This can reduce number and effect of user errors, recovery situations and system break-ins.
Database environment complexity
Yes, even this sometimes become an issue introducing even more database outages. Companies implement complex database infrastructure not taking into account a risk of having more software and hardware components that potentially can fail and/or introduce more bugs and issues compared to simple and robust solutions.
DBAs or human factor
Human error, which is a leading cause of failures, includes errors by an operator, user, database administrator, or system administrator. Another type of human error that can cause unplanned down time is sabotage.
Nobody is perfect and every DBA can make mistakes but what you can do here is to train your personnel properly and do periodic testing of your common DBA tasks and procedures. At the end, you need to find an experienced DBA you can trust and rely on… and I think this is the most difficult thing to do.
In the following articles I plan to discuss some Oracle’s solutions to downtime.