Pythian Blog: Technical Track

How to Test and Verify that Oracle RDS Alert Log Monitoring is Working

One of the standard Oracle DBA tasks is to monitor the alert log for errors. Whether you’re using the integrated AWS Cloudwatch or customized monitoring scripts, you’ll want to test that the monitoring function is working properly.

 

 

 

There can be many noise pages that you want to exclude from alerting. For instance, if your network is sometimes unstable, you’ll frequently see TNS errors in your alert log–perhaps something like “ORA-12196: TNS:received an error from TNS”.

You’re aware of these errors and don’t want to be notified each time they occur. In this case, you have an “EXCLUDE list” which lists all errors you wish to ignore or exclude from paging. It’s less likely, but not impossible, that instead you have an “INCLUDE list” which includes only those alert log messages on which you wish to be notified.

Once you have created your EXCLUDE or INCLUDE list, how do you verify that your monitoring is working correctly and that you’re indeed excluding or including the appropriate notifications? You don’t really want to cause errors on your system to test the monitoring. Instead, you write fake error messages to your alert log, and then confirm your monitoring.

Oracle provides a mechanism to write directly to the alert log using DBMS_SYSTEM.KSDWRT procedure. However, this is not available in a relational database service (RDS) environment. So, we need to find another way to write to the alert log to test monitoring.

The following provides an example of a simple workaround to write to the alert log. It requires only three database objects: a table, a procedure, and a scheduler job.

To avoid filling the alert log with useless information, we put a simple mechanism in place that allows us to turn on/off writing errors to the alert log. We use simple control table for this :

create table control_job_failing ( should_job_fail char(1));
insert into control_job_failing values ( 'N');
commit;

Next, you’ll cause Oracle to raise certain harmless error conditions and temporarily include these errors in your EXCLUDE or INCLUDE list. There are three ways to force Oracle to raise an error:

  1. Oracle PL/SQL provides the built-in procedure “RAISE_APPLICATION_ERROR”. This allows you to raise an exception in the error range of -20000 to -20999.
  2. The Oracle built-in “RAISE” allows you to raise one of the pre-defined exceptions.
  3. Manually execute an invalid command, such as “divide by zero”.

This procedure will read the control table and then either terminate successfully or terminate with error based on the control table value:

create or replace procedure proc_control_job_failing
as
v_should_job_fail char(1);
divisor number;
begin
select should_job_fail into v_should_job_fail
from control_job_failing;
if v_should_job_fail = 'Y' then
-- using one of the three ways to force an error
raise_application_error ( -20000, 'Forcing failure for testing');
-- raise INVALID_CURSOR ;
-- select 5/0 into divisor from dual;
end if;
end;
/

Lastly, create a scheduler job that will run the procedure on a regular basis:

begin
dbms_scheduler.create_job (
job_name => 'JOB_FAILING_OR_NOT',
job_type => 'STORED_PROCEDURE',
job_action => 'PROC_CONTROL_JOB_FAILING',
start_date => systimestamp + 5/1440,
repeat_interval => 'FREQ=HOURLY;BYMINUTE=0,5,10,15,20,25,30,35,40,45,50,55',
enabled => TRUE);
END;
/

Verify that the scheduler job has been created successfully:


set linesize 150 trimspool on
col job_name format a30
col next_run_date format a30
col start_date format a30
col repeat_interval format a20
select job_name, enabled, next_run_date, start_date, repeat_interval
from dba_scheduler_jobs where job_name = 'JOB_FAILING_OR_NOT';

Test the scheduler job by running with the control setting to do nothing and with value set to cause the error to be written:

exec dbms_scheduler.run_job( 'AVAIL.JOB_FAILING_OR_NOT');
update control_job_failing set should_job_fail = 'Y' ;
commit;
exec dbms_scheduler.run_job( 'AVAIL.JOB_FAILING_OR_NOT');

View the scheduler run details to verify that it has been run by scheduler:

set linesize 200 trimspool on
col log_date format a20
col status format a15
col additional_info word_wrapped format a80
select log_date, status, error#, additional_info
from dba_SCHEDULER_JOB_RUN_DETAILS
where job_name = 'JOB_FAILING_OR_NOT'
order by log_date;

You’ll want to confirm that the errors actually have been written to the alert log. You don’t necessarily want to review all records in alert log to locate and view that your test errors were written to alert log. Use the following SQL to view the alert log messages from just before and just after the most recent alert log error:


with find_time
as ( select max(originating_timestamp) max_timestamp
from alertlog where instr(message_text, 'ORA-') > 0 )
select ORIGINATING_TIMESTAMP, message_text
from alertlog, find_time
where ( ORIGINATING_TIMESTAMP > ( find_time.max_timestamp - 5/1440 )
and ORIGINATING_TIMESTAMP < ( find_time.max_timestamp + 5/1440 )
)
order by ORIGINATING_TIMESTAMP;

So, now we have shown how to force an error to the alert log and view alert log messages created by our testing. Hopefully by now, your regular alert log monitoring has notified you of the errors you have generated and caused to be written. If not, take this time to review your alert log monitoring and EXCLUDE list to verify if you should have been notified.

 

I hope you’ve found this post helpful. Feel free to share your questions in the comments and subscribe so you don’t miss the next post.

 

 

 

No Comments Yet

Let us know what you think

Subscribe by email