Pythian Blog: Technical Track

Failovers with Oracle Dataguard

In the last month, we have been hit with two clients’ large-scale failures. The first involved network issues; the second, disk failures. Fun fun.

We have one production and two standby environments. In the case of the first failure we went into our trusty dataguard config and performed what can only be described as a minor miracle.

First we connected to the standby that was to become the primary.

Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys/xxxxx@rcatstdby2
Connected.
DGMGRL> show configuration;

Configuration
Name: rcat
Enabled: YES
Protection Mode: MaxPerformance
Fast-Start Failover: DISABLED
Databases:
rcat_aussw10_db - Physical standby database
rcat_nflsw01_db - Primary database
rcat_aussw20_db - Physical standby database

Current status for "rcat":
^C

We had to control-c out because it hung: it could not connect to nflsw01 and we knew that at least dataguard thought there was a problem. Then we took the steps that separate the men from the boys.

oracle@aussw10-->(ORACLE_SID = rcat) ** prodstdby2, rcatstdby2, limtest, COREprod, **
/home/oracle> dgmgrl
DGMGRL for Solaris: Version 10.2.0.2.0 - 64bit Production

Copyright (c) 2000, 2005, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.

DGMGRL> connect sys/xxxxxxx@rcatstdby2
Connected.
DGMGRL> remove site rcat_nflsw01_db
you cannot remove the primary site
DGMGRL> failover to rcat_aussw10_db graceful;
Performing failover NOW, please wait...
Failover succeeded, new primary is "rcat_aussw10_db"
DGMGRL> show configuration

Configuration
Name: rcat
Enabled: YES
Protection Mode: MaxPerformance
Fast-Start Failover: DISABLED
Databases:
rcat_aussw10_db - Primary database
rcat_nflsw01_db - Physical standby database (disabled)
rcat_aussw20_db - Physical standby database

Current status for "rcat":
SUCCESS

DGMGRL>

That was it. What a let down! No magical status updates, no little “dots” going across the screen. Just one line and 15 seconds. Oracle, get your act together! I even had to use bold text to make it stand out.

What happened, you may ask. We took one of the standbys and turned it into a primary. The other standby then automagically started recovering from the new primary! Wow! Maybe you don’t know, but this used to be a huge, complicated process once upon a time. I remember doing — ugh! — manual failovers in the past and now it just feels so… 2005. It’s really amazing what dataguard can do isn’t it?

Sadly, it can’t take care of everything. The second disk failure managed to fubar dataguard enough that we had to do our trusty 2005 version of a manual failover. We were using standby redo logs, and when the failure hit, some of our actual archive logs hadn’t made it over because we weren’t using ASYNC. I’m still not sure how this happened, but when dataguard tried to failover, it found that we were missing archive logs and redo, and wouldn’t do a graceful failover.

Here’s this attempt with the accompanying alert log from the standby that we tried to activate.

DGMGRL> show configuration

Configuration
  Name:            prod
  Enabled:         YES
  Protection Mode: MaxPerformance
  Databases:
    prod_aussw20_db - Primary database
    prod_nflsw01_db - Physical standby database
    prod_aussw10_db - Physical standby database

Current status for "prod":
Error: ORA-16625: cannot reach the primary
DGMGRL> failover to  prod_nflsw01_db graceful;
Performing failover NOW. Please wait...
Error: ORA-16771: Converting physical standby to primary failed in the failover operation.

Failed.
Can not proceed to failover.
DGMGRL>

And here is the output from the alert log:

ARCH: Connecting to console port...
Mon Oct 30 07:15:30 2006
Terminal Recovery: completion detected (prod)
Mon Oct 30 07:15:30 2006
ARCH: Connecting to console port...
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DI
Mon Oct 30 07:15:30 2006
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
Mon Oct 30 07:15:30 2006
ALTER DATABASE SWITCHOVER TO PRIMARY (prod)
ARCH: Connecting to console port...
If media recovery active, switchover will wait 900 seconds
Database not available for switchover
  End-Of-REDO archived log file has not been recovered
Database not available for switchover
  End-Of-REDO archived log file has not been recovered
Switchover: Media recovery required - standby not in limbo
ORA-16139 signalled during: ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAI...

It unfortunate that it wasn’t smart enough to know that it had missing archive logs. I think it should have asked to perform a resetlogs or something. In the end, we found two of the three archive logs on the other standby. The third must be in ora-0600 purgatory.

For reference, I’ve added the alert log from the successful dataguard failover that I described in the first part of the article. It’s neat-o. When it works, it works incredibly well.

MRP0: Background Media Recovery cancelled with status 16037
Thu Oct 19 15:07:28 2006
Errors in file /home/oracle/app/oracle/admin/rcat/bdump/rcat_mrp0_6688.trc:
ORA-16037: user requested cancel of managed recovery operation
Managed Standby Recovery not using Real Time Apply
Recovery interrupted!
Recovered data files to a consistent state at change 68214212
Thu Oct 19 15:07:29 2006
Errors in file /home/oracle/app/oracle/admin/rcat/bdump/rcat_mrp0_6688.trc:
ORA-16037: user requested cancel of managed recovery operation
Thu Oct 19 15:07:29 2006
MRP0: Background Media Recovery process shutdown (rcat)
Thu Oct 19 15:07:29 2006
Terminal Recovery: Stopped real time apply
Thu Oct 19 15:07:29 2006
Attempt to do a Terminal Recovery (rcat)
Thu Oct 19 15:07:29 2006
Media Recovery Start: Managed Standby Recovery (rcat)
Managed Standby Recovery not using Real Time Apply
parallel recovery started with 5 processes
Terminal Recovery timestamp is '10/19/2006 15:07:31'
Terminal Recovery: applying standby redo logs.
Terminal Recovery: thread 1 seq# 98673 redo required
Terminal Recovery: /home/oracle/oradata/rcat/sbylog1_g6.dbf
Identified End-Of-Redo for thread 1 sequence 98673
Thu Oct 19 15:07:31 2006
Incomplete recovery applied all redo ever generated.
Recovery completed through change 68214224
Thu Oct 19 15:07:31 2006
Media Recovery Complete (rcat)
Terminal Recovery: successful completion
Begin: Standby Redo Logfile archival
End: Standby Redo Logfile archival
Resetting standby activation ID 375195464 (0x165d0748)
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
Thu Oct 19 15:07:35 2006
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
Thu Oct 19 15:07:35 2006
ALTER DATABASE SWITCHOVER TO PRIMARY (rcat)
If media recovery active, switchover will wait 900 seconds
Standby terminal recovery start SCN: 68214212
SwitchOver after complete recovery through change 68214224
Online log /home/oracle/oradata/rcat/log1_g1.dbf: Thread 1 Group 1 was previously cleared
Online log /home/oracle/oradata/rcat/log2_g1.dbf: Thread 1 Group 1 was previously cleared
Online log /home/oracle/oradata/rcat/log1_g2.dbf: Thread 1 Group 2 was previously cleared
Online log /home/oracle/oradata/rcat/log2_g2.dbf: Thread 1 Group 2 was previously cleared
Online log /home/oracle/oradata/rcat/log1_g3.dbf: Thread 1 Group 3 was previously cleared
Online log /home/oracle/oradata/rcat/log2_g3.dbf: Thread 1 Group 3 was previously cleared
Standby became primary SCN: 68214211
Converting standby mount to primary mount.
Thu Oct 19 15:07:36 2006
Switchover: Complete - Database mounted as primary (rcat)
Thu Oct 19 15:07:36 2006
ARC0: STARTING ARCH PROCESSES
Thu Oct 19 15:07:36 2006
Completed: ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
Thu Oct 19 15:07:36 2006
ARC2: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
ARC0: Becoming the 'no SRL' ARCH
ARC2 started with pid=13, OS id=14326
Thu Oct 19 15:07:37 2006
ALTER SYSTEM SET standby_archive_dest='' SCOPE=BOTH SID='rcat';
Thu Oct 19 15:07:37 2006
ALTER SYSTEM SET log_archive_dest_1='location="/export/backup/arch/rcat"', 'valid_for=(ONLINE_LOGFILE,ALL_ROLES)' SCOPE=BOTH SID='rcat';
Thu Oct 19 15:07:37 2006
ALTER SYSTEM SET log_archive_dest_state_1='ENABLE' SCOPE=BOTH SID='rcat';
Thu Oct 19 15:07:37 2006
ALTER SYSTEM SET log_archive_dest_2='service="(DESCRIPTION=(ADDRESS_LIST= (ADDRESS=(PROTOCOL=tcp)(HOST=aussw20)(PORT=1521))) (CONNECT_DATA=(SERVICE_NAME=rcat_aussw20_db_XPT.aussw20.lim.com) (INSTANCE_NAME=rcat)(SERVER=dedicated)))"',' ARCH SYNC NOAFFIRM delay=0 OPTIONAL max_failure=0 max_connections=1 reopen=300 db_unique_name="rcat_aussw20_db" register net_timeout=180 valid_for=(online_logfile,primary_role)' SCOPE=BOTH;
Thu Oct 19 15:07:37 2006
ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH;
Thu Oct 19 15:07:37 2006
ALTER DATABASE OPEN
Thu Oct 19 15:07:38 2006
Assigning activation ID 374183684 (0x164d9704)
Thread 1 opened at log sequence 98674
Current log# 2 seq# 98674 mem# 0: /home/oracle/oradata/rcat/log1_g2.dbf
Current log# 2 seq# 98674 mem# 1: /home/oracle/oradata/rcat/log2_g2.dbf
Successful open of redo thread 1

No Comments Yet

Let us know what you think

Subscribe by email