Pythian Blog: Technical Track

How to test an Oracle database upgrade using a physical standby

It was a Thursday morning. I started my day at work and found out that I was tasked with running a test upgrade to 11.2.0.4 right on Friday. This is just to make clear that I did not have much time for planning and testing and that there may be better options to complete this task, but this is the way I found to work.

Let's start with the setup. This is a 11.2.0.3 single instance database with two physical standbys managed with Data Guard (DG). To add a bit of salt to the mix, Fast-Start Failover (FSFO) is enabled. Let’s call these our new database friends, A (the primary), B (one physical standby) and C (the other physical standby). There is a fourth partner in the party: the DG Observer. This is part of the FSFO architecture and is a process running, ideally, out of a server that is not hosting any of the databases. The idea of the test was to remove B from the DG setup, upgrade it to 11.2.0.4, downgrade to 11.2.0.3 and then put it back in the mix. Easily said, not so easy to accomplish. This blog post is a simplification of the whole process, straight to the point and only showing possible caveats if someone faces a requirement like this one in the future, myself included.

The plan

So, the Pythian machinery started to work. Yes we are good, not only because we are individually good, but also because we collaborate internally to make things even better, hence "the machinery" term I've just coined. I started with a basic plan:
  • Changed the failover target in Data Guard to point to C
  • Disable B physical standby in the Data Guard configuration
  • Create a guaranteed restore point on B
  • Activate the physical standby
  • Upgrade to 11.2.0.4
  • Downgrade to 11.2.0.3
  • Flashback the database
  • Enable it back into the Data Guard configuration
  • Go get some rest

Get to the point

First things first, and this is a lesson I've learned the hard way: always use TNS to connect to Data Guard CLI dgmgrl. Why? Because most of the time you will be changing things or reviewing stuff, but when it comes to executing a switchover operation, a bequeath connection fails to connect to the database that goes down and the operation fails.

Now, I start by changing the FSFO target, initially pointing to B, to point to C database. This requires temporarily disabling FSFO or facing the following error:
Error: ORA-16654: fast-start failover is enabled
So, we disable FSFO, change the database configuration and enable FSFO back:
DGMGRL> disable fast_start failover;
 Disabled.
 DGMGRL> edit database 'A' set property 'FastStartFailoverTarget' = 'C';
 Property "FastStartFailoverTarget" updated
 DGMGRL> enable fast_start failover;
 Error: ORA-16651: requirements not met for enabling fast-start failover
Oops!! What happened here? A quick review of the MOS document " Data Guard Broker - Configure Fast Start Failover, Data Protection Levels and the Data Guard Observer (Doc ID 1508729.1)" showed me that the LogXPTMode of the C database was set to ASYNC, while it is required to be SYNC for the database to be eligible as a FSFO target. Let's do it then:
DGMGRL> edit database C set property 'LogXptMode' = 'SYNC';
 Property "LogXptMode" updated
 DGMGRL> disable fast_start failover;
 Disabled.
 DGMGRL> edit database 'A' set property 'FastStartFailoverTarget' = 'C';
 Property "FastStartFailoverTarget" updated
 DGMGRL> enable fast_start failover;
 Enabled.

Right, the first step completed. I will now send some archived logs to the standby databases, just to make sure that everything is up to date before I disable the B database.

And here comes another piece of advice: Enable the time and the timing in SQL*Plus wherever you are working. It will give more sense to your notes and you can easily track back your work in case something goes wrong. Yes, I learned this one the hard way, too.

sys@A> set time on timing on
 01:38:01 sys@A> alter system archive log current;
 
 System altered.
 
 Elapsed: 00:00:00.29
 01:38:13 sys@A> alter system archive log current;
 
 System altered.
 
 Elapsed: 00:00:00.92
 01:38:14 sys@A> alter system archive log current;
 
 System altered.
 
 Elapsed: 00:00:01.19
 01:38:15 sys@A> alter system checkpoint;
 
 System altered.
 
 Elapsed: 00:00:00.18
 
 
And now I modify the DG configuration.
oracle@serverA> dgmgrl
 DGMGRL for Linux: Version 11.2.0.3.0 - 64bit Production
 
 Copyright (c) 2000, 2009, Oracle. All rights reserved.
 
 Welcome to DGMGRL, type "help" for information.
 DGMGRL> connect sys@A
 Password:
 Connected.
 DGMGRL> show database 'B';
 
 Database - B
 
 Role: PHYSICAL STANDBY
 Intended State: APPLY-ON
 Transport Lag: 0 seconds
 Apply Lag: 0 seconds
 Real Time Query: OFF
 Instance(s):
 B
 
 Database Status:
 SUCCESS
 
 DGMGRL> EDIT DATABASE 'B' SET STATE='APPLY-OFF';
 Succeeded.
 DGMGRL> disable database 'B';
 Disabled.
 DGMGRL> show configuration
 
 Configuration - fsfo_A
 
 Protection Mode: MaxAvailability
 Databases:
 A - Primary database
 C - (*) Physical standby database
 Error: ORA-16820: fast-start failover observer is no longer observing this database
 
 B - Physical standby database (disabled)
 
 Fast-Start Failover: ENABLED
 
 Configuration Status:
 ERROR

Wait, what? Another problem? This one was harder to spot. It turned out to be a problem with the Observer. It was unable to connect to the C database due to lack of proper credentials. An "ORA-01031: insufficient privileges" was the tip in the log file. Simply adding the credentials to the Oracle Wallet in use by the Observer fixed the issue, as I was able to verify from the very same connection:

oracle@observer> mkstore -wrl /home/oracle/wallet/.private -createCredential C sys ************
 Oracle Secret Store Tool : Version 11.2.0.3.0 - Production
 Copyright (c) 2004, 2011, Oracle and/or its affiliates. All rights reserved.
 
 Enter wallet password:
 Create credential oracle.security.client.connect_string7
 
 oracle@observer> dgmgrl /@C (This is not a bequeath connection ;) )
 DGMGRL for Linux: Version 11.2.0.3.0 - 64bit Production
 
 Copyright (c) 2000, 2009, Oracle. All rights reserved.
 
 Welcome to DGMGRL, type "help" for information.
 Connected.
 
 
 DGMGRL> disable fast_start failover
 Disabled.
 DGMGRL> enable fast_start failover
 Enabled.
 DGMGRL> show configuration verbose;
 
 Configuration - fsfo_A
 
 Protection Mode: MaxAvailability
 Databases:
 A - Primary database
 C - (*) Physical standby database
 B - Physical standby database (disabled)
 
 (*) Fast-Start Failover target
 
 Properties:
 FastStartFailoverThreshold = '30'
 OperationTimeout = '30'
 FastStartFailoverLagLimit = '30'
 CommunicationTimeout = '180'
 FastStartFailoverAutoReinstate = 'TRUE'
 FastStartFailoverPmyShutdown = 'TRUE'
 BystandersFollowRoleChange = 'ALL'
 
 Fast-Start Failover: ENABLED
 
 Threshold: 30 seconds
 Target: C
 Observer: observer
 Lag Limit: 30 seconds (not in use)
 Shutdown Primary: TRUE
 Auto-reinstate: TRUE
 
 Configuration Status:
 SUCCESS
 
 DGMGRL> exit
At this point, we have B out of the Data Guard equation and can proceed with the upgrade/downgrade part.

Upgrade to 11.2.0.4 and downgrade to 11.2.03

In order to run the test, I have to activate the standby database, so I can open it as a primary and execute the upgrade/downgrade process. This is why I removed it from the DG configuration as a first step:to avoid facing serious trouble with two primary databases enabled.

So I start by creating a Guaranteed Restore Point (GRP) to easily revert the database back to its standby role. In order to be able to create the GRP, the redo apply must be stopped, which I did already.
2:53:48 sys@B> CREATE RESTORE POINT BEFORE_UPGRADE_11204 GUARANTEE FLASHBACK DATABASE;
 
 Restore point created.
 

Now that the GRP has been created, I activate the standby and proceed with the tests. The DG broker process on the database must be stopped to avoid conflicts with the DG configuration. I also cleaned the log_archive_config init parameter to be sure that nothing is getting out of this database.

02:54:00 sys@B> alter system set dg_broker_start=false scope=spfile;
 
 System altered.
 
 Elapsed: 00:00:00.04
 
 02:54:24 sys@B> alter system set log_archive_config='' scope=both;
 
 System altered.
 
 Elapsed: 00:00:00.05
 02:54:41 sys@B> shut immediate;
 ORA-01109: database not open
 
 Database dismounted.
 ORACLE instance shut down.
 
 02:55:01 sys@B> startup mount;
 
 ORACLE instance started.
 
 Total System Global Area 8551575552 bytes
 Fixed Size 2245480 bytes
 Variable Size 3607104664 bytes
 Database Buffers 4932501504 bytes
 Redo Buffers 9723904 bytes
 Database mounted.
 
 02:56:25 sys@B> alter database activate standby database;
 
 Database altered.
 
 Elapsed: 00:00:00.58
 02:56:31 sys@B> alter database open;
 
 Database altered.

It is time to upgrade and downgrade the database now. There is plenty of documentation about the process and I encountered no issues, so there is nothing about the process worth including here.

I'll post the script I used to run the catupgrd.sql script just in case it is useful to someone in the future. As you know, this script modifies the data dictionary to adjust it to the new version. Depending on the gap between versions and the size of the data dictionary, this script may run for quite a long time. Join this with the risk of a remote session going down over VPN or similar stuff, and you will want to make sure that your database session is still there when you come back. There are tools like screen or tmux but they may not be available or usable, so I usually rely on nohup and a simple bash script. This particular version is to be run after loading the proper Oracle environment variables with oraenv, but you can easily modify it to include this step if you want to schedule the script with cron or similar.
#!/bin/bash
 
 sqlplus >> EOF
 conn / as sysdba
 set time on timing on trimspool on pages 999 lines 300
 alter session set nls_date_format='dd-mm-yyyy hh24:mi:ss';
 
 spool /covisint/user/a_catbpd/11204Upgrade/upgrade_dryrun_output.log
 @$ORACLE_HOME/rdbms/admin/catupgrd.sql
 spool off;
 
 exit
 
 EOF
Once the script is ready, and with execution permissions, simply run it in nohup mode with some logging in place:
nohup run_upgrade.sh > run_upgrade_`date "+%F_%H-%M"`.log 2>&1 &
The same script can be used to run catdwgrd.sql and catrelod.sql scripts for the downgrade by simply changing the relevant lines.

Back to the start point

After the upgrade and downgrade tests are successfully completed, it is time to bring everything back to what it looked like when we started this exercise.

The very first step is to flashback the B standby database to the GRP I created before. This is done still using the 11.2.0.4 binaries.
05:03:19 sys@B> flashback database to restore point before_upgrade_11204;
 
 Flashback complete.
 
 Elapsed: 00:00:09.92
 05:03:34 sys@B> alter database convert to physical standby;
 
 Database altered.
 
 Elapsed: 00:00:00.39
 05:04:06 sys@B> shutdown immediate
 ORA-01507: database not mounted
 
 
 ORACLE instance shut down.
After the flashback is complete, we mount the database again but now with the 11.2.0.3 binaries and start the DG broker process
05:05:22 > startup nomount
 ORACLE instance started.
 
 Total System Global Area 8551575552 bytes
 Fixed Size 2245480 bytes
 Variable Size 1358957720 bytes
 Database Buffers 7180648448 bytes
 Redo Buffers 9723904 bytes
 05:05:30 > alter system set dg_broker_start=true scope=Both;
 
 System altered.
 
 Elapsed: 00:00:00.01
 05:05:38 > alter database mount standby database;
 
 Database altered.
 
 Elapsed: 00:00:05.25
 
 Once the database is back up and ready to apply redo, I can enable it back in the DG configuration:
 
 oracle@serverA> dgmgrl
 DGMGRL for Linux: Version 11.2.0.3.0 - 64bit Production
 
 Copyright (c) 2000, 2009, Oracle. All rights reserved.
 
 Welcome to DGMGRL, type "help" for information.
 DGMGRL> connect sys@A
 Password:
 Connected.
 DGMGRL> show configuration
 
 Configuration - fsfo_A
 
 Protection Mode: MaxAvailability
 Databases:
 A - Primary database
 C - (*) Physical standby database
 B - Physical standby database (disabled)
 
 Fast-Start Failover: ENABLED
 
 Configuration Status:
 SUCCESS
 
 
 DGMGRL> enable database B
 Enabled.
 
 
 DGMGRL> EDIT DATABASE 'B' SET STATE='APPLY-ON';
 Succeeded.
 
 
 DGMGRL> show database B
 
 Database - B
 
 Role: PHYSICAL STANDBY
 Intended State: APPLY-ON
 Transport Lag: 0 seconds
 Apply Lag: 1 hour(s) 11 minutes 49 seconds  <== We have some lag here 
 Real Time Query: OFF
 Instance(s):
 B
 
 Database Status:
 SUCCESS
 
 After a few minutes, the standby is back in sync with the primary
 
 
 DGMGRL> show database B
 
 Database - B
 
 Role: PHYSICAL STANDBY
 Intended State: APPLY-ON
 Transport Lag: 0 seconds
 Apply Lag: 0 seconds  <== It catched up quite quickly
 Real Time Query: OFF
 Instance(s):
 B
 
 Database Status:
 SUCCESS
I now set B again as the FSFO target and validate the final setup.
DGMGRL> DISABLE FAST_START FAILOVER
 Disabled.
 DGMGRL> edit database 'A' set property 'FastStartFailoverTarget' = 'B';
 Property "FastStartFailoverTarget" updated
 DGMGRL> edit database 'B' set property 'FastStartFailoverTarget' = 'A';
 Property "FastStartFailoverTarget" updated
 DGMGRL> edit database 'C' set property 'FastStartFailoverTarget' = '';
 Property "FastStartFailoverTarget" updated
 DGMGRL> ENABLE FAST_START FAILOVER
 Enabled.
 DGMGRL> show database A FastStartFailoverTarget;
 FastStartFailoverTarget = 'B'
 DGMGRL> show database B FastStartFailoverTarget;
 FastStartFailoverTarget = 'A'
 DGMGRL> show database C FastStartFailoverTarget;
 FastStartFailoverTarget = ''
 
 DGMGRL> show fast_start failover
 
 Fast-Start Failover: ENABLED
 
 Threshold: 30 seconds
 Target: B
 Observer: observer
 Lag Limit: 30 seconds (not in use)
 Shutdown Primary: TRUE
 Auto-reinstate: TRUE
 
 Configurable Failover Conditions
 Health Conditions:
 Corrupted Controlfile YES
 Corrupted Dictionary YES
 Inaccessible Logfile NO
 Stuck Archiver NO
 Datafile Offline YES
 
 Oracle Error Conditions:
 (none)
 
 DGMGRL>
 DGMGRL> show configuration
 
 Configuration - fsfo_A
 
 Protection Mode: MaxAvailability
 Databases:
 A - Primary database
 B - (*) Physical standby database
 C - Physical standby database
 
 Fast-Start Failover: ENABLED
 
 Configuration Status:
 SUCCESS
Don't forget to drop the GRP on B or you may have a little alarm to deal with later :)
DROP RESTORE POINT BEFORE_UPGRADE_11204;

Final thoughts

This was an interesting exercise for several reasons.

First, this gave me the possibility of testing a procedure before it is actually executed in a production environment. This is always a good thing. No matter how many times have you done something, there is always a slight change, something done differently in a given installation or a bug hiding behind the bushes. Testing will help you prepare for what may come ahead. Reduce the surprises to a minimum.

It will also give you some deeper familiarity with the environment you are working on and confidence while running the process in production. We as consultants may not work frequently on a given customer and getting to know the environment we are working with eases the tasks.

Another reason I liked this exercise is that I, once more, got the support from my co-workers here at Pythian (special thanks to my team members). Throw an idea into a Slack channel and they will come back with more ideas, experiences, caveats and whatnot, making the task more enjoyable and executed with better quality.

If four eyes see better than two, imagine sixteen or twenty. There'll be still room for mistakes, issues and such but they will surely be quite rare.

No Comments Yet

Let us know what you think

Subscribe by email