Pythian Blog: Technical Track

How to Fix the Status of the Oracle GI CRS After a Failed Upgrade

Editor's Note: Because our bloggers have lots of useful tips, every now and then we update and bring forward a popular post from the past. Today's post was originally published on April 29, 2019.

Blogger's Note: The information in this post is also valid for 19.x upgrades, in addition to 18.x upgrades.

A couple of weeks ago I was working on a two-node Oracle Grid Infrastructure upgrade from 12.1 to 18.5. Everything went well, with both rootupgrade.sh scripts running correctly. The only thing pending to run was the gridSetup.sh -executeConfigTools command, which failed in the rhprepos upgradeSchema section:

[oracle@node1 /u01/app/18.5.0/grid ]$ ./gridSetup.sh -executeConfigTools -responseFile /tmp/gridresponse.rsp -silent 
 
 ########################################
 # From the upgrade log file :
 ########################################
 INFO: [Apr 9, 2019 3:24:08 PM] Starting 'Upgrading RHP Repository' 
 INFO: [Apr 9, 2019 3:24:08 PM] Starting 'Upgrading RHP Repository' 
 INFO: [Apr 9, 2019 3:24:08 PM] Executing RHPUPGRADE 
 INFO: [Apr 9, 2019 3:24:08 PM] Command /u01/app/18.5.0/grid/bin/rhprepos upgradeSchema -fromversion 12.1.0.2.0 
 INFO: [Apr 9, 2019 3:24:08 PM] ... GenericInternalPlugIn.handleProcess() entered. 
 INFO: [Apr 9, 2019 3:24:08 PM] ... GenericInternalPlugIn: getting configAssistantParmas. 
 INFO: [Apr 9, 2019 3:24:08 PM] ... GenericInternalPlugIn: checking secretArguments. 
 INFO: [Apr 9, 2019 3:24:08 PM] No arguments to pass to stdin 
 INFO: [Apr 9, 2019 3:24:08 PM] ... GenericInternalPlugIn: starting read loop. 
 INFO: [Apr 9, 2019 3:24:11 PM] Completed Plugin named: rhpupgrade 
 INFO: [Apr 9, 2019 3:24:11 PM] ConfigClient.saveSession method called 
 INFO: [Apr 9, 2019 3:24:11 PM] Upgrading RHP Repository failed. 
 INFO: [Apr 9, 2019 3:24:11 PM] Upgrading RHP Repository failed. 
 INFO: [Apr 9, 2019 3:24:11 PM] ConfigClient.executeSelectedToolsInAggregate action performed 
 ...
 INFO: [Apr 9, 2019 3:24:11 PM] Validating state <setup> 
 WARNING: [Apr 9, 2019 3:24:11 PM] [WARNING] [INS-43080] Some of the configuration assistants failed, were cancelled or skipped 
 
 [oracle@node1 ~]$ crsctl query crs activeversion -f 
 Oracle Clusterware active version on the cluster is [18.0.0.0.0]. The cluster upgrade state is [UPGRADE FINAL]. The cluster active patch level is [2532936542].

After looking for information in MOS (My Oracle Support), I couldn't find much to help me solve the issue; just a lot of bugs related to the RHP (rapid home provisioning) repository.

The main problem was that instead of upgrading to 18.5 (or 19.5) during the upgrade process, the MGMTDB remained in version 12.1. As a result, when the RHP migration tried to execute, it failed.

I was lucky enough to get on a call with a good friend (Ricardo Gonzalez) who is the PM of the RHP, and we were able to work through it. Below is the solution for the issue.

The first step is to bring up the MGMTDB in the 12.1 GI_HOME.

[oracle@node1 ~]$ srvctl start mgmtdb
 PRCR-1079 : Failed to start resource ora.mgmtdb
 CRS-2501: Resource 'ora.mgmtdb' is disabled
 [oracle@node1 ~]$ srvctl enable mgmtdb
 [oracle@node1 ~]$ srvctl start mgmtdb 
 [oracle@node1 ~]$ srvctl status mgmtdb
 Database is enabled
 Instance -MGMTDB is running on node node2

Once the MGMTDB is up and running, you need to drop the RHP service that was created during the rootupgrade process. You do this from the 18.5 GI_HOME.

[root@node2 ~]$ env | grep ORA
 ORACLE_SID=+ASM2
 ORACLE_BASE=/u01/app/oracle
 ORACLE_HOME=/u01/app/18.5.0/grid
 [root@node2 ~]$ srvctl remove rhpserver
 PRCT-1470 : failed to reset the Rapid Home Provisioning (RHP) repository
  PRCT-1011 : Failed to run "mgmtca". Detailed error: [MGTCA-1005 : Could not connect to the GIMR. 
  ORA-01034: ORACLE not available
  ORA-27101: shared memory realm does not exist
  Linux-x86_64 Error: 2: No such file or directory
  Additional information: 4150
  Additional information: -1526109961
  ]
 [root@node2 ~]$ srvctl remove rhpserver -f
 

Now that your have removed the RHP service, you need to remove the MGMTDB in 12.1.

You should do this from the first node. While it's possible to do it from the other nodes, Oracle highly recommends doing it in the first node. Accordingly, if it's running from any other node, relocate it to the first node.

########################################
 # As root user in BOTH nodes
 ########################################
 #Node 1
 [root@node1 ~]$ export ORACLE_HOME=/u01/app/12.1.0.2/grid
 [root@node1 ~]$ export PATH=$PATH:$ORACLE_HOME/bin
 [root@node1 ~]$ crsctl stop res ora.crf -init
 [root@node1 ~]$ crsctl modify res ora.crf -attr ENABLED=0 -init
 
 #Node 2
 [root@node2 ~]$ export ORACLE_HOME=/u01/app/12.1.0.2/grid
 [root@node2 ~]$ export PATH=$PATH:$ORACLE_HOME/bin
 [root@node2 ~]$ crsctl stop res ora.crf -init
 [root@node2 ~]$ crsctl modify res ora.crf -attr ENABLED=0 -init
 
 ########################################
 # As oracle User on Node 1
 ########################################
 [oracle@node1 ~]$ export ORACLE_HOME=/u01/app/12.1.0.2/grid
 [oracle@node1 ~]$ export PATH=$PATH:$ORACLE_HOME/bin
 [oracle@node1 ~]$ srvctl relocate mgmtdb -node node1 
 [oracle@node1 ~]$ srvctl stop mgmtdb
 [oracle@node1 ~]$ srvctl stop mgmtlsnr
 [oracle@node1 ~]$ srvctl remove mgmtdb -force
 Remove the database _mgmtdb? (y/[n]) y
 ########################################
 ##### Manually Removed the mgmtdb files
 ##### Verify that the files for MGMTDB match your environment before deleting them
 ########################################
 ASMCMD> cd DBFS_DG/_MGMTDB/DATAFILE
 ASMCMD> ls
 SYSAUX.257.879563483
 SYSTEM.258.879563493
 UNDOTBS1.259.879563509
 ASMCMD> rm system.258.879563493
 ASMCMD> rm sysaux.257.879563483
 ASMCMD> rm undotbs1.259.879563509
 ASMCMD> cd ../PARAMETERFILE
 ASMCMD> rm spfile.268.879563627
 ASMCMD> cd ../TEMPFILE
 ASMCMD> rm TEMP.264.879563553
 ASMCMD> cd ../ONLINELOG
 ASMCMD> rm group_1.261.879563549
 ASMCMD> rm group_2.262.879563549
 ASMCMD> rm group_3.263.879563549
 ASMCMD> cd ../CONTROLFILE
 ASMCMD> rm Current.260.879563547

Once the MGMTDB is deleted, you now run the mdbutil.pl (which you can grab from MOS Doc 2065175.1) and add the MGMTDB in the 18.5 GI_HOME.

########################################
 # As oracle User on Node 1
 ########################################
 [oracle@node1 ~]$ env | grep ORA
 ORACLE_SID=+ASM1
 ORACLE_BASE=/u01/app/oracle
 ORACLE_HOME=/u01/app/18.5.0/grid
 [oracle@node1 ~]$ ./mdbutil.pl --addmdb --target=+DBFS_DG
 mdbutil.pl version : 1.95
 2019-04-14 19:11:48: I Starting To Configure MGMTDB at +DBFS_DG...
 2019-04-14 19:11:53: I Container database creation in progress... for GI 18.0.0.0.0
 2019-04-14 19:20:29: I Plugable database creation in progress...
 2019-04-14 19:22:25: I Executing "/tmp/mdbutil.pl --addchm" on node1 as root to configure CHM.
 root@node1's password:
 2019-04-14 19:23:08: W Not able to execute "/tmp/mdbutil.pl --addchm" on node1 as root to configure CHM.
 2019-04-14 19:23:08: I Executing "/tmp/mdbutil.pl --addchm" on node2 as root to configure CHM.
 root@node2's password:
 2019-04-14 19:23:27: W Not able to execute "/tmp/mdbutil.pl --addchm" on node2 as root to configure CHM.
 2019-04-14 19:23:27: I MGMTDB & CHM configuration done!
 
 ########################################
 # As root user in BOTH nodes
 ########################################
 [root@node1 ~]$ env | grep ORA
 ORACLE_SID=+ASM1
 ORACLE_BASE=/u01/app/oracle
 ORACLE_HOME=/u01/app/18.5.0/grid
 [root@node1 ~]$ /tmp/mdbutil.pl --addchm ##Only if it failed in the mdbutil.pl execution
 [root@node1 ~]$ crsctl modify res ora.crf -attr ENABLED=1 -init
 [root@node1 ~]$ crsctl start res ora.crf -init
 CRS-2672: Attempting to start 'ora.crf' on 'node1'
 CRS-2676: Start of 'ora.crf' on 'node1' succeeded
 
 [root@node2 ~]$ env | grep ORA
 ORACLE_SID=+ASM2
 ORACLE_BASE=/u01/app/oracle
 ORACLE_HOME=/u01/app/18.5.0/grid
 [root@node2 ~]$ /tmp/mdbutil.pl --addchm ##Only if it failed in the mdbutil.pl execution
 [root@node2 ~]$ crsctl modify res ora.crf -attr ENABLED=1 -init
 [root@node2 ~]$ crsctl start res ora.crf -init
 CRS-2672: Attempting to start 'ora.crf' on 'node2'
 CRS-2676: Start of 'ora.crf' on 'node2' succeeded
 
 ########################################
 # As oracle User on Node 1
 ########################################
 [oracle@node1 ~]$ srvctl status MGMTDB
 Database is enabled
 Instance -MGMTDB is running on node tstedbadm01
 oracle@node1 : ~> srvctl status mgmtlsnr
 Listener MGMTLSNR is enabled
 Listener MGMTLSNR is running on node(s): tstedbadm01
 [oracle@node1 ~]$ srvctl config MGMTDB
 Database unique name: _mgmtdb
 Database name: 
 Oracle home: <CRS home>
 Oracle user: oracle
 Spfile: +DBFS_DG/_MGMTDB/PARAMETERFILE/spfile.282.1005320705
 Password file: 
 Domain: 
 Start options: open
 Stop options: immediate
 Database role: PRIMARY
 Management policy: AUTOMATIC
 Type: Management
 PDB name: GIMR_DSCREP_10
 PDB service: GIMR_DSCREP_10
 Cluster name: test-clu
 Database instance: -MGMTDB

Once the MGMTDB has been recreated, you can rerun the gridSetup.sh -executeConfigTools command, and you will see that the cluster status is now NORMAL and everything is running as expected in version 18.5 (or 19.5).

[oracle@node1 ~]$ env | grep ORA
 ORACLE_SID=+ASM1
 ORACLE_BASE=/u01/app/oracle
 ORACLE_HOME=/u01/app/18.5.0/grid
 [oracle@node1 ~]$ /u01/app/18.5.0/grid/gridSetup.sh -executeConfigTools -responseFile /tmp/gridresponse.rsp -silent 
 Launching Oracle Grid Infrastructure Setup Wizard...
 
 You can find the logs of this session at:
 /u01/app/oraInventory/logs/GridSetupActions2019-04-11_04-07-18PM
 
 Successfully Configured Software.
 
 [oracle@node1 ~]$ crsctl query crs activeversion -f
 Oracle Clusterware active version on the cluster is [18.0.0.0.0]. The cluster upgrade state is [NORMAL]. The cluster active patch level is [2532936542].
 
 [oracle@node1 ~]$ crsctl check cluster -all
 **************************************************************
 node1:
 CRS-4537: Cluster Ready Services is online
 CRS-4529: Cluster Synchronization Services is online
 CRS-4533: Event Manager is online
 **************************************************************
 node2:
 CRS-4537: Cluster Ready Services is online
 CRS-4529: Cluster Synchronization Services is online
 CRS-4533: Event Manager is online
 **************************************************************

I hope this blog post helps you solve this issue if you ever face this problem.

Quick note: We were not using the rapid home provisioning feature, and the deletion of the GIMR database did not have any impact on the environment . If you are using RHP, I highly recommend you contact Oracle before running this, to avoid losing the RHP repository.

Oracle also confirmed that this is a bug in the upgrade process of 18.X, so hopefully they will fix it in the future.

Note: This was originally posted on rene-ace.com.

No Comments Yet

Let us know what you think

Subscribe by email