Pythian Blog: Technical Track

ASM disk group just will not mount

I recently had an issue involving a damaged ASM disk group. Each time an attempt was made to mount the disk group it would return an ORA-0600 [kfrValAcd30]. There had been some network and IO issues reported in this environment, so something was messed up. The usual tricks using kfed to repair the headers were not working. The GI version was 11.2.0.2 and it smelled like a bug. A client SR with Oracle Support yielded a recommendation to delete the headers of the disks and recreate the disk group. Hmmm….This sounded a bit extreme as the first step. This was the vendors verbatim response:
"ORA-600 [kfrValAcd30] signaled when the expected change sequence doesn't match with the sequence we find, during the recovery of the diskgroup. This is some kind of inconsistency in the ACD block which is usually caused by a platform specific install/operational issue. It may also happen if there is any IO or storage issue . This should be further investigated by the OS/Storage vendor. Solution The only way to resolve this is by recreating the respective diskgroup. How To Drop and Recreate ASM Diskgroup DOC ID - 563048.1"
The proposed solution in MOS Note 563048.1 starts off with: Erase ASM metadata information for disk using OS command: !!! YOU HAVE TO BE CAREFUL WHEN TYPING DISK NAME !!! For Unix platforms using 'dd' Unix command. Example:
dd if=/dev/zero of=/dev/raw/raw13 bs=1048576 count=50
Okay, any client following this recommendation, has now officially nuked the first 50MB of data on the disk without backing up anything. If possible, it is always a good idea to backup as much of the damaged environment as possible, so if what you try fails, you can restart at the last known good point.
To be fair, earlier in the MOS note (in the Goals section, this course of action is qualified with: "Erasing the header using "dd" command is very dangerous operation and need to be done under support supervision and when it [is] confirmed by support that fixing the header is impossible. ... Backup disk header for all member disks in diskgroup"
This MOS note has a purpose. It tackles the issue of dropping a disk group that is not being dropped with SQL commands. So the note is good, the advice less so for two reasons. Firstly, no attempt had been made restore data availability either with regular restore procedures or by attempting to extract any data from the disks. Dropping the disk groups as a first step will cause all data in the diskgroup to be lost. Secondly, I was not convinced that the disk groups could not be dropped without resorting to wiping the headers when the time came to drop the disk group. I ignored the "dd" advice and tested the several other options in a 12.1.0.2 two-node RAC lab environment. ASM disk groups may sometimes have problems due to bugs and more commonly due to I/O subsystem related failures. The software is generally resilient enough to fix itself, but if there is an issue with a disk group that refuses to mount, here are two options to consider.

Option 1: DB has regular RMAN backups

Solution 1 Overview:
  1. Collect some metadata to recreate the diskgroup when we drop it.
  2. Collect some DB structural info to identify which datafiles reside in affected diskgroup.
  3. Restore and recover datafiles previously resident in this diskgroup from RMAN backupset.
  4. Drop and recreate the faulty disk group.

Option 2: DB has NO regular RMAN backups, recent archivelogs are available

Solution 2 Overview:
  1. Same as above, until step 3. We cannot restore datafiles from an RMAN backupset. We use AMDU to extract the datafile from the diskgroup.
  2. Use AMDU to report on metadata from the unmountable disk group. View the report produced for corruptions and other problems.
  3. Use AMDU to extract the datafiles from the unmountable disk group.
  4. Rename the datafile in the controlfile to point to the AMDU extracted datafiles, apply recovery and open the database.
  5. Drop and recreate the faulty disk group.

Lab Scenario

There are 2 disk groups named DATA and TEST. The TEST_TBS tablespace is created with one datafile residing in the TEST diskgroup. A test table DBT is created to test whether data loss has occurred. The new tablespace is backed up. [code language="sql"] SQL> create tablespace test_tbs datafile '+TEST' size 10m; Tablespace created. SQL> create table dbt tablespace test_tbs as select * from dba_tables; -- this has 2340 rows in my example Table created. RMAN> backup tablespace test_tbs; input datafile file number=00014 name=+TEST/CDBRAC/DATAFILE/test_tbs.256.912776579 piece handle=+DATA/CDBRAC/BACKUPSET/2016_05_25/nnndf0_tag20160525t130349_0.303.912776631 tag=TAG20160525T130349 comment=NONE channel ORA_DISK_1: backup set complete, elapsed time: 00:00:03 [/code] I have the name of my datafile, but you should query v$datafile or look at previously collected metadata to identify the datafiles resident in the problematic diskgroup.

Option 1: DB has regular RMAN backups

1. Collect some metadata to recreate the diskgroup when we drop it.
[code language="sql"] SQL> select group_number, name, compatibility, database_compatibility from v$asm_diskgroup where name='TEST'; GROUP_NUMBER NAME COMPATIBILITY DATABASE_COMPATIBILITY ------------ ------------ ------------- ---------------------- 2 TEST 12.1.0.0.0 10.1.0.0.0 SQL> select path, redundancy from v$asm_disk where group_number=2; PATH REDUNDANCY ------------------------------------------------------------ ---------- /dev/asm-disk4 UNKNOWN [/code]
2. Collect some DB structural info to identify which datafiles reside in affected diskgroup.
[code language="sql"] SQL> select name from v$datafile where name like '+TEST%'; NAME -------------------------------------------------------------------------------- +TEST/CDBRAC/DATAFILE/test_tbs.256.912785223[/code]
3. Restore and recover datafiles previously resident in this diskgroup from RMAN backupset.
You will recognize this as simple RMAN restore and recover. Do not be surprised if this actually works. [code language="sql"]RMAN> startup mount; ... database mounted RMAN> restore datafile 14; RMAN> recover datafile 14; RMAN> alter database open; [/code] Querying the DBT test table shows all rows restored.
4. Drop and recreate the faulty disk group.
Drop diskgroup from ASM instance connected as sysasm [code language="sql"]SQL> alter diskgroup test dismount; Diskgroup altered. SQL> drop diskgroup test force including contents; Diskgroup dropped.[/code] [code language="sql"]create diskgroup test external redundancy disk '/dev/asm-disk4' attribute 'compatible.asm'='12.1.0.0.0', 'compatible.rdbms'='10.1.0.0.0'; [/code] You may want to place the restored datafile into the original diskgroup.

Option 2: DB has NO regular RMAN backups, recent archivelogs are available

1. Same as Option 1 steps 1 and 2.
We cannot restore datafiles from an RMAN backupset. We use AMDU to extract the datafile from the diskgroup. This MOS note: How to Restore the Database Using AMDU after Diskgroup Corruption (Doc ID 1597581.1) has details about AMDU.
2. Use AMDU to report on metadata from the unmountable disk group.
View the report produced for corruptions or any issues. [code language="sql"]$ amdu -diskstring '/dev/asm-disk4' -dump TEST amdu_2016_05_25_14_25_22/ $ ls -lrt ./amdu_2016_05_25_14_25_22/ total 52244 -rw-r--r--. 1 oracle oinstall 4240 May 25 14:25 TEST.map -rw-r--r--. 1 oracle oinstall 53485568 May 25 14:25 TEST_0001.img -rw-r--r--. 1 oracle oinstall 2981 May 25 14:25 report.txt $ cat report.txt -*-amdu-*- ******************************* AMDU Settings ******************************** ORACLE_HOME = /u01/app/12.1.0.2/grid System name: Linux Machine: x86_64 amdu run: 25-MAY-16 14:25:22 Endianess: 1 ... ************************** SCANNING DISKGROUP TEST *************************** ---------------------------- SCANNING DISK N0001 ----------------------------- Disk N0001: '/dev/asm-disk4' ------------------------- SUMMARY FOR DISKGROUP TEST ------------------------- Allocated AU's: 53 Free AU's: 5061 AU's read for dump: 53 Block images saved: 13058 Map lines written: 53 Heartbeats seen: 0 Corrupt metadata blocks: 0 Corrupt AT blocks: 0 [/code]
3. Use AMDU to extract the datafiles from the unmountable disk group.
[code language="sql"]$ amdu -diskstring '/dev/asm-disk4' -extract TEST.256 amdu_2016_05_25_15_30_12/ $ ls -lrth ./amdu_2016_05_25_15_30_12 total 11M -rw-r--r--. 1 oracle oinstall 11M May 25 15:30 TEST_256.flt -rw-r--r--. 1 oracle oinstall 3.7K May 25 15:30 report.txt [/code]
4. Rename the datafile in the controlfile to point to the AMDU extracted datafiles, apply recovery and open the database.
[code language="sql"] SQL> startup mount SQL> alter database open; alter database open * ERROR at line 1: ORA-01157: cannot identify/lock data file 15 - see DBWR trace file ORA-01110: data file 15: '+TEST/CDBRAC/DATAFILE/test_tbs.256.912785223' SQL> alter database rename file '+TEST/CDBRAC/DATAFILE/test_tbs.256.912785223' to '/home/oracle/extract/amdu_2016_05_25_15_30_12/TEST_256.f'; Database altered. SQL> recover datafile 15; Media recovery complete. SQL> alter database open; Database altered. [/code] Querying the DBT test table shows all rows restored.
5. Same as Option 1 step 4.

Conclusion

In this note, the assumption is that attempts to repair and mount the diskgroup have failed. The ASM disk headers were damaged but the actual datafiles stored in the diskgroup were intact. When there are usable RMAN backups it is a simple matter to restore the missing files, recover them and go about dropping and recreating the damaged diskgroup. When backups are not available, utilities like AMDU can be used to extract datafiles without requiring diskgroups to be mounted. Notice that focus was placed on data availability and the damaged diskgroups are the last items to be considered. Dropping and recreating faulty diskgroups should not involve concerns about data loss because by the time you get to this task, the data involved should already be available elsewhere. Backing up metadata is good practice. Consider using md_backup or query the v$asm views from the backup scripts to ensure the output is always included in the backup log file. The advice provided by the support engineer had to be weighed carefully and balanced with common sense. Why risk data loss by wiping headers and forcibly dropping disk groups without attempting to restore data availability first?

No Comments Yet

Let us know what you think

Subscribe by email