Email: service@parnassusdata.com 7 x 24 online support!
Oracle ASM DISK HEADER CORRUPTION
If you cannot recover data by yourself, ask Parnassusdata, the professional ORACLE database recovery team for help.
Oracle ASM DISK HEADER CORRUPTION
10.2.0.4 ASM on 2 Node RAC. In 2011 August some of the disks were dropped
from ASM diskgroup using sqlplus alter diskgroup drop disk command. Since the file descriptors were held, customer got downtime on 21st Jan 2012 to restart ASM instance to release the locks. ASM instance was restarted on Node-1 alone. After restart of the ASM instance, two diskgroups ( orcl1_DATADG01 and orcl1_ARCHDG01 ) did not mount. One disk from each diskgroup were complained as missing. In v$asm_disk these disks were seen as CANDIDATE. kfed read on it shows there is no ASM Metadata header. Only aunum=0 blknum=0 had issues and rest of the blocks were fine. Purpose of this bug is to figure out
what could cause the header block data getting corrupted.
DIAGNOSTIC ANALYSIS:
--------------------
Both the diskgroups were mounted on Node-2. And the issue fixed by the
following steps:
1] Drop the affected disk from diskgroup on Node-2 where it is mounted.
2] Rebalance started and completed. After rebalance completed, diskgroup got
dismounted failed with
ERROR: empty ASM disk check aborted, diskgroup (orcl1_DATADG01)
ERROR: ORA-15066 thrown in RBAL for group number 1
ORA-15066: offlining disk "Porcl1_DATADG01_0023" may result in a data
loss
3] But diskgroup orcl1_DATADG01 can be mounted on both the nodes after
this.
4] Restarted both the ASM instances restarted on both nodes. All diskgroups are mounted on both the nodes. It did not looked for disk 23 any more.
5] On Node-1 created a dummy disk group using
'/dev/oracle/orcl1/orcl1_datadg01_62'. Mounted on both the nodes.
Dropped dummy diskgroup.
6] added the disk to orcl1_DATADG01 with power 11. Rebalance completed
without errors.
The affected disks
orcl1_DATADG01 : orcl1_DATADG01_0023 :
/dev/oracle/orcl1/orcl1_datadg01_62
orcl1_ARCHDG01 : orcl1_ARCHDG01_0032 :
/dev/oracle/orcl1/orcl1_archdg01_25
We do not have the disk dump of orcl1_archdg01_25 before the fix. But we have the dd dump of orcl1_datadg01_62 when the issue was seen and header was lost.
two affected dg:
orcl1_DATADG01
orcl1_ARCHDG01
affected disks:
orcl1_DATADG01 : orcl1_DATADG01_0023 :
/dev/oracle/orcl1/orcl1_datadg01_62
orcl1_ARCHDG01 : orcl1_ARCHDG01_0032 :
/dev/oracle/orcl1/orcl1_archdg01_25
- last time dg mounted successfully at:
asm1:
Tue Aug 30 20:46:42 2011
NOTE: cache mounting group 2/0x729B169F (orcl1_DATADG01) succeeded
SUCCESS: diskgroup orcl1_DATADG01 was mounted
Tue Aug 30 20:46:42 2011
NOTE: cache mounting group 1/0x728B169E (orcl1_ARCHDG01) succeeded
SUCCESS: diskgroup orcl1_ARCHDG01 was mounted
asm2:
Tue Aug 30 20:59:40 2011
NOTE: cache mounting group 1/0x7288AD5B (orcl1_ARCHDG01) succeeded
SUCCESS: diskgroup orcl1_ARCHDG01 was mounted
Tue Aug 30 20:59:40 2011
NOTE: cache mounting group 2/0x7298AD5C (orcl1_DATADG01) succeeded
SUCCESS: diskgroup orcl1_DATADG01 was mounted
- then ASM1 restarted and diskgroup orcl1_ARCHDG01 and orcl1_DATADG01
fail to mount.
Sat Jan 21 09:15:18 2012
NOTE: PST enabling heartbeating (grp 1)
Sat Jan 21 09:15:18 2012
ERROR: diskgroup orcl1_ARCHDG01 was not mounted
NOTE: cache dismounting group 2/0x72978BD8 (orcl1_DATADG01)
NOTE: dbwr not being msg'd to dismount
Sat Jan 21 09:15:18 2012
NOTE: PST enabling heartbeating (grp 2)
Sat Jan 21 09:15:18 2012
ERROR: diskgroup orcl1_DATADG01 was not mounted
NOTE: cache opening disk 1 of grp 3: orcl1_REDO1DG01_0001
path:/dev/oracle/orcl1/orcl1_redo1dg01_01
NOTE: F1X0 found on disk 1 fcn 0.172655
NOTE: cache mounting (not first) group 3/0x72978BD9 (orcl1_REDO1DG01)
- missing disks are:
orcl1_DATADG01 : orcl1_DATADG01_0023 :
/dev/oracle/orcl1/orcl1_datadg01_62
orcl1_ARCHDG01 : orcl1_ARCHDG01_0032 :
/dev/oracle/orcl1/orcl1_archdg01_25
As per bug update, the dd output of missing diskorcl1_DATADG01_0023 shows
the at least the first 0x30 bytes are corrupted:
dd if=./orcl1_datadg01_62.dd bs=4k count=1 | hexdump -C
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 2.1049e-05 seconds, 195 MB/s
00000000 53 00 00 00 fd ff ff ff 06 ff 00 00 d8 00 00 00
|S...............|
00000010 00 4a 2a 08 b0 cf ff ff ad 4d 2a 08 30 75 00 00
|.J*......M*.0u..|
00000020 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|................|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|................|
00000040 00 00 10 0a 17 00 01 03 50 4f 4c 4e 46 58 4d 31
|........orcl1|
00000050 5f 44 41 54 41 44 47 30 31 5f 30 30 32 33 00 00
|_DATADG01_0023..|
00000060 00 00 00 00 00 00 00 00 50 4f 4c 4e 46 58 4d 31
|........orcl1|
00000070 5f 44 41 54 41 44 47 30 31 00 00 00 00 00 00 00
|_DATADG01.......|
00000080 53 00 00 00 fd ff ff ff 0a ff 00 00 08 00 00 00
|S...............|
there is no error/activities in asm alert.log that could suggest anything suspicious. Those S J* M* characters are not written by ASM. The The first byte of ASM header should be "kfbh.endian", for Linux, it should be 0x01, but here is 53 "S". It appears that something in the operating system or HBA is overwriting the first 64 bytes of block 0 on some ASM disks. Later version of ASM could provide diskheader backup and restore. This is caused by something else outside of Oracle.