Oracle ASM DISK HEADER CORRUPTION

Posted by PDSERVICE on Nov 09, 2016 In

If you cannot recover data by yourself, ask Parnassusdata, the professional ORACLE database recovery team for help.

Parnassusdata Software Database Recovery Team

Service Hotline: +86 13764045638 E-mail: [email protected]

10.2.0.4 ASM on 2 Node RAC. In 2011 August some of the disks were dropped

from ASM diskgroup using sqlplus alter diskgroup drop disk command. Since the file descriptors were held, customer got downtime on 21st Jan 2012 to restart ASM instance to release the locks. ASM instance was restarted on Node-1 alone. After restart of the ASM instance, two diskgroups ( orcl1_DATADG01 and orcl1_ARCHDG01 ) did not mount. One disk from each diskgroup were complained as missing. In v$asm_disk these disks were seen as CANDIDATE. kfed read on it shows there is no ASM Metadata header. Only aunum=0 blknum=0 had issues and rest of the blocks were fine. Purpose of this bug is to figure out

what could cause the header block data getting corrupted.

DIAGNOSTIC ANALYSIS:

--------------------

Both the diskgroups were mounted on Node-2. And the issue fixed by the

following steps:

1] Drop the affected disk from diskgroup on Node-2 where it is mounted.

2] Rebalance started and completed. After rebalance completed, diskgroup got

dismounted failed with

ERROR: empty ASM disk check aborted, diskgroup (orcl1_DATADG01)

ERROR: ORA-15066 thrown in RBAL for group number 1

ORA-15066: offlining disk "Porcl1_DATADG01_0023" may result in a data

loss

3] But diskgroup orcl1_DATADG01 can be mounted on both the nodes after

this.

4] Restarted both the ASM instances restarted on both nodes. All diskgroups are mounted on both the nodes. It did not looked for disk 23 any more.

5] On Node-1 created a dummy disk group using

'/dev/oracle/orcl1/orcl1_datadg01_62'. Mounted on both the nodes.

Dropped dummy diskgroup.

6] added the disk to orcl1_DATADG01 with power 11. Rebalance completed

without errors.

The affected disks

orcl1_DATADG01 : orcl1_DATADG01_0023 :

/dev/oracle/orcl1/orcl1_datadg01_62

orcl1_ARCHDG01 : orcl1_ARCHDG01_0032 :

/dev/oracle/orcl1/orcl1_archdg01_25

We do not have the disk dump of orcl1_archdg01_25 before the fix. But we have the dd dump of orcl1_datadg01_62 when the issue was seen and header was lost.

two affected dg:

orcl1_DATADG01

orcl1_ARCHDG01

affected disks:

orcl1_DATADG01 : orcl1_DATADG01_0023 :

/dev/oracle/orcl1/orcl1_datadg01_62

orcl1_ARCHDG01 : orcl1_ARCHDG01_0032 :

/dev/oracle/orcl1/orcl1_archdg01_25

- last time dg mounted successfully at:

asm1:

Tue Aug 30 20:46:42 2011

NOTE: cache mounting group 2/0x729B169F (orcl1_DATADG01) succeeded

SUCCESS: diskgroup orcl1_DATADG01 was mounted

Tue Aug 30 20:46:42 2011

NOTE: cache mounting group 1/0x728B169E (orcl1_ARCHDG01) succeeded

SUCCESS: diskgroup orcl1_ARCHDG01 was mounted

asm2:

Tue Aug 30 20:59:40 2011

NOTE: cache mounting group 1/0x7288AD5B (orcl1_ARCHDG01) succeeded

SUCCESS: diskgroup orcl1_ARCHDG01 was mounted

Tue Aug 30 20:59:40 2011

NOTE: cache mounting group 2/0x7298AD5C (orcl1_DATADG01) succeeded

SUCCESS: diskgroup orcl1_DATADG01 was mounted

- then ASM1 restarted and diskgroup orcl1_ARCHDG01 and orcl1_DATADG01

fail to mount.

Sat Jan 21 09:15:18 2012

NOTE: PST enabling heartbeating (grp 1)

Sat Jan 21 09:15:18 2012

ERROR: diskgroup orcl1_ARCHDG01 was not mounted

NOTE: cache dismounting group 2/0x72978BD8 (orcl1_DATADG01)

NOTE: dbwr not being msg'd to dismount

Sat Jan 21 09:15:18 2012

NOTE: PST enabling heartbeating (grp 2)

Sat Jan 21 09:15:18 2012

ERROR: diskgroup orcl1_DATADG01 was not mounted

NOTE: cache opening disk 1 of grp 3: orcl1_REDO1DG01_0001

path:/dev/oracle/orcl1/orcl1_redo1dg01_01

NOTE: F1X0 found on disk 1 fcn 0.172655

NOTE: cache mounting (not first) group 3/0x72978BD9 (orcl1_REDO1DG01)

- missing disks are:

orcl1_DATADG01 : orcl1_DATADG01_0023 :

/dev/oracle/orcl1/orcl1_datadg01_62

orcl1_ARCHDG01 : orcl1_ARCHDG01_0032 :

/dev/oracle/orcl1/orcl1_archdg01_25

As per bug update, the dd output of missing diskorcl1_DATADG01_0023 shows

the at least the first 0x30 bytes are corrupted:

dd if=./orcl1_datadg01_62.dd bs=4k count=1 | hexdump -C

1+0 records in

1+0 records out

4096 bytes (4.1 kB) copied, 2.1049e-05 seconds, 195 MB/s

00000000 53 00 00 00 fd ff ff ff 06 ff 00 00 d8 00 00 00

|S...............|

00000010 00 4a 2a 08 b0 cf ff ff ad 4d 2a 08 30 75 00 00

|.J*......M*.0u..|

00000020 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

|................|

00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

|................|

00000040 00 00 10 0a 17 00 01 03 50 4f 4c 4e 46 58 4d 31

|........orcl1|

00000050 5f 44 41 54 41 44 47 30 31 5f 30 30 32 33 00 00

|_DATADG01_0023..|

00000060 00 00 00 00 00 00 00 00 50 4f 4c 4e 46 58 4d 31

|........orcl1|

00000070 5f 44 41 54 41 44 47 30 31 00 00 00 00 00 00 00

|_DATADG01.......|

00000080 53 00 00 00 fd ff ff ff 0a ff 00 00 08 00 00 00

|S...............|

there is no error/activities in asm alert.log that could suggest anything suspicious. Those S J* M* characters are not written by ASM. The The first byte of ASM header should be "kfbh.endian", for Linux, it should be 0x01, but here is 53 "S". It appears that something in the operating system or HBA is overwriting the first 64 bytes of block 0 on some ASM disks. Later version of ASM could provide diskheader backup and restore. This is caused by something else outside of Oracle.

You are here

Oracle ASM DISK HEADER CORRUPTION