@ Due to flood all system was shutdown from main switch and storage cache was
@ dirty. To online the disks dirty cache on the storage was cleared. Customer
@ does not have os backups so can not restore. The only option is to mount the
@ diskgroup. ASM instance crashes with the error ORA-00600: internal error
@ code, arguments: [kfrValAcd30], [DATA], [1], [43], [8486], [44], [8486].
@ Customer cannot run anything on it now. AMDU shows only one corrupt block.
@ kfed output shows valid asm header.
Fri Dec 28 07:19:06 2012
NOTE: attached to recovery domain 1
NOTE: starting recovery of thread=1 ckpt=43.8486 group=1 (DATA)
NOTE: starting recovery of thread=2 ckpt=47.24 group=1 (DATA)
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_11865.trc
(incident=32228):
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [43],
[8486], [44], [8486], [], [], [], [], []
Incident details in:
/u01/app/oracle/diag/asm/+asm/+ASM1/incident/incdir_32228/+ASM1_ora_11865_i322
28.trc
Abort recovery for domain 1
NOTE: crash recovery signalled OER-600
ERROR: ORA-600 signalled during mount of diskgroup DATA
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [43],
[8486], [44], [8486], [], [], [], [], []
ERROR: ALTER DISKGROUP ALL MOUNT /* asm agent */
.
.
.
// From the OS log, there are some multipathd errors before ASM start.
.
.
Dec 28 06:32:34 srv-ural multipathd: asm!.asm_ctl_vbg0: add path (uevent)
Dec 28 06:32:34 srv-ural multipathd: asm!.asm_ctl_vbg0: failed to store path
info
Dec 28 06:32:34 srv-ural multipathd: uevent trigger error
Dec 28 06:32:34 srv-ural multipathd: asm!.asm_ctl_vbg1: add path (uevent)
Dec 28 06:32:34 srv-ural multipathd: asm!.asm_ctl_vbg1: failed to store path
info
Dec 28 06:32:35 srv-ural multipathd: uevent trigger error
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg2: add path (uevent)
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg2: failed to store path
info
Dec 28 06:32:35 srv-ural multipathd: uevent trigger error
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg3: add path (uevent)
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg3: failed to store path
info
Dec 28 06:32:35 srv-ural multipathd: uevent trigger error
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg4: add path (uevent)
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg4: failed to store path
info
Dec 28 06:32:35 srv-ural multipathd: uevent trigger error
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg5: add path (uevent)
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg5: failed to store path
info
Dec 28 06:32:35 srv-ural multipathd: uevent trigger error
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg6: add path (uevent)
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg6: failed to store path
info
Dec 28 06:32:35 srv-ural multipathd: uevent trigger error
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg7: add path (uevent)
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg7: failed to store path
info
Dec 28 06:32:35 srv-ural multipathd: uevent trigger error
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg8: add path (uevent)
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg8: failed to store path
info
Dec 28 06:32:35 srv-ural multipathd: uevent trigger error
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg9: add path (uevent)
Dec 28 06:32:35 srv-ural multipathd: asm!.asm_ctl_vbg9: failed to store path
info
node 2 - message
.
Dec 24 21:02:23 srv-aral CLSD: The clock on host srv-aral has been updated by
the Cluster Time Synchronization Service to be synchronous with the mean
cluster time.
Dec 24 21:51:35 srv-aral CLSD: The clock on host srv-aral has been updated by
the Cluster Time Synchronization Service to be synchronous with the mean
cluster time.
Dec 24 22:21:41 srv-aral kernel: qla2xxx 0000:06:00.0: LOOP DOWN detected (2
6 0).
Dec 24 22:21:42 srv-aral kernel: qla2xxx 0000:06:00.1: LOOP DOWN detected (2
6 0).
Dec 27 18:56:08 srv-aral syslogd 1.4.1: restart.
Dec 27 18:56:08 srv-aral kernel: klogd 1.4.1, log source = /proc/kmsg
started.
*** RVARDHIN 12/31/12 01:57 am ***
requested ct to set the below values
.
From
.
blacklist {
# wwid 26353900f02796769
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^hd[a-z][[0-9]*]"
# devnode "^cciss!c[0-9]d[0-9]*"
# device {
# vendor "HP"
# product "OPEN-*"
# }
}
.
.
To,
.
blacklist {
# wwid 26353900f02796769
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^hd[a-z][[0-9]*]"
# devnode "^cciss!c[0-9]d[0-9]*"
# device {
# vendor "HP"
# product "OPEN-*"
# wwid "*"
devnode "^asm/*" # Here <<<.
devnode "ofsctl" # Here <<<
.
# }
}
ct changed the multipatch.conf file. issue still exist
.
/u01/app/11.2.0/grid/bin/crsctl start cluster -all
.
CRS-2672: Attempting to start 'ora.asm' on 'srv-aral'
CRS-2672: Attempting to start 'ora.cssd' on 'srv-ural'
CRS-2672: Attempting to start 'ora.diskmon' on 'srv-ural'
CRS-2676: Start of 'ora.diskmon' on 'srv-ural' succeeded
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [43],
[8486], [44], [8486], [], [], [], [], []
CRS-2674: Start of 'ora.asm' on 'srv-aral' failed
CRS-2679: Attempting to clean 'ora.asm' on 'srv-aral'
CRS-2681: Clean of 'ora.asm' on 'srv-aral' succeeded
CRS-2676: Start of 'ora.cssd' on 'srv-ural' succeeded
CRS-2672: Attempting to start 'ora.ctssd' on 'srv-ural'
CRS-2676: Start of 'ora.ctssd' on 'srv-ural' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'srv-ural'
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [43],
[8486], [44], [8486], [], [], [], [], []
CRS-2674: Start of 'ora.asm' on 'srv-ural' failed
CRS-2679: Attempting to clean 'ora.asm' on 'srv-ural'
CRS-2681: Clean of 'ora.asm' on 'srv-ural' succeeded
CRS-2673: Attempting to stop 'ora.ctssd' on 'srv-ural'
CRS-2677: Stop of 'ora.ctssd' on 'srv-ural' succeeded
CRS-2673: Attempting to stop 'ora.diskmon' on 'srv-ural'
CRS-2677: Stop of 'ora.diskmon' on 'srv-ural' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'srv-ural'
CRS-2677: Stop of 'ora.cssd' on 'srv-ural' succeeded
- AMDU report shows only one Block as corrupted
158 ---------------------------- SCANNING DISK N0001
-----------------------------
159 Disk N0001: '/dev/oracleasm/disks/DISK1'
160 AMDU-00209: Corrupt block found: Disk N0001 AU [33] block [91] type [0]
161 AMDU-00201: Disk N0001: '/dev/oracleasm/disks/DISK1'
162 Allocated AU's: 93584
163 Free AU's: 954991
164 AU's read for dump: 165
165 Block images saved: 30383
166 Map lines written: 165
167 Heartbeats seen: 0
168 Corrupt metadata blocks: 1
169 Corrupt AT blocks: 0
- ORA 600 kfrValAcd30 is hit since ckpt ABA and current block ABA is not
matching as expected.
ASM1 alert log shows:
16747 ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1],
[43], [8486], [44], [8486], [], [], [], [], []
- It doesn't look like the ABA is wrong here. It is having expected value as
per ASM alert logs. (Still have to see why the assert is triggered)
ASM1 alert log shows:
16546 Thu Dec 20 18:14:17 2012
16547 NOTE: attached to recovery domain 1
16548 NOTE: starting recovery of thread=1 ckpt=43.8486 group=1 (DATA) <===
ABA value reported in ORA 600
16549 NOTE: advancing ckpt for thread=1 ckpt=43.8486
16550 NOTE: cache recovered group 1 to fcn 0.814009
16551 Thu Dec 20 18:14:17 2012
16552 NOTE: LGWR attempting to mount thread 1 for diskgroup 1 (DATA)
16553 NOTE: LGWR found thread 1 closed at ABA 43.8485
16554 NOTE: LGWR mounted thread 1 for diskgroup 1 (DATA)
16555 NOTE: LGWR opening thread 1 at fcn 0.814009 ABA 44.8486 <=== ABA value
reported in ORA 600
16556 NOTE: cache mounting group 1/0x0C886DF4 (DATA) succeeded
16557 NOTE: cache ending mount (success) of group DATA number=1
incarn=0x0c886df4
- Though ABA is right as per alert logs, ASM throws assert. We need to check
why so.
.
SQL> alter diskgroup all mount;
alter diskgroup all mount
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [43],
[8486], [44], [8486], [], [], [], [], []
performed below activity by customer, getting below errors
.
-bash-3.2$ sqlplus / as sysasm
.
SQL*Plus: Release 11.2.0.1.0 Production on Wed Jan 2 13:38:10 2013
.
Copyright (c) 1982, 2009, Oracle. All rights reserved.
.
.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
SQL> alter diskgroup all mount;
alter diskgroup all mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15130: diskgroup "DATA" is being dismounted
ORA-15066: offlining disk "DATA_0000" may result in a data loss
.
.
SQL>
1: Please explain what do you mean by remap here.
2: Lost write means the data is valid but old. ASM uses set of methods (like
magic, checksum, expected values of fields, etc) to validate whether it is
.
corrupted or not. With lost write, it is valid data but stale data. ASM
need not identify it as corrupted.
regarding remap please see:
11gR1 ASM New Feature (Doc ID 429098.1)
remap - You can remap and recover bad blocks on an ASM disk in normal or high
redundancy that have been reported by storage management tools such as disk
scrubbers. ASM reads from the good copy of an ASM mirror and rewrites these
blocks to an alternate location on disk.
.
What is the redundancy on this disk group?
.
Customer requirement outlined here is to mount the disk group and then
recover what data can be recovered at this time. Prabakaran is investigating
if a diagnostic patch can be made to mount the disk group without hitting the
assert.
.
points to note here:
The customer needs to review their backup policy
Looks like ct had external,I have conveyed the same to ct
.
[gridstgrac1 ~]$ kfed read disk1_dd.out | more
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt: 1 ; 0x003: 0x01
|
|
kfdhdb.grptyp: 1 ; 0x026: KFDGTP_EXTERNAL>>>>
kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER
kfdhdb.dskname: DATA_0000 ; 0x028: length=9
kfdhdb.grpname: DATA ; 0x048: length=4
kfdhdb.fgname: DATA_0000 ; 0x068: length=9
.
- As per ckpt record, ABA is 43.8486
16 kfbh.fcn.base: 0 ; 0x010: 0x00000000
17 kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
29 kfracdc.ckpt.seq: 43 ; 0x018: 0x0000002b
30 kfracdc.ckpt.blk: 8486 ; 0x01c: 0x00002126
- While mounting a DG, ASM hit below assert:
805 Errors in file
/ade/b/1029623278/oracle/log/diag/asm/+asm/two0one1asm/trace/two0one1asm_ora_1
1194.trc (incident=4939):
806 ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA],
[1], [43], [8486], [44], [8486], [], [], [], [], []
7771 KSEORAASSERTNM6(FALSE,
7772 OERINM("kfrValAcd30"),
7773 kseidn(*thdCtx->grpName_kfrthdCtx),
7774 ksenrg(thdCtx->thread_kfrthdCtx),
7775
ksenrg(thdCtx->currAba_kfrthdCtx.seq_kfraba),
7776
ksenrg(thdCtx->currAba_kfrthdCtx.blk_kfraba),
7777 ksenrg(acdb->aba_kfracdbamd_kfraba),
7778 ksenrg(acdb->aba_kfracdb.blk_kfraba));
- In the previous mount, DG ckpt was 43.8486 which matches with thread
Context
16546 Thu Dec 20 18:14:17 2012
16547 NOTE: attached to recovery domain 1
16548 NOTE: starting recovery of thread=1 ckpt=43.8486 group=1 (DATA)
16549 NOTE: advancing ckpt for thread=1 ckpt=43.8486
16550 NOTE: cache recovered group 1 to fcn 0.814009
.
- Following the recovery, LGWR opened the thread with ABA 44.8486. Followin
it, ASM2 was crashed. Thread 2 recovery for DATA is performed.
16551 Thu Dec 20 18:14:17 2012
16552 NOTE: LGWR attempting to mount thread 1 for diskgroup 1 (DATA)
16553 NOTE: LGWR found thread 1 closed at ABA 43.8485
16554 NOTE: LGWR mounted thread 1 for diskgroup 1 (DATA)
16555 NOTE: LGWR opening thread 1 at fcn 0.814009 ABA 44.8486
16556 NOTE: cache mounting group 1/0x0C886DF4 (DATA) succeeded
16557 NOTE: cache ending mount (success) of group DATA number=1
incarn=0x0c886df4
16574 Thu Dec 20 18:17:24 2012
16575 Reconfiguration started (old inc 4, new inc 6)
16576 List of instances:
16577 1 (myinst: 1)
16574 Thu Dec 20 18:17:24 2012
16575 Reconfiguration started (old inc 4, new inc 6)
16598 NOTE: SMON starting instance recovery for group DATA domain 1
(mounted)
16599 NOTE: F1X0 found on disk 0 au 2 fcn 0.0
16600 NOTE: starting recovery of thread=2 ckpt=45.10601 group=1 (DATA)
16601 NOTE: SMON waiting for thread 2 recovery enqueue
16602 NOTE: SMON about to begin recovery lock claims for diskgroup 1 (DATA)
16603 NOTE: advancing ckpt for thread=2 ckpt=45.10601 <=== recovery is
done
16604 NOTE: SMON did instance recovery for group DATA domain 1
- ASM alert log does not show any activity for next 8 days. In the next mount
(on Dec 28), kfrValAcd30 was triggered.
16621 Reconfiguration complete
16622 Fri Dec 28 07:18:51 2012
16623 NOTE: No asm libraries found in the system
16624 Starting ORACLE instance (normal)
16716 Fri Dec 28 07:18:59 2012
16717 SQL> ALTER DISKGROUP ALL MOUNT /* asm agent */
16718 NOTE: Diskgroup used for Voting files is:
16719 DATA
16740 Fri Dec 28 07:19:06 2012
16741 * allocate domain 1, invalid = TRUE
16742 Fri Dec 28 07:19:06 2012
16743 NOTE: attached to recovery domain 1
16744 NOTE: starting recovery of thread=1 ckpt=43.8486 group=1 (DATA)
16745 NOTE: starting recovery of thread=2 ckpt=47.24 group=1 (DATA)
16746 Errors in file
/u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_11865.trc
(incident=32228):
16747 ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA],
[1], [43], [8486], [44], [8486], [], [], [], [], []
16748 Incident details in:
/u01/app/oracle/diag/asm/+asm/+ASM1/incident/incdir_32228/+ASM1_ora_11865_i322
28.trc
.
- Excerpt from F3 dump is given below
There is no recovery with ABA of
43.8486 where ckpt is made
10 kfbh.hard: 130 ; 0x001: 0x82
11 kfbh.type: 7 ; 0x002: KFBTYP_ACDC
12 kfbh.datfmt: 1 ; 0x003: 0x01
29 kfracdc.ckpt.seq: 43 ; 0x018: 0x0000002b
30 kfracdc.ckpt.blk: 8486 ; 0x01c: 0x00002126
.
3542863 kfbh.block.blk: 8486 ; 0x004: T=0 NUMB=0x2126
3542864 kfbh.block.obj: 3 ; 0x008: TYPE=0x0 NUMB=0x3
3542870 kfracdb.aba.seq: 43 ; 0x000: 0x0000002b
3542871 kfracdb.aba.blk: 8485 ; 0x004: 0x00002125
.
3542942 kfbh.block.blk: 8487 ; 0x004: T=0 NUMB=0x2127
3542943 kfbh.block.obj: 3 ; 0x008: TYPE=0x0 NUMB=0x3
3542949 kfracdb.aba.seq: 44 ; 0x000: 0x0000002c
3542950 kfracdb.aba.blk: 8486 ; 0x004: 0x00002126
.
.