Email: [email protected] 7 x 24 online support!

简体中文
English
日本語

ORA-600 [3020] "Stuck Recovery"

Posted by PDSERVICE on Jan 18, 2016 In

If you cannot recover the data by yourself, ask Parnassusdata, the professional ORACLE database recovery team for help.

Parnassusdata Software Database Recovery Team

Service Hotline: +86 13764045638 E-mail: [email protected]

Format: ORA-600 [3020] [a] [b] [c] [d] [e]

VERSIONS:

version 6.0 and above DESCRIPTION:

This is called a 'STUCK RECOVERY'.

There is an inconsistency between the information stored in the redo and the information stored in a database block being recovered.

ARGUMENTS:

For Oracle 9.2 and earlier: Arg [a] Block DBA

Arg [b] Redo Thread Arg [c] Redo RBA Seq

Arg [d] Redo RBA Block No Arg [e] Redo RBA Offset.

For Oracle 10.1

Arg [a] Absolute file number of the datafile. Arg [b] Block number

Arg [c] Block DBA

FUNCTIONALITY:

kernel cache recovery parallel

IMPACT:

INSTANCE FAILURE during recovery.

SUGGESTIONS:

There have been cases of receiving this error when RECOVER has been issued, but either some datafiles were not restored to disk, or the restoration has not finished.

Therefore, ensure that the entire backup has been restored and that the restore has finished PRIOR to issuing a RECOVER database command.

If problems continue, consider restoring from a backup and doing a point-in-time recovery to a time PRIOR to the one implied by

the ORA-600[3020] error.

Example:

SQL> recover database until time 'YYYY-MON-DD:HH:MI:SS'; This error can also be caused by a lost update.

During normal operations, block updates/writes are being performed to a number of files including database datafiles, redo log files, archived redo log files etc.

This error can be reported if any of these updates are lost for some reason.

Therefore, thoroughly check your operating system and disk hardware.

In the case of a lost update, restore an old copy of the datafile and attempt to recover and roll forward again.

If the Known Issues section below does not help in terms of identifying a solution, please submit the trace files and alert.log to Oracle Support Services for further analysis.

Known Issues:

Related Articles

Note:1265884.1 Resolving ORA-752 or ORA-600 [3020] During Standby Recovery

Known Bugs

You can restrict the list below to issues likely to affect one of the following versions by clicking the relevant button:

NB	Bug	Fixed	Description
	9847338		Session hang after applying the patch for Bug 9587912 which causes ORA-600 [30
+	13467683	11.2.0.2.BP15, 11.2.0.3.3, 11.2.0.3.BP04, 12.1.0.0	Join of temp and permanent tables in RAC might cause corruption of permanent ta Regression by bug 10352368
	12831782	11.2.0.2.BP11, 11.2.0.3.BP01, 12.1.0.0	ORA-600 [3020] / ORA-333 Recovery of datafile or async transport do not read mi there is a stale block
	12582839	11.2.0.3, 12.1.0.0	ORA-8103/ORA-600 [3020] on RMAN recovered locally managed tablespace
	11689702	11.2.0.2.5, 11.2.0.2.BP13, 11.2.0.2.GIPSU05, 11.2.0.3, 12.1.0.0	ORA-600 [3020] during recovery after datafile RESIZE (to smaller size)
	10329146	11.2.0.1.BP10, 11.2.0.2.2, 11.2.0.2.BP03, 11.2.0.2.GIBUNDLE02, 11.2.0.2.GIPSU02, 11.2.0.3, 12.1.0.0	Lost write in ASM with multiple DBWs and a disk is offlined and then onlined
	10218814	11.2.0.2.2, 11.2.0.2.BP02, 11.2.0.3, 12.1.0.0	ORA-600 [3020] during recovery / on standby
+	10209232	11.1.0.7.7, 11.2.0.1.BP08, 11.2.0.2.1, 11.2.0.2.BP02, 11.2.0.2.GIBUNDLE01, 11.2.0.3, 12.1.0.0	ORA-1578 / ORA-600 [3020] Corruption. Misplaced Blocks and Lost Write in ASM
*	10205230	11.2.0.1.6, 11.2.0.1.BP09, 11.2.0.2.2, 11.2.0.2.BP04, 11.2.0.3, 12.1.0.0	ORA-600 / corruption possible during shutdown in RAC
	10094823	11.2.0.2.4, 11.2.0.2.BP09, 11.2.0.3, 12.1.0.0	Block change tracking on physical standby can cause data loss
	10071193	11.2.0.2.BP02, 11.2.0.3, 12.1.0.0	Lost write / ORA-600 [kclchkblk_3] / ORA-600 [3020] in RAC - superceded
	9587912	11.2.0.2, 12.1.0.0	ORA-600 [3020] in datafile that went offline/online in a RAC instance
	8774868	11.2.0.1.2, 11.2.0.1.BP06, 11.2.0.2, 12.1.0.0	OERI[3020] reinstating primary
+	8769473	11.2.0.2, 12.1.0.0	ORA-600 [kcbzib_5] on multi block read in RAC. Invalid lock in RAC. ORA-600 [302 Recovery
P	8635179	10.2.0.5, 11.2.0.2, 12.1.0.0	Solaris: directio may be disabled for RAC file access. Corruption / Lost Write
+	8597106	11.2.0.1.BP06, 11.2.0.2, 12.1.0.0	Lost Write in ASM when normal redundancy is used
P	12330911	12.1	EXADATA LSI firmware for lost writes
+	10425010	11.2.0.3, 12.1	Stale data blocks may be returned by Exadata FlashCache
	8826708	10.2.0.5, 11.2.0.2	ORA-600 [3020] for block type 0x3a (58) during recovery for block restored by RM backup
	11684626	11.2.0.1	ORA-600 [3020] on standby involving "BRR" redo when db_lost_write_protect is e
	8230457	10.2.0.4.1, 10.2.0.5, 11.1.0.7.1, 11.2.0.1	Physical standby media recovery gets OERI[krr_media_12]
+	7680907	10.2.0.5, 11.1.0.7.1, 11.2.0.1	ORA-600 [kclexpandlock_2] in LMS / instance crash. Incorrect locks in RAC. ORA-6 [3020] in recovery
	4637668	10.2.0.3, 11.1.0.6	IMU transactions can produce out-of-order redo (OERI [3020] on recovery)
	4594917	9.2.0.8, 10.2.0.2, 11.1.0.6	Write IO error can cause incorrect file header checkpoint information
	4453449	10.2.0.2, 11.1.0.6	OERI:3020 / corruption errors from multiple FLASHBACK DATABASE
	7197445	10.2.0.4.1, 10.2.0.5	Standby Recovery session cancelled due to ORA-600 [3020] "CHANGE IN FUTURE BLOCK"
	5610267	10.2.0.5	MRP terminated by ORA-600[krr_media_12] / OERI:3020 after flashback
	3762714	9.2.0.7, 10.1.0.4, 10.2.0.1	ALTER DATABASE RECOVER MANAGED STANDBY fails with OERI[3020]
	3560209	10.2.0.1	OERI[3020] stuck recovery under RAC
	3397181	9.2.0.5, 10.1.0.3, 10.2.0.1	ALTER SYSTEM KILL SESSION of recovery slave causes stuck recovery
*	3381950	10.2.0.1	Backups from RAC DB before Data Guard Failover cannot be used
	3535712	9.2.0.6, 10.1.0.4	OERI[3020] / ORA-10567 from RAC with standby in max performance mode
	4594912	9.2.0.8, 10.1.0.2	Incorrect checkpoint possible in datafile headers
	3635331	9.2.0.6, 10.1.0.4	Stuck recovery (OERI:3020) / ORA-1172 on startup after a crash
	2322620	9.2.0.1	OERI:3020 possible on recovery of LOB DATA
P+	656370	7.3.3.4, 7.3.4.0, 8.0.3.0	AlphaNT only: Corrupt Redo (zeroed byte) OERI:3020

Note:190263.1

ORA-1172 OR ORA-600[3020] Quick Support Debugging Guide

Given that this error could be due to a lost update to either the datafile and/or the redo files, one thing to do would be to get dumps of both.

Refer to the following notes for information on how to do this :

Note:1031381.6 How to Dump Redo Log File Information

Note:45852.1 Taking BLOCKDUMPS on Oracle8 - The ALTER SYSTEM DUMP command **INTERNAL ONLY**

It is especially useful to focus on the particular datafile block implied by the ORA-600 [3020]. Dump all redo for that block, starting with the log sequence before the restored datafile,

up to the point of failure.

Blockdumps of the datafile should be taken at various stages of the recovery process - for example right after doing the restore; and then again after each redo log file has been applied; and just before the SCN (or point in time) that the ORA-600 was reported; and just after redo

for the given SCN has been applied; and so on.

The idea being that you may narrow down the point at which something went wrong.

ORA-600 [3020] [a] [b] [c] [d] [e]

Versions: 7.0.X - 8.0.5 Source: knl/kcrp.c

===========================================================================

Meaning:

Recovering database and REDO entry has an INC/SEQUENCE number greater than that on the database block.

In Oracle8 where the block structure is different it still means the same basic thing - the redo record we have has an SCN / SEQ which does not match the database block we are wanting to apply it to.

This is called 'STUCK RECOVERY'.

---------------------------------------------------------------------------

Argument Description:

a. Block DBA

b. Redo Thread

c. Redo RBA Seq

d. Redo RBA Block No

e. Redo RBA Offset.

---------------------------------------------------------------------------

Diagnosis:

There are many possible causes for this most resulting from either invalid sets of commands or media corruption.

- Has customer restored a backup, open the DB, closed the DB and then tried to recover without re-loading the backup ??

** If they say no GET THE ALERT LOG and prove it - it's easy to waste a lot of time when this was the real cause.

- If the problem was a lost update, restore of an OLDER copy of the datafile and a recovery may work.

- The quick option here is to restore and recover UP TO an SCN

just before the problem. Customer will lose some data as this is an incomplete recovery so you need to know the priority:

a) TIME or b) Minimal Data Loss.

- Check the tracefile for the 3020 report. It is possible to signal OERI(3020) if the datafile block is corrupt.

Eg: OERI(3020) with Inc=0 Seq=1 reported for the disk block is possibly a zeroed out data-block on the datafile and NOT a redo issue.

- Is parallel server being used ?

If so another thread may have the required changes and they haven't been read for some reason. Check for OS and DLM errors. Try to make sure only ONE instance attempts any recovery by shutting down other instances.

- Are hot backups being used ??

Check that the backups are occuring correctly between BEGIN and END backup commands.

- Up to Oracle 8i you can try to skip the error using the hidden

parameter:CORRUPT_BLOCKS_ON_STUCK_RECOVERY

Be aware that blocks will be marked corrupt if this is used so make sure the error is not on a dictionary object !!

- From 9i you can try to skip the error using the 'ALLOW .. CORRUPTION' clause of the RECOVER DATABASE command.

(Note that in 11g onwards you may need to set DB_LOST_WRITE_PROTECT=NONE for the "ALLOW 1 CORRUPTION" clause to work)

- For logging a bug you need:

(a) Where an error is reported, get any trace files produced and relevant redo log dumps if necessary.

Document completely the circumstances leading

up to the error including configuration; type of backup (manual, RMAN, incremental, etc.);

the exact commands used to create the backups and

the exact commands used to do the restore and recovery.

(b) Provide a reproducible test case or dial-in information to development.

Articles:

Parameter:CORRUPT_BLOCKS_ON_STUCK_RECOVERY

---------------------------------------------------------------------------

Example OERI:3020 dump in Oracle8

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*** 1999.07.02.01.02.58.000

RECOVERY OF THREAD 1 STUCK AT BLOCK 14099 OF FILE 6

REDO RECORD - Thread:1 RBA: 0x0045ee.00009c8b.0010 LEN: 0x00e8 VLD: 0x01 SCN scn: 0x0000.0951d868 07/02/99 00:57:19

CHANGE #2 TYP:2 CLS: 1 AFN:6 DBA:0x01803713 SCN:0x0000.09519e84 SEQ: 1 OP:10.4

buffer tsn: 5 rdba: 0x01803713 (6/14099) scn:0x0000.0951b4d4 seq:0x01 flg:0x00 tail:0xb4d40601

frmt:0x02 chkval:0x0000 type:0x06=trans data

*** 1999.07.02.01.02.58.000

ksedmp: internal or fatal error

ORA-00600: internal error code, arguments:

[3020], [25179923], [1], [17902], [40075], [16], [], []

Breaking this up shows the following SCN information: Redo SCN: 0x0000.0951d868

SCN expected on block: 0x0000.09519e84 SCN on Buffer: 0x0000.0951b4d4

In this case ithe actual SCN marked in the block

in the buffer cache is _later_ than the expected SCN, but _before_

the SCN level for the redo change vector. Normally, the SCN in the CHANGE line must match exactly the one on the block (in the buffer cache);

and redo application brings that block to the (later) SCN/SEQ

on the redo record. One possible explanation is that the system

saw a stale copy of the datafile block when the redo was generated, so that the SCN in the CHANGE line is the wrong one. That would indicate a possible lost update to the datafile.

More commonly, the ORA-600 [3020] error indicates that the SCN on the block is BEHIND the SCN on the redo we want to apply,

so there is a GAP. I.e., the REDO is ahead of the block.

However, in this example there is still a problem even though the block initially appears to be AHEAD of the REDO (normally OK).

Why? The SCN on the block is BELOW the most recent commit SCN.

If we applied the current redo record then the SCN on the block would advance to the more recent commit SCN so if this block is truely

ahead of this redo record it must have an SCN >= the most recent commit SCN. It hasn't, so something is wrong - most likely a lost datafile write which occurred between two items of redo causing

two redo records using the same block SCN to base their change on.

Known issues caused by 3rd party provider

1. Lost IO / Corruption caused by EMC. From JET SR: 3-1260172021. EMC bug ID: emc230687

ID: emc230687

Domain: EMC1 HP-UX 11v1

Solution Class: 3.X Compatibility

ORA-600 [3020] during recovery caused bu LOST IO due to EMC bug ID: emc230687.

No errors raised within the I/O stack at the host level nor from a Timefinder perspective, API ECA debug data void of any anomalies Timefinder w/Oracle best practices process is being adhered to ( recoverable business solution process )

This also caused some corruption errors like:

ORA-00600: internal error code, arguments: [kddummy_blkchk], [29], [2121334], [6108] kdbchk: xaction header lock count mismatch

No errors raised with the Symm as well. Corruption issue resolved by applying fix 44177, see the following Primus article for more i ETA emc204393

2. Lost IO by EMC. EMC solution # is emc251398

Fixed by the latest microcode version 5773.163.113 applied on the Symmetrix DMX (no changes on V-MAX cabins). EMC solution # is emc2

Ensure that this note comes out on top in Metalink when searched ora-600 ora-600 ora-600 ora-600 ora-600 ora-600 ora-600

ora-600 ora-600 ora-600 ora-600 ora-600 ora-600 ora-600 3020 3020 3020 3020 3020 3020 3020 3020 3020 3020

3020 3020 3020 3020 3020 3020 3020 3020 3020 3020

You are here

ORA-600 [3020] "Stuck Recovery"