Consultor Eletrônico



Kbase P20719: Determining if database corruption in memory is causing errors 1124 1152
Autor   Progress Software Corporation - Progress
Acesso   Público
Publicação   19/05/2009
Status: Verified

SYMPTOM(s):

Determining if database corruption in memory is causing errors 1124 1152

To diagnose database corruption in memory

SYSTEM ERROR: wrong dbkey in block. Found <dbkey1>, should be <dbkey2> (1124)

SYSTEM ERROR: read wrong dbkey at offset <offset> in file <file> found <dbkey>, expected <dbkey>, retrying. (1152)

SYSTEM ERROR: read wrong dbkey at offset <offset> in file <file> found <dbkey>, expected <dbkey>, retrying. (9445)

FACT(s) (Environment):

UNIX
Progress 8.x
Progress 9.x
OpenEdge 10.x
How to use the UNIX od command to dump a block from memory
How to make an octal dump (od)
No aparent evidence of hardware failure
No evidence of data corruption

CAUSE:

Progress posts a (1124) or similar error to indicate that a block read from disk does not match that block in the UNIX buffer cache. Most often, this is due to what we consider 'corruption'; that is, that the block on disk is no longer correct.

Scans of the database using Progress utilities such as:
- proutil -C dbrpr, options 1; 1+4
- proutil -C dbscan # introduced in Progress 9.1C which is exactly the reporting option of dbrpr
- dbtool, option 3 # introduced in Progress 9.1D and later
would normally locate this corruption.
The Progress user must isolate the cause, which will typically be some type of hardware failure. Sometimes, for those cases where the block is an index block not a data block, an index test must also be done to locate the corruption that's causing the error.

The UNIX buffer cache would be suspected as a cause for error (1124) or similar when the following conditions are all in effect:

- Where a reboot of the machine causes the error to go away.
- The scans and test index are clean (no errors).
- All possible hardware problems are eliminated.
- There is no write caching enabled on the disks being used.
- The LVM structure is such that there are no concatenated or spanned drives.

An example of drives that would be considered spanned or concatenated is if there are 3 physical drives -- A, B, C -- in the logical set and you specify to use all of A, then all of B, then all of C. A striped structure would be a stripe of A, stripe of B, stripe of C, stripe of A....

FIX:

This Solution confirms a problem with information in the UNIX buffer cache and helps to resolve error (1124) or similar where there is no hardware failure nor actual data corruption.

It also shows that the issue is not related to Progress. Since the block corruption was in memory (Progress could not read the block that it wrote) and points to the wrong address on disk, this problem is related to memory, cache or hardware.

STEPS:

1) Use the UNIX od command to 'dump' the block in question.
2) Then reboot the system to clear the UNIX buffer cache, and
3) Use od again to dump the same block.
If there is a problem with the cache, then we expect the output to be different. If the system is not dedicated to Progress, then this test may not provide conclusive evidence, as the information in the UNIX buffer cache is considered 'dirty' and could be flushed at any time. However, in this circumstance, if the buffer was flushed then we would expect to be reading from disk, and the information returned would be the dbkey expected.

The od command example described below is specific to IBM AIX and Sun Solaris systems. Please refer to your local man page for specific information on "od".

Make a note of the dbkey expected and the one actually found in the block. The dbkey that was expected should be the correct value, the value found is assumed to be incorrect.

When error (1124) is posted, please follow the steps below. What you will be doing is using system commands to dump a block which should be currently cached in memory, re-booting the system to clear the UNIX buffer cache, then dumping the block again and comparing the output.

1) Determine the physical offset for the block associated with the expected dbkey within the database.

Progress Version 8.
Divide the dbkey by 32 (ignore any remainder) if not using 8K database blocksizes. Otherwise divide by 64. This is the block number within the database. For example, with a 1K blocksize, if the expected dbkey were 240028, then the block number is (240028/32)=7500.

Progress Version 9+
As above, except divide the dbkey by the records per block setting.

2) Determine the physical byte offset for the block to be dumped. This will be based on the block number that was determined by the above instructions, and the database blocksize for the database. On most UNIX systems the Progress database blocksize is 1024 but could be 2048, 4096 or 8192.

Single-volume..
The first block in the database starts at offset 0, so take the block number, subtract 1, and multiply by blocksize. This is the physical byte offset. For example:
if block number=7500 and database blocksize is 1024..
(7500-1)x(1024)=7678976 = byte-offset.

- Or -

Multi-volume..
Identify the block within the database structure and then the offset relative to the starting block of that extent. (Reference your db.st file for the extent sizes. It is always a good idea to create a current structure snapshot using the PROSTRCT list command.)

For example, if the database is made up of 5 extents of 2016 KB with one KB blocksizes, then the blocks are laid out as follows:

Extent d1 d2 d3 d4 d5
Start 1 2016 4031 6046 8061
End 2015 4030 6045 8060 10075

Block 7500 would be located in extent 4. Subtract the starting block of that extent from the block number you are looking for to make the block number relative to the extent. where it's located, then add one to account for the extent header. This gives you the relative block. Multiply this times the block size for the offset. For example:

(7500-6046) + 1 = 1455 = relative block number.
(1455*1024) = 1489920 = .d4 byte-offset.

The calculation in this section may not be necessary if the log file reveals the byte offset within the (1152) logged error message as listed below. The byte offset in this example is 392609792.

01:19:09 Usr 11: SYSTEM ERROR: read wrong dbkey at offset 392609792 in file /<dir>/<db>_11.d5 found -278578860, expected 19451008, retrying. (1152)
01:19:19 Usr 11: Corrupt block detected when reading from database. (4229)
01:19:19 Usr 11: SYSTEM ERROR: wrong dbkey in block. Found -278578860, should be 19451008 (1124)

3) Use the OS command od to dump 1024 bytes starting with the byte-offset value. Refer to the man page for od for complete details on the command qualifiers. Essentially what you're specifying is output in hex, offset is in decimal, output 1 block.

od -t x -j 1489920 -A x -N 1024 db.d4 > block.out

The dbkey of the block will be the second 4 bytes dumped. This is then converted to decimal.

In this example we would expect to find the value 240028, but the actual value found should be the same as the dbkey found in the original (1124) message since we expect the UNIX filesystem buffer cache changed substantially since the failure.

Here is an example of the first line of a dumped block.
0000000 0128cc80 037f0004 00000000 0000002c

The 2nd 4 bytes is 0128cc80 which when converted to decimal is 19451008, which is the expected dbkey as can be seen from this log entry:

01:19:19 Usr 11: SYSTEM ERROR: wrong dbkey in block. Found -278578860, should be 19451008 (1124)

Rename your output file or move it to a new location so that when you repeat step 3, it won't get overwritten.

4) Reboot the system (that is, clear the UNIX buffer cache)

5) Repeat step 3.

What we expect to find now is the correct dbkey; that is, the expected value, which is different from the one found before the system was re-booted..