Consultor Eletrônico



Kbase 19439: Database Corruption in Memory Causing Errors (1124) & (1152)
Autor   Progress Software Corporation - Progress
Acesso   Público
Publicação   31/01/2002
SUMMARY:

This Knowledge Base confirms a problem with information in the UNIX buffer cache and helps you to resolve error (1124) where there is no hardware failure nor actual data corruption. It also shows that the issue is not related to Progress. Since the block corruption was in memory (Progress could not read the block that it wrote) and points to the wrong address on disk, this problem is related to memory, cache or hardware.

SYSTEM ERROR: read wrong dbkey at offset <offset> in file <file>
found <dbkey>, expected <dbkey>, retrying. (1152)
SYSTEM ERROR: wrong dbkey in block. Found <dbkey>, should be
<dbkey2> (1124)

EXPLANATION:

Progress posts a (1124) error to indicate that a block read from
disk does not match that block in the UNIX buffer cache. Most often,
this is due to what we consider 'corruption'; that is, the block on disk is no longer correct. Scans of the database using dbrpr would locate this corruption. The Progress user must isolate the cause, which will typically be some type of hardware failure. Sometimes, for those cases where the block is an index block not a data block, an index test must also be done to locate the corruption that's causing the error.

The UNIX buffer cache would be suspected as a cause for error (1124) when the following conditions are all in effect:

-- Where a reboot of the machine causes the error to go away.
-- The scans and test index are clean (no errors).
-- All possible hardware problems are eliminated.
-- There is no write caching enabled on the disks being used.
-- The LVM structure is such that there are no concatenated or
spanned drives.

An example of drives that would be considered spanned or concatenated is if you have 3 physical drives -- A, B, C -- in the logical set and you specify to use all of A, then all of B, then all of C. A striped structure would be a stripe of A, stripe of B, stripe of C, stripe of A... and so forth.

SOLUTION:

Use the UNIX od command to 'dump' the block in question. Then reboot the system to clear the UNIX buffer cache, and use od again to dump the same block. If there is a problem with the cache, then we expect the output to be different.
NOTE: If the system is not dedicated to Progress, then this
test may not provide conclusive evidence, as the information in the
UNIX buffer cache is considered 'dirty' and could be flushed at any
time. However, in this circumstance, if the buffer was flushed then we would expect to be reading from disk, and the information
returned would be the dbkey expected.

The od command shown below is specific to IBM AIX and Sun Solaris
systems for example. Refer to your local man page for specific information on od.

Note the dbkey expected and the one actually found in the block.
The dbkey that was expected should be the correct value, the value
found is assumed to be incorrect.

When error (1124) is posted, please follow the steps below. What you will be doing is using system commands to dump a block which should be currently cached in memory, re-booting the system to clear the UNIX
buffer cache, then dumping the block again and comparing the output.

1) Determine the physical offset for the block associated with the
expected dbkey within the database.
NOTE that the dbkey in Progress 9 is unique by area, not by
the database itself.

Progress Version 8..
Divide the dbkey by 32 (ignore any remainder) if not using 8K
database blocksizes. Otherwise divide by 64. This is the block
number within the database. For example, with a 1K blocksize,
if the expected dbkey were 240028, then the block number is
(240028/32)=7500.

- Or -

Progress Version 9..
Divide the dbkey by 32 (ignore any remainder) if you're not
using 8K database blocksizes and you *are* using default records
per block in the area. Otherwise, if using 8K blocksizes and
using the default records per block in the area, divide by 64.
If the area has been modified with respect to the records per
block, divide by the records per blocks configured for the
storage area.

2) Determine the physical byte offset for the block you want to
dump. This will be based on the block number that was
determined by the above instructions, and the database blocksize
for the database. On most UNIX systems the Progress database
blocksize is 1024 but could be 2048, 4096 or 8192.

Single-volume..
The first block in the database starts at offset 0,
so take the block number, subtract 1, and multiply by blocksize.
This is the physical byte offset. For example:
if block number=7500 and database blocksize is 1024..

(7500-1)x(1024)=7678976 = byte-offset.
- Or -

Multi-volume..
You need to identify the block within the database
structure and then the offset relative to the starting block of
that extent. (You can reference your db.st file for the extent
sizes. If you no longer have it, then one can be created using
the PROSTRCT list command.)

For example, if the database is made up of 5-2016 extents, then
the blocks are laid out as follows:

Extent d1 d2 d3 d4 d5
Start 1 2016 4031 6046 8061
End 2015 4030 6045 8060 10075

Block 7500 would be located in extent 4. Subtract the starting
block of that extent from the block number you are looking for to
make the block number relative to the extent where it's located,
then add one to account for the extent header. This gives you
the relative block. Multiply this times the block size for the
offset. For example:

(7500-6046) + 1 = 1455 = relative block number.
(1455*1024) = 1489920 = .d4 byte-offset.

The calculation in this section may not be necessary if the log
file reveals the byte offset within the (1152) logged error
message as listed below.
The byte offset in this example is 392609792.

01:19:09 Usr 11: SYSTEM ERROR: read wrong dbkey at
offset 392609792 in file /MAUDT1/MUTUFUND_11.d5 found
-278578860, expected 19451008, retrying. (1152)
01:19:19 Usr 11: Corrupt block detected when reading
from database. (4229)
01:19:19 Usr 11: SYSTEM ERROR: wrong dbkey in block.
Found -278578860, should be 19451008 (1124)

3) Use the OS command od to dump 1024 bytes starting with the
byte-offset value. Refer to the man page for od for complete
details on the command qualifiers. Essentially what you're
specifying is output in hex, offset is in decimal, output 1
block.

od -t x -j 1489920 -A x -N 1024 db.d4 > block.out

The dbkey of the block will be the second 4 bytes dumped.
This is then converted to decimal.
In this example we would expect to find the value 240028, but
the actual value found should be the same as the dbkey found
in the original (1124) message since we expect the UNIX
filesystem buffer cache changed substantially since the
failure.

Here is an example of the first line of a dumped block.

0000000 0128cc80 037f0004 00000000 0000002c

The 2nd 4 bytes is 0128cc80 which when converted to decimal
is 19451008, which is the expected dbkey as can be seen from
this log entry:

01:19:19 Usr 11: SYSTEM ERROR: wrong dbkey in block.
Found -278578860, should be 19451008 (1124)

Rename your output file or move it to a new location so when
you repeat step 3, it won't get overwritten.

4) Reboot the system (that is, clear the UNIX buffer cache)

5) Repeat step 3.

What we expect to find now is the correct dbkey; that is, the expected
value, which is different from the one found before the system was
re-booted.