Kbase P4300: Determining if database corruption in memory is causing errors 1124 1152
Autor |
  Progress Software Corporation - Progress |
Acesso |
  Público |
Publicação |
  21/12/2006 |
|
Status: Verified
FACT(s) (Environment):
UNIX
Progress 9.x
OpenEdge 10.x
SYMPTOM(s):
Database corruption in memory
SYSTEM ERROR: read wrong dbkey at offset in file found , expected , retrying. (1152)
SYSTEM ERROR: wrong dbkey in block. Found , should be (1124)
No hardware failure
No actual data corruption
CAUSE:
Progress posts a (1124) error to indicate that a block read from disk does not match that block in the UNIX buffer cache. Most often, this is due to what we consider 'corruption'; that is, the block on disk is no longer correct.
Scans of the database using dbrpr would locate this corruption. The Progress user must isolate the cause, which will typically be some type of hardware failure. Sometimes, for those cases where the block is an index block not a data block, an index test must also be done to locate the corruption that's causing
the error.
The UNIX buffer cache would be suspected as a cause for error (1124) when the following conditions are all in effect:
- Where a reboot of the machine causes the error to go away.
- The scans and test index are clean (no errors).
- All possible hardware problems are eliminated.
- There is no write caching enabled on the disks being used.
- The LVM structure is such that there are no concatenated or
spanned drives.
An example of drives that would be considered spanned or concatenated is if you have 3 physical drives -- A, B, C -- in the logical set and you specify to use all of A, then all of B, then all of C. A striped structure would be a stripe of A, stripe of B, stripe of C, stripe of A... and so forth.
FIX:
This Solution confirms a problem with information in the UNIX buffer cache and helps to resolve error (1124) where there is no hardware failure nor actual data corruption.
It also shows that the issue is not related to Progress. Since the block corruption was in memory (Progress could not read the block that it wrote) and points to the wrong address on disk, this problem is related to memory, cache or hardware.
The following steps will be undertaken to complete this exercise and are described in more detail below:
1) Use the UNIX od command to 'dump' the block in question.
2) Then reboot the system to clear the UNIX buffer cache, and
3) use od again to dump the same block. If there is a problem with the cache, then we expect the output to be different.
If the system is not dedicated to Progress, then this test may not provide conclusive evidence, as the information in the UNIX buffer cache is considered 'dirty' and could be flushed at any time. However, in this circumstance, if the buffer was flushed then we would expect to be reading from disk, and the information returned would be the dbkey expected.
The od command shown below is specific to IBM AIX and Sun Solaris systems for example. Refer to your local man page for specific information on od.
Make a note of the dbkey expected and the one actually found in the block. The dbkey that was expected should be the correct value, the value found is assumed to be incorrect.
When error (1124) is posted, please follow the steps below. What you will be doing is using system commands to dump a block which should be currently cached in memory, re-booting the system to clear the UNIX buffer cache, then dumping the block again and comparing the output.
1) Determine the physical offset for the block associated with the expected dbkey within the database. NOTE that the dbkey in Progress 9 is unique by area, not in the database itself.
Progress Version 9:
Divide the dbkey by 32 (ignore any remainder) if you are not using 8K database blocksizes and you are using default records per block in the area.
Otherwise, if using 8K blocksizes and using the default records per block in the area, divide by 64.
If the area has been modified with respect to the records per block, divide by the records per blocks configured for the storage area.
2) Determine the physical byte offset for the block that you want to dump. This will be based on the block number that was determined by the above instructions, and the database blocksize for the database. On most UNIX systems the Progress database blocksize is 1024 but could be 2048, 4096 or 8192.
You need to identify the block within the database structure and then the offset relative to the starting block of that extent.
(You can reference your db.st file for the extent sizes. If you no longer
have it, then one can be created using the PROSTRCT list command.)
For example, if the database is made up of 5-2016 extents, then the blocks are laid out as follows:
Extent d1 d2 d3 d4 d5
Start 1 2016 4031 6046 8061
End 2015 4030 6045 8060 10075
Block 7500 would be located in extent 4. Subtract the starting block of that extent from the block number you are looking for to make the block number relative to the extent where it's located, then add one to account for the extent header. This gives you the relative block. Multiply this times the block size for the offset.
For example:
(7500-6046) + 1 = 1455 = relative block number.
(1455*1024) = 1489920 = .d4 byte-offset.
The calculation in this section may not be necessary if the log file reveals the byte offset within the (1152) logged error message as listed below.
The byte offset. in this example is 392609792.
01:19:09 Usr 11: SYSTEM ERROR: read wrong dbkey at offset 392609792
in file /<dir>/<dbname> _11.d5 found -278578860, expected 19451008,
retrying. (1152)
01:19:19 Usr 11: Corrupt block detected when reading from database. (4229)
01:19:19 Usr 11: SYSTEM ERROR: wrong dbkey in block. Found -278578860,
should be 19451008 (1124)
3) Use the OS command od to dump 1024 bytes starting with the byte-offset value. Refer to the man page for od for complete details on the command qualifiers. Essentially what you're specifying is output in hex,offset is in decimal, output 1 block.
od -t x -j 1489920 -A x -N 1024 db.d4 > block.out
The dbkey of the block will be the second 4 bytes dumped.
This is then converted to decimal.
In this example we would expect to find the value 240028, but the actual value found should be the same as the dbkey found in the original (1124) message since we expect the UNIX filesystem buffer cache changed substantially since the failure.
Here is an example of the first line of a dumped block.
0000000 0128cc80 037f0004 00000000 0000002c
The 2nd 4 bytes is 0128cc80 which when converted to decimal is 19451008, which is the expected dbkey as can be seen from this log entry:
01:19:19 Usr 11: SYSTEM ERROR: wrong dbkey in block.
Found -278578860, should be 19451008 (1124)
Rename your output file or move it to a new location so that when you repeat step 3, it won't get overwritten.
4) Reboot the system (that is, clear the UNIX buffer cache)
5) Repeat step 3.
What we expect to find now is the correct dbkey; that is, the expected value, which is different from the one found before the system was re-booted..