Consultor Eletrônico

Determining the causes of S/E 1124 (wrong dbkey in block)

INTRODUCTION:
=============

The cause of "SYSTEM ERROR: wrong dbkey in block. Found <dbkey>, should be <dbkey2> (1124)" is very often due to hardware or operating system problems.

Error 1124 is an indication of database corruption. This very serious error is reported by the database storage manager when database block header validation fails after a disk read. Error 1124 may also be reported in 3 other cases, described below.

EXPLANATION:
============

Each database block has a unique identifier called a "dbkey". The dbkey identifies a block's location within the database. Every database block stored on disk contains a copy of its dbkey in its block header. When the Progress database storage manager reads the block for a particular dbkey from disk, it compares the dbkey in the block header of the block that was just read with the dbkey that was requested. If they do not match, an error 1124 is reported. The 1124 error is immediately preceded by error 4229
(Corrupt block detected when reading from database).

If the dbkeys do not match, it means either that the block has been damaged and does not contain valid data, or that the read operation returned the wrong block. If the block does not contain valid data, then those data are permanently lost and cannot be reconstructed by the crash recovery mechanism of the database.

If a database block has been read successfully from disk into the
memory-resident buffer pool, then the database manager validates the block header again whenever a buffer lock is released, after the block has been updated, and before writing the block back to disk. If the validation fails any of these checks, then the block has been damaged while it is in memory.
In each case, error 1124 is reported, preceded by one of error 4232, 4231, or 4230, depending on when the error was detected. The affected block is NOT WRITTEN back to the database.

The Progress database storage manager applies the same block header
validation in all executables that read from or write to the data extents of database, e.g., self-service clients, page writers, probkup and prorest.
Note that this block header validation is also performed for on-disk
temporary storage used for 4GL TEMP-TABLES.

Error 1124 can occur for many different reasons. In most cases, especially when preceded by error 4229, the cause is external to the Progress database, occurring some time between when the database manager wrote a good block to disk and when it later reads the block again and the header validation fails. Isolating the cause of the problem is often difficult and time consuming. Once the cause has been determined and the problem corrected, the best course of action is to restore the database from backup.

Many actions occur between the database storage manager writing a good block and a subsequent read of the same block. Between these two events, there are many possible points of failure. A simplified sequence of events is as follows:

* The storage manager issues a buffered write request the block is copied into the operating system's buffers.

Faulty RAM or an operating system or filesystem bug can cause
corruption here.
Replace any newly installed RAM, and check with your OS supplier for latest OS patch information.

* The operating system then passes the block to a device driver.
A device driver bug can cause corruption here.
Check OS supplier for latest patch information.

* The device driver then passes the block to the disk controller.
Faulty disk controller or a controller firmware bug can cause
corruption here.

* The disk controller then transfers the block to the disk, possibly
via an external cable.

Faulty disk or cable can cause corruption here.

In reading the block, the reverse happens. A similar sequence of events occurs when backing up and restoring. Using non-PROGRESS backup utilities (particularly if they are not the standard ones provided with the operating system) introduces another potential point of failure.

In addition, after a block has been read from disk and while it is present in the database manager's buffer pool, a memory shortage may cause the operating system to page the buffer to disk (in the paging file) and retrieve it later, unbeknownst to the storage manager. Errors and corruption can occur during this process as well.

For SCO Unix, UnixWare, Linux, and Solaris Intel systems, using old PC
hardware, the BIOS sector translation for DOS drives greater than 1 GB
_MUST_ be disabled. If you are running one of these operating systems with that translation turned on, the BIOS translates sectors for the benefit of DOS. This translation is not needed for other operating systems. Turning BIOS sector translation off may greatly reduce if not eliminate the 1124 errors and may greatly increase performance, too.

Miscellaneous User Experience
==============================
These 'case studies' are provided merely as suggestions on where to
focus your research efforts when trying to troubleshoot the cause of
the 1124 error.

1) During an idxbuild on a system with a defective scsi cable,
errors occurred at random points in the job, different each
time.

If the problem seems to come and go, then check such
things as terminators or not having enough cable between
connectors, termination power, and anything else related.

Another customer, on an AIX system, found that with all the disks
connected via SCSI controllers, when they added all their
cable length together, had exceeded the specified maximum length
for SCSI. They split off the disks and the 1124 errors went away.

2) The 1124 error occurred on a database that had been cpio'd from
another machine. The customer was able to run the process
on the original machine but not on the copy.

Damage may have been caused when the cpio copy was performed.
Check the size of the database against the ulimit size on
the machine that you copied the database to. Cpio will truncate the database at the ulimit size without giving you an error message.

NOTE: The PROGRESS backup utility, probkup, will override the
ulimit size but cpio will not.

3) It is possible that the machine has memory problems and/or if you
are using a disk cache, the block is being corrupted by the
caching process.

4) Verify that the motherboard's speed is correctly set to match
the CPU's speed.

5) Examine other simultaneous processes accessing the drives.
One customer had a benchmark process that was able to cause the
error. What they also found is running multiple index rebuilds on
multiple databases at the same time also caused the error. It
seemed any workload that generated high levels of disk activity
would cause the error.

The customer was able to determine that this problem was only
occurring on one particular model of hard drive they were using.
The problem occurred regardless of whether the hard drive was a
master, or slave drive.

The cause was ultimately determined to be flawed disk drive design
and all drives of this particular model were faulty.

Related error messages:

Corrupt block detected when reading from database (4229)
Corrupt block detected when attempting to write a block (4230)
Corrupt block detected when attempting to modify a block (4231)
Corrupt block detected when attempting to release a buffer (4232)
Database block <nbr> has incorrect recid: <nbr> (355)

Progress Software Technical Support Note # 15349