Kbase 2784: Bad RECID system errors 18 210 Bug in early PROGRESS
Autor |
  Progress Software Corporation - Progress |
Acesso |
  Público |
Publicação |
  5/10/1998 |
|
Bad RECID system errors 18 210 Bug in early PROGRESS
910826-rgw01
INTRODUCTION
============
This Progress Software Technical Support Knowledgebase entry
provides instructions on how to diagnose and fix index
references to bad recids or attempt to read block errors
caused by roll over of transaction number in the master block.
PROGRESS ERROR TEXT:
====================
SYSTEM ERROR: Index refers to bad recid for <filename>, entry
ignored. (18)
SYSTEM ERROR: Attempt to read block XXXX which does not
exist. (210)
BACKGROUND:
===========
In the leaf block (the lowest level) of an index PROGRESS
stores a compressed version of the index key and the recid of
the record the index points to. During a transaction, when
a record is deleted PROGRESS puts a placeholder in the place
of the recid for the deleted record. This place holder indicates
that the record has been deleted but does not actually delete
the entry. The reason for this is that to actually delete the
record and rearrange the index blocks requires a fair amount of
overhead and if the transaction needs to be backed out it doubles
the work to be done. By inserting a placeholder, a lot of
processing time is saved. The placeholder is the last
transaction number stored in the master block of the database.
PROGRESS takes the last transaction, converts it to a negative
number and puts it in place of the recid in the index.
The problem comes when the transaction number grows so large
that it rolls over and becomes negative. When the negative
transaction number "changes to negative", it becomes positive
so we no longer recognize it as a place holder and don't delete
it at the end of the transaction. The next time we access the
index and hit the key associated with this placeholder gone-bad,
either the "index refers to bad recid" error or the "attempt to
read block xxx..." error occurs since this number is not the
recid of any real record. Which error that appears depends on
whether the incorrect recid falls within the bounds of the
database. (Error 18 appears if it is; error 210 appears if it
is not.)
The version number in the master block is a large (32 bit)
number so one would think that it would be close to impossible
to run that many transactions. But in Version 5 on a heavily
loaded system, the transaction number is not always incremented
by one. Instead several numbers can be skipped, causing the
transaction number to grow at a faster rate. In Versions 6.2A
through 6.2F the transaction numbers are allocated with out skips,
in 6.2G01 and higher, the transaction number is automatically
reset to 1 when it it nears the roll-over point.
DIAGNOSIS:
==========
When the index refers to a bad recid or an attempt to read block
error occurs, the first step is to look at the block number or
recid reported. If the number is very high (beyond the bounds of
the database), restart the server and use promon to check the
transaction number for the database.
PROGRESS MONITOR Version 5.0
Database: /users/ps/rgw/demo5
1. User Control
2. Locking and Waiting Statistics
3. Block Access
4. Record Locking Table
5. Index Cursors
6. Shared Resources
7. Database Status
8. Shut Down Database
M. Modify Defaults
Q. Quit
Enter your selection: 7
Choose option 7 to display the status of the database. Status
information appears. For example:
Database Status:
Database version number: 2109
Database state: Not modified since last open (2)
Database damaged flags: None (0)
Integrity flags: None (0)
Total number of database blocks: 2097120
Database blocks high water mark: 2097120
Number of empty blocks: 5
Blocks with free space: 18723
Data extents (multi-volume): 0
.bi extents (multi-volume): 0
Structure file (multi-volume):
Last transaction number: -2114922664 <-- **SEE NOTE BELOW
^^^^^^^^^^^
Highest file number defined: 10
Database Language: Language Group 1 (Non-Scandinavian)
Database created (multi-volume): - -
Most recent database open: 07/31/91 16:17
Previous database open: 07/31/91 16:17
Most recent .bi file open: 07/31/91 16:17
Previous .bi file open: - -
** HERE, THE TRANSACTION NUMBER HAS ROLLED OVER.
If the transaction number is not negitive it is not our culprit
so a fix based on this hypothesys will not help. If this is the
case please call tech support for another approach to solving the
problem.
CORRECTIVE MEASURES:
===================
Fixing the problem:
Fixing the problem is straight forward, just reset the last
transaction in the master block, and rebuild the indexes.
The first step is to truncate the before image file (.bi) and
turn off after-imaging. Both before and after imaging are based
on the transaction number, so truncate the .bi and do aimage end.
Once this is done, make a good backup (or two) of the database
and put it in a safe place.
To fix the last transaction, use the database repair
utility portion of proutil. The syntax is as follows:
proutil <dbname> -C dbrpr
When you enter this command (replacing <dbname> with the name
of the database) it starts the utility and brings up the
Database Repair Menu. Choose selection 4, Dump Block, and it
prompts you for a dbkey. Enter 32, which is the dbkey of the
master block. This dumps the contents of the master block to a
file named 32.dmp.
The files will look somewhat different depending on what
processor your system is based. First an example with forward
byte order Motorola style:
# BLOCK REPAIR UTILITIES
# DATABASE = /users/ps/rgw/demo6.db
# DBKEY = 32
# BLOCK NUM = 1
What we need to fix is the transaction number in the master block.
>0000 0000 0020 017F 0001 0000 0000 0000 0362
>0010 043F 0002 000A 0000 001F FFF0 081F0 D75 <-- This is the
^^^^ ^^^^ transaction
number.
Converting
it to
decimal shows
it has
rolled over.
(-2114922664
decimal)
>0020 0000 0000 0000 2880 0000 0000 0000 0001
>0030 0000 0000 0000 0000 0000 0000 0000 0000
>0040 2845 4E40 0000 0000 0000 0000 0000 0000
>0050 2845 4E73 0000 0000 0000 0001 0000 0000
The transaction number in your database will be different, but
it will be the last two columns on row 10. Use vi patch it so it
looks like this:
# BLOCK REPAIR UTILITIES
# DATABASE = /users/ps/rgw/demo6.db
# DBKEY = 32
# BLOCK NUM = 1
>0000 0000 0020 017F 0001 0000 0000 0000 0362
>0010 043F 0002 000A 0000 001F FFF0 0000 0001 <-- Set this
^^^^ ^^^^ value to 1
>0020 0000 0000 0000 2880 0000 0000 0000 0001
>0030 0000 0000 0000 0000 0000 0000 0000 0000
>0040 2845 4E40 0000 0000 0000 0000 0000 0000
>0050 2845 4E73 0000 0000 0000 0001 0000 0000
# BLOCK REPAIR UTILITIES
# DATABASE = /usr7/john/cru.db
# DBKEY = 32
# BLOCK NUM = 1
On an Intel based machine, the byte order is reversed. This adds
another step in the process.
After the
>0000 2000 0000 017F 5B00 0000 0000 FA6F 0800 byte swap
>0010 3D08 0200 1E00 0100 80BE 0100 58D7 F081 --> 81F0 D758
^^^^ ^^^^ (-2114922664
decimal)
>0020 4065 3700 A0B7 3700 1200 0000 EC20 0000
>0030 0000 0000 0000 0000 0000 0000 0000 0000
>0040 E538 DD27 0000 0000 0000 0000 0000 0000
>0050 7C80 DF27 0000 0000 3D00 0000 0000 0000
>0060 0001 1300 0000 0000 0000 0000 0000 0000
The transaction number in your database will be different
but it will be the last two columns on row 10. Use vi patch
it so it looks like this:
# BLOCK REPAIR UTILITIES
# DATABASE = /usr7/john/cru.db
# DBKEY = 32
# BLOCK NUM = 1
Setting the transaction number back to 1.
After the
>0000 2000 0000 017F 5B00 0000 0000 FA6F 0800 byte swap
>0010 3D08 0200 1E00 0100 80BE 0100 0100 0000 --> 0000 0001
^^^^ ^^^^
>0020 4065 3700 A0B7 3700 1200 0000 EC20 0000
>0030 0000 0000 0000 0000 0000 0000 0000 0000
>0040 E538 DD27 0000 0000 0000 0000 0000 0000
>0050 7C80 DF27 0000 0000 3D00 0000 0000 0000
>0060 0001 1300 0000 0000 0000 0000 0000 0000
The last step is to run proutil <dbname> -C dbrpr and pick 5,
Load Block, from the menu and enter 32 at the dbkey prompt.
This loads your patched block back into the database. You can
then quit out, make a backup, rebuild the indexes, restart your
after-imaging, and start the server. The database is fixed.
Progress Software Technical Support Note # 2784