Consultor Eletrônico



Kbase P42834: How to troubleshoot index corruption ?
Autor   Progress Software Corporation - Progress
Acesso   Público
Publicação   21/07/2005
Status: Verified

GOAL:

How to troubleshoot index corruption ?

GOAL:

How to identify the cause of index corruption ?

GOAL:

Troubleshooting errors 1422 and 2699

FIX:

The important thing to remember when receiving one of the errors:
"SYSTEM ERROR: Index in for RECID could not be deleted. (1422)",
"SYSTEM ERROR: proixdel failed for word indexing of array (2699)"

or a similar error relating to indexes, is that the error itself usually has very little to do with the actual cause of the issue. The index corruption is detected and reported at the time that the record was accessed, but the real damage has usually occurred earlier.

The first thing that needs to be determined is *when* this actual damage occurs. Once that is known, it can be determined *how* the damage is done - the root cause of the issue.
Note that under some circumstances the errors can be misleading, and no actual index corruption exists in the database. For these cases, the guidelines provided in this Solution can still help identifing the cause of the error.

1) To determine when the damage happens, apply the following steps:

1.1) Establish which indexes (and consequently which tables) are affected by running PROUTIL -C <IDXtool of choice>

** If different indexes on different tables are reported at this time, this indicates hardware failure as a possible cause. This becomes important when determining how the damage is done (see below)

** It can also happen that no actual index corruption is reported in the database at this point. In this case the investigation should focus on possible configuration issues and the interaction between clients and the database and the steps related to monitoring the database can be skipped.

1.2) Create a clean baseline by repairing the existing index corruption using PROUTIL -C <IDXtool of choice>.

1.3) Schedule a PROUTIL -C IDXFIX operation on the affected index(es) to be run periodically and as frequently as possible, in order to narrow down when the corruption occurs and to minimize the impact of the corruption in the production environment. This should be done at least daily to obtain a sufficiently narrow result.

** This step also serves to minimize downtime while the corruption has not yet been resolved.

1.4) Identify which processes interact with the index, thus being potential causes of the corruption. These are the processes that:
- create or delete records in the associated table,
- modify one or more of the field definitions that the index is based on,
- modify the data in one or more of the fields in the index,
- touch the extent files the that the index is located in.

If possible, add logging capacity to these processes to track at what time these changes are made.

1.5) When index corruption is reported, cross reference with the processes that were run in between the IDXFIX operations. It may be required to perform this step multiple times in order to reveal / establish a trend.

** At this time it is recommended to temporary disable as many of the processes as possible as this will help to narrow down and identify which process is responsible. Should this cause the corruption to stop occurring, the disabled processes should be enabled one at a time until corruption reoccurs.

Once it has been determined where the index corruption occurs, it should be possible to build a reproducible scenario (in a lab environment) where it can be investigated further.

2) Establishing how the damage is done:


2.1) OS related causes can be determined by monitoring file space and memory usage, both in terms of the total amount used and number and size of separate files/segments. If any of these reach set OS limits or the "traditional limits" (like the 2GB file limit) this warrants further investigation. One additional investigation here, is to research if other applications also have issues under similar conditions.

2.2) Third-party interference such as non-Progress online backup utiliti.es, virus scanners and I/O caching on hardware or at OS level are also a potential cause of index corruption. To investigate this, disable these processes and features. If this stops the index corruption, re-enable them one at the time until the corruption re-occurs.

** Note that for databases in general, it is not recommended to enable I/O caching at all. Most database engines are already optimized to minimize I/O operations, and the delay caused by additional caching may cause data on disk and in memory to get out of sync.

2.3) There are a number of possible known causes that should be investigated, especially if the corrupted index is a word-index or otherwise based on character fields:
- Ensure the proper word-break rules have been applied (especially in the case of UNICODE/UTF-8 databases)
- Ensure the codepage settings for the clients and the database match
- Ensure there are no duplicate word indexes on the same field
- After a binary load, ensure all active indexes have been rebuilt
- Ensure that there are not multiple records with identical key values for a unique index. Such a situation may occur if:
-- Records were added while the unique index was inactive, and the index was then activated.
-- An existing non-unique index is modified to be a unique one, and non-unique keys exist in the index.

2.4) Investigate possible hardware related causes one at a time, by moving the extents containing the corrupted index(es) to different physical disks, swapping memory banks etc. This is especially important if the corruption is intermittent, or not limited to a single index or table.

2.5) If none of the above apply, the index corruption is likely caused by a Progress bug. In this case a reproducible case will need to be extracted from the application code, so that the bug can be logged for investigation and addressed.

Based on these five points, the actual cause of the corruption can be identified, and the appropriate action(s) taken to prevent recurrence.

The following should also be taken into account while investigating index corruption:

When a damaged database needs to be transported for investigation, do not use a dump and load operation. The dump files will not contain the contents of the damaged index, they are essentially worthless - unless the index corruption actually occurs during a LOAD operation.
What is needed are the actual database blocks as they appear in the damaged database. The easiest way to obtain these is by performing the following steps.

1) Make a copy of the damaged database.
2) Remove all sequences from the copy.
3) Remove all tables that do not report index corruption and that are not required for a reproducible case.
4) Make a probkup of the database.
5) Use that backup and the database structure (.st) file to recreate the damaged part of the database as needed.

The stack traces (protrace and/or Dr. Watson) are not useful in these scenarios. The traces are generated when the error 1422 or 2699 occurs. As stated above, anything that happens at this time typically bears little relation to the actual cause of the issue..