Consultor Eletrônico



Kbase P109265: Why You Need The Before-image Log?
Autor   Progress Software Corporation - Progress
Acesso   Público
Publicação   1/16/2006
Status: Unverified

GOAL:

Why You Need The Before-image Log?

GOAL:

Purpose of the Before-image Log

GOAL:

When Changed Blocks Are Written to Data Extents?

FIX:

Why You Need the Before-image Log

Introduction

The before-image log, also called the "bi file", contains the primary transaction log[1] of the Progress RDBMS. It is an integral and essential part of every database and is just as important as the data extents used to store table data and their associated indexes. Without the before-image log extents, the database cannot be accessed.

If you lose your bi log or it becomes corrupted somehow, your database will be hosed (that's a technical term for "damaged beyond repair"). To reliably recover from such a situation, you must restore the database from a recent backup and roll forward using the after-image logs. You do make backup copies of your database, don't you? You do use the after-image journalling feature, don't you?

Some have advocated a number of short-cuts to avoid restoring from backup in the event the bi log is lost. Such methods hardly ever work and even worse: there is no way to tell if they do. You will not discover that these recovery" methods don't work until much later when you encounter errors caused by data corruption that has been here since the apparently successful short-cut was taken.

If your data are valuable, if you experience loss of a data extent or a transaction log extent, don't rely on any recovery method other than restoring from a backup.

Purpose of the Before-image Log

Although it is called the "before-image" log, the data stored in it are
ot/ before-images of anything. The name s historical and perhaps was poorly chosen. The before-image log contains records, called "notes", that allow previously made changes to the database to be repeated (redone) or rolled back (undone).

The before-image log is mainly used for two things:

0) To perform crash recovery in the event of a failure,

1) To perform transaction rollback during normal processing

In addition, the before-image log allows for a variety of performance optimizations because it, along with the storage manager's buffer management policies[2], permits database blocks to be modified in-place without requiring them to be written to the data extents right away (neither at the time changes are made, nor at transaction commit time).

How It Works:

During normal operation, copies of database blocks are read from the data extents on disk into the database buffer pool (also called the buffer cache) as needed, if a requested block is not already present. When database changes are made by running transactions, those changes are made /only/ in memory, on the /copies/ of the blocks in the buffer pool.

Eventually, the changed blocks must be written out to the data extents on disk, but the database storage manager postpones the write operations for as long as possible, sometimes for many hours or even days. One important reason for the postponement is that a given block will often be updated more than once and it is advantageous to avoid writing it for each change. Changed blocks are normally written by an automatic asynchronous checkpoint mechanism (a topic too complex to go into here), or when a normal database shutdown takes place.

If the changes are made only in memory, what happens if the system crashes and the changes are lost? The answer is that the changes can be recovered using the information that is recorded in the transaction log (the bi file). As each change is made to any database block (and to some in-memory data structures as well), a note describing the change is first spooled to the transaction log buffers. For some types of changes, multiple notes are generated. The change notes in the transaction lo.g buffers must be forced to disk when the transaction that generated them is committed (or shortly thereafter, more on that later).

The notes written to the before-image log contain storage area numbers, block numbers, rowids, and other data which are used to identify what was changed and precisely how it was changed. As a result, the before-image log is tightly coupled to its database. A before-image log can only be used with its own database.

The storage manager meticulously follows what we call the "write-ahead logging rule" or WAL rule. This rule states that no changed database block can be written to a data extent on-disk unless the transaction log notes describing the changes to it have been written to the transaction log /first/. Thus there will always be a record of the changes on disk in the transaction log[6]. If the changes are lost, we can use the notes in the transaction log to recreate those changes by performing them again[3]. Similarly, we can use the same log notes to undo changes that were made by an uncommitted transaction.

The WAL rule makes it possible to update database blocks in memory and not immediately write them to the data extents on disk. Furthermore, we do not have to write the changes to disk even when the transaction that made them commits. We merely have to write the change notes to the transaction log.

A result of the foregoing is that the on-disk versions of the data extents are almost never up-to-date in a running system. The complete and correct database state is made up of three separate parts:

0) the blocks on disk in the various data extents

1) newer versions of some of those blocks in memory

2) the change notes in the before-image log

The data in the before-image log and the data in the data extents /taken together/ is sufficient to recreate the newer versions that are present in memory in the buffer pool. That means it is ok if we lose what's in memory and when the system crashes, we /do/ lose those changed blocks. However, if a crash occurs and there is no bi log (perhaps because the disk it was on failed), then it is impossible to recreate the changes that were buffered in memory and not written to disk. Note that it is impossible to recreate the data in the before-image log.

If you use the after-image journalling feature, then you can easily recover from the loss of either the before-image log or data extents. The after-image log extents contain a second copy of all the change notes that are written to the before-image log. By restoring a backup of the database and rolling forward all of the previous ai logs, you can recreate all of the changes that were made since the backup.

Crash Recovery

During normal operation as transactions update the database, before-image log notes are generated, spooled to the log buffers, and written to the log, blocks are updated in the buffer pool in shared-memory, and blocks are occasionally written to the data extents. If a crash occurs, the contents of shared-memory will be lost.
Assuming the crash was not caused by a failure that caused the loss or corruption of the data on the disks, we will be left only with what remains on disk in the before-image log and in the outdated data extents.

When the database is restarted after a crash, the recovery manager reads through the notes in the before-image log and recreates any changes that were not previously written to disk. To aid in this process, each database block contains a version number that is incremented whenever a change is made. The version number and the block number are copied to the before-image note describing the change. As the recovery manager reads the before-image notes, in the. order they were written, it compares the version number in the note with the version number in the block, which it reads into the buffer pool from disk if necessary. If the block's version number is later than the note's, we know the change described by the note was previously made[4]. We can discard it and read the next one. If the version number matches, we make the change and increment the block's version number[5]. Once crash recovery reaches the end of the before-image log, we will have reproduced any changes that were lost. Most of the changed blocks will be in memory in the database buffer pool, but some may have been written to the data extents if it was necessary to make room for reading in blocks. Once all the notes have been read, we can roll back any incomplete transactions and recovery is complete.
Note: changes to the in-memory transaction table are also recorded in the before-image log and it is reconstructed as the notes are read.

When Changed Blocks Are Written to Data Extents

Modified blocks in the database buffer pool are written to their associated data extents on disk at the following times:

0) When an asynchronous page writer (APW) processes the modified blocks in the checkpoint queue and writes them.

1) When a block not in the buffer pool is requested and a previously modified block must be written to make room for the new block.

2) When an APW writes a modified block from the APW "write it now queue" (called simply "the APW queue").

3) When an APW has no useful work to do and slowly scans the modified block list, occasionally writing them.

4) Shortly after an online backup is initiated, the online backup tool writes all modified blocks in the buffer pool. This is actually unnecessary.

5) When the database is shut down by way of a normal shutdown.

In all of these cases, because we must adhere to the write-ahead-logging rule, all of the transaction log notes that describe the changes must be written to the transaction log /before/ the data blocks can be written.

In the first five cases, the database is in use and usually new changes are being made at the same time. There is no way to know which data extents are up-to-date and which are not. In the case labelled 3, while there may be little or no update activity taking place, there is still no way to be certain that all modified blocks have been written. This is true even if the database has been idle for a long time, even several hours.

Note that if you are not using the APWs the situation is somewhat different. Without page writers, blocks will be written at the following times:

0) When a process generating a note discovers that the current log cluster is full. Before initializing the next log cluster, all modified blocks on the checkpoint queue will be written.

1) When a block not in the buffer pool is requested and a previously modified block must be written to make room for the new block.

3) Shortly after an online backup is initialed, the online backup tool writes all modified blocks in the buffer pool. This is actually unnecessary.

4) When the database is shut down by way of a normal shutdown.

Without page writers, all of these writes will take longer than they do when page writers are used.

The disk-resident data extents are completely up-to-date when /all/ of the following conditions have been satisfied:

0) The database is not in use in either multi-user mode or in single-user mode and is not being used by any database utility or any other program.

1) The database was shut down by way of a normal shutdown.

2) All data blocks that were written previously have been written to disk from the op.erating system's buffer pool by the operating system. It is sometimes difficult to know when this condition is true.

If you are using the -directio startup option, or you are using Version 10, then following completion of a successful shutdown, it will be true.

If you are NOT using the -directio option or Version 10, then it is not possible to determine /with certainty/ that there are no unwritten blocks in the operating system buffer pool. At shutdown, after writing all modified blocks, the storage manager issues a sync() system call to request the operating system to flush its buffers, including data unrelated to the database. On some operating systems, this system call returns before the operating system's writes have been completed[7]. The storage manager always waits to give the system time to finish writing before declaring the shutdown complete. The default wait time is 60 seconds, which may be insufficient. The -G startup parameter can be used to increase the wait time. If the system crashes before the operating system has finished writing, some changes may be lost.

3) There are no unwritten data blocks in disk controller buffers, buffers in the disks themselves, or disk subsystem (for example, a disk array) buffers. Unless you are using a known to be reliable, high-quality storage subsystem with battery backup, disks should always be configured so that writes are not buffered.

When the database is restarted after a normal shutdown, it goes through some of the same processing used in crash recovery. For no particular reason, we call this "warm start" processing. Among other things, it reads transaction log notes and verifies that no block changes were lost after the shutdown. Unless the system crashes after the shutdown but before it finished writing its
filesystem buffers, no changes will have been lost.

If all of the above conditions are met, and only then, you can be sure the data extents on disk are up-to-date.

An Example Failure Scenario:

The proutil tool can be used to skip crash recovery (with the -F option) and truncate the bi log, discarding its contents. Skipping crash recovery means the records in the before-image log are not read or processed, any database changes that were lost are not recreated, and any incomplete transactions will not be rolled back. As a result the database will likely be corrupted by performing
this operation. Furthermore, this corruption may not be immediately apparent. You may be able to successfully open the database and access some, but not all, of the data in it. The following example illustrates what may happen. The example is small to make it easy to understand. I've left out a few details to keep it from becoming too involved, but nothing important. Everything is applicable to the real world.

Consider the following sequence of events:

Let's assume you have a very small buffer pool, with only 4 database buffers, and you are going to update 1 record and then read another one.
To fetch the record that will be updated, we need to read some stuff from the data extents on disk. First, we read in the root block of an index into one of the 4 buffers.
Assume the index has two levels, so we will have to read a second index block as well.

Next, we read a record block to fetch the record that is going to be updated. Assume it has two fragments so we will also have to read another record block.

At this point each of the 4 blocks in the buffer pool contains one of the blocks we just read - two index blocks and two record blocks.

Now, when we update the record, each of the two fragments will be updated, causing the /both/ record blocks to become dirty (i.e. modified). Transaction log notes. will be generated for each change.

After the update is finished, we have the following situation:

- We generated 4 notes to the before-image log:

a) a transaction begin note,

b) a record update note for the first fragment,

c) a record update note for the second fragment,

d) a transaction end note.

- Nothing has been written to disk. So on disk, we still have exactly what we started with.

- In the buffer pool, there are two dirty blocks which have not been written to disk, but will need to be written in the future.

Now we read the second record, again reading in two index blocks. Because the previous index blocks were oldest, they will be discarded and the new ones replace them and move to the front of the lru chain.
Then we read in a single record block, replacing one of the previously modified record blocks, which is written to disk. We do not update this second record.

At this point, we have:

- the 4 notes from the previous update

- on disk, one of the two modified record blocks has been written

- in memory we have two index blocks, one unmodified record block, and one modified record block.

The modified record block contains one of the two fragments from the record we updated earlier. The unmodified one contains the second record we read but did not update.

Now assume that the the system crashes because someone unplugs the server so they can plug in the vacuum cleaner. Stuff like this actually happens! But failures occur for many other reasons too. Maybe it was a laptop instead of a vacuum cleaner.

After we plug in the server and reboot it, we have:

- the 4 before-image log notes from the update, in the bi file.

- on disk, one of the two modified record blocks which was written,

- the rest of the stuff that was in the data extents before we started this exercise.

- whatever was in memory has disappeared. This includes one of our updated record blocks. The /only/ evidence that the block was updated is in the notes in the before-image log.

If we start the database the normal way, it will go through crash recovery, read the before-image log notes, discover that one of the block updates was lost, and make that update again. After crash recovery does this, we will again have both updated record blocks. Everything is quite fine.

If instead we use proutil to truncate the bi log with the -F option, the bi notes are thrown away without being processed. The note that contains the only evidence of the updated block that was never written to disk is discarded. We are left with whatever was on disk in the database. We have one half of the updated record and the old version of the other half. Or maybe there was no old version of the other half because the original record fit in one block but the updated one did not.

Rebuilding indexes will not repair the damaged record.
There is
o/ information available /anywhere/ that can be used to repair it. You are hosed, big time.

When you skip crash recovery this way. the "damaged" flag is set in the database and each time you start the database, you receive a message that says the database is damaged and you must dump and reload the data. If the damage is severe, this may not be possible. Even when it is possible, you are likely to lose some data.

Nota Bene:

If you do what is sometimes called a "mock index rebuild", where you tell the index-rebuild tool that you want to rebuild some indexes and then don't actually specify any, the database damaged flag that got turned on because you skipped crash recovery will be turned off. But nothing has
been fixed.&n.bsp; You are still hosed, big time.

If you actually rebuild all the indexes, and the damage is confined to indexes, then doing the index rebuild will result in a good database. But index rebuild cannot and does not repair damaged records. The index rebuild might complete successfully even though damaged records are present, as long as all those fragments that contain the key fields for the indexes being rebuilt are readable.

The loss of a before-image file /may/ be recoverable via proutil truncate bi with the -F option if you are absolutely certain that the database was cleanly shutdown and you are confident that the operating system and disk subsystem flushed everything to disk. Unfortunately, it is difficult or impossible to tell when this recovery method is successful.

Final Words

The before-image log is part of the database just like data extents are. If you lose a data extent, you must restore the database from a backup. If you lose your before-image log, you must restore the database from backup. There is no other solution that can be reliably expected to work.

Footnotes:

[1] The database geek's technical term for this kind of transaction log is "undo-redo log".

[2] The buffer management policy is "no-force, steal" which means that modified buffers do not have to be forced to disk at transaction commit time, and a modified buffer can be written to disk, possibly by another transaction, whenever it is necessary to do so in order to make room to read in a block that is not present in the buffer pool.

[3] This is the reason it is referred to as an "undo-redo log" -- it allows us to redo changes we made previously. Redo is also called "repeating history" or "replaying transactions".

[4] It is possible for only some changes, but not all, to be lost in the event of a crash. This might occur if a modified block needs to be written to disk in order to make room for another block that needs to be read into memory (into the buffer pool). It may also occur if a block is written but the operating system buffers it and has not yet written it to disk.

[5] Note that this scheme requires that all changes must be redone in the same order they were done to begin with, and no change can be left out. Each change record is a description of how to change the block from version "n" to version "n+1". The situation for rollback is different, more complicated, and we will not go into it here.

[6] Strictly speaking, only descriptions of /recent/ changes are available in the before-image log. This is because space in the log is reused to reduce the amount of disk space it occupies. When reusing log space, the log manager first ensures that the data about to be discarded (i.e overwritten with new data) will no longer be needed.

[7] You can tell if the sync() call is synchronous by performing a simple experiment. At a time when the system is active and many file writes are taking place, type a "time sync" command in a command shell a few times and see how long it takes. If it returns immediately the first time, then sync() is probably not synchronous. If it takes awhile (several seconds or a few minutes) and subsequent sync commands return quickly, then it is synchronous and the disk writes were completed before returning control to the shell. This experiment only works when there are many modified filesystem buffers present in the system buffer pool..