Consultor Eletrônico

Undo and Redo Logging : How the bi file works

The following draft was written to provide a detailed explanation of
how the before image file works as an integral part of the PROGRESS
database.

Logging in the Progress RDBMS

1. Introduction

This document contains a brief discussion of how the Progress
database manager uses logging to achieve reliability and high
performance.

2. Transactions

The transaction concept is central to the Progress database manager.
A brief review is provided here as background.

Transactions are an *error handling* mechanism. They allow you to do
an arbitrary amount of work and then change your mind. You can tell
the system "Oops, I don't want to do this after all. Put everything
back they way it was before I started." or "OK, I'm finished now, make
all my changes official". If a failure occurs before you are done, the
system will automatically take away any work that is not completed and
cannot be finished.

Transactions have four basic properties (the "ACID" properties):
atomicity, consistency, isolation, and durability. Thes properties are
closely related to each other and are briefly defined below. The
transaction properties of atomicity and durability have a profound
effect on the functionality of the Progress database manager and how
it works.

Atomicity

Transactions frequently require making several related changes to the
database. For example, to transfer money from one bank account to
another, funds must be deducted from one account record and added to
another. These operations must be performed as a unit. Either all the
changes are made, or none of them are made.

Consistency

The consistency property says that transactions transform the database
from one consistent state to another.

Isolation

Isolation means that the effects of several concurrently executing
transactions are not visible to each other. The changes made by a
transaction are provisional. They do not become official or permanent
until the transaction ends successfully or commits. Until a
transaction commits, an error will cause the changes made in that
transaction to be reversed or undone.

Durability

When a transaction is committed, its effects on the database are
permanent. They will not be undone even if a failure occurs. A
committed transaction can only be undone by executing a second
transaction that reverses the effects of the first.

3. Logging

The Progress database manager provides the transaction properties of
atomicity and durability by using a technique called undo-redo logging
combined with write-ahead logging. As database changes are made,
notations of them are durably recorded in a sequential file called a
log file. These log records can be used to undo an incomplete
transaction if an error occurs and the transaction is rolled back, and
to restore the database to a consistent state when a failure occurs.

In addition, the use of the undo-redo log allows the Progress database
manager to store database changes in memory indefinitiely, writing
them to disk when it is convenient. Changes can be made in-place to
the actual data and do not need to be written to disk when a
transaction is committed (this is known as "deferred writes"). This
allows for high performance and optimization of disk write operations.
In effect, random writes can be transformed into sequential writes.

4. The Undo-Redo Log

The undo-redo log generated by the Progress database manager is called
the "before-image file" or "bi file". The before-image file contains a
log of all recent database changes. As transactions change the
database during normal ("forward") processing, the database manager
writes one or more log records describing each change to the
before-image file. Although it is called the before-image file, the
undo-redo log does not contain just before-images. It contains data
sufficient to:

a) Undo or reverse the effects of a transaction that has not yet been
committed.

b) Restore the database to a consistent state after a failure or
"crash".

c) Reuse space in the log when it is no longer needed.

d) Limit the amount of data that must be processed when recovering
from a failure.

Depending on the operation being performed, what is logged may be a
copy of the data before it was changed, a copy of new data, a
combination of the two, or some other information. There are over 50
different types of log records and each type contains different data.

Database changes are buffered in memory, in the database buffer pool
and are written back to disk only when it becomes necessary to do so.
Over time, the disk-resident copy of the database becomes
progressively more out of date.

The current state of the database is the sum of the disk-resident part
and the memory-resident part (the contents of database buffer pool,
the transaction table, and other memory-resident data structures).
When the database is shut down in an orderly manner, the
memory-resident data are written to disk. When a failure occurs, the
memory-resident part of the database state will be lost, but it can be
reconstructed from the data in the before-image file.

5. Write-Ahead Logging

In write-ahead logging, all database changes are recorded and written
to the log *first* and later to the database. This way, the data in
the before-image file can be used to repeat or redo the changes if
they are never written to the database. This technique also allows
database changes to be stored in memory indefinitely before they are
written to disk.

For write-ahead logging to work, it is necessary that all writes to
the before-image log are unbuffered or "synchronous". If before-image
records are buffered in operating system buffers, the data will be
lost if the system fails. When the system fails, a record that a
transaction was committed might be lost, r writes to database files
might occur before the log records have been written to the
before-image log (because the operating system is free to schedule
buffered disk writes as it sees fit). If this happens, there could be
a database change on disk for which no log record exists. Without the
log record, the change cannot be undone.

Because of the requirement for write-ahead logging, database files and
before-image logs should not be stored on remote filesystems mounted
across networks, such as NFS.

6. Transaction Undo

To undo or "roll back" an active transaction, the database manager
reads, in reverse order, all the log records generated by the
transaction, back to where it began. The effects of each change are
reversed and the original data values are restored. Note that, as
changes are undone, the database is changed again. These new changes
are logged, generating additional log records during the rollback.

7. Crash Recovery

When a failure ("crash") occurs, the memory-resident part of the
database state is lost and the disk-resident part is in an unknown
state. Some database changes may have been written to disk and some
may have been buffered in memory and lost. One or more transactions
may have been active and not completed.

During recovery processing, the memory-resident and disk-resident data
are reconstructed by reading log records from the before-image file
and repeating or "redoing" any changes which were not written to the
database files. This is possible because the Progress database
manager uses a technique called "write-ahead logging".

Once the memory-resident and disk-resident states of the database have
been reconstructed up to the point at which the crash occurred, any
transactions that were active at that point are rolled back in much
the same way as during normal transaction undo. When this recovery
process has been completed, the database is once again in a consistent
state.

8. Checkpoints

If we never wrote changed database blocks back to disk, we would have
to repeat all changes from the start of the session when a failure
occurs. This could take a long time and we would never be able to
reuse any before-image space since we would always need the data. To
prevent this from happening, the disk-resident state of the database
is periodically reconciled with the memory-resident state. This
reconciling operation is called a "checkpoint" and is initiated
whenever a before-image cluster (see before-image space allocation,
below) is filled.

During the checkpoint operation, all memory-resident state is written
to disk. This includes the transaction table, all changed database
blocks in the buffer pool, and some other miscellaneous data
structures.

Synchronous Checkpoints

Before version 6.3, checkpoints were always synchronous. *No*
database changes could take place while a checkpoint was being
performed. If the buffer pool was large, the cluster size was large,
and many updates had been made, there may have been thousands of dirty
database buffers that had to be written to disk. This might have taken
several minutes. For example, assume there were 5000 modified buffers
and that one can write 30 blocks per second to disk. Writing all the
buffers would have taken about 166 seconds (nearly three minutes).

Fuzzy Checkpoints

In version 6.3 and later, checkpoints are asynchronous (also called
"fuzzy checkpoints"). A checkpoint is begun when a before-image
cluster is filled, but it does not have to be completed until the next
one begins. Additional database changes can continue while the
checkpoint is occurring. One or more "asynchronous page writer"
processes can do the necessary disk writes in the background while
servers continue to process client database requests.

Provided the page writers can complete a checkpoint by the time the
next one must begin, only a very few disk writes will be required at
cluster close time. If any buffers are still marked for checkpointing
when the next checkpoint must begin then they will be written first.

9. Before-Image File Space Allocation

Space in the before-image file is allocated in fixed-size units called
"clusters". The cluster size can be changed at the same time that the
before-image file is truncated.

When the before-image file is initialized, space for four clusters are
allocated and linked together to form a ring. Once these four clusters
fill with log records, all the initially allocated space will have
been used. Then, either the first cluster is reused or a new cluster
is added, by using existing unused space from a multi-extent bi file,
or by expanding the bi file if no unused space is available. When a
new cluster is added, it is linked into the ring, time stamped, and
assigned the next sequential cluster number.

Each time a cluster is filled, either the oldest cluster is reused, or
a new one is added. A cluster can be reused when the log records it
contains are no longer needed. They are not needed when:

a) All transactions that were active when the records were generated
have ended. Then we will not need to undo them.

b) All database blocks that were changed when the log records within
the cluster were generated have been written to disk. Then we will not
need to repeat any of the changes during crash recovery.

A long-running transaction can prevent the space from the cluster in
which the transaction began to the most recently written cluster being
reused. If the transaction has to be rolled back, the log records it
generated will be needed to reverse the effects of its changes.

When a previously used cluster is going to be reused, the cluster
header is updated with a new cluster number and the current time.
Since previously used clusters are already linked into the ring of
allocated clusters, they do not need to be formatted or linked again.

A new cluster is allocated either by using available previously
created space from fixed-length extents or by expanding a variable
length extent.

When a cluster is allocated from available fixed-length extents, it is
formatted by writing zeros into the entire cluster. This is necessary
because the space may have been used before and may contain old log
records that are obsolete and should not be processed during crash
recovery.

When a variable length extent is expanded, it is completely formatted
by writing zeros into the entire cluster. This is necessary because on
many systems, space is not allocated from the filesystem until data
are written to the file.

After a cluster has been formatted, it is linked into the ring of
allocated clusters, timestamped, and assigned the next cluster number.

Before version 7.2E, four clusters are formatted each time the file is
expanded. In version 7.2E and later, after the initial four clusters
have been allocated, only one cluster is formatted.

10. Truncating the Before-Image File

Truncating the bi file consists of going through recovery to ensure
the database files are consistent and that there are no active
transactions, and then discarding the contents of the bi file. There
are several reasons why you might want to truncate the bi file. Among
them are:

a) To avoid having to write the bi file when doing a full off-line
backup. In version 7.3A and later the full backup does not write the
bi file to the backup medium.

b) To start after-imaging file after a full backup. In this case,
Progress automatically truncates the bi file.

c) To change the before-image cluster size (version 6.2 and later) or
block size (version 7.2 and later).

d) To reduce the size of a before-image file that has become unusually
large. This *usually* does not occur during normal operation because
the before-image space is reused. Most of the time, the log reaches
some stable size and does not grow further. The size is dependent on
the types of transactions executed by the application.

When a variable-length bi file is truncated, the master block is
updated to reflect the fact that it has been truncated, the existing
bi file is deleted, and a new, empty one (zero bytes long) is created.
The next time the database is opened, four initial clusters will be
allocated, expanding the file.

When a multi-extent bi file is truncated, the master block is updated
and fixed length extents are marked as truncated by updating the
header. The previously used space is not reformatted. Any variable
length extent is deleted and an empty one created. The next time the
database is opened, four clusters are allocated.

11. Clusters, Buffers and Blocks

A cluster is composed of a fixed number of fixed-size blocks
(before-image blocks). The number of blocks is determined by the
cluster size and the block size. The cluster size is a multiple of the
before-image block size.

Before version 6.2, the cluster size was fixed at 16 kilobytes. In
version 6.2 and later, the cluster size can be changed when the
before-image file is truncated.

Before version 7.2, the before-image block size was fixed at the same
size as the database block size. In version 7.2 and later, the
before-image block size can be adjusted. It can be changed at the same
time that the before-image file is truncated. The optimal block size
is 8 kilobytes for most Unix systems.

In version 6.3 and later, the number of before-image buffers can be
controlled. The default (and minimum) is 5 buffers. Each buffer is the
size of a before-image block.

Progress Software Technical Support Note # 13866