Kbase P10481: How the primary recovery (.bi) file works
Autor |
  Progress Software Corporation - Progress |
Acesso |
  Público |
Publicação |
  28/07/2009 |
|
Status: Verified
GOAL:
How the primary recovery (.bi) file works
GOAL:
What is Undo and Redo Logging
GOAL:
What is roll-back recovery ?
FACT(s) (Environment):
All Supported Operating Systems
Progress/OpenEdge Product Family
FIX:
The following provides a detailed explanation of how the before image file works and some of the history behind it's evolution. It is as an integral part of the Progress database.
1. Introduction
This document contains a brief discussion of how the Progress database manager uses logging to achieve reliability and high performance.
2. Transactions
The transaction concept is central to the Progress database manager. A brief review is provided here as background.
Transactions are an *error handling* mechanism. They allow you to do an arbitrary amount of work and then change your mind. You can tell the system "Oops, I don't want to do this after all. Put everything
back they way it was before I started." or "OK, I'm finished now, make all my changes official". If a failure occurs before you are done, the system will automatically take away any work that is not completed and cannot be finished.
Transactions have four basic properties (the "ACID" properties):
atomicity, consistency, isolation, and durability.
These properties are closely related to each other and are briefly defined below. The transaction properties of atomicity and durability have a profound effect on the functionality of the Progress database manager and how it works.
Atomicity
Transactions frequently require making several related changes to the database. For example, to transfer money from one bank account to another, funds must be deducted from one account record and added to another. These operations must be performed as a unit. Either all the changes are made, or none of them are made.
Consistency
The consistency property says that transactions transform the database from one consistent state to another.
Isolation
Isolation means that the effects of several concurrently executing transactions are not visible to each other. The changes made by a transaction are provisional. They do not become official or permanent until the transaction ends successfully or commits. Until a transaction commits, an error will cause the changes made in that transaction to be reversed or undone.
Durability
When a transaction is committed, its effects on the database are permanent. They will not be undone even if a failure occurs. A committed transaction can only be undone by executing a second transaction that reverses the effects of the first.
3. Logging
The Progress database manager provides the transaction properties of atomicity and durability by using a technique called undo-redo logging combined with write-ahead logging. As database changes are made, notations of them are durably recorded in a sequential file called a log file. These log records can be used to undo an incomplete transaction if an error occurs and the transaction is rolled back, and to restore the database to a consistent state when a failure occurs.
In addition, the use of the undo-redo log allows the Progress database manager to store database changes in memory indefinitely, writing them to disk when it is convenient. Changes can be made in-place to the actual data and do not need to be written to disk when a transaction is committed (this is known as "deferred writes"). This allows for high performance and optimization of disk write operations. In effect, random writes can be transformed into sequential writes.
4. The Undo-Redo Log
The undo-redo log generated by the Progress database manager is called the "before-image file" or "bi file". The before-image file contains a log of all recent database changes. As transactions change the database during normal ("forward") processing, the database manager writes one or more log records describing each change to the before-image file. Although it is called the before-image file, the undo-redo log does not contain just before-images. It contains data sufficient to:
a) Undo or reverse the effects of a tra.nsaction that has not yet been committed.
b) Restore the database to a consistent state after a failure or "crash".
c) Reuse space in the log when it is no longer needed.
d) Limit the amount of data that must be processed when recovering from a failure.
Depending on the operation being performed, what is logged may be a copy of the data before it was changed, a copy of new data, a combination of the two, or some other information. There are over 50 different types of log records and each type contains different data.
Database changes are buffered in memory, in the database buffer pool and are written back to disk only when it becomes necessary to do so. Over time, the disk-resident copy of the database becomes progressively more out of date.
The current state of the database is the sum of the disk-resident part and the memory-resident part (the contents of database buffer pool, the transaction table, and other memory-resident data structures). When the database is shut down in an orderly manner, the memory-resident data are written to disk. When a failure occurs, the memory-resident part of the database state will be lost, but it can be reconstructed from the data in the before-image file.
5. Write-Ahead Logging
In write-ahead logging, all database changes are recorded and written to the log *first* and later to the database. This way, the data in the before-image file can be used to repeat or redo the changes if they are never written to the database. This technique also allows database changes to be stored in memory indefinitely before they are written to disk.
For write-ahead logging to work, it is necessary that all writes to the before-image log are unbuffered or "synchronous". If before-image records are buffered in operating system buffers, the data will be lost if the system fails. When the system fails, a record that a transaction was committed might be lost, r writes to database files might occur before the log records have been written to the before-image log (because the operating system is free to schedule buffered disk writes as it sees fit). If this happens, there could be a database change on disk for which no log record exists. Without the log record, the change cannot be undone.
6. Transaction Undo
To undo or "roll back" an active transaction, the database manager reads, in reverse order, all the log records generated by the transaction, back to where it began. The effects of each change are reversed and the original data values are restored. Note that, as changes are undone, the database is changed again. These new changes are logged, generating additional log records during the rollback.
7. Crash Recovery
When a failure ("crash") occurs, the memory-resident part of the database state is lost and the disk-resident part is in an unknown state. Some database changes may have been written to disk and some may have been buffered in memory and lost. One or more transactions may have been active and not completed.
During recovery processing, the memory-resident and disk-resident data are reconstructed by reading log records from the before-image file and repeating or "redoing" any changes which were not written to the database files. This is possible because the Progress database manager uses a technique called "write-ahead logging".
Once the memory-resident and disk-resident states of the database have been reconstructed up to the point at which the crash occurred, any transactions that were active at that point are rolled back in much the same way as during normal transaction undo. When this recovery process has been completed, the database is once again in a consistent state.
It is also important to note that the Database engine performs crash recovery every time you open the database, not all of the recovery phases are logged in the database .lg file. For example, the Database engine pe.rforms and logs the Physical Redo phase unconditionally, but the Physical Undo and Logical Undo phases are only performed and logged when outstanding transactions are found. 8. Checkpoints
If we never wrote changed database blocks back to disk, we would have to repeat all changes from the start of the session when a failure occurs. This could take a long time and we would never be able to reuse any before-image space since we would always need the data. To prevent this from happening, the disk-resident state of the database is periodically reconciled with the memory-resident state. This reconciling operation is called a "checkpoint" and is initiated whenever a before-image cluster (see before-image space allocation, below) is filled.
During the checkpoint operation, all memory-resident state is written to disk. This includes the transaction table, all changed database blocks in the buffer pool, and some other miscellaneous data structures.
Synchronous Checkpoints
Before version 6.3, checkpoints were always synchronous. *No* database changes could take place while a checkpoint was being performed. If the buffer pool was large, the cluster size was large, and many updates had been made, there may have been thousands of dirty database buffers that had to be written to disk. This might have taken several minutes. For example, assume there were 5000 modified buffers and that one can write 30 blocks per second to disk. Writing all the buffers would have taken about 166 seconds (nearly three minutes).
Fuzzy Checkpoints
In version 6.3 and later, checkpoints are asynchronous (also called "fuzzy checkpoints"). A checkpoint is begun when a before-image cluster is filled, but it does not have to be completed until the next one begins. Additional database changes can continue while the checkpoint is occurring. One or more "asynchronous page writer" processes can do the necessary disk writes in the background while servers continue to process client database requests.
Provided the page writers can complete a checkpoint by the time the next one must begin, only a very few disk writes will be required at cluster close time. If any buffers are still marked for checkpointing when the next checkpoint must begin then they will be written first.
9. Before-Image File Space Allocation
Space in the before-image file is allocated in fixed-size units called "clusters". The cluster size can be changed at the same time that the before-image file is truncated.
When the before-image file is initialized, space for four clusters are allocated and linked together to form a ring. Once these four clusters fill with log records, all the initially allocated space will have been used. Then, either the first cluster is reused or a new cluster is added, by using existing unused space from a multi-extent bi file, or by expanding the bi file if no unused space is available. When a new cluster is added, it is linked into the ring, time stamped, and assigned the next sequential cluster number.
Each time a cluster is filled, either the oldest cluster is reused, or a new one is added. A cluster can be reused when the log records it contains are no longer needed. They are not needed when:
a) All transactions that were active when the records were generated have ended. Then we will not need to undo them.
b) All database blocks that were changed when the log records within the cluster were generated have been written to disk. Then we will not need to repeat any of the changes during crash recovery.
A long-running transaction can prevent the space from the cluster in which the transaction began to the most recently written cluster being reused. If the transaction has to be rolled back, the log records it generated will be needed to reverse the effects of its changes.
When a previously used cluster is going to be reused, the cluster header is updated with a new cluster. number and the current time. Since previously used clusters are already linked into the ring of allocated clusters, they do not need to be formatted or linked again.
A new cluster is allocated either by using available previously created space from fixed-length extents or by expanding a variable length extent.
When a cluster is allocated from available fixed-length extents, it is formatted by writing zeros into the entire cluster. This is necessary because the space may have been used before and may contain old log records that are obsolete and should not be processed during crash recovery.
When a variable length extent is expanded, it is completely formatted by writing zeros into the entire cluster. This is necessary because on many systems, space is not allocated from the file system until data are written to the file.
After a cluster has been formatted, it is linked into the ring of allocated clusters, timestamped, and assigned the next cluster number.
Before version 7.2E, four clusters are formatted each time the file is expanded. In version 7.2E and later, after the initial four clusters have been allocated, only one cluster is formatted.
10. Truncating the Before-Image File
Truncating the bi file consists of going through recovery to ensure the database files are consistent and that there are no active transactions, and then discarding the contents of the bi file. There are several reasons why you might want to truncate the bi file. Among them are:
a) To avoid having to write the bi file when doing a full off-line backup. In version 7.3A and later the full backup does not write the bi file to the backup medium.
b) To start after-imaging file after a full backup. In this case, Progress automatically truncates the bi file.
c) To change the before-image cluster size (version 6.2 and later) or block size (version 7.2 and later).
d) To reduce the size of a before-image file that has become unusually large. This *usually* does not occur during normal operation because the before-image space is reused. Most of the time, the log reaches some stable size and does not grow further. The size is dependent on the types of transactions executed by the application.
When a variable-length bi file is truncated, the master block is updated to reflect the fact that it has been truncated, the existing bi file is deleted, and a new, empty one (zero bytes long) is created. The next time the database is opened, four initial clusters will be allocated, expanding the file.
When a multi-extent bi file is truncated, the master block is updated and fixed length extents are marked as truncated by updating the header. The previously used space is not reformatted. Any variable length extent is deleted and an empty one created. The next time the database is opened, four clusters are allocated.
11. Clusters, Buffers and Blocks
A cluster is composed of a fixed number of fixed-size blocks (before-image blocks). The number of blocks is determined by the cluster size and the block size. The cluster size is a multiple of the before-image block size.
Before version 6.2, the cluster size was fixed at 16 kilobytes.
In version 6.2 and later, the cluster size can be changed when the before-image file is truncated.
Before version 7.2, the before-image block size was fixed at the same size as the database block size.
In version 7.2 and later, the before-image block size can be adjusted. It can be changed at the same time that the before-image file is truncated. The optimal block size is 8 kilobytes for most UNIX systems.
In version 6.3 and later, the number of before-image buffers can be controlled. The default (and minimum) is 5 buffers. Each buffer is the size of a before-image block..