hd_keywords {file I/O} proc hd_assumption {id text} { # hd_requirement $id $text } unset -nocomplain fioreq hd_read_requirement_file $::DOC/req/hlr35000.txt fioreq proc fileio_noop {args} {} proc process {text} { set zOut "" set zSpecial "" foreach zLine [split $text "\n"] { switch -regexp $zLine { {^ *REQ *[^ ][^ ]* *$} { regexp { *REQ *([^ ]+) *} $zLine -> zRecid append zOut "

\n" append zOut "[lindex $::fioreq($zRecid) 1]\n" set zRecText "" } {^ *ASSUMPTION *[^ ][^ ]* *$} { regexp { *ASSUMPTION *([^ ]+) *} $zLine -> zRecid append zOut "

\n" set zSpecial hd_assumption set zRecText "" } {^ *$} { if {$zSpecial ne ""} { $zSpecial $zRecid $zRecText set zSpecial "" append zOut

} } default { if {$zSpecial ne ""} { if {[regexp {^ *\. *$} $zLine]} {set zLine ""} append zRecText "$zLine\n" } append zOut "$zLine\n" } } } set zOut } hd_resolve [process {
SQLite File IO Specification
Table Of Contents
Javascript is required for some features of this document, including table of contents, figure numbering and internal references (section numbers and hyper-links.

Overview

SQLite stores an entire database within a single file, the format of which is described in the SQLite File Database File Format document ff_sqlitert_requirements. Each database file is stored within a file system, presumably provided by the host operating system. Instead of interfacing with the operating system directly, the host application is required to supply an adaptor component that implements the SQLite Virtual File System interface (described in capi_sqlitert_requirements). The adaptor component is responsible for translating the calls made by SQLite to the VFS interface into calls to the file-system interface provided by the operating system. This arrangement is depicted in figure figure_vfs_role.

Figure - Virtual File System (VFS) Adaptor

Although it would be easy to design a system that uses the VFS interface to read and update the content of a database file stored within a file-system, there are several complicated issues that need to be addressed by such a system:

  1. SQLite is required to implement atomic and durable transactions (the 'A' and 'D' from the ACID acronym), even if an application, operating system or power failure occurs midway through or shortly after updating a database file.

    To implement atomic transactions in the face of potential application, operating system or power failures, database writers write a copy of those portions of the database file that they are going to modify into a second file, the journal file, before writing to the database file. If a failure does occur while modifying the database file, SQLite can reconstruct the original database (before the modifications were attempted) based on the contents of the journal file.

  2. SQLite is required to implement isolated transactions (the 'I' from the ACID acronym).

    This is done by using the file locking facilities provided by the VFS adaptor to serialize writers (write transactions) and preventing readers (read transactions) from accessing database files while writers are midway through updating them.

  3. For performance reasons, it is advantageous to minimize the quantity of data read and written to and from the file-system.

    As one might expect, the amount of data read from the database file is minimized by caching portions of the database file in main memory. Additionally, multiple updates to the database file that are part of the same write transaction may be cached in main memory and written to the file together, allowing for more efficient IO patterns and eliminating the redundant write operations that could take place if part of the database file is modified more than once within a single write transaction.

System requirement references for the above points.

This document describes in detail the way that SQLite uses the API provided by the VFS adaptor component to solve the problems and implement the strategies enumerated above. It also specifies the assumptions made about the properties of the system that the VFS adaptor provides access to. For example, specific assumptions about the extent of data corruption that may occur if a power failure occurs while a database file is being updated are presented in section fs_characteristics.

This document does not specify the details of the interface that must be implemented by the VFS adaptor component, that is left to capi_sqlitert_requirements.

Relationship to Other Documents

Related to C-API requirements:

  1. Opening a connection.
  2. Closing a connection.

Related to SQL requirements:

  1. Opening a read-only transaction.
  2. Terminating a read-only transaction.
  3. Opening a read-write transaction.
  4. Committing a read-write transaction.
  5. Rolling back a read-write transaction.
  6. Opening a statement transaction.
  7. Committing a statement transaction.
  8. Rolling back a statement transaction.
  9. Committing a multi-file transaction.

Related to file-format requirements:

  1. Pinning (reading) a database page.
  2. Unpinning a database page.
  3. Modifying the contents of a database page.
  4. Appending a new page to the database file.
  5. Truncating a page from the end of the database file.

Document Structure

Section vfs_assumptions of this document describes the various assumptions made about the system to which the VFS adaptor component provides access. The basic capabilities and functions required from the VFS implementation are presented along with the description of the VFS interface in capi_sqlitert_requirements. Section vfs_assumptions complements this by describing in more detail the assumptions made about VFS implementations on which the algorithms presented in this document depend. Some of these assumptions relate to performance issues, but most concern the expected state of the file-system following a failure that occurs midway through modifying a database file.

Section database_connections introduces the concept of a database connection, a combination of a file-handle and in-memory cache used to access a database file. It also describes the VFS operations required when a new database connection is created (opened), and when one is destroyed (closed).

Section reading_data describes the steps required to open a read transaction and read data from a database file.

Section writing_data describes the steps required to open a write transaction and write data to a database file.

Section rollback describes the way in which aborted write transactions may be rolled back (reverted), either as a result of an explicit user directive or because an application, operating system or power failure occurred while SQLite was midway through updating a database file.

Section page_cache_algorithms describes some of the algorithms used to determine exactly which portions of the database file are cached by a page cache, and the effect that they have on the quantity and nature of the required VFS operations. It may at first seem odd to include the page cache, which is primarily an implementation detail, in this document. However, it is necessary to acknowledge and describe the page cache in order to provide a more complete explanation of the nature and quantity of IO performed by SQLite.

Glossary

After this document is ready, make the vocabulary consistent and then add a glossary here.

VFS Adaptor Related Assumptions

This section documents those assumptions made about the system that the VFS adaptor provides access to. The assumptions noted in section fs_characteristics are particularly important. If these assumptions are not true, then a power or operating system failure may cause SQLite databases to become corrupted.

Performance Related Assumptions

SQLite uses the assumptions in this section to try to speed up reading from and writing to the database file. ASSUMPTION A21010 It is assumed that writing a series of sequential blocks of data to a file in order is faster than writing the same blocks in an arbitrary order.

System Failure Related Assumptions

In the event of an operating system or power failure, the various combinations of file-system software and storage hardware available provide varying levels of guarantee as to the integrity of the data written to the file system just before or during the failure. The exact combination of IO operations that SQLite is required to perform in order to safely modify a database file depend on the exact characteristics of the target platform.

This section describes the assumptions that SQLite makes about the content of a file-system following a power or system failure. In other words, it describes the extent of file and file-system corruption that such an event may cause.

SQLite queries an implementation for file-system characteristics using the xDeviceCharacteristics() and xSectorSize() methods of the database file file-handle. These two methods are only ever called on file-handles open on database files. They are not called for journal files, master-journal files or temporary database files.

The file-system sector size value determined by calling the xSectorSize() method is a power of 2 value between 512 and 32768, inclusive reference to exactly how this is determined. SQLite assumes that the underlying storage device stores data in blocks of sector-size bytes each, sectors. It is also assumed that each aligned block of sector-size bytes of each file is stored in a single device sector. If the file is not an exact multiple of sector-size bytes in size, then the final device sector is partially empty.

Normally, SQLite assumes that if a power failure occurs while updating any portion of a sector then the contents of the entire device sector is suspect following recovery. After writing to any part of a sector within a file, it is assumed that the modified sector contents are held in a volatile buffer somewhere within the system (main memory, disk cache etc.). SQLite does not assume that the updated data has reached the persistent storage media, until after it has successfully synced the corresponding file by invoking the VFS xSync() method. Syncing a file causes all modifications to the file up until that point to be committed to persistent storage.

Based on the above, SQLite is designed around a model of the file-system whereby any sector of a file written to is considered to be in a transient state until after the file has been successfully synced. Should a power or system failure occur while a sector is in a transient state, it is impossible to predict its contents following recovery. It may be written correctly, not written at all, overwritten with random data, or any combination thereof.

For example, if the sector-size of a given file-system is 2048 bytes, and SQLite opens a file and writes a 1024 byte block of data to offset 3072 of the file, then according to the model the second sector of the file is in the transient state. If a power failure or operating system crash occurs before or during the next call to xSync() on the file handle, then following system recovery SQLite assumes that all file data between byte offsets 2048 and 4095, inclusive, is invalid. It also assumes that since the first sector of the file, containing the data from byte offset 0 to 2047 inclusive, is valid, since it was not in a transient state when the crash occurred.

Assuming that any and all sectors in the transient state may be corrupted following a power or system failure is a very pessimistic approach. Some modern systems provide more sophisticated guarantees than this. SQLite allows the VFS implementation to specify at runtime that the current platform supports zero or more of the following properties:

Failure Related Assumption Details

This section describes how the assumptions presented in the parent section apply to the individual API functions and operations provided by the VFS to SQLite for the purposes of modifying the contents of the file-system.

SQLite manipulates the contents of the file-system using a combination of the following four types of operation:

Additionally, all VFS implementations are required to provide the sync file operation, accessed via the xSync() method of the sqlite3_file object, used to flush create, write and truncate operations on a file to the persistent storage medium.

The formalized assumptions in this section refer to system failure events. In this context, this should be interpreted as any failure that causes the system to stop operating. For example a power failure or operating system crash.

SQLite does not assume that a create file operation has actually modified the file-system records within persistent storage until after the file has been successfully synced. ASSUMPTION A21001 If a system failure occurs during or after a "create file" operation, but before the created file has been synced, then SQLite assumes that it is possible that the created file may not exist following system recovery.

Of course, it is also possible that it does exist following system recovery. ASSUMPTION A21002 If a "create file" operation is executed by SQLite, and then the created file synced, then SQLite assumes that the file-system modifications corresponding to the "create file" operation have been committed to persistent media. It is assumed that if a system failure occurs any time after the file has been successfully synced, then the file is guaranteed to appear in the file-system following system recovery.

A delete file operation (invoked by a call to the VFS xDelete() method) is assumed to be an atomic and durable operation.

ASSUMPTION A21003 If a system failure occurs at any time after a "delete file" operation (call to the VFS xDelete() method) returns successfully, it is assumed that the file-system will not contain the deleted file following system recovery. ASSUMPTION A21004 If a system failure occurs during a "delete file" operation, it is assumed that following system recovery the file-system will either contain the file being deleted in the state it was in before the operation was attempted, or not contain the file at all. It is assumed that it is not possible for the file to have become corrupted purely as a result of a failure occurring during a "delete file" operation.

The effects of a truncate file operation are not assumed to be made persistent until after the corresponding file has been synced. ASSUMPTION A21005 If a system failure occurs during or after a "truncate file" operation, but before the truncated file has been synced, then SQLite assumes that the size of the truncated file is either as large or larger than the size that it was to be truncated to. ASSUMPTION A21006 If a system failure occurs during or after a "truncate file" operation, but before the truncated file has been synced, then it is assumed that the contents of the file up to the size that the file was to be truncated to are not corrupted.

The above two assumptions may be interpreted to mean that if a system failure occurs after file truncation but before the truncated file is synced, the contents of the file following the point at which it was to be truncated may not be trusted. They may contain the original file data, or may contain garbage. ASSUMPTION A21007 If a "truncate file" operation is executed by SQLite, and then the truncated file synced, then SQLite assumes that the file-system modifications corresponding to the "truncate file" operation have been committed to persistent media. It is assumed that if a system failure occurs any time after the file has been successfully synced, then the effects of the file truncation are guaranteed to appear in the file system following recovery.

A write file operation modifies the contents of an existing file within the file-system. It may also increase the size of the file. The effects of a write file operation are not assumed to be made persistent until after the corresponding file has been synced. ASSUMPTION A21008 If a system failure occurs during or after a "write file" operation, but before the corresponding file has been synced, then it is assumed that the content of all sectors spanned by the write file operation are untrustworthy following system recovery. This includes regions of the sectors that were not actually modified by the write file operation. ASSUMPTION A21011 If a system failure occurs on a system that supports the atomic-write property for blocks of size N bytes following an aligned write of N bytes to a file but before the file has been successfully synced, then is assumed following recovery that all sectors spanned by the write operation were correctly updated, or that none of the sectors were modified at all. ASSUMPTION A21012 If a system failure occurs on a system that supports the safe-append following a write operation that appends data to the end of the file without modifying any of the existing file content but before the file has been successfully synced, then is assumed following recovery that either the data was correctly appended to the file, or that the file size remains unchanged. It is assumed that it is impossible that the file be extended but populated with incorrect data. ASSUMPTION A21013 Following a system recovery, if a device sector is deemed to be untrustworthy as defined by A21008 and neither A21011 or A21012 apply to the range of bytes written, then no assumption can be made about the content of the sector following recovery. It is assumed that it is possible for such a sector to be written correctly, not written at all, populated with garbage data or any combination thereof. ASSUMPTION A21009 If a system failure occurs during or after a "write file" operation that causes the file to grow, but before the corresponding file has been synced, then it is assumed that the size of the file following recovery is as large or larger than it was when it was most recently synced.

If a system supports the sequential-write property, then further assumptions may be made with respect to the state of the file-system following recovery from a system failure. Specifically, it is assumed that create, truncate, delete and write file operations are applied to the persistent representation in the same order as they are performed by SQLite. Furthermore, it is assumed that the file-system waits until one operation is safely written to the persistent media before the next is attempted, just as if the relevant file were synced following each operation. ASSUMPTION A21014 If a system failure occurs on a system that supports the sequential-write property, then it is assumed that all operations completed before the last time any file was synced have been successfully committed to persistent media. ASSUMPTION A21015 If a system failure occurs on a system that supports the sequential-write property, then it is assumed that the set of possible states that the file-system may be in following recovery is the same as if each of the write operations performed since the most recent time a file was synced was itself followed by a sync file operation, and that the system failure may have occurred during any of the write or sync file operations.

Database Connections

Within this document, the term database connection has a slightly different meaning from that which one might assume. The handles returned by the sqlite3_open() and sqlite3_open16() APIs (reference) are referred to as database handles. A database connection is a connection to a single database file using a single file-handle, which is held open for the lifetime of the connection. Using the SQL ATTACH syntax, multiple database connections may be accessed via a single database handle. Or, using SQLite's shared-cache mode feature, multiple database handles may access a single database connection.

Usually, a new database connection is opened whenever the user opens new database handle on a real database file (not an in-memory database) or when a database file is attached to an existing database connection using the SQL ATTACH syntax. However if the shared-cache mode feature is enabled, then the database file may be accessed through an existing database connection. For more information on shared-cache mode, refer to Reference. The various IO operations required to open a new connection are detailed in section open_new_connection of this document.

Similarly, a database connection is usually closed when the user closes a database handle that is open on a real database file or has had one or more real database files attached to it using the ATTACH mechanism, or when a real database file is detached from a database connection using the DETACH syntax. Again, the exception is if shared-cache mode is enabled. In this case, a database connection is not closed until its number of users reaches zero. The IO related steps required to close a database connection are described in section closing_database_connection.

After sections 4 and 5 are finished, come back here and see if we can add a list of state items associated with each database connection to make things easier to understand. i.e each database connection has a file handle, a set of entries in the page cache, an expected page size etc.

Opening a New Connection

This section describes the VFS operations that take place when a new database connection is created.

Opening a new database connection is a two-step process:

  1. A file-handle is opened on the database file.
  2. If step 1 was successful, an attempt is made to read the database file header from the database file using the new file-handle.

In step 2 of the procedure above, the database file is not locked before it is read from. This is the only exception to the locking rules described in section reading_data.

The reason for attempting to read the database file header is to determine the page-size used by the database file. Because it is not possible to be certain as to the page-size without holding at least a shared lock on the database file (because some other database connection might have changed it since the database file header was read), the value read from the database file header is known as the expected page size. REQ H35060 REQ H35070 REQ H35080 REQ H35090

Closing a Connection

This section describes the VFS operations that take place when an existing database connection is closed (destroyed).

Closing a database connection is a simple matter. The open VFS file-handle is closed and in-memory page cache related resources are released. REQ H35400 REQ H35430

The Page Cache

The contents of an SQLite database file are formatted as a set of fixed size pages. See ff_sqlitert_requirements for a complete description of the format used. The page size used for a particular database is stored as part of the database file header at a well-known offset within the first 100 bytes of the file. Almost all read and write operations performed by SQLite on database files are done on blocks of data page-size bytes in size.

All SQLite database connections running within a single process share a single page cache. The page cache caches data read from database files in main-memory on a per-page basis. When SQLite requires data from a database file to satisfy a database query, it checks the page cache for usable cached versions of the required database pages before loading it from the database file. If no usable cache entry can be found and the database page data is loaded from the database file, it is cached in the page cache in case the same data is needed again later. Because reading from the database file is assumed to be an order of magnitude faster than reading from main-memory, caching database page content in the page cache to minimize the number of read operations performed on the database file is a significant performance enhancement.

The page cache is also used to buffer database write operations. When SQLite is required to modify one of more of the database pages that make up a database file, it first modifies the cached version of the page in the page cache. At that point the page is considered a "dirty" page. At some point later on, the new content of the "dirty" page is copied from the page cache into the database file via the VFS interface. Buffering writes in the page cache can reduce the number of write operations required on the database file (in cases where the same page is updated twice) and allows optimizations based on the assumptions outlined in section fs_performance.

Database read and write operations, and the way in which they interact with and use the page cache, are described in detail in sections reading_data and writing_data of this document, respectively.

At any one time, the page cache contains zero or more page cache entries, each of which has the following data associated with it:

The first two elements in the list above, the associated database connection and the page number, uniquely identify the page cache entry. At no time may the page cache contain two entries for which both the database connection and page number are identical. Or, put another way, a single database connection never caches more than one copy of a database page within the page cache.

At any one time, each page cache entry may be said to be a clean page, a non-writable dirty page or a writable dirty page, according to the following definitions:

The exact logic used to determine if a page cache entry with modified content is a dirty page or writable page is presented in section page_cache_algorithms.

Because main-memory is a limited resource, the page cache cannot be allowed to grow indefinitely. As a result, unless all database files opened by database connections within the process are quite small, sometimes data must be discarded from the page cache. In practice this means page cache entries must be purged to make room for new ones. If a page cache entry being removed from the page cache to free main-memory is a dirty page, then its contents must be saved into the database file before it can be discarded without data loss. The following two sub-sections describe the algorithms used by the page cache to determine exactly when existing page cache entries are purged (discarded).

Page Cache Configuration

Describe the parameters set to configure the page cache limits.

Page Cache Algorithms

Requirements describing the way in which the configuration parameters are used. About LRU etc.

Reading Data

In order to return data from the database to the user, for example as the results of a SELECT query, SQLite must at some point read data from the database file. Usually, data is read from the database file in aligned blocks of page-size bytes. The exception is when the database file header fields are being inspected, before the page-size used by the database can be known.

With two exceptions, a database connection must have an open transaction (either a read-only transaction or a read/write transaction) on the database before data may be read from the database file.

The two exceptions are:

Once a transaction has been opened, reading data from a database connection is a simple operation. Using the xRead() method of the file-handle open on the database file, the required database file pages are read one at a time. SQLite never reads partial pages and always uses a single call to xRead() for each required page.

After reading the data for a database page, SQLite stores the raw page of data in the page cache. Each time a page of data is required by the upper layers, the page cache is queried to see if it contains a copy of the required page stored by the current database connection. If such an entry can be found, then the required data is read from the page cache instead of the database file. Only a connection with an open transaction transaction (either a read-only transaction or a read/write transaction) on the database may read data from the page cache. In this sense reading from the page cache is no different to reading from the database file.

Refer to section page_cache_algorithms for a description of exactly how and for how long page data is stored in the page cache. REQ H35010 REQ H35020 REQ H35420

Opening a Read-Only Transaction

Before data may be read from a database file or queried from the page cache, a read-only transaction must be successfully opened by the associated database connection (this is true even if the connection will eventually write to the database, as a read/write transaction may only be opened by upgrading from a read-only transaction). This section describes the procedure for opening a read-only transaction.

The key element of a read-only transaction is that the file-handle open on the database file obtains and holds a shared-lock on the database file. Because a connection requires an exclusive-lock before it may actually modify the contents of the database file, and by definition while one connection is holding a shared-lock no other connection may hold an exclusive-lock, holding a shared-lock guarantees that no other process may modify the database file while the read-only transaction remains open. This ensures that read-only transactions are sufficiently isolated from the transactions of other database users (see section overview).

Obtaining the shared lock itself on the database file is quite simple, SQLite just calls the xLock() method of the database file handle. Some of the other processes that take place as part of opening the read-only transaction are quite complex. The steps that SQLite is required to take to open a read-only transaction, in the order in which they must occur, is as follows:

  1. A shared-lock is obtained on the database file.
  2. The connection checks if a hot journal file exists in the file-system. If one does, then it is rolled back before continuing.
  3. The connection checks if the data in the page cache may still be trusted. If not, all page cache data is discarded.
  4. If the file-size is not zero bytes and the page cache does not contain valid data for the first page of the database, then the data for the first page must be read from the database.

Of course, an error may occur while attempting any of the 4 steps enumerated above. If this happens, then the shared-lock is released (if it was obtained) and an error returned to the user. Step 2 of the procedure above is described in more detail in section hot_journal_detection. Section cache_validation describes the process identified by step 3 above. Further detail on step 4 may be found in section read_page_one. REQ H35100 REQ H35110

The most common reason an attempt to obtain a shared-lock may fail is that some other connection is holding an exclusive or pending lock. However it may also fail because some other error (e.g. an IO or comms related error) occurs within the call to the xLock() method. REQ H35030 REQ H35120

Section hot_journal_detection contains a description of and requirements governing the detection of a hot-journal file referred to in the above requirements. REQ H35040

The cache validation process is described in detail in section cache_validation REQ H35050

The numbered list above notes that the data for the first page of the database file, if it exists and is not already loaded into the page cache, must be read from the database file before the read-only transaction may be considered opened. This is handled by requirement H35240.

Hot Journal Detection

This section describes the procedure that SQLite uses to detect a hot journal file. If a hot journal file is detected, this indicates that at some point the process of writing a transaction to the database was interrupted and a recovery operation (hot journal rollback) needs to take place. This section does not describe the process of hot journal rollback (see section hot_journal_rollback) or the processes by which a hot journal file may be created (see section writing_data).

The procedure used to detect a hot-journal file is quite complex. The following steps take place:

  1. Using the VFS xAccess() method, SQLite queries the file-system to see if the journal file associated with the database exists. If it does not, then there is no hot-journal file.
  2. By invoking the xCheckReservedLock() method of the file-handle opened on the database file, SQLite checks if some other connection holds a reserved lock or greater. If some other connection does hold a reserved lock, this indicates that the other connection is midway through a read/write transaction (see section writing_data). In this case the journal file is not a hot-journal and must not be rolled back.
  3. Using the xFileSize() method of the file-handle opened on the database file, SQLite checks if the database file is 0 bytes in size. If it is, the journal file is not considered to be a hot journal file. Instead of rolling back the journal file, in this case it is deleted from the file-system by calling the VFS xDelete() method. Technically, there is a race condition here. This step should be moved to after the exclusive lock is held.
  4. An attempt is made to upgrade to an exclusive lock on the database file. If the attempt fails, then all locks, including the recently obtained shared lock are dropped. The attempt to open a read-only transaction has failed. This occurs when some other connection is also attempting to open a read-only transaction and the attempt to gain the exclusive lock fails because the other connection is also holding a shared lock. It is left to the other connection to roll back the hot journal.
    It is important that the file-handle lock is upgraded directly from shared to exclusive in this case, instead of first upgrading to reserved or pending locks as is required when obtaining an exclusive lock to write to the database file (section writing_data). If SQLite were to first upgrade to a reserved or pending lock in this scenario, then a second process also trying to open a read-transaction on the database file might detect the reserved lock in step 2 of this process, conclude that there was no hot journal, and commence reading data from the database file.
  5. The xAccess() method is invoked again to detect if the journal file is still in the file system. If it is, then it is a hot-journal file and SQLite tries to roll it back (see section rollback).

Master journal file pointers?

The following requirements describe step 1 of the above procedure in more detail. REQ H35140 REQ H35510 REQ H35150

The following requirements describe step 2 of the above procedure in more detail. REQ H35160 REQ H35520 REQ H35170

The following requirements describe step 3 of the above procedure in more detail. REQ H35440 REQ H35530 REQ H35450 REQ H35540 REQ H35460 REQ H35550

The following requirements describe step 4 of the above procedure in more detail. REQ H35470 REQ H35480

Finally, the following requirements describe step 5 of the above procedure in more detail. REQ H35490 REQ H35560 REQ H35570 REQ H35500

Cache Validation

When a database connection opens a read transaction, the page cache may already contain data associated with the database connection. However, if another process has modified the database file since the cached pages were loaded it is possible that the cached data is invalid.

SQLite determines whether or not the page cache entries belonging to the database connection are valid or not using the file change counter, a field in the database file header. The file change counter is a 4-byte big-endian integer field stored starting at byte offset 24 of the database file header. Before the conclusion of a read/write transaction that modifies the contents of the database file in any way (see section writing_data), the value stored in the file change counter is incremented. When a database connection unlocks the database file, it stores the current value of the file change counter. Later, while opening a new read-only transaction, SQLite checks the value of the file change counter stored in the database file. If the value has not changed since the database file was unlocked, then the page cache entries can be trusted. If the value has changed, then the page cache entries cannot be trusted and all entries associated with the current database connection are discarded. REQ H35180 REQ H35190

Why a 16 byte block? Why not 4? (something to do with encrypted databases). REQ H35200

Requirement H35050 (section open_read_only_trans) specifies the action SQLite is required to take upon determining that the cache contents are invalid.

Page 1 and the Expected Page Size

As the last step in opening a read transaction on a database file that is more than 0 bytes in size, SQLite is required to load data for page 1 of the database into the page cache, if it is not already there. This is slightly more complicated than it seems, as the database page-size is no known at this point.

Even though the database page-size cannot be known for sure, SQLite is usually able to guess correctly by assuming it to be equal to the connections expected page size. The expected page size is the value of the page-size field read from the database file header while opening the database connection (see section open_new_connection), or the page-size of the database file when the most read transaction was concluded. REQ H35210 REQ H35220 REQ H35230 REQ H35240

Reading Database Data

Add something about checking the page-cache first etc.

Ending a Read-only Transaction

To end a read-only transaction, SQLite simply relinquishes the shared lock on the file-handle open on the database file. No other action is required. REQ H35130

See also requirements H35180 and H35210 above.

Writing Data

Using DDL or DML SQL statements, SQLite users may modify the contents and size of a database file. Exactly how changes to the logical database are translated to modifications to the database file is described in ff_sqlitert_requirements. From the point of view of the sub-systems described in this document, each DDL or DML statement executed results in the contents of zero or more database file pages being overwritten with new data. A DDL or DML statement may also append or truncate one or more pages to or from the end of the database file. One or more DDL and/or DML statements are grouped together to make up a single write transaction. A write transaction is required to have the special properties described in section overview; a write transaction must be isolated, durable and atomic.

SQLite accomplishes these goals using the following techniques:

The page cache is used to buffer modifications to the database file image before they are written to the database file. When the contents of a page is required to be modified as the results of an operation within a write transaction, the modified copy is stored in the page cache. Similarly, if new pages are appended to the end of a database file, they are added to the page cache instead of being immediately written to the database file within the file-system.

Ideally, all changes for an entire write transaction are buffered in the page cache until the end of the transaction. When the user commits the transaction, all changes are applied to the database file in the most efficient way possible, taking into account the assumptions enumerated in section fs_performance. Unfortunately, since main-memory is a limited resource, this is not always possible for large transactions. In this case changes are buffered in the page cache until some internal condition or limit is reached, then written out to the database file in order to free resources as they are required. Section page_cache_algorithms describes the circumstances under which changes are flushed through to the database file mid-transaction to free page cache resources.

Even if an application or system failure does not occur while a write transaction is in progress, a rollback operation to restore the database file and page cache to the state that it was in before the transaction started may be required. This may occur if the user explicitly requests transaction rollback (by issuing a "ROLLBACK" command), or automatically, as a result of encountering an SQL constraint (see sql_sqlitert_requirements). For this reason, the original page content is stored in the journal file before the page is even modified within the page cache.

Introduce the following sub-sections.

Journal File Format

This section describes the format used by an SQLite journal file.

A journal file consists of one or more journal headers, zero or more journal records and optionally a master journal pointer. Each journal file always begins with a journal header, followed by zero or more journal records. Following this may be a second journal header followed by a second set of zero or more journal records and so on. There is no limit to the number of journal headers a journal file may contain. Following the journal headers and their accompanying sets of journal records may be the optional master journal pointer. Or, the file may simply end following the final journal record.

This section only describes the format of the journal file and the various objects that make it up. But because a journal file may be read by an SQLite process following recovery from a system failure (hot journal rollback, see section hot_journal_rollback) it is also important to describe the way the file is created and populated within the file-system using a combination of write file, sync file and truncate file operations. These are described in section write_transactions.

Journal Header Format

A journal header is sector-size bytes in size, where sector-size is the value returned by the xSectorSize method of the file handle opened on the database file. Only the first 28 bytes of the journal header are used, the remainder may contain garbage data. The first 28 bytes of each journal header consists of an eight byte block set to a well-known value, followed by five big-endian 32-bit unsigned integer fields.

Figure - Journal Header Format

Figure figure_journal_header graphically depicts the layout of a journal header. The individual fields are described in the following table. The offsets in the 'byte offset' column of the table are relative to the start of the journal header.
Byte offsetSize in bytesDescription
08The journal magic field always contains a well-known 8-byte string value used to identify SQLite journal files. The well-known sequence of byte values is:
0xd9 0xd5 0x05 0xf9 0x20 0xa1 0x63 0xd7
84This field, the record count, is set to the number of journal records that follow this journal header in the journal file.
124The checksum initializer field is set to a pseudo-random value. It is used as part of the algorithm to calculate the checksum for all journal records that follow this journal header.
164This field, the database page count, is set to the number of pages that the database file contained before any modifications associated with write transaction are applied.
204This field, the sector size, is set to the sector size of the device on which the journal file was created, in bytes. This value is required when reading the journal file to determine the size of each journal header.
244The page size field contains the database page size used by the corresponding database file when the journal file was created, in bytes.

All journal headers are positioned in the file so that they start at a sector size aligned offset. To achieve this, unused space may be left between the start of the second and subsequent journal headers and the end of the journal records associated with the previous header.

Journal Record Format

Each journal record contains the original data for a database page modified by the write transaction. If a rollback is required, then this data may be used to restore the contents of the database page to the state it was in before the write transaction was started.

Figure - Journal Record Format

A journal record, depicted graphically by figure figure_journal_record, contains three fields, as described in the following table. Byte offsets are relative to the start of the journal record.
Byte offsetSize in bytesDescription
04The page number of the database page associated with this journal record, stored as a 4 byte big-endian unsigned integer.
4page-size This field contains the original data for the page, exactly as it appeared in the database file before the write transaction began.
4 + page-size4 This field contains a checksum value, calculated based on the contents of the journaled database page data (the previous field) and the values stored in the checksum initializer field of the preceding journal header.

The set of journal records that follow a journal header in a journal file are packed tightly together. There are no alignment requirements for journal records as there are for journal headers.

Master Journal Pointer

To support atomic transactions that modify more than one database file, SQLite sometimes includes a master journal pointer record in a journal file. Multiple file transactions are described in section multifile_transactions. A master journal pointer contains the name of a master journal-file along with a check-sum and some well known values that allow the master journal pointer to be recognized as such when the journal file is read during a rollback operation (section rollback).

As is the case for a journal header, the start of a master journal pointer is always positioned at a sector size aligned offset. If the journal record or journal header that appears immediately before the master journal pointer does not end at an aligned offset, then unused space is left between the end of the journal record or journal header and the start of the master journal pointer.

Figure - Master Journal Pointer Format

A master journal pointer, depicted graphically by figure figure_master_journal_ptr, contains five fields, as described in the following table. Byte offsets are relative to the start of the master journal pointer.
Byte offsetSize in bytesDescription
04This field, the locking page number, is always set to the page number of the database locking page stored as a 4-byte big-endian integer. The locking page is the page that begins at byte offset 230 of the database file. Even if the database file is large enough to contain the locking page, the locking page is never used to store any data and so the first four bytes of of a valid journal record will never contain this value. For further description of the locking page, refer to ff_sqlitert_requirements.
4name-length The master journal name field contains the name of the master journal file, encoded as a utf-8 string. There is no nul-terminator appended to the string.
4 + name-length4 The name-length field contains the length of the previous field in bytes, formatted as a 4-byte big-endian unsigned integer.
8 + name-length4 The checksum field contains a checksum value stored as a 4-byte big-endian signed integer. The checksum value is calculated as the sum of the bytes that make up the master journal name field, interpreting each byte as an 8-bit signed integer.
12 + name-length8 Finally, the journal magic field always contains a well-known 8-byte string value; the same value stored in the first 8 bytes of a journal header. The well-known sequence of bytes is:
0xd9 0xd5 0x05 0xf9 0x20 0xa1 0x63 0xd7

Write Transactions

This section describes the progression of an SQLite write transaction. From the point of view of the systems described in this document, most write transactions consist of three steps:

  1. The write transaction is opened. This process is described in section opening_a_write_transaction.

  2. The end-user executes DML or DDL SQL statements that require the structure of the database file of the database file to be modified. These modifications may be any combination of operations to

    • modify the content of an existing database page,
    • append a new database page to the database file image, or
    • truncate (discard) a database page from the end of the database file.
    These operations are described in detail in section modifying_appending_truncating. How user DDL or DML SQL statements are mapped to combinations of these three operations is described in ff_sqlitert_requirements.
  3. The write transaction is concluded and the changes made permanently committed to the database. The process required to commit a transaction is described in section committing_a_transaction.

As an alternative to step 3 above, the transaction may be rolled back. Transaction rollback is described in section rollback. Finally, it is also important to remember that a write transaction may be interrupted by a system failure at any point. In this case, the contents of the file-system (the database file and journal file) must be left in such a state so as to enable the database file to be restored to the state it was in before the interrupted write transaction was started. This is known as hot journal rollback, and is described in section hot_journal_rollback. Section fs_assumption_details describes the assumptions made regarding the effects of a system failure on the file-system contents following recovery.

Beginning a Write Transaction

Before any database pages may be modified within the page cache, the database connection must open a write transaction. Opening a write transaction requires that the database connection obtains a reserved lock (or greater) on the database file. Because a obtaining a reserved lock on a database file guarantees that no other database connection may hold or obtain a reserved lock or greater, it follows that no other database connection may have an open write transaction.

A reserved lock on the database file may be thought of as an exclusive lock on the journal file. No database connection may read from or write to a journal file without a reserved or greater lock on the corresponding database file.

Before opening a write transaction, a database connection must have an open read transaction, opened via the procedure described in section open_read_only_trans. This ensures that there is no hot-journal file that needs to be rolled back and that any data stored in the page cache can be trusted.

Once a read transaction has been opened, upgrading to a write transaction is a two step process, as follows:

  1. A reserved lock is obtained on the database file.
  2. The journal file is opened and created if necessary (using the VFS xOpen method), and a journal file header written to the start of it using a single call to the file handles xWrite method.

Requirements describing step 1 of the above procedure in detail: REQ H35350 REQ H35360 REQ H35580

Requirements describing step 2 of the above procedure in detail: REQ H35370 REQ H35380

Writing a Journal Header

Requirements describing how a journal header is appended to a journal file: REQ H35680 REQ H35690 REQ H35700 REQ H35710 REQ H35720 REQ H35730 REQ H35740

Modifying, Adding or Truncating a Database Page

When the end-user executes a DML or DDL SQL statement to modify the database schema or content, SQLite is required to update the database file image to reflect the new database state. This involves modifying the content of, appending or truncating one of more database file pages. Instead of modifying the database file directly using the VFS interface, changes are first buffered within the page cache.

Before modifying a database page within the page cache that may need to be restored by a rollback operation, the page must be journalled. Journalling a page is the process of copying that pages original data into the journal file so that it can be recovered if the write transaction is rolled back. The process of journalling a page is described in section journalling_a_page. REQ H35590 REQ H35600

When a new database page is appended to a database file, there is no requirement to add a record to the journal file. If a rollback is required the database file will simply be truncated back to its original size based on the value stored at byte offset 12 of the journal file. REQ H35610

If required to truncate a database page from the end of the database file, the associated page cache entry is discarded. The adjusted size of the database file is stored internally. The database file is not actually truncated until the current write transaction is committed (see section committing_a_transaction). REQ H35620 REQ H35630

Journalling a Database Page

A page is journalled by adding a journal record to the journal file. The format of a journal record is described in section journal_record_format. REQ H35270 REQ H35280 REQ H35290

The checksum value written to the journal file immediately after the page data (requirement H35290), is a function of both the page data and the checksum initializer field stored in the journal header (see section journal_header_format). Specifically, it is the sum of the checksum initializer and the value of every 200th byte of page data interpreted as an 8-bit unsigned integer, starting with the (page-size % 200)'th byte of page data. For example, if the page-size is 1024 bytes, then a checksum is calculated by adding the values of the bytes at offsets 23, 223, 423, 623, 823 and 1023 (the last byte of the page) together with the value of the checksum initializer. REQ H35300

The '%' character is used in requirement H35300 to represent the modulo operator, just as it is in programming languages such as C, Java and Javascript.

Syncing the Journal File

Even after the original data of a database page has been written into the journal file using calls to the journal file file-handle xWrite method (section journalling_a_page), it is still not safe to write to the page within the database file. This is because in the event of a system failure the data written to the journal file may still be corrupted (see section fs_characteristics). Before the page can be updated within the database itself, the following procedure takes place:

  1. The xSync method of the file-handle opened on the journal file is called. This operation ensures that all journal records in the journal file have been written to persistent storage, and that they will not become corrupted as a result of a subsequent system failure.
  2. The journal record count field (see section journal_header_format) of the most recently written journal header in the journal file is updated to contain the number of journal records added to the journal file since the header was written.
  3. The xSync method is called again, to ensure that the update to the journal record count has been committed to persistent storage.

If all three of the steps enumerated above are executed successfully, then it is safe to modify the content of the journalled database pages within the database file itself. The combination of the three steps above is referred to as syncing the journal file. REQ H35750 REQ H35760 REQ H35770

Upgrading to an Exclusive Lock

Before the content of a page modified within the page cache may be written to the database file, an exclusive lock must be held on the database file. The purpose of this lock is to prevent another connection from reading from the database file while the first connection is midway through writing to it. Whether the reason for writing to the database file is because a transaction is being committed, or to free up space within the page cache, upgrading to an exclusive lock always occurs immediately after syncing the journal file. REQ H35780 REQ H35790

What happens if the exclusive lock cannot be obtained? It is not possible for the attempt to upgrade from a reserved to a pending lock to fail.

Committing a Transaction

Committing a write transaction is the final step in updating the database file. Committing a transaction is a seven step process, summarized as follows:

  1. The database file header change counter field is incremented. The change counter, described in ff_sqlitert_requirements, is used by the cache validation procedure described in section cache_validation.

  2. The journal file is synced. The steps required to sync the journal file are described in section syncing_journal_file.

  3. Upgrade to an exclusive lock on the database file, if an exclusive lock is not already held. Upgrading to an exclusive lock is described in section upgrading_to_exclusive_lock.

  4. Copy the contents of all dirty pages stored in the page cache into the database file. The set of dirty pages are written to the database file in page number order in order to improve performance (see the assumptions in section fs_performance for details).

  5. The database file is synced to ensure that all updates are stored safely on the persistent media.

  6. The file-handle open on the journal file is closed and the journal file itself deleted. At this point the write transaction transaction has been irrevocably committed.

  7. The database file is unlocked.

Expand on and explain the above a bit.

The following requirements describe the steps enumerated above in more detail. REQ H35800

The change counter is a 4-byte big-endian integer field stored at byte offset 24 of the database file. The modification to page 1 required by H35800 is made using the process described in section modifying_appending_truncating. If page 1 has not already been journalled as a part of the current write-transaction, then incrementing the change counter may require that page 1 be journalled. In all cases the page cache entry corresponding to page 1 becomes a dirty page as part of incrementing the change counter value. REQ H35810 REQ H35820 REQ H35830 REQ H35840 REQ H35850 REQ H35860

Is the shared lock held after committing a write transaction?

Purging a Dirty Page

Usually, no data is actually written to the database file until the user commits the active write transaction. The exception is if a single write transaction contains too many modifications to be stored in the page cache. In this case, some of the database file modifications stored in the page cache must be applied to the database file before the transaction is committed so that the associated page cache entries can be purged from the page cache to free memory. Exactly when this condition is reached and dirty pages must be purged is described in section page_cache_algorithms.

Before the contents of the page cache entry can be written into the database file, the page cache entry must meet the criteria for a writable dirty page, as defined in section page_cache_algorithms. If the dirty page selected by the algorithms in section page_cache_algorithms for purging, SQLite is required to sync the journal file. Immediately after the journal file is synced, all dirty pages associated with the database connection are classified as writable dirty pages. REQ H35640 REQ H35660

Appending a new journal header to the journal file is described in section writing_journal_header.

Once the dirty page being purged is writable, it is simply written into the database file. REQ H35670

Multi-File Transactions

Statement Transactions

Rollback

Hot Journal Rollback

Transaction Rollback

Statement Rollback

References

[1] C API Requirements Document.
[2] SQL Requirements Document.
[3] File Format Requirements Document.
}]