The transaction mechanism in SQLite is designed so that even if there is an OS crash or power failure in the middle of the commit, the partially written changes will be automatically rolled back the next time the database is opened. It will always appear as if either the database was completely changed or completely unchanged regardless of when the OS crash or power failure occurred. The restoration process is described in more detail on the CrashRecovery wiki page. It is recommented that you read and understand this page before attempting to study CrashRecovery.
The diagram at the right shows the initial state of the system. All database information is held on disk. Nothing is in the operating sytems's cache. And since the database has not yet been opened, nothing has been loading into user memory either.
The database is divided into fixed-size pages. The default page size is 1024 bytes, but this can be changed either at compile-time or using the page_size PRAGMA. In the diagram on the right, each page of the database is shown as a separate blue block.
Every page in a disk file can be potentially mirrored in the operating system disk cache. Initially, of course, nothing has been read from disk so nothing has been mirrored in cache. We show this in the diagram by not coloring in the pages of the database which are not yet in cache.
Before attempting to read the databse, SQLite obtains a "shared" lock on the database file. The shared lock insures that no other process will be writing to the file at the same time our process is trying to read it. If another process were to write to the file at the same time we are reading it, then we might read some old data together with some new data and the two would not necessary be consistent. The shared lock prevents this.
Even though file locks are associated with disk files, locks are not actually stored on disk. They are held in the operating system's file cache so that they disappear automatically following a reboot.
Some subset of the database (3 pages in this example) is read into user memory. The operating system also usually makes copies of these pages in its disk cache.
This example shows a database containing only 8 pages. A real database would likely contain hundreds or thousands or millions of pages. One of the advantages of using a b-tree based database engine, like SQLite, is that only a small subset of the database containing the information you actually need is read into memory.
The previous steps show what happens when SQLite only wants to read information from the database. Here we begin the process of trying to write. The first thing that occurs is that SQLite obtains a "reserved" lock on the database file. The reserved lock does little more than a "shared" lock. No other process can write to the database while we hold a reserved lock but other processes can continue to read from the database. The difference between a shared lock and a reserved lock is that many processes can hold a shared lock at once but only a single process at a time can hold a reserved lock. Another way of thinking about a reserved lock is that it signals that the current process intends to modify the database but has not yet actually gotten around to doing any modifications so no other process can being the writing process but other processes can continue to read for the time being.
Before any changes are made to the database, the original content of all pages that going to change is written into a rollback journal file. The name of the rollback journal is the same as the name of the original database with "-journal" appended. The rollback journal contains a short header (shown in green) and some other formatting information but is otherwise just a list of the content of the pages that are to be changed. The rollback journal contains everything that is needed to restore the database to its original state if the transaction is aborted.
Once the original page content has been written to the rollback journal, pages of the database in user memory can be modified. Modified pages are shown in red.
Notice that no part of the database has yet been changed on disk or in the OS disk cache. All of the changes have occurred in local memory only. So other processes can continue to read from the database file.
Also notice that the content of the rollback journal might yet only be in the operating systems write cache and not yet reached the disk surface.
Before going further, we have to make sure that the content of the rollback journal has been written to the disk surface. This insures that a rollback will be possible even if there is an unexpected powerloss during subsequent parts of the commit process. This is the first "sync" operation that occurs during commit. Actual synchronous disk I/O occurs at this point. This is one of the two steps that take a lot of time.
We are now ready to write the changes to the database file. But before continuing, we first need to obtain an exclusive lock on the database file. An exclusive lock guarantees that no other processes are reading from the database. We do not want to write while other processes are reading since our writes might interfere with the reads of the other processes.
Escalating a reserved lock to an exclusive lock is a really a two-step process. The first thing that happens si that the reserved lock is promoted to a "pending" lock. Both reserved and pending locks allow other processes to continue reading. But a pending lock prevents new processes from starting to read. Presumably, after setting a pending lock all other processes will eventually finish whatever reading they are doing and drop their shared locks, thus allowing us to continue. The pending lock prevents new processes from starting a read while other processes are finishing. In this way, a pending lock helps prevent writer starvation.
Once all other processes have finished reading the database (and dropped their shared locks) the exclusive lock can be obtained.
Once an exclusive lock is obtained, the modified pages of the database can be written. As with all writes, the information is normally stored only in the operating systems write cache and is not immediately written onto the disk surface.
The database changes must be flushed to disk. Otherwise the information might be lost if the operating system crashes or if there is an unexpected power failure. this is the second "sync" of a commit. This step, together with step 6 above, is why commits are slow.
After all database file changes have been written to disk, the rollback journal is deleted. This is the moment when the transaction commits. If a power loss occurs at any prior point during the transaction, then the transaction will automatically rollback the next time the database is accessed. After the rollback journal has been deleted, however, the transaction persists even if power is lost.
Deleting a file is not really an atomic operation, but it appears to be atomic to a user-space program like SQLite. Following a power failure, if a program (like SQLite) asks whether or not a file exists, it receives back either a "yes" or a "no" as an answer. When SQLite wants to know if a transaction has committed, it asks the OS if the rollback journal exists, and gets back an unambiguous "yes" or "no". In this sense, the commit is atomic, or at least appears to be from the point of view of a user-level program like SQLite.
Finally, the locks on the database file are removed, allowing other processes to once again read the database.
- commit-0.gif 6487 bytes added by drh on 2007-Apr-08 20:54:31 UTC.
- commit-1.gif 7265 bytes added by drh on 2007-Apr-08 20:54:50 UTC.
- commit-2.gif 7252 bytes added by drh on 2007-Apr-08 20:55:03 UTC.
- commit-3.gif 7668 bytes added by drh on 2007-Apr-08 20:55:12 UTC.
- commit-4.gif 9769 bytes added by drh on 2007-Apr-08 20:55:25 UTC.
- commit-5.gif 7716 bytes added by drh on 2007-Apr-08 20:55:35 UTC.
- commit-6.gif 7790 bytes added by drh on 2007-Apr-08 20:55:45 UTC.
- commit-7.gif 7477 bytes added by drh on 2007-Apr-08 20:55:59 UTC.
- commit-8.gif 8275 bytes added by drh on 2007-Apr-08 20:56:10 UTC.
- commit-9.gif 7704 bytes added by drh on 2007-Apr-08 20:56:20 UTC.
- commit-A.gif 9508 bytes added by drh on 2007-Apr-08 20:56:31 UTC.
- commit-B.gif 6718 bytes added by drh on 2007-Apr-08 20:56:42 UTC.