Documentation Source Text: Artifact [4e123bdff2]

Artifact 4e123bdff21487bca36c4e3131628aab438430ecd95fed5a9d46f87a7f46e7cd:

File pages/atomiccommit.in — part of check-in [f4e38c11f1] at 2017-09-27 09:47:43 on branch branch-3.20 — Fix two minor typos. (user: drh size: 66258) [more...]
<title>Atomic Commit In SQLite</title>
<tcl>hd_keywords {atomic commit} {*Atomic Commit}</tcl>
<table_of_contents>

<h1> Introduction</h1>

<p>An important feature of transactional databases like SQLite
is "atomic commit".  
Atomic commit means that either all database changes within a single 
transaction occur or none of them occur.  With atomic commit, it
is as if many different writes to different sections of the database
file occur instantaneously and simultaneously.
Real hardware serializes writes to mass storage, and writing
a single sector takes a finite amount of time.
So it is impossible to truly write many different sectors of a 
database file simultaneously and/or instantaneously.
But the atomic commit logic within
SQLite makes it appear as if the changes for a transaction
are all written instantaneously and simultaneously.</p>

<p>SQLite has the important property that transactions appear
to be atomic even if the transaction is interrupted by an
operating system crash or power failure.</p>

<p>This article describes the techniques used by SQLite to create the
illusion of atomic commit.</p>

<p>The information in this article applies only when SQLite is operating
in "rollback mode", or in other words when SQLite is not 
using a [write-ahead log].  SQLite still supports atomic commit when
write-ahead logging is enabled, but it accomplishes atomic commit by
a different mechanism from the one described in this article.  See
the [WAL | write-ahead log documentation] for additional information on how
SQLite supports atomic commit in that context.</p>

<tcl>hd_fragment hardware</tcl>
<h1> Hardware Assumptions</h1>

<p>Throughout this article, we will call the mass storage device "disk"
even though the mass storage device might really be flash memory.</p>

<p>We assume that disk is written in chunks which we call a "sector".
It is not possible to modify any part of the disk smaller than a sector.
To change a part of the disk smaller than a sector, you have to read in
the full sector that contains the part you want to change, make the
change, then write back out the complete sector.</p>

<p>On a traditional spinning disk, a sector is the minimum unit of transfer
in both directions, both reading and writing.  On flash memory, however,
the minimum size of a read is typically much smaller than a minimum write.
SQLite is only concerned with the minimum write amount and so for the
purposes of this article, when we say "sector" we mean the minimum amount
of data that can be written to mass storage in a single go.</p>

<p>
  Prior to SQLite version 3.3.14, a sector size of 512 bytes was
  assumed in all cases.  There was a compile-time option to change
  this but the code had never been tested with a larger value.  The
  512 byte sector assumption seemed reasonable since until very recently
  all disk drives used a 512 byte sector internally.  However, there
  has recently been a push to increase the sector size of disks to
  4096 bytes.  Also the sector size
  for flash memory is usually larger than 512 bytes.  For these reasons,
  versions of SQLite beginning with 3.3.14 have a method in the OS
  interface layer that interrogates the underlying filesystem to find
  the true sector size.  As currently implemented (version 3.5.0) this
  method still returns a hard-coded value of 512 bytes, since there
  is no standard way of discovering the true sector size on either
  Unix or Windows.  But the method is available for embedded device
  manufacturers to tweak according to their own needs.  And we have
  left open the possibility of filling in a more meaningful implementation
  on Unix and Windows in the future.</p>

<p>SQLite has traditionally assumed that a sector write is <u>not</u> atomic.
However, SQLite does always assume that a sector write is linear.  By "linear"
we mean that SQLite assumes that when writing a sector, the hardware begins
at one end of the data and writes byte by byte until it gets to
the other end.  The write might go from beginning to end or from
end to beginning.  If a power failure occurs in the middle of a
sector write it might be that part of the sector was modified
and another part was left unchanged.  The key assumption by SQLite
is that if any part of the sector gets changed, then either the
first or the last bytes will be changed.  So the hardware will
never start writing a sector in the middle and work towards the
ends.  We do not know if this assumption is always true but it
seems reasonable.</p>

<p>The previous paragraph states that SQLite does not assume that
sector writes are atomic.  This is true by default.  But as of
SQLite version 3.5.0, there is a new interface called the
Virtual File System ([VFS]) interface.  The [VFS] is the only means
by which SQLite communicates to the underlying filesystem.  The
code comes with default VFS implementations for Unix and Windows
and there is a mechanism for creating new custom VFS implementations
at runtime.  In this new VFS interface there is a method called
xDeviceCharacteristics.  This method interrogates the underlying
filesystem to discover various properties and behaviors that the
filesystem may or may not exhibit.  The xDeviceCharacteristics
method might indicate that sector writes are atomic, and if it does
so indicate, SQLite will try to take advantage of that fact.  But
the default xDeviceCharacteristics method for both Unix and Windows
does not indicate atomic sector writes and so these optimizations
are normally omitted.</p>

<p>SQLite assumes that the operating system will buffer writes and
that a write request will return before data has actually been stored
in the mass storage device.
SQLite further assumes that write operations will be reordered by
the operating system.
For this reason, SQLite does a "flush" or "fsync" operation at key
points.  SQLite assumes that the flush or fsync will not return until
all pending write operations for the file that is being flushed have
completed.  We are told that the flush and fsync primitives
are broken on some versions of Windows and Linux.  This is unfortunate.
It opens SQLite up to the possibility of database corruption following
a power loss in the middle of a commit.  However, there is nothing 
that SQLite can do to test for or remedy the situation.  SQLite
assumes that the operating system that it is running on works as
advertised.  If that is not quite the case, well then hopefully you
will not lose power too often.</p>

<p>SQLite assumes that when a file grows in length that the new
file space originally contains garbage and then later is filled in
with the data actually written.  In other words, SQLite assumes that
the file size is updated before the file content.  This is a 
pessimistic assumption and SQLite has to do some extra work to make
sure that it does not cause database corruption if power is lost
between the time when the file size is increased and when the
new content is written.  The xDeviceCharacteristics method of
the [VFS] might indicate that the filesystem will always write the
data before updating the file size.  (This is the 
SQLITE_IOCAP_SAFE_APPEND property for those readers who are looking
at the code.)  When the xDeviceCharacteristics method indicates
that files content is written before the file size is increased,
SQLite can forego some of its pedantic database protection steps
and thereby decrease the amount of disk I/O needed to perform a
commit.  The current implementation, however, makes no such assumptions
for the default VFSes for Windows and Unix.</p>

<p>SQLite assumes that a file deletion is atomic from the
point of view of a user process.  By this we mean that if SQLite
requests that a file be deleted and the power is lost during the
delete operation, once power is restored either the file will
exist completely with all if its original content unaltered, or
else the file will not be seen in the filesystem at all.  If
after power is restored the file is only partially deleted,
if some of its data has been altered or erased,
or the file has been truncated but not completely removed, then
database corruption will likely result.</p>

<p>SQLite assumes that the detection and/or correction of 
bit errors caused by cosmic rays, thermal noise, quantum
fluctuations, device driver bugs, or other mechanisms, is the 
responsibility of the underlying hardware and operating system.  
SQLite does not add any redundancy to the database file for
the purpose of detecting corruption or I/O errors.
SQLite assumes that the data it reads is exactly the same data 
that it previously wrote.</p>

<p>By default, SQLite assumes that an operating system call to write
a range of bytes will not damage or alter any bytes outside of that range
even if a power loss or OS crash occurs during that write.  We
call this the "[PSOW | powersafe overwrite]" property.  
Prior to [version 3.7.9] ([dateof:3.7.9]),
SQLite did not assume powersafe overwrite.  But with the standard
sector size increasing from 512 to 4096 bytes on most disk drives, it
has become necessary to assume powersafe overwrite in order to maintain
historical performance levels and so powersafe overwrite is assumed by
default in recent versions of SQLite.  The assumption of powersafe 
overwrite property can be disabled at compile-time or a run-time if
desired.  See the [PSOW | powersafe overwrite documentation] for further
details.


<a name="section_3_0"></a>
<h1> Single File Commit</h1>

<p>We begin with an overview of the steps SQLite takes in order to
perform an atomic commit of a transaction against a single database
file.  The details of file formats used to guard against damage from
power failures and techniques for performing an atomic commit across
multiple databases are discussed in later sections.</p>

<tcl>hd_fragment initstate</tcl>
<h2> Initial State</h2>

<img src="images/ac/commit-0.gif" align="right" hspace="15">

<p>The state of the computer when a database connection is
first opened is shown conceptually by the diagram at the
right.
The area of the diagram on the extreme right (labeled "Disk") represents
information stored on the mass storage device.  Each rectangle is
a sector.  The blue color represents that the sectors contain
original data.
The middle area is the operating systems disk cache.  At the
onset of our example, the cache is cold and this is represented
by leaving the rectangles of the disk cache empty.
The left area of the diagram shows the content of memory for
the process that is using SQLite.  The database connection has
just been opened and no information has been read yet, so the
user space is empty.
</p>
<br clear="both">

<tcl>hd_fragment rdlck</tcl>
<h2> Acquiring A Read Lock</h2>

<img src="images/ac/commit-1.gif" align="right" hspace="15">

<p>Before SQLite can write to a database, it must first read
the database to see what is there already.  Even if it is just
appending new data, SQLite still has to read in the database
schema from the <b>sqlite_master</b> table so that it can know
how to parse the INSERT statements and discover where in the
database file the new information should be stored.</p>

<p>The first step toward reading from the database file
is obtaining a shared lock on the database file.  A "shared"
lock allows two or more database connections to read from the
database file at the same time.  But a shared lock prevents
another database connection from writing to the database file
while we are reading it.  This is necessary because if another
database connection were writing to the database file at the
same time we are reading from the database file, we might read
some data before the change and other data after the change.
This would make it appear as if the change made by the other
process is not atomic.</p>

<p>Notice that the shared lock is on the operating system
disk cache, not on the disk itself.  File locks
really are just flags within the operating system kernel,
usually.  (The details depend on the specific OS layer
interface.)  Hence, the lock will instantly vanish if the
operating system crashes or if there is a power loss.  It
is usually also the case that the lock will vanish if the
process that created the lock exits.</p>

<br clear="both">

<a name="section_3_3"></a>
<h2> Reading Information Out Of The Database</h2>

<img src="images/ac/commit-2.gif" align="right" hspace="15">

<p>After the shared lock is acquired, we can begin reading
information from the database file.  In this scenario, we
are assuming a cold cache, so information must first be
read from mass storage into the operating system cache then
transferred from operating system cache into user space.
On subsequent reads, some or all of the information might
already be found in the operating system cache and so only
the transfer to user space would be required.</p>

<p>Usually only a subset of the pages in the database file
are read.  In this example we are showing three
pages out of eight being read.  In a typical application, a
database will have thousands of pages and a query will normally
only touch a small percentage of those pages.</p>

<br clear="both">

<tcl>hd_fragment rsvdlock</tcl>
<h2> Obtaining A Reserved Lock</h2>

<img src="images/ac/commit-3.gif" align="right" hspace="15">

<p>Before making changes to the database, SQLite first
obtains a "reserved" lock on the database file.  A reserved
lock is similar to a shared lock in that both a reserved lock
and shared lock allow other processes to read from the database
file.  A single reserve lock can coexist with multiple shared
locks from other processes.  However, there can only be a
single reserved lock on the database file.  Hence only a
single process can be attempting to write to the database
at one time.</p>

<p>The idea behind a reserved lock is that it signals that
a process intends to modify the database file in the near
future but has not yet started to make the modifications.
And because the modifications have not yet started, other
processes can continue to read from the database.  However,
no other process should also begin trying to write to the
database.</p>

<br clear="both">
<a name="section_3_5"></a>
<h2> Creating A Rollback Journal File</h2>
<img src="images/ac/commit-4.gif" align="right" hspace="15">

<p>Prior to making any changes to the database file, SQLite first
creates a separate rollback journal file and writes into the 
rollback journal the original
content of the database pages that are to be altered.
The idea behind the rollback journal is that it contains
all information needed to restore the database back to 
its original state.</p>

<p>The rollback journal contains a small header (shown in green
in the diagram) that records the original size of the database
file.  So if a change causes the database file to grow, we
will still know the original size of the database.  The page
number is stored together with each database page that is 
written into the rollback journal.</p>

<p>
  When a new file is created, most desktop operating systems
  (Windows, Linux, Mac OS X) will not actually write anything to
  disk.  The new file is created in the operating systems disk
  cache only.  The file is not created on mass storage until sometime
  later, when the operating system has a spare moment.  This creates
  the impression to users that I/O is happening much faster than
  is possible when doing real disk I/O.  We illustrate this idea in
  the diagram to the right by showing that the new rollback journal
  appears in the operating system disk cache only and not on the
  disk itself.</p>

<br clear="both">
<a name="section_3_6"></a>
<h2> Changing Database Pages In User Space</h2>
<img src="images/ac/commit-5.gif" align="right" hspace="15">

<p>After the original page content has been saved in the rollback
journal, the pages can be modified in user memory.  Each database
connection has its own private copy of user space, so the changes
that are made in user space are only visible to the database connection
that is making the changes.  Other database connections still see
the information in operating system disk cache buffers which have
not yet been changed.  And so even though one process is busy
modifying the database, other processes can continue to read their
own copies of the original database content.</p>

<br clear="both">
<a name="section_3_7"></a>
<h2> Flushing The Rollback Journal File To Mass Storage</h2>
<img src="images/ac/commit-6.gif" align="right" hspace="15">

<p>The next step is to flush the content of the rollback journal
file to nonvolatile storage.
As we will see later, 
this is a critical step in insuring that the database can survive
an unexpected power loss.
This step also takes a lot of time, since writing to nonvolatile
storage is normally a slow operation.</p>

<p>This step is usually more complicated than simply flushing
the rollback journal to the disk.  On most platforms two separate
flush (or fsync()) operations are required.  The first flush writes
out the base rollback journal content.  Then the header of the
rollback journal is modified to show the number of pages in the 
rollback journal.  Then the header is flushed to disk.  The details
on why we do this header modification and extra flush are provided
in a later section of this paper.</p>

<br clear="both">
<a name="section_3_8"></a>
<h2> Obtaining An Exclusive Lock</h2>
<img src="images/ac/commit-7.gif" align="right" hspace="15">

<p>Prior to making changes to the database file itself, we must
obtain an exclusive lock on the database file.  Obtaining an
exclusive lock is really a two-step process.  First SQLite obtains
a "pending" lock.  Then it escalates the pending lock to an
exclusive lock.</p>

<p>A pending lock allows other processes that already have a
shared lock to continue reading the database file.  But it
prevents new shared locks from being established.  The idea
behind a pending lock is to prevent writer starvation caused
by a large pool of readers.  There might be dozens, even hundreds,
of other processes trying to read the database file.  Each process
acquires a shared lock before it starts reading, reads what it
needs, then releases the shared lock.  If, however, there are
many different processes all reading from the same database, it
might happen that a new process always acquires its shared lock before
the previous process releases its shared lock.  And so there is
never an instant when there are no shared locks on the database
file and hence there is never an opportunity for the writer to
seize the exclusive lock.  A pending lock is designed to prevent
that cycle by allowing existing shared locks to proceed but
blocking new shared locks from being established.  Eventually
all shared locks will clear and the pending lock will then be
able to escalate into an exclusive lock.</p>

<br clear="both">
<a name="section_3_9"></a>
<h2> Writing Changes To The Database File</h2>
<img src="images/ac/commit-8.gif" align="right" hspace="15">

<p>Once an exclusive lock is held, we know that no other
processes are reading from the database file and it is
safe to write changes into the database file.  Usually
those changes only go as far as the operating systems disk
cache and do not make it all the way to mass storage.</p>

<br clear="both">
<a name="section_3_10"></a>
<h2>0 Flushing Changes To Mass Storage</h2>
<img src="images/ac/commit-9.gif" align="right" hspace="15">

<p>Another flush must occur to make sure that all the
database changes are written into nonvolatile storage.
This is a critical step to ensure that the database will
survive a power loss without damage.  However, because
of the inherent slowness of writing to disk or flash memory, 
this step together with the rollback journal file flush in section
3.7 above takes up most of the time required to complete a
transaction commit in SQLite.</p>

<br clear="both">
<a name="section_3_11"></a>
<h2>1 Deleting The Rollback Journal</h2>
<img src="images/ac/commit-A.gif" align="right" hspace="15">

<p>After the database changes are all safely on the mass
storage device, the rollback journal file is deleted.
This is the instant where the transaction commits.
If a power failure or system crash occurs prior to this
point, then recovery processes to be described later make
it appear as if no changes were ever made to the database
file.  If a power failure or system crash occurs after
the rollback journal is deleted, then it appears as if
all changes have been written to disk.  Thus, SQLite gives
the appearance of having made no changes to the database
file or having made the complete set of changes to the
database file depending on whether or not the rollback
journal file exists.</p>

<p>Deleting a file is not really an atomic operation, but
it appears to be from the point of view of a user process.
A process is always able to ask the operating system "does
this file exist?" and the process will get back a yes or no
answer.  After a power failure that occurs during a 
transaction commit, SQLite will ask the operating system
whether or not the rollback journal file exists.  If the
answer is "yes" then the transaction is incomplete and is
rolled back.  If the answer is "no" then it means the transaction
did commit.</p>

<p>The existence of a transaction depends on whether or
not the rollback journal file exists and the deletion
of a file appears to be an atomic operation from the point of
view of a user-space process.  Therefore, 
a transaction appears to be an atomic operation.</p>

<p>The act of deleting a file is expensive on many systems.
As an optimization, SQLite can be configured to truncate
the journal file to zero bytes in length
or overwrite the journal file header with zeros.  In either
case, the resulting journal file is no longer capable of rolling
back and so the transaction still commits.  Truncating a file
to zero length, like deleting a file, is assumed to be an atomic
operation from the point of view of a user process.  Overwriting
the header of the journal with zeros is not atomic, but if any
part of the header is malformed the journal will not roll back.
Hence, one can say that the commit occurs as soon as the header
is sufficiently changed to make it invalid.  Typically this happens
as soon as the first byte of the header is zeroed.</p>

<br clear="both">
<a name="section_3_12"></a>
<h2>2 Releasing The Lock</h2>
<img src="images/ac/commit-B.gif" align="right" hspace="15">

<p>The last step in the commit process is to release the
exclusive lock so that other processes can once again
start accessing the database file.</p>

<p>In the diagram at the right, we show that the information
that was held in user space is cleared when the lock is released.
This used to be literally true for older versions of SQLite.  But
more recent versions of SQLite keep the user space information
in memory in case it might be needed again at the start of the
next transaction.  It is cheaper to reuse information that is
already in local memory than to transfer the information back
from the operating system disk cache or to read it off of the
disk drive again.  Prior to reusing the information in user space,
we must first reacquire the shared lock and then we have to check
to make sure that no other process modified the database file while
we were not holding a lock.  There is a counter in the first page
of the database that is incremented every time the database file
is modified.  We can find out if another process has modified the
database by checking that counter.  If the database was modified,
then the user space cache must be cleared and reread.  But it is
commonly the case that no changes have been made and the user
space cache can be reused for a significant performance savings.</p>

<br clear="both">
<tcl>hd_fragment rollback</tcl>
<h1> Rollback</h1>

<p>An atomic commit is supposed to happen instantaneously.  But the processing
described above clearly takes a finite amount of time.
Suppose the power to the computer were cut
part way through the commit operation described above.  In order
to maintain the illusion that the changes were instantaneous, we
have to "rollback" any partial changes and restore the database to
the state it was in prior to the beginning of the transaction.</p>

<tcl>hd_fragment crisis</tcl>
<h2> When Something Goes Wrong...</h2>
<img src="images/ac/rollback-0.gif" align="right" hspace="15">

<p>Suppose the power loss occurred
during <a href="#section_3_10">step 3.10</a> above,
while the database changes were being written to disk.
After power is restored, the situation might be something
like what is shown to the right.  We were trying to change
three pages of the database file but only one page was
successfully written.  Another page was partially written
and a third page was not written at all.</p>

<p>The rollback journal is complete and intact on disk when
the power is restored.  This is a key point.  The reason for
the flush operation in <a href="#section_3_7">step 3.7</a>
is to make absolutely sure that
all of the rollback journal is safely on nonvolatile storage
prior to making any changes to the database file itself.</p>

<br clear="both">
<a name="section_4_2"></a>
<h2> Hot Rollback Journals</h2>
<img src="images/ac/rollback-1.gif" align="right" hspace="15">

<p>The first time that any SQLite process attempts to access
the database file, it obtains a shared lock as described in
<a href="section_3_2">section 3.2</a> above.
But then it notices that there is a 
rollback journal file present.  SQLite then checks to see if
the rollback journal is a "hot journal".   A hot journal is
a rollback journal that needs to be played back in order to
restore the database to a sane state.  A hot journal only
exists when an earlier process was in the middle of committing
a transaction when it crashed or lost power.</p>

<p>A rollback journal is a "hot" journal if all of the following
are true:</p>

<ul>
<li>The rollback journal exists.
<li>The rollback journal is not an empty file.
<li>There is no reserved lock on the main database file.
<li>The header of the rollback journal is well-formed and in particular
    has not been zeroed out.
<li>The rollback journal does not
contain the name of a master journal file (see
<a href="#section_5_5">section 5.5</a> below) or if does
contain the name of a master journal, then that master journal
file exists.
</ul>

<p>The presence of a hot journal is our indication
that a previous process was trying to commit a transaction but
it aborted for some reason prior to the completion of the
commit.  A hot journal means that
the database file is in an inconsistent state and needs to
be repaired (by rollback) prior to being used.</p>

<br clear="both">
<tcl>hd_fragment exlock</tcl>
<h2> Obtaining An Exclusive Lock On The Database</h2>
<img src="images/ac/rollback-2.gif" align="right" hspace="15">

<p>The first step toward dealing with a hot journal is to
obtain an exclusive lock on the database file.  This prevents two
or more processes from trying to rollback the same hot journal
at the same time.</p>

<br clear="both">
<a name="section_4_4"></a>
<h2> Rolling Back Incomplete Changes</h2>
<img src="images/ac/rollback-3.gif" align="right" hspace="15">

<p>Once a process obtains an exclusive lock, it is permitted
to write to the database file.  It then proceeds to read the
original content of pages out of the rollback journal and write
that content back to where it came from in the database file.
Recall that the header of the rollback journal records the original
size of the database file prior to the start of the aborted
transaction.  SQLite uses this information to truncate the
database file back to its original size in cases where the
incomplete transaction caused the database to grow.  At the
end of this step, the database should be the same size and
contain the same information as it did before the start of
the aborted transaction.</p>

<br clear="both">
<tcl>hd_fragment delhotjrnl</tcl>
<h2> Deleting The Hot Journal</h2>
<img src="images/ac/rollback-4.gif" align="right" hspace="15">

<p>After all information in the rollback journal has been
played back into the database file (and flushed to disk in case
we encounter yet another power failure), the hot rollback journal
can be deleted.</p>

<p>As in <a href="#section_3_11">section 3.11</a>, the journal
file might be truncated to zero length or its header might
be overwritten with zeros as an optimization on systems where
deleting a file is expensive.  Either way, the journal is no 
longer hot after this step.</p>

<br clear="both">
<tcl>hd_fragment cont</tcl>
<h2> Continue As If The Uncompleted Writes Had Never Happened</h2>
<img src="images/ac/rollback-5.gif" align="right" hspace="15">

<p>The final recovery step is to reduce the exclusive lock back
to a shared lock.  Once this happens, the database is back in the
state that it would have been if the aborted transaction had never
started.  Since all of this recovery activity happens completely
automatically and transparently, it appears to the program using
SQLite as if the aborted transaction had never begun.</p>

<br clear="both">
<tcl>hd_fragment multicommit</tcl>
<h1> Multi-file Commit</h1>

<p>SQLite allows a single 
<a href="c3ref/sqlite3.html">database connection</a> to talk to
two or more database files simultaneously through the use of
the <a href="lang_attach.html">ATTACH DATABASE</a> command.
When multiple database files are modified within a single
transaction, all files are updated atomically.  
In other words, either all of the database files are updated or
else none of them are.
Achieving an atomic commit across multiple database files is
more complex that doing so for a single file.  This section
describes how SQLite works that bit of magic.</p>

<tcl>hd_fragment multijrnl</tcl>
<h2> Separate Rollback Journals For Each Database</h2>
<img src="images/ac/multi-0.gif" align="right" hspace="15">

<p>When multiple database files are involved in a transaction,
each database has its own rollback journal and each database
is locked separately.  The diagram at the right shows a scenario
where three different database files have been modified within
one transaction.  The situation at this step is analogous to 
the single-file transaction scenario at 
<a href="#section_3_6">step 3.6</a>.  Each database file has
a reserved lock.  For each database, the original content of pages 
that are being changed have been written into the rollback journal
for that database, but the content of the journals have not yet
been flushed to disk.  No changes have been made to the database
file itself yet, though presumably there are changes being held
in user memory.</p>

<p>For brevity, the diagrams in this section are simplified from
those that came before.  Blue color still signifies original content
and pink still signifies new content.  But the individual pages
in the rollback journal and the database file are not shown and
we are not making the distinction between information in the
operating system cache and information that is on disk.  All of
these factors still apply in a multi-file commit scenario.  They
just take up a lot of space in the diagrams and they do not add
any new information, so they are omitted here.</p>

<br clear="both">
<tcl>hd_fragment masterjrnl</tcl>
<h2> The Master Journal File</h2>
<img src="images/ac/multi-1.gif" align="right" hspace="15">

<p>The next step in a multi-file commit is the creation of a
"master journal" file.  The name of the master journal file is
the same name as the original database filename (the database
that was opened using the 
<a href="c3ref/open.html">sqlite3_open()</a> interface,
not one of the <a href="lang_attach.html">ATTACHed</a> auxiliary
databases) with the text "<b>-mj</b><i>HHHHHHHH</i>" appended where
<i>HHHHHHHH</i> is a random 32-bit hexadecimal number.  The
random <i>HHHHHHHH</i> suffix changes for every new master journal.</p>

<p><i>(Nota bene: The formula for computing the master journal filename
given in the previous paragraph corresponds to the implementation as
of SQLite version 3.5.0.  But this formula is not part of the SQLite
specification and is subject to change in future releases.)</i></p>

<p>Unlike the rollback journals, the master journal does not contain
any original database page content.  Instead, the master journal contains
the full pathnames for rollback journals for every database that is
participating in the transaction.</p>

<p>After the master journal is constructed, its content is flushed
to disk before any further actions are taken.  On Unix, the directory
that contains the master journal is also synced in order to make sure
the master journal file will appear in the directory following a
power failure.</p>

<p>The purpose of the master journal is to ensure that multi-file
transactions are atomic across a power-loss.  But if the database files
have other settings that compromise integrity across a power-loss event
(such as [PRAGMA synchronous=OFF] or [PRAGMA journal_mode=MEMORY]) then
the creation of the master journal is omitted, as an optimization.

<br clear="both">
<tcl>hd_fragment multijrnlupdate</tcl>
<h2> Updating Rollback Journal Headers</h2>
<img src="images/ac/multi-2.gif" align="right" hspace="15">

<p>The next step is to record the full pathname of the master journal file
in the header of every rollback journal.  Space to hold the master
journal filename was reserved at the beginning of each rollback journal
as the rollback journals were created.</p>

<p>The content of each rollback journal is flushed to disk both before
and after the master journal filename is written into the rollback
journal header.  It is important to do both of these flushes.  Fortunately,
the second flush is usually inexpensive since typically only a single
page of the journal file (the first page) has changed.</p>

<p>This step is analogous to 
<a href="#section_3_7">step 3.7</a> in the single-file commit
scenario described above.</p>

<br clear="both">
<tcl>hd_fragment multidbupdate</tcl>
<h2> Updating The Database Files</h2>
<img src="images/ac/multi-3.gif" align="right" hspace="15">

<p>Once all rollback journal files have been flushed to disk, it
is safe to begin updating database files.  We have to obtain an
exclusive lock on all database files before writing the changes.
After all the changes are written, it is important to flush the
changes to disk so that they will be preserved in the event of
a power failure or operating system crash.</p>

<p>This step corresponds to steps
<a href="#section_3_8">3.8</a>,
<a href="#section_3_9">3.9</a>, and
<a href="#section_3_10">3.10</a> in the single-file commit
scenario described previously.</p>


<br clear="both">
<a name="section_5_5"></a>
<h2> Delete The Master Journal File</h2>
<img src="images/ac/multi-4.gif" align="right" hspace="15">

<p>The next step is to delete the master journal file.
This is the point where the multi-file transaction commits.
This step corresponds to 
<a href="#section_3_11">step 3.11</a> in the single-file
commit scenario where the rollback journal is deleted.</p>

<p>If a power failure or operating system crash occurs at this
point, the transaction will not rollback when the system reboots
even though there are rollback journals present.  The
difference is the master journal pathname in the header of the
rollback journal.  Upon restart, SQLite only considers a journal
to be hot and will only playback the journal if there is no
master journal filename in the header (which is the case for
a single-file commit) or if the master journal file still
exists on disk.</p>

<br clear="both">
<tcl>hd_fragment cleanup</tcl>
<h2> Clean Up The Rollback Journals</h2>
<img src="images/ac/multi-5.gif" align="right" hspace="15">

<p>The final step in a multi-file commit is to delete the
individual rollback journals and drop the exclusive locks on
the database files so that other processes can see the changes.
This corresponds to 
<a href="#section_3_12">step 3.12</a> in the single-file
commit sequence.</p>

<p>The transaction has already committed at this point so timing
is not critical in the deletion of the rollback journals.
The current implementation deletes a single rollback journal
then unlocks the corresponding database file before proceeding
to the next rollback journal.  But in the future we might change
this so that all rollback journals are deleted before any database
files are unlocked.  As long as the rollback journal is deleted before
its corresponding database file is unlocked it does not matter in what
order the rollback journals are deleted or the database files are
unlocked.</p>

<tcl>hd_fragment moredetail</tcl>
<h1> Additional Details Of The Commit Process</h1>

<p><a href="#section_3_0">Section 3.0</a> above provides an overview of
how atomic commit works in SQLite.  But it glosses over a number of
important details.  The following subsections will attempt to fill
in the gaps.</p>

<tcl>hd_fragment completesectors</tcl>
<h2> Always Journal Complete Sectors</h2>

<p>When the original content of a database page is written into
the rollback journal (as shown in <a href="#section_3_5">section 3.5</a>),
SQLite always writes a complete sector of data, even if the
page size of the database is smaller than the sector size.  
Historically, the sector size in SQLite has been hard coded to 512
bytes and since the minimum page size is also 512 bytes, this has never
been an issue.  But beginning with SQLite version 3.3.14, it is possible
for SQLite to use mass storage devices with a sector size larger than 512
bytes.  So, beginning with version 3.3.14, whenever any page within a
sector is written into the journal file, all pages in that same sector
are stored with it.</p>

<p>It is important to store all pages of a sector in the rollback
journal in order to prevent database corruption following a power
loss while writing the sector.  Suppose that pages 1, 2, 3, and 4 are
all stored in sector 1 and that page 2 is modified.  In order to write
the changes to page 2, the underlying hardware must also rewrite the
content of pages 1, 3, and 4 since the hardware must write the complete
sector.  If this write operation is interrupted by a power outage,
one or more of the pages 1, 3, or 4 might be left with incorrect data.
Hence, to avoid lasting corruption to the database, the original content
of all of those pages must be contained in the rollback journal.</p>

<tcl>hd_fragment journalgarbage</tcl>
<h2> Dealing With Garbage Written Into Journal Files</h2>

<p>When data is appended to the end of the rollback journal,
SQLite normally makes the pessimistic assumption that the file
is first extended with invalid "garbage" data and that afterwards
the correct data replaces the garbage.  In other words, SQLite assumes
that the file size is increased first and then afterwards the content
is written into the file.  If a power failure occurs after the file
size has been increased but before the file content has been written,
the rollback journal can be left containing garbage data.  If after
power is restored, another SQLite process sees the rollback journal
containing the garbage data and tries to roll it back into the original
database file, it might copy some of the garbage into the database file
and thus corrupt the database file.</p>

<p>SQLite uses two defenses against this problem.  In the first place,
SQLite records the number of pages in the rollback journal in the header
of the rollback journal.  This number is initially zero.  So during an
attempt to rollback an incomplete (and possibly corrupt) rollback
journal, the process doing the rollback will see that the journal
contains zero pages and will thus make no changes to the database.  Prior
to a commit, the rollback journal is flushed to disk to ensure that
all content has been synced to disk and there is no "garbage" left
in the file, and only then is the page count in the header changed from
zero to true number of pages in the rollback journal.  The rollback journal
header is always kept in a separate sector from any page data so that
it can be overwritten and flushed without risking damage to a data
page if a power outage occurs.  Notice that the rollback journal
is flushed to disk twice: once to write the page content and a second
time to write the page count in the header.</p>

<p>The previous paragraph describes what happens when the
synchronous pragma setting is "full".</p>

<blockquote>
PRAGMA synchronous=FULL;
</blockquote>

<p>The default synchronous setting is full so the above is what usually
happens.  However, if the synchronous setting is lowered to "normal",
SQLite only flushes the rollback journal once, after the page count has
been written.
This carries a risk of corruption because it might happen that the 
modified (non-zero) page count reaches the disk surface before all
of the data does.  The data will have been written first, but SQLite
assumes that the underlying filesystem can reorder write requests and
that the page count can be burned into oxide first even though its
write request occurred last.  So as a second line of defense, SQLite
also uses a 32-bit checksum on every page of data in the rollback
journal.  This checksum is evaluated for each page during rollback
while rolling back a journal as described in 
<a href="#section_4_4">section 4.4</a>.  If an incorrect checksum
is seen, the rollback is abandoned.  Note that the checksum does
not guarantee that the page data is correct since there is a small
but finite probability that the checksum might be right even if the data is
corrupt.  But the checksum does at least make such an error unlikely.
</p>

<p>Note that the checksums in the rollback journal are not necessary
if the synchronous setting is FULL.  We only depend on the checksums
when synchronous is lowered to NORMAL.  Nevertheless, the checksums
never hurt and so they are included in the rollback journal regardless
of the synchronous setting.</p>

<tcl>hd_fragment cachespill</tcl>
<h2> Cache Spill Prior To Commit</h2>

<p>The commit process shown in <a href="#section_3_0">section 3.0</a>
assumes that all database changes fit in memory until it is time to
commit.  This is the common case.  But sometimes a larger change will
overflow the user-space cache prior to transaction commit.  In those
cases, the cache must spill to the database before the transaction
is complete.</p>

<p>At the beginning of a cache spill, the status of the database
connection is as shown in <a href="#section_3_6">step 3.6</a>.
Original page content has been saved in the rollback journal and
modifications of the pages exist in user memory.  To spill the cache,
SQLite executes steps <a href="#section_3_7">3.7</a> through
<a href="#section_3_9">3.9</a>.  In other words, the rollback journal
is flushed to disk, an exclusive lock is acquired, and changes are
written into the database.  But the remaining steps are deferred
until the transaction really commits.  A new journal header is
appended to the end of the rollback journal (in its own sector)
and the exclusive database lock is retained, but otherwise processing
returns to <a href="#section_3_6">step 3.6</a>.  When the transaction
commits, or if another cache spill occurs, steps
<a href="#section_3_7">3.7</a> and <a href="#section_3_9">3.9</a> are
repeated.  (Step <a href="#section_3_8">3.8</a> is omitted on second
and subsequent passes since an exclusive database lock is already held
due to the first pass.)</p>

<p>A cache spill causes the lock on the database file to
escalate from reserved to exclusive.  This reduces concurrency.
A cache spill also causes extra disk flush or fsync operations to
occur and these operations are slow, hence a cache spill can
seriously reduce performance.
For these reasons a cache spill is avoided whenever possible.</p>

<tcl>hd_fragment opts</tcl>
<h1> Optimizations</h1>

<p>Profiling indicates that for most systems and in most circumstances
SQLite spends most of its time doing disk I/O.  It follows then that
anything we can do to reduce the amount of disk I/O will likely have a
large positive impact on the performance of SQLite.  This section
describes some of the techniques used by SQLite to try to reduce the
amount of disk I/O to a minimum while still preserving atomic commit.</p>

<tcl>hd_fragment keepcache</tcl>
<h2> Cache Retained Between Transactions</h2>

<p><a href="#section_3_12">Step 3.12</a> of the commit process shows
that once the shared lock has been released, all user-space cache
images of database content must be discarded.  This is done because
without a shared lock, other processes are free to modify the database
file content and so any user-space image of that content might become
obsolete.  Consequently, each new transaction would begin by rereading
data which had previously been read.  This is not as bad as it sounds
at first since the data being read is still likely in the operating
systems file cache.  So the "read" is really just a copy of data
from kernel space into user space.  But even so, it still takes time.</p>

<p>Beginning with SQLite version 3.3.14 a mechanism has been added
to try to reduce the needless rereading of data.  In newer versions
of SQLite, the data in the user-space pager cache is retained when
the lock on the database file is released.  Later, after the
shared lock is acquired at the beginning of the next transaction,
SQLite checks to see if any other process has modified the database
file.  If the database has been changed in any way since the lock
was last released, the user-space cache is erased at that point.
But commonly the database file is unchanged and the user-space cache
can be retained, and some unnecessary read operations can be avoided.</p>

<p>In order to determine whether or not the database file has changed,
SQLite uses a counter in the database header (in bytes 24 through 27)
which is incremented during every change operation.  SQLite saves a copy
of this counter prior to releasing its database lock.  Then after
acquiring the next database lock it compares the saved counter value
against the current counter value and erases the cache if the values
are different, or reuses the cache if they are the same.</p>

<a name="section_7_2"></a>
<h2> Exclusive Access Mode</h2>

<p>SQLite version 3.3.14 adds the concept of "Exclusive Access Mode".
In exclusive access mode, SQLite retains the exclusive
database lock at the conclusion of each transaction.  This prevents
other processes from accessing the database, but in many deployments
only a single process is using a database so this is not a
serious problem.  The advantage of exclusive access mode is that
disk I/O can be reduced in three ways:</p>

<ol>
<li><p>It is not necessary to increment the change counter in the
database header for transactions after the first transaction.  This
will often save a write of page one to both the rollback
journal and the main database file.</p></li>

<li><p>No other processes can change the database so there is never
a need to check the change counter and clear the user-space cache
at the beginning of a transaction.</p></li>

<li><p>Each transaction can be committed by overwriting the rollback
journal header with zeros rather than deleting the journal file.
This avoids having to modify the directory entry for the journal file
and it avoids having to deallocate disk sectors associated with the 
journal.  Furthermore, the next transaction will overwrite existing
journal file content rather than append new content and on most systems
overwriting is much faster than appending.</p></li>
</ol>

<p>The third optimization, zeroing the journal file header rather than
deleting the rollback journal file,
does not depend on holding an exclusive lock at all times.
This optimization can be set independently of exclusive lock mode
using the [journal_mode pragma]
as described in <a href="#section_7_6">section 7.6</a> below.</p>

<tcl>hd_fragment freelistjrnl</tcl>
<h2> Do Not Journal Freelist Pages</h2>

<p>When information is deleted from an SQLite database, the pages used
to hold the deleted information are added to a "[freelist]".  Subsequent
inserts will draw pages off of this freelist rather than expanding the
database file.</p>

<p>Some freelist pages contain critical data; specifically the locations
of other freelist pages.  But most freelist pages contain nothing useful.
These latter freelist pages are called "leaf" pages.  We are free to
modify the content of a leaf freelist page in the database without
changing the meaning of the database in any way.</p>

<p>Because the content of leaf freelist pages is unimportant, SQLite
avoids storing leaf freelist page content in the rollback journal
in <a href="#section_3_5">step 3.5</a> of the commit process.
If a leaf freelist page is changed and that change does not get rolled back
during a transaction recovery, the database is not harmed by the omission.
Similarly, the content of a new freelist page is never written back
into the database at <a href="#section_3_9">step 3.9</a> nor
read from the database at <a href="#section_3_3">step 3.3</a>.
These optimizations can greatly reduce the amount of I/O that occurs
when making changes to a database file that contains free space.</p>

<tcl>hd_fragment atomicsector</tcl>
<h2> Single Page Updates And Atomic Sector Writes</h2>

<p>Beginning in SQLite version 3.5.0, the new Virtual File System (VFS)
interface contains a method named xDeviceCharacteristics which reports
on special properties that the underlying mass storage device
might have.  Among the special properties that
xDeviceCharacteristics might report is the ability of to do an
atomic sector write.</p>

<p>Recall that by default SQLite assumes that sector writes are
linear but not atomic.  A linear write starts at one end of the
sector and changes information byte by byte until it gets to the
other end of the sector.  If a power loss occurs in the middle of
a linear write then part of the sector might be modified while the
other end is unchanged.  In an atomic sector write, either the entire
sector is overwritten or else nothing in the sector is changed.</p>

<p>We believe that most modern disk drives implement atomic sector
writes.  When power is lost, the drive uses energy stored in capacitors
and/or the angular momentum of the disk platter to provide power to 
complete any operation in progress.  Nevertheless, there are so many
layers in between the write system call and the on-board disk drive
electronics that we take the safe approach in both Unix and w32 VFS
implementations and assume that sector writes are not atomic.  On the
other hand, device
manufacturers with more control over their filesystems might want
to consider enabling the atomic write property of xDeviceCharacteristics
if their hardware really does do atomic writes.</p>

<p>When sector writes are atomic and the page size of a database is
the same as a sector size, and when there is a database change that
only touches a single database page, then SQLite skips the whole
journaling and syncing process and simply writes the modified page
directly into the database file.  The change counter in the first
page of the database file is modified separately since no harm is
done if power is lost before the change counter can be updated.</p>

<tcl>hd_fragment safeappend</tcl>
<h2> Filesystems With Safe Append Semantics</h2>

<p>Another optimization introduced in SQLite version 3.5.0 makes
use of "safe append" behavior of the underlying disk.
Recall that SQLite assumes that when data is appended to a file
(specifically to the rollback journal) that the size of the file
is increased first and that the content is written second.  So
if power is lost after the file size is increased but before the
content is written, the file is left containing invalid "garbage"
data.  The xDeviceCharacteristics method of the VFS might, however,
indicate that the filesystem implements "safe append" semantics.
This means that the content is written before the file size is
increased so that it is impossible for garbage to be introduced
into the rollback journal by a power loss or system crash.</p>

<p>When safe append semantics are indicated for a filesystem,
SQLite always stores the special value of -1 for the page count
in the header of the rollback journal.  The -1 page count value
tells any process attempting to rollback the journal that the
number of pages in the journal should be computed from the journal
size.  This -1 value is never changed.  So that when a commit
occurs, we save a single flush operation and a sector write of
the first page of the journal file.  Furthermore, when a cache
spill occurs we no longer need to append a new journal header
to the end of the journal; we can simply continue appending
new pages to the end of the existing journal.</p>

<a name="section_7_6"></a>
<h2> Persistent Rollback Journals</h2>

<p>Deleting a file is an expensive operation on many systems.
So as an optimization, SQLite can be configured to avoid the
delete operation of <a href="#section_3_11">section 3.11</a>.
Instead of deleting the journal file in order to commit a transaction,
the file is either truncated to zero bytes in length or its
header is overwritten with zeros.  Truncating the file to zero
length saves having to make modifications to the directory containing
the file since the file is not removed from the directory. 
Overwriting the header has the additional savings of not having
to update the length of the file (in the "inode" on many systems)
and not having to deal with newly freed disk sectors.  Furthermore,
at the next transaction the journal will be created by overwriting
existing content rather than appending new content onto the end
of a file, and overwriting is often much faster than appending.</p>

<p>SQLite can be configured to commit transactions by overwriting
the journal header with zeros instead of deleting the journal file
by setting the "PERSIST" journaling mode using the 
<a href="pragma.html#pragma_journal_mode">journal_mode</a> PRAGMA.
For example:</p>

<blockquote><pre>
PRAGMA journal_mode=PERSIST;
</per></blockquote>

<p>The use of persistent journal mode provides a noticeable performance
improvement on many systems.  Of course, the drawback is that the 
journal files remain on the disk, using disk space and cluttering
directories, long after the transaction commits.  The only safe way
to delete a persistent journal file is to commit a transaction
with journaling mode set to DELETE:</p>

<blockquote><pre>
PRAGMA journal_mode=DELETE;
BEGIN EXCLUSIVE;
COMMIT;
</per></blockquote>

<p>Beware of deleting persistent journal files by any other means
since the journal file might be hot, in which case deleting it will
corrupt the corresponding database file.</p>

<p>Beginning in SQLite [version 3.6.4] ([dateof:3.6.4]), 
the TRUNCATE journal mode is
also supported:</p>

<blockquote><pre>
PRAGMA journal_mode=TRUNCATE;
</pre></blockquote>

<p>In truncate journal mode, the transaction is committed by truncating
the journal file to zero length rather than deleting the journal file
(as in DELETE mode) or by zeroing the header (as in PERSIST mode).
TRUNCATE mode shares the advantage of PERSIST mode that the directory
that contains the journal file and database does not need to be updated.
Hence truncating a file is often faster than deleting it.  TRUNCATE has
the additional advantage that it is not followed by a
system call (ex: fsync()) to synchronize the change to disk.  It might
be safer if it did.
But on many modern filesystems, a truncate is an atomic and
synchronous operation and so we think that TRUNCATE will usually be safe
in the face of power failures.  If you are uncertain about whether or
not TRUNCATE will be synchronous and atomic on your filesystem and it is
important to you that your database survive a power loss or operating
system crash that occurs during the truncation operation, then you might
consider using a different journaling mode.</p>

<p>On embedded systems with synchronous filesystems, TRUNCATE results
in slower behavior than PERSIST.  The commit operation is the same speed.
But subsequent transactions are slower following a TRUNCATE because it is
faster to overwrite existing content than to append to the end of a file.
New journal file entries will always be appended following a TRUNCATE but
will usually overwrite with PERSIST.</p>

<tcl>hd_fragment  testing</tcl>
<h1> Testing Atomic Commit Behavior</h1>

<p>The developers of SQLite are confident that it is robust
in the face of power failures and system crashes because the
automatic test procedures do extensive checks on
the ability of SQLite to recover from simulated power loss.
We call these the "crash tests".</p>

<p>Crash tests in SQLite use a modified VFS that can simulate
the kinds of filesystem damage that occur during a power
loss or operating system crash.  The crash-test VFS can simulate
incomplete sector writes, pages filled with garbage data because
a write has not completed, and out of order writes, all occurring
at varying points during a test scenario.  Crash tests execute
transactions over and over, varying the time at which a simulated
power loss occurs and the properties of the damage inflicted.
Each test then reopens the database after the simulated crash and
verifies that the transaction either occurred completely
or not at all and that the database is in a completely
consistent state.</p>

<p>The crash tests in SQLite have discovered a number of very
subtle bugs (now fixed) in the recovery mechanism.  Some of 
these bugs were very obscure and unlikely to have been found
using only code inspection and analysis techniques.  From this
experience, the developers of SQLite feel confident that any other
database system that does not use a similar crash test system
likely contains undetected bugs that will lead to database
corruption following a system crash or power failure.</p>

<tcl>hd_fragment {sect_9_0} {Things That Can Go Wrong}</tcl>
<h1> Things That Can Go Wrong</h1>

<p>The atomic commit mechanism in SQLite has proven to be robust,
but it can be circumvented by a sufficiently creative
adversary or a sufficiently broken operating system implementation.
This section describes a few of the ways in which an SQLite database
might be corrupted by a power failure or system crash.
(See also: [how to corrupt | How To Corrupt Your Database Files].)</p>

<tcl>hd_fragment brokenlocks</tcl>
<h2> Broken Locking Implementations</h2>

<p>SQLite uses filesystem locks to make sure that only one
process and database connection is trying to modify the database
at a time.  The filesystem locking mechanism is implemented
in the VFS layer and is different for every operating system.
SQLite depends on this implementation being correct.  If something
goes wrong and two or more processes are able to write the same
database file at the same time, severe damage can result.</p>

<p>We have received reports of implementations of both
Windows network filesystems and NFS in which locking was
subtly broken.  We can not verify these reports, but as
locking is difficult to get right on a network filesystem
we have no reason to doubt them.  You are advised to 
avoid using SQLite on a network filesystem in the first place,
since performance will be slow.  But if you must use a 
network filesystem to store SQLite database files, consider
using a secondary locking mechanism to prevent simultaneous
writes to the same database even if the native filesystem
locking mechanism malfunctions.</p>

<p>The versions of SQLite that come preinstalled on Apple
Mac OS X computers contain a version of SQLite that has been
extended to use alternative locking strategies that work on
all network filesystems that Apple supports.  These extensions
used by Apple work great as long as all processes are accessing
the database file in the same way.  Unfortunately, the locking
mechanisms do not exclude one another, so if one process is
accessing a file using (for example) AFP locking and another
process (perhaps on a different machine) is using dot-file locks,
the two processes might collide because AFP locks do not exclude
dot-file locks or vice versa.</p>

<tcl>hd_fragment fsync</tcl>
<h2> Incomplete Disk Flushes</h2>

<p>SQLite uses the fsync() system call on Unix and the FlushFileBuffers()
system call on w32 in order to sync the file system buffers onto disk
oxide as shown in <a href="#section_3_7">step 3.7</a> and
<a href="#section_3_10">step 3.10</a>.  Unfortunately, we have received
reports that neither of these interfaces works as advertised on many
systems.  We hear that FlushFileBuffers() can be completely disabled
using registry settings on some Windows versions.  Some historical
versions of Linux contain versions of fsync() which are no-ops on
some filesystems, we are told.  Even on systems where 
FlushFileBuffers() and fsync() are said to be working, often
the IDE disk control lies and says that data has reached oxide
while it is still held only in the volatile control cache.</p>

<p>On the Mac, you can set this pragma:</p>

<blockquote>
PRAGMA fullfsync=ON;
</blockquote>

<p>Setting fullfsync on a Mac will guarantee that data really does
get pushed out to the disk platter on a flush.  But the implementation
of fullfsync involves resetting the disk controller.  And so not only
is it profoundly slow, it also slows down other unrelated disk I/O.
So its use is not recommended.</p>

<tcl>hd_fragment filedel</tcl>
<h2> Partial File Deletions</h2>

<p>SQLite assumes that file deletion is an atomic operation from the
point of view of a user process.  If power fails in the middle of
a file deletion, then after power is restored SQLite expects to see
either the entire file with all of its original data intact, or it
expects not to find the file at all.  Transactions may not be atomic
on systems that do not work this way.</p>

<tcl>hd_fragment filegarbage</tcl>
<h2> Garbage Written Into Files</h2>

<p>SQLite database files are ordinary disk files that can be
opened and written by ordinary user processes.  A rogue process
can open an SQLite database and fill it with corrupt data.  
Corrupt data might also be introduced into an SQLite database
by bugs in the operating system or disk controller; especially
bugs triggered by a power failure.  There is nothing SQLite can
do to defend against these kinds of problems.</p>

<tcl>hd_fragment mvhotjrnl</tcl>
<h2> Deleting Or Renaming A Hot Journal</h2>

<p>If a crash or power loss does occur and a hot journal is left on
the disk, it is essential that the original database file and the hot
journal remain on disk with their original names until the database
file is opened by another SQLite process and rolled back.  
During recovery at <a href="#section_4_2">step 4.2</a> SQLite locates
the hot journal by looking for a file in the same directory as the
database being opened and whose name is derived from the name of the
file being opened.  If either the original database file or the
hot journal have been moved or renamed, then the hot journal will
not be seen and the database will not be rolled back.</p>

<p>We suspect that a common failure mode for SQLite recovery happens
like this:  A power failure occurs.  After power is restored, a well-meaning
user or system administrator begins looking around on the disk for
damage.  They see their database file named "important.data".  This file
is perhaps familiar to them.  But after the crash, there is also a
hot journal named "important.data-journal".  The user then deletes
the hot journal, thinking that they are helping to cleanup the system.
We know of no way to prevent this other than user education.</p>

<p>If there are multiple (hard or symbolic) links to a database file,
the journal will be created using the name of the link through which
the file was opened.  If a crash occurs and the database is opened again
using a different link, the hot journal will not be located and no
rollback will occur.</p>

<p>Sometimes a power failure will cause a filesystem to be corrupted
such that recently changed filenames are forgotten and the file is
moved into a "/lost+found" directory.  When that happens, the hot
journal will not be found and recovery will not occur.
SQLite tries to prevent this
by opening and syncing the directory containing the rollback journal
at the same time it syncs the journal file itself.  However, the
movement of files into /lost+found can be caused by unrelated processes
creating unrelated files in the same directory as the main database file.
And since this is out from under the control of SQLite, there is nothing
that SQLite can do to prevent it.  If you are running on a system that
is vulnerable to this kind of filesystem namespace corruption (most
modern journalling filesystems are immune, we believe) then you might
want to consider putting each SQLite database file in its own private
subdirectory.</p>

<tcl>hd_fragment future</tcl>
<h1>10.0 Future Directions And Conclusion</h1>

<p>Every now and then someone discovers a new failure mode for
the atomic commit mechanism in SQLite and the developers have to
put in a patch.  This is happening less and less and the
failure modes are becoming more and more obscure.  But it would
still be foolish to suppose that the atomic commit logic of
SQLite is entirely bug-free.  The developers are committed to fixing
these bugs as quickly as they might be found.</p>

<p>
The developers are also on the lookout for new ways to
optimize the commit mechanism.  The current VFS implementations
for Unix (Linux and Mac OS X) and Windows make pessimistic assumptions about
the behavior of those systems.  After consultation with experts
on how these systems work, we might be able to relax some of the
assumptions on these systems and allow them to run faster.  In
particular, we suspect that most modern filesystems exhibit the
safe append property and that many of them might support atomic
sector writes.  But until this is known for certain, SQLite will
take the conservative approach and assume the worst.</p>