Documentation Source Text

Check-in [8b28a36fa1]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add the atomiccommit.html document. No links to it yet.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 8b28a36fa14aa14df7b083f571618e57a5668cf4
User & Date: drh 2007-11-13 01:43:31
Context
2007-11-13
02:53
Darken the font color on hyperlinks. Fix a bad link to the icons on the famous.html page. check-in: 8916aa40ec user: drh tags: trunk
01:43
Add the atomiccommit.html document. No links to it yet. check-in: 8b28a36fa1 user: drh tags: trunk
01:24
Cleaned up the famous users webpage. Work on the about page. check-in: 8a34ae6a2f user: drh tags: trunk
Changes
Hide Diffs Side-by-Side Diffs Ignore Whitespace Patch

Added images/ac/commit-0.gif.

cannot compute difference between binary files

Added images/ac/commit-1.gif.

cannot compute difference between binary files

Added images/ac/commit-2.gif.

cannot compute difference between binary files

Added images/ac/commit-3.gif.

cannot compute difference between binary files

Added images/ac/commit-4.gif.

cannot compute difference between binary files

Added images/ac/commit-5.gif.

cannot compute difference between binary files

Added images/ac/commit-6.gif.

cannot compute difference between binary files

Added images/ac/commit-7.gif.

cannot compute difference between binary files

Added images/ac/commit-8.gif.

cannot compute difference between binary files

Added images/ac/commit-9.gif.

cannot compute difference between binary files

Added images/ac/commit-A.gif.

cannot compute difference between binary files

Added images/ac/commit-B.gif.

cannot compute difference between binary files

Added images/ac/multi-0.gif.

cannot compute difference between binary files

Added images/ac/multi-1.gif.

cannot compute difference between binary files

Added images/ac/multi-2.gif.

cannot compute difference between binary files

Added images/ac/multi-3.gif.

cannot compute difference between binary files

Added images/ac/multi-4.gif.

cannot compute difference between binary files

Added images/ac/multi-5.gif.

cannot compute difference between binary files

Added images/ac/rollback-0.gif.

cannot compute difference between binary files

Added images/ac/rollback-1.gif.

cannot compute difference between binary files

Added images/ac/rollback-2.gif.

cannot compute difference between binary files

Added images/ac/rollback-3.gif.

cannot compute difference between binary files

Added images/ac/rollback-4.gif.

cannot compute difference between binary files

Added images/ac/rollback-5.gif.

cannot compute difference between binary files

Added pages/atomiccomit.in.

            1  +<title>Atomic Commit In SQLite</title>
            2  +
            3  +<h1 align="center">
            4  +Atomic Commit In SQLite
            5  +</h1>
            6  +
            7  +<h2>1.0 Introduction</h2>
            8  +
            9  +<p>An important feature of transactional databases like SQLite
           10  +is "atomic commit".  
           11  +Atomic commit means that either all database changes within a single 
           12  +transaction occur or none of them occur.  With atomic commit, it
           13  +is as if many different writes to different sections of the database
           14  +file occur instantaneously and simultaneously.
           15  +Real hardware serializes writes to mass storage, and writing
           16  +a single sector takes a finite amount of time.
           17  +So it is impossible to truly write many different sectors of a 
           18  +database file simultaneously and/or instantaneously.
           19  +But the atomic commit logic within
           20  +SQLite makes it appear as if the changes for a transaction
           21  +are all written instantaneously and simultaneously.</p>
           22  +
           23  +<p>SQLite has the important property that transactions appear
           24  +to be atomic even if the transaction is interrupted by an
           25  +operating system crash or power failure.</p>
           26  +
           27  +<p>This article describes the techniques used by SQLite to create the
           28  +illusion of atomic commit.</p>
           29  +
           30  +
           31  +<h2>2.0 Hardware Assumptions</h2>
           32  +
           33  +<p>Throughout this article, we will call the mass storage device "disk"
           34  +even though the mass storage device might really be flash memory.</p>
           35  +
           36  +<p>We assume that disk is written in chunks which we call a "sector".
           37  +It is not possible to modify any part of the disk smaller than a sector.
           38  +To change a part of the disk smaller than a sector, you have to read in
           39  +the full sector that contains the part you want to change, make the
           40  +change, then write back out the complete sector.</p>
           41  +
           42  +<p>On a traditional spinning disk, a sector is the minimum unit of transfer
           43  +in both directions, both reading and writing.  On flash memory, however,
           44  +the minimum size of a read is typically much smaller than a minimum write.
           45  +SQLite is only concerned with the minimum write amount and so for the
           46  +purposes of this article, when we say "sector" we mean the minimum amount
           47  +of data that can be written to mass storage in a single go.</p>
           48  +
           49  +<p>Prior to SQLite version 3.3.14, a sector size of 512 bytes was
           50  +assumed in all cases.  There was a compile-time option to change
           51  +this but the code had never been tested with a larger value.  The
           52  +512 byte sector assumption seemed reasonable since until very recently
           53  +all disk drives used a 512 byte sector internally.  However, there
           54  +has recently been a push to increase the sector size of disks to
           55  +4096 bytes.  Also the sector size
           56  +for flash memory is usually larger than 512 bytes.  For these reasons,
           57  +versions of SQLite beginning with 3.3.14 have a method in the OS
           58  +interface layer that interrogates the underlying filesystem to find
           59  +the true sector size.  As currently implemented (version 3.5.0) this
           60  +method still returns a hard-coded value of 512 bytes, since there
           61  +is no standard way of discovering the true sector size on either
           62  +win32 or unix.  But the method is available for embedded device
           63  +manufactures to tweak according to their own needs.  And we have
           64  +left open the possibility of filling in a more meaningful implementation
           65  +on unix and win32 in the future.</p>
           66  +
           67  +<p>SQLite does <u>not</u> assume that a sector write is atomic.
           68  +However, it does assume that a sector write is linear.  By "linear"
           69  +we mean that SQLite assumes that when writing a sector, the hardware begins
           70  +at one end of the data and writes byte by byte until it gets to
           71  +the other end.  The write might go from beginning to end or from
           72  +end to beginning.  If a power failure occurs in the middle of a
           73  +sector write it might be that part of the sector was modified
           74  +and another part was left unchanged.  The key assumption by SQLite
           75  +is that if any part of the sector gets changed, then either the
           76  +first or the last bytes will be changed.  So the hardware will
           77  +never start writing a sector in the middle and work towards the
           78  +ends.  We do not know if this assumption is always true but it
           79  +seems reasonable.</p>
           80  +
           81  +<p>The previous paragraph states that SQLite does not assume that
           82  +sector writes are atomic.  This is true by default.  But as of
           83  +SQLite version 3.5.0, there is a new interface called the
           84  +Virtual File System (VFS) interface.  The VFS is the only means
           85  +by which SQLite communicates to the underlying filesystem.  The
           86  +code comes with default VFS implementations for unix and windows
           87  +and there is a mechanism for creating new custom VFS implementations
           88  +at runtime.  In this new VFS interface there is a method called
           89  +xDeviceCharacteristics.  This method interrogates the underlying
           90  +filesystem to discover various properties and behaviors that the
           91  +filesystem may or may not exhibit.  The xDeviceCharacteristics
           92  +method might indicate that sector writes are atomic, and if it does
           93  +so indicate, SQLite will try to take advantage of that fact.  But
           94  +the default xDeviceCharacteristics method for both unix and windows
           95  +does not indicate atomic sector writes and so these optimizations
           96  +are normally omitted.</p>
           97  +
           98  +<p>SQLite assumes that the operating system will buffer writes and
           99  +that a write request will return before data has actually been stored
          100  +in the mass storage device.
          101  +SQLite further assumes that write operations will be reordered by
          102  +the operating system.
          103  +For this reason, SQLite does a "flush" or "fsync" operation at key
          104  +points.  SQLite assumes that the flush or fsync will not return until
          105  +all pending write operations for the file that is being flushed have
          106  +completed.  We are told that the flush and fsync primitives
          107  +are broken on some versions of windows and Linux.  This is unfortunate.
          108  +It opens SQLite up to the possibility of database corruption following
          109  +a power loss in the middle of a commit.  However, there is nothing 
          110  +that SQLite can do to test for or remedy the situation.  SQLite
          111  +assumes that the operating system that it is running on works as
          112  +advertised.  If that is not quite the case, well then hopefully you
          113  +will not lose power too often.</p>
          114  +
          115  +<p>SQLite assumes that when a file grows in length that the new
          116  +file space originally contains garbage and then later is filled in
          117  +with the data actually written.  In other words, SQLite assumes that
          118  +the file size is updated before the file content.  This is a 
          119  +pessimistic assumption and SQLite has to do some extra work to make
          120  +sure that it does not cause database corruption if power is lost
          121  +between the time when the file size is increased and when the
          122  +new content is written.  The xDeviceCharacteristics method of
          123  +the VFS might indicate that the filesystem will always write the
          124  +data before updating the file size.  (This is the 
          125  +SQLITE_IOCAP_SAFE_APPEND property for those readers who are looking
          126  +at the code.)  When the xDeviceCharacteristics method indicates
          127  +that files content is written before the file size is increased,
          128  +SQLite can forego some of its pedantic database protection steps
          129  +and thereby decrease the amount of disk I/O needed to perform a
          130  +commit.  The current implementation, however, makes no such assumptions
          131  +for the default VFSes for windows and unix.</p>
          132  +
          133  +<p>SQLite assumes that a file deletion is atomic from the
          134  +point of view of a user process.  By this we mean that if SQLite
          135  +requests that a file be deleted and the power is lost during the
          136  +delete operation, once power is restored either the file will
          137  +exist completely with all if its original content unaltered, or
          138  +else the file will not be seen in the filesystem at all.  If
          139  +after power is restored the file is only partially deleted,
          140  +if some of its data has been altered or erased,
          141  +or the file has been truncated but not completely removed, then
          142  +database corruption will likely result.</p>
          143  +
          144  +<p>SQLite assumes that the detection and/or correction of 
          145  +bit errors caused by cosmic rays, thermal noise, quantum
          146  +fluctuations, device driver bugs, or some other mechanism, is the 
          147  +responsibility of the underlying hardware and operating system.  
          148  +SQLite does not add any redundancy to the database file for
          149  +the purpose of detecting corruption or I/O errors.
          150  +SQLite assumes that the data it reads is exactly the same data 
          151  +that it previously wrote.</p>
          152  +
          153  +<a name="section_3_0"></a>
          154  +<h2>3.0 Single File Commit</h2>
          155  +
          156  +<p>We begin with an overview of the steps SQLite takes in order to
          157  +perform an atomic commit of a transaction against a single database
          158  +file.  The details of file formats used to guard against damage from
          159  +power failures and techniques for performing an atomic commit across
          160  +multiple databases are discussed in later sections.</p>
          161  +
          162  +<h3>3.1 Initial State</h3>
          163  +
          164  +<img src="images/ac/commit-0.gif" align="right" hspace="15">
          165  +
          166  +<p>The state of the computer when a database connection is
          167  +first opened is shown conceptually by the diagram at the
          168  +right.
          169  +The area of the diagram on the extreme right (labeled "Disk") represents
          170  +information stored on the mass storage device.  Each rectangle is
          171  +a sector.  The blue color represents that the sectors contain
          172  +original data.
          173  +The middle area is the operating systems disk cache.  At the
          174  +onset of our example, the cache is cold and this is represented
          175  +by leaving the rectangles of the disk cache empty.
          176  +The left area of the diagram shows the content of memory for
          177  +the process that is using SQLite.  The database connection has
          178  +just been opened and no information has been read yet, so the
          179  +user space is empty.
          180  +</p>
          181  +<br clear="both">
          182  +
          183  +<h3>3.2 Acquiring A Read Lock</h3>
          184  +
          185  +<img src="images/ac/commit-1.gif" align="right" hspace="15">
          186  +
          187  +<p>Before SQLite can write to a database, it must first read
          188  +the database to see what is there already.  Even if it is just
          189  +appending new data, SQLite still has to read in the database
          190  +schema from the <b>sqlite_master</b> table so that it can know
          191  +how to parse the INSERT statements and discover where in the
          192  +database file the new information should be stored.</p>
          193  +
          194  +<p>The first step toward reading from the database file
          195  +is obtaining a shared lock on the database file.  A "shared"
          196  +lock allows two or more database connections to read from the
          197  +database file at the same time.  But a shared lock prevents
          198  +another database connection from writing to the database file
          199  +while we are reading it.  This is necessary because if another
          200  +database connection were writing to the database file at the
          201  +same time we are reading from the database file, we might read
          202  +some data before the change and other data after the change.
          203  +This would make it appears as if the change made by the other
          204  +process is not atomic.</p>
          205  +
          206  +<p>Notice that the shared lock is on the operating system
          207  +disk cache, not on the disk itself.  File locks
          208  +really are just flags within the operating system kernel,
          209  +usually.  (The details depend on the specific OS layer
          210  +interface.)  Hence, the lock will instantly vanish if the
          211  +operating system crashes or if there is a power loss.  It
          212  +is usually also the case that the lock will vanish if the
          213  +process that created the lock exits.</p>
          214  +
          215  +<br clear="both">
          216  +
          217  +<a name="section_3_3"></a>
          218  +<h3>3.3 Reading Information Out Of The Database</h3>
          219  +
          220  +<img src="images/ac/commit-2.gif" align="right" hspace="15">
          221  +
          222  +<p>After the shared lock is acquired, we can begin reading
          223  +information from the database file.  In this scenario, we
          224  +are assuming a cold cache, so information must first be
          225  +read from mass storage into the operating system cache then
          226  +transferred from operating system cache into user space.
          227  +On subsequent reads, some or all of the information might
          228  +already be found in the operating system cache and so only
          229  +the transfer to user space would be required.</p>
          230  +
          231  +<p>Usually only a subset of the pages in the database file
          232  +are read.  In this example we are showing three
          233  +pages out of eight being read.  In a typical application, a
          234  +database will have thousands of pages and a query will normally
          235  +only touch a small percentage of those pages.</p>
          236  +
          237  +<br clear="both">
          238  +
          239  +<h3>3.4 Obtaining A Reserved Lock</h3>
          240  +
          241  +<img src="images/ac/commit-3.gif" align="right" hspace="15">
          242  +
          243  +<p>Before making changes to the database, SQLite first
          244  +obtains a "reserved" lock on the database file.  A reserved
          245  +lock is similar to a shared lock in that both a reserved lock
          246  +and shared lock allow other processes to read from the database
          247  +file.  A single reserve lock can coexist with multiple shared
          248  +locks from other processes.  However, there can only be a
          249  +single reserved lock on the database file.  Hence only a
          250  +single process can be attempting to write to the database
          251  +at one time.</p>
          252  +
          253  +<p>The idea behind a reserved locks is that it signals that
          254  +a process intends to modify the database file in the near
          255  +future but has not yet started to make the modifications.
          256  +And because the modifications have not yet started, other
          257  +processes can continue to read from the database.  However,
          258  +no other process should also begin trying to write to the
          259  +database.</p>
          260  +
          261  +<br clear="both">
          262  +<a name="section_3_5"></a>
          263  +<h3>3.5 Creating A Rollback Journal File</h3>
          264  +<img src="images/ac/commit-4.gif" align="right" hspace="15">
          265  +
          266  +<p>Prior to making any changes to the database file, SQLite first
          267  +creates a separate rollback journal file and writes into the 
          268  +rollback journal the original
          269  +content of the database pages that are to be altered.
          270  +The idea behind the rollback journal is that it contains
          271  +all information needed to restore the database back to 
          272  +its original state.</p>
          273  +
          274  +<p>The rollback journal contains a small header (shown in green
          275  +in the diagram) that records the original size of the database
          276  +file.  So if a change causes the database file to grow, we
          277  +will still know the original size of the database.  The page
          278  +number is stored together with each database page that is 
          279  +written into the rollback journal.</p>
          280  +
          281  +<p>When a new file is created, most desktop operating systems
          282  +(windows, linux, macOSX) will not actually write anything to
          283  +disk.  The new file is created in the operating systems disk
          284  +cache only.  The file is not created on mass storage until sometime
          285  +later, when the operating system has a spare moment.  This creates
          286  +the impression to users that I/O is happening much faster than
          287  +is possible when doing real disk I/O.  We illustrate this idea in
          288  +the diagram to the right by showing that the new rollback journal
          289  +appears in the operating system disk cache only and not on the
          290  +disk itself.</p>
          291  +
          292  +<br clear="both">
          293  +<a name="section_3_6"></a>
          294  +<h3>3.6 Changing Database Pages In User Space</h3>
          295  +<img src="images/ac/commit-5.gif" align="right" hspace="15">
          296  +
          297  +<p>After the original page content has been saved in the rollback
          298  +journal, the pages can be modified in user memory.  Each database
          299  +connection has its own private copy of user space, so the changes
          300  +that are made in user space are only visible to the database connection
          301  +that is making the changes.  Other database connections still see
          302  +the information in operating system disk cache buffers which have
          303  +not yet been changed.  And so even though one process is busy
          304  +modifying the database, other processes can continue to read their
          305  +own copies of the original database content.</p>
          306  +
          307  +<br clear="both">
          308  +<a name="section_3_7"></a>
          309  +<h3>3.7 Flushing The Rollback Journal File To Mass Storage</h3>
          310  +<img src="images/ac/commit-6.gif" align="right" hspace="15">
          311  +
          312  +<p>The next step is to flush the content of the rollback journal
          313  +file to nonvolatile storage.
          314  +As we will see later, 
          315  +this is a critical step in insuring that the database can survive
          316  +an unexpected power loss.
          317  +This step also takes a lot of time, since writing to nonvolatile
          318  +storage is normally a slow operation.</p>
          319  +
          320  +<p>This step is usually more complicated than simply flushing
          321  +the rollback journal to the disk.  On most platforms two separate
          322  +flush (or fsync()) operations are required.  The first flush writes
          323  +out the base rollback journal content.  Then the header of the
          324  +rollback journal is modified to show the number of pages in the 
          325  +rollback journal.  Then the header is flushed to disk.  The details
          326  +on why we do this header modification and extra flush are provided
          327  +in a later section of this paper.</p>
          328  +
          329  +<br clear="both">
          330  +<a name="section_3_8"></a>
          331  +<h3>3.8 Obtaining An Exclusive Lock</h3>
          332  +<img src="images/ac/commit-7.gif" align="right" hspace="15">
          333  +
          334  +<p>Prior to making changes to the database file itself, we must
          335  +obtain an exclusive lock on the database file.  Obtaining an
          336  +exclusive lock is really a two-step process.  First SQLite obtains
          337  +a "pending" lock.  Then it escalates the pending lock to an
          338  +exclusive lock.</p>
          339  +
          340  +<p>A pending lock allows other processes that already have a
          341  +shared lock to continue reading the database file.  But it
          342  +prevents new shared locks from being established.  The idea
          343  +behind a pending lock is to prevent writer starvation caused
          344  +by a large pool of readers.  There might be dozens, even hundreds,
          345  +of other processes trying to read the database file.  Each process
          346  +acquires a shared lock before it starts reading, reads what it
          347  +needs, then releases the shared lock.  If, however, there are
          348  +many different processes all reading from the same database, it
          349  +might happen that a new process always acquires its shared lock before
          350  +the previous process releases its shared lock.  And so there is
          351  +never an instant when there are no shared locks on the database
          352  +file and hence there is never an opportunity for the writer to
          353  +seize the exclusive lock.  A pending lock is designed to prevent
          354  +that cycle by allowing existing shared locks to proceed but
          355  +blocking new shared locks from being established.  Eventually
          356  +all shared locks will clear and the pending lock will then be
          357  +able to escalate into an exclusive lock.</p>
          358  +
          359  +<br clear="both">
          360  +<a name="section_3_9"></a>
          361  +<h3>3.9 Writing Changes To The Database File</h3>
          362  +<img src="images/ac/commit-8.gif" align="right" hspace="15">
          363  +
          364  +<p>Once an exclusive lock is held, we know that no other
          365  +processes are reading from the database file and it is
          366  +safe to write changes into the database file.  Usually
          367  +those changes only go as far as the operating systems disk
          368  +cache and do not make it all the way to mass storage.</p>
          369  +
          370  +<br clear="both">
          371  +<a name="section_3_10"></a>
          372  +<h3>3.10 Flushing Changes To Mass Storage</h3>
          373  +<img src="images/ac/commit-9.gif" align="right" hspace="15">
          374  +
          375  +<p>Another flush must occur to make sure that all the
          376  +database changes are written into nonvolatile storage.
          377  +This is a critical step to insure that the database will
          378  +survive a power loss without damage.  However, because
          379  +of the inherent slowness of writing to disk or flash memory, 
          380  +this step together with the rollback journal file flush in section
          381  +3.7 above takes up most the time required to complete a
          382  +transaction commit in SQLite.</p>
          383  +
          384  +<br clear="both">
          385  +<a name="section_3_11"></a>
          386  +<h3>3.11 Deleting The Rollback Journal</h3>
          387  +<img src="images/ac/commit-A.gif" align="right" hspace="15">
          388  +
          389  +<p>After the database changes are all safely on the mass
          390  +storage device, the rollback journal file is deleted.
          391  +This is the instant where the transaction commits.
          392  +If a power failure or system crash occurs prior to this
          393  +point, then recovery processes to be described later make
          394  +it appears as if no changes were ever made to the database
          395  +file.  If a power failure or system crash occurs after
          396  +the rollback journal is deleted, then it appears as if
          397  +all changes have been written to disk.  Thus, SQLite gives
          398  +the appearance of having made no changes to the database
          399  +file or having made the complete set of changes to the
          400  +database file depending on whether or not the rollback
          401  +journal file exists.</p>
          402  +
          403  +<p>Deleting a file is not really an atomic operation, but
          404  +it appears to be from the point of view of a user process.
          405  +A process is always able to ask the operating system "does
          406  +this file exist?" and the process will get back a yes or no
          407  +answer.  After a power failure that occurs during a 
          408  +transaction commit, SQLite will ask the operating system
          409  +whether or not the rollback journal file exists.  If the
          410  +answer is "yes" then the transaction is incomplete and is
          411  +rolled back.  If the answer is "no" then it means the transaction
          412  +did commit.</p>
          413  +
          414  +<p>The existence of a transaction depends on whether or
          415  +not the rollback journal file exists and the deletion
          416  +of a file appears to be an atomic operation from the point of
          417  +view of a user-space process.  Therefore, 
          418  +a transaction appears to be an atomic operation.</p>
          419  +
          420  +<br clear="both">
          421  +<a name="section_3_12"></a>
          422  +<h3>3.12 Releasing The Lock</h3>
          423  +<img src="images/ac/commit-B.gif" align="right" hspace="15">
          424  +
          425  +<p>The last step in the commit process is to release the
          426  +exclusive lock so that other processes can once again
          427  +start accessing the database file.</p>
          428  +
          429  +<p>In the diagram at the right, we show that the information
          430  +that was held in user space is cleared when the lock is released.
          431  +This used to be literally true for older versions of SQLite.  But
          432  +more recent versions of SQLite keep the user space information
          433  +in memory in case it might be needed again at the start of the
          434  +next transaction.  It is cheaper to reuse information that is
          435  +already in local memory than to transfer the information back
          436  +from the operating system disk cache or to read it off of the
          437  +disk drive again.  Prior to reusing the information in user space,
          438  +we must first reacquire the shared lock and then we have to check
          439  +to make sure that no other process modified the database file while
          440  +we were not holding a lock.  There is a counter in the first page
          441  +of the database that is incremented every time the database file
          442  +is modified.  We can find out if another process has modified the
          443  +database by checking that counter.  If the database was modified,
          444  +then the user space cache must be cleared and reread.  But it is
          445  +commonly the case that no changes have been made and the user
          446  +space cache can be reused for a significant performance savings.</p>
          447  +
          448  +<br clear="both">
          449  +<h2>4.0 Rollback</h2>
          450  +
          451  +<p>An atomic commit is suppose to happen instantaneously.  But the processing
          452  +described above clearly takes a finite amount of time.
          453  +Suppose the power to the computer were cut
          454  +part way through the commit operation described above.  In order
          455  +to maintain the illusion that the changes were instantaneous, we
          456  +have to "rollback" any partial changes and restore the database to
          457  +the state it was in prior to the beginning of the transaction.</p>
          458  +
          459  +<h3>4.1 When Something Goes Wrong...</h3>
          460  +<img src="images/ac/rollback-0.gif" align="right" hspace="15">
          461  +
          462  +<p>Suppose the power loss occurred during step 3.10 above,
          463  +while the database changes were being written to disk.
          464  +After power is restored, the situation might be something
          465  +like what is shown to the right.  We were trying to change
          466  +three pages of the database file but only one page was
          467  +successfully written.  Another page was partially written
          468  +and a third page was not written at all.</p>
          469  +
          470  +<p>The rollback journal is complete and intact on disk when
          471  +the power is restored.  This is a key point.  The reason for
          472  +the flush operation in step 3.7 is to make absolutely sure that
          473  +all of the rollback journal is safely on nonvolatile storage
          474  +prior to making any changes to the database file itself.</p>
          475  +
          476  +<br clear="both">
          477  +<a name="section_4_2"></a>
          478  +<h3>4.2 Hot Rollback Journals</h3>
          479  +<img src="images/ac/rollback-1.gif" align="right" hspace="15">
          480  +
          481  +<p>The first time that any SQLite process attempts to access
          482  +the database file, it obtains a shared lock as described in
          483  +section 3.2 above.  But then it notices that there is a 
          484  +rollback journal file present.  SQLite then checks to see if
          485  +the rollback journal is a "hot journal".   A hot journal is
          486  +a rollback journal that needs to be played back in order to
          487  +restore the database to a sane state.  A hot journal only
          488  +exists when an earlier process was in the middle of committing
          489  +a transaction when it crashed or lost power.</p>
          490  +
          491  +<p>A rollback journal is a "hot" journal if all of the following
          492  +are true:</p>
          493  +
          494  +<ul>
          495  +<li>The rollback journal exist.
          496  +<li>The rollback journal is not an empty file.
          497  +<li>There is no reserved lock on the main database file.
          498  +<li>The rollback journal header does not
          499  +contain the name of a master journal file (see
          500  +<a href="#section_5_5">section 5.5</a> below) or if does
          501  +contain the name of a master journal, then that master journal
          502  +file exists.
          503  +</ul>
          504  +
          505  +<p>The presence of a hot journal is our indication
          506  +that a previous process was trying to commit a transaction but
          507  +it aborted for some reason prior to the completion of the
          508  +commit.  The presence of a hot journal is our indication that
          509  +the database file is in an inconsistent state and needs to
          510  +be repaired (by rollback) prior to being used.</p>
          511  +
          512  +<br clear="both">
          513  +<h3>4.3 Obtaining An Exclusive Lock On The Database</h3>
          514  +<img src="images/ac/rollback-2.gif" align="right" hspace="15">
          515  +
          516  +<p>The first step toward dealing with a hot journal is to
          517  +obtain an exclusive lock on the database file.  This prevents two
          518  +or more processes from trying to rollback the same hot journal
          519  +at the same time.</p>
          520  +
          521  +<br clear="both">
          522  +<a name="section_4_4"></a>
          523  +<h3>4.4 Rolling Back Incomplete Changes</h3>
          524  +<img src="images/ac/rollback-3.gif" align="right" hspace="15">
          525  +
          526  +<p>Once a process obtains an exclusive lock, it is permitted
          527  +to write to the database file.  It then proceeds to read the
          528  +original content of pages out of the rollback journal and write
          529  +that content back to were it came from in the database file.
          530  +Recall that the header of the rollback journal records the original
          531  +size of the database file prior to the start of the aborted
          532  +transaction.  SQLite uses this information to truncate the
          533  +database file back to its original size in cases where the
          534  +incomplete transaction caused the database to grow.  At the
          535  +end of this step, the database should be the same size and
          536  +contain the same information as it did before the start of
          537  +the aborted transaction.</p>
          538  +
          539  +<br clear="both">
          540  +<h3>4.5 Deleting The Hot Journal</h3>
          541  +<img src="images/ac/rollback-4.gif" align="right" hspace="15">
          542  +
          543  +<p>After all information in the rollback journal has been
          544  +played back into the database file (and flushed to disk in case
          545  +we encounter yet another power failure), the hot rollback journal
          546  +can be deleted.</p>
          547  +
          548  +<br clear="both">
          549  +<h3>4.6 Continue As If The Uncompleted Writes Had Never Happened</h3>
          550  +<img src="images/ac/rollback-5.gif" align="right" hspace="15">
          551  +
          552  +<p>The final recovery step is to reduce the exclusive lock back
          553  +to a shared lock.  Once this happens, the database is back in the
          554  +state that it would have been if the aborted transaction had never
          555  +started.  Since all of this recovery activity happens completely
          556  +automatically and transparently, it appears to the program using
          557  +SQLite as if the aborted transaction had never begun.</p>
          558  +
          559  +<br clear="both">
          560  +<h2>5.0 Multi-file Commit</h2>
          561  +
          562  +<p>SQLite allows a single 
          563  +<a href="/capi3ref.html#sqlite3">database connection</a> to talk to
          564  +two or more database files simultaneously through the use of
          565  +the <a href="/lang_attach.html">ATTACH DATABASE</a> command.
          566  +When multiple database files are modified within a single
          567  +transaction, all files are updated atomically.  
          568  +In other words, either all of the database files are updated or
          569  +else none of them are.
          570  +Achieving an atomic commit across multiple database files is
          571  +more complex that doing so for a single file.  This section
          572  +describes how SQLite works that bit of magic.</p>
          573  +
          574  +<h3>5.1 Separate Rollback Journals For Each Database</h3>
          575  +<img src="images/ac/multi-0.gif" align="right" hspace="15">
          576  +
          577  +<p>When multiple database files are involved in a transaction,
          578  +each database has its own rollback journal and each database
          579  +is locked separately.  The diagram at the right shows a scenario
          580  +where three different database files have been modified within
          581  +one transaction.  The situation at this step is analogous to 
          582  +the single-file transaction scenario at 
          583  +<a href="#section_3_6">step 3.6</a>.  Each database file has
          584  +a reserved lock.  For each database, the original content of pages 
          585  +that are being changed have been written into the rollback journal
          586  +for that database, but the content of the journals have not yet
          587  +been flushed to disk.  No changes have been made to the database
          588  +file itself yet, though presumably there are changes being held
          589  +in user memory.</p>
          590  +
          591  +<p>For brevity, the diagrams in this section are simplified from
          592  +those that came before.  Blue color still signifies original content
          593  +and pink still signifies new content.  But the individual pages
          594  +in the rollback journal and the database file are not shown and
          595  +we are not making the distinction between information in the
          596  +operating system cache and information that is on disk.  All of
          597  +these factors still apply in a multi-file commit scenario.  They
          598  +just take up a lot of space in the diagrams and they do not add
          599  +any new information, so they are omitted here.</p>
          600  +
          601  +<br clear="both">
          602  +<h3>5.2 The Master Journal File</h3>
          603  +<img src="images/ac/multi-1.gif" align="right" hspace="15">
          604  +
          605  +<p>The next step in a multi-file commit is the creation of a
          606  +"master journal" file.  The name of the master journal file is
          607  +the same name as the original database filename (the database
          608  +that was opened using the 
          609  +<a href="capi3ref.html#sqlite3_open">sqlite3_open()</a> interface,
          610  +not one of the <a href="lang_attach.html">ATTACHed</a> auxiliary
          611  +databases) with the text "<b>-mj</b><i>HHHHHHHH</i>" appended where
          612  +<i>HHHHHHHH</i> is a random 32-bit hexadecimal number.  The
          613  +random <i>HHHHHHHH</i> suffix changes for every new master journal.</p>
          614  +
          615  +<p><i>(Nota bene: The formula for computing the master journal filename
          616  +given in the previous paragraph corresponds to the implementation as
          617  +of SQLite version 3.5.0.  But this formula is not part of the SQLite
          618  +specification and is subject to change in future releases.)</i></p>
          619  +
          620  +<p>Unlike the rollback journals, the master journal does not contain
          621  +any original database page content.  Instead, the master journal contains
          622  +the full pathnames for rollback journals for every database that is
          623  +participating in the transaction.</p>
          624  +
          625  +<p>After the master journal is constructed, its content is flushed
          626  +to disk before any further actions are taken.  On unix, the directory
          627  +that contains the master journal is also synced in order to make sure
          628  +the master journal file will appear in the directory following a
          629  +power failure.</p>
          630  +
          631  +<br clear="both">
          632  +<h3>5.3 Updating Rollback Journal Headers</h3>
          633  +<img src="images/ac/multi-2.gif" align="right" hspace="15">
          634  +
          635  +<p>The next step is to record the full pathname of the master journal file
          636  +in the header of every rollback journal.  Space to hold the master
          637  +journal filename was reserved at the beginning of each rollback journal
          638  +as the rollback journals were created.</p>
          639  +
          640  +<p>The content of each rollback journal is flushed to disk both before
          641  +and after the master journal filename is written into the rollback
          642  +journal header.  It is important to do both of these flushes.  Fortunately,
          643  +the second flush is usually inexpensive since typically only a single
          644  +page of the journal file (the first page) has changed.</p>
          645  +
          646  +<p>This step is analogous to 
          647  +<a href="#section_3_7">step 3.7</a> in the single-file commit
          648  +scenario described above.</p>
          649  +
          650  +<br clear="both">
          651  +<h3>5.4 Updating The Database Files</h3>
          652  +<img src="images/ac/multi-3.gif" align="right" hspace="15">
          653  +
          654  +<p>Once all rollback journal files have been flushed to disk, it
          655  +is safe to begin updating database files.  We have to obtain an
          656  +exclusive lock on all database files before writing the changes.
          657  +After all the changes are written, it is important to flush the
          658  +changes to disk so that they will be preserved in the event of
          659  +a power failure or operating system crash.</p>
          660  +
          661  +<p>This step corresponds to steps
          662  +<a href="#section_3_8">3.8</a>,
          663  +<a href="#section_3_9">3.9</a>, and
          664  +<a href="#section_3_10">3.10</a> in the single-file commit
          665  +scenario described previously.</p>
          666  +
          667  +
          668  +<br clear="both">
          669  +<a name="section_5_5"></a>
          670  +<h3>5.5 Delete The Master Journal File</h3>
          671  +<img src="images/ac/multi-4.gif" align="right" hspace="15">
          672  +
          673  +<p>The next step is to delete the master journal file.
          674  +This is the point where the multi-file transaction commits.
          675  +This step corresponds to 
          676  +<a href="#section_3_11">step 3.11</a> in the single-file
          677  +commit scenario where the rollback journal is deleted.</p>
          678  +
          679  +<p>If a power failure or operating system crash occurs at this
          680  +point, the transaction will not rollback when the system reboots
          681  +even though there are rollback journals present.  The
          682  +difference is the master journal pathname in the header of the
          683  +rollback journal.  Upon restart, SQLite only considers a journal
          684  +to be hot and will only playback the journal if there is no
          685  +master journal filename in the header (which is the case for
          686  +a single-file commit) or if the master journal file still
          687  +exists on disk.</p>
          688  +
          689  +<br clear="both">
          690  +<h3>5.6 Clean Up The Rollback Journals</h3>
          691  +<img src="images/ac/multi-5.gif" align="right" hspace="15">
          692  +
          693  +<p>The final step in a multi-file commit is to delete the
          694  +individual rollback journals and drop the exclusive locks on
          695  +the database files so that other processes can see the changes.
          696  +This corresponds to 
          697  +<a href="#section_3_12">step 3.12</a> in the single-file
          698  +commit sequence.</p>
          699  +
          700  +<p>The transaction has already committed at this point so timing
          701  +is not critical in the deletion of the rollback journals.
          702  +The current implementation deletes a single rollback journal
          703  +then unlocks the corresponding database file before proceeding
          704  +to the next rollback journal.  But in the future we might change
          705  +this so that all rollback journals are deleted before any database
          706  +files are unlocked.  As long as the rollback journal is deleted before
          707  +its corresponding database file is unlocked it does not matter in what
          708  +order the rollback journals are deleted or the database files are
          709  +unlocked.</p>
          710  +
          711  +<h2>6.0 Additional Details Of The Commit Process</h2>
          712  +
          713  +<p><a href="#section_3_0">Section 3.0</a> above provides an overview of
          714  +how atomic commit works in SQLite.  But it glosses over a number of
          715  +important details.  The following subsections will attempt to fill
          716  +in the gaps.</p>
          717  +
          718  +<h3>6.1 Always Journal Complete Sectors</h3>
          719  +
          720  +<p>When the original content of a database page is written into
          721  +the rollback journal (as shown in <a href="#section_3_5">section 3.5</a>),
          722  +SQLite always writes a complete sectors worth of data, even if the
          723  +page size of the database is smaller than the sector size.  
          724  +Historically, the sector size in SQLite has been hard coded to 512
          725  +bytes and since the minimum page size is also 512 bytes, this has never
          726  +been an issue.  But beginning with SQLite version 3.3.14, it is possible
          727  +for SQLite to use mass storage devices with a sector size larger than 512
          728  +bytes.  So, beginning with version 3.3.14, whenever any page within a
          729  +sector is written into the journal file, all pages in that same sector
          730  +are stored with it.</p>
          731  +
          732  +<p>It is important to store all pages of a sector in the rollback
          733  +journal in order to prevent database corruption following a power
          734  +loss while writing the sector.  Suppose that pages 1, 2, 3, and 4 are
          735  +all stored in sector 1 and that page 2 is modified.  In order to write
          736  +the changes to page 2, the underlying hardware must also rewrite the
          737  +content of pages 1, 3, and 4 since the hardware must write the complete
          738  +sector.  If this write operation is interrupted by a power outage,
          739  +one or more of the pages 1, 3, or 4 might be left with incorrect data.
          740  +Hence, to avoid lasting corruption to the database, the original content
          741  +of all of those pages must be contained in the rollback journal.</p>
          742  +
          743  +<h3>6.2 Dealing With Garbage Written Into Journal Files</h3>
          744  +
          745  +<p>When data is appended to the end of the rollback journal,
          746  +SQLite normally makes the pessimistic assumption that the file
          747  +is first extended with invalid "garbage" data and that afterwards
          748  +the correct data replaces the garbage.  In other words, SQLite assumes
          749  +that the file size is increased first and then afterwards the content
          750  +is written into the file.  If a power failure occurs after the file
          751  +size has been increased but before the file content has been written,
          752  +the rollback journal can be left containing garbage data.  If after
          753  +power is restored, another SQLite process sees the rollback journal
          754  +containing the garbage data and tries to roll it back into the original
          755  +database file, it might copy some of the garbage into the database file
          756  +and thus corrupt the database file.</p>
          757  +
          758  +<p>SQLite uses two defenses against this problem.  In the first place,
          759  +SQLite records the number of pages in the rollback journal in the header
          760  +of the rollback journal.  This number is initially zero.  So during an
          761  +attempt to rollback an incomplete (and possibly corrupt) rollback
          762  +journal, the process doing the rollback will see that the journal
          763  +contains zero pages will thus make no changes to the database.  Prior
          764  +to a commit, the rollback journal is flushed to disk to ensure that
          765  +all content has been synched to disk and there is no "garbage" left
          766  +in the file, and only then is the page count in the header increased
          767  +true number of pages in the rollback journal.  The rollback journal
          768  +header is always kept in a separate sector from any page data so that
          769  +it can be overwritten and flushed without risking damage to a data
          770  +page if a power outage occurs.  Notice that the rollback journal
          771  +is flushed to disk twice: once to write the page content and a second
          772  +time to write the page count in the header.</p>
          773  +
          774  +<p>The previous paragraph describes what happens when the
          775  +synchronous pragma setting is "full".</p>
          776  +
          777  +<blockquote>
          778  +PRAGMA synchronous=FULL;
          779  +</blockquote>
          780  +
          781  +<p>The default synchronous setting is full so the above is what usually
          782  +happens.  However, if the synchronous setting is lowered to "normal",
          783  +SQLite only flushes the rollback journal once, after the page count has
          784  +been written.
          785  +This carries a risk of corruption because it might happen that the 
          786  +modified (non-zero) page count reaches the disk surface before all
          787  +of the data does.  The data will have been written first, but SQLite
          788  +assumes that the underlying filesystem can reorder write requests and
          789  +that the page count can be burned into oxide first even though its
          790  +write request occurred last.  So as a second line of defense, SQLite
          791  +also uses a 32-bit checksum on every page of data in the rollback
          792  +journal.  This checksum is evaluated for each page during rollback
          793  +while rolling back a journal as described in 
          794  +<a href="#section_4_4">section 4.4</a>.  If an incorrect checksum
          795  +is seen, the rollback is abandoned.  Note that the checksum does
          796  +not guarantee that the page data is correct since there is a small
          797  +but finite probability that the checksum might be right even if the data is
          798  +corrupt.  But the checksum does at least make such an error unlikely.
          799  +</p>
          800  +
          801  +<p>Note that the checksums in the rollback journal are not necessary
          802  +if the synchronous setting is FULL.  We only depend on the checksums
          803  +when synchronous is lowered to NORMAL.  Nevertheless, the checksums
          804  +never hurt and so they are included in the rollback journal regardless
          805  +of the synchronous setting.</p>
          806  +
          807  +<h3>6.3 Cache Spill Prior To Commit</h3>
          808  +
          809  +<p>The commit process shown in <a href="#section_3_0">section 3.0</a>
          810  +assumes that all database changes fit in memory until it is time to
          811  +commit.  This is the common case.  But sometimes a larger change will
          812  +overflow the user-space cache prior to transaction commit.  In those
          813  +cases, the cache must spill to the database before the transaction
          814  +is complete.</p>
          815  +
          816  +<p>At the beginning of a cache spill, the status of the database
          817  +connection is as shown in <a href="#section_3_6">step 3.6</a>.
          818  +Original page content has been saved in the rollback journal and
          819  +modifications of the pages exist in user memory.  To spill the cache,
          820  +SQLite executes steps <a href="#section_3_7">3.7</a> through
          821  +<a href="#section_3_9">3.9</a>.  In other words, the rollback journal
          822  +is flushed to disk, an exclusive lock is acquired, and changes are
          823  +written into the database.  But the remaining steps are deferred
          824  +until the transaction really commits.  A new journal header is
          825  +appended to the end of the rollback journal (in its own sector)
          826  +and the exclusive database lock is retained, but otherwise processing
          827  +returns to <a href="#section_3_6">step 3.6</a>.  When the transaction
          828  +commits, or if another cache spill occurs, steps
          829  +<a href="#section_3_7">3.7</a> and <a href="#section_3_9">3.9</a> are
          830  +repeated.  (Step <a href="#section_3_8">3.8</a> is omitted on second
          831  +and subsequent passes since an exclusive database lock is already held
          832  +due to the first pass.)</p>
          833  +
          834  +<p>A cache spill causes the lock on the database file to
          835  +escalate from reserved to exclusive.  This reduces concurrency.
          836  +A cache spill also causes extra disk flush or fsync operations to
          837  +occur and these operations are slow, hence a cache spill can
          838  +seriously reduce performance.
          839  +For these reasons a cache spill is avoided whenever possible.</p>
          840  +
          841  +<h2>7.0 Optimizations</h2>
          842  +
          843  +<p>Profiling indicates that for most systems and in most circumstances
          844  +SQLite spends most of its time doing disk I/O.  It follows then that
          845  +anything we can do to reduce the amount of disk I/O will likely have a
          846  +large positive impact on the performance of SQLite.  This section
          847  +describes some of the techniques used by SQLite to try to reduce the
          848  +amount of disk I/O to a minimum while still preserving atomic commit.</p>
          849  +
          850  +<h3>7.1 Cache Retained Between Transactions</h3>
          851  +
          852  +<p><a href="#section_3_12">Step 3.12</a> of the commit process shows
          853  +that once the shared lock has been released, all user-space cache
          854  +images of database content must be discarded.  This is done because
          855  +without a shared lock, other processes are free to modify the database
          856  +file content and so any user-space image of that content might become
          857  +obsolete.  Consequently, each new transaction would begin by rereading
          858  +data which had previously been read.  This is not as bad as it sounds
          859  +at first since the data being read is still likely in the operating
          860  +systems file cache.  So the "read" is really just a copy of data
          861  +from kernel space into user space.  But even so, it still takes time.</p>
          862  +
          863  +<p>Beginning with SQLite version 3.3.14 a mechanism has been added
          864  +to try to reduce the needless rereading of data.  In newer versions
          865  +of SQLite, the data in the user-space pager cache is retained when
          866  +the lock on the database file is released.  Later, after the
          867  +shared lock is acquired at the beginning of the next transaction,
          868  +SQLite checks to see if any other process has modified the database
          869  +file.  If the database has been changed in any way since the lock
          870  +was last released, the user-space cache is erased at that point.
          871  +But commonly the database file is unchanged and the user-space cache
          872  +can be retained, and some unnecessary read operations can be avoided.</p>
          873  +
          874  +<p>In order to determine whether or not the database file has changed,
          875  +SQLite uses a counter in the database header (in bytes 24 through 27)
          876  +which is incremented during every change operation.  SQLite saves a copy
          877  +of this counter prior to releasing its database lock.  Then after
          878  +acquiring the next database lock it compares the saved counter value
          879  +against the current counter value and erases the cache if the values
          880  +are different, or reuses the cache if they are the same.</p>
          881  +
          882  +<h3>7.2 Exclusive Access Mode</h3>
          883  +
          884  +<p>SQLite version 3.3.14 adds the concept of "Exclusive Access Mode".
          885  +In exclusive access mode, SQLite retains the exclusive
          886  +database lock at the conclusion of each transaction.  This prevents
          887  +other processes for accessing the database, but in many deployments
          888  +only a single process is using a database so often this is not a
          889  +serious problem.  The advantage of exclusive access mode is that
          890  +disk I/O can be reduced in three ways:</p>
          891  +
          892  +<ol>
          893  +<li><p>It is not necessary to increment the change counter in the
          894  +database header for transactions after the first transaction.  This
          895  +will often save a write of page one to both the rollback
          896  +journal and the main database file.</p></li>
          897  +
          898  +<li><p>No other processes can change the database so there is never
          899  +a need to check the change counter and clear the user-space cache
          900  +at the beginning of a transaction.</p></li>
          901  +
          902  +<li><p>The rollback journal file can be truncated rather than deleted
          903  +at the end of each transaction.  On many operating systems, truncating
          904  +a file is much faster than deleting it.</p></li>
          905  +</ol>
          906  +
          907  +<p>The third optimization, truncating rather than deleting the rollback
          908  +journal file, does not depend on holding an exclusive lock at all times.
          909  +We could, in theory, do that optimization at all times, not just when
          910  +exclusive access mode is enabled, and we may well choose to do so in 
          911  +some future release of SQLite.  But for now (version 3.5.0) the
          912  +rollback journal truncation optimization is only enabled in conjunction
          913  +with exclusive access mode.</p>
          914  +
          915  +<h3>7.3 Do Not Journal Freelist Pages</h3>
          916  +
          917  +<p>When information is deleted from an SQLite database, the pages used
          918  +to old the deleted information are added to a "freelist".  Subsequent
          919  +inserts will draw pages off of this freelist rather than expanding the
          920  +database file.</p>
          921  +
          922  +<p>Some freelist pages contain critical data; specifically the locations
          923  +of other freelist pages.  But most freelist pages contain nothing useful.
          924  +These latter freelist pages are called "leaf" pages.  We are free to
          925  +modify the content of a leaf freelist page in the database without
          926  +changing the meaning of the database in any way.</p>
          927  +
          928  +<p>Because the content of leaf freelist pages is unimportant, SQLite
          929  +avoids storing leaf freelist page content in the rollback journal
          930  +in <a href="#section_3_5">step 3.5</a> of the commit process.
          931  +If a leaf freelist page is changed and that change does not get rolled back
          932  +during a transaction recovery, the database is not harmed by the omission.
          933  +Similarly, the content of a new freelist page is never written back
          934  +into the database at <a href="#section_3_9">step 3.9</a> nor
          935  +read from the database at <a href="#section_3_3">step 3.3</a>.
          936  +These optimizations can greatly reduce the amount of I/O that occurs
          937  +when making changes to a database file that contains free space.</p>
          938  +
          939  +<h3>7.4 Single Page Updates And Atomic Sector Writes</h3>
          940  +
          941  +<p>Beginning in SQLite version 3.5.0, the new Virtual File System (VFS)
          942  +interface contains a method named xDeviceCharacteristics which reports
          943  +on special properties that the underlying mass storage device
          944  +might have.  Among the special properties that
          945  +xDeviceCharacteristics might report is the ability of to do an
          946  +atomic sector write.</p>
          947  +
          948  +<p>Recall that by default SQLite assumes that sector writes are
          949  +linear but not atomic.  A linear write starts at one end of the
          950  +sector and changes information byte by byte until it gets to the
          951  +other end of the sector.  If a power loss occurs in the middle of
          952  +a linear write then part of the sector might be modified while the
          953  +other end is unchanged.  In an atomic sector write, either the entire
          954  +sector is overwritten or else nothing in the sector is changed.</p>
          955  +
          956  +<p>We believe that most modern disk drives implement atomic sector
          957  +writes.  When power is lost, the drive uses energy stored in capacitors
          958  +and/or the angular momentum of the disk platter to provide power to 
          959  +complete any operation in progress.  Nevertheless, there are so many
          960  +layers in between the write system call and the on-board disk drive
          961  +electronics that we take the safe approach in both Unix and w32 VFS
          962  +implementations and assume that sector writes are not atomic.  On the
          963  +other hand, device
          964  +manufactures with more control over their filesystems might want
          965  +to consider enabling the atomic write property of xDeviceCharacteristics
          966  +if their hardware really does do atomic writes.</p>
          967  +
          968  +<p>When sector writes are atomic and the page size of a database is
          969  +the same as a sector size, and when there is a database change that
          970  +only touches a single database page, then SQLite skips the whole
          971  +journaling and syncing process and simply writes the modified page
          972  +directly into the database file.  The change counter in the first
          973  +page of the database file is modified separately since no harm is
          974  +done if power is lost before the change counter can be updated.</p>
          975  +
          976  +<h3>7.5 Filesystems With Safe Append Semantics</h3>
          977  +
          978  +<p>Another optimization introduced in SQLite version 3.5.0 makes
          979  +use of "safe append" behavior of the underlying disk.
          980  +Recall that SQLite assumes that when data is appended to a file
          981  +(specifically to the rollback journal) that the size of the file
          982  +is increased first and that the content is written second.  So
          983  +if power is lost after the file size is increased but before the
          984  +content is written, the file is left containing invalid "garbage"
          985  +data.  The xDeviceCharacteristics method of the VFS might, however,
          986  +indicate that the filesystem implements "safe append" semantics.
          987  +This means that the content is written before the file size is
          988  +increased so that it is impossible for garbage to be introduced
          989  +into the rollback journal by a power loss or system crash.</p>
          990  +
          991  +<p>When safe append semantics are indicated for a filesystem,
          992  +SQLite always stores the special value of -1 for the page count
          993  +in the header of the rollback journal.  The -1 page count value
          994  +tells any process attempting to rollback the journal that the
          995  +number of pages in the journal should be computed from the journal
          996  +size.  This -1 value is never changed.  So that when a commit
          997  +occurs, we save a single flush operation and a sector write of
          998  +the first page of the journal file.  Furthermore, when a cache
          999  +spill occurs we no longer need to append a new journal header
         1000  +to the end of the journal; we can simply continue appending
         1001  +new pages to the end of the existing journal.</p>
         1002  +
         1003  +<h2>8.0 Testing Atomic Commit Behavior</h2>
         1004  +
         1005  +<p>The developers of SQLite are confident that it is robust
         1006  +in the face of power failures and system crashes because the
         1007  +automatic test procedures do extensive checks on
         1008  +the ability of SQLite to recover from simulated power loss.
         1009  +We call these the "crash tests".</p>
         1010  +
         1011  +<p>Crash tests in SQLite use a modified VFS that can simulate
         1012  +the kinds of filesystem damage that occur during a power
         1013  +loss or operating system crash.  The crash-test VFS can simulate
         1014  +incomplete sector writes, pages filled with garbage data because
         1015  +a write has not completed, and out of order writes, all occurring
         1016  +at varying points during a test scenario.  Crash tests execute
         1017  +transactions over and over, varying the time at which a simulated
         1018  +power loss occurs and the properties of the damage inflicted.
         1019  +Each test then reopens the database after the simulated crash and
         1020  +verifies that the transaction either occurred completely
         1021  +or not at all and that the database is in a completely
         1022  +consistent state.</p>
         1023  +
         1024  +<p>The crash tests in SQLite have discovered a number of very
         1025  +subtle bugs (now fixed) in the recovery mechanism.  Some of 
         1026  +these bugs were very obscure and unlikely to have been found
         1027  +using only code inspection and analysis techniques.  From this
         1028  +experience, the developers of SQLite feel confident that any other
         1029  +database system that does not use a similar crash test system
         1030  +likely contains undetected bugs that will lead to database
         1031  +corruption following a system crash or power failure.</p>
         1032  +
         1033  +<h2>9.0 Things That Can Go Wrong</h2>
         1034  +
         1035  +<p>The atomic commit mechanism in SQLite has proven to be robust,
         1036  +but it can be circumvented by a sufficiently creative
         1037  +adversary or a sufficiently broken operating system implementation.
         1038  +This section describes a few of the ways in which an SQLite database
         1039  +might be corrupted by a power failure or system crash.</p>
         1040  +
         1041  +<h3>9.1 Broken Locking Implementations</h3>
         1042  +
         1043  +<p>SQLite uses filesystem locks to make sure that only one
         1044  +process and database connection is trying to modify the database
         1045  +at a time.  The filesystem locking mechanism is implemented
         1046  +in the VFS layer and is different for every operating system.
         1047  +SQLite depends on this implementation being correct.  If something
         1048  +goes wrong and two or more processes are able to write the same
         1049  +database file at the same time, severe damage can result.</p>
         1050  +
         1051  +<p>We have received reports of implementations of both
         1052  +windows network filesystems and NFS in which locking was
         1053  +subtly broken.  We can not verify these reports, but as
         1054  +locking is difficult to get right on a network filesystem
         1055  +we have no reason to doubt them.  You are advised to 
         1056  +avoid using SQLite on a network filesystem in the first place,
         1057  +since performance will be slow.  But if you must use a 
         1058  +network filesystem to store SQLite database files, consider
         1059  +using a secondary locking mechanism to prevent simultaneous
         1060  +writes to the same database even if the native filesystem
         1061  +locking mechanism malfunctions.</p>
         1062  +
         1063  +<p>The versions of SQLite that come preinstalled on Apple
         1064  +Mac OS X computers contain a version of SQLite that has been
         1065  +extended to use alternative locking strategies that work on
         1066  +all network filesystems that Apple supports.  These extensions
         1067  +used by Apple work great as long as all processes are accessing
         1068  +the database file in the same way.  Unfortunately, the locking
         1069  +mechanisms do not exclude one another, so if one process is
         1070  +accessing a file using (for example) AFP locking and another
         1071  +process (perhaps on a different machine) is using dot-file locks,
         1072  +the two processes might collide because AFP locks do not exclude
         1073  +dot-file locks or vice versa.</p>
         1074  +
         1075  +<h3>9.2 Incomplete Disk Flushes</h3>
         1076  +
         1077  +<p>SQLite uses the fsync() system call on unix and the FlushFileBuffers()
         1078  +system call on w32 in order to sync the file system buffers onto disk
         1079  +oxide as shown in  <a href="#section_3_7">step 3.7</a> and
         1080  +<a href="#section_3.10">step 3.10</a>.  Unfortunately, we have received
         1081  +reports that neither of these interfaces works as advertised on many
         1082  +systems.  We hear that FlushFileBuffers() can be completely disabled
         1083  +using registry settings on some windows versions.  Some historical
         1084  +versions of Linux contain versions of fsync() which are no-ops on
         1085  +some filesystems, we are told.  Even on systems where 
         1086  +FlushFileBuffers() and fsync() are said to be working, often
         1087  +the IDE disk control lies and says that data has reached oxide
         1088  +while it is still held only in the volatile control cache.</p>
         1089  +
         1090  +<p>On the Mac, you can set this pragma:</p>
         1091  +
         1092  +<blockquote>
         1093  +PRAGMA fullfsync=ON;
         1094  +</blockquote>
         1095  +
         1096  +<p>Setting fullfsync on a Mac will guarantee that data really does
         1097  +get pushed out to the disk platter on a flush.  But the implementation
         1098  +of fullfsync involves resetting the disk controller.  And so not only
         1099  +is it profoundly slow, it also slows down other unrelated disk I/O.
         1100  +So its use is not recommended.</p>
         1101  +
         1102  +<h3>9.3 Partial File Deletions</h3>
         1103  +
         1104  +<p>SQLite assumes that file deletion is an atomic operation from the
         1105  +point of view of a user process.  If power fails in the middle of
         1106  +a file deletion, then after power is restored SQLite expects to see
         1107  +either the entire file with all of its original data intact, or it
         1108  +expects not to find the file at all.  Transactions may not be atomic
         1109  +on systems that do not work this way.</p>
         1110  +
         1111  +<h3>9.4 Garbage Written Into Files</h3>
         1112  +
         1113  +<p>SQLite database files are ordinary disk files that can be
         1114  +opened and written by ordinary user processes.  A rogue process
         1115  +can open an SQLite database and fill it with corrupt data.  
         1116  +Corrupt data might also be introduced into an SQLite database
         1117  +by bugs in the operating system or disk controller; especially
         1118  +bugs triggered by a power failure.  There is nothing SQLite can
         1119  +do to defend against these kinds of problems.</p>
         1120  +
         1121  +<h3>9.5 Deleting Or Renaming A Hot Journal</h3>
         1122  +
         1123  +<p>If a crash or power loss does occur and a hot journal is left on
         1124  +the disk, it is essential that the original database file and the hot
         1125  +journal remain on disk with their original names until the database
         1126  +file is opened by another SQLite process and rolled back.  
         1127  +During recovery at <a href="section_4_2">step 4.2</a> SQLite locates
         1128  +the hot journal by looking for a file in the same directory as the
         1129  +database being opened and whose name is derived from the name of the
         1130  +file being opened.  If either the original database file or the
         1131  +hot journal have been moved or renamed, then the hot journal will
         1132  +not be seen and the database will not be rolled back.</p>
         1133  +
         1134  +<p>We suspect that a common failure mode for SQLite recovery happens
         1135  +like this:  A power failure occurs.  After power is restored, a well-meaning
         1136  +user or system administrator begins looking around on the disk for
         1137  +damage.  They see their database file named "important.data".  This file
         1138  +is perhaps familiar to them.  But after the crash, there is also a
         1139  +hot journal named "important.data-journal".  The user then deletes
         1140  +the hot journal, thinking that they are helping to cleanup the system.
         1141  +We know of no way to prevent this other than user education.</p>
         1142  +
         1143  +<p>If there are multiple (hard or symbolic) links to a database file,
         1144  +the journal will be created using the name of the link through which
         1145  +the file was opened.  If a crash occurs and the database is opened again
         1146  +using a different link, the hot journal will not be located and no
         1147  +rollback will occur.</p>
         1148  +
         1149  +<p>Sometimes a power failure will cause a filesystem to be corrupted
         1150  +such that recently changed filenames are forgotten and the file is
         1151  +moved into a "/lost+found" directory.  When that happens, the hot
         1152  +journal will not be found and recovery will not occur.
         1153  +SQLite tries to prevent this
         1154  +by opening and syncing the directory containing the rollback journal
         1155  +at the same time it syncs the journal file itself.  However, the
         1156  +movement of files into /lost+found can be caused by unrelated processes
         1157  +creating unrelated files in the same directory as the main database file.
         1158  +And since this is out from under the control of SQLite, there is nothing
         1159  +that SQLite can do to prevent it.  If you are running on a system that
         1160  +is vulnerable to this kind of filesystem namespace corruption (most
         1161  +modern journalling filesystems are immune, we believe) then you might
         1162  +want to consider putting each SQLite database file in its own private
         1163  +subdirectory.</p>
         1164  +
         1165  +<h2>10.0 Future Directions And Conclusion</h2>
         1166  +
         1167  +<p>Every now and then someone discovers a new failure mode for
         1168  +the atomic commit mechanism in SQLite and the developers have to
         1169  +put in a patch.  This is happening less and less and the
         1170  +failure modes are becoming more and more obscure.  But it would
         1171  +still be foolish to suppose that the atomic commit logic of
         1172  +SQLite is entirely bug-free.  The developers are committed to fixing
         1173  +these bugs as quickly as they might be found.</p>
         1174  +
         1175  +<p>The developers are also on the lookout for new ways to
         1176  +optimize the commit mechanism.  The current VFS implementations
         1177  +for Linux, MacOSX, and w32 make pessimistic assumptions about
         1178  +the behavior of those systems.  After consultation with experts
         1179  +on how these systems work, we might be able to relax some of the
         1180  +assumptions on these systems and allow them to run faster.  In
         1181  +particular, we suspect that most modern filesystems exhibit the
         1182  +safe append property and that many of them might support atomic
         1183  +sector writes.  But until this is known for certain, SQLite will
         1184  +take the conservative approach and assume the worst.</p>