Documentation Source Text

Check-in [ad70832674]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add fragments to headings of longer documents.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: ad70832674c129fa6bcfea6c769af324a62efec1
User & Date: drh 2010-01-13 22:15:07
Context
2010-01-14
09:56
Update backup.in to explain that writing to an in-memory source database mid-backup causes the entire backup operation to be restarted in the next call to backup_step(). check-in: f5ee14345e user: dan tags: trunk
2010-01-13
22:15
Add fragments to headings of longer documents. check-in: ad70832674 user: drh tags: trunk
18:09
Add a description of matchinfo() and update the description of snippet() in fts3.html. check-in: c31f4ab8b8 user: dan tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/atomiccommit.in.

24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
...
157
158
159
160
161
162
163

164
165
166
167
168
169
170
...
178
179
180
181
182
183
184

185
186
187
188
189
190
191
...
234
235
236
237
238
239
240

241
242
243
244
245
246
247
...
459
460
461
462
463
464
465

466
467
468
469
470
471
472
473
474
475

476
477
478
479
480
481
482
...
525
526
527
528
529
530
531

532
533
534
535
536
537
538
...
552
553
554
555
556
557
558

559
560
561
562
563
564
565
...
567
568
569
570
571
572
573

574
575
576
577
578
579
580
581
582
583
584

585
586
587
588
589
590
591
...
592
593
594
595
596
597
598

599
600
601
602
603
604
605
...
620
621
622
623
624
625
626

627
628
629
630
631
632
633
...
650
651
652
653
654
655
656

657
658
659
660
661
662
663
...
669
670
671
672
673
674
675

676
677
678
679
680
681
682
...
708
709
710
711
712
713
714

715
716
717
718
719
720
721
...
729
730
731
732
733
734
735

736
737
738
739
740
741
742

743
744
745
746
747
748
749
...
761
762
763
764
765
766
767

768
769
770
771
772
773
774
...
825
826
827
828
829
830
831

832
833
834
835
836
837
838
...
859
860
861
862
863
864
865

866
867
868
869
870
871
872
873
874

875
876
877
878
879
880
881
...
937
938
939
940
941
942
943

944
945
946
947
948
949
950
...
961
962
963
964
965
966
967

968
969
970
971
972
973
974
...
998
999
1000
1001
1002
1003
1004

1005
1006
1007
1008
1009
1010
1011
....
1101
1102
1103
1104
1105
1106
1107

1108
1109
1110
1111
1112
1113
1114
....
1141
1142
1143
1144
1145
1146
1147

1148
1149
1150
1151
1152
1153
1154
....
1175
1176
1177
1178
1179
1180
1181

1182
1183
1184
1185
1186
1187
1188
....
1202
1203
1204
1205
1206
1207
1208

1209
1210
1211
1212
1213
1214
1215
1216
1217

1218
1219
1220
1221
1222
1223
1224
1225
1226
1227

1228
1229
1230
1231
1232
1233
1234
....
1265
1266
1267
1268
1269
1270
1271

1272
1273
1274
1275
1276
1277
1278
<p>SQLite has the important property that transactions appear
to be atomic even if the transaction is interrupted by an
operating system crash or power failure.</p>

<p>This article describes the techniques used by SQLite to create the
illusion of atomic commit.</p>


<h2>2.0 Hardware Assumptions</h2>

<p>Throughout this article, we will call the mass storage device "disk"
even though the mass storage device might really be flash memory.</p>

<p>We assume that disk is written in chunks which we call a "sector".
It is not possible to modify any part of the disk smaller than a sector.
................................................................................

<p>We begin with an overview of the steps SQLite takes in order to
perform an atomic commit of a transaction against a single database
file.  The details of file formats used to guard against damage from
power failures and techniques for performing an atomic commit across
multiple databases are discussed in later sections.</p>


<h3>3.1 Initial State</h3>

<img src="images/ac/commit-0.gif" align="right" hspace="15">

<p>The state of the computer when a database connection is
first opened is shown conceptually by the diagram at the
right.
................................................................................
The left area of the diagram shows the content of memory for
the process that is using SQLite.  The database connection has
just been opened and no information has been read yet, so the
user space is empty.
</p>
<br clear="both">


<h3>3.2 Acquiring A Read Lock</h3>

<img src="images/ac/commit-1.gif" align="right" hspace="15">

<p>Before SQLite can write to a database, it must first read
the database to see what is there already.  Even if it is just
appending new data, SQLite still has to read in the database
................................................................................
are read.  In this example we are showing three
pages out of eight being read.  In a typical application, a
database will have thousands of pages and a query will normally
only touch a small percentage of those pages.</p>

<br clear="both">


<h3>3.4 Obtaining A Reserved Lock</h3>

<img src="images/ac/commit-3.gif" align="right" hspace="15">

<p>Before making changes to the database, SQLite first
obtains a "reserved" lock on the database file.  A reserved
lock is similar to a shared lock in that both a reserved lock
................................................................................
is modified.  We can find out if another process has modified the
database by checking that counter.  If the database was modified,
then the user space cache must be cleared and reread.  But it is
commonly the case that no changes have been made and the user
space cache can be reused for a significant performance savings.</p>

<br clear="both">

<h2>4.0 Rollback</h2>

<p>An atomic commit is supposed to happen instantaneously.  But the processing
described above clearly takes a finite amount of time.
Suppose the power to the computer were cut
part way through the commit operation described above.  In order
to maintain the illusion that the changes were instantaneous, we
have to "rollback" any partial changes and restore the database to
the state it was in prior to the beginning of the transaction.</p>


<h3>4.1 When Something Goes Wrong...</h3>
<img src="images/ac/rollback-0.gif" align="right" hspace="15">

<p>Suppose the power loss occurred during step 3.10 above,
while the database changes were being written to disk.
After power is restored, the situation might be something
like what is shown to the right.  We were trying to change
................................................................................
that a previous process was trying to commit a transaction but
it aborted for some reason prior to the completion of the
commit.  A hot journal means that
the database file is in an inconsistent state and needs to
be repaired (by rollback) prior to being used.</p>

<br clear="both">

<h3>4.3 Obtaining An Exclusive Lock On The Database</h3>
<img src="images/ac/rollback-2.gif" align="right" hspace="15">

<p>The first step toward dealing with a hot journal is to
obtain an exclusive lock on the database file.  This prevents two
or more processes from trying to rollback the same hot journal
at the same time.</p>
................................................................................
database file back to its original size in cases where the
incomplete transaction caused the database to grow.  At the
end of this step, the database should be the same size and
contain the same information as it did before the start of
the aborted transaction.</p>

<br clear="both">

<h3>4.5 Deleting The Hot Journal</h3>
<img src="images/ac/rollback-4.gif" align="right" hspace="15">

<p>After all information in the rollback journal has been
played back into the database file (and flushed to disk in case
we encounter yet another power failure), the hot rollback journal
can be deleted.</p>
................................................................................
<p>As in <a href="#section_3_11">section 3.11</a>, the journal
file might be truncated to zero length or its header might
be overwritten with zeros as an optimization on systems where
deleting a file is expense.  Either way, the journal is no 
long hot after this step.</p>

<br clear="both">

<h3>4.6 Continue As If The Uncompleted Writes Had Never Happened</h3>
<img src="images/ac/rollback-5.gif" align="right" hspace="15">

<p>The final recovery step is to reduce the exclusive lock back
to a shared lock.  Once this happens, the database is back in the
state that it would have been if the aborted transaction had never
started.  Since all of this recovery activity happens completely
automatically and transparently, it appears to the program using
SQLite as if the aborted transaction had never begun.</p>

<br clear="both">

<h2>5.0 Multi-file Commit</h2>

<p>SQLite allows a single 
<a href="c3ref/sqlite3.html">database connection</a> to talk to
two or more database files simultaneously through the use of
the <a href="lang_attach.html">ATTACH DATABASE</a> command.
When multiple database files are modified within a single
................................................................................
transaction, all files are updated atomically.  
In other words, either all of the database files are updated or
else none of them are.
Achieving an atomic commit across multiple database files is
more complex that doing so for a single file.  This section
describes how SQLite works that bit of magic.</p>


<h3>5.1 Separate Rollback Journals For Each Database</h3>
<img src="images/ac/multi-0.gif" align="right" hspace="15">

<p>When multiple database files are involved in a transaction,
each database has its own rollback journal and each database
is locked separately.  The diagram at the right shows a scenario
where three different database files have been modified within
................................................................................
we are not making the distinction between information in the
operating system cache and information that is on disk.  All of
these factors still apply in a multi-file commit scenario.  They
just take up a lot of space in the diagrams and they do not add
any new information, so they are omitted here.</p>

<br clear="both">

<h3>5.2 The Master Journal File</h3>
<img src="images/ac/multi-1.gif" align="right" hspace="15">

<p>The next step in a multi-file commit is the creation of a
"master journal" file.  The name of the master journal file is
the same name as the original database filename (the database
that was opened using the 
................................................................................
<p>After the master journal is constructed, its content is flushed
to disk before any further actions are taken.  On Unix, the directory
that contains the master journal is also synced in order to make sure
the master journal file will appear in the directory following a
power failure.</p>

<br clear="both">

<h3>5.3 Updating Rollback Journal Headers</h3>
<img src="images/ac/multi-2.gif" align="right" hspace="15">

<p>The next step is to record the full pathname of the master journal file
in the header of every rollback journal.  Space to hold the master
journal filename was reserved at the beginning of each rollback journal
as the rollback journals were created.</p>
................................................................................
page of the journal file (the first page) has changed.</p>

<p>This step is analogous to 
<a href="#section_3_7">step 3.7</a> in the single-file commit
scenario described above.</p>

<br clear="both">

<h3>5.4 Updating The Database Files</h3>
<img src="images/ac/multi-3.gif" align="right" hspace="15">

<p>Once all rollback journal files have been flushed to disk, it
is safe to begin updating database files.  We have to obtain an
exclusive lock on all database files before writing the changes.
After all the changes are written, it is important to flush the
................................................................................
rollback journal.  Upon restart, SQLite only considers a journal
to be hot and will only playback the journal if there is no
master journal filename in the header (which is the case for
a single-file commit) or if the master journal file still
exists on disk.</p>

<br clear="both">

<h3>5.6 Clean Up The Rollback Journals</h3>
<img src="images/ac/multi-5.gif" align="right" hspace="15">

<p>The final step in a multi-file commit is to delete the
individual rollback journals and drop the exclusive locks on
the database files so that other processes can see the changes.
This corresponds to 
................................................................................
to the next rollback journal.  But in the future we might change
this so that all rollback journals are deleted before any database
files are unlocked.  As long as the rollback journal is deleted before
its corresponding database file is unlocked it does not matter in what
order the rollback journals are deleted or the database files are
unlocked.</p>


<h2>6.0 Additional Details Of The Commit Process</h2>

<p><a href="#section_3_0">Section 3.0</a> above provides an overview of
how atomic commit works in SQLite.  But it glosses over a number of
important details.  The following subsections will attempt to fill
in the gaps.</p>


<h3>6.1 Always Journal Complete Sectors</h3>

<p>When the original content of a database page is written into
the rollback journal (as shown in <a href="#section_3_5">section 3.5</a>),
SQLite always writes a complete sectors worth of data, even if the
page size of the database is smaller than the sector size.  
Historically, the sector size in SQLite has been hard coded to 512
................................................................................
the changes to page 2, the underlying hardware must also rewrite the
content of pages 1, 3, and 4 since the hardware must write the complete
sector.  If this write operation is interrupted by a power outage,
one or more of the pages 1, 3, or 4 might be left with incorrect data.
Hence, to avoid lasting corruption to the database, the original content
of all of those pages must be contained in the rollback journal.</p>


<h3>6.2 Dealing With Garbage Written Into Journal Files</h3>

<p>When data is appended to the end of the rollback journal,
SQLite normally makes the pessimistic assumption that the file
is first extended with invalid "garbage" data and that afterwards
the correct data replaces the garbage.  In other words, SQLite assumes
that the file size is increased first and then afterwards the content
................................................................................

<p>Note that the checksums in the rollback journal are not necessary
if the synchronous setting is FULL.  We only depend on the checksums
when synchronous is lowered to NORMAL.  Nevertheless, the checksums
never hurt and so they are included in the rollback journal regardless
of the synchronous setting.</p>


<h3>6.3 Cache Spill Prior To Commit</h3>

<p>The commit process shown in <a href="#section_3_0">section 3.0</a>
assumes that all database changes fit in memory until it is time to
commit.  This is the common case.  But sometimes a larger change will
overflow the user-space cache prior to transaction commit.  In those
cases, the cache must spill to the database before the transaction
................................................................................
<p>A cache spill causes the lock on the database file to
escalate from reserved to exclusive.  This reduces concurrency.
A cache spill also causes extra disk flush or fsync operations to
occur and these operations are slow, hence a cache spill can
seriously reduce performance.
For these reasons a cache spill is avoided whenever possible.</p>


<h2>7.0 Optimizations</h2>

<p>Profiling indicates that for most systems and in most circumstances
SQLite spends most of its time doing disk I/O.  It follows then that
anything we can do to reduce the amount of disk I/O will likely have a
large positive impact on the performance of SQLite.  This section
describes some of the techniques used by SQLite to try to reduce the
amount of disk I/O to a minimum while still preserving atomic commit.</p>


<h3>7.1 Cache Retained Between Transactions</h3>

<p><a href="#section_3_12">Step 3.12</a> of the commit process shows
that once the shared lock has been released, all user-space cache
images of database content must be discarded.  This is done because
without a shared lock, other processes are free to modify the database
file content and so any user-space image of that content might become
................................................................................
<p>The third optimization, zeroing the journal file header rather than
deleting the rollback journal file,
does not depend on holding an exclusive lock at all times.
This optimization can be set independently of exclusive lock mode
using the [journal_mode pragma]
as described in <a href="#section_7_6">section 7.6</a> below.</p>


<h3>7.3 Do Not Journal Freelist Pages</h3>

<p>When information is deleted from an SQLite database, the pages used
to hold the deleted information are added to a "freelist".  Subsequent
inserts will draw pages off of this freelist rather than expanding the
database file.</p>

................................................................................
during a transaction recovery, the database is not harmed by the omission.
Similarly, the content of a new freelist page is never written back
into the database at <a href="#section_3_9">step 3.9</a> nor
read from the database at <a href="#section_3_3">step 3.3</a>.
These optimizations can greatly reduce the amount of I/O that occurs
when making changes to a database file that contains free space.</p>


<h3>7.4 Single Page Updates And Atomic Sector Writes</h3>

<p>Beginning in SQLite version 3.5.0, the new Virtual File System (VFS)
interface contains a method named xDeviceCharacteristics which reports
on special properties that the underlying mass storage device
might have.  Among the special properties that
xDeviceCharacteristics might report is the ability of to do an
................................................................................
the same as a sector size, and when there is a database change that
only touches a single database page, then SQLite skips the whole
journaling and syncing process and simply writes the modified page
directly into the database file.  The change counter in the first
page of the database file is modified separately since no harm is
done if power is lost before the change counter can be updated.</p>


<h3>7.5 Filesystems With Safe Append Semantics</h3>

<p>Another optimization introduced in SQLite version 3.5.0 makes
use of "safe append" behavior of the underlying disk.
Recall that SQLite assumes that when data is appended to a file
(specifically to the rollback journal) that the size of the file
is increased first and that the content is written second.  So
................................................................................
<p>On embedded systems with synchronous filesystems, TRUNCATE results
in slower behavior than PERSIST.  The commit operation is the same speed.
But subsequent transactions are slower following a TRUNCATE because it is
faster to overwrite existing content than to append to the end of a file.
New journal file entries will always be appended following a TRUNCATE but
will usually overwrite with PERSIST.</p>


<h2>8.0 Testing Atomic Commit Behavior</h2>

<p>The developers of SQLite are confident that it is robust
in the face of power failures and system crashes because the
automatic test procedures do extensive checks on
the ability of SQLite to recover from simulated power loss.
We call these the "crash tests".</p>
................................................................................
<p>The atomic commit mechanism in SQLite has proven to be robust,
but it can be circumvented by a sufficiently creative
adversary or a sufficiently broken operating system implementation.
This section describes a few of the ways in which an SQLite database
might be corrupted by a power failure or system crash.
(See also: [How To Corrupt Your Database Files].)</p>


<h3>9.1 Broken Locking Implementations</h3>

<p>SQLite uses filesystem locks to make sure that only one
process and database connection is trying to modify the database
at a time.  The filesystem locking mechanism is implemented
in the VFS layer and is different for every operating system.
SQLite depends on this implementation being correct.  If something
................................................................................
the database file in the same way.  Unfortunately, the locking
mechanisms do not exclude one another, so if one process is
accessing a file using (for example) AFP locking and another
process (perhaps on a different machine) is using dot-file locks,
the two processes might collide because AFP locks do not exclude
dot-file locks or vice versa.</p>


<h3>9.2 Incomplete Disk Flushes</h3>

<p>SQLite uses the fsync() system call on Unix and the FlushFileBuffers()
system call on w32 in order to sync the file system buffers onto disk
oxide as shown in  <a href="#section_3_7">step 3.7</a> and
<a href="#section_3_10">step 3.10</a>.  Unfortunately, we have received
reports that neither of these interfaces works as advertised on many
................................................................................

<p>Setting fullfsync on a Mac will guarantee that data really does
get pushed out to the disk platter on a flush.  But the implementation
of fullfsync involves resetting the disk controller.  And so not only
is it profoundly slow, it also slows down other unrelated disk I/O.
So its use is not recommended.</p>


<h3>9.3 Partial File Deletions</h3>

<p>SQLite assumes that file deletion is an atomic operation from the
point of view of a user process.  If power fails in the middle of
a file deletion, then after power is restored SQLite expects to see
either the entire file with all of its original data intact, or it
expects not to find the file at all.  Transactions may not be atomic
on systems that do not work this way.</p>


<h3>9.4 Garbage Written Into Files</h3>

<p>SQLite database files are ordinary disk files that can be
opened and written by ordinary user processes.  A rogue process
can open an SQLite database and fill it with corrupt data.  
Corrupt data might also be introduced into an SQLite database
by bugs in the operating system or disk controller; especially
bugs triggered by a power failure.  There is nothing SQLite can
do to defend against these kinds of problems.</p>


<h3>9.5 Deleting Or Renaming A Hot Journal</h3>

<p>If a crash or power loss does occur and a hot journal is left on
the disk, it is essential that the original database file and the hot
journal remain on disk with their original names until the database
file is opened by another SQLite process and rolled back.  
During recovery at <a href="section_4_2">step 4.2</a> SQLite locates
................................................................................
And since this is out from under the control of SQLite, there is nothing
that SQLite can do to prevent it.  If you are running on a system that
is vulnerable to this kind of filesystem namespace corruption (most
modern journalling filesystems are immune, we believe) then you might
want to consider putting each SQLite database file in its own private
subdirectory.</p>


<h2>10.0 Future Directions And Conclusion</h2>

<p>Every now and then someone discovers a new failure mode for
the atomic commit mechanism in SQLite and the developers have to
put in a patch.  This is happening less and less and the
failure modes are becoming more and more obscure.  But it would
still be foolish to suppose that the atomic commit logic of







|







 







>







 







>







 







>







 







>










>







 







>







 







>







 







>











>







 







>







 







>







 







>







 







>







 







>







 







>







>







 







>







 







>







 







>









>







 







>







 







>







 







>







 







>







 







>







 







>







 







>









>










>







 







>







24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
...
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
...
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
...
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
...
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
...
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
...
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
...
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
...
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
...
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
...
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
...
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
...
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
...
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
...
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
...
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
...
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
...
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
...
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
....
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
....
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
....
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
....
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
....
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
....
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
<p>SQLite has the important property that transactions appear
to be atomic even if the transaction is interrupted by an
operating system crash or power failure.</p>

<p>This article describes the techniques used by SQLite to create the
illusion of atomic commit.</p>

<tcl>hd_fragment hardware</tcl>
<h2>2.0 Hardware Assumptions</h2>

<p>Throughout this article, we will call the mass storage device "disk"
even though the mass storage device might really be flash memory.</p>

<p>We assume that disk is written in chunks which we call a "sector".
It is not possible to modify any part of the disk smaller than a sector.
................................................................................

<p>We begin with an overview of the steps SQLite takes in order to
perform an atomic commit of a transaction against a single database
file.  The details of file formats used to guard against damage from
power failures and techniques for performing an atomic commit across
multiple databases are discussed in later sections.</p>

<tcl>hd_fragment initstate</tcl>
<h3>3.1 Initial State</h3>

<img src="images/ac/commit-0.gif" align="right" hspace="15">

<p>The state of the computer when a database connection is
first opened is shown conceptually by the diagram at the
right.
................................................................................
The left area of the diagram shows the content of memory for
the process that is using SQLite.  The database connection has
just been opened and no information has been read yet, so the
user space is empty.
</p>
<br clear="both">

<tcl>hd_fragment rdlck</tcl>
<h3>3.2 Acquiring A Read Lock</h3>

<img src="images/ac/commit-1.gif" align="right" hspace="15">

<p>Before SQLite can write to a database, it must first read
the database to see what is there already.  Even if it is just
appending new data, SQLite still has to read in the database
................................................................................
are read.  In this example we are showing three
pages out of eight being read.  In a typical application, a
database will have thousands of pages and a query will normally
only touch a small percentage of those pages.</p>

<br clear="both">

<tcl>hd_fragment rsvdlock</tcl>
<h3>3.4 Obtaining A Reserved Lock</h3>

<img src="images/ac/commit-3.gif" align="right" hspace="15">

<p>Before making changes to the database, SQLite first
obtains a "reserved" lock on the database file.  A reserved
lock is similar to a shared lock in that both a reserved lock
................................................................................
is modified.  We can find out if another process has modified the
database by checking that counter.  If the database was modified,
then the user space cache must be cleared and reread.  But it is
commonly the case that no changes have been made and the user
space cache can be reused for a significant performance savings.</p>

<br clear="both">
<tcl>hd_fragment rollback</tcl>
<h2>4.0 Rollback</h2>

<p>An atomic commit is supposed to happen instantaneously.  But the processing
described above clearly takes a finite amount of time.
Suppose the power to the computer were cut
part way through the commit operation described above.  In order
to maintain the illusion that the changes were instantaneous, we
have to "rollback" any partial changes and restore the database to
the state it was in prior to the beginning of the transaction.</p>

<tcl>hd_fragment crisis</tcl>
<h3>4.1 When Something Goes Wrong...</h3>
<img src="images/ac/rollback-0.gif" align="right" hspace="15">

<p>Suppose the power loss occurred during step 3.10 above,
while the database changes were being written to disk.
After power is restored, the situation might be something
like what is shown to the right.  We were trying to change
................................................................................
that a previous process was trying to commit a transaction but
it aborted for some reason prior to the completion of the
commit.  A hot journal means that
the database file is in an inconsistent state and needs to
be repaired (by rollback) prior to being used.</p>

<br clear="both">
<tcl>hd_fragment exlock</tcl>
<h3>4.3 Obtaining An Exclusive Lock On The Database</h3>
<img src="images/ac/rollback-2.gif" align="right" hspace="15">

<p>The first step toward dealing with a hot journal is to
obtain an exclusive lock on the database file.  This prevents two
or more processes from trying to rollback the same hot journal
at the same time.</p>
................................................................................
database file back to its original size in cases where the
incomplete transaction caused the database to grow.  At the
end of this step, the database should be the same size and
contain the same information as it did before the start of
the aborted transaction.</p>

<br clear="both">
<tcl>hd_fragment delhotjrnl</tcl>
<h3>4.5 Deleting The Hot Journal</h3>
<img src="images/ac/rollback-4.gif" align="right" hspace="15">

<p>After all information in the rollback journal has been
played back into the database file (and flushed to disk in case
we encounter yet another power failure), the hot rollback journal
can be deleted.</p>
................................................................................
<p>As in <a href="#section_3_11">section 3.11</a>, the journal
file might be truncated to zero length or its header might
be overwritten with zeros as an optimization on systems where
deleting a file is expense.  Either way, the journal is no 
long hot after this step.</p>

<br clear="both">
<tcl>hd_fragment cont</tcl>
<h3>4.6 Continue As If The Uncompleted Writes Had Never Happened</h3>
<img src="images/ac/rollback-5.gif" align="right" hspace="15">

<p>The final recovery step is to reduce the exclusive lock back
to a shared lock.  Once this happens, the database is back in the
state that it would have been if the aborted transaction had never
started.  Since all of this recovery activity happens completely
automatically and transparently, it appears to the program using
SQLite as if the aborted transaction had never begun.</p>

<br clear="both">
<tcl>hd_fragment multicommit</tcl>
<h2>5.0 Multi-file Commit</h2>

<p>SQLite allows a single 
<a href="c3ref/sqlite3.html">database connection</a> to talk to
two or more database files simultaneously through the use of
the <a href="lang_attach.html">ATTACH DATABASE</a> command.
When multiple database files are modified within a single
................................................................................
transaction, all files are updated atomically.  
In other words, either all of the database files are updated or
else none of them are.
Achieving an atomic commit across multiple database files is
more complex that doing so for a single file.  This section
describes how SQLite works that bit of magic.</p>

<tcl>hd_fragment multijrnl</tcl>
<h3>5.1 Separate Rollback Journals For Each Database</h3>
<img src="images/ac/multi-0.gif" align="right" hspace="15">

<p>When multiple database files are involved in a transaction,
each database has its own rollback journal and each database
is locked separately.  The diagram at the right shows a scenario
where three different database files have been modified within
................................................................................
we are not making the distinction between information in the
operating system cache and information that is on disk.  All of
these factors still apply in a multi-file commit scenario.  They
just take up a lot of space in the diagrams and they do not add
any new information, so they are omitted here.</p>

<br clear="both">
<tcl>hd_fragment masterjrnl</tcl>
<h3>5.2 The Master Journal File</h3>
<img src="images/ac/multi-1.gif" align="right" hspace="15">

<p>The next step in a multi-file commit is the creation of a
"master journal" file.  The name of the master journal file is
the same name as the original database filename (the database
that was opened using the 
................................................................................
<p>After the master journal is constructed, its content is flushed
to disk before any further actions are taken.  On Unix, the directory
that contains the master journal is also synced in order to make sure
the master journal file will appear in the directory following a
power failure.</p>

<br clear="both">
<tcl>hd_fragment multijrnlupdate</tcl>
<h3>5.3 Updating Rollback Journal Headers</h3>
<img src="images/ac/multi-2.gif" align="right" hspace="15">

<p>The next step is to record the full pathname of the master journal file
in the header of every rollback journal.  Space to hold the master
journal filename was reserved at the beginning of each rollback journal
as the rollback journals were created.</p>
................................................................................
page of the journal file (the first page) has changed.</p>

<p>This step is analogous to 
<a href="#section_3_7">step 3.7</a> in the single-file commit
scenario described above.</p>

<br clear="both">
<tcl>hd_fragment multidbupdate</tcl>
<h3>5.4 Updating The Database Files</h3>
<img src="images/ac/multi-3.gif" align="right" hspace="15">

<p>Once all rollback journal files have been flushed to disk, it
is safe to begin updating database files.  We have to obtain an
exclusive lock on all database files before writing the changes.
After all the changes are written, it is important to flush the
................................................................................
rollback journal.  Upon restart, SQLite only considers a journal
to be hot and will only playback the journal if there is no
master journal filename in the header (which is the case for
a single-file commit) or if the master journal file still
exists on disk.</p>

<br clear="both">
<tcl>hd_fragment cleanup</tcl>
<h3>5.6 Clean Up The Rollback Journals</h3>
<img src="images/ac/multi-5.gif" align="right" hspace="15">

<p>The final step in a multi-file commit is to delete the
individual rollback journals and drop the exclusive locks on
the database files so that other processes can see the changes.
This corresponds to 
................................................................................
to the next rollback journal.  But in the future we might change
this so that all rollback journals are deleted before any database
files are unlocked.  As long as the rollback journal is deleted before
its corresponding database file is unlocked it does not matter in what
order the rollback journals are deleted or the database files are
unlocked.</p>

<tcl>hd_fragment moredetail</tcl>
<h2>6.0 Additional Details Of The Commit Process</h2>

<p><a href="#section_3_0">Section 3.0</a> above provides an overview of
how atomic commit works in SQLite.  But it glosses over a number of
important details.  The following subsections will attempt to fill
in the gaps.</p>

<tcl>hd_fragment completesectors</tcl>
<h3>6.1 Always Journal Complete Sectors</h3>

<p>When the original content of a database page is written into
the rollback journal (as shown in <a href="#section_3_5">section 3.5</a>),
SQLite always writes a complete sectors worth of data, even if the
page size of the database is smaller than the sector size.  
Historically, the sector size in SQLite has been hard coded to 512
................................................................................
the changes to page 2, the underlying hardware must also rewrite the
content of pages 1, 3, and 4 since the hardware must write the complete
sector.  If this write operation is interrupted by a power outage,
one or more of the pages 1, 3, or 4 might be left with incorrect data.
Hence, to avoid lasting corruption to the database, the original content
of all of those pages must be contained in the rollback journal.</p>

<tcl>hd_fragment journalgarbage</tcl>
<h3>6.2 Dealing With Garbage Written Into Journal Files</h3>

<p>When data is appended to the end of the rollback journal,
SQLite normally makes the pessimistic assumption that the file
is first extended with invalid "garbage" data and that afterwards
the correct data replaces the garbage.  In other words, SQLite assumes
that the file size is increased first and then afterwards the content
................................................................................

<p>Note that the checksums in the rollback journal are not necessary
if the synchronous setting is FULL.  We only depend on the checksums
when synchronous is lowered to NORMAL.  Nevertheless, the checksums
never hurt and so they are included in the rollback journal regardless
of the synchronous setting.</p>

<tcl>hd_fragment cachespill</tcl>
<h3>6.3 Cache Spill Prior To Commit</h3>

<p>The commit process shown in <a href="#section_3_0">section 3.0</a>
assumes that all database changes fit in memory until it is time to
commit.  This is the common case.  But sometimes a larger change will
overflow the user-space cache prior to transaction commit.  In those
cases, the cache must spill to the database before the transaction
................................................................................
<p>A cache spill causes the lock on the database file to
escalate from reserved to exclusive.  This reduces concurrency.
A cache spill also causes extra disk flush or fsync operations to
occur and these operations are slow, hence a cache spill can
seriously reduce performance.
For these reasons a cache spill is avoided whenever possible.</p>

<tcl>hd_fragment opts</tcl>
<h2>7.0 Optimizations</h2>

<p>Profiling indicates that for most systems and in most circumstances
SQLite spends most of its time doing disk I/O.  It follows then that
anything we can do to reduce the amount of disk I/O will likely have a
large positive impact on the performance of SQLite.  This section
describes some of the techniques used by SQLite to try to reduce the
amount of disk I/O to a minimum while still preserving atomic commit.</p>

<tcl>hd_fragment keepcache</tcl>
<h3>7.1 Cache Retained Between Transactions</h3>

<p><a href="#section_3_12">Step 3.12</a> of the commit process shows
that once the shared lock has been released, all user-space cache
images of database content must be discarded.  This is done because
without a shared lock, other processes are free to modify the database
file content and so any user-space image of that content might become
................................................................................
<p>The third optimization, zeroing the journal file header rather than
deleting the rollback journal file,
does not depend on holding an exclusive lock at all times.
This optimization can be set independently of exclusive lock mode
using the [journal_mode pragma]
as described in <a href="#section_7_6">section 7.6</a> below.</p>

<tcl>hd_fragment freelistjrnl</tcl>
<h3>7.3 Do Not Journal Freelist Pages</h3>

<p>When information is deleted from an SQLite database, the pages used
to hold the deleted information are added to a "freelist".  Subsequent
inserts will draw pages off of this freelist rather than expanding the
database file.</p>

................................................................................
during a transaction recovery, the database is not harmed by the omission.
Similarly, the content of a new freelist page is never written back
into the database at <a href="#section_3_9">step 3.9</a> nor
read from the database at <a href="#section_3_3">step 3.3</a>.
These optimizations can greatly reduce the amount of I/O that occurs
when making changes to a database file that contains free space.</p>

<tcl>hd_fragment atomicsector</tcl>
<h3>7.4 Single Page Updates And Atomic Sector Writes</h3>

<p>Beginning in SQLite version 3.5.0, the new Virtual File System (VFS)
interface contains a method named xDeviceCharacteristics which reports
on special properties that the underlying mass storage device
might have.  Among the special properties that
xDeviceCharacteristics might report is the ability of to do an
................................................................................
the same as a sector size, and when there is a database change that
only touches a single database page, then SQLite skips the whole
journaling and syncing process and simply writes the modified page
directly into the database file.  The change counter in the first
page of the database file is modified separately since no harm is
done if power is lost before the change counter can be updated.</p>

<tcl>hd_fragment safeappend</tcl>
<h3>7.5 Filesystems With Safe Append Semantics</h3>

<p>Another optimization introduced in SQLite version 3.5.0 makes
use of "safe append" behavior of the underlying disk.
Recall that SQLite assumes that when data is appended to a file
(specifically to the rollback journal) that the size of the file
is increased first and that the content is written second.  So
................................................................................
<p>On embedded systems with synchronous filesystems, TRUNCATE results
in slower behavior than PERSIST.  The commit operation is the same speed.
But subsequent transactions are slower following a TRUNCATE because it is
faster to overwrite existing content than to append to the end of a file.
New journal file entries will always be appended following a TRUNCATE but
will usually overwrite with PERSIST.</p>

<tcl>hd_fragment  testing</tcl>
<h2>8.0 Testing Atomic Commit Behavior</h2>

<p>The developers of SQLite are confident that it is robust
in the face of power failures and system crashes because the
automatic test procedures do extensive checks on
the ability of SQLite to recover from simulated power loss.
We call these the "crash tests".</p>
................................................................................
<p>The atomic commit mechanism in SQLite has proven to be robust,
but it can be circumvented by a sufficiently creative
adversary or a sufficiently broken operating system implementation.
This section describes a few of the ways in which an SQLite database
might be corrupted by a power failure or system crash.
(See also: [How To Corrupt Your Database Files].)</p>

<tcl>hd_fragment brokenlocks</tcl>
<h3>9.1 Broken Locking Implementations</h3>

<p>SQLite uses filesystem locks to make sure that only one
process and database connection is trying to modify the database
at a time.  The filesystem locking mechanism is implemented
in the VFS layer and is different for every operating system.
SQLite depends on this implementation being correct.  If something
................................................................................
the database file in the same way.  Unfortunately, the locking
mechanisms do not exclude one another, so if one process is
accessing a file using (for example) AFP locking and another
process (perhaps on a different machine) is using dot-file locks,
the two processes might collide because AFP locks do not exclude
dot-file locks or vice versa.</p>

<tcl>hd_fragment fsync</tcl>
<h3>9.2 Incomplete Disk Flushes</h3>

<p>SQLite uses the fsync() system call on Unix and the FlushFileBuffers()
system call on w32 in order to sync the file system buffers onto disk
oxide as shown in  <a href="#section_3_7">step 3.7</a> and
<a href="#section_3_10">step 3.10</a>.  Unfortunately, we have received
reports that neither of these interfaces works as advertised on many
................................................................................

<p>Setting fullfsync on a Mac will guarantee that data really does
get pushed out to the disk platter on a flush.  But the implementation
of fullfsync involves resetting the disk controller.  And so not only
is it profoundly slow, it also slows down other unrelated disk I/O.
So its use is not recommended.</p>

<tcl>hd_fragment filedel</tcl>
<h3>9.3 Partial File Deletions</h3>

<p>SQLite assumes that file deletion is an atomic operation from the
point of view of a user process.  If power fails in the middle of
a file deletion, then after power is restored SQLite expects to see
either the entire file with all of its original data intact, or it
expects not to find the file at all.  Transactions may not be atomic
on systems that do not work this way.</p>

<tcl>hd_fragment filegarbage</tcl>
<h3>9.4 Garbage Written Into Files</h3>

<p>SQLite database files are ordinary disk files that can be
opened and written by ordinary user processes.  A rogue process
can open an SQLite database and fill it with corrupt data.  
Corrupt data might also be introduced into an SQLite database
by bugs in the operating system or disk controller; especially
bugs triggered by a power failure.  There is nothing SQLite can
do to defend against these kinds of problems.</p>

<tcl>hd_fragment mvhotjrnl</tcl>
<h3>9.5 Deleting Or Renaming A Hot Journal</h3>

<p>If a crash or power loss does occur and a hot journal is left on
the disk, it is essential that the original database file and the hot
journal remain on disk with their original names until the database
file is opened by another SQLite process and rolled back.  
During recovery at <a href="section_4_2">step 4.2</a> SQLite locates
................................................................................
And since this is out from under the control of SQLite, there is nothing
that SQLite can do to prevent it.  If you are running on a system that
is vulnerable to this kind of filesystem namespace corruption (most
modern journalling filesystems are immune, we believe) then you might
want to consider putting each SQLite database file in its own private
subdirectory.</p>

<tcl>hd_fragment future</tcl>
<h2>10.0 Future Directions And Conclusion</h2>

<p>Every now and then someone discovers a new failure mode for
the atomic commit mechanism in SQLite and the developers have to
put in a patch.  This is happening less and less and the
failure modes are becoming more and more obscure.  But it would
still be foolish to suppose that the atomic commit logic of

Changes to pages/optoverview.in.

127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
...
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
  ^Thus for the example index above, if there is no WHERE clause term
  that constraints column c, then terms that constrain columns a and b can
  be used with the index but not terms that constraint columns d through z.
  ^Similarly, no index column will be used (for indexing purposes)
  that is to the right of a 
  column that is constrained only by inequalities.
}
HEADING 2 {Index term usage examples}
PARAGRAPH {
  ^(For the index above and WHERE clause like this:
}
CODE {
  ... WHERE a=5 AND b IN (1,2,3) AND c IS NULL AND d='hello'
}
PARAGRAPH {
................................................................................
  SQLite choose by default?  ^(As of version 3.6.18, without running [ANALYZE],
  SQLite will choose option 2.)^
  ^But if the [ANALYZE] command is run in order to gather statistics,
  a different choice might be made if the statistics indicate that the
  alternative is likely to run faster.
}

HEADING 2 {Manual Control Of Query Plans}

PARAGRAPH {
  SQLite provides the ability for advanced programmers to exercise control
  over the query plan chosen by the optimizer. One method for doing this
  is to fudge the [ANALYZE] results in the <b>sqlite_stat1</b> and
  <b>sqlite_stat2</b> tables.  That approach is not recommended except
  for the one scenario described in the following paragraph.







|







 







|







127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
...
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
  ^Thus for the example index above, if there is no WHERE clause term
  that constraints column c, then terms that constrain columns a and b can
  be used with the index but not terms that constraint columns d through z.
  ^Similarly, no index column will be used (for indexing purposes)
  that is to the right of a 
  column that is constrained only by inequalities.
}
HEADING 2 {Index term usage examples} idxexamp
PARAGRAPH {
  ^(For the index above and WHERE clause like this:
}
CODE {
  ... WHERE a=5 AND b IN (1,2,3) AND c IS NULL AND d='hello'
}
PARAGRAPH {
................................................................................
  SQLite choose by default?  ^(As of version 3.6.18, without running [ANALYZE],
  SQLite will choose option 2.)^
  ^But if the [ANALYZE] command is run in order to gather statistics,
  a different choice might be made if the statistics indicate that the
  alternative is likely to run faster.
}

HEADING 2 {Manual Control Of Query Plans} manctrl

PARAGRAPH {
  SQLite provides the ability for advanced programmers to exercise control
  over the query plan chosen by the optimizer. One method for doing this
  is to fudge the [ANALYZE] results in the <b>sqlite_stat1</b> and
  <b>sqlite_stat2</b> tables.  That approach is not recommended except
  for the one scenario described in the following paragraph.

Changes to pages/tempfiles.in.

32
33
34
35
36
37
38

39
40
41
42
43
44
45
..
53
54
55
56
57
58
59

60
61
62
63
64
65
66
...
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
...
168
169
170
171
172
173
174

175
176
177
178
179
180
181
...
214
215
216
217
218
219
220

221
222
223
224
225
226
227
...
248
249
250
251
252
253
254

255
256
257
258
259
260
261
...
353
354
355
356
357
358
359

360
361
362
363
364
365
366
...
446
447
448
449
450
451
452

453
454
455
456
457
458
459
...
551
552
553
554
555
556
557

558
559
560
561
562
563
564
updated.  But there is no guarantee that future versions of SQLite
will use temporary files in the same way.  New kinds of temporary
files might be employed  and some of
the current temporary file uses might be discontinued
in future releases of SQLite.
</p>


<h2>2.0 Seven Kinds Of Temporary Files</h2>

<p>
SQLite currently uses seven distinct types of temporary files:
</p>

<ol>
................................................................................
</ol>

<p>
Additional information about each of these temporary file types
is in the sequel.
</p>


<h3>2.1 Rollback Journals</h3>

<p>
A rollback journal is a temporary file used to implement
atomic commit and rollback capabilities in SQLite.
(For a detailed discussion of how this works, see
the separate document titled
................................................................................
commit and rollback capabilities of SQLite.  The ROLLBACK command
is not available when OFF journal mode is set.  And if a crash or power
loss occurs in the middle of a transaction that uses the OFF journal
mode, no recovery is possible and the database file will likely
go corrupt.
</p>


<h3>2.2 Master Journal Files</h3>

<p>
The master journal file is used as part of the atomic commit
process when a single transaction makes changes to multiple
databases that have been added to a single [database connection]
using the [ATTACH] statement.  The master journal file is always
................................................................................
would not be atomic across all databases.  In other words, if the
commit were interrupted in the middle by a crash or power loss, then
the changes to one of the databases might complete while the changes
to another database might roll back.  The master journal causes all
changes in all databases to either rollback or commit together.
</p>


<h3>2.3 Statement Journal Files</h3>

<p>
A statement journal file is used to rollback partial results of
a single statement within a larger transaction.  For example, suppose
an UPDATE statement will attempt to modify 100 rows in the database.
But after modifying the first 50 rows, the UPDATE hits
................................................................................
in the same directory as the main database, and is automatically
deleted at the conclusion of the transaction.  The size of the
statement journal is proportional to the size of the change implemented
by the UPDATE or INSERT statement that caused the statement journal
to be created.
</p>


<h3>2.4 TEMP Databases</h3>

<p>Tables created using the "CREATE TEMP TABLE" syntax are only
visible to the [database connection] in which the "CREATE TEMP TABLE"
statement is originally evaluated.  These TEMP tables, together
with any associated indices, triggers, and views, are collectively
stored in a separate temporary database file that is created as
................................................................................

<p>
The temporary files associated with the TEMP database and its
rollback journal are only created if the application makes use
of the "CREATE TEMP TABLE" statement.
</p>


<h3>2.5 Materializations Of Views And Subqueries</h3>

<p>Queries that contain subqueries must sometime evaluate
the subqueries separately and store the results in a temporary
table, then use the content of the temporary table to evaluate
the outer query.
We call this "materializing" the subquery.
................................................................................
the query can be flattened depends on such factors as whether
or not the subquery or outer query contain aggregate functions,
ORDER BY or GROUP BY clauses, LIMIT clauses, and so forth.
The rules for when a query and an cannot be flattened are
very complex and are beyond the scope of this document.
</p>


<h3>2.6 Transient Indices</h3>

<p>
SQLite may make use of transient indices to
implement SQL language features such as:
</p>

................................................................................

<p>
Note that the UNION ALL operator for compound queries does not
use transient indices by itself (though of course the right
and left subqueries of the UNION ALL might use transient indices
depending on how they are composed.)


<h3>2.7 Transient Database Used By [VACUUM]</h3>

<p>
The [VACUUM] command works by creating a temporary file
and then rebuilding the entire database into that temporary
file.  Then the content of the temporary file is copied back
into the original database file and the temporary file is
................................................................................
influence the temporary files other than the rollback journal
and the master journal.  The rollback journal and the master
journal are always written to disk regardless of the settings of
the [SQLITE_TEMP_STORE] compile-time parameter and the
[temp_store pragma].
</p>


<h2>4.0 Other Temporary File Optimizations</h2>

<p>
SQLite uses a page cache of recently read and written database
pages.  This page cache is used not just for the main database
file but also for transient indices and tables stored in temporary
files.  If SQLite needs to use a temporary index or table and







>







 







>







 







|







 







>







 







>







 







>







 







>







 







>







 







>







32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
..
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
...
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
...
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
...
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
...
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
...
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
...
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
...
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
updated.  But there is no guarantee that future versions of SQLite
will use temporary files in the same way.  New kinds of temporary
files might be employed  and some of
the current temporary file uses might be discontinued
in future releases of SQLite.
</p>

<tcl>hd_fragment seventypes</tcl>
<h2>2.0 Seven Kinds Of Temporary Files</h2>

<p>
SQLite currently uses seven distinct types of temporary files:
</p>

<ol>
................................................................................
</ol>

<p>
Additional information about each of these temporary file types
is in the sequel.
</p>

<tcl>hd_fragment rollbackjrnl</tcl>
<h3>2.1 Rollback Journals</h3>

<p>
A rollback journal is a temporary file used to implement
atomic commit and rollback capabilities in SQLite.
(For a detailed discussion of how this works, see
the separate document titled
................................................................................
commit and rollback capabilities of SQLite.  The ROLLBACK command
is not available when OFF journal mode is set.  And if a crash or power
loss occurs in the middle of a transaction that uses the OFF journal
mode, no recovery is possible and the database file will likely
go corrupt.
</p>

<tcl>hd_fragment masterjrnl</tcl>
<h3>2.2 Master Journal Files</h3>

<p>
The master journal file is used as part of the atomic commit
process when a single transaction makes changes to multiple
databases that have been added to a single [database connection]
using the [ATTACH] statement.  The master journal file is always
................................................................................
would not be atomic across all databases.  In other words, if the
commit were interrupted in the middle by a crash or power loss, then
the changes to one of the databases might complete while the changes
to another database might roll back.  The master journal causes all
changes in all databases to either rollback or commit together.
</p>

<tcl>hd_fragment stmtjrnl</tcl>
<h3>2.3 Statement Journal Files</h3>

<p>
A statement journal file is used to rollback partial results of
a single statement within a larger transaction.  For example, suppose
an UPDATE statement will attempt to modify 100 rows in the database.
But after modifying the first 50 rows, the UPDATE hits
................................................................................
in the same directory as the main database, and is automatically
deleted at the conclusion of the transaction.  The size of the
statement journal is proportional to the size of the change implemented
by the UPDATE or INSERT statement that caused the statement journal
to be created.
</p>

<tcl>hd_fragment tempdb</tcl>
<h3>2.4 TEMP Databases</h3>

<p>Tables created using the "CREATE TEMP TABLE" syntax are only
visible to the [database connection] in which the "CREATE TEMP TABLE"
statement is originally evaluated.  These TEMP tables, together
with any associated indices, triggers, and views, are collectively
stored in a separate temporary database file that is created as
................................................................................

<p>
The temporary files associated with the TEMP database and its
rollback journal are only created if the application makes use
of the "CREATE TEMP TABLE" statement.
</p>

<tcl>hd_fragment views</tcl>
<h3>2.5 Materializations Of Views And Subqueries</h3>

<p>Queries that contain subqueries must sometime evaluate
the subqueries separately and store the results in a temporary
table, then use the content of the temporary table to evaluate
the outer query.
We call this "materializing" the subquery.
................................................................................
the query can be flattened depends on such factors as whether
or not the subquery or outer query contain aggregate functions,
ORDER BY or GROUP BY clauses, LIMIT clauses, and so forth.
The rules for when a query and an cannot be flattened are
very complex and are beyond the scope of this document.
</p>

<tcl>hd_fragment transidx</tcl>
<h3>2.6 Transient Indices</h3>

<p>
SQLite may make use of transient indices to
implement SQL language features such as:
</p>

................................................................................

<p>
Note that the UNION ALL operator for compound queries does not
use transient indices by itself (though of course the right
and left subqueries of the UNION ALL might use transient indices
depending on how they are composed.)

<tcl>hd_fragment vacuumdb</tcl>
<h3>2.7 Transient Database Used By [VACUUM]</h3>

<p>
The [VACUUM] command works by creating a temporary file
and then rebuilding the entire database into that temporary
file.  Then the content of the temporary file is copied back
into the original database file and the temporary file is
................................................................................
influence the temporary files other than the rollback journal
and the master journal.  The rollback journal and the master
journal are always written to disk regardless of the settings of
the [SQLITE_TEMP_STORE] compile-time parameter and the
[temp_store pragma].
</p>

<tcl>hd_fragment otheropt</tcl>
<h2>4.0 Other Temporary File Optimizations</h2>

<p>
SQLite uses a page cache of recently read and written database
pages.  This page cache is used not just for the main database
file but also for transient indices and tables stored in temporary
files.  If SQLite needs to use a temporary index or table and