Documentation Source Text

Check-in [8b28a36fa1]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add the atomiccommit.html document. No links to it yet.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 8b28a36fa14aa14df7b083f571618e57a5668cf4
User & Date: drh 2007-11-13 01:43:31
Context
2007-11-13
02:53
Darken the font color on hyperlinks. Fix a bad link to the icons on the famous.html page. check-in: 8916aa40ec user: drh tags: trunk
01:43
Add the atomiccommit.html document. No links to it yet. check-in: 8b28a36fa1 user: drh tags: trunk
01:24
Cleaned up the famous users webpage. Work on the about page. check-in: 8a34ae6a2f user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Show Whitespace Changes Patch

Added images/ac/commit-0.gif.

cannot compute difference between binary files

Added images/ac/commit-1.gif.

cannot compute difference between binary files

Added images/ac/commit-2.gif.

cannot compute difference between binary files

Added images/ac/commit-3.gif.

cannot compute difference between binary files

Added images/ac/commit-4.gif.

cannot compute difference between binary files

Added images/ac/commit-5.gif.

cannot compute difference between binary files

Added images/ac/commit-6.gif.

cannot compute difference between binary files

Added images/ac/commit-7.gif.

cannot compute difference between binary files

Added images/ac/commit-8.gif.

cannot compute difference between binary files

Added images/ac/commit-9.gif.

cannot compute difference between binary files

Added images/ac/commit-A.gif.

cannot compute difference between binary files

Added images/ac/commit-B.gif.

cannot compute difference between binary files

Added images/ac/multi-0.gif.

cannot compute difference between binary files

Added images/ac/multi-1.gif.

cannot compute difference between binary files

Added images/ac/multi-2.gif.

cannot compute difference between binary files

Added images/ac/multi-3.gif.

cannot compute difference between binary files

Added images/ac/multi-4.gif.

cannot compute difference between binary files

Added images/ac/multi-5.gif.

cannot compute difference between binary files

Added images/ac/rollback-0.gif.

cannot compute difference between binary files

Added images/ac/rollback-1.gif.

cannot compute difference between binary files

Added images/ac/rollback-2.gif.

cannot compute difference between binary files

Added images/ac/rollback-3.gif.

cannot compute difference between binary files

Added images/ac/rollback-4.gif.

cannot compute difference between binary files

Added images/ac/rollback-5.gif.

cannot compute difference between binary files

Added pages/atomiccomit.in.

































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
<title>Atomic Commit In SQLite</title>

<h1 align="center">
Atomic Commit In SQLite
</h1>

<h2>1.0 Introduction</h2>

<p>An important feature of transactional databases like SQLite
is "atomic commit".  
Atomic commit means that either all database changes within a single 
transaction occur or none of them occur.  With atomic commit, it
is as if many different writes to different sections of the database
file occur instantaneously and simultaneously.
Real hardware serializes writes to mass storage, and writing
a single sector takes a finite amount of time.
So it is impossible to truly write many different sectors of a 
database file simultaneously and/or instantaneously.
But the atomic commit logic within
SQLite makes it appear as if the changes for a transaction
are all written instantaneously and simultaneously.</p>

<p>SQLite has the important property that transactions appear
to be atomic even if the transaction is interrupted by an
operating system crash or power failure.</p>

<p>This article describes the techniques used by SQLite to create the
illusion of atomic commit.</p>


<h2>2.0 Hardware Assumptions</h2>

<p>Throughout this article, we will call the mass storage device "disk"
even though the mass storage device might really be flash memory.</p>

<p>We assume that disk is written in chunks which we call a "sector".
It is not possible to modify any part of the disk smaller than a sector.
To change a part of the disk smaller than a sector, you have to read in
the full sector that contains the part you want to change, make the
change, then write back out the complete sector.</p>

<p>On a traditional spinning disk, a sector is the minimum unit of transfer
in both directions, both reading and writing.  On flash memory, however,
the minimum size of a read is typically much smaller than a minimum write.
SQLite is only concerned with the minimum write amount and so for the
purposes of this article, when we say "sector" we mean the minimum amount
of data that can be written to mass storage in a single go.</p>

<p>Prior to SQLite version 3.3.14, a sector size of 512 bytes was
assumed in all cases.  There was a compile-time option to change
this but the code had never been tested with a larger value.  The
512 byte sector assumption seemed reasonable since until very recently
all disk drives used a 512 byte sector internally.  However, there
has recently been a push to increase the sector size of disks to
4096 bytes.  Also the sector size
for flash memory is usually larger than 512 bytes.  For these reasons,
versions of SQLite beginning with 3.3.14 have a method in the OS
interface layer that interrogates the underlying filesystem to find
the true sector size.  As currently implemented (version 3.5.0) this
method still returns a hard-coded value of 512 bytes, since there
is no standard way of discovering the true sector size on either
win32 or unix.  But the method is available for embedded device
manufactures to tweak according to their own needs.  And we have
left open the possibility of filling in a more meaningful implementation
on unix and win32 in the future.</p>

<p>SQLite does <u>not</u> assume that a sector write is atomic.
However, it does assume that a sector write is linear.  By "linear"
we mean that SQLite assumes that when writing a sector, the hardware begins
at one end of the data and writes byte by byte until it gets to
the other end.  The write might go from beginning to end or from
end to beginning.  If a power failure occurs in the middle of a
sector write it might be that part of the sector was modified
and another part was left unchanged.  The key assumption by SQLite
is that if any part of the sector gets changed, then either the
first or the last bytes will be changed.  So the hardware will
never start writing a sector in the middle and work towards the
ends.  We do not know if this assumption is always true but it
seems reasonable.</p>

<p>The previous paragraph states that SQLite does not assume that
sector writes are atomic.  This is true by default.  But as of
SQLite version 3.5.0, there is a new interface called the
Virtual File System (VFS) interface.  The VFS is the only means
by which SQLite communicates to the underlying filesystem.  The
code comes with default VFS implementations for unix and windows
and there is a mechanism for creating new custom VFS implementations
at runtime.  In this new VFS interface there is a method called
xDeviceCharacteristics.  This method interrogates the underlying
filesystem to discover various properties and behaviors that the
filesystem may or may not exhibit.  The xDeviceCharacteristics
method might indicate that sector writes are atomic, and if it does
so indicate, SQLite will try to take advantage of that fact.  But
the default xDeviceCharacteristics method for both unix and windows
does not indicate atomic sector writes and so these optimizations
are normally omitted.</p>

<p>SQLite assumes that the operating system will buffer writes and
that a write request will return before data has actually been stored
in the mass storage device.
SQLite further assumes that write operations will be reordered by
the operating system.
For this reason, SQLite does a "flush" or "fsync" operation at key
points.  SQLite assumes that the flush or fsync will not return until
all pending write operations for the file that is being flushed have
completed.  We are told that the flush and fsync primitives
are broken on some versions of windows and Linux.  This is unfortunate.
It opens SQLite up to the possibility of database corruption following
a power loss in the middle of a commit.  However, there is nothing 
that SQLite can do to test for or remedy the situation.  SQLite
assumes that the operating system that it is running on works as
advertised.  If that is not quite the case, well then hopefully you
will not lose power too often.</p>

<p>SQLite assumes that when a file grows in length that the new
file space originally contains garbage and then later is filled in
with the data actually written.  In other words, SQLite assumes that
the file size is updated before the file content.  This is a 
pessimistic assumption and SQLite has to do some extra work to make
sure that it does not cause database corruption if power is lost
between the time when the file size is increased and when the
new content is written.  The xDeviceCharacteristics method of
the VFS might indicate that the filesystem will always write the
data before updating the file size.  (This is the 
SQLITE_IOCAP_SAFE_APPEND property for those readers who are looking
at the code.)  When the xDeviceCharacteristics method indicates
that files content is written before the file size is increased,
SQLite can forego some of its pedantic database protection steps
and thereby decrease the amount of disk I/O needed to perform a
commit.  The current implementation, however, makes no such assumptions
for the default VFSes for windows and unix.</p>

<p>SQLite assumes that a file deletion is atomic from the
point of view of a user process.  By this we mean that if SQLite
requests that a file be deleted and the power is lost during the
delete operation, once power is restored either the file will
exist completely with all if its original content unaltered, or
else the file will not be seen in the filesystem at all.  If
after power is restored the file is only partially deleted,
if some of its data has been altered or erased,
or the file has been truncated but not completely removed, then
database corruption will likely result.</p>

<p>SQLite assumes that the detection and/or correction of 
bit errors caused by cosmic rays, thermal noise, quantum
fluctuations, device driver bugs, or some other mechanism, is the 
responsibility of the underlying hardware and operating system.  
SQLite does not add any redundancy to the database file for
the purpose of detecting corruption or I/O errors.
SQLite assumes that the data it reads is exactly the same data 
that it previously wrote.</p>

<a name="section_3_0"></a>
<h2>3.0 Single File Commit</h2>

<p>We begin with an overview of the steps SQLite takes in order to
perform an atomic commit of a transaction against a single database
file.  The details of file formats used to guard against damage from
power failures and techniques for performing an atomic commit across
multiple databases are discussed in later sections.</p>

<h3>3.1 Initial State</h3>

<img src="images/ac/commit-0.gif" align="right" hspace="15">

<p>The state of the computer when a database connection is
first opened is shown conceptually by the diagram at the
right.
The area of the diagram on the extreme right (labeled "Disk") represents
information stored on the mass storage device.  Each rectangle is
a sector.  The blue color represents that the sectors contain
original data.
The middle area is the operating systems disk cache.  At the
onset of our example, the cache is cold and this is represented
by leaving the rectangles of the disk cache empty.
The left area of the diagram shows the content of memory for
the process that is using SQLite.  The database connection has
just been opened and no information has been read yet, so the
user space is empty.
</p>
<br clear="both">

<h3>3.2 Acquiring A Read Lock</h3>

<img src="images/ac/commit-1.gif" align="right" hspace="15">

<p>Before SQLite can write to a database, it must first read
the database to see what is there already.  Even if it is just
appending new data, SQLite still has to read in the database
schema from the <b>sqlite_master</b> table so that it can know
how to parse the INSERT statements and discover where in the
database file the new information should be stored.</p>

<p>The first step toward reading from the database file
is obtaining a shared lock on the database file.  A "shared"
lock allows two or more database connections to read from the
database file at the same time.  But a shared lock prevents
another database connection from writing to the database file
while we are reading it.  This is necessary because if another
database connection were writing to the database file at the
same time we are reading from the database file, we might read
some data before the change and other data after the change.
This would make it appears as if the change made by the other
process is not atomic.</p>

<p>Notice that the shared lock is on the operating system
disk cache, not on the disk itself.  File locks
really are just flags within the operating system kernel,
usually.  (The details depend on the specific OS layer
interface.)  Hence, the lock will instantly vanish if the
operating system crashes or if there is a power loss.  It
is usually also the case that the lock will vanish if the
process that created the lock exits.</p>

<br clear="both">

<a name="section_3_3"></a>
<h3>3.3 Reading Information Out Of The Database</h3>

<img src="images/ac/commit-2.gif" align="right" hspace="15">

<p>After the shared lock is acquired, we can begin reading
information from the database file.  In this scenario, we
are assuming a cold cache, so information must first be
read from mass storage into the operating system cache then
transferred from operating system cache into user space.
On subsequent reads, some or all of the information might
already be found in the operating system cache and so only
the transfer to user space would be required.</p>

<p>Usually only a subset of the pages in the database file
are read.  In this example we are showing three
pages out of eight being read.  In a typical application, a
database will have thousands of pages and a query will normally
only touch a small percentage of those pages.</p>

<br clear="both">

<h3>3.4 Obtaining A Reserved Lock</h3>

<img src="images/ac/commit-3.gif" align="right" hspace="15">

<p>Before making changes to the database, SQLite first
obtains a "reserved" lock on the database file.  A reserved
lock is similar to a shared lock in that both a reserved lock
and shared lock allow other processes to read from the database
file.  A single reserve lock can coexist with multiple shared
locks from other processes.  However, there can only be a
single reserved lock on the database file.  Hence only a
single process can be attempting to write to the database
at one time.</p>

<p>The idea behind a reserved locks is that it signals that
a process intends to modify the database file in the near
future but has not yet started to make the modifications.
And because the modifications have not yet started, other
processes can continue to read from the database.  However,
no other process should also begin trying to write to the
database.</p>

<br clear="both">
<a name="section_3_5"></a>
<h3>3.5 Creating A Rollback Journal File</h3>
<img src="images/ac/commit-4.gif" align="right" hspace="15">

<p>Prior to making any changes to the database file, SQLite first
creates a separate rollback journal file and writes into the 
rollback journal the original
content of the database pages that are to be altered.
The idea behind the rollback journal is that it contains
all information needed to restore the database back to 
its original state.</p>

<p>The rollback journal contains a small header (shown in green
in the diagram) that records the original size of the database
file.  So if a change causes the database file to grow, we
will still know the original size of the database.  The page
number is stored together with each database page that is 
written into the rollback journal.</p>

<p>When a new file is created, most desktop operating systems
(windows, linux, macOSX) will not actually write anything to
disk.  The new file is created in the operating systems disk
cache only.  The file is not created on mass storage until sometime
later, when the operating system has a spare moment.  This creates
the impression to users that I/O is happening much faster than
is possible when doing real disk I/O.  We illustrate this idea in
the diagram to the right by showing that the new rollback journal
appears in the operating system disk cache only and not on the
disk itself.</p>

<br clear="both">
<a name="section_3_6"></a>
<h3>3.6 Changing Database Pages In User Space</h3>
<img src="images/ac/commit-5.gif" align="right" hspace="15">

<p>After the original page content has been saved in the rollback
journal, the pages can be modified in user memory.  Each database
connection has its own private copy of user space, so the changes
that are made in user space are only visible to the database connection
that is making the changes.  Other database connections still see
the information in operating system disk cache buffers which have
not yet been changed.  And so even though one process is busy
modifying the database, other processes can continue to read their
own copies of the original database content.</p>

<br clear="both">
<a name="section_3_7"></a>
<h3>3.7 Flushing The Rollback Journal File To Mass Storage</h3>
<img src="images/ac/commit-6.gif" align="right" hspace="15">

<p>The next step is to flush the content of the rollback journal
file to nonvolatile storage.
As we will see later, 
this is a critical step in insuring that the database can survive
an unexpected power loss.
This step also takes a lot of time, since writing to nonvolatile
storage is normally a slow operation.</p>

<p>This step is usually more complicated than simply flushing
the rollback journal to the disk.  On most platforms two separate
flush (or fsync()) operations are required.  The first flush writes
out the base rollback journal content.  Then the header of the
rollback journal is modified to show the number of pages in the 
rollback journal.  Then the header is flushed to disk.  The details
on why we do this header modification and extra flush are provided
in a later section of this paper.</p>

<br clear="both">
<a name="section_3_8"></a>
<h3>3.8 Obtaining An Exclusive Lock</h3>
<img src="images/ac/commit-7.gif" align="right" hspace="15">

<p>Prior to making changes to the database file itself, we must
obtain an exclusive lock on the database file.  Obtaining an
exclusive lock is really a two-step process.  First SQLite obtains
a "pending" lock.  Then it escalates the pending lock to an
exclusive lock.</p>

<p>A pending lock allows other processes that already have a
shared lock to continue reading the database file.  But it
prevents new shared locks from being established.  The idea
behind a pending lock is to prevent writer starvation caused
by a large pool of readers.  There might be dozens, even hundreds,
of other processes trying to read the database file.  Each process
acquires a shared lock before it starts reading, reads what it
needs, then releases the shared lock.  If, however, there are
many different processes all reading from the same database, it
might happen that a new process always acquires its shared lock before
the previous process releases its shared lock.  And so there is
never an instant when there are no shared locks on the database
file and hence there is never an opportunity for the writer to
seize the exclusive lock.  A pending lock is designed to prevent
that cycle by allowing existing shared locks to proceed but
blocking new shared locks from being established.  Eventually
all shared locks will clear and the pending lock will then be
able to escalate into an exclusive lock.</p>

<br clear="both">
<a name="section_3_9"></a>
<h3>3.9 Writing Changes To The Database File</h3>
<img src="images/ac/commit-8.gif" align="right" hspace="15">

<p>Once an exclusive lock is held, we know that no other
processes are reading from the database file and it is
safe to write changes into the database file.  Usually
those changes only go as far as the operating systems disk
cache and do not make it all the way to mass storage.</p>

<br clear="both">
<a name="section_3_10"></a>
<h3>3.10 Flushing Changes To Mass Storage</h3>
<img src="images/ac/commit-9.gif" align="right" hspace="15">

<p>Another flush must occur to make sure that all the
database changes are written into nonvolatile storage.
This is a critical step to insure that the database will
survive a power loss without damage.  However, because
of the inherent slowness of writing to disk or flash memory, 
this step together with the rollback journal file flush in section
3.7 above takes up most the time required to complete a
transaction commit in SQLite.</p>

<br clear="both">
<a name="section_3_11"></a>
<h3>3.11 Deleting The Rollback Journal</h3>
<img src="images/ac/commit-A.gif" align="right" hspace="15">

<p>After the database changes are all safely on the mass
storage device, the rollback journal file is deleted.
This is the instant where the transaction commits.
If a power failure or system crash occurs prior to this
point, then recovery processes to be described later make
it appears as if no changes were ever made to the database
file.  If a power failure or system crash occurs after
the rollback journal is deleted, then it appears as if
all changes have been written to disk.  Thus, SQLite gives
the appearance of having made no changes to the database
file or having made the complete set of changes to the
database file depending on whether or not the rollback
journal file exists.</p>

<p>Deleting a file is not really an atomic operation, but
it appears to be from the point of view of a user process.
A process is always able to ask the operating system "does
this file exist?" and the process will get back a yes or no
answer.  After a power failure that occurs during a 
transaction commit, SQLite will ask the operating system
whether or not the rollback journal file exists.  If the
answer is "yes" then the transaction is incomplete and is
rolled back.  If the answer is "no" then it means the transaction
did commit.</p>

<p>The existence of a transaction depends on whether or
not the rollback journal file exists and the deletion
of a file appears to be an atomic operation from the point of
view of a user-space process.  Therefore, 
a transaction appears to be an atomic operation.</p>

<br clear="both">
<a name="section_3_12"></a>
<h3>3.12 Releasing The Lock</h3>
<img src="images/ac/commit-B.gif" align="right" hspace="15">

<p>The last step in the commit process is to release the
exclusive lock so that other processes can once again
start accessing the database file.</p>

<p>In the diagram at the right, we show that the information
that was held in user space is cleared when the lock is released.
This used to be literally true for older versions of SQLite.  But
more recent versions of SQLite keep the user space information
in memory in case it might be needed again at the start of the
next transaction.  It is cheaper to reuse information that is
already in local memory than to transfer the information back
from the operating system disk cache or to read it off of the
disk drive again.  Prior to reusing the information in user space,
we must first reacquire the shared lock and then we have to check
to make sure that no other process modified the database file while
we were not holding a lock.  There is a counter in the first page
of the database that is incremented every time the database file
is modified.  We can find out if another process has modified the
database by checking that counter.  If the database was modified,
then the user space cache must be cleared and reread.  But it is
commonly the case that no changes have been made and the user
space cache can be reused for a significant performance savings.</p>

<br clear="both">
<h2>4.0 Rollback</h2>

<p>An atomic commit is suppose to happen instantaneously.  But the processing
described above clearly takes a finite amount of time.
Suppose the power to the computer were cut
part way through the commit operation described above.  In order
to maintain the illusion that the changes were instantaneous, we
have to "rollback" any partial changes and restore the database to
the state it was in prior to the beginning of the transaction.</p>

<h3>4.1 When Something Goes Wrong...</h3>
<img src="images/ac/rollback-0.gif" align="right" hspace="15">

<p>Suppose the power loss occurred during step 3.10 above,
while the database changes were being written to disk.
After power is restored, the situation might be something
like what is shown to the right.  We were trying to change
three pages of the database file but only one page was
successfully written.  Another page was partially written
and a third page was not written at all.</p>

<p>The rollback journal is complete and intact on disk when
the power is restored.  This is a key point.  The reason for
the flush operation in step 3.7 is to make absolutely sure that
all of the rollback journal is safely on nonvolatile storage
prior to making any changes to the database file itself.</p>

<br clear="both">
<a name="section_4_2"></a>
<h3>4.2 Hot Rollback Journals</h3>
<img src="images/ac/rollback-1.gif" align="right" hspace="15">

<p>The first time that any SQLite process attempts to access
the database file, it obtains a shared lock as described in
section 3.2 above.  But then it notices that there is a 
rollback journal file present.  SQLite then checks to see if
the rollback journal is a "hot journal".   A hot journal is
a rollback journal that needs to be played back in order to
restore the database to a sane state.  A hot journal only
exists when an earlier process was in the middle of committing
a transaction when it crashed or lost power.</p>

<p>A rollback journal is a "hot" journal if all of the following
are true:</p>

<ul>
<li>The rollback journal exist.
<li>The rollback journal is not an empty file.
<li>There is no reserved lock on the main database file.
<li>The rollback journal header does not
contain the name of a master journal file (see
<a href="#section_5_5">section 5.5</a> below) or if does
contain the name of a master journal, then that master journal
file exists.
</ul>

<p>The presence of a hot journal is our indication
that a previous process was trying to commit a transaction but
it aborted for some reason prior to the completion of the
commit.  The presence of a hot journal is our indication that
the database file is in an inconsistent state and needs to
be repaired (by rollback) prior to being used.</p>

<br clear="both">
<h3>4.3 Obtaining An Exclusive Lock On The Database</h3>
<img src="images/ac/rollback-2.gif" align="right" hspace="15">

<p>The first step toward dealing with a hot journal is to
obtain an exclusive lock on the database file.  This prevents two
or more processes from trying to rollback the same hot journal
at the same time.</p>

<br clear="both">
<a name="section_4_4"></a>
<h3>4.4 Rolling Back Incomplete Changes</h3>
<img src="images/ac/rollback-3.gif" align="right" hspace="15">

<p>Once a process obtains an exclusive lock, it is permitted
to write to the database file.  It then proceeds to read the
original content of pages out of the rollback journal and write
that content back to were it came from in the database file.
Recall that the header of the rollback journal records the original
size of the database file prior to the start of the aborted
transaction.  SQLite uses this information to truncate the
database file back to its original size in cases where the
incomplete transaction caused the database to grow.  At the
end of this step, the database should be the same size and
contain the same information as it did before the start of
the aborted transaction.</p>

<br clear="both">
<h3>4.5 Deleting The Hot Journal</h3>
<img src="images/ac/rollback-4.gif" align="right" hspace="15">

<p>After all information in the rollback journal has been
played back into the database file (and flushed to disk in case
we encounter yet another power failure), the hot rollback journal
can be deleted.</p>

<br clear="both">
<h3>4.6 Continue As If The Uncompleted Writes Had Never Happened</h3>
<img src="images/ac/rollback-5.gif" align="right" hspace="15">

<p>The final recovery step is to reduce the exclusive lock back
to a shared lock.  Once this happens, the database is back in the
state that it would have been if the aborted transaction had never
started.  Since all of this recovery activity happens completely
automatically and transparently, it appears to the program using
SQLite as if the aborted transaction had never begun.</p>

<br clear="both">
<h2>5.0 Multi-file Commit</h2>

<p>SQLite allows a single 
<a href="/capi3ref.html#sqlite3">database connection</a> to talk to
two or more database files simultaneously through the use of
the <a href="/lang_attach.html">ATTACH DATABASE</a> command.
When multiple database files are modified within a single
transaction, all files are updated atomically.  
In other words, either all of the database files are updated or
else none of them are.
Achieving an atomic commit across multiple database files is
more complex that doing so for a single file.  This section
describes how SQLite works that bit of magic.</p>

<h3>5.1 Separate Rollback Journals For Each Database</h3>
<img src="images/ac/multi-0.gif" align="right" hspace="15">

<p>When multiple database files are involved in a transaction,
each database has its own rollback journal and each database
is locked separately.  The diagram at the right shows a scenario
where three different database files have been modified within
one transaction.  The situation at this step is analogous to 
the single-file transaction scenario at 
<a href="#section_3_6">step 3.6</a>.  Each database file has
a reserved lock.  For each database, the original content of pages 
that are being changed have been written into the rollback journal
for that database, but the content of the journals have not yet
been flushed to disk.  No changes have been made to the database
file itself yet, though presumably there are changes being held
in user memory.</p>

<p>For brevity, the diagrams in this section are simplified from
those that came before.  Blue color still signifies original content
and pink still signifies new content.  But the individual pages
in the rollback journal and the database file are not shown and
we are not making the distinction between information in the
operating system cache and information that is on disk.  All of
these factors still apply in a multi-file commit scenario.  They
just take up a lot of space in the diagrams and they do not add
any new information, so they are omitted here.</p>

<br clear="both">
<h3>5.2 The Master Journal File</h3>
<img src="images/ac/multi-1.gif" align="right" hspace="15">

<p>The next step in a multi-file commit is the creation of a
"master journal" file.  The name of the master journal file is
the same name as the original database filename (the database
that was opened using the 
<a href="capi3ref.html#sqlite3_open">sqlite3_open()</a> interface,
not one of the <a href="lang_attach.html">ATTACHed</a> auxiliary
databases) with the text "<b>-mj</b><i>HHHHHHHH</i>" appended where
<i>HHHHHHHH</i> is a random 32-bit hexadecimal number.  The
random <i>HHHHHHHH</i> suffix changes for every new master journal.</p>

<p><i>(Nota bene: The formula for computing the master journal filename
given in the previous paragraph corresponds to the implementation as
of SQLite version 3.5.0.  But this formula is not part of the SQLite
specification and is subject to change in future releases.)</i></p>

<p>Unlike the rollback journals, the master journal does not contain
any original database page content.  Instead, the master journal contains
the full pathnames for rollback journals for every database that is
participating in the transaction.</p>

<p>After the master journal is constructed, its content is flushed
to disk before any further actions are taken.  On unix, the directory
that contains the master journal is also synced in order to make sure
the master journal file will appear in the directory following a
power failure.</p>

<br clear="both">
<h3>5.3 Updating Rollback Journal Headers</h3>
<img src="images/ac/multi-2.gif" align="right" hspace="15">

<p>The next step is to record the full pathname of the master journal file
in the header of every rollback journal.  Space to hold the master
journal filename was reserved at the beginning of each rollback journal
as the rollback journals were created.</p>

<p>The content of each rollback journal is flushed to disk both before
and after the master journal filename is written into the rollback
journal header.  It is important to do both of these flushes.  Fortunately,
the second flush is usually inexpensive since typically only a single
page of the journal file (the first page) has changed.</p>

<p>This step is analogous to 
<a href="#section_3_7">step 3.7</a> in the single-file commit
scenario described above.</p>

<br clear="both">
<h3>5.4 Updating The Database Files</h3>
<img src="images/ac/multi-3.gif" align="right" hspace="15">

<p>Once all rollback journal files have been flushed to disk, it
is safe to begin updating database files.  We have to obtain an
exclusive lock on all database files before writing the changes.
After all the changes are written, it is important to flush the
changes to disk so that they will be preserved in the event of
a power failure or operating system crash.</p>

<p>This step corresponds to steps
<a href="#section_3_8">3.8</a>,
<a href="#section_3_9">3.9</a>, and
<a href="#section_3_10">3.10</a> in the single-file commit
scenario described previously.</p>


<br clear="both">
<a name="section_5_5"></a>
<h3>5.5 Delete The Master Journal File</h3>
<img src="images/ac/multi-4.gif" align="right" hspace="15">

<p>The next step is to delete the master journal file.
This is the point where the multi-file transaction commits.
This step corresponds to 
<a href="#section_3_11">step 3.11</a> in the single-file
commit scenario where the rollback journal is deleted.</p>

<p>If a power failure or operating system crash occurs at this
point, the transaction will not rollback when the system reboots
even though there are rollback journals present.  The
difference is the master journal pathname in the header of the
rollback journal.  Upon restart, SQLite only considers a journal
to be hot and will only playback the journal if there is no
master journal filename in the header (which is the case for
a single-file commit) or if the master journal file still
exists on disk.</p>

<br clear="both">
<h3>5.6 Clean Up The Rollback Journals</h3>
<img src="images/ac/multi-5.gif" align="right" hspace="15">

<p>The final step in a multi-file commit is to delete the
individual rollback journals and drop the exclusive locks on
the database files so that other processes can see the changes.
This corresponds to 
<a href="#section_3_12">step 3.12</a> in the single-file
commit sequence.</p>

<p>The transaction has already committed at this point so timing
is not critical in the deletion of the rollback journals.
The current implementation deletes a single rollback journal
then unlocks the corresponding database file before proceeding
to the next rollback journal.  But in the future we might change
this so that all rollback journals are deleted before any database
files are unlocked.  As long as the rollback journal is deleted before
its corresponding database file is unlocked it does not matter in what
order the rollback journals are deleted or the database files are
unlocked.</p>

<h2>6.0 Additional Details Of The Commit Process</h2>

<p><a href="#section_3_0">Section 3.0</a> above provides an overview of
how atomic commit works in SQLite.  But it glosses over a number of
important details.  The following subsections will attempt to fill
in the gaps.</p>

<h3>6.1 Always Journal Complete Sectors</h3>

<p>When the original content of a database page is written into
the rollback journal (as shown in <a href="#section_3_5">section 3.5</a>),
SQLite always writes a complete sectors worth of data, even if the
page size of the database is smaller than the sector size.  
Historically, the sector size in SQLite has been hard coded to 512
bytes and since the minimum page size is also 512 bytes, this has never
been an issue.  But beginning with SQLite version 3.3.14, it is possible
for SQLite to use mass storage devices with a sector size larger than 512
bytes.  So, beginning with version 3.3.14, whenever any page within a
sector is written into the journal file, all pages in that same sector
are stored with it.</p>

<p>It is important to store all pages of a sector in the rollback
journal in order to prevent database corruption following a power
loss while writing the sector.  Suppose that pages 1, 2, 3, and 4 are
all stored in sector 1 and that page 2 is modified.  In order to write
the changes to page 2, the underlying hardware must also rewrite the
content of pages 1, 3, and 4 since the hardware must write the complete
sector.  If this write operation is interrupted by a power outage,
one or more of the pages 1, 3, or 4 might be left with incorrect data.
Hence, to avoid lasting corruption to the database, the original content
of all of those pages must be contained in the rollback journal.</p>

<h3>6.2 Dealing With Garbage Written Into Journal Files</h3>

<p>When data is appended to the end of the rollback journal,
SQLite normally makes the pessimistic assumption that the file
is first extended with invalid "garbage" data and that afterwards
the correct data replaces the garbage.  In other words, SQLite assumes
that the file size is increased first and then afterwards the content
is written into the file.  If a power failure occurs after the file
size has been increased but before the file content has been written,
the rollback journal can be left containing garbage data.  If after
power is restored, another SQLite process sees the rollback journal
containing the garbage data and tries to roll it back into the original
database file, it might copy some of the garbage into the database file
and thus corrupt the database file.</p>

<p>SQLite uses two defenses against this problem.  In the first place,
SQLite records the number of pages in the rollback journal in the header
of the rollback journal.  This number is initially zero.  So during an
attempt to rollback an incomplete (and possibly corrupt) rollback
journal, the process doing the rollback will see that the journal
contains zero pages will thus make no changes to the database.  Prior
to a commit, the rollback journal is flushed to disk to ensure that
all content has been synched to disk and there is no "garbage" left
in the file, and only then is the page count in the header increased
true number of pages in the rollback journal.  The rollback journal
header is always kept in a separate sector from any page data so that
it can be overwritten and flushed without risking damage to a data
page if a power outage occurs.  Notice that the rollback journal
is flushed to disk twice: once to write the page content and a second
time to write the page count in the header.</p>

<p>The previous paragraph describes what happens when the
synchronous pragma setting is "full".</p>

<blockquote>
PRAGMA synchronous=FULL;
</blockquote>

<p>The default synchronous setting is full so the above is what usually
happens.  However, if the synchronous setting is lowered to "normal",
SQLite only flushes the rollback journal once, after the page count has
been written.
This carries a risk of corruption because it might happen that the 
modified (non-zero) page count reaches the disk surface before all
of the data does.  The data will have been written first, but SQLite
assumes that the underlying filesystem can reorder write requests and
that the page count can be burned into oxide first even though its
write request occurred last.  So as a second line of defense, SQLite
also uses a 32-bit checksum on every page of data in the rollback
journal.  This checksum is evaluated for each page during rollback
while rolling back a journal as described in 
<a href="#section_4_4">section 4.4</a>.  If an incorrect checksum
is seen, the rollback is abandoned.  Note that the checksum does
not guarantee that the page data is correct since there is a small
but finite probability that the checksum might be right even if the data is
corrupt.  But the checksum does at least make such an error unlikely.
</p>

<p>Note that the checksums in the rollback journal are not necessary
if the synchronous setting is FULL.  We only depend on the checksums
when synchronous is lowered to NORMAL.  Nevertheless, the checksums
never hurt and so they are included in the rollback journal regardless
of the synchronous setting.</p>

<h3>6.3 Cache Spill Prior To Commit</h3>

<p>The commit process shown in <a href="#section_3_0">section 3.0</a>
assumes that all database changes fit in memory until it is time to
commit.  This is the common case.  But sometimes a larger change will
overflow the user-space cache prior to transaction commit.  In those
cases, the cache must spill to the database before the transaction
is complete.</p>

<p>At the beginning of a cache spill, the status of the database
connection is as shown in <a href="#section_3_6">step 3.6</a>.
Original page content has been saved in the rollback journal and
modifications of the pages exist in user memory.  To spill the cache,
SQLite executes steps <a href="#section_3_7">3.7</a> through
<a href="#section_3_9">3.9</a>.  In other words, the rollback journal
is flushed to disk, an exclusive lock is acquired, and changes are
written into the database.  But the remaining steps are deferred
until the transaction really commits.  A new journal header is
appended to the end of the rollback journal (in its own sector)
and the exclusive database lock is retained, but otherwise processing
returns to <a href="#section_3_6">step 3.6</a>.  When the transaction
commits, or if another cache spill occurs, steps
<a href="#section_3_7">3.7</a> and <a href="#section_3_9">3.9</a> are
repeated.  (Step <a href="#section_3_8">3.8</a> is omitted on second
and subsequent passes since an exclusive database lock is already held
due to the first pass.)</p>

<p>A cache spill causes the lock on the database file to
escalate from reserved to exclusive.  This reduces concurrency.
A cache spill also causes extra disk flush or fsync operations to
occur and these operations are slow, hence a cache spill can
seriously reduce performance.
For these reasons a cache spill is avoided whenever possible.</p>

<h2>7.0 Optimizations</h2>

<p>Profiling indicates that for most systems and in most circumstances
SQLite spends most of its time doing disk I/O.  It follows then that
anything we can do to reduce the amount of disk I/O will likely have a
large positive impact on the performance of SQLite.  This section
describes some of the techniques used by SQLite to try to reduce the
amount of disk I/O to a minimum while still preserving atomic commit.</p>

<h3>7.1 Cache Retained Between Transactions</h3>

<p><a href="#section_3_12">Step 3.12</a> of the commit process shows
that once the shared lock has been released, all user-space cache
images of database content must be discarded.  This is done because
without a shared lock, other processes are free to modify the database
file content and so any user-space image of that content might become
obsolete.  Consequently, each new transaction would begin by rereading
data which had previously been read.  This is not as bad as it sounds
at first since the data being read is still likely in the operating
systems file cache.  So the "read" is really just a copy of data
from kernel space into user space.  But even so, it still takes time.</p>

<p>Beginning with SQLite version 3.3.14 a mechanism has been added
to try to reduce the needless rereading of data.  In newer versions
of SQLite, the data in the user-space pager cache is retained when
the lock on the database file is released.  Later, after the
shared lock is acquired at the beginning of the next transaction,
SQLite checks to see if any other process has modified the database
file.  If the database has been changed in any way since the lock
was last released, the user-space cache is erased at that point.
But commonly the database file is unchanged and the user-space cache
can be retained, and some unnecessary read operations can be avoided.</p>

<p>In order to determine whether or not the database file has changed,
SQLite uses a counter in the database header (in bytes 24 through 27)
which is incremented during every change operation.  SQLite saves a copy
of this counter prior to releasing its database lock.  Then after
acquiring the next database lock it compares the saved counter value
against the current counter value and erases the cache if the values
are different, or reuses the cache if they are the same.</p>

<h3>7.2 Exclusive Access Mode</h3>

<p>SQLite version 3.3.14 adds the concept of "Exclusive Access Mode".
In exclusive access mode, SQLite retains the exclusive
database lock at the conclusion of each transaction.  This prevents
other processes for accessing the database, but in many deployments
only a single process is using a database so often this is not a
serious problem.  The advantage of exclusive access mode is that
disk I/O can be reduced in three ways:</p>

<ol>
<li><p>It is not necessary to increment the change counter in the
database header for transactions after the first transaction.  This
will often save a write of page one to both the rollback
journal and the main database file.</p></li>

<li><p>No other processes can change the database so there is never
a need to check the change counter and clear the user-space cache
at the beginning of a transaction.</p></li>

<li><p>The rollback journal file can be truncated rather than deleted
at the end of each transaction.  On many operating systems, truncating
a file is much faster than deleting it.</p></li>
</ol>

<p>The third optimization, truncating rather than deleting the rollback
journal file, does not depend on holding an exclusive lock at all times.
We could, in theory, do that optimization at all times, not just when
exclusive access mode is enabled, and we may well choose to do so in 
some future release of SQLite.  But for now (version 3.5.0) the
rollback journal truncation optimization is only enabled in conjunction
with exclusive access mode.</p>

<h3>7.3 Do Not Journal Freelist Pages</h3>

<p>When information is deleted from an SQLite database, the pages used
to old the deleted information are added to a "freelist".  Subsequent
inserts will draw pages off of this freelist rather than expanding the
database file.</p>

<p>Some freelist pages contain critical data; specifically the locations
of other freelist pages.  But most freelist pages contain nothing useful.
These latter freelist pages are called "leaf" pages.  We are free to
modify the content of a leaf freelist page in the database without
changing the meaning of the database in any way.</p>

<p>Because the content of leaf freelist pages is unimportant, SQLite
avoids storing leaf freelist page content in the rollback journal
in <a href="#section_3_5">step 3.5</a> of the commit process.
If a leaf freelist page is changed and that change does not get rolled back
during a transaction recovery, the database is not harmed by the omission.
Similarly, the content of a new freelist page is never written back
into the database at <a href="#section_3_9">step 3.9</a> nor
read from the database at <a href="#section_3_3">step 3.3</a>.
These optimizations can greatly reduce the amount of I/O that occurs
when making changes to a database file that contains free space.</p>

<h3>7.4 Single Page Updates And Atomic Sector Writes</h3>

<p>Beginning in SQLite version 3.5.0, the new Virtual File System (VFS)
interface contains a method named xDeviceCharacteristics which reports
on special properties that the underlying mass storage device
might have.  Among the special properties that
xDeviceCharacteristics might report is the ability of to do an
atomic sector write.</p>

<p>Recall that by default SQLite assumes that sector writes are
linear but not atomic.  A linear write starts at one end of the
sector and changes information byte by byte until it gets to the
other end of the sector.  If a power loss occurs in the middle of
a linear write then part of the sector might be modified while the
other end is unchanged.  In an atomic sector write, either the entire
sector is overwritten or else nothing in the sector is changed.</p>

<p>We believe that most modern disk drives implement atomic sector
writes.  When power is lost, the drive uses energy stored in capacitors
and/or the angular momentum of the disk platter to provide power to 
complete any operation in progress.  Nevertheless, there are so many
layers in between the write system call and the on-board disk drive
electronics that we take the safe approach in both Unix and w32 VFS
implementations and assume that sector writes are not atomic.  On the
other hand, device
manufactures with more control over their filesystems might want
to consider enabling the atomic write property of xDeviceCharacteristics
if their hardware really does do atomic writes.</p>

<p>When sector writes are atomic and the page size of a database is
the same as a sector size, and when there is a database change that
only touches a single database page, then SQLite skips the whole
journaling and syncing process and simply writes the modified page
directly into the database file.  The change counter in the first
page of the database file is modified separately since no harm is
done if power is lost before the change counter can be updated.</p>

<h3>7.5 Filesystems With Safe Append Semantics</h3>

<p>Another optimization introduced in SQLite version 3.5.0 makes
use of "safe append" behavior of the underlying disk.
Recall that SQLite assumes that when data is appended to a file
(specifically to the rollback journal) that the size of the file
is increased first and that the content is written second.  So
if power is lost after the file size is increased but before the
content is written, the file is left containing invalid "garbage"
data.  The xDeviceCharacteristics method of the VFS might, however,
indicate that the filesystem implements "safe append" semantics.
This means that the content is written before the file size is
increased so that it is impossible for garbage to be introduced
into the rollback journal by a power loss or system crash.</p>

<p>When safe append semantics are indicated for a filesystem,
SQLite always stores the special value of -1 for the page count
in the header of the rollback journal.  The -1 page count value
tells any process attempting to rollback the journal that the
number of pages in the journal should be computed from the journal
size.  This -1 value is never changed.  So that when a commit
occurs, we save a single flush operation and a sector write of
the first page of the journal file.  Furthermore, when a cache
spill occurs we no longer need to append a new journal header
to the end of the journal; we can simply continue appending
new pages to the end of the existing journal.</p>

<h2>8.0 Testing Atomic Commit Behavior</h2>

<p>The developers of SQLite are confident that it is robust
in the face of power failures and system crashes because the
automatic test procedures do extensive checks on
the ability of SQLite to recover from simulated power loss.
We call these the "crash tests".</p>

<p>Crash tests in SQLite use a modified VFS that can simulate
the kinds of filesystem damage that occur during a power
loss or operating system crash.  The crash-test VFS can simulate
incomplete sector writes, pages filled with garbage data because
a write has not completed, and out of order writes, all occurring
at varying points during a test scenario.  Crash tests execute
transactions over and over, varying the time at which a simulated
power loss occurs and the properties of the damage inflicted.
Each test then reopens the database after the simulated crash and
verifies that the transaction either occurred completely
or not at all and that the database is in a completely
consistent state.</p>

<p>The crash tests in SQLite have discovered a number of very
subtle bugs (now fixed) in the recovery mechanism.  Some of 
these bugs were very obscure and unlikely to have been found
using only code inspection and analysis techniques.  From this
experience, the developers of SQLite feel confident that any other
database system that does not use a similar crash test system
likely contains undetected bugs that will lead to database
corruption following a system crash or power failure.</p>

<h2>9.0 Things That Can Go Wrong</h2>

<p>The atomic commit mechanism in SQLite has proven to be robust,
but it can be circumvented by a sufficiently creative
adversary or a sufficiently broken operating system implementation.
This section describes a few of the ways in which an SQLite database
might be corrupted by a power failure or system crash.</p>

<h3>9.1 Broken Locking Implementations</h3>

<p>SQLite uses filesystem locks to make sure that only one
process and database connection is trying to modify the database
at a time.  The filesystem locking mechanism is implemented
in the VFS layer and is different for every operating system.
SQLite depends on this implementation being correct.  If something
goes wrong and two or more processes are able to write the same
database file at the same time, severe damage can result.</p>

<p>We have received reports of implementations of both
windows network filesystems and NFS in which locking was
subtly broken.  We can not verify these reports, but as
locking is difficult to get right on a network filesystem
we have no reason to doubt them.  You are advised to 
avoid using SQLite on a network filesystem in the first place,
since performance will be slow.  But if you must use a 
network filesystem to store SQLite database files, consider
using a secondary locking mechanism to prevent simultaneous
writes to the same database even if the native filesystem
locking mechanism malfunctions.</p>

<p>The versions of SQLite that come preinstalled on Apple
Mac OS X computers contain a version of SQLite that has been
extended to use alternative locking strategies that work on
all network filesystems that Apple supports.  These extensions
used by Apple work great as long as all processes are accessing
the database file in the same way.  Unfortunately, the locking
mechanisms do not exclude one another, so if one process is
accessing a file using (for example) AFP locking and another
process (perhaps on a different machine) is using dot-file locks,
the two processes might collide because AFP locks do not exclude
dot-file locks or vice versa.</p>

<h3>9.2 Incomplete Disk Flushes</h3>

<p>SQLite uses the fsync() system call on unix and the FlushFileBuffers()
system call on w32 in order to sync the file system buffers onto disk
oxide as shown in  <a href="#section_3_7">step 3.7</a> and
<a href="#section_3.10">step 3.10</a>.  Unfortunately, we have received
reports that neither of these interfaces works as advertised on many
systems.  We hear that FlushFileBuffers() can be completely disabled
using registry settings on some windows versions.  Some historical
versions of Linux contain versions of fsync() which are no-ops on
some filesystems, we are told.  Even on systems where 
FlushFileBuffers() and fsync() are said to be working, often
the IDE disk control lies and says that data has reached oxide
while it is still held only in the volatile control cache.</p>

<p>On the Mac, you can set this pragma:</p>

<blockquote>
PRAGMA fullfsync=ON;
</blockquote>

<p>Setting fullfsync on a Mac will guarantee that data really does
get pushed out to the disk platter on a flush.  But the implementation
of fullfsync involves resetting the disk controller.  And so not only
is it profoundly slow, it also slows down other unrelated disk I/O.
So its use is not recommended.</p>

<h3>9.3 Partial File Deletions</h3>

<p>SQLite assumes that file deletion is an atomic operation from the
point of view of a user process.  If power fails in the middle of
a file deletion, then after power is restored SQLite expects to see
either the entire file with all of its original data intact, or it
expects not to find the file at all.  Transactions may not be atomic
on systems that do not work this way.</p>

<h3>9.4 Garbage Written Into Files</h3>

<p>SQLite database files are ordinary disk files that can be
opened and written by ordinary user processes.  A rogue process
can open an SQLite database and fill it with corrupt data.  
Corrupt data might also be introduced into an SQLite database
by bugs in the operating system or disk controller; especially
bugs triggered by a power failure.  There is nothing SQLite can
do to defend against these kinds of problems.</p>

<h3>9.5 Deleting Or Renaming A Hot Journal</h3>

<p>If a crash or power loss does occur and a hot journal is left on
the disk, it is essential that the original database file and the hot
journal remain on disk with their original names until the database
file is opened by another SQLite process and rolled back.  
During recovery at <a href="section_4_2">step 4.2</a> SQLite locates
the hot journal by looking for a file in the same directory as the
database being opened and whose name is derived from the name of the
file being opened.  If either the original database file or the
hot journal have been moved or renamed, then the hot journal will
not be seen and the database will not be rolled back.</p>

<p>We suspect that a common failure mode for SQLite recovery happens
like this:  A power failure occurs.  After power is restored, a well-meaning
user or system administrator begins looking around on the disk for
damage.  They see their database file named "important.data".  This file
is perhaps familiar to them.  But after the crash, there is also a
hot journal named "important.data-journal".  The user then deletes
the hot journal, thinking that they are helping to cleanup the system.
We know of no way to prevent this other than user education.</p>

<p>If there are multiple (hard or symbolic) links to a database file,
the journal will be created using the name of the link through which
the file was opened.  If a crash occurs and the database is opened again
using a different link, the hot journal will not be located and no
rollback will occur.</p>

<p>Sometimes a power failure will cause a filesystem to be corrupted
such that recently changed filenames are forgotten and the file is
moved into a "/lost+found" directory.  When that happens, the hot
journal will not be found and recovery will not occur.
SQLite tries to prevent this
by opening and syncing the directory containing the rollback journal
at the same time it syncs the journal file itself.  However, the
movement of files into /lost+found can be caused by unrelated processes
creating unrelated files in the same directory as the main database file.
And since this is out from under the control of SQLite, there is nothing
that SQLite can do to prevent it.  If you are running on a system that
is vulnerable to this kind of filesystem namespace corruption (most
modern journalling filesystems are immune, we believe) then you might
want to consider putting each SQLite database file in its own private
subdirectory.</p>

<h2>10.0 Future Directions And Conclusion</h2>

<p>Every now and then someone discovers a new failure mode for
the atomic commit mechanism in SQLite and the developers have to
put in a patch.  This is happening less and less and the
failure modes are becoming more and more obscure.  But it would
still be foolish to suppose that the atomic commit logic of
SQLite is entirely bug-free.  The developers are committed to fixing
these bugs as quickly as they might be found.</p>

<p>The developers are also on the lookout for new ways to
optimize the commit mechanism.  The current VFS implementations
for Linux, MacOSX, and w32 make pessimistic assumptions about
the behavior of those systems.  After consultation with experts
on how these systems work, we might be able to relax some of the
assumptions on these systems and allow them to run faster.  In
particular, we suspect that most modern filesystems exhibit the
safe append property and that many of them might support atomic
sector writes.  But until this is known for certain, SQLite will
take the conservative approach and assume the worst.</p>