Documentation Source Text

Check-in [a7c7d3a520]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Fleshing out the WAL documentation.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: a7c7d3a520e4af96d00f417c6b4737eb8a1736de
User & Date: drh 2010-05-07 16:18:29
Context
2010-05-08
14:05
Fix typos in the testing document. Change "insure" to "ensure" throughout the website. Ticket [6afbaac77a52917fc]. check-in: 8c1658c1e2 user: drh tags: trunk
2010-05-07
16:18
Fleshing out the WAL documentation. check-in: a7c7d3a520 user: drh tags: trunk
02:46
Add the new Kreibich book. Add preliminary documentation on WAL pragmas. Refactor the pragma.html document. check-in: f1676af6d8 user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/changes.in.

39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
    }
    hd_close_aux
    hd_enable_main 1
  }
}

chng {2010 July 1 (3.7.0)} {
<li> Added support for write-ahead logging
<li> Query planner enhancements
}

chng {2010 March 30 (3.6.23.1)} {
<li> Fix a bug in the offsets() function of [FTS3]
<li> Fix a missing "sync" that when omitted could lead to database
     corruption if a power failure or OS crash occurred just as a







|







39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
    }
    hd_close_aux
    hd_enable_main 1
  }
}

chng {2010 July 1 (3.7.0)} {
<li> Added support for [WAL | write-ahead logging].
<li> Query planner enhancements
}

chng {2010 March 30 (3.6.23.1)} {
<li> Fix a bug in the offsets() function of [FTS3]
<li> Fix a missing "sync" that when omitted could lead to database
     corruption if a power failure or OS crash occurred just as a

Changes to pages/fileformat2.in.

102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
<tr><th>Offset<th>Size<th>Description
<tr><td valign=top align=center>0<td valign=top align=center>16<td align=left>
The header string: "SQLite format 3\000"
<tr><td valign=top align=center>16<td valign=top align=center>2<td align=left>
The database page size in bytes.  Must be a power of two between 512
and 32768 inclusive.
<tr><td valign=top align=center>18<td valign=top align=center>1<td align=left>
File format write version.  Must be 1.
<tr><td valign=top align=center>19<td valign=top align=center>1<td align=left>
File format read version.  Must be 1.
<tr><td valign=top align=center>20<td valign=top align=center>1<td align=left>
Bytes of unused "reserved" space at the end of each page.  Usually 0.
<tr><td valign=top align=center>21<td valign=top align=center>1<td align=left>
Maximum embedded payload fraction.  Must be 64.
<tr><td valign=top align=center>22<td valign=top align=center>1<td align=left>
Minimum embedded payload fraction.  Must be 32.
<tr><td valign=top align=center>23<td valign=top align=center>1<td align=left>







|

|







102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
<tr><th>Offset<th>Size<th>Description
<tr><td valign=top align=center>0<td valign=top align=center>16<td align=left>
The header string: "SQLite format 3\000"
<tr><td valign=top align=center>16<td valign=top align=center>2<td align=left>
The database page size in bytes.  Must be a power of two between 512
and 32768 inclusive.
<tr><td valign=top align=center>18<td valign=top align=center>1<td align=left>
File format write version.  1 for legacy; 2 for [WAL].
<tr><td valign=top align=center>19<td valign=top align=center>1<td align=left>
File format read version.  1 for legacy; 2 for [WAL].
<tr><td valign=top align=center>20<td valign=top align=center>1<td align=left>
Bytes of unused "reserved" space at the end of each page.  Usually 0.
<tr><td valign=top align=center>21<td valign=top align=center>1<td align=left>
Maximum embedded payload fraction.  Must be 64.
<tr><td valign=top align=center>22<td valign=top align=center>1<td align=left>
Minimum embedded payload fraction.  Must be 32.
<tr><td valign=top align=center>23<td valign=top align=center>1<td align=left>

Changes to pages/wal.in.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

16
17
18
19
20

21
22
23
24
25
26
27
28
29
30


31
32
33
34
35
36
37
38
39



40


































































































































































































<title>Write-Ahead Logging</title>
<tcl>hd_keywords {WAL} {write-ahead log}</tcl>

<h1 align="center">Write-Ahead Logging</h1>

<p>The default method in which SQLite implements
[atomic commit | atomic commit and rollback] is through the use of
a [rollback journal].
Beginning with [version 3.7.0], a new write-ahead log option
(hereafter referred to as "WAL")
for implementing atomic commit and rollback is available as an option
on some platforms.</p>

<p>There are advantages and disadvantages to using WAL.  We begin with
a quick summary of advantages:</p>


<ol>
<li>WAL is significately faster in the common case of many small transactions.
<li>WAL provides more concurrency as readers do not block and 
    are not blocked by a writer.

<li>The sequence of disk I/O operations tends to  be more sequential using WAL.
<li>WAL uses many fewer fsync() operations and is thus less vulnerable to
    problems on systems where the fsync() system call is broken.
</ol>

<p>But there are also disadvantages:</p>

<ol>
<li>WAL is only available when the [sqlite3_vfs | VFS] and the underlying
    operating system support a set of shared-memory primitives.


<li>All processes using a database must be on the same host computer;
    WAL does not work over a network filesystem.
<li>Transactions that involve changes against multiple [ATTACH | ATTACHed]
    databases are atomic for each individual database, but are not
    atomic across all databases as a set.
<li>With WAL, it is not possible to change database page sizes on [VACUUM]
    or on recovery using the [backup API].
<li>WAL can be slower than the traditional rollback-journal approach
    for large transactions (such as when using the [VACUUM] command).



</ol>







































































































































































































|
|
<
|
|
<
<

|
<
>



|
|
>
|







|
|
>
>





|
|


>
>
>

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7

8
9


10
11

12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
<title>Write-Ahead Logging</title>
<tcl>hd_keywords {WAL} {write-ahead log}</tcl>

<h1 align="center">Write-Ahead Logging</h1>

<p>The default method by which SQLite implements
[atomic commit | atomic commit and rollback] is a [rollback journal].

Beginning with [version 3.7.0], a new "Write-Ahead Log" option
(hereafter referred to as "WAL") is available.</p>



<p>There are advantages and disadvantages to using WAL instead of

a rollback journal.  Advantages include:</p>

<ol>
<li>WAL is significately faster in the common case of many small transactions.
<li>WAL provides more concurrency as readers do not block writers and 
    are not blocked by a writer.  Reading and writing can proceed 
    concurrently.
<li>The sequence of disk I/O operations tends to be more sequential using WAL.
<li>WAL uses many fewer fsync() operations and is thus less vulnerable to
    problems on systems where the fsync() system call is broken.
</ol>

<p>But there are also disadvantages:</p>

<ol>
<li>WAL is only available when the [sqlite3_vfs | VFS] 
    supports shared-memory primitives.  The built-in unix and windows VFSes
    support this but third-party extension VFSes for custom operating
    systems might not.
<li>All processes using a database must be on the same host computer;
    WAL does not work over a network filesystem.
<li>Transactions that involve changes against multiple [ATTACH | ATTACHed]
    databases are atomic for each individual database, but are not
    atomic across all databases as a set.
<li>With WAL, it is not possible to change the database page sizes 
    using [VACUUM] or when restoring from a backup using the [backup API].
<li>WAL can be slower than the traditional rollback-journal approach
    for large transactions (such as when using the [VACUUM] command).
<li>There is the extra opertion of [checkpointing] which, though automatic
    by default, is still something that application developers need to
    be mindful of.
</ol>

<h2>How WAL Works</h2>

<p>The traditional rollback journal works by writing a copy of the
original unchanged database content into a separate rollback journal file
and then writing changes directly into the database file.  In the
event of a crash or [ROLLBACK], the original content contained in the
rollback journal is played back into the database file to
revert the database file to its original state.  The [COMMIT] occurs
when the rollback journal is deleted.</p>

<p>The WAL approach inverts this.  The original content is preserved
in the database file and the changes are appended into a separate
WAL file.  A [COMMIT] occurs when a special record indicating a commit
is appended to the WAL.  Thus a COMMIT can happen without ever writing
to the original database, which allows reader to continue operating
from the original, unaltered database while changes are simultaneously
commited into the WAL.  Multiple transactions can be appended to the
end of a single WAL file.</p>

<tcl>hd_fragment ckpt checkpoint checkpointed checkpointing</tcl>
<h3>Checkpointing</h3>

<p>Of course, one wants to eventually transfer all the transactions that
are appended in the WAL file back into the original database.  Moving
the WAL file transactions back into the database is called a
"<i>checkpoint</i>".<p>

<p>Another way to think about the difference between rollback and 
write-ahead log is that in the rollback-journal
approach, there are two primitive operations:  reading from the database
and writing changes into the database but in the write-ahead log approach,
there are now three primitive operations:  reading, writing, and
checkpointing.</p>

<tcl>hd_fragment concurrency {WAL concurrency}</tcl>
<h3>Concurrency</h3>

<p>When a read operation begins on a WAL-mode database, it first
remembers the location of the last valid commit record in the WAL.
Call this point the "end mark".  Because the WAL can be growing and
adding new commit records while various readers connect to the database,
each reader can potentially have its own end mark.  But for any
particular reader, the end mark is unchanged for the duration of the
transaction, thus insuring that a single read transaction only sees
the database content as it existed at a single point in time.</p>

<p>When a reader needs a page of content, it first checks the WAL to
see if that page appears there, and if so it pulls in the last copy
of the page that occurs in the WAL prior to the reader's end mark.
If no copy of the page exists in the WAL prior to the reader's end mark,
then the page is read from the original database file.  Readers can
exist in separate processes, so to avoid forcing every reader to scan
the entire WAL looking for pages it needs (since WAL file can grow to
multiple megabytes, depending on how often checkpoints are run), a
data structure called the wal-index is maintained in shared memory
which helps readers locate pages in the WAL quickly and with a minimum
of I/O.  The wal-index greatly improves the performance of readers,
but the use of shared memory means that all readers must exist on the
same machine.  This is why the write-ahead log implementation will not
work on a network filesystem.</p>

<p>Writers merely append new content to the end of the WAL file.
Because writers do nothing that would interfere with the actions of
readers, writers and readers can run at the same time.  However,
since there is only one WAL file, there can only be one writer at
a time.</p>

<p>A checkpoint operation takes all the content from the WAL file
and transfers it back into the original database file.
As long as all readers are looking at the entire WAL file (in other
words, as long as no reader is using an end mark that is earlier
than the actual end of the WAL) then it will be OK for the checkpoint
to run in parallel with the readers.  The database file changes
that the checkpointer makes will not interfere with the readers, because
the database pages changed all also exist in the WAL and so
the readers will never access them.  Thus, a checkpoint will never
block a reader and readers will not block a checkpoint as long as all
readers are looking at the entire WAL.  But, older readers that only
see a prefix of the WAL will block a checkpoint.  And writers will
block a checkpoint.  And a checkpoint will block a writer.</p>

<tcl>hd_fragment fast</tcl>
<h3>Performance Considerations</h3>

<p>Write transactions are very fast since they only involve writing
the content once (versus twice for rollback-journal transactions)
and because the writes are all sequential.  Further, syncing the
content to the disk is not required, as long as the application is
willing to sacrifice durability following a power loss or hard reboot.</p>

<p>On the other hand, read performance deteriorates as the WAL file
grows in size since reader must check the WAL file for the content
they need and the time needed to check the WAL file is proportional
to the size of the WAL file.  The wal-index helps readers find content
in the WAL file much faster, but performance still falls off with
increasing WAL file size.  Hence, to maintain good read performance 
it is important to keep the WAL file size down by
run checkpoints at regular intervals.</p>

<p>Checkpointing does require sync operations in order to avoid
the possibility of database corruption following a power loss
or hard reboot.  The WAL must be synced to persistent storage
prior to the start of the checkpoint and the database file must by synced
prior to resetting the WAL.  Checkpoint also requires more seeking.
The checkpointer makes an effort to
do as many sequential page writes to the database as it can (the pages
are transferred from WAL to database in ascending order) but even
then there will typically be many seek operations interspersed among
the page writes.  These factors combine to make checkpoints slower than
write transactions.</p>

<p>The default strategy is to allow successive write transactions to
grow the WAL until the WAL becomes about 1000 pages in size, then to
run a single checkpoint operation.  By default, the checkpoint will be
run automatically by the same thread that does the COMMIT that pushes
the WAL over its size limit.  This has the effect of causing most
COMMIT operations to be very fast but an occasional COMMIT that triggers
a checkpoint to be much slower.  If that effect is undersirable, then
the application can disable automatic checkpointing and run the
periodic checkpoints in a separate thread, or separate process.</p>

<p>Notice that there is a tradeoff between average read performance
and average write performance.  To maximize the read performance,
one wants to keep the WAL as small as possible and hence run checkpoints
frequently, perhaps as often as every COMMIT.  To maximize
write performance, one wants to amortize the cost of each checkpoint
over as many writes as possible, meaning that one wants to run checkpoints
infrequently and let the WAL grow as large as possible before each 
checkpoint.  The decision of how often to run checkpoints may therefore
vary from one application to another depending on the relative read
and write performance requirements of the application.
The default strategy of running a checkpoint once the WAL
reaches 1000 pages seems to work well in test applications on 
workstations, but other strategies might work better on different 
platforms or for different workloads.</p>

<h2>Activating And Configuring WAL Mode</h2>

<p>An SQLite database connection defaults to 
[journal_mode | journal_mode=DELETE].  To convert to WAL mode, use the
following pragma:</p>

<blockquote><pre>
PRAGMA journal_mode=WAL;
</pre></blockquote>

<p>The journal_mode pragma returns a string which is the new journal mode.
On success, the pragma will return the string "<tt>wal</tt>".  If 
the conversion to WAL could not be completed (for example, if the VFS
does not support the necessary shared-memory primitives) then the
journaling mode will be unchanged and the string returned from the
primitive will be the prior journaling mode (for example "<tt>delete</tt>").

<p>By default, SQLite will automatically checkpoint whenever the WAL
file reaches or exceeds 1000 pages of change, or when the last database
connection on a database file closes.  The default configuration is
intended to work well for most applications.  But programs that want
more control can force a checkpoint
using the [wal_checkpoint pragma] or by calling the
[sqlite3_wal_checkpoint()] C interface.  The automatic checkpoint
threshold can be changes or automatic checkpointing can be completely
disabled using the [wal_autocheckpoint pragma] or by calling the
[sqlite3_wal_autocheckpoint()] C interface.  A program can also 
use [sqlite3_wal_hook()] to register a callback to be invoked whenever
any transaction commits to the WAL.  This callback can then invoke
[sqlite3_wal_checkpoint()] to for a checkpoint based on whatever
criteria it thinks is appropriate.  (The automatic checkpoint mechanism
is implemented as a simple wrapper around [sqlite3_wal_hook()].)</p>

<h2>Backwards Compatibility</h2>

<p>The database file format is unchanged for WAL mode.  However, the
WAL file is a new concept and so older versions of SQLite will not know
how to recover a crashed SQLite database that was operating in WAL mode
when the crash occurred.
To prevent older versions of SQLite from trying to recover
a WAL-mode database (and making matters worse) the database file format
version numbers (bytes 18 and 19 in the [database header])
are incremented in WAL mode.
Thus, if an older version of SQLite attempts to connect to an SQLite
database that is operating in WAL mode, it will report an error along
the lines of "file is encrypted or is not a database".</p>

<p>One can explicitly changing out of WAL mode using a pragma such as
this:</p>

<blockquote><pre>
PRAGMA journal_mode=DELETE;
</pre></blockquote>

<p>Deliberately changing out of WAL mode lowers the database file format
version numbers so that older versions of SQLite can again access the
database file.</p>