Documentation Source Text

Check-in [8cdaccbb90]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Updates to the WAL documentation.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 8cdaccbb9047043a13a9e650bf1f5ed16430b55f
User & Date: drh 2010-07-13 23:51:04
Context
2010-07-14
18:39
Improved documentation of the IN and NOT IN operators. Push back the 3.7.0 release date to 2010-07-22. check-in: 07096801a3 user: drh tags: trunk
2010-07-13
23:51
Updates to the WAL documentation. check-in: 8cdaccbb90 user: drh tags: trunk
16:51
Enhancements to the ext3 barrier problem description in lockingv3.html. check-in: a1de10679b user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/fileformat2.in.

1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
because frames for page P can appear anywhere within the WAL, the
reader has to scan the entire WAL looking for page P frames.  If the
WAL is large (multiple megabytes is typical) that scan can be slow,
and read performance suffers.  To overcome this problem, a separate
data structure called the wal-index is maintained to expedite the
search for frames of a particular page.</p>

<tcl>hd_fragment walindexformat {WAL-index format}</tcl>
<h3>4.5 WAL-Index Format</h3>

<p>Conceptually, the wal-index is shared memory, though the current
VFS implementations use a mmapped file for the wal-index.  The mmapped
file is in the same directory as the database and has the same name
as the database with a "<tt>-shm</tt>" suffix appended.  Because
the wal-index is shared memory, SQLite does not support 







|







1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
because frames for page P can appear anywhere within the WAL, the
reader has to scan the entire WAL looking for page P frames.  If the
WAL is large (multiple megabytes is typical) that scan can be slow,
and read performance suffers.  To overcome this problem, a separate
data structure called the wal-index is maintained to expedite the
search for frames of a particular page.</p>

<tcl>hd_fragment walindexformat {wal-index} {WAL-index format}</tcl>
<h3>4.5 WAL-Index Format</h3>

<p>Conceptually, the wal-index is shared memory, though the current
VFS implementations use a mmapped file for the wal-index.  The mmapped
file is in the same directory as the database and has the same name
as the database with a "<tt>-shm</tt>" suffix appended.  Because
the wal-index is shared memory, SQLite does not support 

Changes to pages/lockingv3.in.

461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483




484
485
486
487
488
489
490
491
492
493
494
495
496
497
498

<p>
SQLite uses the fsync() system call to flush data to the disk under Unix and
it uses the FlushFileBuffers() to do the same under Windows.  Once again,
SQLite assumes that these operating system services function as advertised.
But it has been reported that fsync() and FlushFileBuffers() do not always
work correctly, especially with inexpensive IDE disks.  Apparently some
manufactures of IDE disks have defective controller chips that report
that data has reached the disk surface when in fact the data is still
in volatile cache memory in the disk drive electronics.  There are also
reports that Windows sometimes chooses to ignore FlushFileBuffers() for
unspecified reasons.  The author cannot verify any of these reports.
But if they are true, it means that database corruption is a possibility
following an unexpected power loss.  These are hardware and/or operating
system bugs that SQLite is unable to defend against.
</p>

<tcl>hd_fragment {ext3-barrier-problem} {the ext3 barrier problem}</tcl>
<p>In the Linux <a href="http://en.wikipedia.org/wiki/Ext3">ext3</a>
file system, if the filesystem is not mounted with the "barrier=1" option
in the <a href="http://en.wikipedia.org/wiki/fstab">/etc/fstab</a>
and the disk drive write cache is enabled
then filesystem corruption can occur following a power loss or OS crash.




Various ext3 experts
<a href="http://www.redhat.com/archives/ext3-users/2010-July/msg00001.html">
confirm this behavior</a>.
We are told that most Linux distributions do not use barrier=1 and and do
not disable the write cache so most
Linux distributions are vulnerable to this problem.  Note that this is an
operating system issue and that there is nothing that SQLite can do to
work around it.  
<a href="http://ozlabs.org/~rusty/index.cgi/tech/2009-10-20.html">
Other database engines</a> have also run into this same problem.</p>

<p>
If a crash or power failure occurs and results in a hot journal but that
journal is deleted, the next process to open the database will not
know that it contains changes that need to be rolled back.  The rollback







|










|
|



>
>
>
>



|


|
|







461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502

<p>
SQLite uses the fsync() system call to flush data to the disk under Unix and
it uses the FlushFileBuffers() to do the same under Windows.  Once again,
SQLite assumes that these operating system services function as advertised.
But it has been reported that fsync() and FlushFileBuffers() do not always
work correctly, especially with inexpensive IDE disks.  Apparently some
manufactures of IDE disks have controller chips that report
that data has reached the disk surface when in fact the data is still
in volatile cache memory in the disk drive electronics.  There are also
reports that Windows sometimes chooses to ignore FlushFileBuffers() for
unspecified reasons.  The author cannot verify any of these reports.
But if they are true, it means that database corruption is a possibility
following an unexpected power loss.  These are hardware and/or operating
system bugs that SQLite is unable to defend against.
</p>

<tcl>hd_fragment {ext3-barrier-problem} {the ext3 barrier problem}</tcl>
<p>If a Linux <a href="http://en.wikipedia.org/wiki/Ext3">ext3</a>
filesystem is mounted without the "barrier=1" option
in the <a href="http://en.wikipedia.org/wiki/fstab">/etc/fstab</a>
and the disk drive write cache is enabled
then filesystem corruption can occur following a power loss or OS crash.
Whether or not corruption can occur depends on the details of the disk control
hardware; corruption is more likely with inexpensive consumer-grade disks
and less of a problem for enterprise-class storage devices with advanced
features such as non-volatile write caches.
Various ext3 experts
<a href="http://www.redhat.com/archives/ext3-users/2010-July/msg00001.html">
confirm this behavior</a>.
We are told that most Linux distributions do not use barrier=1 and do
not disable the write cache so most
Linux distributions are vulnerable to this problem.  Note that this is an
operating system and hardware issue and that there is nothing that SQLite
can do to work around it.  
<a href="http://ozlabs.org/~rusty/index.cgi/tech/2009-10-20.html">
Other database engines</a> have also run into this same problem.</p>

<p>
If a crash or power failure occurs and results in a hot journal but that
journal is deleted, the next process to open the database will not
know that it contains changes that need to be rolled back.  The rollback

Changes to pages/sqlite.in.

1
2
3
4
5
6
7
8
9
<title>Command Line Shell For SQLite</title>
<tcl>hd_keywords {CLI} {Command Line Interface}</tcl>

<h1 align=center>
Command Line Shell For SQLite
</h1>

<p>The SQLite library includes a simple command-line utility named
<b>sqlite3</b> (or <b>sqlite3.exe</b> on windows)

|







1
2
3
4
5
6
7
8
9
<title>Command Line Shell For SQLite</title>
<tcl>hd_keywords {CLI} {Command Line Interface} {command-line shell}</tcl>

<h1 align=center>
Command Line Shell For SQLite
</h1>

<p>The SQLite library includes a simple command-line utility named
<b>sqlite3</b> (or <b>sqlite3.exe</b> on windows)

Changes to pages/wal.in.

33
34
35
36
37
38
39





40
41
42

43
44
45
46
47
48
49
50
...
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
...
231
232
233
234
235
236
237


238
239
240
241
242
243
244
...
246
247
248
249
250
251
252
253




































































































254
255
256
257
258
259

260
261
262
263
264
265
266
<li>Transactions that involve changes against multiple [ATTACH | ATTACHed]
    databases are atomic for each individual database, but are not
    atomic across all databases as a set.
<li>It is not possible to change the database page size after entering WAL
    mode, either on an empty database or by using [VACUUM] or by restoring
    from a backup using the [backup API].  You must be in a rollback journal
    mode to change the page size.





<li>WAL might be very slightly slower (perhaps 1% or 2% slower)
    than the traditional rollback-journal approach
    in applications that do mostly reads and seldom write.

<li>There is an additional persistent "*-wal" file associated with each
    database, which can make SQLite less appealing for use as an 
    [application file-format].
<li>There is the extra operation of [checkpointing] which, though automatic
    by default, is still something that application developers need to
    be mindful of.
</ol>

................................................................................
then there will typically be many seek operations interspersed among
the page writes.  These factors combine to make checkpoints slower than
write transactions.</p>

<p>The default strategy is to allow successive write transactions to
grow the WAL until the WAL becomes about 1000 pages in size, then to
run a checkpoint operation for each subsequent COMMIT until the WAL
is reset to be smalller than 1000 pages.  By default, the checkpoint will be
run automatically by the same thread that does the COMMIT that pushes
the WAL over its size limit.  This has the effect of causing most
COMMIT operations to be very fast but an occasional COMMIT (those that trigger
a checkpoint) to be much slower.  If that effect is undesirable, then
the application can disable automatic checkpointing and run the
periodic checkpoints in a separate thread, or separate process.
(Links to commands and interfaces to accomplish this are
................................................................................
On success, the pragma will return the string "<tt>wal</tt>".  If 
the conversion to WAL could not be completed (for example, if the VFS
does not support the necessary shared-memory primitives) then the
journaling mode will be unchanged and the string returned from the
primitive will be the prior journaling mode (for example "<tt>delete</tt>").

<a name="how_to_checkpoint"></a>


<p>By default, SQLite will automatically checkpoint whenever a [COMMIT]
occurs that causes the WAL file to be 1000 pages or more in size, or when the 
last database connection on a database file closes.  The default 
configuration is intended to work well for most applications.
But programs that want more control can force a checkpoint
using the [wal_checkpoint pragma] or by calling the
[sqlite3_wal_checkpoint()] C interface.  The automatic checkpoint
................................................................................
disabled using the [wal_autocheckpoint pragma] or by calling the
[sqlite3_wal_autocheckpoint()] C interface.  A program can also 
use [sqlite3_wal_hook()] to register a callback to be invoked whenever
any transaction commits to the WAL.  This callback can then invoke
[sqlite3_wal_checkpoint()] to for a checkpoint based on whatever
criteria it thinks is appropriate.  (The automatic checkpoint mechanism
is implemented as a simple wrapper around [sqlite3_wal_hook()].)</p>





































































































<h2>Backwards Compatibility</h2>

<p>The database file format is unchanged for WAL mode.  However, the
WAL file (located in the same directory or folder as the original database
file and with the same name as the original database with 
"<tt>-wal</tt>" appended) is a new concept and so older versions of 

SQLite will not know
how to recover a crashed SQLite database that was operating in WAL mode
when the crash occurred.
To prevent older versions of SQLite from trying to recover
a WAL-mode database (and making matters worse) the database file format
version numbers (bytes 18 and 19 in the [database header])
are increased from 1 to 2 in WAL mode.







>
>
>
>
>



>
|







 







|







 







>
>







 








>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>



<
<
<
>







33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
...
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
...
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
...
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364



365
366
367
368
369
370
371
372
<li>Transactions that involve changes against multiple [ATTACH | ATTACHed]
    databases are atomic for each individual database, but are not
    atomic across all databases as a set.
<li>It is not possible to change the database page size after entering WAL
    mode, either on an empty database or by using [VACUUM] or by restoring
    from a backup using the [backup API].  You must be in a rollback journal
    mode to change the page size.
<li>It is not possible to open [read-only WAL databases].
    The opening process must have write privileges for "<tt>-shm</tt>"
    [wal-index] shared memory file associated with the database, if that
    file exists, or else write access on the directory containing
    the database file if the "<tt>-shm</tt>" file does not exist.
<li>WAL might be very slightly slower (perhaps 1% or 2% slower)
    than the traditional rollback-journal approach
    in applications that do mostly reads and seldom write.
<li>There is an additional quasi-persistent "<tt>-wal</tt>" file and
    "<tt>-shm</tt> shared memory file associated with each
    database, which can make SQLite less appealing for use as an 
    [application file-format].
<li>There is the extra operation of [checkpointing] which, though automatic
    by default, is still something that application developers need to
    be mindful of.
</ol>

................................................................................
then there will typically be many seek operations interspersed among
the page writes.  These factors combine to make checkpoints slower than
write transactions.</p>

<p>The default strategy is to allow successive write transactions to
grow the WAL until the WAL becomes about 1000 pages in size, then to
run a checkpoint operation for each subsequent COMMIT until the WAL
is reset to be smaller than 1000 pages.  By default, the checkpoint will be
run automatically by the same thread that does the COMMIT that pushes
the WAL over its size limit.  This has the effect of causing most
COMMIT operations to be very fast but an occasional COMMIT (those that trigger
a checkpoint) to be much slower.  If that effect is undesirable, then
the application can disable automatic checkpointing and run the
periodic checkpoints in a separate thread, or separate process.
(Links to commands and interfaces to accomplish this are
................................................................................
On success, the pragma will return the string "<tt>wal</tt>".  If 
the conversion to WAL could not be completed (for example, if the VFS
does not support the necessary shared-memory primitives) then the
journaling mode will be unchanged and the string returned from the
primitive will be the prior journaling mode (for example "<tt>delete</tt>").

<a name="how_to_checkpoint"></a>
<h3>Automatic Checkpoint</h3>

<p>By default, SQLite will automatically checkpoint whenever a [COMMIT]
occurs that causes the WAL file to be 1000 pages or more in size, or when the 
last database connection on a database file closes.  The default 
configuration is intended to work well for most applications.
But programs that want more control can force a checkpoint
using the [wal_checkpoint pragma] or by calling the
[sqlite3_wal_checkpoint()] C interface.  The automatic checkpoint
................................................................................
disabled using the [wal_autocheckpoint pragma] or by calling the
[sqlite3_wal_autocheckpoint()] C interface.  A program can also 
use [sqlite3_wal_hook()] to register a callback to be invoked whenever
any transaction commits to the WAL.  This callback can then invoke
[sqlite3_wal_checkpoint()] to for a checkpoint based on whatever
criteria it thinks is appropriate.  (The automatic checkpoint mechanism
is implemented as a simple wrapper around [sqlite3_wal_hook()].)</p>

<h3>Persistence of WAL mode</h3>

<p>Unlike the other journaling modes, 
[journal_mode | PRAGMA journal_mode=WAL] is
persistent.  If a process sets WAL mode, then closes and reopens the
database, the database will come back in WAL mode.  In contrast, if
a process sets (for example) PRAGMA journal_mode=TRUNCATE and then closes and
reopens the database will come back up in the default rollback mode of
DELETE rather than the previous TRUNCATE setting.</p>

<p>The persistence of WAL mode means that applications can be converted
to using SQLite in WAL mode without making any changes to the application
itself.  One has merely to run "<tt>PRAGMA journal_mode=WAL;</tt>" on the
database file(s) using the [command-line shell] or other utility, then
restart the application.</p>

<p>The WAL journal mode will be set on all
connections to the same database file if it is set on any one connection.
</p>

<tcl>hd_fragment {readonly} {read-only WAL databases}</tcl>
<h2>Read-Only Databases</h2>

<p>No SQLite database (regardless of whether or not it is WAL mode) is
readable if it is located on read-only media and it requires recovery.
So, for example, if an application crashes and leaves an SQLite database
with a [hot journal], that database cannot be opened unless the opening
process has write privilege on the database file, the directory
containing the database file, and the hot journal.  This is because the 
incomplete transaction left over from the crash must be rolled back prior 
to reading the database and that rollback cannot occur without write 
permission on all files and the directory containing them.</p>

<p>A database in WAL mode cannot generally be opened from read-only 
media because even ordinary reads in WAL mode require recovery-like
operations.</p>

<p>An efficient implementation of the [WAL read algorithm] requires that
there exist a hash table in shared memory over the content of the WAL file.
This hash table is called the [wal-index].
The wal-index is in shared memory, and so technically it does not have 
to have a name in the host computer filesystem.  Custom
[sqlite3_vfs | VFS] implementations are free to implement shared 
memory in any way they see fit, but the default unix and windows 
drivers that come built-in with SQLite implement shared memory
using <a href="http://en.wikipedia.org/wiki/Mmap">mmapped files</a>
named using the suffix "<tt>-shm</tt>" and
located in the same directory as the database file.  The wal-index must
be rebuilt upon first access, even by readers, and so in order to open
the WAL database, write access is required on the "<tt>-shm</tt>" shared
memory file if the file exists, or else write access is required on the
directory containing the database so that the wal-index can be created if
it does not already exist.
This does not preclude custom VFS implementations that implement shared 
memory differently from being able to access read-only WAL databases, but
it does prevent the default unix and windows backends from accessing
WAL databases on read-only media.</p>

<p>Hence, SQLite databases should always be converted to 
[journal_mode | PRAGMA journal_mode=DELETE] prior to being transferred
to read-only media.</p>

<p>Also, if multiple processes are to access a WAL mode database, then
all processes should run under user or group IDs that give them write
access to the database files, the WAL file, the shared memory 
<tt>-shm</tt> file, and the containing directory.</p>

<h2>Implementation Of Shared-Memory For The WAL-Index</h2>

<pThe [wal-index] is implemented using an ordinary file that is
mmapped for robustness.  Early (pre-release) implementations of WAL mode
stored the wal-index in volatile shared-memory, such as files created in
/dev/shm on Linux or /tmp on other unix systems.  The problem
with that approach is that processes with a different root directory
(changed via <a href="http://en.wikipedia.org/wiki/Chroot">chroot</a>)
will see different files and hence use different shared memory areas,
leading to database corruption.  Other methods for creating nameless
shared memory blocks are not portable across the various flavors of
unix.  And we could not find any method to create nameless shared
memory blocks on windows.  The only way we have found to guarantee
that all processes accessing the same database file use the same shared
memory is to create the shared memory by mmapping a file in the same
directory as the database itself.</p>

<p>Using an ordinary disk file to provide shared memory has the 
disadvantge that it might actually do unnecessary disk I/O by
writing the shared memory to disk.  However, the developers do not
think this is a major concern since the wal-index rarely exceeds
32 KiB in size and is never synced.  Furthermore, the wal-index 
backing file is deleted when the last database connection disconnects,
which often prevents any real disk I/O from ever happening.</p>

<p>Specialized applications for which the default implementation of
shared memory is unacceptable can devise alternative methods via a
custom [sqlite3_vfs | VFS].  
For example, if it is known that a particular database
will only be accessed by threads within a single process, the wal-index
can be implemented using heap memory instead of true shared memory.</p>


<h2>Backwards Compatibility</h2>

<p>The database file format is unchanged for WAL mode.  However, the



WAL file and the [wal-index] are new concepts and so older versions of 
SQLite will not know
how to recover a crashed SQLite database that was operating in WAL mode
when the crash occurred.
To prevent older versions of SQLite from trying to recover
a WAL-mode database (and making matters worse) the database file format
version numbers (bytes 18 and 19 in the [database header])
are increased from 1 to 2 in WAL mode.