Documentation Source Text
Check-in [e63c0c]
Not logged in

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Edits to the VFS documentation. Better cross-referencing to VFS.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1:e63c0ce4fb1d824ffcac9360ecc6f24cc854c341
User & Date: drh 2011-05-23 19:04:56
Context
2011-05-24
00:44
Restrict website search text to omit non-alphanumeric characters. check-in: 252855 user: drh tags: trunk
2011-05-23
19:04
Edits to the VFS documentation. Better cross-referencing to VFS. check-in: e63c0c user: drh tags: trunk
2011-05-20
23:21
Further tightening of the validity rules for valid HTTP requests in althttpd.c. check-in: 73bc37 user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/atomiccommit.in.

83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
...
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
never start writing a sector in the middle and work towards the
ends.  We do not know if this assumption is always true but it
seems reasonable.</p>

<p>The previous paragraph states that SQLite does not assume that
sector writes are atomic.  This is true by default.  But as of
SQLite version 3.5.0, there is a new interface called the
Virtual File System (VFS) interface.  The VFS is the only means
by which SQLite communicates to the underlying filesystem.  The
code comes with default VFS implementations for Unix and Windows
and there is a mechanism for creating new custom VFS implementations
at runtime.  In this new VFS interface there is a method called
xDeviceCharacteristics.  This method interrogates the underlying
filesystem to discover various properties and behaviors that the
filesystem may or may not exhibit.  The xDeviceCharacteristics
................................................................................
file space originally contains garbage and then later is filled in
with the data actually written.  In other words, SQLite assumes that
the file size is updated before the file content.  This is a 
pessimistic assumption and SQLite has to do some extra work to make
sure that it does not cause database corruption if power is lost
between the time when the file size is increased and when the
new content is written.  The xDeviceCharacteristics method of
the VFS might indicate that the filesystem will always write the
data before updating the file size.  (This is the 
SQLITE_IOCAP_SAFE_APPEND property for those readers who are looking
at the code.)  When the xDeviceCharacteristics method indicates
that files content is written before the file size is increased,
SQLite can forego some of its pedantic database protection steps
and thereby decrease the amount of disk I/O needed to perform a
commit.  The current implementation, however, makes no such assumptions







|







 







|







83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
...
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
never start writing a sector in the middle and work towards the
ends.  We do not know if this assumption is always true but it
seems reasonable.</p>

<p>The previous paragraph states that SQLite does not assume that
sector writes are atomic.  This is true by default.  But as of
SQLite version 3.5.0, there is a new interface called the
Virtual File System ([VFS]) interface.  The [VFS] is the only means
by which SQLite communicates to the underlying filesystem.  The
code comes with default VFS implementations for Unix and Windows
and there is a mechanism for creating new custom VFS implementations
at runtime.  In this new VFS interface there is a method called
xDeviceCharacteristics.  This method interrogates the underlying
filesystem to discover various properties and behaviors that the
filesystem may or may not exhibit.  The xDeviceCharacteristics
................................................................................
file space originally contains garbage and then later is filled in
with the data actually written.  In other words, SQLite assumes that
the file size is updated before the file content.  This is a 
pessimistic assumption and SQLite has to do some extra work to make
sure that it does not cause database corruption if power is lost
between the time when the file size is increased and when the
new content is written.  The xDeviceCharacteristics method of
the [VFS] might indicate that the filesystem will always write the
data before updating the file size.  (This is the 
SQLITE_IOCAP_SAFE_APPEND property for those readers who are looking
at the code.)  When the xDeviceCharacteristics method indicates
that files content is written before the file size is increased,
SQLite can forego some of its pedantic database protection steps
and thereby decrease the amount of disk I/O needed to perform a
commit.  The current implementation, however, makes no such assumptions

Changes to pages/compile.in.

371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389

    <li> No locking style. If none of the above can be supported, this 
         locking style is used. No database locking mechanism is used. When
         this system is used it is not safe for a single database to be
         accessed by multiple clients.
  </ul>

  Additionally, five extra VFS implementations are provided as well as the
  default. By specifying one of the extra VFS implementations 
  when calling [sqlite3_open_v2()], an application may bypass the file-system
  detection logic and explicitly select one of the above locking styles. The
  five extra VFS implementations are called "unix-posix", "unix-afp",
  "unix-flock", "unix-dotfile" and "unix-none".
}

COMPILE_OPTION {SQLITE_ENABLE_MEMORY_MANAGEMENT} {
  This option adds extra logic to SQLite that allows it to release unused
  memory upon request.  This option must be enabled in order for the
  [sqlite3_release_memory()] interface to work.  If this compile-time







|



|







371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389

    <li> No locking style. If none of the above can be supported, this 
         locking style is used. No database locking mechanism is used. When
         this system is used it is not safe for a single database to be
         accessed by multiple clients.
  </ul>

  Additionally, five extra [VFS] implementations are provided as well as the
  default. By specifying one of the extra VFS implementations 
  when calling [sqlite3_open_v2()], an application may bypass the file-system
  detection logic and explicitly select one of the above locking styles. The
  five extra [VFS] implementations are called "unix-posix", "unix-afp",
  "unix-flock", "unix-dotfile" and "unix-none".
}

COMPILE_OPTION {SQLITE_ENABLE_MEMORY_MANAGEMENT} {
  This option adds extra logic to SQLite that allows it to release unused
  memory upon request.  This option must be enabled in order for the
  [sqlite3_release_memory()] interface to work.  If this compile-time

Changes to pages/fileformat2.in.

360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
that contains the bytes at offsets between 1073741824 and 1073742335,
inclusive.  A database file that is less than or equal to 1073741824 bytes 
in size contains no lock-byte page.  A database file larger than
1073741824 contains exactly one lock-byte page.
</p>

<p>The lock-byte page is set aside for use by the operating-system specific
VFS implementation in implementing the database file locking primitives.
^SQLite does not use the lock-byte page.  ^The SQLite core 
will never read or write, though operating-system specific VFS 
implementations may choose to read or write bytes on the lock-byte 
page according to the 
needs and proclivities of the underlying system.  The unix and win32
VFS implementations that come built into SQLite do not write to the
lock-byte page, but third-party VFS implementations for
other operating systems might.</p>

<tcl>hd_fragment {freelist} {freelist} {free-page list}</tcl>
<h3>1.4 The Freelist</h3>

<p>A database file might contain one or more pages that are not in







|

|



|







360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
that contains the bytes at offsets between 1073741824 and 1073742335,
inclusive.  A database file that is less than or equal to 1073741824 bytes 
in size contains no lock-byte page.  A database file larger than
1073741824 contains exactly one lock-byte page.
</p>

<p>The lock-byte page is set aside for use by the operating-system specific
[VFS] implementation in implementing the database file locking primitives.
^SQLite does not use the lock-byte page.  ^The SQLite core 
will never read or write, though operating-system specific [VFS] 
implementations may choose to read or write bytes on the lock-byte 
page according to the 
needs and proclivities of the underlying system.  The unix and win32
[VFS] implementations that come built into SQLite do not write to the
lock-byte page, but third-party VFS implementations for
other operating systems might.</p>

<tcl>hd_fragment {freelist} {freelist} {free-page list}</tcl>
<h3>1.4 The Freelist</h3>

<p>A database file might contain one or more pages that are not in

Changes to pages/pragma.in.

743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
    <p>^(<b>PRAGMA synchronous;
       <br>PRAGMA synchronous = </b>
          <i>0 | OFF | 1 | NORMAL | 2 | FULL</i><b>;</b></p>

    <p>Query or change the setting of the "synchronous" flag.)^
    ^The first (query) form will return the synchronous setting as an 
    integer.  ^When synchronous is FULL (2), the SQLite database engine will
    use the xSync method of the VFS to ensure that all content is safely
    written to the disk surface prior to continuing.
    This ensures that an operating system crash or power failure will
    not corrupt the database.
    FULL synchronous is very safe, but it is also slower. 
    ^When synchronous is NORMAL (1), the SQLite database
    engine will still sync at the most critical moments, but less often
    than in FULL mode.  There is a very small (though non-zero) chance that







|







743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
    <p>^(<b>PRAGMA synchronous;
       <br>PRAGMA synchronous = </b>
          <i>0 | OFF | 1 | NORMAL | 2 | FULL</i><b>;</b></p>

    <p>Query or change the setting of the "synchronous" flag.)^
    ^The first (query) form will return the synchronous setting as an 
    integer.  ^When synchronous is FULL (2), the SQLite database engine will
    use the xSync method of the [VFS] to ensure that all content is safely
    written to the disk surface prior to continuing.
    This ensures that an operating system crash or power failure will
    not corrupt the database.
    FULL synchronous is very safe, but it is also slower. 
    ^When synchronous is NORMAL (1), the SQLite database
    engine will still sync at the most critical moments, but less often
    than in FULL mode.  There is a very small (though non-zero) chance that

Changes to pages/selfcontained.in.

33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
The date and time SQL functions provided by SQLite require
some additional C library support, but those functions can
be also be omitted from the build using compile-time options.
</p>

<p>
Communications between SQLite and the operating system and disk are
mediated through an interchangeable 
[sqlite3_vfs | Virtual File System (VFS)] layer.
VFS modules for Unix (Linux and Mac OS X), OS/2, and Windows (Win32 and WinCE)
are provided in the source tree.  It is a simple matter to devise an
alternative VFS for embedded devices.
</p>

<p>
For safe operation in multi-threaded environments, SQLite requires







|
<







33
34
35
36
37
38
39
40

41
42
43
44
45
46
47
The date and time SQL functions provided by SQLite require
some additional C library support, but those functions can
be also be omitted from the build using compile-time options.
</p>

<p>
Communications between SQLite and the operating system and disk are
mediated through an interchangeable [VFS] layer.

VFS modules for Unix (Linux and Mac OS X), OS/2, and Windows (Win32 and WinCE)
are provided in the source tree.  It is a simple matter to devise an
alternative VFS for embedded devices.
</p>

<p>
For safe operation in multi-threaded environments, SQLite requires

Changes to pages/testing.in.

296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
...
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
...
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
course, and so crash testing is done in simulation.  An alternative
[sqlite3_vfs | Virtual File System] is inserted that allows the test
harness to simulate the state of the database file following a crash.</p>

<p>In the TCL test harness, the crash simulation is done in a separate
process.  The main testing process spawns a child process which runs
some SQLite operation and randomly crashes somewhere in the middle of
a write operation.  A special VFS randomly reorders and corrupts
the unsynchronized
write operations to simulate the effect of buffered filesystems.  After
the child dies, the original test process opens and reads the test
database and verifies that the changes attempted by the child either
completed successfully or else were completely rolled back.  The
[integrity_check] [PRAGMA] is used to make sure no database corruption
occurs.</p>

<p>The TH3 test harness needs to run on embedded systems that do not
necessarily have the ability to spawn child processes, so it uses
an in-memory VFS to simulate crashes.  The in-memory VFS can be rigged
to make a snapshot of the entire filesystem after a set number of I/O
operations.  Crash tests run in a loop.  On each iteration of the loop,
the point at which a snapshot is made is advanced until the SQLite
operations being tested run to completion without ever hitting a
snapshot.  Within the loop, after the SQLite operation under test has
completed, the filesystem is reverted to the snapshot and random file
damage is introduced that is characteristic of the kinds of damage
................................................................................

<p>The SQLite core has 100% branch test coverage under [TH3] as of
2009-07-25, in its default configuration as measured by
[http://gcc.gnu.org/onlinedocs/gcc/Gcov.html | gcov]
utility on SuSE Linux 10.1 on x86 hardware with the GCC 4.0.1 compiler.</p>

<p>The "SQLite core" in the previous paragraph excludes the
operating-system dependent [sqlite3_vfs | VFS] backends, since it is
not possible to write cross-platform tests for those modules.  Extensions
such as FTS3 and RTree are also excluded from the analysis.</p>

<tcl>hd_fragment stmtvbr</tcl>
<h3>7.1 Statement versus branch coverage</h3>

<p>There are many ways to measure test coverage.  The most popular
................................................................................
<tcl>hd_fragment journaltest</tcl>
<h3>8.5 Journal Tests</h3>

<p>One of the things that SQLite does to ensure that transactions
are atomic across system crashes and power failures is to write
all changes into the rollback journal file prior to changing the
database.  The TCL test harness contains an alternative
[sqlite3_vfs | Virtual File System ] implementation that helps to
verify this is occurring correctly.  The "journal-test VFS" monitors
all disk I/O traffic between the database file and rollback journal,
checking to make sure that nothing is written into the database
file which has not first been written and synced to the rollback journal.
If any discrepancies are found, an assertion fault is raised.</p>

<p>The journal tests are an additional double-check over and above







|










|







 







|







 







|







296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
...
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
...
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
course, and so crash testing is done in simulation.  An alternative
[sqlite3_vfs | Virtual File System] is inserted that allows the test
harness to simulate the state of the database file following a crash.</p>

<p>In the TCL test harness, the crash simulation is done in a separate
process.  The main testing process spawns a child process which runs
some SQLite operation and randomly crashes somewhere in the middle of
a write operation.  A special [VFS] randomly reorders and corrupts
the unsynchronized
write operations to simulate the effect of buffered filesystems.  After
the child dies, the original test process opens and reads the test
database and verifies that the changes attempted by the child either
completed successfully or else were completely rolled back.  The
[integrity_check] [PRAGMA] is used to make sure no database corruption
occurs.</p>

<p>The TH3 test harness needs to run on embedded systems that do not
necessarily have the ability to spawn child processes, so it uses
an in-memory [VFS] to simulate crashes.  The in-memory [VFS] can be rigged
to make a snapshot of the entire filesystem after a set number of I/O
operations.  Crash tests run in a loop.  On each iteration of the loop,
the point at which a snapshot is made is advanced until the SQLite
operations being tested run to completion without ever hitting a
snapshot.  Within the loop, after the SQLite operation under test has
completed, the filesystem is reverted to the snapshot and random file
damage is introduced that is characteristic of the kinds of damage
................................................................................

<p>The SQLite core has 100% branch test coverage under [TH3] as of
2009-07-25, in its default configuration as measured by
[http://gcc.gnu.org/onlinedocs/gcc/Gcov.html | gcov]
utility on SuSE Linux 10.1 on x86 hardware with the GCC 4.0.1 compiler.</p>

<p>The "SQLite core" in the previous paragraph excludes the
operating-system dependent [VFS] backends, since it is
not possible to write cross-platform tests for those modules.  Extensions
such as FTS3 and RTree are also excluded from the analysis.</p>

<tcl>hd_fragment stmtvbr</tcl>
<h3>7.1 Statement versus branch coverage</h3>

<p>There are many ways to measure test coverage.  The most popular
................................................................................
<tcl>hd_fragment journaltest</tcl>
<h3>8.5 Journal Tests</h3>

<p>One of the things that SQLite does to ensure that transactions
are atomic across system crashes and power failures is to write
all changes into the rollback journal file prior to changing the
database.  The TCL test harness contains an alternative
[OS backend] implementation that helps to
verify this is occurring correctly.  The "journal-test VFS" monitors
all disk I/O traffic between the database file and rollback journal,
checking to make sure that nothing is written into the database
file which has not first been written and synced to the rollback journal.
If any discrepancies are found, an assertion fault is raised.</p>

<p>The journal tests are an additional double-check over and above

Changes to pages/th3.in.

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
     errors, and power loss during transaction commit. </p></li>

<li><p> TH3 exercises SQLite in a variety of run-time configurations
     (UTF8 vs UTF16, different pages sizes, varying journal modes, etc.)
     </p></li>

<li><p> TH3 achieves 100% branch test coverage over SQLite core.
    (Test coverage of the operating-system specific VFSes and extensions
    such as FTS and RTREE is less than 100%). </p></li>
</ul>

<p>TH3 was originally written for validation testing only, but has
subsequently been used for development testing and debugging
as well, and has proven very helpful in those roles.  A full-coverage
test run for TH3 takes less than 10 minutes on a workstation and hence







|







23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
     errors, and power loss during transaction commit. </p></li>

<li><p> TH3 exercises SQLite in a variety of run-time configurations
     (UTF8 vs UTF16, different pages sizes, varying journal modes, etc.)
     </p></li>

<li><p> TH3 achieves 100% branch test coverage over SQLite core.
    (Test coverage of the operating-system specific [VFSes] and extensions
    such as FTS and RTREE is less than 100%). </p></li>
</ul>

<p>TH3 was originally written for validation testing only, but has
subsequently been used for development testing and debugging
as well, and has proven very helpful in those roles.  A full-coverage
test run for TH3 takes less than 10 minutes on a workstation and hence

Changes to pages/vfs.in.

3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
..
38
39
40
41
42
43
44
45
46

47
48
49
50
51
52
53
54
..
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
..
84
85
86
87
88
89
90
91
92
93

94
95
96
97
98
99


100
101
102
103
104
105
106
...
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
...
145
146
147
148
149
150
151
152

153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
...
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240

<h1 align="center">
The SQLite OS Interface or "VFS"
</h1>

<p>
This article describes the SQLite OS portability layer or "VFS" - the
thin module that is at the bottom of the SQLite implementation stack
and which provides portability across operating systems.
</p>

<img src="images/vfs1.gif" align="right" hspace="10">
<h2>1.0 The VFS In Relation To The Rest Of SQLite</h2>

<p>
The internal organization of the SQLite library can be viewed as a
stack of modules such as shown to the right.
The Tokenizer, Parser, and Code Generator components are used to
process SQL statements and convert them into executable programs 
in a virtual machine language or byte code.
Roughly speaking, these top three layers implement
[sqlite3_prepare_v2()].  The byte code generated by the top three
layers is a [prepared statement].
The Virtual Machine module is responsible for running the SQL statement 
................................................................................

<p>
This article is about the bottom layer.
</p>

<p>The OS Interface - also called the "VFS" - is what makes SQLite 
portable across operating systems.  Whenever any of the other modules
in SQLite needs to communicate with the filesystem, or the operating
system, they invoke methods in the VFS to communicate with the 

operating system on their behalf.  Hence, porting SQLite to a new
operating system is simply a matter of writing a new OS interface layer
or "VFS".</p>

<h2>2.0 Multiple VFSes</h2>

<p>
The standard SQLite source tree contains built-in VFSes for os/2, unix,
................................................................................

<p>
Multiple VFSes can be registered at the same time.
Each VFS has a unique names.
Separate [database connections] within the same process can be using
different VFSes at the same time.   For that matter, if a single
database connection has multiple database files open using
the [ATTACH] command, then each database might be using a
different VFS.
</p>

<p>
Unix builds come with multiple VFSes built-in.  The default VFS
for unix is called "unix" and is what is used in an overwhelming
majority of applications.  Other VFSes that can be found in unix
may include:
</p>

<ol>
<li><p><b>unix-dotfile</b> - uses dot-file locking rather than
          POSIX advisory locks.
................................................................................
<li><p><b>unix-none</b> - all file locking operations are no-ops.
<li><p><b>unix-namedsem</b> - uses named semaphores for file locking.
       VXWorks only.
</ol>

<p>
The various unix VFSes differ only in the way they handle file locking -
they share most of the implementation in common with one another and
are all located in the same SQLite source file:  "<tt>os_unix.c</tt>".
Note that because they do use different locking strategies, they are

subtle imcompatible with one another.  If two processes are accessing
the same SQLite database using different unix VFSes, they will probably
not see each others locks and may end up interfering with one another,
resulting in database corruption.  The "unix-none" VFS in particular
does no locking at all and will easily result in database corruption if
used to access a database by two or more database connections at a time.


</p>

<h2>2.1 Specifying Which VFS To Use</h2>

<p>
There is always one VFS which is the default VFS.  On unix systems,
the "unix" VFS comes up as the default and on windows it is "win32".
................................................................................
If no other actions are taken, new database connections will make use
of the default VFS.
</p>

<p>
The default VFS can be changed by registering or re-registering the
VFS using the [sqlite3_vfs_register()] interface with a second parameter
of 1.  Hence, of a (unix) process to make the "unix-nolock" VFS the
default in place of "unix", the following code would work:
</p>

<blockquote><pre>
sqlite3_vfs_register(sqlite3_vfs_find("unix-nolock"), 1);
</pre></blockquote>

<p>
................................................................................
a VFS specified as the fourth argument to [sqlite3_open_v2()].  The
default VFS is used if no VFS is specified otherwise.
</p>

<h2>2.2 VFS Shims</h2>

<p>
From the point of view of the uppers layers of the SQLite stack, there

is only one VFS in use at a time.  But in practice, a particular VFS might
just be a thin wrapper around another VFS does the real work.
We call a wrapper VFS a "shim".
</p>

<p>
A simple example of a shim is the "vfstrace" VFS.  This is a VFS
(imnplemented in the 
[http://www.sqlite.org/src/finfo?name=src/test_vfstrace.c | test_vfstrace.c]
source file) that writes a message associated with each VFS method call
into a log file, then passes control off to another VFS to do the actual
work.
</p>

<h2>2.3 Other Example VFSes</h2>

................................................................................
<p>
The following are other VFS implementations available in the public
SQLite source tree:
</p>

<ul>
<li><p>
<b>test_demovfs.c</b> -
This file implements a very simple VFS named "demo" that uss POSIX 
functions such as
open(), read(), write(), fsync(), close(), fsync(), sleep(), time(),
and so forth.  This VFS only works on unix systems.  But it is not
intended as a replacement for the standard "unix" VFS used by default
on unix platforms.  The "demo" VFS is deliberately kept very simple
so that it can be used as a learning aid or as template for building
other VFSes or for porting SQLite to new operating systems.

<li><p>
<b>test_quota.c</b> -
This file implements a shim called "quota" that enforces cumulative
file size limits on a collection of database files.  An auxiliary
interface is used to define "quote groups".  A quota group is a
set of files (database files, journals, and temporary files) whose
names all match a [GLOB] pattern.  The sum of the sizes of all files
in each quota group is tracked, and if that sum exceeds a threshold
defined for the quota group, a callback function is invoked.  That
callback can either increase the threshold or cause the operation
that would have exceeded the quota to fail with an 
[SQLITE_FULL] error.  One of the uses of this shim is used to enforce 
resource limits on application databases in Firefox.

<li><p>
<b>test_multiplex.c</b> -
This file implements a shim that allows database files to exceed the
maximum file size of the underlying filesystem.  This shim presents
an interface to the upper six layers of SQLite that makes it look like
very large files are being used, when in reality each such large file
is split up into many smaller files on the underlying system.
This shim has been used, for example, to allow databases to grow
larger than 2 gibibytes on FAT16 filesystems.

<li><p>
<b>test_onefile.c</b> -
This file implements a demonstration VFS named "fs" that shows how SQLite 
can be used on an embedded device that lacks a filesystem.  Content is
written directly to the underlying media.  A VFS derived from this
demonstration code could be used by a gadget with a limited amount of
flash memory to make SQLite behave as the filesystem for the flash memory
on the device.

<li><p>
<b>test_journal.c</b> -
This file implements a shim used during SQLite testing that verifies that
the database and rollback journal are written in the correct order and
are "synced" at appropriate times in order to guarantee that the database
can recover from a power lose are hard reset at any time.  The shim
checks several invariants on the operation of databases and rollback
journals and raises exceptions if any of those invariants are violated.
These invariants, in turn, assure that the database is always recoverable.
Running a large suite of test cases using this shim provides added
assurance that SQLite databases will not be damaged by unexpected
power failures or device resets.

<li><p>
<b>test_vfs.c</b> -
This file implements a shim that can be used to simulate filesystem faults.
This shim is used during testing to verify that SQLite responses sanely
to hardware malfunctions or to other error conditions such as running out
of filesystem space that are difficult to test on a real system.
</ul>

<p>







|
|






|
|







 







|
|
>
|







 







|





|







 







|
|
|
>
|
|



|
>
>







 







|
|







 







|
>
|
|





|
|







 







|










|













|









|








|












|







3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
..
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
..
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
..
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
...
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
...
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
...
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245

<h1 align="center">
The SQLite OS Interface or "VFS"
</h1>

<p>
This article describes the SQLite OS portability layer or "VFS" - the
module at the bottom of the SQLite implementation stack
that provides portability across operating systems.
</p>

<img src="images/vfs1.gif" align="right" hspace="10">
<h2>1.0 The VFS In Relation To The Rest Of SQLite</h2>

<p>
The internal organization of the SQLite library can be viewed as the
stack of modules shown to the right.
The Tokenizer, Parser, and Code Generator components are used to
process SQL statements and convert them into executable programs 
in a virtual machine language or byte code.
Roughly speaking, these top three layers implement
[sqlite3_prepare_v2()].  The byte code generated by the top three
layers is a [prepared statement].
The Virtual Machine module is responsible for running the SQL statement 
................................................................................

<p>
This article is about the bottom layer.
</p>

<p>The OS Interface - also called the "VFS" - is what makes SQLite 
portable across operating systems.  Whenever any of the other modules
in SQLite needs to communicate with the operating
system, they invoke methods in the VFS.  The VFS then invokes the
operating-specific code needed to satisfy the request.
Hence, porting SQLite to a new
operating system is simply a matter of writing a new OS interface layer
or "VFS".</p>

<h2>2.0 Multiple VFSes</h2>

<p>
The standard SQLite source tree contains built-in VFSes for os/2, unix,
................................................................................

<p>
Multiple VFSes can be registered at the same time.
Each VFS has a unique names.
Separate [database connections] within the same process can be using
different VFSes at the same time.   For that matter, if a single
database connection has multiple database files open using
the [ATTACH] command, then each attached database might be using a
different VFS.
</p>

<p>
Unix builds come with multiple VFSes built-in.  The default VFS
for unix is called "unix" and is the VFS used in an overwhelming
majority of applications.  Other VFSes that can be found in unix
may include:
</p>

<ol>
<li><p><b>unix-dotfile</b> - uses dot-file locking rather than
          POSIX advisory locks.
................................................................................
<li><p><b>unix-none</b> - all file locking operations are no-ops.
<li><p><b>unix-namedsem</b> - uses named semaphores for file locking.
       VXWorks only.
</ol>

<p>
The various unix VFSes differ only in the way they handle file locking -
they share most of their implementation in common with one another and
are all located in the same SQLite source file:  
[http://www.sqlite.org/src/doc/trunk/src/os_unix.c | os_unix.c].
Note that except for "unix" and "unix-excl", the various unix VFSes all
use incompatible locking implementations.  If two processes are accessing
the same SQLite database using different unix VFSes, they may
not see each others locks and may end up interfering with one another,
resulting in database corruption.  The "unix-none" VFS in particular
does no locking at all and will easily result in database corruption if
used by two or more database connections at the same time.
Programmers are encouraged to use only "unix" or "unix-excl" unless
there is a compelling reason to do otherwise.
</p>

<h2>2.1 Specifying Which VFS To Use</h2>

<p>
There is always one VFS which is the default VFS.  On unix systems,
the "unix" VFS comes up as the default and on windows it is "win32".
................................................................................
If no other actions are taken, new database connections will make use
of the default VFS.
</p>

<p>
The default VFS can be changed by registering or re-registering the
VFS using the [sqlite3_vfs_register()] interface with a second parameter
of 1.  Hence, if a (unix) process to always use the "unix-nolock" VFS 
in place of "unix", the following code would work:
</p>

<blockquote><pre>
sqlite3_vfs_register(sqlite3_vfs_find("unix-nolock"), 1);
</pre></blockquote>

<p>
................................................................................
a VFS specified as the fourth argument to [sqlite3_open_v2()].  The
default VFS is used if no VFS is specified otherwise.
</p>

<h2>2.2 VFS Shims</h2>

<p>
From the point of view of the uppers layers of the SQLite stack, each
open database file uses exactly one VFS.
But in practice, a particular VFS might
just be a thin wrapper around another VFS that does the real work.
We call a wrapper VFS a "shim".
</p>

<p>
A simple example of a shim is the "vfstrace" VFS.  This is a VFS
(implemented in the 
[http://www.sqlite.org/src/doc/trunk/src/test_vfstrace.c | test_vfstrace.c]
source file) that writes a message associated with each VFS method call
into a log file, then passes control off to another VFS to do the actual
work.
</p>

<h2>2.3 Other Example VFSes</h2>

................................................................................
<p>
The following are other VFS implementations available in the public
SQLite source tree:
</p>

<ul>
<li><p>
[http://www.sqlite.org/src/doc/trunk/src/test_demovfs.c | test_demovfs.c] - 
This file implements a very simple VFS named "demo" that uss POSIX 
functions such as
open(), read(), write(), fsync(), close(), fsync(), sleep(), time(),
and so forth.  This VFS only works on unix systems.  But it is not
intended as a replacement for the standard "unix" VFS used by default
on unix platforms.  The "demo" VFS is deliberately kept very simple
so that it can be used as a learning aid or as template for building
other VFSes or for porting SQLite to new operating systems.

<li><p>
[http://www.sqlite.org/src/doc/trunk/src/test_quota.c | test_quota.c] - 
This file implements a shim called "quota" that enforces cumulative
file size limits on a collection of database files.  An auxiliary
interface is used to define "quote groups".  A quota group is a
set of files (database files, journals, and temporary files) whose
names all match a [GLOB] pattern.  The sum of the sizes of all files
in each quota group is tracked, and if that sum exceeds a threshold
defined for the quota group, a callback function is invoked.  That
callback can either increase the threshold or cause the operation
that would have exceeded the quota to fail with an 
[SQLITE_FULL] error.  One of the uses of this shim is used to enforce 
resource limits on application databases in Firefox.

<li><p>
[http://www.sqlite.org/src/doc/trunk/src/test_multiplex.c | test_multiplex.c] - 
This file implements a shim that allows database files to exceed the
maximum file size of the underlying filesystem.  This shim presents
an interface to the upper six layers of SQLite that makes it look like
very large files are being used, when in reality each such large file
is split up into many smaller files on the underlying system.
This shim has been used, for example, to allow databases to grow
larger than 2 gibibytes on FAT16 filesystems.

<li><p>
[http://www.sqlite.org/src/doc/trunk/src/test_onefile.c | test_onefile.c] - 
This file implements a demonstration VFS named "fs" that shows how SQLite 
can be used on an embedded device that lacks a filesystem.  Content is
written directly to the underlying media.  A VFS derived from this
demonstration code could be used by a gadget with a limited amount of
flash memory to make SQLite behave as the filesystem for the flash memory
on the device.

<li><p>
[http://www.sqlite.org/src/doc/trunk/src/test_journal.c | test_journal.c] - 
This file implements a shim used during SQLite testing that verifies that
the database and rollback journal are written in the correct order and
are "synced" at appropriate times in order to guarantee that the database
can recover from a power lose are hard reset at any time.  The shim
checks several invariants on the operation of databases and rollback
journals and raises exceptions if any of those invariants are violated.
These invariants, in turn, assure that the database is always recoverable.
Running a large suite of test cases using this shim provides added
assurance that SQLite databases will not be damaged by unexpected
power failures or device resets.

<li><p>
[http://www.sqlite.org/src/doc/trunk/src/test_vfs.c | test_vfs.c] - 
This file implements a shim that can be used to simulate filesystem faults.
This shim is used during testing to verify that SQLite responses sanely
to hardware malfunctions or to other error conditions such as running out
of filesystem space that are difficult to test on a real system.
</ul>

<p>

Changes to pages/wal.in.

20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
...
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
...
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
...
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
...
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
<li>WAL uses many fewer fsync() operations and is thus less vulnerable to
    problems on systems where the fsync() system call is broken.
</ol>

<p>But there are also disadvantages:</p>

<ol>
<li>WAL normally requires that the [sqlite3_vfs | VFS] 
    supports shared-memory primitives.
    (Exception: [WAL without shared memory])
    The built-in unix and windows VFSes
    support this but third-party extension VFSes for custom operating
    systems might not.
<li>All processes using a database must be on the same host computer;
    WAL does not work over a network filesystem.
<li>Transactions that involve changes against multiple [ATTACH | ATTACHed]
................................................................................

<blockquote><pre>
PRAGMA journal_mode=WAL;
</pre></blockquote>

<p>The journal_mode pragma returns a string which is the new journal mode.
On success, the pragma will return the string "<tt>wal</tt>".  If 
the conversion to WAL could not be completed (for example, if the VFS
does not support the necessary shared-memory primitives) then the
journaling mode will be unchanged and the string returned from the
primitive will be the prior journaling mode (for example "<tt>delete</tt>").

<a name="how_to_checkpoint"></a>
<h3>Automatic Checkpoint</h3>

................................................................................
operations.</p>

<p>An efficient implementation of the [WAL read algorithm] requires that
there exist a hash table in shared memory over the content of the WAL file.
This hash table is called the [wal-index].
The wal-index is in shared memory, and so technically it does not have 
to have a name in the host computer filesystem.  Custom
[sqlite3_vfs | VFS] implementations are free to implement shared 
memory in any way they see fit, but the default unix and windows 
drivers that come built-in with SQLite implement shared memory
using <a href="http://en.wikipedia.org/wiki/Mmap">mmapped files</a>
named using the suffix "<tt>-shm</tt>" and
located in the same directory as the database file.  The wal-index must
be rebuilt upon first access, even by readers, and so in order to open
the WAL database, write access is required on the "<tt>-shm</tt>" shared
................................................................................
think this is a major concern since the wal-index rarely exceeds
32 KiB in size and is never synced.  Furthermore, the wal-index 
backing file is deleted when the last database connection disconnects,
which often prevents any real disk I/O from ever happening.</p>

<p>Specialized applications for which the default implementation of
shared memory is unacceptable can devise alternative methods via a
custom [sqlite3_vfs | VFS].  
For example, if it is known that a particular database
will only be accessed by threads within a single process, the wal-index
can be implemented using heap memory instead of true shared memory.</p>

<tcl>hd_fragment noshm {WAL without shared memory}</tcl>
<h2>Use of WAL Without Shared-Memory</h2>

................................................................................
<p>Beginning in SQLite version 3.7.4, ^WAL databases can be created, read, and
written even if shared memory is unavailable as long as the
[locking_mode] is set to EXCLUSIVE before the first attempted access.
In other words, a process can interact with
a WAL database without using shared memory if that
process is guaranteed to be the only process accessing the database.
^This feature allows WAL databases to be created, read, and written
by legacy [sqlite3_vfs | VFSes] that lack the "version 2" shared-memory
methods xShmMap, xShmLock, xShmBarrier, and xShmUnmap on the
[sqlite3_io_methods] object.</p>

<p>^(If EXCLUSIVE locking mode is set prior to the first WAL-mode 
database access, then SQLite never attempts to call any of the
shared-memory methods and hence no shared-memory
wal-index is ever created.)^







|
|







 







|







 







|







 







|







 







|







20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
...
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
...
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
...
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
...
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
<li>WAL uses many fewer fsync() operations and is thus less vulnerable to
    problems on systems where the fsync() system call is broken.
</ol>

<p>But there are also disadvantages:</p>

<ol>
<li>WAL normally requires that the [VFS] 
    support shared-memory primitives.
    (Exception: [WAL without shared memory])
    The built-in unix and windows VFSes
    support this but third-party extension VFSes for custom operating
    systems might not.
<li>All processes using a database must be on the same host computer;
    WAL does not work over a network filesystem.
<li>Transactions that involve changes against multiple [ATTACH | ATTACHed]
................................................................................

<blockquote><pre>
PRAGMA journal_mode=WAL;
</pre></blockquote>

<p>The journal_mode pragma returns a string which is the new journal mode.
On success, the pragma will return the string "<tt>wal</tt>".  If 
the conversion to WAL could not be completed (for example, if the [VFS]
does not support the necessary shared-memory primitives) then the
journaling mode will be unchanged and the string returned from the
primitive will be the prior journaling mode (for example "<tt>delete</tt>").

<a name="how_to_checkpoint"></a>
<h3>Automatic Checkpoint</h3>

................................................................................
operations.</p>

<p>An efficient implementation of the [WAL read algorithm] requires that
there exist a hash table in shared memory over the content of the WAL file.
This hash table is called the [wal-index].
The wal-index is in shared memory, and so technically it does not have 
to have a name in the host computer filesystem.  Custom
[VFS] implementations are free to implement shared 
memory in any way they see fit, but the default unix and windows 
drivers that come built-in with SQLite implement shared memory
using <a href="http://en.wikipedia.org/wiki/Mmap">mmapped files</a>
named using the suffix "<tt>-shm</tt>" and
located in the same directory as the database file.  The wal-index must
be rebuilt upon first access, even by readers, and so in order to open
the WAL database, write access is required on the "<tt>-shm</tt>" shared
................................................................................
think this is a major concern since the wal-index rarely exceeds
32 KiB in size and is never synced.  Furthermore, the wal-index 
backing file is deleted when the last database connection disconnects,
which often prevents any real disk I/O from ever happening.</p>

<p>Specialized applications for which the default implementation of
shared memory is unacceptable can devise alternative methods via a
custom [VFS].  
For example, if it is known that a particular database
will only be accessed by threads within a single process, the wal-index
can be implemented using heap memory instead of true shared memory.</p>

<tcl>hd_fragment noshm {WAL without shared memory}</tcl>
<h2>Use of WAL Without Shared-Memory</h2>

................................................................................
<p>Beginning in SQLite version 3.7.4, ^WAL databases can be created, read, and
written even if shared memory is unavailable as long as the
[locking_mode] is set to EXCLUSIVE before the first attempted access.
In other words, a process can interact with
a WAL database without using shared memory if that
process is guaranteed to be the only process accessing the database.
^This feature allows WAL databases to be created, read, and written
by legacy [VFSes] that lack the "version 2" shared-memory
methods xShmMap, xShmLock, xShmBarrier, and xShmUnmap on the
[sqlite3_io_methods] object.</p>

<p>^(If EXCLUSIVE locking mode is set prior to the first WAL-mode 
database access, then SQLite never attempts to call any of the
shared-memory methods and hence no shared-memory
wal-index is ever created.)^