Documentation Source Text

Check-in [05824b6e6a]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Adding first draft of WAL file format.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 05824b6e6a18312794b134d37c4486596a000d02
User & Date: drh 2010-07-06 01:27:33
Context
2010-07-07
02:03
Update the compile-time options document and footprint size bounds. Updates to the file format document. check-in: 0992cf9ad0 user: drh tags: trunk
2010-07-06
01:27
Adding first draft of WAL file format. check-in: 05824b6e6a user: drh tags: trunk
2010-07-05
15:06
Updates to the file format documentation. check-in: 2d2c9fd38a user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/fileformat2.in.

1147
1148
1149
1150
1151
1152
1153











































































































































































































may be zero padded out to the next multiple of the sector size and another
journal header may be inserted.  All journal headers within the same
journal must contain the same database page size and sector size.</p>

<p>If M is -1 in the initial journal header, then the number of page records
that follow is computed by figure out how many page records will fit in
the available space of the remainder of the journal file.</p>


















































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
may be zero padded out to the next multiple of the sector size and another
journal header may be inserted.  All journal headers within the same
journal must contain the same database page size and sector size.</p>

<p>If M is -1 in the initial journal header, then the number of page records
that follow is computed by figure out how many page records will fit in
the available space of the remainder of the journal file.</p>

<tcl>hd_fragment walformat {WAL format}</tcl>
<h2>4.0 The Write-Ahead Log</h2>

<p>Beginning with [version 3.7.0], SQLite supports a new transaction
control mechanism called "[WAL | write-ahead log]" or "[WAL]".
When a database is in WAL mode, all connections to that database must
use the WAL.  A particular database will use either a rollback journal
or a WAL, but not both at the same time.
The WAL is always located in the directory as the database
file and has the same name as the database file but with the string
"<tt>-wal</tt>" appended.</p>

<h3>4.1 WAL File Format</h4>

<p>A WAL file consists of a header followed by zero or more "frames".
Each frame records the revised content of a single page from the
database file.  All changes to the database are recorded by writing
frames into the WAL.  Transactions commit when a frame is written that
contains a commit marker.  A single WAL can and usually does record 
multiple transactions.  Periodically, the content of the WAL is
transferred back into the database file in an operation called a
"checkpoint".</p>

<p>A single WAL file can be reused multiple times.  In other words, the
WAL can fill up with frames and then be checkpointed and then new
frames can overwrite the old ones.  A WAL always grows from beginning
toward the end.  Checksums and counters attached to each frame are
used to determine which frames within the WAL are valid and which
are leftovers from prior checkpoints.</p>

<p>The WAL header is 32 bytes in size and consists of the following eight
big-endian 32-bit unsigned integer values:</p>

<center>
<i>WAL Header Format</i><br>
<table width="80%" border=1>
<tr><th>Offset<th>Size<th>Description
<tr><td valign=top align=center>0<td valign=top align=center>4
    <td>Magic number.  0x377f0682 or 0x377f0683
<tr><td valign=top align=center>4<td valign=top align=center>4
    <td>File format version.  Currently 3007000.
<tr><td valign=top align=center>8<td valign=top align=center>4
    <td>Database page size.  Example: 1024
<tr><td valign=top align=center>12<td valign=top align=center>4
    <td>Checkpoint sequence number
<tr><td valign=top align=center>16<td valign=top align=center>4
    <td>Salt-1: random integer incremented with each checkpoint
<tr><td valign=top align=center>20<td valign=top align=center>4
    <td>Salt-2: a different random number for each checkpoint
<tr><td valign=top align=center>24<td valign=top align=center>4
    <td>Checksum-1: First part of a checksum on the first 24 bytes of header
<tr><td valign=top align=center>28<td valign=top align=center>4
    <td>Checksum-2: Second part of the checksum on the first 24 bytes of header
</table>
</center>

<p>Immediately following the wal-header are zero or more frames. Each
frame consists of a 24-byte frame-header followed by a <i>page-size</i> bytes
of page data. The frame-header is six big-endian 32-bit unsigned 
integer values, as follows:

<center>
<i>WAL Frame Header Format</i><br>
<table width="80%" border=1>
<tr><th>Offset<th>Size<th>Description
<tr><td valign=top align=center>0<td valign=top align=center>4
    <td>Page number
<tr><td valign=top align=center>4<td valign=top align=center>4
    <td>For commit records, the size of the database file in pages
        after the commit.  For all other records, zero.
<tr><td valign=top align=center>8<td valign=top align=center>4
    <td>Salt-1 copied from the WAL header
<tr><td valign=top align=center>12<td valign=top align=center>4
    <td>Salt-2 copied from the WAL header
<tr><td valign=top align=center>16<td valign=top align=center>4
    <td>Checksum-1:  Cumulative checksum up through and including this page
<tr><td valign=top align=center>20<td valign=top align=center>4
    <td>Checksum-2:  Second half of the cumulative checksum.
</table>
</center>

<p>A frame is considered valid if and only if the following conditions are
true:</p>

<ol>
<li><p>The salt-1 and salt-2 values in the frame-header match
       salt values in the wal-header</p></li>

<li><p>The checksum values in the final 8 bytes of the frame-header
       exactly match the checksum computed consecutively on the
       WAL header and the first 8 bytes and the content of all frames
       up to and including the current frame.</p></li></li>
</ol>

<tcl>hd_fragment walcksm {WAL checksum algorithm}</tcl>
<h3>4.2 Checksum Algorithm</h3>

<p>The checksum is computed by interpreting the input as
an even number of unsigned 32-bit integers: x(0) through x(N).
The 32-bit integers are big-endian if the
magic number in the first 4 bytes of the WAL header is 0x377f0683 and
the integers are little-endian the magic number is 0x377f0682.
The checksum values are always stored in the frame header in a
big-endian format regardless of which byte order is used to compute
the checksum.</p>

<p>The checksum algorithm only works for content which is a multiple of
8 bytes in length.  In other words, if the puts are x(0) through x(N)
then N must be odd.
The checksum algorithm is as follows:

<blockquote><pre> 
s0 = s1 = 0
for i from 0 to n-1 step 2:
   s0 += x(i) + s1;
   s1 += x(i+1) + s0;
endfor
# result in s0 and s1
</pre></blockquote>

<p>The outputs s0 and s1 are both weighted checksums using fibonacci weights
in reverse order (the largest fibonacci weight occurs on the first element
of the sequence being summed.)  The s1 value spans all 32-bit integer
terms of the sequence whereas s0 omits the final term.</p>

<h3>4.3 Checkpoint Algorithm</h3>

<p>On a [checkpoint], the WAL is first flushed to persistent storage using
the xSync method of the [sqlite3_io_methods | VFS]. 
Then valid content of the WAL is transferred into the database file.
Finally, the database is flushed to persistent storage using another
xSync method call.
The xSync operations serve as write barriers - all writes launched
before the xSync must complete before any write that launches after the
xSync begins.</p>

<p>After each checkpoint, the WAL header salt-1 value is incremented and the 
salt-2 value is randomized.  This prevents old and new frames in the WAL from
being considered valid at the same time and being checkpointing together
following a crash.</p>

<tcl>hd_fragment walread {WAL read algorithm}</tcl>
<h3>4.4 Reader Algorithm</h3>

<p>To read a page from the database (call it page number P), a reader
first checks the WAL to see if it contains page P.  If so, then the
last valid instance of page P that is a followed by a commit frame
or is a commit frame itself becomes the value read.  If the WAL
contains no copies of page P that are valid and which are a commit
frame or are followed by a commit frame, then page P is read from
the database file.</p>

<p>To start a read transaction, the reader records the index of the last
valid frame in the WAL.  The reader uses this recorded "mxFrame" value
for all subsequent read operations.  New transactions can be appended
to the WAL, but as long as the reader uses its original mxFrame value
and ignores the newly appended content, it will see a consistent snapshot
of the database from a single point in time.  This technique allows
multiple concurrent readers to view different versions of the database
content simultaneously.</p>

<p>The reader algorithm in the previous paragraphs works correctly, but 
because frames for page P can appear anywhere within the WAL, the
reader has to scan the entire WAL looking for page P frames.  If the
WAL is large (multiple megabytes is typical) that scan can be slow,
and read performance suffers.  To overcome this problem, a separate
data structure called the wal-index is maintained to expedite the
search for frames of a particular page.</p>

<tcl>hd_fragment walindexformat {WAL-index format}</tcl>
<h3>4.5 WAL-Index Format</h3>

<p>Conceptually, the wal-index is shared memory, though the current
VFS implementations use a mmapped file for the wal-index.  Because
the wal-index is shared memory, SQLite does not support 
[PRAGMA journal_mode | journal_mode=WAL] 
on a network filesystem when clients are on different machines.
All users of the database must be able to share the same memory.</p>

<p>The purpose of the wal-index is to answer this question quickly:</p>

<blockquote><i>
Given a page number P and a maximum WAL frame index M,
return the largest WAL frame index for page P that does not exceed M, 
or return NULL if there are no frames for page P that do not exceed M.
</i></blockquote>


<p>The wal-index is transient.  After a crash, the wal-index is
reconstructed from the original WAL file.  The VFS is required
to either truncate or zero the header of the wal-index when the last
connection to it closes.  Because the wal-index is transient, it can
use an architecture-specific format; it does not have to be cross-platform.
Hence, unlike the database and WAL file formats which store all values
as big endian, the wal-index stores multi-byte values in the native
byte order of the host computer.</p>

<p>This document is concerned with the persist state of the database
file, and since the wal-index is a transient structure, no further 
information about the format of the wal-index will be provided here.
Complete details on the format of the wal-index are contained within
comments in SQLite source code.</p>