Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Fleshing out the WAL documentation. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
a7c7d3a520e4af96d00f417c6b4737eb |
User & Date: | drh 2010-05-07 16:18:29.000 |
Context
2010-05-08
| ||
14:05 | Fix typos in the testing document. Change "insure" to "ensure" throughout the website. Ticket [6afbaac77a52917fc]. (check-in: 8c1658c1e2 user: drh tags: trunk) | |
2010-05-07
| ||
16:18 | Fleshing out the WAL documentation. (check-in: a7c7d3a520 user: drh tags: trunk) | |
02:46 | Add the new Kreibich book. Add preliminary documentation on WAL pragmas. Refactor the pragma.html document. (check-in: f1676af6d8 user: drh tags: trunk) | |
Changes
Changes to pages/changes.in.
︙ | ︙ | |||
39 40 41 42 43 44 45 | } hd_close_aux hd_enable_main 1 } } chng {2010 July 1 (3.7.0)} { | | | 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | } hd_close_aux hd_enable_main 1 } } chng {2010 July 1 (3.7.0)} { <li> Added support for [WAL | write-ahead logging]. <li> Query planner enhancements } chng {2010 March 30 (3.6.23.1)} { <li> Fix a bug in the offsets() function of [FTS3] <li> Fix a missing "sync" that when omitted could lead to database corruption if a power failure or OS crash occurred just as a |
︙ | ︙ |
Changes to pages/fileformat2.in.
︙ | ︙ | |||
102 103 104 105 106 107 108 | <tr><th>Offset<th>Size<th>Description <tr><td valign=top align=center>0<td valign=top align=center>16<td align=left> The header string: "SQLite format 3\000" <tr><td valign=top align=center>16<td valign=top align=center>2<td align=left> The database page size in bytes. Must be a power of two between 512 and 32768 inclusive. <tr><td valign=top align=center>18<td valign=top align=center>1<td align=left> | | | | 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | <tr><th>Offset<th>Size<th>Description <tr><td valign=top align=center>0<td valign=top align=center>16<td align=left> The header string: "SQLite format 3\000" <tr><td valign=top align=center>16<td valign=top align=center>2<td align=left> The database page size in bytes. Must be a power of two between 512 and 32768 inclusive. <tr><td valign=top align=center>18<td valign=top align=center>1<td align=left> File format write version. 1 for legacy; 2 for [WAL]. <tr><td valign=top align=center>19<td valign=top align=center>1<td align=left> File format read version. 1 for legacy; 2 for [WAL]. <tr><td valign=top align=center>20<td valign=top align=center>1<td align=left> Bytes of unused "reserved" space at the end of each page. Usually 0. <tr><td valign=top align=center>21<td valign=top align=center>1<td align=left> Maximum embedded payload fraction. Must be 64. <tr><td valign=top align=center>22<td valign=top align=center>1<td align=left> Minimum embedded payload fraction. Must be 32. <tr><td valign=top align=center>23<td valign=top align=center>1<td align=left> |
︙ | ︙ |
Changes to pages/wal.in.
1 2 3 4 5 | <title>Write-Ahead Logging</title> <tcl>hd_keywords {WAL} {write-ahead log}</tcl> <h1 align="center">Write-Ahead Logging</h1> | | | < | | < < | | | | > | | | > > | | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | <title>Write-Ahead Logging</title> <tcl>hd_keywords {WAL} {write-ahead log}</tcl> <h1 align="center">Write-Ahead Logging</h1> <p>The default method by which SQLite implements [atomic commit | atomic commit and rollback] is a [rollback journal]. Beginning with [version 3.7.0], a new "Write-Ahead Log" option (hereafter referred to as "WAL") is available.</p> <p>There are advantages and disadvantages to using WAL instead of a rollback journal. Advantages include:</p> <ol> <li>WAL is significately faster in the common case of many small transactions. <li>WAL provides more concurrency as readers do not block writers and are not blocked by a writer. Reading and writing can proceed concurrently. <li>The sequence of disk I/O operations tends to be more sequential using WAL. <li>WAL uses many fewer fsync() operations and is thus less vulnerable to problems on systems where the fsync() system call is broken. </ol> <p>But there are also disadvantages:</p> <ol> <li>WAL is only available when the [sqlite3_vfs | VFS] supports shared-memory primitives. The built-in unix and windows VFSes support this but third-party extension VFSes for custom operating systems might not. <li>All processes using a database must be on the same host computer; WAL does not work over a network filesystem. <li>Transactions that involve changes against multiple [ATTACH | ATTACHed] databases are atomic for each individual database, but are not atomic across all databases as a set. <li>With WAL, it is not possible to change the database page sizes using [VACUUM] or when restoring from a backup using the [backup API]. <li>WAL can be slower than the traditional rollback-journal approach for large transactions (such as when using the [VACUUM] command). <li>There is the extra opertion of [checkpointing] which, though automatic by default, is still something that application developers need to be mindful of. </ol> <h2>How WAL Works</h2> <p>The traditional rollback journal works by writing a copy of the original unchanged database content into a separate rollback journal file and then writing changes directly into the database file. In the event of a crash or [ROLLBACK], the original content contained in the rollback journal is played back into the database file to revert the database file to its original state. The [COMMIT] occurs when the rollback journal is deleted.</p> <p>The WAL approach inverts this. The original content is preserved in the database file and the changes are appended into a separate WAL file. A [COMMIT] occurs when a special record indicating a commit is appended to the WAL. Thus a COMMIT can happen without ever writing to the original database, which allows reader to continue operating from the original, unaltered database while changes are simultaneously commited into the WAL. Multiple transactions can be appended to the end of a single WAL file.</p> <tcl>hd_fragment ckpt checkpoint checkpointed checkpointing</tcl> <h3>Checkpointing</h3> <p>Of course, one wants to eventually transfer all the transactions that are appended in the WAL file back into the original database. Moving the WAL file transactions back into the database is called a "<i>checkpoint</i>".<p> <p>Another way to think about the difference between rollback and write-ahead log is that in the rollback-journal approach, there are two primitive operations: reading from the database and writing changes into the database but in the write-ahead log approach, there are now three primitive operations: reading, writing, and checkpointing.</p> <tcl>hd_fragment concurrency {WAL concurrency}</tcl> <h3>Concurrency</h3> <p>When a read operation begins on a WAL-mode database, it first remembers the location of the last valid commit record in the WAL. Call this point the "end mark". Because the WAL can be growing and adding new commit records while various readers connect to the database, each reader can potentially have its own end mark. But for any particular reader, the end mark is unchanged for the duration of the transaction, thus insuring that a single read transaction only sees the database content as it existed at a single point in time.</p> <p>When a reader needs a page of content, it first checks the WAL to see if that page appears there, and if so it pulls in the last copy of the page that occurs in the WAL prior to the reader's end mark. If no copy of the page exists in the WAL prior to the reader's end mark, then the page is read from the original database file. Readers can exist in separate processes, so to avoid forcing every reader to scan the entire WAL looking for pages it needs (since WAL file can grow to multiple megabytes, depending on how often checkpoints are run), a data structure called the wal-index is maintained in shared memory which helps readers locate pages in the WAL quickly and with a minimum of I/O. The wal-index greatly improves the performance of readers, but the use of shared memory means that all readers must exist on the same machine. This is why the write-ahead log implementation will not work on a network filesystem.</p> <p>Writers merely append new content to the end of the WAL file. Because writers do nothing that would interfere with the actions of readers, writers and readers can run at the same time. However, since there is only one WAL file, there can only be one writer at a time.</p> <p>A checkpoint operation takes all the content from the WAL file and transfers it back into the original database file. As long as all readers are looking at the entire WAL file (in other words, as long as no reader is using an end mark that is earlier than the actual end of the WAL) then it will be OK for the checkpoint to run in parallel with the readers. The database file changes that the checkpointer makes will not interfere with the readers, because the database pages changed all also exist in the WAL and so the readers will never access them. Thus, a checkpoint will never block a reader and readers will not block a checkpoint as long as all readers are looking at the entire WAL. But, older readers that only see a prefix of the WAL will block a checkpoint. And writers will block a checkpoint. And a checkpoint will block a writer.</p> <tcl>hd_fragment fast</tcl> <h3>Performance Considerations</h3> <p>Write transactions are very fast since they only involve writing the content once (versus twice for rollback-journal transactions) and because the writes are all sequential. Further, syncing the content to the disk is not required, as long as the application is willing to sacrifice durability following a power loss or hard reboot.</p> <p>On the other hand, read performance deteriorates as the WAL file grows in size since reader must check the WAL file for the content they need and the time needed to check the WAL file is proportional to the size of the WAL file. The wal-index helps readers find content in the WAL file much faster, but performance still falls off with increasing WAL file size. Hence, to maintain good read performance it is important to keep the WAL file size down by run checkpoints at regular intervals.</p> <p>Checkpointing does require sync operations in order to avoid the possibility of database corruption following a power loss or hard reboot. The WAL must be synced to persistent storage prior to the start of the checkpoint and the database file must by synced prior to resetting the WAL. Checkpoint also requires more seeking. The checkpointer makes an effort to do as many sequential page writes to the database as it can (the pages are transferred from WAL to database in ascending order) but even then there will typically be many seek operations interspersed among the page writes. These factors combine to make checkpoints slower than write transactions.</p> <p>The default strategy is to allow successive write transactions to grow the WAL until the WAL becomes about 1000 pages in size, then to run a single checkpoint operation. By default, the checkpoint will be run automatically by the same thread that does the COMMIT that pushes the WAL over its size limit. This has the effect of causing most COMMIT operations to be very fast but an occasional COMMIT that triggers a checkpoint to be much slower. If that effect is undersirable, then the application can disable automatic checkpointing and run the periodic checkpoints in a separate thread, or separate process.</p> <p>Notice that there is a tradeoff between average read performance and average write performance. To maximize the read performance, one wants to keep the WAL as small as possible and hence run checkpoints frequently, perhaps as often as every COMMIT. To maximize write performance, one wants to amortize the cost of each checkpoint over as many writes as possible, meaning that one wants to run checkpoints infrequently and let the WAL grow as large as possible before each checkpoint. The decision of how often to run checkpoints may therefore vary from one application to another depending on the relative read and write performance requirements of the application. The default strategy of running a checkpoint once the WAL reaches 1000 pages seems to work well in test applications on workstations, but other strategies might work better on different platforms or for different workloads.</p> <h2>Activating And Configuring WAL Mode</h2> <p>An SQLite database connection defaults to [journal_mode | journal_mode=DELETE]. To convert to WAL mode, use the following pragma:</p> <blockquote><pre> PRAGMA journal_mode=WAL; </pre></blockquote> <p>The journal_mode pragma returns a string which is the new journal mode. On success, the pragma will return the string "<tt>wal</tt>". If the conversion to WAL could not be completed (for example, if the VFS does not support the necessary shared-memory primitives) then the journaling mode will be unchanged and the string returned from the primitive will be the prior journaling mode (for example "<tt>delete</tt>"). <p>By default, SQLite will automatically checkpoint whenever the WAL file reaches or exceeds 1000 pages of change, or when the last database connection on a database file closes. The default configuration is intended to work well for most applications. But programs that want more control can force a checkpoint using the [wal_checkpoint pragma] or by calling the [sqlite3_wal_checkpoint()] C interface. The automatic checkpoint threshold can be changes or automatic checkpointing can be completely disabled using the [wal_autocheckpoint pragma] or by calling the [sqlite3_wal_autocheckpoint()] C interface. A program can also use [sqlite3_wal_hook()] to register a callback to be invoked whenever any transaction commits to the WAL. This callback can then invoke [sqlite3_wal_checkpoint()] to for a checkpoint based on whatever criteria it thinks is appropriate. (The automatic checkpoint mechanism is implemented as a simple wrapper around [sqlite3_wal_hook()].)</p> <h2>Backwards Compatibility</h2> <p>The database file format is unchanged for WAL mode. However, the WAL file is a new concept and so older versions of SQLite will not know how to recover a crashed SQLite database that was operating in WAL mode when the crash occurred. To prevent older versions of SQLite from trying to recover a WAL-mode database (and making matters worse) the database file format version numbers (bytes 18 and 19 in the [database header]) are incremented in WAL mode. Thus, if an older version of SQLite attempts to connect to an SQLite database that is operating in WAL mode, it will report an error along the lines of "file is encrypted or is not a database".</p> <p>One can explicitly changing out of WAL mode using a pragma such as this:</p> <blockquote><pre> PRAGMA journal_mode=DELETE; </pre></blockquote> <p>Deliberately changing out of WAL mode lowers the database file format version numbers so that older versions of SQLite can again access the database file.</p> |