Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Continuing work on the walformat.html document. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA3-256: |
74c71c744b4646507a0704b070ece467 |
User & Date: | drh 2017-11-13 23:06:54.415 |
Context
2017-11-25
| ||
14:02 | Add a description of the "^" syntax to fts5.in. (check-in: 58bbcf8b2f user: dan tags: trunk) | |
2017-11-13
| ||
23:06 | Continuing work on the walformat.html document. (check-in: 74c71c744b user: drh tags: trunk) | |
17:03 | Continuing work on the walformat.html document. This is an incremental check-in. (check-in: 627feecddc user: drh tags: trunk) | |
Changes
Changes to pages/fileformat2.in.
︙ | ︙ | |||
1805 1806 1807 1808 1809 1810 1811 | ^Then valid content of the WAL is transferred into the database file. ^Finally, the database is flushed to persistent storage using another xSync method call. The xSync operations serve as write barriers - all writes launched before the xSync must complete before any write that launches after the xSync begins.</p> | > > > > > > > > > > > > | > | > > > > > | > | | | > | | | 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 | ^Then valid content of the WAL is transferred into the database file. ^Finally, the database is flushed to persistent storage using another xSync method call. The xSync operations serve as write barriers - all writes launched before the xSync must complete before any write that launches after the xSync begins.</p> <p>A checkpoint need not run to completion. It might be that some readers are still using older transactions with data that is contained in the database file. In that case, transferring content for newer transactions from the WAL file into the database would delete the content out from under readers still using the older transactions. To avoid that, checkpoints only run to completion if all reader are using the last transaction in the WAL. <tcl>hd_fragment walreset {reset the WAL} {WAL reset}</tcl> <h2>WAL Reset</h2> <p>^After a complete checkpoint, if no other connections are in transactions that use the WAL, then subsequent write transactions can overwrite the WAL file from the beginning. This is called "resetting the WAL". ^At the start of the first new write transaction, the WAL header salt-1 value is incremented and the salt-2 value is randomized. These changes to the salts invalidate old frames in the WAL that have already been checkpointed but not yet overwritten, and prevent them from being checkpointed again.</p> <p>The WAL file can optionally be truncated on a reset, but it need not be. Performance is usually a little better if the WAL is not truncate, since filesystems generally will overwrite an existing file faster than they will grow a file. <tcl>hd_fragment walread {WAL read algorithm}</tcl> <h2>Reader Algorithm</h2> <p>^(To read a page from the database (call it page number P), a reader first checks the WAL to see if it contains page P. If so, then the last valid instance of page P that is followed by a commit frame or is a commit frame itself becomes the value read.)^ ^If the WAL contains no copies of page P that are valid and which are a commit frame or are followed by a commit frame, then page P is read from the database file.</p> <p>To start a read transaction, the reader records the number of value frames in the WAL as "mxFrame". ([mxFrame|More detail]) The reader uses this recorded mxFrame value for all subsequent read operations. New transactions can be appended to the WAL, but as long as the reader uses its original mxFrame value and ignores subsequently appended content, the reader will see a consistent snapshot of the database from a single point in time. ^This technique allows multiple concurrent readers to view different versions of the database content simultaneously.</p> <p>The reader algorithm in the previous paragraphs works correctly, but because frames for page P can appear anywhere within the WAL, the reader has to scan the entire WAL looking for page P frames. If the WAL is large (multiple megabytes is typical) that scan can be slow, and read performance suffers. ^To overcome this problem, a separate data structure called the wal-index is maintained to expedite the search for frames of a particular page.</p> <tcl>hd_fragment walindexformat</tcl> <h2>WAL-Index Format</h2> <p>Conceptually, the wal-index is shared memory, though the current VFS implementations use a memory-mapped file for operating-system portability. ^The memory-mapped file is in the same directory as the database and has the same name as the database with a "<tt>-shm</tt>" suffix appended. Because the wal-index is shared memory, SQLite does not support [PRAGMA journal_mode | journal_mode=WAL] on a network filesystem when clients are on different machines, as all clients of the database must be able to share the same memory.</p> <p>The purpose of the wal-index is to answer this question quickly:</p> <blockquote><i> Given a page number P and a maximum WAL frame index M, return the largest WAL frame index for page P that does not exceed M, or return NULL if there are no frames for page P that do not exceed M. |
︙ | ︙ | |||
1877 1878 1879 1880 1881 1882 1883 | Hence, unlike the database and WAL file formats which store all values as big endian, the wal-index stores multi-byte values in the native byte order of the host computer.</p> <p>This document is concerned with the persistent state of the database file, and since the wal-index is a transient structure, no further information about the format of the wal-index will be provided here. | | | | 1897 1898 1899 1900 1901 1902 1903 1904 1905 | Hence, unlike the database and WAL file formats which store all values as big endian, the wal-index stores multi-byte values in the native byte order of the host computer.</p> <p>This document is concerned with the persistent state of the database file, and since the wal-index is a transient structure, no further information about the format of the wal-index will be provided here. Additional details on the format of the wal-index are contained in the separate [WAL-index File Format] document.</p> |
Changes to pages/walformat.in.
1 | <title>WAL-mode File Format</title> | | | | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | <title>WAL-mode File Format</title> <tcl>hd_keywords {wal-index} {WAL-index format} {WAL-index File Format}</tcl> <table_of_contents> <p>This document describes low-level details on how [WAL mode] is implemented on unix and windows. <p>The separate [file format] description provides details on the structure of a database file and of the write-head log file used in [WAL mode]. But details of the locking protocol and of the format of the WAL-index are deliberately omitted since those details are left to descretion of individual [VFS] implementations. This document fills in those missing details for the unix and windows [VFSes]. <p>For completeness, some of the higher level formatting information contains in the [file format] document and elsewhere is replicated here, when it pertains to WAL mode processing. |
︙ | ︙ | |||
84 85 86 87 88 89 90 91 92 93 94 95 96 97 | <h2>Variations</h2> <p>When [PRAGMA locking_mode=EXCLUSIVE] is set, only a single client is allowed to have the database open at one time. Since only a single client can use the database, the shm file is omitted. The single client uses a buffer in heap memory as a substitute for the memory-mapped shm file. <h1>The WAL-Index File Format</h1> <p> The WAL-index or "shm" file is used to coordinate access to the database by multiple clients, and as a cache to help clients quickly locate frames within the wal file. | > > > > > > > | 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | <h2>Variations</h2> <p>When [PRAGMA locking_mode=EXCLUSIVE] is set, only a single client is allowed to have the database open at one time. Since only a single client can use the database, the shm file is omitted. The single client uses a buffer in heap memory as a substitute for the memory-mapped shm file. <p>If a read/write client invokes [sqlite3_file_control]([SQLITE_FCNTL_PERSIST_WAL]) prior to shutdown, then at shutdown a checkpoint is still run, but the shm file and wal file are not deleted. This allows subsequent read-only clients to connect to and read the database. <h1>The WAL-Index File Format</h1> <p> The WAL-index or "shm" file is used to coordinate access to the database by multiple clients, and as a cache to help clients quickly locate frames within the wal file. |
︙ | ︙ | |||
135 136 137 138 139 140 141 | The size of integers may be 8, 16, 32, or 64 bits. A detailed breakout of the individual fields of the shm header follows: <center> <i>WAL-Index Header Details</i><br> <table width="80%" border=1> | | | | | | | | | | | | | | | | | | | | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 | The size of integers may be 8, 16, 32, or 64 bits. A detailed breakout of the individual fields of the shm header follows: <center> <i>WAL-Index Header Details</i><br> <table width="80%" border=1> <tr><th>Bytes</th><th>Name</th><th>Meaning</th></tr> <tr> <td>0..3</td><td>iVersion</td> <td>The WAL-index format version number. Always 3007000.</td> </tr> <tr> <td>4..7</td><td> </td> <td>Unused padding space. Must be zero. </tr> <tr> <td>8..11</td><td>iChange</td> <td>Unsigned integer counter, incremented with each transaction </tr> <tr> <td>12</td><td>isInit</td> <td>The "isInit" flag. 1 when the shm file has been initialized. </tr> <tr> <td>13</td><td>bigEndCksum</td> <td>True if the WAL file uses big-ending checksums. 0 if the WAL uses little-endian checksums. </tr> <tr> <td>14..15</td><td>szPage</td> <td>The database page size in bytes, or 1 if the page size is 65536. </tr> <tr> <td>16..19</td><td>mxFrame</td> <td>Number of valid frames in the WAL file. </tr> <tr> <td>20..23</td><td>nPage</td> <td>Size of the database file in pages. </tr> <tr> <td>24..31</td><td>aFrameCksum</td> <td>Checksum of the last frame in the WAL file. </tr> <tr> <td>32..39</td><td>aSalt</td> <td>The two salt value copied from the WAL file header. </tr> <tr> <td>40..47</td><td>aCksum</td> <td>A checksum over bytes 0 through 39 of this header. </tr> <tr> <td>48..95</td><td> </td> <td>A copy of bytes 0 through 47 of this header. </tr> <tr> <td>96..99</td><td>nBackfill</td> <td>Number of WAL frames that have already been backfilled into the database by prior checkpoints </tr> <tr> <td>100..119</td><td>read-mark[0..4] <td>Five "read marks". Each read mark is a 32-bit unsigned integer (4 bytes). </tr> <tr> <td>120..127</td><td> </td> <td>Unused space set aside for 8 file locks. </tr> <tr> <td>128..132</td><td>nBackfillAttempted</td> <td>Number of WAL frames that have attempted to be backfilled but which might not have been backfilled successfully. </tr> <tr> <td>132..136</td><td> </td> <td>Unused space reserved for futher expansion. </tr> </table> </center> <tcl>hd_fragment mxframe mxFrame</tcl> <h3>The mxFrame field</h3> <p>The 32-bit unisgned integer at offset 16 (and repeated at offset 64) is the number of valid frames in the WAL. <p> When this field is zero, it indicates that the WAL is empty and that all content should be obtained directly from the database file. <p> When mxFrame is equal to [nBackfill], that indicates that all content in the WAL has been written back into the database. In that case, all content can be read directly from the database. Furthermore, the next writer is free to [reset the WAL] if no other connections are holding locks on WAL_READ_LOCK(N) for N>0. <p> The mxFrame value is always greater than or equal to both [nBackfill] and nBackfillAttempted. <tcl>hd_fragment nbackfill nBackfill</tcl> <h3>The nBackfill field</h3> <p>The 32-bit unsigned integer at offset 128 in the WAL-index header is called the "nBackfill". this field holds the number of frames in the WAL file which have been copied back into the main database. <p>The nBackfill number is never greater than [mxFrame]. When nBackfill equals [mxFrame], that means that the WAL content has been completely written back into the database and it is ok to [reset the WAL] if there are no locks held on any of WAL_READ_LOCK(N) for N>0. <p>The nBackfill can only be increased while holding the WAL_CKPT_LOCK. However, nBackfill is changed to zero during a [WAL reset], and this happens while holding the WAL_WRITE_LOCK. <tcl>hd_fragment locks {WAL-mode locks}</tcl> <h3>WAL Locks</h3> <p>Eight bytes of space are set aside in the header to support file locking using the xShmLock() method in the [sqlite3_io_methods] object. These eight bytes are never read nor written by SQLite since |
︙ | ︙ | |||
275 276 277 278 279 280 281 | <td>WAL_READ_LOCK(4) <td>7 <td>127 </tr> </table> </center> | < < < < < < < < < < < < < < < < < | 320 321 322 323 324 325 326 327 328 329 330 331 332 333 | <td>WAL_READ_LOCK(4) <td>7 <td>127 </tr> </table> </center> <p><i>TBD: More information about the header</i> <h2>WAL-Index Hash Tables</h2> <p>The hash tables in the shm file are designed to answer the |
︙ | ︙ | |||
450 451 452 453 454 455 456 | This is different from rollback mode, where the SQLITE_LOCK_SHARED is released at the end of each transaction. <li><p><b>SQLITE_LOCK_EXCLUSIVE</b> <p> Connections hold an exclusive lock when change in between WAL mode and any of the various rollback-modes. Connections might also attempt to | | | | 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 | This is different from rollback mode, where the SQLITE_LOCK_SHARED is released at the end of each transaction. <li><p><b>SQLITE_LOCK_EXCLUSIVE</b> <p> Connections hold an exclusive lock when change in between WAL mode and any of the various rollback-modes. Connections might also attempt to obtain an EXCLUSIVE lock when they disconnect from WAL mode. If a connection is able to obtain an EXCLUSIVE lock, that means it is the only connection to the database and so it may attempt to checkpoint and then delete the WAL-index and WAL files. <p> When a connection is holding a SHARED lock on the main database, that will prevent any other connection from acquiring the EXCLUSIVE lock, which in turn prevents the WAL-index and WAL files from being deleted out from under other users, and prevents a transition out of WAL-mode while other users are accessing the database in WAL-mode. <li><p><b>WAL_WRITE_LOCK</b> <p> The WAL_WRITE_LOCK is only locked exclusively. There is never a shared lock taken on WAL_WRITE_LOCK. <p> An EXCLUSIVE WAL_WRITE_LOCK is held by any connection that is appending content to the end of the WAL. Hence, only a single process at a time can append content to the WAL. If a [WAL reset] occurs as a consequence of a write, then the [nBackfill] field of the WAL-index header is reset to zero while holding this lock. <p>An EXCLUSIVE is also held WAL_WRITE_LOCK, and on several other locking bytes, when a connection is running [recovery] on the shared WAL-index. <li><p><b>WAL_CKPT_LOCK</b> <p> |
︙ | ︙ | |||
514 515 516 517 518 519 520 | they are within a transaction. Connections also obtain an exclusive lock on read locks, one at a time, for the brief moment while they are updating the values of the corresponding read-marks. Read locks 1 through 4 are held exclusively when running [recovery]. <p> Each read lock byte corresponds to one of the five 32-bit read-mark | | | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 | they are within a transaction. Connections also obtain an exclusive lock on read locks, one at a time, for the brief moment while they are updating the values of the corresponding read-marks. Read locks 1 through 4 are held exclusively when running [recovery]. <p> Each read lock byte corresponds to one of the five 32-bit read-mark integers located in bytes 100 through 119 of the WAL-index header, as follows: <center> <table border=1> <tr><th>Lock Name<th>Lock offset<th>Read-mark name<th>Read-mark offset <tr><td>WAL_READ_LOCK(0)<td>123<td>read-mark[0]<td>100..103 <tr><td>WAL_READ_LOCK(1)<td>124<td>read-mark[1]<td>104..107 <tr><td>WAL_READ_LOCK(2)<td>125<td>read-mark[2]<td>108..111 <tr><td>WAL_READ_LOCK(3)<td>126<td>read-mark[3]<td>112..115 <tr><td>WAL_READ_LOCK(4)<td>127<td>read-mark[4]<td>116..119 </table> </center> <p> When a connection holds a shared lock on WAL_READ_LOCK(N), that is a promise by the connection that it will use the WAL and not the database file for any database pages that are modified by the first read-mark[N] entries in the WAL. The read-mark[0] is always zero. If a connection holds a shared lock on WAL_READ_LOCK(0), that means the connection expects to be able to ignore the WAL and read any content it wants from the main database. If N>0 then the connection is free to use more of the WAL file beyond read-mark[N] if it wants to, up to the first mxFrame frames. But when a connection holds a shared lock on WAL_READ_LOCK(0), that is a promise that it will never read content from the WAL and will acquire all content directly from the main database. <p> When a checkpoint runs, if it sees a lock on WAL_READ_LOCK(N), then it must not move WAL content into the main database for more than the first read-mark[N] frames. Were it to do so, it would overwrite content that the process holding the lock was expecting to be able to read out of the main database file. A consequence of if this is that f the WAL file contains more than read-mark[N] frames (if mxFrame>read-mark[N] for any read-mark for which WAL_READ_LOCK(N) is held by another process), then the checkpoint cannot run to completion. <p> When a writer wants to [reset the WAL], it must ensure that there are no locks on WAL_READ_LOCK(N) for N>0 because such locks indicate that some other connection is still using the current WAL file and a [WAL reset] would delete content out from those other connections. It is ok for a [WAL reset] to occur if other connections are holding WAL_READ_LOCK(0) because by holding WAL_READ_LOCK(0), those other connections are promising not to use any content from the WAL. </ul> <h3>Operations that require locks and which locks those operations use</h3> <ul> <li><p><b>Transition into and out of WAL-mode</b> |
︙ | ︙ | |||
564 565 566 567 568 569 570 | without write permission on the database, WAL, or WAL-index files to open the database read-only. If the WAL and WAL-index files are missing, then a process that lacks permission to create and initialize those files will not be able to open the database, unless the database is designated as immutable using the [immutable query parameter]. | | < < < < | 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 | without write permission on the database, WAL, or WAL-index files to open the database read-only. If the WAL and WAL-index files are missing, then a process that lacks permission to create and initialize those files will not be able to open the database, unless the database is designated as immutable using the [immutable query parameter]. <li><p><b>Reconstruct the global shared WAL-index during [recovery]</b> <li><p><b>Append a new transaction to the end of the WAL</b> <li><p><b>Read content from the database and WAL as part of a transaction</b> <li><p><b>Run a checkpoint</b> |
︙ | ︙ |