Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Updates to the WAL documentation. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
8cdaccbb9047043a13a9e650bf1f5ed1 |
User & Date: | drh 2010-07-13 23:51:04.000 |
Context
2010-07-14
| ||
18:39 | Improved documentation of the IN and NOT IN operators. Push back the 3.7.0 release date to 2010-07-22. (check-in: 07096801a3 user: drh tags: trunk) | |
2010-07-13
| ||
23:51 | Updates to the WAL documentation. (check-in: 8cdaccbb90 user: drh tags: trunk) | |
16:51 | Enhancements to the ext3 barrier problem description in lockingv3.html. (check-in: a1de10679b user: drh tags: trunk) | |
Changes
Changes to pages/fileformat2.in.
︙ | ︙ | |||
1317 1318 1319 1320 1321 1322 1323 | because frames for page P can appear anywhere within the WAL, the reader has to scan the entire WAL looking for page P frames. If the WAL is large (multiple megabytes is typical) that scan can be slow, and read performance suffers. To overcome this problem, a separate data structure called the wal-index is maintained to expedite the search for frames of a particular page.</p> | | | 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 | because frames for page P can appear anywhere within the WAL, the reader has to scan the entire WAL looking for page P frames. If the WAL is large (multiple megabytes is typical) that scan can be slow, and read performance suffers. To overcome this problem, a separate data structure called the wal-index is maintained to expedite the search for frames of a particular page.</p> <tcl>hd_fragment walindexformat {wal-index} {WAL-index format}</tcl> <h3>4.5 WAL-Index Format</h3> <p>Conceptually, the wal-index is shared memory, though the current VFS implementations use a mmapped file for the wal-index. The mmapped file is in the same directory as the database and has the same name as the database with a "<tt>-shm</tt>" suffix appended. Because the wal-index is shared memory, SQLite does not support |
︙ | ︙ |
Changes to pages/lockingv3.in.
︙ | ︙ | |||
461 462 463 464 465 466 467 | <p> SQLite uses the fsync() system call to flush data to the disk under Unix and it uses the FlushFileBuffers() to do the same under Windows. Once again, SQLite assumes that these operating system services function as advertised. But it has been reported that fsync() and FlushFileBuffers() do not always work correctly, especially with inexpensive IDE disks. Apparently some | | | | > > > > | | | | 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 | <p> SQLite uses the fsync() system call to flush data to the disk under Unix and it uses the FlushFileBuffers() to do the same under Windows. Once again, SQLite assumes that these operating system services function as advertised. But it has been reported that fsync() and FlushFileBuffers() do not always work correctly, especially with inexpensive IDE disks. Apparently some manufactures of IDE disks have controller chips that report that data has reached the disk surface when in fact the data is still in volatile cache memory in the disk drive electronics. There are also reports that Windows sometimes chooses to ignore FlushFileBuffers() for unspecified reasons. The author cannot verify any of these reports. But if they are true, it means that database corruption is a possibility following an unexpected power loss. These are hardware and/or operating system bugs that SQLite is unable to defend against. </p> <tcl>hd_fragment {ext3-barrier-problem} {the ext3 barrier problem}</tcl> <p>If a Linux <a href="http://en.wikipedia.org/wiki/Ext3">ext3</a> filesystem is mounted without the "barrier=1" option in the <a href="http://en.wikipedia.org/wiki/fstab">/etc/fstab</a> and the disk drive write cache is enabled then filesystem corruption can occur following a power loss or OS crash. Whether or not corruption can occur depends on the details of the disk control hardware; corruption is more likely with inexpensive consumer-grade disks and less of a problem for enterprise-class storage devices with advanced features such as non-volatile write caches. Various ext3 experts <a href="http://www.redhat.com/archives/ext3-users/2010-July/msg00001.html"> confirm this behavior</a>. We are told that most Linux distributions do not use barrier=1 and do not disable the write cache so most Linux distributions are vulnerable to this problem. Note that this is an operating system and hardware issue and that there is nothing that SQLite can do to work around it. <a href="http://ozlabs.org/~rusty/index.cgi/tech/2009-10-20.html"> Other database engines</a> have also run into this same problem.</p> <p> If a crash or power failure occurs and results in a hot journal but that journal is deleted, the next process to open the database will not know that it contains changes that need to be rolled back. The rollback |
︙ | ︙ |
Changes to pages/sqlite.in.
1 | <title>Command Line Shell For SQLite</title> | | | 1 2 3 4 5 6 7 8 9 | <title>Command Line Shell For SQLite</title> <tcl>hd_keywords {CLI} {Command Line Interface} {command-line shell}</tcl> <h1 align=center> Command Line Shell For SQLite </h1> <p>The SQLite library includes a simple command-line utility named <b>sqlite3</b> (or <b>sqlite3.exe</b> on windows) |
︙ | ︙ |
Changes to pages/wal.in.
︙ | ︙ | |||
33 34 35 36 37 38 39 40 41 42 | <li>Transactions that involve changes against multiple [ATTACH | ATTACHed] databases are atomic for each individual database, but are not atomic across all databases as a set. <li>It is not possible to change the database page size after entering WAL mode, either on an empty database or by using [VACUUM] or by restoring from a backup using the [backup API]. You must be in a rollback journal mode to change the page size. <li>WAL might be very slightly slower (perhaps 1% or 2% slower) than the traditional rollback-journal approach in applications that do mostly reads and seldom write. | > > > > > > | | 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | <li>Transactions that involve changes against multiple [ATTACH | ATTACHed] databases are atomic for each individual database, but are not atomic across all databases as a set. <li>It is not possible to change the database page size after entering WAL mode, either on an empty database or by using [VACUUM] or by restoring from a backup using the [backup API]. You must be in a rollback journal mode to change the page size. <li>It is not possible to open [read-only WAL databases]. The opening process must have write privileges for "<tt>-shm</tt>" [wal-index] shared memory file associated with the database, if that file exists, or else write access on the directory containing the database file if the "<tt>-shm</tt>" file does not exist. <li>WAL might be very slightly slower (perhaps 1% or 2% slower) than the traditional rollback-journal approach in applications that do mostly reads and seldom write. <li>There is an additional quasi-persistent "<tt>-wal</tt>" file and "<tt>-shm</tt> shared memory file associated with each database, which can make SQLite less appealing for use as an [application file-format]. <li>There is the extra operation of [checkpointing] which, though automatic by default, is still something that application developers need to be mindful of. </ol> |
︙ | ︙ | |||
176 177 178 179 180 181 182 | then there will typically be many seek operations interspersed among the page writes. These factors combine to make checkpoints slower than write transactions.</p> <p>The default strategy is to allow successive write transactions to grow the WAL until the WAL becomes about 1000 pages in size, then to run a checkpoint operation for each subsequent COMMIT until the WAL | | | 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | then there will typically be many seek operations interspersed among the page writes. These factors combine to make checkpoints slower than write transactions.</p> <p>The default strategy is to allow successive write transactions to grow the WAL until the WAL becomes about 1000 pages in size, then to run a checkpoint operation for each subsequent COMMIT until the WAL is reset to be smaller than 1000 pages. By default, the checkpoint will be run automatically by the same thread that does the COMMIT that pushes the WAL over its size limit. This has the effect of causing most COMMIT operations to be very fast but an occasional COMMIT (those that trigger a checkpoint) to be much slower. If that effect is undesirable, then the application can disable automatic checkpointing and run the periodic checkpoints in a separate thread, or separate process. (Links to commands and interfaces to accomplish this are |
︙ | ︙ | |||
231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 | On success, the pragma will return the string "<tt>wal</tt>". If the conversion to WAL could not be completed (for example, if the VFS does not support the necessary shared-memory primitives) then the journaling mode will be unchanged and the string returned from the primitive will be the prior journaling mode (for example "<tt>delete</tt>"). <a name="how_to_checkpoint"></a> <p>By default, SQLite will automatically checkpoint whenever a [COMMIT] occurs that causes the WAL file to be 1000 pages or more in size, or when the last database connection on a database file closes. The default configuration is intended to work well for most applications. But programs that want more control can force a checkpoint using the [wal_checkpoint pragma] or by calling the [sqlite3_wal_checkpoint()] C interface. The automatic checkpoint threshold can be changed or automatic checkpointing can be completely disabled using the [wal_autocheckpoint pragma] or by calling the [sqlite3_wal_autocheckpoint()] C interface. A program can also use [sqlite3_wal_hook()] to register a callback to be invoked whenever any transaction commits to the WAL. This callback can then invoke [sqlite3_wal_checkpoint()] to for a checkpoint based on whatever criteria it thinks is appropriate. (The automatic checkpoint mechanism is implemented as a simple wrapper around [sqlite3_wal_hook()].)</p> <h2>Backwards Compatibility</h2> <p>The database file format is unchanged for WAL mode. However, the | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > < < | | 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 | On success, the pragma will return the string "<tt>wal</tt>". If the conversion to WAL could not be completed (for example, if the VFS does not support the necessary shared-memory primitives) then the journaling mode will be unchanged and the string returned from the primitive will be the prior journaling mode (for example "<tt>delete</tt>"). <a name="how_to_checkpoint"></a> <h3>Automatic Checkpoint</h3> <p>By default, SQLite will automatically checkpoint whenever a [COMMIT] occurs that causes the WAL file to be 1000 pages or more in size, or when the last database connection on a database file closes. The default configuration is intended to work well for most applications. But programs that want more control can force a checkpoint using the [wal_checkpoint pragma] or by calling the [sqlite3_wal_checkpoint()] C interface. The automatic checkpoint threshold can be changed or automatic checkpointing can be completely disabled using the [wal_autocheckpoint pragma] or by calling the [sqlite3_wal_autocheckpoint()] C interface. A program can also use [sqlite3_wal_hook()] to register a callback to be invoked whenever any transaction commits to the WAL. This callback can then invoke [sqlite3_wal_checkpoint()] to for a checkpoint based on whatever criteria it thinks is appropriate. (The automatic checkpoint mechanism is implemented as a simple wrapper around [sqlite3_wal_hook()].)</p> <h3>Persistence of WAL mode</h3> <p>Unlike the other journaling modes, [journal_mode | PRAGMA journal_mode=WAL] is persistent. If a process sets WAL mode, then closes and reopens the database, the database will come back in WAL mode. In contrast, if a process sets (for example) PRAGMA journal_mode=TRUNCATE and then closes and reopens the database will come back up in the default rollback mode of DELETE rather than the previous TRUNCATE setting.</p> <p>The persistence of WAL mode means that applications can be converted to using SQLite in WAL mode without making any changes to the application itself. One has merely to run "<tt>PRAGMA journal_mode=WAL;</tt>" on the database file(s) using the [command-line shell] or other utility, then restart the application.</p> <p>The WAL journal mode will be set on all connections to the same database file if it is set on any one connection. </p> <tcl>hd_fragment {readonly} {read-only WAL databases}</tcl> <h2>Read-Only Databases</h2> <p>No SQLite database (regardless of whether or not it is WAL mode) is readable if it is located on read-only media and it requires recovery. So, for example, if an application crashes and leaves an SQLite database with a [hot journal], that database cannot be opened unless the opening process has write privilege on the database file, the directory containing the database file, and the hot journal. This is because the incomplete transaction left over from the crash must be rolled back prior to reading the database and that rollback cannot occur without write permission on all files and the directory containing them.</p> <p>A database in WAL mode cannot generally be opened from read-only media because even ordinary reads in WAL mode require recovery-like operations.</p> <p>An efficient implementation of the [WAL read algorithm] requires that there exist a hash table in shared memory over the content of the WAL file. This hash table is called the [wal-index]. The wal-index is in shared memory, and so technically it does not have to have a name in the host computer filesystem. Custom [sqlite3_vfs | VFS] implementations are free to implement shared memory in any way they see fit, but the default unix and windows drivers that come built-in with SQLite implement shared memory using <a href="http://en.wikipedia.org/wiki/Mmap">mmapped files</a> named using the suffix "<tt>-shm</tt>" and located in the same directory as the database file. The wal-index must be rebuilt upon first access, even by readers, and so in order to open the WAL database, write access is required on the "<tt>-shm</tt>" shared memory file if the file exists, or else write access is required on the directory containing the database so that the wal-index can be created if it does not already exist. This does not preclude custom VFS implementations that implement shared memory differently from being able to access read-only WAL databases, but it does prevent the default unix and windows backends from accessing WAL databases on read-only media.</p> <p>Hence, SQLite databases should always be converted to [journal_mode | PRAGMA journal_mode=DELETE] prior to being transferred to read-only media.</p> <p>Also, if multiple processes are to access a WAL mode database, then all processes should run under user or group IDs that give them write access to the database files, the WAL file, the shared memory <tt>-shm</tt> file, and the containing directory.</p> <h2>Implementation Of Shared-Memory For The WAL-Index</h2> <pThe [wal-index] is implemented using an ordinary file that is mmapped for robustness. Early (pre-release) implementations of WAL mode stored the wal-index in volatile shared-memory, such as files created in /dev/shm on Linux or /tmp on other unix systems. The problem with that approach is that processes with a different root directory (changed via <a href="http://en.wikipedia.org/wiki/Chroot">chroot</a>) will see different files and hence use different shared memory areas, leading to database corruption. Other methods for creating nameless shared memory blocks are not portable across the various flavors of unix. And we could not find any method to create nameless shared memory blocks on windows. The only way we have found to guarantee that all processes accessing the same database file use the same shared memory is to create the shared memory by mmapping a file in the same directory as the database itself.</p> <p>Using an ordinary disk file to provide shared memory has the disadvantge that it might actually do unnecessary disk I/O by writing the shared memory to disk. However, the developers do not think this is a major concern since the wal-index rarely exceeds 32 KiB in size and is never synced. Furthermore, the wal-index backing file is deleted when the last database connection disconnects, which often prevents any real disk I/O from ever happening.</p> <p>Specialized applications for which the default implementation of shared memory is unacceptable can devise alternative methods via a custom [sqlite3_vfs | VFS]. For example, if it is known that a particular database will only be accessed by threads within a single process, the wal-index can be implemented using heap memory instead of true shared memory.</p> <h2>Backwards Compatibility</h2> <p>The database file format is unchanged for WAL mode. However, the WAL file and the [wal-index] are new concepts and so older versions of SQLite will not know how to recover a crashed SQLite database that was operating in WAL mode when the crash occurred. To prevent older versions of SQLite from trying to recover a WAL-mode database (and making matters worse) the database file format version numbers (bytes 18 and 19 in the [database header]) are increased from 1 to 2 in WAL mode. |
︙ | ︙ |