Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Initial identification of requirements in the fileformat2.html document. |
---|---|
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
8925c8c2e1ab502eb1f261ff0717146b |
User & Date: | drh 2010-08-12 17:55:32 |
Context
2010-08-13
| ||
17:44 | Enhancements to the description of how the COLLATE operator works. check-in: 17093bc7d6 user: drh tags: trunk | |
2010-08-12
| ||
17:55 | Initial identification of requirements in the fileformat2.html document. check-in: 8925c8c2e1 user: drh tags: trunk | |
16:26 | Work on the queryplanner.html document. check-in: e8152063fb user: drh tags: trunk | |
Changes
Changes to pages/fileformat2.in.
31 31 scenario and so are uncommon, but they are part of the state of an SQLite 32 32 database and so should not be ignored. This document will define the format 33 33 of a rollback journal, but most of the content of this document will focus 34 34 on the main database file.</p> 35 35 36 36 <h3>1.1 Pages</h3> 37 37 38 -<p>The main database file consists of one or more pages. The size of a 39 -page is a power of two between 512 and 32768 inclusive. All pages within 40 -the same database are the same size. The page size for a database file 38 +<p>The main database file consists of one or more pages. ^The size of a 39 +page is a power of two between 512 and 65536 inclusive. All pages within 40 +the same database are the same size. ^The page size for a database file 41 41 is determined by the 2-byte integer located at an offset of 42 42 16 bytes from the beginning of the database file.</p> 43 43 44 44 <p>Pages are numbered beginning with 1. The maximum page number is 45 45 2147483646 (2<sup><small>31</small></sup> - 2). The minimum size 46 46 SQLite database is a single 512-byte page. 47 -The maximum size database would be 2147483646 pages at 32768 bytes per 48 -page or 70,368,744,112,128 bytes (about 70 terabytes). Usually SQLite will 47 +The maximum size database would be 2147483646 pages at 65536 bytes per 48 +page or 140,737,488,224,256 bytes (about 140 terabytes). Usually SQLite will 49 49 hit the maximum file size limit of the underlying filesystem or disk 50 50 hardware size limit long before it hits its own internal size limit.</p> 51 51 52 52 <p>In common use, SQLite databases tend to range in size from a few kilobytes 53 53 to a few gigabytes.</p> 54 54 55 55 <p>At any point in time, every page in the main database has a single ................................................................................ 69 69 <li>An index b-tree leaf page 70 70 </ul> 71 71 <li>A payload overflow page 72 72 <li>A pointer map page 73 73 </ul> 74 74 </p> 75 75 76 -<p>All reads from and writes to the main database file begin at a page 77 -boundary and all writes are an integer number of pages in size. Reads 76 +<p>^All reads from and writes to the main database file begin at a page 77 +boundary and all writes are an integer number of pages in size. ^Reads 78 78 are also usually an integer number of pages in size, with the one exception 79 79 that when the database is first opened, the first 100 bytes of the 80 80 database file (the database file header) are read as a sub-page size unit.</p> 81 81 82 -<p>Before any information-bearing page of the database is modified, 82 +<p>^Before any information-bearing page of the database is modified, 83 83 the original unmodified content of that page is written into the 84 84 rollback journal. If a transaction is interrupted and needs to be 85 85 rolled back, the rollback journal can then be used to restore the 86 -database to its original state. Freelist leaf pages bear no 86 +database to its original state. ^Freelist leaf pages bear no 87 87 information that would need to be restored on a rollback and so they 88 88 are not written to the journal prior to modification, in order to 89 89 reduce disk I/O.</p> 90 90 91 91 <tcl>hd_fragment database_header {database header}</tcl> 92 92 <h3>1.2 The Database Header</h3> 93 93 ................................................................................ 147 147 The [version-valid-for number]. 148 148 <tr><td valign=top align=center>96<td valign=top align=center>4<td align=left> 149 149 [SQLITE_VERSION_NUMBER] 150 150 </table></center> 151 151 152 152 <h4>1.2.1 Magic Header String</h4> 153 153 154 -<p>Every SQLite database file begins with the following 16 bytes (in hex): 155 -53 51 4c 69 74 65 20 66 6f 72 6d 61 74 20 33 00. This byte sequence 154 +<p>^Every valid SQLite database file begins with the following 16 bytes 155 +(in hex): 53 51 4c 69 74 65 20 66 6f 72 6d 61 74 20 33 00. This byte sequence 156 156 corresponds to the UTF-8 string "SQLite format 3" including the nul 157 157 terminator character at the end.</p> 158 158 159 159 <h4>1.2.2 Page Size</h4> 160 160 161 161 <p>The two-byte value beginning at offset 16 determines the page size of 162 162 the database. For SQLite versions 3.7.0.1 and earlier, this value is ................................................................................ 164 164 512 and 32768, inclusive. Beginning with SQLite version 3.7.1, a page 165 165 size of 65536 bytes is supported. The value 65536 will not fit in a 166 166 two-byte integer, so to specify a 65536-byte page size, the value is 167 167 at offset 16 is 0x00 0x01. 168 168 This value can be interpreted as a big-endian 169 169 1 and thought of is as a magic number to represent the 65536 page size. 170 170 Or one can view the two-byte field as a little endian number and say 171 -that it represents the page size divided by 256.</p> 171 +that it represents the page size divided by 256. These two 172 +interpretations of the page-size field are equivelent.</p> 172 173 173 174 <h4>1.2.3 File format version numbers</h4> 174 175 175 176 <p>The file format write version and file format read version at offsets 176 177 18 and 19 are intended to allow for enhancements of the file format 177 178 in future versions of SQLite. In current versions of SQLite, both of 178 179 these values are 1 for rollback journalling modes and 2 for [WAL] ................................................................................ 183 184 greater than 2 is encounter, then that database cannot be read or written.</p> 184 185 185 186 <h4>1.2.4 Reserved bytes per page</h4> 186 187 187 188 <p>SQLite has the ability to set aside a small number of extra bytes at 188 189 the end of every page for use by extensions. These extra bytes are 189 190 used, for example, by the SQLite Encryption Extension to store a nonce 190 -and/or cryptographic checksum associated with each page. The 191 +and/or cryptographic checksum associated with each page. ^The 191 192 "reserved space" size in the 1-byte integer at offset 20 is the number 192 193 of bytes of space at the end of each page to reserve for extensions. 193 194 This value is usually 0. The value can be odd.</p> 194 195 195 196 <tcl>hd_fragment usable_size {usable size}</tcl> 196 197 <p>The "usable size" of a database page is the page size specify by the 197 198 2-byte integer at offset 16 in the header less the "reserved" space size 198 199 recorded in the 1-byte integer at offset 20 in the header. The usable 199 -size of a page might be an odd number. However, the usable size is not 200 +size of a page might be an odd number. ^(However, the usable size is not 200 201 allowed to be less than 480. In other words, if the page size is 512, 201 -then the reserved space size cannot exceed 32.</p> 202 +then the reserved space size cannot exceed 32.)^</p> 202 203 203 204 <h4>1.2.5 Payload fractions</h4> 204 205 205 -<p>The maximum and minimum embedded payload fractions and the leaf 206 +<p>^The maximum and minimum embedded payload fractions and the leaf 206 207 payload fraction values must be 64, 32, and 32. These values were 207 208 originally intended to as tunable parameters that could be used to 208 209 modify the storage format of the b-tree algorithm. However, that 209 210 functionality is not supported and there are no current plans to add 210 211 support in the future. Hence, these three bytes are fixed at the 211 212 values specified.</p> 212 213 213 214 <h4>1.2.6 File change counter</h4> 214 215 215 216 <tcl>hd_fragment chngctr {change counter}</tcl> 216 -<p>The file change counter is a 4-byte big-endian integer which is 217 -incremented whenever the database file is changed in rollback mode. 217 +<p>^The file change counter is a 4-byte big-endian integer which is 218 +incremented whenever the database file is unlocked after having 219 +been modified. 218 220 When two or more processes are reading the same database file, each 219 221 process can detect database changes from other processes by monitoring 220 222 the change counter. 221 223 A process will normally want to flush its database page cache when 222 224 another process modified the database, since the cache has become stale. 223 225 The file change counter facilitates this.</p> 224 226 ................................................................................ 225 227 <p>In WAL mode, changes to the database are detected using the wal-index 226 228 and so the change counter is not needed. Hence, the change counter might 227 229 not be incremented on each transaction in WAL mode.</p> 228 230 229 231 <h4>1.2.7 In-header database size</h4> 230 232 231 233 <tcl>hd_fragment filesize {in-header database size}</tcl> 232 -<p>The 4-byte big-endian integer at offset 28 into the header 233 -stores the size of the database file in pages. If this in-header 234 +<p>^The 4-byte big-endian integer at offset 28 into the header 235 +stores the size of the database file in pages. ^If this in-header 234 236 datasize size is not valid (see the next paragraph), then the database 235 237 size is computed by looking 236 238 at the actual size of the database file. Older versions of SQLite 237 239 ignored the in-header database size and used the actual file size 238 -exclusively. Newer versions of SQLite use the in-header database 240 +exclusively. ^Newer versions of SQLite use the in-header database 239 241 size if it is available but fall back to the actual file size if 240 242 the in-header database size is not valid.</p> 241 243 242 -<p>The in-header database size is only considered to be valid if 244 +<p>^The in-header database size is only considered to be valid if 243 245 it is non-zero and if the 4-byte [change counter] at offset 24 244 246 exactly matches the 4-byte [version-valid-for number] at offset 92. 245 -The in-header database size should always be valid 247 +^The in-header database size is always valid 246 248 when the database is only modified using recent versions of SQLite 247 249 (versions 3.7.0 and later). 248 250 If a legacy version of SQLite writes to the database, it will not 249 251 know to update the in-header database size and so the in-header 250 252 database size could be incorrect. But legacy versions of SQLite 251 253 will also leave the version-valid-for number at offset 92 unchanged 252 254 so it will not match the change-counter. Hence, invalid in-header 253 255 database sizes can be detected (and ignored) by observing when 254 256 the change-counter does not match the version-valid-for number.</p> 255 257 256 258 <h4>1.2.8 Free page list</h4> 257 259 258 -<p>Unused pages in the database file are stored on a freelist. The 260 +<p>Unused pages in the database file are stored on a freelist. ^The 259 261 4-byte big-endian integer at offset 32 stores the page number of 260 262 the first page of the freelist, or zero if the freelist is empty. 261 -The 4-byte big-endian integer at offset 36 stores stores the total 263 +^The 4-byte big-endian integer at offset 36 stores stores the total 262 264 number of pages on the freelist.</p> 263 265 264 266 <h4>1.2.9 Schema cookie</h4> 265 267 266 -<p>The schema cookie is a 4-byte big-endian integer at offset 40 268 +<p>^The schema cookie is a 4-byte big-endian integer at offset 40 267 269 that is incremented whenever the database schema changes. A 268 270 prepared statement is compiled against a specific version of the 269 -database schema. Whenever the database schema changes, the statement 270 -must be reprepared. Whenever a prepared statement runs, it first checks 271 +database schema. ^Whenever the database schema changes, the statement 272 +must be reprepared. ^Whenever a prepared statement runs, it first checks 271 273 the schema cookie to make sure the value is the same as when the statement 272 -was prepared and if not it aborts to force the statement to be reprepared.</p> 274 +was prepared and if the schema cookie has changed, the statement aborts 275 +in order to force the statement to be reprepared.</p> 273 276 274 277 <h4>1.2.10 Schema format number</h4> 275 278 276 279 <p>The schema format number is a 4-byte big-endian integer at offset 44. 277 280 The schema format number is similar to the file format read and write 278 281 version numbers at offsets 18 and 19 except that the schema format number 279 282 refers to the high-level SQL formatting rather than the low-level b-tree ................................................................................ 287 290 [ALTER TABLE | ALTER TABLE ... ADD COLUMN] functionality. Support for 288 291 reading and writing format 2 was added in SQLite version 3.1.3 289 292 on 2005-02-19.</li> 290 293 <li value=3>Format 3 adds the ability of extra columns added by 291 294 [ALTER TABLE | ALTER TABLE ... ADD COLUMN] to have non-NULL default 292 295 values. This capability was added in SQLite version 3.1.4 293 296 on 2005-03-11.</li> 294 -<li value=4>Format 4 causes SQLite to respect the DESC keyword on 295 -index declarations. (The DESC keyword is ignored in indices for 297 +<li value=4>^Format 4 causes SQLite to respect the DESC keyword on 298 +index declarations. (^The DESC keyword is ignored in indices for 296 299 formats 1, 2, and 3.) 297 -Format 4 also adds two new boolean record type values ([serial types] 300 +^Format 4 also adds two new boolean record type values ([serial types] 298 301 8 and 9.) Support for format 4 was added in SQLite 3.3.0 on 299 302 2006-01-10.</li> 300 303 </ol> 301 304 302 -<p>New database files created by SQLite use format 1 by default, so 305 +<p>^New database files created by SQLite use format 1 by default, so 303 306 that database files created by newer versions of SQLite can still 304 307 be read by older versions of SQLite. 305 -The [legacy_file_format pragma] can be used to cause SQLite 308 +^The [legacy_file_format pragma] can be used to cause SQLite 306 309 to create new database files using format 4. Future versions of 307 310 SQLite may begin to create files using format 4 by default.</p> 308 311 309 312 <h4>1.2.11 Suggested cache size</h4> 310 313 311 314 <p>The 4-byte big-endian signed integer at offset 48 is the suggest 312 315 cache size in pages for the database file. The value is a suggestion ................................................................................ 313 316 only and SQLite is under no obligation to honor it. The absolute value 314 317 of the integer is used as the suggested size. The suggested cache size 315 318 can be set using the [default_cache_size pragma].</p> 316 319 317 320 <h4>1.2.12 Incremental vacuum settings</h4> 318 321 319 322 <p>The two 4-byte big-endian integers at offsets 52 and 64 are used 320 -to manage the [auto_vacuum] and [incremental_vacuum] modes. If 323 +to manage the [auto_vacuum] and [incremental_vacuum] modes. ^If 321 324 the integer at offset 52 is zero then pointer-map (ptrmap) pages are 322 325 omitted from the database file and neither auto_vacuum nor 323 -incremental_vacuum are supported. If the integer at offset 52 is 326 +incremental_vacuum are supported. ^If the integer at offset 52 is 324 327 non-zero then it is the page number of the largest root page in the 325 328 database file, the database file contain ptrmap pages, and the 326 -mode must be either auto_vacuum or incremental_vacuum. In this latter 329 +mode must be either auto_vacuum or incremental_vacuum. ^In this latter 327 330 case, the integer at offset 64 is true for incremental_vacuum and 328 -false for auto_vacuum. If the integer at offset 52 is zero then 331 +false for auto_vacuum. ^If the integer at offset 52 is zero then 329 332 the integer at offset 64 must also be zero.</p> 330 333 331 334 <h4>1.2.13 Text encoding</h4> 332 335 333 -<p>The 4-byte big-endian integer at offset 56 determines the encoding 334 -used for all text strings stored in the database. A value of 1 means 335 -UTF-8. A value of 2 means UTF-16le. A value of 3 means UTF-16be. 336 +<p>^The 4-byte big-endian integer at offset 56 determines the encoding 337 +used for all text strings stored in the database. ^A value of 1 means 338 +UTF-8. ^A value of 2 means UTF-16le. ^A value of 3 means UTF-16be. 336 339 No other values are allowed.</p> 337 340 338 341 <h4>1.2.14 User version number</h4> 339 342 340 -<p>The 4-byte big-endian integer at offset 60 is the user version which 343 +<p>^The 4-byte big-endian integer at offset 60 is the user version which 341 344 is set and queried by the [user_version pragma]. The user version is 342 345 not used by SQLite.</p> 343 346 344 347 <tcl>hd_fragment validfor {version-valid-for number}</tcl> 345 348 <h4>1.2.15 Write library version number and version-valid-for number</h4> 346 349 347 -<p>The 4-byte big-endian integer at offset 96 stores the 348 -[SQLITE_VERSION_NUMBER] value. The 4-byte big-ending integer at 350 +<p>^The 4-byte big-endian integer at offset 96 stores the 351 +[SQLITE_VERSION_NUMBER] value for the SQLite library that most 352 +recently modified the database file. ^The 4-byte big-ending integer at 349 353 offset 92 is the value of the [change counter] when the version number 350 354 was stored. The integer at offset 92 indicates which transaction 351 355 the version number is valid for and is sometimes called the 352 356 "version-valid-for number". 353 357 354 358 <h4>1.2.16 Header space reserved for expansion</h4> 355 359 ................................................................................ 363 367 inclusive. A database file that is less than or equal to 1073741824 bytes 364 368 in size contains no lock-byte page. A database file larger than 365 369 1073741824 contains exactly one lock-byte page. 366 370 </p> 367 371 368 372 <p>The lock-byte page is set aside for use by the operating-system specific 369 373 VFS implementation in implementing the database file locking primitives. 370 -SQLite will not use the lock-byte page; it will never be read or written 374 +^SQLite will not use the lock-byte page; it will never be read or written 371 375 by the SQLite core, though operating-system specific VFS implementations may 372 376 choose to read or write bytes on the lock-byte page according to the 373 377 needs and proclivities of the underlying system. The unix and win32 374 378 VFS implementations that come built into SQLite do not write to the 375 379 lock-byte page, but we are aware of third-party VFS implementations for 376 380 other operating systems that do sometimes write to the lock-byte page.</p> 377 381 ................................................................................ 385 389 <p>The freelist is organized as a linked list of freelist trunk pages 386 390 with each trunk pages containing page numbers for zero or more freelist 387 391 leaf pages.</p> 388 392 389 393 <p>A freelist trunk page consists of an array of 4-byte big-endian integers. 390 394 The size of the array is as many integers as will fit in the usable space 391 395 of a page. The minimum usable space is 480 bytes so the array will always 392 -be at least 120 entries in length. The first integer in the array 396 +be at least 120 entries in length. ^The first integer in the array 393 397 is the page number of the next freelist trunk page in the list or zero 394 -if this is the last freelist trunk page. The second integer in the array 398 +if this is the last freelist trunk page. ^The second integer in the array 395 399 is the number of leaf page pointers to follow. Call the second integer L. 396 -If L is greater than zero then integers with array indexes between 2 and 400 +^If L is greater than zero then integers with array indexes between 2 and 397 401 L+1 inclusive contain page numbers for freelist leaf pages.</p> 398 402 399 -<p>Freelist leaf pages contain no information. SQLite avoid reading or 403 +<p>Freelist leaf pages contain no information. ^SQLite avoid reading or 400 404 writing freelist leaf pages in order to reduce disk I/O.</p> 401 405 402 406 <p>A bug in SQLite versions prior to 3.6.0 caused the database to be 403 407 reported as corrupt if any last 6 entries in the freelist trunk page 404 408 array contained non-zero values. Newer versions of SQLite do not have 405 -this problem. However, newer versions of SQLite still avoid using the 409 +this problem. ^However, newer versions of SQLite still avoid using the 406 410 last six entries in the freelist trunk page array in order that database 407 411 files created by newer versions of SQLite can be read by older versions 408 412 of SQLite.</p> 409 413 410 -<p>The number of freelist pages is stored as a 4-byte big-endian integer 414 +<p>^The number of freelist pages is stored as a 4-byte big-endian integer 411 415 in the database header at an offset of 36 from the beginning of the file. 412 -The database header also stores the page number of the first freelist trunk 416 +^The database header also stores the page number of the first freelist trunk 413 417 page as a 4-byte big-endian integer at an offset of 32 from the beginning 414 418 of the file.</p> 415 419 416 420 <h3>1.5 B-tree Pages</h3> 417 421 418 422 <p>A b-tree page is either an interior page or a leaf page. 419 423 A leaf page contains keys and in the case of a table b-tree each ................................................................................ 420 424 key has associated content. An interior page contains 421 425 K keys without content but with K+1 pointers to child b-tree pages. 422 426 A "pointer" in an interior b-tree page is just the 31-bit integer 423 427 page number of the child page.</p> 424 428 425 429 426 430 <p>Define the depth 427 -of a leaf b-tree to be 1 and that the depth of any interior b-tree to be one 428 -more than the maximum depth of any of its children. In a well-formed 431 +of a leaf b-tree to be 1 and the the depth of any interior b-tree to be one 432 +more than the maximum depth of any of its children. ^In a well-formed 429 433 database, all children of any one interior b-tree have the same depth.</p> 430 434 431 435 <p>In an interior b-tree page, the pointers and keys logically alternate 432 436 with a pointer on both ends. (The previous sentence is to be understood 433 437 conceptually - the actual layout of the keys and 434 438 pointers within the page is more complicated and will be described in 435 -the sequel.) All keys within the same page are unique and are organized 436 -in ascending order from left to right. For any key X, pointers to the left 439 +the sequel.) All keys within the same page are unique and are logically 440 +organized in ascending order from left to right. (Again, this ordering 441 +is logical, not physical. The actual location of keys within the page 442 +is arbitrary.) ^For any key X, pointers to the left 437 443 of a X refer to b-tree pages on which alls keys are less than or equal to X. 438 -Pointers to the right of X refer to pages where all keys are greater than X.</p> 444 +^Pointers to the right of X refer to pages where all keys are 445 +greater than X.</p> 439 446 440 447 <p>Within an interior b-tree page, each key and the pointer to its 441 448 immediate left are combined into a structure called a "cell". The 442 449 right-most pointer is held separately. A leaf b-tree page has no 443 450 pointers, but it still uses the cell structure to hold keys for 444 451 index b-trees or keys and content for table b-trees.</p> 445 452 </p> ................................................................................ 498 505 499 506 <p>The 100-byte database file header is found only on page 1, which is 500 507 always a table b-tree page. All other b-tree pages in the database file 501 508 omit this 100-byte header.</p> 502 509 503 510 <p>The reserved region is an area of unused space at the end of every 504 511 page (except the locking page) that extensions can use to hold per-page 505 -information. The size of the reserved region is determined by the one-byte 512 +information. ^The size of the reserved region is determined by the one-byte 506 513 unsigned integer found at an offset of 20 into the database file header. 507 514 The size of the reserved region is usually zero.</p> 508 515 509 516 <p>The b-tree page header is 8 bytes in size for leaf pages and 12 510 517 bytes for interior pages. All multibyte values in the page header 511 518 are big-endian. 512 519 The b-tree page header is composed of the following fields:</p> ................................................................................ 513 520 514 521 <center> 515 522 <i>B-tree Page Header Format</i><br> 516 523 <table border=1 width="80%"> 517 524 <tr><th>Offset<th>Size<th>Description 518 525 <tr><td align=center valign=top>0<td align=center valign=top>1<td align=left> 519 526 A flag indicating the b-tree page type 520 -A value of 2 means the page is an interior index b-tree page. 521 -A value of 5 means the page is an interior table b-tree page. 522 -A value of 10 means the page is a leaf index b-tree page. 523 -A value of 13 means the page is a leaf table b-tree page. 524 -Any other value is an error. 527 +^A value of 2 means the page is an interior index b-tree page. 528 +^A value of 5 means the page is an interior table b-tree page. 529 +^A value of 10 means the page is a leaf index b-tree page. 530 +^A value of 13 means the page is a leaf table b-tree page. 531 +^Any other value for the b-tree page type is an error. 525 532 <tr><td align=center valign=top>1<td align=center valign=top>2<td align=left> 526 533 Byte offset into the page of the first freeblock 527 534 <tr><td align=center valign=top>3<td align=center valign=top>2<td align=left> 528 535 Number of cells on this page 529 536 <tr><td align=center valign=top>5<td align=center valign=top>2<td align=left> 530 537 Offset to the first byte of the cell content area. A zero value is used to represent an offset of 65536, which occurs on an empty root page when using a 65536-byte page size. 531 538 <tr><td align=center valign=top>7<td align=center valign=top>1<td align=left> 532 539 Number of fragmented free bytes within the cell content area 533 540 <tr><td align=center valign=top>8<td align=center valign=top>4<td align=left> 534 541 The right-most pointer (interior b-tree pages only) 535 542 </table></blockquote></center> 536 543 537 -<p>The cell pointer array of a b-tree page immediately follows the b-tree 538 -page header. Let K be the number of cells on the btree. The cell pointer 539 -array consists of K 2-byte integer offsets to the cell contents. The 544 +<p>^The cell pointer array of a b-tree page immediately follows the b-tree 545 +page header. Let K be the number of cells on the btree. ^The cell pointer 546 +array consists of K 2-byte integer offsets to the cell contents. ^The 540 547 cell pointers are arranged in key order with left-most cell (the cell with the 541 548 smallest key) first and the right-most cell (the cell with the largest 542 549 key) last.</p> 543 550 544 551 <p>Cell content is stored in the cell content region of the b-tree page. 545 552 SQLite strives to place cells as far toward the end of the b-tree page as 546 553 it can, in order to leave space for future growth of the cell pointer array. 547 554 The area in between the last cell pointer array entry and the beginning of 548 555 the first cell is the unallocated region. 549 556 </p> 550 557 551 -<p>If a page contains no cells (which is only possible for a root page 558 +<p>^If a page contains no cells (which is only possible for a root page 552 559 of a table that contains no rows) then the offset to the cell content 553 -area will equal the page size minus the bytes of reserved space. If 560 +area will equal the page size minus the bytes of reserved space. ^(If 554 561 the database uses a 65536-byte page size and the reserved space is zero 555 -(the usual value for reserved space) then the cell content offset would 556 -want to be 65536. However, that integer is too large to be stored in a 557 -2-byte unsigned integer, so a value of 0 is used in its place. 562 +(the usual value for reserved space) then the cell content offset of an 563 +empty page wants to be 65536. 564 +However, that integer is too large to be stored in a 565 +2-byte unsigned integer, so a value of 0 is used in its place.)^ 558 566 559 567 <p>A freeblock is a structure used to identify unallocated space within 560 -a b-tree page. Freeblocks are organized on a chain. The first 2 bytes of 568 +a b-tree page. Freeblocks are organized as a chain. ^The first 2 bytes of 561 569 a freeblock are a big-endian integer which is the offset in the b-tree page 562 570 of the next freeblock in the chain, or zero if the freeblock is the last on 563 -the chain. The third and fourth bytes of each freeblock form 571 +the chain. ^The third and fourth bytes of each freeblock form 564 572 a big-endian integer which is the size of the freeblock in bytes, including 565 -the 4-byte header. Freeblocks are always connected in order 566 -of increasing offset. The second field of the b-tree page header is the 573 +the 4-byte header. ^Freeblocks are always connected in order 574 +of increasing offset. ^The second field of the b-tree page header is the 567 575 offset of the first freeblock, or zero if there are no freeblocks on the 568 -page. In a well-formed b-tree page, there will always be at least one cell 576 +page. ^In a well-formed b-tree page, there will always be at least one cell 569 577 before the first freeblock.</p> 570 578 571 579 <p>A freeblock requires at least 4 bytes of space. If there is an isolated 572 580 group of 1, 2, or 3 unused bytes within the cell content area, those bytes 573 -comprise a fragment. The total number of bytes in all fragments is stored 574 -in the fifth field of the b-tree page header. In a well-formed b-tree page, 581 +comprise a fragment. ^The total number of bytes in all fragments is stored 582 +in the fifth field of the b-tree page header. ^In a well-formed b-tree page, 575 583 the total number of bytes in fragments may not exceed 60.</p> 576 584 577 585 <p>The total amount of free space on a b-tree page consists of the size 578 586 of the unallocated region plus the total size of all freeblocks plus the 579 -number of fragmented free bytes. SQLite may from time to time reorganize 587 +number of fragmented free bytes. ^SQLite may from time to time reorganize 580 588 a b-tree page so that there are no freeblocks or fragment bytes, all 581 589 unused bytes are contained in the unallocated space region, and all 582 590 cells are packed tightly at the end of the page. This is called 583 591 "defragmenting" the b-tree page.</p> 584 592 585 593 <tcl>hd_fragment varint {variable-length integer} {varint}</tcl> 586 594 ................................................................................ 692 700 the page type. For the following computations, let U be the usable size 693 701 of a database page, the total page size less the reserved space at the 694 702 end of each page. And let P be the payload size.</p> 695 703 696 704 <blockquote><dl> 697 705 <dt>Table B-Tree Leaf Cell:</dt> 698 706 <dd><p> 699 -If the payload size P is less than U-36 then the entire payload is stored 700 -on the b-tree page. Let M be ((U-12)*32/255)-23. If P is greater than U-35 707 +^If the payload size P is less than U-36 then the entire payload is stored 708 +on the b-tree page. ^(Let M be ((U-12)*32/255)-23. If P is greater than U-35 701 709 then the number of byte stored on the b-tree page is the lessor of 702 -M+((P-M)%(U-4)) and U-35. 710 +M+((P-M)%(U-4)) and U-35.)^ 703 711 </p></dd> 704 712 705 713 <dt>Table B-Tree Interior Cell:</dt> 706 714 <dd><p> 707 715 Interior pages of table b-trees have no payload and so there is never 708 716 any payload to spill. 709 717 </p></dd> 710 718 711 719 <dt>Index B-Tree Leaf Or Interior Cell:</dt> 712 720 <dd><p> 713 -If the payload size P is less than ((U-12)*64/255)-22 then the entire 721 +^If the payload size P is less than ((U-12)*64/255)-22 then the entire 714 722 payload is stored on the b-tree page. 715 -Let M be ((U-12)*32/255)-23. If P is greater than ((U-12)*64/255)-23 723 +^(Let M be ((U-12)*32/255)-23. If P is greater than ((U-12)*64/255)-23 716 724 then the number of byte stored on the b-tree page is the lessor of 717 -M+((P-M)%(U-4)) and ((U-12)*64/255)-23. 725 +M+((P-M)%(U-4)) and ((U-12)*64/255)-23.)^ 718 726 </p></dd> 719 727 </dl></blockquote> 720 728 721 729 <p>The overflow thresholds are designed to give a minimum fanout of 722 730 4 for index b-trees and to make sure enough of the payload 723 731 is on the b-tree page that the record header can usually be accessed 724 732 without consulting an overflow page. In hindsight, the designers of ................................................................................ 725 733 the SQLite b-tree logic realize that these thresholds could have been 726 734 made much simpler. However, the computations cannot be now be changed 727 735 without resulting in an incompatible file format. And the current computations 728 736 work well, even if they are a little complex.</p> 729 737 730 738 <h3>1.6 Cell Payload Overflow Pages</h3> 731 739 732 -<p>When the payload of a b-tree cell is too large for the b-tree page, 733 -the surplus is spilled onto overflow pages. Overflow pages form a linked 734 -list. The first four bytes of each overflow page are a big-endian 740 +<p>^When the payload of a b-tree cell is too large for the b-tree page, 741 +the surplus is spilled onto overflow pages. ^Overflow pages form a linked 742 +list. ^The first four bytes of each overflow page are a big-endian 735 743 integer which is the page number of the next page in the chain, or zero 736 -for the final page in the chain. The fifth byte through the last usable 744 +for the final page in the chain. ^The fifth byte through the last usable 737 745 byte are used to hold overflow content.</p> 738 746 739 747 <h3>1.7 Pointer Map or Ptrmap Pages</h3> 740 748 741 749 <p>Pointer map or ptrmap pages are extra pages inserted into the database 742 750 to make the operation of [auto_vacuum] and [incremental_vacuum] modes 743 751 more efficient. Other page types in the database typically have pointers 744 752 from parent to child. For example, a interior b-tree page contains pointers 745 753 to its child b-tree pages and an overflow chain has a pointer 746 754 from earlier to later links in the chain. A ptrmap page contains linkage 747 755 information going in the opposite direction, from child to parent.</p> 748 756 749 -<p>Ptrmap pages must exist in any database file which has a non-zero 757 +<p>^Ptrmap pages must exist in any database file which has a non-zero 750 758 largest root b-tree page value at offset 52 in the database header. 751 -If the largest root b-tree page value is zero, then the database must not 759 +^If the largest root b-tree page value is zero, then the database must not 752 760 contain ptrmap pages.</p> 753 761 754 -<p>In a database with ptrmap pages, the first ptrmap page is page 2. 762 +<p>^In a database with ptrmap pages, the first ptrmap page is page 2. 755 763 A ptrmap page consists of an array of 5-byte entries. Let J be the 756 764 number of 5-byte entries that will fit in the usable space of a page. 757 -(In other words, J=U/5.) The first ptrmap page will contain back pointer 758 -information for pages 3 through J+2, inclusive. The next pointer map 765 +(In other words, J=U/5.) ^The first ptrmap page will contain back pointer 766 +information for pages 3 through J+2, inclusive. ^The second pointer map 759 767 page will be on page J+3 and that ptrmap page will provide back pointer 760 768 information for pages J+4 through 2*J+3 inclusive. An so forth for 761 769 the entire database file.</p> 762 770 763 -<p>In a database that uses ptrmap pages, all pages at locations identified 771 +<p>^(In a database that uses ptrmap pages, all pages at locations identified 764 772 by the computation in the previous paragraph must be ptrmap page and no 765 773 other page may be a ptrmap page. Except, if the byte-lock page happens to 766 774 fall on the same page number as a ptrmap page, then the ptrmap is moved 767 -to the following page for that one case.</p> 775 +to the following page for that one case.)^</p> 768 776 769 777 <p>Each 5-byte entry on a ptrmap page provides back-link information about 770 -one of pages that immediately follow the pointer map. If page B is 778 +one of pages that immediately follow the pointer map. ^(If page B is a 771 779 ptrmap page then back-link information about page B+1 is provided by 772 780 the first entry on the pointer map. Information about page B+2 is 773 -provided by the second entry. And so forth.</p> 781 +provided by the second entry. And so forth.)^</p> 774 782 775 783 <p>Each 5-byte ptrmap entry consists of one byte of "page type" information 776 784 followed by a 4-byte big-endian page number. Five page types are recognized: 777 785 </p> 778 786 779 787 <ol> 780 788 <li>A b-tree root page. The ................................................................................ 787 795 <li>A page in an overflow chain 788 796 other than the first page. The page number is the prior page of the 789 797 overflow chain. 790 798 <li>A non-root b-tree page. The 791 799 page number is the parent b-tree page. 792 800 </ol> 793 801 794 -<p>In any database file that contains ptrmap pages, all b-tree root pages 802 +<p>^In any database file that contains ptrmap pages, all b-tree root pages 795 803 must come before any non-root b-tree page, cell payload overflow page, or 796 -freelist page.</p> 804 +freelist page. This restriction ensures that a root page will never 805 +be moved during an auto-vacuum or incremental-vacuum. The auto-vacuum 806 +logic is not to update the root_page field of the sqlite_master 807 +table and so it is necessary to prevent root pages from being moved 808 +during an auto-vacuum in order to preserve the integrity of the 809 +sqlite_master table. ^Root pages are moved to the beginning of the 810 +database file by the CREATE TABLE, CREATE INDEX, DROP TABLE, and 811 +DROP INDEX operations.</p> 797 812 798 813 <h2>2.0 Schema Layer</h2> 799 814 800 815 <p>The foregoing text describes low-level aspects of the SQLite file 801 -format. The b-tree mechanism provides powerful and efficient means of 816 +format. The b-tree mechanism provides a powerful and efficient means of 802 817 accessing a large data set. This section will describe how the 803 818 low-level b-tree layer is used to implement higher-level SQL 804 819 capabilities.</p> 805 820 806 821 <tcl>hd_fragment record_format {record format}</tcl> 807 822 <h3>2.1 Record Format</h3> 808 823 809 824 <p>The content of a table b-tree leaf page and the key 810 825 of any index b-tree page was characterized above 811 826 as an arbitrary sequence of bytes. 812 827 The prior discussion mentioned one key being less than another, but 813 828 did not define what "less than" meant. The current section will address 814 -these deficiencies.</p> 829 +these omissions.</p> 815 830 816 831 <p>Payload, be it table content or index keys, is always in the "record 817 832 format". The record format defines a sequence of values corresponding to 818 833 to columns in a table or index. The record format specifies the number 819 834 of columns, the datatype of each column, and the content of each column.</p> 820 835 821 836 <p>The record format makes extensive use of the 822 837 [variable-length integer] or [varint] 823 838 representation of 64-bit signed integers defined above.</p> 824 839 825 840 <tcl>hd_fragment serialtype {serial type} {serial types}</tcl> 826 841 <p>A record contains a header and a body, in that order. 827 -The header begins with a single varint which determines the total number 842 +^(The header begins with a single varint which determines the total number 828 843 of bytes in the header. The varint value is the size of the header in 829 -byte including the size varint itself. Following the size varint are 844 +bytes including the size varint itself.)^ ^Following the size varint are 830 845 one or more additional varints, one per column. These additional varints 831 846 are called "serial type" numbers and 832 847 determine the datatype of each column, according to the following key:</p> 833 848 834 849 <center> 835 -<i>Serial Type Codes Of The Record Format</i><br> 850 +^(<i>Serial Type Codes Of The Record Format</i><br> 836 851 <table width="80%" border=1> 837 852 <tr><th>Serial Type<th>Content Size<th>Meaning 838 853 <tr><td valign=top align=center>0<td valign=top align=center>0<td align=left> 839 854 NULL 840 855 <tr><td valign=top align=center>1<td valign=top align=center>1<td align=left> 841 856 8-bit twos-complement integer 842 857 <tr><td valign=top align=center>2<td valign=top align=center>2<td align=left> ................................................................................ 861 876 <tr><td valign=top align=center>N≥12 and even 862 877 <td valign=top align=center>(N-12)/2<td align=left> 863 878 A string in the database encoding and (N-12)/2 bytes in length. 864 879 The nul terminator is omitted. 865 880 <tr><td valign=top align=center>N≥13 and odd 866 881 <td valign=top align=center>(N-13)/2<td align=left> 867 882 A BLOB that is (N-13)/2 bytes in length 868 -</table></center> 883 +</table></center>)^ 869 884 870 885 <p>Note that because of the way varints are defined, the header size varint 871 886 and serial type varints will usually consist of a single byte. The 872 887 serial type varints for large strings and BLOBs might extend to two or three 873 888 byte varints, but that is the exception rather than the rule. 874 889 The varint format is very efficient at coding the record header.</p> 875 890 876 891 <p>The values for each column in the record immediately follow the header. 877 -Note that for serial types 0, 8, 9, 12, and 13, the value is zero bytes in 892 +^(Note that for serial types 0, 8, 9, 12, and 13, the value is zero bytes in 878 893 length. If all columns are of these types then the body section of the 879 -record is empty.</p> 894 +record is empty.)^</p> 880 895 881 896 <h3>2.2 Record Sort Order</h3> 882 897 883 898 <p>The order of keys in an index b-tree is determined by the sort order of 884 899 the records that the keys represent. Record comparison proceeds column 885 900 by column. Columns of a record are examined from left to right. The 886 901 first pair of columns that are not equal determines the relative order 887 902 of the two records. The sort order of individual columns is as 888 903 follows:</p> 889 904 890 905 <ol> 891 -<li>NULL values (serial type 0) sort first 892 -<li>Numeric values (serial types 1 through 9) sort next and in numeric order 893 -<li>Text values (even serial types 12 and larger) sort next in the order 906 +<li>^NULL values (serial type 0) sort first 907 +<li>^Numeric values (serial types 1 through 9) sort next and in numeric order 908 +<li>^Text values (even serial types 12 and larger) sort next in the order 894 909 determined by the columns collating function 895 -<li>BLOB values (odd serial types 13 and larger) sort last in order 910 +<li>^BLOB values (odd serial types 13 and larger) sort last in order 896 911 determined by memcmp(). 897 912 </ol> 898 913 899 914 <p>A collating function for each column is necessary in order to compute 900 915 the order of text fields. SQLite defines three built-in collating functions: 901 916 </p> 902 917 903 918 <blockquote><table border=0 cellspacing=10> 904 -<tr><td valign=top>BINARY 919 +<tr>^<td valign=top>BINARY 905 920 <td>Strings are compared byte by byte using the memcmp() function 906 921 from the standard C library. 907 -<tr><td valign=top>NOCASE 922 +<tr>^<td valign=top>NOCASE 908 923 <td>Like BINARY except that uppercase ASCII characters ('A' through 'Z') 909 924 are folded into their lowercase equivalents prior to running the 910 - comparison. Note that only ASCII characters are case-folded. NOCASE 925 + comparison. Note that only ASCII characters are case-folded. ^NOCASE 911 926 does not implement a general purpose unicode caseless comparison. 912 -<tr><td valign=top>RTRIM 927 +<tr>^<td valign=top>RTRIM 913 928 <td>Like BINARY except that spaces at the end of the string are elided 914 929 prior to comparison. 915 930 </table></blockquote> 916 931 917 -<p>Additional application-specific collating functions can be added to 932 +<p>^Additional application-specific collating functions can be added to 918 933 SQLite using the [sqlite3_create_collation()] interface.</p> 919 934 920 -<p>The default collating function for all strings is BINARY. 921 -Alternative collating functions for table columns can be specified in the 935 +<p>^The default collating function for all strings is BINARY. 936 +^Alternative collating functions for table columns can be specified in the 922 937 [CREATE TABLE] statement using the COLLATE clause on the column definition. 923 -When a column is indexed, the same collating function specified in the 938 +^When a column is indexed, the same collating function specified in the 924 939 [CREATE TABLE] statement is used for the column in the index, by default, 925 940 though this can be overridden using a COLLATE clause in the 926 941 [CREATE INDEX] statement. 927 942 928 943 <h3>2.3 Representation Of SQL Tables</h3> 929 944 930 945 <p>Each ordinary SQL table in the database schema is represented on disk ................................................................................ 931 946 by a table b-tree. Each entry in the table b-tree corresponds to a row 932 947 of the SQL table. The [rowid] of the SQL table is the 64-bit signed 933 948 integer key for each entry in the table b-tree.</p> 934 949 935 950 <p>The content of each SQL table row is stored in the database file by 936 951 first combining the values in the various columns into a byte array 937 952 in the record format, then storing that byte array as the payload in 938 -an entry in the table b-tree. The order of values in the record is 953 +an entry in the table b-tree. ^The order of values in the record is 939 954 the same as the order of columns in the SQL table definition. 940 -When an SQL table that includes an 955 +^When an SQL table that includes an 941 956 [INTEGER PRIMARY KEY] column (which aliases the [rowid]) then that 942 -column appears in the record as a NULL value. SQLite will know to use 957 +column appears in the record as a NULL value. ^SQLite will always use 943 958 the table b-tree key rather than the NULL value when referencing the 944 959 [INTEGER PRIMARY KEY] column.</p> 945 960 946 -<p>If the [affinity] of a column is REAL and that column contains a 961 +<p>^If the [affinity] of a column is REAL and that column contains a 947 962 value that can be converted to an integer without loss of information 948 963 (if the value contains no fractional part and is not too large to be 949 964 represented as an integer) then the column may be stored in the record 950 -as an integer. SQLite will know to convert the value back to floating 965 +as an integer. ^SQLite will convert the value back to floating 951 966 point when extracting it from the record.</p> 952 967 953 968 954 969 <h3>2.4 Representation Of SQL Indices</h3> 955 970 956 -<p>Each SQL index, whether explicitly declared via a [CREATE INDEX] statement 971 +<p>^Each SQL index, whether explicitly declared via a [CREATE INDEX] statement 957 972 or implied by a UNIQUE constraint, corresponds to an index b-tree in the 958 973 database file. 959 -There is one entry in index b-tree for each row in the corresponding table. 960 -The key to an index b-tree is 974 +^There is one entry in index b-tree for each row in the corresponding table. 975 +^The key to an index b-tree is 961 976 a record composed of the columns that are being indexed followed by the 962 977 [rowid] of the table row. Because every row in a table has a unique 963 978 rowid and all keys in an index contain the rowid, all keys in an index 964 979 are unique.</p> 965 980 966 -<p>There is a one-to-one mapping between rows in a table and 981 +<p>^There is a one-to-one mapping between rows in a table and 967 982 entries in each index associated with that table. 968 -Corresponding rows int the index and table b-trees share the same rowid 983 +^Corresponding rows int the index and table b-trees share the same rowid 969 984 value, and contain the same value for all indexed columns.</p> 970 985 971 986 <h3>2.5 Storage Of The SQL Database Schema</h3> 972 987 973 -<p>Page 1 of a database file is the root page of a table b-tree that 988 +<p>^Page 1 of a database file is the root page of a table b-tree that 974 989 holds a special table named "sqlite_master" (or "sqlite_temp_master" in 975 990 the case of a TEMP database) which stores the complete 976 -database schema. The structure of the sqlite_master table is as 991 +database schema. ^(The structure of the sqlite_master table is as 977 992 if it had been created using the following SQL:</p> 978 993 979 994 <blockquote><pre> 980 995 CREATE TABLE sqlite_master( 981 996 type text, 982 997 name text, 983 998 tbl_name text, 984 999 rootpage integer, 985 1000 sql text 986 1001 ); 987 -</pre></blockquote> 1002 +</pre></blockquote>)^ 988 1003 989 -<p>The sqlite_master table contains a row for each table, index, view, 1004 +<p>^The sqlite_master table contains a row for each table, index, view, 990 1005 and trigger in the database schema, except there is no entry for the 991 1006 sqlite_master table itself.</p> 992 1007 993 -<p>The sqlite_master.type column will be one 1008 +<p>^(The sqlite_master.type column will be one 994 1009 of the following text strings: 'table', 'index', 'view', or 'trigger' 995 -according to the type of object defined. The 'table' string is used 996 -for both ordinary and [virtual tables].</p> 1010 +according to the type of object defined. ^The 'table' string is used 1011 +for both ordinary and [virtual tables].)^</p> 997 1012 998 -</p>The sqlite_master.name column will hold the name of the object. 999 -For indices that are automatically created by UNIQUE constraints, the 1000 -name is "sqlite_autoindex_TABLE_N" where TABLE is replaced by the name 1001 -of the table that contains the UNIQUE constraint and N is an integer beginning 1002 -with 1 and increasing by one with each UNIQUE constraint seen.</p> 1013 +</p>^(The sqlite_master.name column will hold the name of the object. 1014 +^For indices that are automatically created by UNIQUE or PRIMARY KEY 1015 +constraints, the name is "sqlite_autoindex_TABLE_N" where TABLE is 1016 +replaced by the name of the table that contains the constraint and N 1017 +is an integer beginning 1018 +with 1 and increasing by one with each constraint seen.)^</p> 1003 1019 1004 1020 <p>The sqlite_master.tbl_name column holds the name of a table or view 1005 -that the object is associated with. For a table or view, the 1006 -tbl_name column is a copy of the name column. For an index, the tbl_name 1007 -is the name of the table that is indexed. For a trigger, the tbl_name 1021 +that the object is associated with. ^For a table or view, the 1022 +tbl_name column is a copy of the name column. ^For an index, the tbl_name 1023 +is the name of the table that is indexed. ^For a trigger, the tbl_name 1008 1024 column stores the name of the table or view that causes the trigger 1009 1025 to fire.</p> 1010 1026 1011 -<p>The sqlite_master.rootpage column stores the page number of the root 1012 -b-tree page for tables and indices. For rows that define views, triggers, 1027 +<p>^(The sqlite_master.rootpage column stores the page number of the root 1028 +b-tree page for tables and indices.)^ ^For rows that define views, triggers, 1013 1029 and virtual tables, the rootpage column is 0 or NULL.</p> 1014 1030 1015 -<p>The sqlite_master.sql column stores SQL text that describes the 1031 +<p>^(The sqlite_master.sql column stores SQL text that describes the 1016 1032 object. This SQL text is a [CREATE TABLE], [CREATE VIRTUAL TABLE], 1017 1033 [CREATE INDEX], 1018 1034 [CREATE VIEW], or [CREATE TRIGGER] statement that if evaluated against 1019 1035 the database file when it is the main database of a [database connection] 1020 -would recreated the object. The text is usually a copy of the original 1036 +would recreated the object.) The text is usually a copy of the original 1021 1037 statement used to create the object but with normalizations applied so 1022 1038 that the text conforms to the following rules: 1023 1039 1024 1040 <ul> 1025 -<li>The CREATE, TABLE, VIEW, TRIGGER, and INDEX keywords at the beginning 1041 +<li>^The CREATE, TABLE, VIEW, TRIGGER, and INDEX keywords at the beginning 1026 1042 of the statement are converted to all upper case letters. 1027 -<li>The TEMP or TEMPORARY keyword is removed if it occurs after the 1043 +<li>^The TEMP or TEMPORARY keyword is removed if it occurs after the 1028 1044 initial CREATE keyword. 1029 -<li>Any database name qualifier that occurs prior to the name of the 1045 +<li>^Any database name qualifier that occurs prior to the name of the 1030 1046 object being created is removed. 1031 -<li>Leading spaces are removed. 1032 -<li>All spaces following the first two keywords are converted into a single 1047 +<li>^Leading spaces are removed. 1048 +<li>^All spaces following the first two keywords are converted into a single 1033 1049 space. 1034 1050 </ul> 1035 1051 1036 -<p>The text in the sqlite_master.sql column is a copy of the original 1052 +<p>^(The text in the sqlite_master.sql column is a copy of the original 1037 1053 CREATE statement text that created the object, except normalized as 1038 -described above and as modified by subsequent [ALTER TABLE] statements.</p> 1054 +described above and as modified by subsequent [ALTER TABLE] statements.)^</p> 1039 1055 1040 -<p>For indices that are automatically created by UNIQUE constraints, 1041 -the sqlite_master.sql field is NULL.</p> 1056 +<p>^(For indices that are automatically created by UNIQUE or 1057 +PRIMARY KEY constraints, the sqlite_master.sql field is NULL.)^</p> 1042 1058 1043 1059 1044 1060 <tcl>hd_fragment rollbackjournal {rollback journal format}</tcl> 1045 1061 <h2>3.0 The Rollback Journal</h2> 1046 1062 1047 1063 <p>The rollback journal is a file associated with each SQLite database 1048 1064 fail that hold information used to restore the database file to its initial 1049 1065 state during the course of a transaction. 1050 -The rollback journal file is always located in the directory as the database 1066 +^The rollback journal file is always located in the directory as the database 1051 1067 file and has the same name as the database file but with the string 1052 1068 "<tt>-journal</tt>" appended. There can only be a single rollback journal 1053 1069 associated with a give database and hence there can only be one write 1054 1070 transaction open against a single database at one time.</p> 1055 1071 1056 1072 <p>If a transaction is aborted due to an application crash, an operating 1057 1073 system crash, or a hardware power failure or crash, then the database may 1058 -be left in an inconsistent state. The next time SQLite attempts to open 1074 +be left in an inconsistent state. ^The next time SQLite attempts to open 1059 1075 the database file, the presence of the rollback journal file will be 1060 1076 detected and the journal will be automatically played back to restore the 1061 1077 database to its state at the start of the incomplete transaction.</p> 1062 1078 1063 -<p>A rollback journal is only considered to be valid if it exists and 1079 +<p>^A rollback journal is only considered to be valid if it exists and 1064 1080 contains a valid header. Hence a transaction can be committed in one 1065 1081 of three ways: 1066 1082 <ol> 1067 -<li>The rollback journal file can be deleted, 1068 -<li>The rollback journal file can be truncated to zero length, or 1069 -<li>The header of the rollback journal can be overwritten with 1083 +<li>^The rollback journal file can be deleted, 1084 +<li>^The rollback journal file can be truncated to zero length, or 1085 +<li>^The header of the rollback journal can be overwritten with 1070 1086 invalid header text (for example, all zeros). 1071 1087 </ol> 1072 -These three ways of committing a transaction correspond to the DELETE, 1088 +^These three ways of committing a transaction correspond to the DELETE, 1073 1089 TRUNCATE, and PERSIST settings, respectively, of the [journal_mode pragma]. 1074 1090 </p> 1075 1091 1076 1092 1077 1093 <p>A valid rollback journal begins with a header in the following format:</p> 1078 1094 1079 1095 <center> 1080 1096 <i>Rollback Journal Header Format</i><br> 1081 1097 <table width="80%" border=1> 1082 1098 <tr><th>Offset<th>Size<th>Description 1083 -<tr><td valign=top align=center>0 1099 +<tr>^(<td valign=top align=center>0 1084 1100 <td valign=top align=center>8 1085 - <td>Header string: 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, 0xd7 1086 -<tr><td valign=top align=center>8 1101 + <td>Header string: 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, 0xd7)^ 1102 +<tr>^(<td valign=top align=center>8 1087 1103 <td valign=top align=center>4 1088 1104 <td>The "Page Count" - The number of pages in the next segment of the 1089 1105 journal, or -1 to 1090 - mean all content to the end of the file 1091 -<tr><td valign=top align=center>12 1106 + mean all content to the end of the file)^ 1107 +<tr>^(<td valign=top align=center>12 1092 1108 <td valign=top align=center>4 1093 - <td>A random nonce for the checksum 1094 -<tr><td valign=top align=center>16 1109 + <td>A random nonce for the checksum)^ 1110 +<tr>^(<td valign=top align=center>16 1095 1111 <td valign=top align=center>4 1096 - <td>Initial size of the database in pages 1097 -<tr><td valign=top align=center>20 1112 + <td>Initial size of the database in pages)^ 1113 +<tr>^(<td valign=top align=center>20 1098 1114 <td valign=top align=center>4 1099 1115 <td>Size of a disk sector assumed by the process that wrote this 1100 - journal. 1101 -<tr><td valign=top align=center>24 1116 + journal.)^ 1117 +<tr>^(<td valign=top align=center>24 1102 1118 <td valign=top align=center>4 1103 - <td>Size of pages in this journal. 1119 + <td>Size of pages in this journal.)^ 1104 1120 </table> 1105 1121 </center> 1106 1122 1107 -<p>A rollback journal header is padded with zeros out to the size of a 1123 +<p>^A rollback journal header is padded with zeros out to the size of a 1108 1124 single sector (as defined by the sector size integer at offset 20). 1109 1125 The header is in a sector by itself so that if a power loss occurs while 1110 1126 writing the sector, information that follows the header will be 1111 1127 (hopefully) undamaged.</p> 1112 1128 1113 -<p>After the header and zero padding are zero or more page records. Each 1129 +<p>^After the header and zero padding are zero or more page records. ^Each 1114 1130 page record stores a copy of the content of a page from the database file 1115 -before it was changed. The same page may not appear more than once 1131 +before it was changed. ^The same page may not appear more than once 1116 1132 within a single rollback journal. 1117 1133 To rollback an incomplete transaction, a process 1118 1134 has merely to read the rollback journal from beginning to end and 1119 1135 write pages found in the journal back into the database file at the 1120 1136 appropriate location.</p> 1121 1137 1122 1138 <p>Let the database page size (the value of the integer at offset 24 ................................................................................ 1123 1139 in the journal header) be N. 1124 1140 Then the format of a page record is as follows:</p> 1125 1141 1126 1142 <center> 1127 1143 <i>Rollback Journal Page Record Format</i><br> 1128 1144 <table width="80%" border=1> 1129 1145 <tr><th>Offset<th>Size<th>Description 1130 -<tr><td valign=top align=center>0 1146 +<tr>^(<td valign=top align=center>0 1131 1147 <td valign=top align=center>4 1132 - <td>The page number in the database file 1133 -<tr><td valign=top align=center>4 1148 + <td>The page number in the database file)^ 1149 +<tr>^(<td valign=top align=center>4 1134 1150 <td valign=top align=center>N 1135 - <td>Original content of the page prior to the start of the transaction 1136 -<tr><td valign=top align=center>N+4 1151 + <td>Original content of the page prior to the start of the transaction)^ 1152 +<tr>^(<td valign=top align=center>N+4 1137 1153 <td valign=top align=center>4 1138 - <td>Checksum 1154 + <td>Checksum)^ 1139 1155 </table> 1140 1156 </center> 1141 1157 1142 1158 1143 -<p>The checksum is an unsigned 32-bit integer computed as follows:</p> 1159 +<p>^(The checksum is an unsigned 32-bit integer computed as follows:</p> 1144 1160 1145 1161 <ol> 1146 1162 <li>Initialize the checksum to the checksum nonce value found in the 1147 1163 journal header at offset 12. 1148 1164 <li>Initialize index X to be N-200 (where N is the size of a database page 1149 1165 in bytes. 1150 1166 <li>Interpret the four bytes at offset X into the page as a 4-byte big-endian 1151 1167 unsigned integer. Add the value of that integer to the checksum. 1152 1168 <li>Subtrace 200 from X. 1153 1169 <li>If X is greater than or equal to zero, go back to step 3. 1154 -</ol> 1170 +</ol>)^ 1155 1171 1156 1172 <p>The checksum value is used to guard against incomplete writes of 1157 1173 a journal page record following a power failure. A different random nonce 1158 1174 is used each time a transaction is started in order to minimize the risk 1159 1175 that unwritten sectors might by chance contain data from the same page 1160 1176 that was a part of prior journals. By changing the nonce for each 1161 1177 transaction, stale data on disk will still generate an incorrect checksum 1162 1178 and be detected with high probability. The checksum only uses a sparce sample 1163 1179 of 32-bit words from the data record for performance reasons - design studies 1164 1180 during the planning phases of SQLite 3.0.0 showed 1165 1181 a significant performance hit in checksumming the entire page.</p> 1166 1182 1167 -<p>Let the page count value at offset 8 in the journal header by M. 1168 -If M is greater than zero then after M page records the journal file 1183 +<p>Let the page count value at offset 8 in the journal header be M. 1184 +^If M is greater than zero then after M page records the journal file 1169 1185 may be zero padded out to the next multiple of the sector size and another 1170 -journal header may be inserted. All journal headers within the same 1186 +journal header may be inserted. ^All journal headers within the same 1171 1187 journal must contain the same database page size and sector size.</p> 1172 1188 1173 -<p>If M is -1 in the initial journal header, then the number of page records 1174 -that follow is computed by figure out how many page records will fit in 1189 +<p>^If M is -1 in the initial journal header, then the number of page records 1190 +that follow is computed by computing how many page records will fit in 1175 1191 the available space of the remainder of the journal file.</p> 1176 1192 1177 1193 <tcl>hd_fragment walformat {WAL format}</tcl> 1178 1194 <h2>4.0 The Write-Ahead Log</h2> 1179 1195 1180 1196 <p>Beginning with [version 3.7.0], SQLite supports a new transaction 1181 1197 control mechanism called "[WAL | write-ahead log]" or "[WAL]". 1182 -When a database is in WAL mode, all connections to that database must 1183 -use the WAL. A particular database will use either a rollback journal 1198 +^When a database is in WAL mode, all connections to that database must 1199 +use the WAL. ^A particular database will use either a rollback journal 1184 1200 or a WAL, but not both at the same time. 1185 -The WAL is always located in the directory as the database 1201 +^The WAL is always located in the directory as the database 1186 1202 file and has the same name as the database file but with the string 1187 1203 "<tt>-wal</tt>" appended.</p> 1188 1204 1189 1205 <h3>4.1 WAL File Format</h4> 1190 1206 1191 1207 <p>A WAL file consists of a header followed by zero or more "frames". 1192 1208 Each frame records the revised content of a single page from the 1193 1209 database file. All changes to the database are recorded by writing 1194 1210 frames into the WAL. Transactions commit when a frame is written that 1195 -contains a commit marker. A single WAL can and usually does record 1211 +contains a commit marker. ^A single WAL can and usually does record 1196 1212 multiple transactions. Periodically, the content of the WAL is 1197 1213 transferred back into the database file in an operation called a 1198 1214 "checkpoint".</p> 1199 1215 1200 -<p>A single WAL file can be reused multiple times. In other words, the 1216 +<p>^A single WAL file can be reused multiple times. ^In other words, the 1201 1217 WAL can fill up with frames and then be checkpointed and then new 1202 -frames can overwrite the old ones. A WAL always grows from beginning 1218 +frames can overwrite the old ones. ^A WAL always grows from beginning 1203 1219 toward the end. Checksums and counters attached to each frame are 1204 1220 used to determine which frames within the WAL are valid and which 1205 1221 are leftovers from prior checkpoints.</p> 1206 1222 1207 1223 <p>The WAL header is 32 bytes in size and consists of the following eight 1208 1224 big-endian 32-bit unsigned integer values:</p> 1209 1225 ................................................................................ 1251 1267 <tr><td valign=top align=center>16<td valign=top align=center>4 1252 1268 <td>Checksum-1: Cumulative checksum up through and including this page 1253 1269 <tr><td valign=top align=center>20<td valign=top align=center>4 1254 1270 <td>Checksum-2: Second half of the cumulative checksum. 1255 1271 </table> 1256 1272 </center> 1257 1273 1258 -<p>A frame is considered valid if and only if the following conditions are 1274 +^(<p>A frame is considered valid if and only if the following conditions are 1259 1275 true:</p> 1260 1276 1261 1277 <ol> 1262 1278 <li><p>The salt-1 and salt-2 values in the frame-header match 1263 1279 salt values in the wal-header</p></li> 1264 1280 1265 1281 <li><p>The checksum values in the final 8 bytes of the frame-header 1266 1282 exactly match the checksum computed consecutively on the 1267 1283 WAL header and the first 8 bytes and the content of all frames 1268 1284 up to and including the current frame.</p></li></li> 1269 -</ol> 1285 +</ol>)^ 1270 1286 1271 1287 <tcl>hd_fragment walcksm {WAL checksum algorithm}</tcl> 1272 1288 <h3>4.2 Checksum Algorithm</h3> 1273 1289 1274 1290 <p>The checksum is computed by interpreting the input as 1275 1291 an even number of unsigned 32-bit integers: x(0) through x(N). 1276 -The 32-bit integers are big-endian if the 1292 +^The 32-bit integers are big-endian if the 1277 1293 magic number in the first 4 bytes of the WAL header is 0x377f0683 and 1278 1294 the integers are little-endian the magic number is 0x377f0682. 1279 -The checksum values are always stored in the frame header in a 1295 +^The checksum values are always stored in the frame header in a 1280 1296 big-endian format regardless of which byte order is used to compute 1281 1297 the checksum.</p> 1282 1298 1283 1299 <p>The checksum algorithm only works for content which is a multiple of 1284 1300 8 bytes in length. In other words, if the inputs are x(0) through x(N) 1285 1301 then N must be odd. 1286 -The checksum algorithm is as follows: 1302 +^(The checksum algorithm is as follows: 1287 1303 1288 1304 <blockquote><pre> 1289 1305 s0 = s1 = 0 1290 1306 for i from 0 to n-1 step 2: 1291 1307 s0 += x(i) + s1; 1292 1308 s1 += x(i+1) + s0; 1293 1309 endfor 1294 1310 # result in s0 and s1 1295 -</pre></blockquote> 1311 +</pre></blockquote>)^ 1296 1312 1297 1313 <p>The outputs s0 and s1 are both weighted checksums using fibonacci weights 1298 1314 in reverse order (the largest fibonacci weight occurs on the first element 1299 1315 of the sequence being summed.) The s1 value spans all 32-bit integer 1300 1316 terms of the sequence whereas s0 omits the final term.</p> 1301 1317 1302 1318 <h3>4.3 Checkpoint Algorithm</h3> ................................................................................ 1306 1322 Then valid content of the WAL is transferred into the database file. 1307 1323 Finally, the database is flushed to persistent storage using another 1308 1324 xSync method call. 1309 1325 The xSync operations serve as write barriers - all writes launched 1310 1326 before the xSync must complete before any write that launches after the 1311 1327 xSync begins.</p> 1312 1328 1313 -<p>After each checkpoint, the WAL header salt-1 value is incremented and the 1329 +<p>^After each checkpoint, the WAL header salt-1 value is incremented and the 1314 1330 salt-2 value is randomized. This prevents old and new frames in the WAL from 1315 1331 being considered valid at the same time and being checkpointing together 1316 1332 following a crash.</p> 1317 1333 1318 1334 <tcl>hd_fragment walread {WAL read algorithm}</tcl> 1319 1335 <h3>4.4 Reader Algorithm</h3> 1320 1336 1321 -<p>To read a page from the database (call it page number P), a reader 1337 +<p>^(To read a page from the database (call it page number P), a reader 1322 1338 first checks the WAL to see if it contains page P. If so, then the 1323 -last valid instance of page P that is a followed by a commit frame 1324 -or is a commit frame itself becomes the value read. If the WAL 1339 +last valid instance of page P that is followed by a commit frame 1340 +or is a commit frame itself becomes the value read.)^ ^If the WAL 1325 1341 contains no copies of page P that are valid and which are a commit 1326 1342 frame or are followed by a commit frame, then page P is read from 1327 1343 the database file.</p> 1328 1344 1329 1345 <p>To start a read transaction, the reader records the index of the last 1330 1346 valid frame in the WAL. The reader uses this recorded "mxFrame" value 1331 1347 for all subsequent read operations. New transactions can be appended 1332 1348 to the WAL, but as long as the reader uses its original mxFrame value 1333 1349 and ignores subsequently appended content, the reader will see a 1334 1350 consistent snapshot of the database from a single point in time. 1335 -This technique allows multiple concurrent readers to view different 1351 +^This technique allows multiple concurrent readers to view different 1336 1352 versions of the database content simultaneously.</p> 1337 1353 1338 1354 <p>The reader algorithm in the previous paragraphs works correctly, but 1339 1355 because frames for page P can appear anywhere within the WAL, the 1340 1356 reader has to scan the entire WAL looking for page P frames. If the 1341 1357 WAL is large (multiple megabytes is typical) that scan can be slow, 1342 -and read performance suffers. To overcome this problem, a separate 1358 +and read performance suffers. ^To overcome this problem, a separate 1343 1359 data structure called the wal-index is maintained to expedite the 1344 1360 search for frames of a particular page.</p> 1345 1361 1346 1362 <tcl>hd_fragment walindexformat {wal-index} {WAL-index format}</tcl> 1347 1363 <h3>4.5 WAL-Index Format</h3> 1348 1364 1349 1365 <p>Conceptually, the wal-index is shared memory, though the current ................................................................................ 1365 1381 1366 1382 <p>The <i>M</i> value in the previous paragraph is the "mxFrame" value 1367 1383 defined in [WAL read algorithm | section 4.4] that is read at the start 1368 1384 of a transaction and which defines the maximum frame from the WAL that 1369 1385 the reader will use.</p> 1370 1386 1371 1387 <p>The wal-index is transient. After a crash, the wal-index is 1372 -reconstructed from the original WAL file. The VFS is required 1388 +reconstructed from the original WAL file. ^The VFS is required 1373 1389 to either truncate or zero the header of the wal-index when the last 1374 1390 connection to it closes. Because the wal-index is transient, it can 1375 1391 use an architecture-specific format; it does not have to be cross-platform. 1376 1392 Hence, unlike the database and WAL file formats which store all values 1377 1393 as big endian, the wal-index stores multi-byte values in the native 1378 1394 byte order of the host computer.</p> 1379 1395 1380 1396 <p>This document is concerned with the persist state of the database 1381 1397 file, and since the wal-index is a transient structure, no further 1382 1398 information about the format of the wal-index will be provided here. 1383 1399 Complete details on the format of the wal-index are contained within 1384 1400 comments in SQLite source code.</p>