Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Add other notes to lsm.wiki. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
ec9a80516479bb7c2ef1c58b313b5964 |
User & Date: | dan 2013-02-21 19:56:37.420 |
Context
2013-02-22
| ||
20:23 | Fix over-length source code lines in sqliteInt.h. check-in: 035df9b1de user: drh tags: trunk | |
2013-02-21
| ||
19:56 | Add other notes to lsm.wiki. check-in: ec9a805164 user: dan tags: trunk | |
17:53 | Add a summary of locks to lsm.wiki. check-in: 3987cb831e user: dan tags: trunk | |
Changes
Changes to www/lsm.wiki.
1 2 3 4 5 6 7 | <title>LSM Design Overview</title> <nowiki> <div id=start_of_toc></div> <a href=#summary style=text-decoration:none>1. Summary </a><br> | > > | > | > > | | | | | | | | | < < | | | | | | | | | < | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | <title>LSM Design Overview</title> <nowiki> <div id=start_of_toc></div> <a href=#summary style=text-decoration:none>1. Summary </a><br> <a href=#locks style=text-decoration:none>2. Locks </a><br> <a href=#database_connect_and_disconnect_operations style=text-decoration:none>3. Database Connect and Disconnect Operations</a><br> <a href=#read-write_clients style=text-decoration:none>3.1. Read-write clients</a><br> <a href=#read-only_clients style=text-decoration:none>3.2. Read-only clients</a><br> <a href=#data_structures style=text-decoration:none>4. Data Structures </a><br> <a href=#database_file style=text-decoration:none>4.1. Database file</a><br> <a href=#sorted_runs style=text-decoration:none>4.1.1. Sorted Runs</a><br> <a href=#levels style=text-decoration:none>4.1.2. Levels</a><br> <a href=#snapshots style=text-decoration:none>4.1.3. Snapshots</a><br> <a href=#in-memory_tree style=text-decoration:none>4.2. In-Memory Tree</a><br> <a href=#memory_allocation style=text-decoration:none>4.2.1. Memory Allocation</a><br> <a href=#header_fields style=text-decoration:none>4.2.2. Header Fields</a><br> <a href=#other_shared-memory_fields style=text-decoration:none>4.3. Other Shared-Memory Fields</a><br> <a href=#log_file style=text-decoration:none>4.4. Log file</a><br> <a href=#database_operations style=text-decoration:none>5. Database Operations </a><br> <a href=#reading style=text-decoration:none>5.1. Reading</a><br> <a href=#writing style=text-decoration:none>5.2. Writing</a><br> <a href=#flushing_the_in-memory_tree_to_disk style=text-decoration:none>5.2.1. Flushing the in-memory tree to disk</a><br> <a href=#shared-memory_management style=text-decoration:none>5.2.2. Shared-memory management</a><br> <a href=#log_file_management style=text-decoration:none>5.2.3. Log file management</a><br> <a href=#working style=text-decoration:none>5.3. Working</a><br> <a href=#free-block_list_management style=text-decoration:none>5.3.1. Free-block list management</a><br> <a href=#checkpoint_operations style=text-decoration:none>5.4. Checkpoint Operations</a><br> <div id=end_of_toc></div> <h1 id=summary>1. Summary </h1> The LSM embedded database software stores data in three distinct data structures: |
︙ | ︙ | |||
71 72 73 74 75 76 77 | <p> When an application writes to the database, the new data is written to the in-memory tree. Once the in-memory tree has grown large enough, its contents are written into the database file as a new sorted run. To reduce the number of sorted runs in the database file, chronologically adjacent sorted runs may be merged together into a single run, either automatically or on demand. | | | > > | | | | | | | | | | | | | | | | | | > > > | > > > > > | > > > > > > | > > > > > | < > > > > | > > | > | > | < > < < > > > > > | | > > > > | > > > > | > > > | | > | > | > > | > > | | < > | | > > > > > > > > > > | > > | > > > | > > | > > | > | > > > > > > > > > | | 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 | <p> When an application writes to the database, the new data is written to the in-memory tree. Once the in-memory tree has grown large enough, its contents are written into the database file as a new sorted run. To reduce the number of sorted runs in the database file, chronologically adjacent sorted runs may be merged together into a single run, either automatically or on demand. <h1 id=locks>2. Locks</h1> Read/write (shared/exclusive) file locks are used to control concurrent access. LSM uses the following "locking regions". Each locking region may be locked and unlocked separately. <p> <table style="margin:0 4ex"> <tr><td valign=top style="width:16ex">DMS1 <td><p style=margin-top:0> This locking region is used to serialize all connection and disconnection operations performed by read-write database connections. An EXCLUSIVE lock is taken for the duration of all such operations. <p>Additionally, read-only connections take a SHARED lock on this locking region while attempting to connect to a database. This ensures that a read-only connection does not attempt to connect to the database while a read-write clients connection or disconnection operation is ongoing. <tr><td valign=top>DMS2 <td><p style=margin-top:0> Read-write connections hold a SHARED lock on this locking region for as long as they are connected to the database. <tr><td valign=top>DMS3 <td><p style=margin-top:0> Read-only connections hold a SHARED lock on this locking region for as long as they are connected to the database. <tr><td valign=top>RWCLIENT(n) <td><p style=margin-top:0> There are a total of 16 RWCLIENT locking regions. After a read-write client connects to the database it attempts to find a free RWCLIENT locking slot to take an EXCLUSIVE lock on. If it cannot find one, this is not an error. If it can, then the lock is held for as long as the read-write client is connected to the database for. <p>The sole purpose of these locks is that they allow a read-only client to detect whether or not there exists at least one read-write client connected to the database. Of course if large numbers of read-write clients connect and disconnect from the system in an inconvenient order the system may enter a state where there exists one or more connected read-write clients but none of them hold a RWCLIENT lock. This is not important - if a read-only client fails to detect that the system has read-write clients it may be less efficient, but will not malfunction. <tr><td valign=top>WRITER <td><p style=margin-top:0> A database client holds an EXCLUSIVE lock on this locking region while writing data to the database. Outside of recovery, only clients holding this lock may modify the contents of the in-memory b-tree. Holding this lock is synonymous with having an open write transaction on the database. <tr><td valign=top>WORKER <td><p style=margin-top:0> A database client holds an EXCLUSIVE lock on this locking region while performing database work (writing data into the body of the database file). <tr><td valign=top>CHECKPOINTER <td><p style=margin-top:0> A database client holds an EXCLUSIVE lock on this locking region while performing a checkpoint (syncing the database file and writing to the database header). <tr><td valign=top>ROTRANS <td><p style=margin-top:0> A read-only database client holds a SHARED lock on this locking region while reading from a non-live database system. <tr><td valign=top>READER(n) <td><p style=margin-top:0> There are a total of 6 READER locking regions. Unless it is a read-only client reading from a non-live database, a client holds a SHARED lock on one of these while it has an open read transaction. Each READER lock is associated with a pair of id values identifying the regions of the in-memory tree and database file that may be read by clients holding such SHARED locks. </table> <h1 id=database_connect_and_disconnect_operations>3. Database Connect and Disconnect Operations</h1> <h2 id=read-write_clients>3.1. Read-write clients</h2> <p>When an LSM database connection is opened (i.e. lsm_open() is called): <pre> lock(DMS1, EXCLUSIVE) # Block until successful trylock(DMS2+DMS3, EXCLUSIVE) if( trylock() successful ){ zero shared memory and run recovery } if( no error during recovery ){ lock(DMS2, SHARED) # "cannot" fail lock(RWCLIENT(x), EXCLUSIVE) # see comment below } unlock(DMS1) </pre> <p> Running recovery involves reading the database file header and log file to initialize the contents of shared-memory. Recovery is only run when the first connection connects to the database file. There are no circumstances (including the unexpected failure of a writer process) that may cause recovery to take place after the first client has successfully connected. <p> After the SHARED lock on DMS2 is taken, an effort is made to find a free RWCLIENT locking region and take an EXCLUSIVE lock on it. If no such region can be found, this step is omitted. It is not an error if this happens. <p> Assuming recovery is successful (or not required), the database client is left holding a SHARED lock on DMS2 and, possibly, an EXCLUSIVE lock on one of the RWCLIENT locks. These locks are held for as long as the database connection remains open. <p> When disconnecting from a database (i.e. an lsm_close() call following a successful lsm_open()): <pre> lock(DMS1, EXCLUSIVE) # Block until successful trylock(DMS2, EXCLUSIVE) if( trylock() successful ){ flush in-memory tree checkpoint database trylock(DMS3, EXCLUSIVE) if( trylock() successful ){ delete shared memory } trylock(ROTRANS, EXCLUSIVE) if( trylock() successful ){ unlink log file } } unlock(RWCLIENT(x)) # If holding RWCLIENT lock unlock(DMS2) unlock(DMS1) </pre> <h2 id=read-only_clients>3.2. Read-only clients</h2> <p>It is assumed that read-only clients: <ul> <li>may take SHARED locks only, <li>may not write to shared-memory, the database file or the log file, and <li>may not use the trylock(x, EXCLUSIVE) primitive to detect SHARED locks held by other clients. </ul> <p>A read-only client does not attempt to connect to the database from within the lsm_open() call. Instead, this is defered until the first time the client attempts to read from the database. <pre> lock(DMS1, SHARED) # Block until successful trylock(RWCLIENT(all), SHARED) if( trylock() successful ){ # System is not live. The database must be read directly from disk. lock(ROTRANS, SHARED) }else{ # Client is now connected. The read transaction may now be opened # in the same way as it would be for a read-write client. lock(DMS3, SHARED) } unlock(DMS1) </pre> <p> Assuming no error occurs, the procedure above leads to one of two possible outcomes: <ul> <li> <p>DMS3 is locked and the read-only client is now connected to the database. From this point on, the read-only client uses the same procedure to open a read transaction as any other client. The lock on DMS3 is held until the client disconnects from the database. <li> <p>ROTRANS is locked and the read-only client is still disconnected. Holding the lock on ROTRANS guarantees that no read-write client will overwrite any existing data in the database file or log file. This allows the read-only client to run a disconnected read transaction - it recovers any data in the log file into private memory, then reads data as required from the database file for the duration of the users read transaction. </ul> <p>A disconnected read transaction is closed by dropping the ROTRANS lock. <h1 id=data_structures>4. Data Structures </h1> <p> In the following sections, "the WRITER lock", refers to an exclusive lock on the WRITER locking region. For example "holding the WRITER lock" is equivalent to "holding an exclusive lock on the WRITER locking region". Similar interpretations apply to "the WORKER lock" and "the CHECKPOINTER lock". <h2 id=database_file>4.1. Database file</h2> <p> This section summarizes the contents of the database file informally. A detailed description is found in the header comments for source code files <a href="../src/lsm_file.c">lsm_file.c</a> (blocks, pages etc.), <a href="../src/lsm_sorted.c">lsm_sorted.c</a> (sorted run format) and <a href="../src/lsm_ckpt.c">lsm_ckpt.c</a> (database snapshot format). |
︙ | ︙ | |||
234 235 236 237 238 239 240 | <p> As with an SQLite database file, each page in the database may be addressed by its 32-bit page number. This means the maximum database size is roughly (pgsz * 2^32) bytes. The first and last pages in each block are 4 bytes smaller than the others. This is to make room for a single page-number. | | | 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 | <p> As with an SQLite database file, each page in the database may be addressed by its 32-bit page number. This means the maximum database size is roughly (pgsz * 2^32) bytes. The first and last pages in each block are 4 bytes smaller than the others. This is to make room for a single page-number. <h3 id=sorted_runs>4.1.1. Sorted Runs</h3> <p> A single sorted run is spread across one or more database pages (each page is a part of at most one sorted run). Given the page number of a page in a sorted run the following statements are true: <ul> |
︙ | ︙ | |||
276 277 278 279 280 281 282 | In other words, given the page numbers of the first and last pages of a sorted run and the page number of the root page for the embedded b-tree, it is possible to traverse the entire run in either direction or query for arbitrary values. <p><span style="color:red"> TODO: Embedded pointers. </span> | | | 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 | In other words, given the page numbers of the first and last pages of a sorted run and the page number of the root page for the embedded b-tree, it is possible to traverse the entire run in either direction or query for arbitrary values. <p><span style="color:red"> TODO: Embedded pointers. </span> <h3 id=levels>4.1.2. Levels</h3> <p> Each sorted run is assigned to a "level". Normally, a level consists of a single sorted run. However, a level may also consist of a set of sorted runs being incrementally merged into a single run. <p> |
︙ | ︙ | |||
337 338 339 340 341 342 343 | time for all entries. | | | 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 | time for all entries. <h3 id=snapshots>4.1.3. Snapshots</h3> <p> Each meta page may contain a database <b>snapshot</b>. A snapshot contains all the information required to interpret the remainder of the database file (the sorted runs and free space). Specifically, it contains: <ul> |
︙ | ︙ | |||
362 363 364 365 366 367 368 | Recovery and Shutdown" below). </ul> <p> A more detailed description is available in the header comments in source code file <a href="../src/lsm_ckpt.c">lsm_ckpt.c</a> | | | | 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 | Recovery and Shutdown" below). </ul> <p> A more detailed description is available in the header comments in source code file <a href="../src/lsm_ckpt.c">lsm_ckpt.c</a> <h2 id=in-memory_tree>4.2. In-Memory Tree</h2> <p> The in-memory tree is an append-only b-tree of order 4 (each node may have up to 4 children), which is more or less equivalent to a red-black tree. An append-only tree is convenient, as it naturally supports the single-writer/many-readers MVCC concurrency model. <p> The implementation includes some optimizations to reduce the number of interior nodes that are updated when a leaf node is written that are not described here. See header comments in source code file <a href=../src/lsm_tree.c>lsm_tree.c</a> for details. <h3 id=memory_allocation>4.2.1. Memory Allocation</h3> <p> More than one in-memory tree may exist in shared-memory at any time. For example in the following scenario: <ol> |
︙ | ︙ | |||
442 443 444 445 446 447 448 | but the values that connect the linked list together are not. The writer that detects the failure must scan the entire shared-memory region to reconstruct the linked list. Any sequence ids assigned by the failed writer are reverted (perhaps not to their original values, but to values that put them at the start of the linked list - before those chunks that may still be in use by existing readers). | | | | < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < | | | 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 | but the values that connect the linked list together are not. The writer that detects the failure must scan the entire shared-memory region to reconstruct the linked list. Any sequence ids assigned by the failed writer are reverted (perhaps not to their original values, but to values that put them at the start of the linked list - before those chunks that may still be in use by existing readers). <h3 id=header_fields>4.2.2. Header Fields</h3> <p> As well as the in-memory tree data, the following fixed-size fields stored in well-known locations in shared-memory are part of the in-memory tree. Like the in-memory tree data, outside of recovery these fields are only ever written to by clients holding the WRITER lock. <ul> <li> Two copies of a data structure called a "tree-header". Tree-header-1 and tree-header 2. A tree-header structure contains all the information required to read or write to a particular version of the append only b-tree. It also contains a 64-bit checksum. <li> A boolean flag set to true at the beginning of every write transaction and cleared after that transaction is successfully concluded - the "writer flag". This is used to detect failures that occur mid-transaction. It is only ever read (or written) by clients that hold the WRITER lock. </ul> <h2 id=other_shared-memory_fields>4.3. Other Shared-Memory Fields</h2> <ul> <li> Snapshot 1. <li> Snapshot 2. <li> The meta-page pointer. This value is either 1 or 2. It indicates which of the two meta-pages contains the most recent database snapshot. <li> READER lock values. </ul> <h2 id=log_file>4.4. Log file</h2> <a href=../src/lsm_log.c>lsm_log.c</a>. <h1 id=database_operations>5. Database Operations </h1> <h2 id=reading>5.1. Reading</h2> <p> Opening a read transaction: <ol> <li> <p>Load the current tree-header from shared-memory. |
︙ | ︙ | |||
634 635 636 637 638 639 640 | Once a read transaction is opened, the reader may continue to read the versions of the in-memory tree and database file for as long as the READER lock is held. <p> To close a read transaction all that is required is to drop the SHARED lock held on the READER slot. | | | 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 | Once a read transaction is opened, the reader may continue to read the versions of the in-memory tree and database file for as long as the READER lock is held. <p> To close a read transaction all that is required is to drop the SHARED lock held on the READER slot. <h2 id=writing>5.2. Writing</h2> <p> To open a write transaction: <ol> <li> <p>Open a read transaction, if one is not already open. <li> <p>Obtain the WRITER lock. |
︙ | ︙ | |||
682 683 684 685 686 687 688 | <li> Sweep the shared-memory area to rebuild the linked list of chunks so that it is consistent with the current tree-header. <li> Clear the writer flag. </ol> | | | 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 | <li> Sweep the shared-memory area to rebuild the linked list of chunks so that it is consistent with the current tree-header. <li> Clear the writer flag. </ol> <h3 id=flushing_the_in-memory_tree_to_disk>5.2.1. Flushing the in-memory tree to disk</h3> <p> For the purposes of writing, the database file and the in-memory tree are largely independent. Processes holding the WRITER lock write to the in-memory tree, and processes holding the WORKER lock write to the database file. <ol> |
︙ | ︙ | |||
704 705 706 707 708 709 710 | <li> Update the private copy of the tree-header to reflect a new, empty tree. <li> Commit the write transaction, writing the new, empty tree to shared-memory. </ol> | | | | | 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 | <li> Update the private copy of the tree-header to reflect a new, empty tree. <li> Commit the write transaction, writing the new, empty tree to shared-memory. </ol> <h3 id=shared-memory_management>5.2.2. Shared-memory management</h3> <p> A writer client may have to allocate new shared-memory chunks. This can be done either by extending the shared-memory region or by recycling the first chunk in the linked-list. To check if the first chunk in the linked-list may be reused, the writer must check that: <ul> <li> The chunk is not part of the current in-memory tree (the one being appended to by the writer). A writer can check this by examining its private copy of the tree-header. <li> The chunk is not part of an in-memory tree being used by an existing reader. A writer checks this by scanning (and possibly updating) the values associated with the READER locks - similar to the way SQLite does in WAL mode. </ul> <h3 id=log_file_management>5.2.3. Log file management</h3> <p> A writer client also writes to the log file. All information required to write to the log file (the offset to write to and the initial checksum values) is embedded in the tree-header. Except, in order to reuse log file space (wrap around to the start of the log file), a writer needs to know that the space being recycled will not be required by any recovery process in the future. In other words, that the information contained in the transactions being overwritten has been written into the database file and is part of the snapshot written into the database file by a checkpointer (see "Checkpoint Operations" below). <p> To determine whether or not the log file can be wrapped, the writer requires access to information stored in the newest snapshot written into the database header. Their exists a shared-memory variable indicating which of the two meta-pages contain this snapshot, but the writer process still has to read the snapshot data and verify its checksum from disk. <h2 id=working>5.3. Working</h2> <p> Working is similar to writing. The difference is that a "writer" modifies the in-memory tree. A "worker" modifies the contents of the database file. <ol> <li> <p>Take the WORKER lock. |
︙ | ︙ | |||
767 768 769 770 771 772 773 | <li> <p>Invoke xShmBarrier(). <li> <p>Update snapshot-1 in shared-memory. <li> <p>Release the WORKER lock. </ol> | | | 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 | <li> <p>Invoke xShmBarrier(). <li> <p>Update snapshot-1 in shared-memory. <li> <p>Release the WORKER lock. </ol> <h3 id=free-block_list_management>5.3.1. Free-block list management</h3> <p> Worker clients occasionally need to allocate new database blocks or move existing blocks to the free-block list. Along with the block number of each free block, the free-block list contains the snapshot-id of the first snapshot created after the block was moved to the free list. The free-block list is always stored in order of snapshot-id, so that the first block in |
︙ | ︙ | |||
797 798 799 800 801 802 803 | header. This is done by reading (and verifying the checksum) of the snapshot currently stored in the database meta-page indicated by the shared-memory variable. </ul> | | | 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 | header. This is done by reading (and verifying the checksum) of the snapshot currently stored in the database meta-page indicated by the shared-memory variable. </ul> <h2 id=checkpoint_operations>5.4. Checkpoint Operations</h2> <ol> <li> Take CHECKPOINTER lock. <li> Load snapshot-1 from shared-memory. If the checksum does not match the content here, release the CHECKPOINTER lock and abandon the attempt to checkpoint the database. |
︙ | ︙ | |||
822 823 824 825 826 827 828 829 | <li> Sync the database file again. <li> Update the shared-memory variable to indicate the meta-page written in step 5. <li> Drop the CHECKPOINTER lock. </ol> | < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < | 841 842 843 844 845 846 847 848 849 850 851 | <li> Sync the database file again. <li> Update the shared-memory variable to indicate the meta-page written in step 5. <li> Drop the CHECKPOINTER lock. </ol> |