Documentation Source Text

Artifact [1b0bd8db88]

Artifact 1b0bd8db88f6d6a85e09a3e329d1ac39825d856a:

<!-- title>SQLite B-Tree Module</title -->

hd_keywords {btree design}
source [file join $::DOC pages fancyformat.tcl]

foreach header {btree.h sqliteInt.h} {
  set fd [open [file join $SRC src $header]]
  set headers($header) [read $fd]
  close $fd

proc header_api_defn {header args} {
  foreach api $args {

    set re_head {\n[^\n]*}
    if { [string match struct* $api] } {
      set re_tail {[^\{]*\};\n}
    } elseif { [string match BTREE* $api] } {
      set re_tail {[^A-Za-z][^\n]*\n}
    } else {
      set re_tail {[^;A-Za-z][^;]*;}

    set api_defn {}
    regexp $re_head$api$re_tail $::headers($header) api_defn
    lappend ret [string trim $api_defn]
  return "<pre class=api>[join $ret "\n"]</pre>"

proc btree_api_defn {args} {
  eval header_api_defn btree.h $args
proc sqliteint_api_defn {args} {
  eval header_api_defn sqliteInt.h $args

fancyformat_document "SQLite B-Tree Module" {hlr50000.txt llr50000.txt} {

  [h1 "Document Overview"]

  [h2 "Scope and Purpose"]

    This document provides a description of the functionality of and public 
    interface offered by the SQLite b-tree module. It also, to a certain extent,
    describes the algorithm and implementation techniques used internally by
    the module. 

    <li><p> To make it easier to maintain, test and improve this critical
            sub-system of the SQLite software library.

    <li><p> To facilitate development of compatible backend modules that can 
            be used with other SQLite sub-systems, either for experimental or 
            production purposes.

    It is important to note that, even given the second bullet point above,
    the interfaces and sub-systems described in this document are not stable.
    They may be changed in any way with each new SQLite release. Any 
    external software development that uses these interfaces must be prepared
    to adapt to interface refactoring without notice.

  [h2 "Document and Requirements Organization"]

     <p class=todo>
       Change the following so that those requirements that describe the API
       are "low-level" requirements.

        [Tr] <th> Requirement ids <th> Contents
        [Tr] <td> H50**** <td> Requirement statements specifying the functionality required of the B-Tree module.
        [Tr] <td> H51**** <td> Requirement statements specifying the API provided by the B-Tree module.
        [Tr] <td> L****** <td> Requirement statements specifying some details of the internal workings of the B-Tree module.

  [h2 "Glossary"]

    <table id=glossary>
      [Glossary "Balance-Siblings Algorithm" {
        The balance-siblings algorithm is one of four algorithms that may be used
	used to redistribute data within a b-tree structure after an insert or
        delete operation that causes a b-tree node to become overfull or underfull.
        See section <cite>balance_siblings</cite> for details.
      [Glossary "B-Tree Cursor" {
        <span class=todo>Define this.
      [Glossary "B-Tree Database Connection" {
        A B-Tree database connection is a single client connection to an in-memory 
        page cache, through which a single temporary or persistent database may
        be accessed. This term is used throughout this document to avoid confusing
	such connections with SQL level SQLite client connections, which are
        sometime simply termed "database connections".
      [Glossary "Lazy-write cache" {
        <span class=todo>Define this.
      [Glossary "Overflow Cell" {
        <span class=todo>Define this.
      [Glossary "Page cache" {
        <span class=todo>Define this.
      [Glossary "Persistent database" {
        <span class=todo>Define this.
      [Glossary "Read-through cache" {
        <span class=todo>Define this.
      [Glossary "Shared-cache mode" {
        <span class=todo>Define this.
      [Glossary "SQLite Error Code" {
        <span class=todo>Define this.
      [Glossary "Temporary database" {
        <span class=todo>Define this.

  [h1 "Module Requirements"]

    The SQLite B-Tree module, the software module described by this document,
    is designed to query and modify a database stored using the database image
    format described in <cite>ref_file_format</cite>. Database images may
    exist only in volatile main-memory (in-memory databases), or may be stored 
    persistently within the file-system (also described in
    <cite>ref_file_format</cite>). Or, a database image may be stored primarily
    in main-memory with the file-system used as secondary storage if the
    database image grows too large. Database images stored only in main-memory,
    and those stored primarily in main-memory with the file-system used only to
    provide secondary storage space are known collectively as temporary
    databases. Database images stored persistently in the file-system are termed
    persistent databases.

    This module implements an in-memory page cache to manage database image
    content. The size of the pages managed by the cache are the same as the
    page-size of the database image. When operating on a persistent database, 
    the cache operates as a read-through, lazy-write cache. When committing a 
    database transaction, the user explicitly directs the cache to flush all 
    dirty pages through to persistent storage. A single in-memory page cache 
    used to access the content of a persistent database may support multiple
    logical client connections. <span class=todo>Some brief explanation of what
    this means. And maybe a pointer to the "Multi-User Database Requirements"

    When operating on a temporary database, there may only be one client for
    each page cache. Depending on the SQLite configuration, either the database
    or journal file, or both, may be omitted from the system.

      [Figure btreemodule_overview.svg figure_overview "Role of Page Cache"]

    Figure <cite>figure_overview</cite> depicts...

    Roughly what is encapsulated by the module.

    [h2 "Functional Requirements"]

      This section contains requirements describing the functionality required 
      from the B-Tree module.

    <p class=todo>
      Figure out where integrity-check goes.

    [h3 "Opening and Closing Connections"]
      The B-Tree module provides an interface to open new b-tree database connections.

      [fancyformat_import_requirement H50010]
      [fancyformat_import_requirement H50020]
      [fancyformat_import_requirement H50030]
      [fancyformat_import_requirement H50040]
      [fancyformat_import_requirement H50050]

      The B-Tree module also provides an interface to close existing b-tree database 

      [fancyformat_import_requirement H50060]
      [fancyformat_import_requirement H50070]

    [h3 "New Database Image Configuration"]

      The following requirements describe database configuration options that
      are only applicable to new database images. For the purposes of the
      following requirements, a "new database image" is defined as one that is
      zero pages in size.

      [fancyformat_import_requirement H50080]
      [fancyformat_import_requirement H50090]

    [h3 "Transaction and Savepoint Functions" hlr_transactions]

     <p class=todo>
       This needs a lot of work...

      All read and write operations performed on a database image via the
      B-Tree module interfaces occur within the context of a read or write
      transaction. <span class=todo>Something about the ACID nature of
      transactions and how this applies to read and write transactions</span>)

      [fancyformat_import_requirement H50100]
      [fancyformat_import_requirement H50101]


      [fancyformat_import_requirement H50102]
      [fancyformat_import_requirement H50103]
      [fancyformat_import_requirement H50104]

    <p class=todo>
      Multi-file transaction support.

      Transaction state query:
      [fancyformat_import_requirement H50108]


    <p class=todo>
      Define "savepoint transactions" and fix the following requirements.

      [fancyformat_import_requirement H50105]
      [fancyformat_import_requirement H50106]
      [fancyformat_import_requirement H50107]

    [h3 "Reading From the Database Image" hlr_reading_data]

      The B-Tree module allows the user to read a subset of the fields from the
      database image header. Each such field is stored in the header as a 4-byte 
      unsigned big-endian integer. A complete description of each field and its 
      interpretation may be found in <cite>ref_file_format</cite>. 

      [fancyformat_import_requirement H50109]

      In other words, the database image header fields that may be read via
      this module are:

      <li> The number of free pages in the database image,
      <li> The database image schema version (schema cookie).
      <li> The database image schema layer file-format.
      <li> The default page-cache size.
      <li> The "auto-vacuum last root-page" field.
      <li> The database image text-encoding field.
      <li> The database image user-cookie value.
      <li> The database image incremental-vacuum flag.

      With the exception of the database image header fields described above,
      all data is read from the database image using B-Tree cursors. A B-Tree
      cursor is a control structure for traversing the contents of a single
      table or index b-tree structure within a database image. As well as
      "forward" and "back" operations, a B-Tree cursor supports fast seeking to
      a table entry identified by key value, or to the first or last entry in
      the table.

      When a B-Tree cursor is created, the specific table or index b-tree that
      it is used to traverse is identified by the database image page number 
      of its root page. Since the root-page of the schema table is always page
      1, and the contents of the schema table includes the root page numbers of all
      other index and table b-tree structures in the database image, it is
      possible for the application to determine the set of valid root-page
      numbers by first traversing the schema table.

      [fancyformat_import_requirement H50110]
      [fancyformat_import_requirement H50111]
      [fancyformat_import_requirement H50112]
      [fancyformat_import_requirement H50113]
      [fancyformat_import_requirement H50114]
      [fancyformat_import_requirement H50115]
      [fancyformat_import_requirement H50116]
      [fancyformat_import_requirement H50117]
      [fancyformat_import_requirement H50118]

      As well as traversing a b-tree structure using the operations enumerated
      by the above requirements, it is also possible to use a cursor to search
      a b-tree structure for a specified key value. If the key value can be
      found, the cursor is left pointing at the entry with the specified key
      value. Otherwise, the cursor is left pointing at either the entry with the 
      largest key that is smaller than the specified key, or to the entry with 
      the smallest key that is larger than the specified key. For table b-tree
      structures, where the key values are 64-bit integers, the definition of
      smaller, larger and equal to is straightforward. For index b-tree
      structures, where the key values are database records, the manner in
      which key values must be compared is more complicated. Refer to
      <cite>ref_file_format</cite> for a full explanation.

    <p class=todo>
      There is a specific section in <cite>ref_file_format</cite> devoted to
      record sort order in index b-tree structures. There needs to be some way to
      point to it. Or, better, to the requirement or range of requirements.

    <p class=todo>
      Maybe a system that automatically links text like H30100 to the
      corresponding requirement. Within a document if it can find it, or a
      summary page (hlreq.html for example).

      [fancyformat_import_requirement H50119]
      [fancyformat_import_requirement H50120]
      [fancyformat_import_requirement H50121]

    <p class=todo>
      Does it depend on the structure of the tree whether the cursor is left
      pointing to a smaller or larger entry after a failed search? Or is it
      possible to determine which it will be based only on the set of keys
      stored in the tree?

      As well as the standard search operation described by the above
      requirements, cursors open on index b-tree structures are required to
      support several variants, as follows:

      <li> <b>Ignore rowid search mode</b>. The final value in a database
	   record used as an index-btree key is always an integer "rowid"
	   field. A search in this mode proceeds as if each key in the b-tree
           was missing this field.

      <li> <b>Increment key mode</b>.
      <li> <b>Prefix match mode</b>.
      <li> <b>Prefix search mode</b>.

    <p class=todo>
      Finish the bullet points above and add HLR for each search mode.

      More than one cursor can be open on a single b-tree structure at one time.
      It is also possible for a write-cursor to modify the contents of a b-tree
      structure while other cursors are open on it. The b-tree module does not
      include any type of row-locking mechanism. It is possible for a write-cursor
      to be used to delete an entry from a b-tree structure even if there are
      one or more other cursors currently pointing to the entry being deleted.

    <p class=todo>
      Requirements to do with how the above is handled. Traceability to 
      sqlite3BtreeCursorHasMoved is required.

    [h3 "Writing to the Database Image"]

      The B-Tree module allows the user to write values to a subset of the
      fields from the database image header. The set of writable fields is
      the same as the set of fields enumerated in section
      <cite>hlr_reading_data</cite> that the B-Tree module is required to
      provide read access to by requirement H50109.

      [fancyformat_import_requirement H50122]

      The B-Tree module also supports operations to create new b-tree 
      structures within the database image. Existing b-tree structures may be
      deleted from the database image entirely, or their entire contents may be
      deleted, leaving an empty b-tree structure.

      [fancyformat_import_requirement H50123]
      [fancyformat_import_requirement H50124]
      [fancyformat_import_requirement H50125]

      As one would expect, the B-Tree module also provides an interface to
      insert and delete entries from b-tree structures. These operations are
      performed using a B-Tree write cursor, a special type of B-Tree cursor
      (see section <cite>hlr_reading_data</cite>).

      [fancyformat_import_requirement H50126]
      [fancyformat_import_requirement H50127]
      [fancyformat_import_requirement H50128]

    <p class=todo>
      Incremental vacuum step.

    [h3 "Page-Cache Configuration Requirements"]

      A page-cache has a number of operational parameters that may be configured
      at run-time via an open b-tree database connection. Note that even though the
      interfaces provided by this module allow these parameters to be set via a
      b-tree database connection, they are properties of the page-cache, not
      the b-tree database connection. In situations where more than one b-tree
      database connection is connected to a single page-cache, writes made via
      one b-tree database connection may overwrite the values set by another.
      The following table summarizes the available configuration parameters.

      [Tr] <th>Parameter <th>Description  <th>Requirements
      [Tr] <td>Locking-mode 
           <td><span class=todo>This!</span> 
           <td>H50138, H50139, H50140
      [Tr] <td>Journal-mode 
           <td><span class=todo>This!</span> 
           <td>H50141, H50142, H50143, H50144, H50145, H50146
      [Tr] <td>Journal-file size limit
	   <td>The journal-file size limit parameter may be set to any integer
	       value within the range of a 64-bit signed integer. Any negative
	       values is interpreted as "no limit". Otherwise, if the
	       journal-file size limit is set to zero or a positive number, it
	       represents an upper limit on the size of the journal file in
	       bytes. If the application executes a database write operation that
	       would normally cause the journal file to grow larger than this
	       configured limit, the operation fails and an error is returned
	       to the user. The default value of this parameter is -1 (no
           <td>H50147, H50148, H50149
      [Tr] <td style="white-space:nowrap">Database-file size limit
	   <td>The database-image size limit parameter may be set to any integer
	       value greater than zero within the range of a 32-bit signed
               integer. The configured value represents an upper limit on the size of
	       the database image in pages. If the application executes a
               database write operation that would normally cause the database image to
	       grow larger than this configured limit, the operation fails and
               an error is returned to the user.
           <td>H50150, H50151, H50152
      [Tr] <td>Cache size
	   <td>The cache-size parameter may be set to any integer value. How it
               affects operation depends on the specific P-Cache implementation used
	       by the page-cache. <span class=todo>Refer to details for the
               behaviour of the built-in default P-Cache.</span>
      [Tr] <td>Safety level
	   <td>The safety-level parameter may be set to "none", "normal" or "full".
               <span class=todo> Where will the effect of this defined/required?</span>
           <td>H50154, H50155

      [fancyformat_import_requirement H50138]
      [fancyformat_import_requirement H50139]
      [fancyformat_import_requirement H50140]

    <p class=todo>
      And if a read/write transaction is downgraded to a read-only transaction?
      This scenario should also be dealt with in section <cite>hlr_transactions</cite>.

      [fancyformat_import_requirement H50141]
      [fancyformat_import_requirement H50142]
      [fancyformat_import_requirement H50143]
      [fancyformat_import_requirement H50144]
      [fancyformat_import_requirement H50145]
      [fancyformat_import_requirement H50146]

    <p class=todo>
      The difference in functionality provided by "off", "memory" and the 3
      modes that use a real journal file should also feature in

      [fancyformat_import_requirement H50147]
      [fancyformat_import_requirement H50148]
      [fancyformat_import_requirement H50149]

      [fancyformat_import_requirement H50150]
      [fancyformat_import_requirement H50151]
      [fancyformat_import_requirement H50152]

      [fancyformat_import_requirement H50153]

      See section <cite>hlr_memory</cite> for a description of and requirements
      specifying how the value of the cache-size parameter affects the
      operation of a page-cache. <span class=todo>Check this reference is
      relevant after it is written. Refer to a specific requirement if possible

      [fancyformat_import_requirement H50154]
      [fancyformat_import_requirement H50155]

    <p class=todo>
      Description of what the safety-level actually does. Or pointer to where a
      description and requirements can be found (transactions section?).

    <p class=todo>
      Interface to set the codec function (encryption).

    <p class=todo>
      The busy-handler. Where exactly does this come in? Transactions and
      savepoints section?

      The six page-cache operational parameters listed above may also be
      queried. The following requirements specify the required query

      [fancyformat_import_requirement H50132]
      [fancyformat_import_requirement H50133]
      [fancyformat_import_requirement H50134]
      [fancyformat_import_requirement H50135]
      [fancyformat_import_requirement H50136]
      [fancyformat_import_requirement H50137]

      It is also possible to interrogate a b-tree database handle to determine
      if it was opened on a temporary or persistent database. An b-tree
      database handle opened on a persistent database may be queried for the
      name of (full-path to) either the database or journal file associated
      with the open database.

      [fancyformat_import_requirement H50131]
      [fancyformat_import_requirement H50129]
      [fancyformat_import_requirement H50130]

    [h3 "Multi-User Database Requirements"]

      [fancyformat_import_requirement H50156]

    <li> Lock on schema memory object.
    <li> Locks on b-tree tables.
    <li> "Unlock notify" feature.
    <li> Mutexes/thread-safety features.

    <p class=todo>
      The b-tree module preventing deadlock (by always grabbing mutexes in
      order of BtShared pointer) should be required here.

    [h3 "Backup/Vacuum API Requirements"]
    <li> Callbacks for backup module.
    <li> Page read/write APIs for backup module.

    [h3 "Integrity Check Requirements"]
    <li> Callbacks for backup module.
    <li> Page read/write APIs for backup module.

  [h2 "Other Requirements and Constraints"]

    [h3 "Caching and Memory Management Requirements" hlr_memory]
    <li> Memory allocation related features (pcache, scratch memory, other...).
    <li> Default pcache implementation (sqlite3_release_memory()).
    <li> Schema memory object allocation (destructor registration).

    [h3 "Exception Handling Requirements"]

    System failure. Do not corrupt the database image.
    Three kinds of exception:
    <li> IO Error.
    <li> Malloc request failure.
    <li> Database image corruption.

    [h3 "Well-Formedness Requirements"]
    <li> Identify the subset of file-format well-formedness requirements that
         this module is responsible for implementing.
    <li> Define how the module should respond to corrupt database files: don't
         crash, return SQLITE_CORRUPT as early as is practical. Should it also
         put the b-tree into a permanent error state?

  [h1 "Module API"]

      <p class=todo>
        Description of the interface in btree.h. Also other interfaces accessed by
        external modules. Including release_memory() and those pager interfaces that
        are accessed directly by other modules. All of these requirements will be
        descended/derived from requirements in the previous sections. Some of the
        text could/should be pulled in from btree.h.

      <p class=todo>
	  The name of sqlite3BtreeBeginStmt() should probably change to
	  sqlite3BtreeOpenSavepoint(). Matches the pager layer and is a more
	  accurate description of the function.

      <p class=todo>
        There are only a few places in which the pager object is used directly,
        always to call some trivial get/set configuration function. These should
	  be replaced somehow with sqlite3BtreeXXX() APIs. Also, the current
	  approach is probably Ok, but worth looking it over for thread-safety

      <p class=todo>
	  It would be easier to write up if the dependency between the B-Tree
        layer and the sqlite3 structure did not exist. At present, it is used for:
          <br> * The unlock-notify feature (arguments to sqlite3ConnectionBlocked() are database handles),
          <br> * Accessing the SQLITE_ReadUncommitted flag,
          <br> * Invoking the busy-handler callback,
          <br> * During sqlite3BtreeOpen(), to find the VFS to use,
          <br> * Accessing the SQLITE_SharedCache flag (for setting it),
          <br> * To check the same B-Tree is not attached more than once in shared-cache mode,
          <br> * To link the B-Tree into the pointer-order list of shared-cache b-trees used by the same handle (used for mutexes).
          <br> * To determine if an in-memory sub-journal should be used.
          <br> * To know how many savepoints are open in BtreeBeginTrans().
          <br> * Many, many times to assert() that the db mutex is held when the b-tree layer is accessed..

    [h2 "Opening and Closing Connections"]

      This section describes the API offered by the B-Tree module to other
      SQLite sub-systems to open and close B-Tree database connections.

      [btree_api_defn Btree]

      A B-Tree database connection is accessed by other SQLite sub-systems
      using an opaque handle, modelled in C code using the type "Btree*".

    [h3 sqlite3BtreeOpen]

      [btree_api_defn sqlite3BtreeOpen] 

      [fancyformat_import_requirement H51001]
      [fancyformat_import_requirement H51002]

      [fancyformat_import_requirement H51003]
      [fancyformat_import_requirement H51004]

      The combination of the above two requirements implies that if the
      zFilename argument passed to sqlite3BtreeOpen is other than a NULL
      pointer or a pointer to a nul-terminated string, the type of or
      filename of the database that sqlite3BtreeOpen attempts to open a
      connection to are undefined.

      Valid values for the <i>flags</i> argument to the sqlite3BtreeOpen
      function consist of the bitwise OR of zero or more of the following


      [fancyformat_import_requirement H51005]

      When opening a connection to a persistent database, the value of the
      BTREE_OMIT_JOURNAL bit in the flags parameter is ignored by

      [fancyformat_import_requirement H51006]

      When opening a connection to a temporary database, the value of the
      BTREE_NO_READLOCK bit in the flags parameter is ignored, as temporary
      databases are never locked for either reading or writing 
      (<span class=todo>reference to some requirement for this statement.</span>).
      Whether or not a new page-cache is created when a connection to a
      persistent database is opened is governed by requirements H50040 and

      [fancyformat_import_requirement H51007]
      [fancyformat_import_requirement H51008]

    <p class=todo>
      Requirements explaining how the db parameter to sqlite3BtreeOpen is used. Must be there for something.

    [h3 sqlite3BtreeClose]

      [btree_api_defn sqlite3BtreeClose]

      [fancyformat_import_requirement H51009]

      If a call to sqlite3BtreeClose is made with a value that is not a valid
      b-tree database connection handle passed as the only argument, the
      results are undefined.

      [fancyformat_import_requirement H51010]

      See also requirement H50070.

    [h2 "Database Image Configuration"]

      <p class=todo>
	This category doesn't work all that well. These APIs are used for other
        things too (i.e. switching to incremental-vacuum mode).

      [btree_api_defn sqlite3BtreeSetAutoVacuum sqlite3BtreeSetPageSize]


      [btree_api_defn sqlite3BtreeGetPageSize]
      [btree_api_defn sqlite3BtreeGetReserve]
      [btree_api_defn sqlite3BtreeGetAutoVacuum]

    [h2 "Connection Configuration"]

      [btree_api_defn sqlite3BtreeSetCacheSize sqlite3BtreeSetSafetyLevel sqlite3BtreeMaxPageCount ]

      And functions to query the current configuration:

      [btree_api_defn sqlite3BtreeSyncDisabled]

    [h2 "Query Interfaces"]

      [btree_api_defn sqlite3BtreeGetFilename]

      [fancyformat_import_requirement H51011]
      [fancyformat_import_requirement H51012]

      [btree_api_defn sqlite3BtreeGetJournalname]

      [fancyformat_import_requirement H51013]
      [fancyformat_import_requirement H51014]

      Requirement H51013 holds true even if the B-Tree database connection is
      configured to use an in-memory journal file or no journal file at all
      (<span class=todo>ref requirements</span>). In these cases the buffer returned
      contains the full-path of the journal file that would be used if the
      connection were configured to use a journal file.

    [h2 "Mutex Functions"]

      [btree_api_defn {typedef struct BtreeMutexArray} {struct BtreeMutexArray \{}]

      [btree_api_defn sqlite3BtreeEnter sqlite3BtreeEnterAll sqlite3BtreeLeave sqlite3BtreeEnterCursor sqlite3BtreeLeaveCursor sqlite3BtreeLeaveAll sqlite3BtreeMutexArrayEnter sqlite3BtreeMutexArrayLeave sqlite3BtreeMutexArrayInsert]

    [h2 "Transaction and Savepoint API"]

      [btree_api_defn sqlite3BtreeBeginTrans sqlite3BtreeCommitPhaseOne sqlite3BtreeCommitPhaseTwo sqlite3BtreeCommit sqlite3BtreeRollback]

      [btree_api_defn sqlite3BtreeBeginStmt sqlite3BtreeSavepoint]

      [btree_api_defn sqlite3BtreeIsInTrans sqlite3BtreeIsInReadTrans sqlite3BtreeIsInBackup]

    [h2 "Reading and Traversing the Database Image"]

      <p class=todo>
	sqlite3BtreeMoveto is never called from outside of the b-tree layer. It
        could/should be removed from the API.

      <p class=todo>
	The "bias" argument to sqlite3BtreeMovetoUnpacked is only ever true
        when it is called from within sqlite3BtreeInsert. This argument could/should 
        also be removed from the API, if only to make it simpler to describe.

      [btree_api_defn BtCursor]

      [btree_api_defn sqlite3BtreeCursor sqlite3BtreeCursorSize]
      [btree_api_defn sqlite3BtreeCloseCursor sqlite3BtreeClearCursor]

      [btree_api_defn sqlite3BtreeFirst sqlite3BtreeLast sqlite3BtreeNext \
                      sqlite3BtreePrevious sqlite3BtreeEof]

      [btree_api_defn sqlite3BtreeKeySize sqlite3BtreeKey sqlite3BtreeKeyFetch \
                      sqlite3BtreeDataFetch sqlite3BtreeDataSize sqlite3BtreeData]

      [btree_api_defn sqlite3BtreeCount]

    [h3 sqlite3BtreeMovetoUnpacked]

        The sqlite3BtreeMovetoUnpacked function is used to 

      [btree_api_defn sqlite3BtreeMovetoUnpacked]

        The following requirements specify exactly how a b-tree cursor is to be moved
        by a successful call to sqlite3BtreeMovetoUnpacked.

      [fancyformat_import_requirement L50008]
      [fancyformat_import_requirement L50009]
      [fancyformat_import_requirement L50010]
      [fancyformat_import_requirement L50011]

      <p class=todo>
	Not clear how to deal with these. Define an external module to
	encapsulate these and define sorting order etc.? That's tricky as things are
	because the UnpackedRecord.flags field defines the "search mode" used
        by sqlite3BtreeMovetoUnpacked.

      [sqliteint_api_defn {typedef struct KeyInfo} {struct KeyInfo \{}]
      [sqliteint_api_defn {typedef struct UnpackedRecord} {struct UnpackedRecord \{}]

    [h3 sqlite3BtreeGetMeta]

	The sqlite3BtreeGetMeta interface may be used to retrieve the current
        value of certain fields from the database image header.

      [btree_api_defn sqlite3BtreeGetMeta]

      [fancyformat_import_requirement H51015]
      [fancyformat_import_requirement H51016]

        The two requirements above imply that if sqlite3BtreeGetMeta is called with
        anything other than a b-tree database connection handle with an open read-only
        or read-write transaction as the first argument, or with anything other than
        an integer between 0 and 7 (inclusive) as the second, the results are undefined.


      [fancyformat_import_requirement H51017]

    [h2 "Modifying the Database Image"]

      [h3 sqlite3BtreeCreateTable sqlite3BtreeCreateTable]
      [btree_api_defn sqlite3BtreeCreateTable]

      [h3 sqlite3BtreeDropTable sqlite3BtreeDropTable]
      [btree_api_defn sqlite3BtreeDropTable ]

      [h3 sqlite3BtreeClearTable sqlite3BtreeClearTable]
      [btree_api_defn sqlite3BtreeClearTable]

      [h3 sqlite3BtreeCursorHasMoved sqlite3BtreeCursorHasMoved]
      [btree_api_defn sqlite3BtreeCursorHasMoved]

      [h3 sqlite3BtreePutData  sqlite3BtreePutData]
      [btree_api_defn sqlite3BtreePutData]

      [h3 sqlite3BtreeUpdateMeta sqlite3BtreeUpdateMeta]
        [btree_api_defn sqlite3BtreeUpdateMeta]

      [h3 sqlite3BtreeDelete sqlite3BtreeDelete]

        [btree_api_defn sqlite3BtreeDelete]

        [fancyformat_import_requirement L50013]

      <p class=todo>
        Effect of a delete operation on other cursors that are pointing to the
        deleted b-tree entry.

      <p class=todo>
        Malloc and IO error handling. Same as for sqlite3BtreeInsert.

      [h3 sqlite3BtreeInsert]

        [btree_api_defn sqlite3BtreeInsert]

        [fancyformat_import_requirement L50001]

        The requirement above implies that the results of passing anything else as 
        the first argument to sqlite3BtreeInsert, for example a read-only b-tree cursor,
        are undefined.

        [fancyformat_import_requirement L50012]

	In other words, the sqlite3BtreeInsert API could easily be renamed
        sqlite3BtreeInsertOrReplace. <span class=todo>We will probably need a module
        requirement for the "replace" operation.</span>

        [fancyformat_import_requirement L50002]
        [fancyformat_import_requirement L50003]
        [fancyformat_import_requirement L50004]

        The following requirements describe the seventh and eighth paramaters passed
        to the sqlite3BtreeInsert function. Both of these are used to provide extra
        information used by sqlite3BtreeInsert to optimize the insert operation. They
        may be safely ignored by alternative b-tree implementations.

      <p class=todo>
        There should be some rationalization for these, eventually. Some traceability
        from somewhere to show how the b-tree module offering these slightly esoteric
        interfaces is helpful to SQLite overall.

        [fancyformat_import_requirement L50005]
        [fancyformat_import_requirement L50006]

        If a non-zero value is passed as the eighth parameter to sqlite3BtreeInsert 
        and the b-tree cursor has not been positioned as assumed by L50006, the
        results are undefined.

      <p class=todo>
        Malloc and IO error handling. Maybe these should be grouped together
        for a whole bunch of APIs. And hook into the above via a definition of
        "successful call".

      [h3 sqlite3BtreeIncrVacuum sqlite3BtreeIncrVacuum]
      [btree_api_defn sqlite3BtreeIncrVacuum]

    [h2 "Advisory B-Tree Locks"] 

	This section describes the b-tree module interfaces used for acquiring
	and querying the advisory locks that can be placed on database image
	pages. The locking mechanisms described in this section are only used
	to arbitrate between multiple clients of the same in-memory page-cache.
	The locking mechanism used to control access to a file-system
	representation of the database when multiple in-memory page caches
	(possibly located in different OS processes) are open on it is
        described in <span class=todo>this</span>.

        As well as obtaining advisory locks explicitly using the 
        sqlite3BtreeLockTable API (see below), a read-lock on page 1 of the
        database image is automatically obtained whenever a b-tree database 
	connection opens a read-only or read-write transaction (see 
        <span class=todo>requirement number</span>). Note that this means
        that a write-lock on page 1 is effectively an exclusive lock on
	the entire page-cache, as it prevents any other connection from opening
        a transaction of any kind.

      [h3 sqlite3BtreeLockTable]

      [btree_api_defn sqlite3BtreeLockTable]

        The sqlite3BtreeLockTable API allows database clients to place 
        advisory read or write locks on a specified page of the database 
        image. The specified page need not exist within the database image.
        By convention, SQLite acquires read and write locks on the root
        pages of table b-trees only, but this is not required to be enforced
        by the b-tree module. Locks may only be obtained when a database
        client has an open transaction. All locks are automatically released
        when the open transaction is concluded.

        [fancyformat_import_requirement L50016]
        [fancyformat_import_requirement L50017]

        The two requirements above imply that the results of calling 
        sqlite3BtreeLockTable on a b-tree database connection handle that does
        not currently have an open transaction, or attempting to obtain
        a write-lock using a b-tree database connection handle that only has
        a read-only transaction open are undefined.

        [fancyformat_import_requirement L50019]
        [fancyformat_import_requirement L50020]

        Requirement L50020 is overly conservative. Because a write-lock may 
        only be requested if the b-tree database connection has an open read-write 
	transaction (L50017), and at most a single b-tree database connection
        may have such an open transaction at one time, it is not possible for
        a request for a write-lock to fail because another connection is holding
        a write-lock on the same b-tree database image page. It may, however,
        fail because another connection is holding a read-lock.

        All locks are held until the current transaction is concluded. Or, if
        a read-write transaction is downgraded to a read-only transaction, all
        currently held write-locks are changed to read-locks.

        [fancyformat_import_requirement L50018]
        [fancyformat_import_requirement L50021]

      <p class=todo> The requirement above should refer to some other
	requirement that defines when a read-write transaction is downgraded to
        a read-only transaction.

      <p class=todo> Malloc failure?

      <p class=todo> Read uncommitted flag. Maybe this should be handled
        outside of the b-tree module. Is there anything to stop connections
        with this flag set simply not obtaining read locks? There are assert()
        statements in the b-tree module that need to take this flag into account,
        but not actual functionality.

      [h3 sqlite3BtreeSchemaLocked]
      [btree_api_defn sqlite3BtreeSchemaLocked]

        [fancyformat_import_requirement L50014]
        [fancyformat_import_requirement L50015]

    [h2 "What do these do?"]

      <p class=todo>
        Where do the following go?

      [btree_api_defn sqlite3BtreeIntegrityCheck sqlite3BtreePager sqlite3BtreeCopyFile]

      [btree_api_defn sqlite3BtreeSchema sqlite3BtreeSchemaLocked sqlite3BtreeLockTable sqlite3BtreeTripAllCursors]

      <p class=todo>
        I know what the following do, but is this mechanism ever used? Or has it been superceded by other tricks in OP_NewRowid?

      [btree_api_defn sqlite3BtreeSetCachedRowid sqlite3BtreeGetCachedRowid]

      <p class=todo>
        Should move to btreeInt.h

      [btree_api_defn BtShared]

    [h2 "APIs not branded sqlite3BtreeXXX()"]

<li> sqlite3PagerLockingMode
<li> sqlite3PagerJournalMode
<li> sqlite3PagerIsMemdb (vacuum and backup).
<li> sqlite3PagerJournalSizeLimit
<li> sqlite3PagerFile (used by sqlite3_file_control() and pragma lock_proxy_file).
<li> sqlite3PagerPagecount (pragma page_count and backup).
<li> Page APIs used by backup routines:
    <li> sqlite3PagerGet
    <li> sqlite3PagerWrite
    <li> sqlite3PagerGetData
    <li> sqlite3PagerGetExtra
    <li> sqlite3PagerUnref
    <li> sqlite3PagerTruncateImage
    <li> sqlite3PagerSync
    <li> sqlite3PagerFile
    <li> sqlite3PagerCommitPhaseOne, sqlite3PagerCommitPhaseTwo
    <li> sqlite3PagerBackupPtr

  [h1 "Module Implementation"]

  [h2 "Database Image Traversal"] 

  [h2 "Database Image Manipulation"]

     <p class=todo>
       This section should describe exactly how bits and bytes are shifted
       around when the database image is traversed and modified. i.e. how the b-tree 
       balancing works, deleting an internal cell from an index b-tree etc.

    [h3 "Creating a B-Tree Structure"]
    [h3 "Clearing a B-Tree Structure"]
    [h3 "Deleting a B-Tree Structure"]

    [h3 "Inserting, Replacing and Deleting B-Tree Entries"]

        The following two sections describe the way entries are added and removed
        from B-Tree structures within a database image. 

        As one might expect, the algorithms described in the following sections
        involve adding and removing b-tree cells to and from b-tree node pages.
        The format of b-tree node pages is described in detail in 
        <cite>ref_file_format</cite>. This document does not describe the exact
        way in which content is manipulated within a page, as these details are
        considered not considered high-level enough to be documented outside of
        the SQLite source code itself. For the purposes of the descriptions in
        the following sections, a b-tree node page is considered to be a container
        for an ordered list of b-tree cells. Cells may be inserted into or removed
        from any position in the ordered list as required.

	A b-tree node page has a finite capacity. If one of the algorithms
	described here is required to insert a cell into a b-tree node page,
	and there is not enough free space within the page to accommodate the
	cell, it is still nominally inserted into the requested position within
        the node, but becomes an overflow cell. Overflow cells never remain so
        for very long. If an insert, replace or delete entry operation creates
        one or more overflow cells, the b-tree structure is rearranged so that
        all cells are stored within the body of a b-tree node page before the
        operation is considered complete. This process of rearranging the b-tree
        structure is termed b-tree balancing, and is described in section 

    [h4 "B-Tree Insert/Replace Entry"]

        This section describes the way in which new entries may be inserted 
        into a b-tree structure, and how existing entries may be replaced. Both
        of these operations are accessed using the sqlite3BtreeInsert API.

        An insert/replace operation involves the following steps:

        <li> Based on the supplied key and value, and the type of b-tree being
             inserted into, allocate and populate any required overflow pages.
             <span class=todo>Should reference file-format requirements that
             provide the formula for doing this.</span>

        <li> Attempt to move the b-tree write cursor to an entry with a key
             that matches the new key being inserted. If a matching entry is 
             found, then the operation is a replace. Otherwise, if the key is
             not found, an insert.

        <ol type="a">
          <li> Requirements L50008, L50009, L50010 and L50011 apply to the cursor
               seek operation here. This ensures that if the search does not find
	       an exact match, the cursor is left pointing to the leaf page that 
               the new entry should be added into.

          <li> As specified by L50006, the cursor may already be positioned. In 
               this case the seek operation is not required.

        <li> If a matching key was found in the b-tree, then it must be removed and
             the new entry added in its place.

        <ol type="a">
          <li> If there are one or more overflow pages associated with the entry
               being replaced, they are moved to the free-list.
          <li> The cell corresponding to the entry being removed is removed from
               the b-tree node page.
          <li> The new cell is inserted in the position previously occupied by the
               cell removed in the previous step. If the page is not a leaf page,
	       then the first four-bytes (the child-page pointer) of the old
               cell are copied to the first four bytes of the new cell. If the new
               cell is larger than the cell that it replaced, then it may become
               an overflow cell.

        <li> If no matching key was found in the b-tree, then the new cell is inserted
             into the leaf page that the cursor was left pointing to by step 1. The
             new cell may become an overflow cell.

        <li> If the new cell is now an overflow cell, then the balancing algorithm 
             (see section <cite>btree_balancing_algorithm</cite>) is run on the 
             overflowing b-tree node page.

    [h4 "B-Tree Delete Entry"]

        This section describes the way in which entries may be removed from
	a b-tree structure, as required when the sqlite3BtreeDelete (section
        <cite>sqlite3BtreeDelete</cite>) API is invoked. Removing an entry
        from a b-tree table involves the following steps:

        <li> All overflow pages in the overflow page chain (if any) associated
             with the entry must be moved to the database free-list. If the
             database image is an autovacuum database, the pointer-map entries
             that correspond to each overflow page in the chain must be updated.
        <li> The b-tree cell corresponding to the entry must be removed from
             the b-tree structure.

      <p class=todo>
	Note about the optimization that makes it possible to move overflow pages
        to the free-list without reading their contents (i.e. without loading them
        into the cache).

        If the b-tree entry being removed is located on a leaf page (as is always the
        case with table b-tree structures), then deleting an entry from a b-tree
        is quite simple.

      [Figure btreemodule_delete1.svg figure_delete1 "Delete from an Internal Node"]

    [h3 "B-Tree Balancing Algorithm" btree_balancing_algorithm]

       <li><p>The <b>balance deeper</b> sub-algorithm is used when the root page of
           a b-tree is overfull. It creates a new page and copies the
           entire contents of the overfull root page to it. The root page
           is then zeroed and the new page installed as its only child.
           The balancing algorithm is then run on the new child page (in case
           it is overfull).

       <li><p>The <b>balance shallower</b> sub-algorithm is used when the root page 
           of a b-tree has only a single child page. If possible, the data from
           the child page is copied into the root-page and the child page discarded.

       <li><p>The <b>balance quick</b> sub-algorithm is used in a very specific,
           but common scenario. It is used only for table b-trees, when a new entry that
           has a key value greater than all existing keys in the b-tree is inserted and
           causes the right-most leaf page of the b-tree structure to become overfull.

       <li><p>The <b>balance siblings</b> sub-algorithm is run when a b-tree page that
           is not the root-page of its b-tree structure is either overfull or underfull.


    [h4 "Balance Deeper"]
        <li> Allocate a new page (the child-page).
        <li> Copy page data from root-page to child-page (including overflow cells).
        <li> Fix pointer map entries associated with new child-page content.
        <li> Zero the root-page.
        <li> Set the right-child pointer of the root-page to point to the new child-page.
        <li> Set the pointer map entry for the new child page.
        <li> Execute the balance procedure on the new child page.

      [Figure btreemodule_balance_deeper.svg figure_balance_deeper "Example Balance Deeper Transform"]

    [h4 "Balance Shallower"]

        <li> Copy node data from child-page to root-page.
        <li> Fix pointer map entries associated with new root-page content.
        <li> Move child-page to database image free-list.

      [Figure btreemodule_balance_shallower.svg figure_balance_shallower "Example Balance Shallower Transform"]

    [h4 "Balance Quick"]

        <li> Allocate a new page (the new sibling-page).

        <li> Populate the new sibling page with the new b-tree entry.

	<li> Add a new divider cell to the parent. The divider cell contains a
             pointer to the page that is currently the right-child of the parent.
             The key in the new divider cell is a copy of the largest key in the
             page that is currently the right-child of the parent.

        <li> Set the right-child of the parent page to point to the new sibling page.

	<li> If the database is an auto-vacuum database, set the pointer map
	     entry associated with the new sibling page. If the cell on the new
             sibling page contains a pointer to an overflow page, set the pointer map
	     entry associated with the overflow page.

        <li> Execute the balance procedure on the parent page.

      [Figure btreemodule_balance_quick.svg figure_balance_quick "Example Balance Quick Transform"]

    [h4 "Balance Siblings" balance_siblings]

      [fancyformat_import_requirement L51001]

    <p class=todo>
      The following description describes how balance() is to be implemented. This
      represents (I think) the lowest level of detail that should be in this document.
      One skilled in the art could use this description to reimplement SQLite's
      balance-siblings algorithm. We also need requirements at a higher level
      of detail in this section. Something to test!

      The balance-siblings algorithm, as implemented by SQLite, is described as
      a series of steps below.  <span class=todo> there are a few terms used
      below that need definitions/clarifications.</span>

        <li> Determine the set of sibling pages to redistribute the cells of, using 
             the following rules:
        <ol type="a">
          <li> If the parent page has three or fewer child pages, then all child 
               pages are deemed to be sibling pages for the purposes of the balance-siblings
	  <li> If the page being balanced is the left-most child of the parent
               page, then the three left-most child pages are used as the siblings.
	  <li> If the page being balanced is the right-most child of the parent
               page, then the three right-most child pages are used as the siblings.
	  <li> Otherwise, if none of the above three conditions are true, then the
               sibling pages are page being balanced and the child pages immediately
               to the left and right of it.

        <li> Determine an ordered list of cells to redistribute. There are several
             variations of this step, depending on the type of page being balanced.
        <ol type="a">
	  <li> If the page being balanced is a leaf page of a table b-tree,
	       then the list of cells to redistribute is simply the concatenation
               of the ordered lists of cells stored on each sibling page, in order
               from left-most sibling to right-most.
	  <li> If the page being balanced is a leaf page of an index b-tree, then 
               the list of cells to redistribute is comprised of the cells on each
               of the sibling pages and the divider cells in the parent page that 
               contain the pointers to each sibling page except the right-most. The
               list is arranged so that it contains: 
             <li> The cells from the left-most sibling page, in order, followed by
             <li> the divider cell from the parent page that contains the pointer 
		  to the left-most sibling (if there is more than one sibling
                  page), followed by
	     <li> the divider cell that contains the pointer to the second left-most
                  sibling and the cells from the remaining sibling page (if there are three
                  sibling pages).
	  <li> If the page being balanced is an internal b-tree node, then the list of
               cells to redistribute is determined as described in the previous case.
               However, when balancing an internal node each cell is associated with
               the page number of a child page of one of the sibling pages. The page 
               number associated with cells stored on a sibling page is the same as
               the page number stored as the first four bytes of the cell. The page
               number associated with a divider cell within the parent page is the page
               number of the right-child page of the sibling page to which the divider
               cell contains a pointer.

        <li> Determine the new cell distribution, using the following steps:
        <ol type="a">
          <li> Assign as may cells as will fit from the start of the ordered list of 
               cells to the left-most sibling page. Then, if any cells remain, assign
               one to be a divider cell, and as many as will fit to the next sibling 
	       page. Repeat until all cells have been assigned a location.
               <span class=todo> no divider cells for table b-tree leaf balance</span>

          <li> The previous step generates a distribution that is biased towards the
	       left-hand side. The right-most sibling may even be completely
	       empty (if the last cell in the ordered list was assigned to be a
               divider cell). To rectify this, cells are moved out of the second 
               right-most sibling page and into the right-most, one at a time, until
               there is at least one cell in the right-most sibling page and to move
               another cell would mean that the right-most sibling page is more full
               than the next to right-most sibling page. This is repeated for the next
               right-most pair of sibling pages, shifting cells out of the third 
               right-most sibling page and into the second right-most, and so on.
               <span class=todo> note about divider cells </span>

        <li> Determine the set of database pages to use as the new sibling pages. 

        <ol type="a">
	   <li> If there were an equal or greater number of siblings identified
                in step 1 than are required by the distribution calculated in step 3, 
                reuse as many as possible, starting with the left-most. If step 3
                calculated a distribution that requires more sibling pages than were
                identified in step 1, allocate the required extra pages using the
                <span class=todo>Refer to ???</span> algorithm.

	   <li> Arrange the new sibling pages from left to right in ascending
                page number order. The new sibling page with the smallest page number
                becomes the left-most sibling page, and so forth.

        <li> Populate the new sibling pages.
        <ol type="a">
	   <li> Populate each new sibling page with the required set of cells. If the
                page being balanced is not a leaf page, then the child-page pointer
                field of each cell is populated with the page-number associated with
                the cell as part of step 2 above. 

	   <li> If the page being balanced is not a leaf page, then the right-child 
                pointer stored in the page header of each new sibling page must also
                be populated. For each new sibling page except the right-most, this 
                field is set to the page number associated with the cell that 
                immediately follows the cells stored on the page (the cell that was
                assigned to be divider cell in step 3). For the right-most sibling page,
                the right-child pointer is set to the value that was stored in the
                right-child pointer of the right-most original sibling page identified
                in step 1.
        <li> Populate the parent page.
        <ol type="a">
	  <li> If the page being balanced is (was) not a leaf page of a table
	       b-tree, the cells that contained pointers to the old sibling
               pages are replaced by the cells designated as divider cells as part
               of step 3. The right-child pointer field of the first divider cell
               is overwritten with the page number of the first new sibling page, and 
               so on.

	  <li> If the page being balanced is (was) a leaf page of a table
	       b-tree, the cells that contained pointers to the old sibling
               pages are replaced by a divider cell associated with all but the
               right-most sibling page. The child-page number stored in each divider
               cell is set to the page number of the associated sibling. The integer key 
               value stored in each divider cell is a copy of the largest integer key
               value stored on the associated sibling page.

	  <li> Before balancing, the parent page contained a pointer to the right-most
               sibling page, either as part of a cell or as the right-child pointer
               stored in the page header. Either way, this value must be overwritten
               with the page number of the new right-most sibling page.

        <li> Populate pointer map entries.
        <ol type="a">
          <li> For each sibling page that was not also an original sibling page, the
               associated pointer-map entry must be updated. Similarly, the pointer-map
               entry associated with each original sibling page that is no longer a
               sibling page must be updated.
          <li> For each cell containing an overflow pointer that has been moved from one 
               page to another, the pointer-map entry associated with the overflow page 
               must be updated.
          <li> If the page being balanced is (was) not a leaf, then for each cell that
               has moved from one page to another the pointer-map entry associated with
               the cell's child page must be updated.
          <li> If the page being balanced is (was) not a leaf, then the pointer-map entry
               associated with each sibling's right-child page may need to be updated.

    [h3 "Page Allocation and Deallocation"]

     <p class=todo>
       Amongst other things, this section needs to explain our old pals the
       DontWrite() and DontRollback() optimizations.

    [h4 "Moving an overflow-chain to the free-list" free_overflow_chain]

     <p class=todo>
       Describe how this can sometimes be done without reading the content of
       overflow pages.

    [h3 "Incremental Vacuum Step"]

  [h2 "Transactions and Savepoints"]

     <p class=todo>
       Requirements surrounding how transactions are made atomic and isolated.
       Also how savepoints are implemented. What happens to active cursors after
       a rollback or savepoint-rollback.

  [h1 References]

  <table id="refs" style="width:auto; margin: 1em 5ex">
    [Ref 1 ref_file_format {
      SQLite Online Documentation,<u>SQLite Database File Format</u>,
      <a href="fileformat.html"></a>.
    [Ref 2 ref_pcache_interface {
      SQLite Online Documentation,<u>Application Defined Page Cache</u>,
      <a href="c3ref/pcache_methods.html"></a>.
    [Ref 3 ref_os_interface {
      SQLite Online Documentation,<u>OS Interface Object</u>,
      <a href="c3ref/vfs.html"></a>.