SQLite4: Artifact [936bc43cc4]

Artifact 936bc43cc48b33f2397e4c46bad946752d0501e0:

File www/lsmusr.wiki — part of check-in [9afc42d70d] at 2013-02-08 15:22:56 on branch trunk — Avoid extending the database file when truncating it to the minimum number of blocks required during system shutdown. (user: dan size: 63662) [more...]

<title>LSM Users Guide</title>
<nowiki>

<h2>Table of Contents</h2>










<div id=start_of_toc></div>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#introduction_to_lsm style=text-decoration:none>1. Introduction to LSM</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#using_lsm_in_applications style=text-decoration:none>2. Using LSM in Applications </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#basic_usage style=text-decoration:none>3. Basic Usage</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#opening_and_closing_database_connections style=text-decoration:none>3.1. Opening and Closing Database Connections </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#writing_to_a_database style=text-decoration:none>3.2. Writing to a Database </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#reading_from_a_database style=text-decoration:none>3.3. Reading from a Database </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#database_transactions_and_mvcc style=text-decoration:none>3.4. Database Transactions and MVCC </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#data_durability style=text-decoration:none>4. Data Durability </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#compressed_and_encrypted_databases style=text-decoration:none>5. Compressed and Encrypted Databases </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#performance_tuning style=text-decoration:none>6. Performance Tuning</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#overview_of_lsm_architecture style=text-decoration:none>6.1. Overview of LSM Architecture</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#performance_related_configuration_options style=text-decoration:none>6.2. Performance Related Configuration Options </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#work_and_checkpoint_scheduling style=text-decoration:none>6.3. Work and Checkpoint Scheduling</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#automatic_scheduling style=text-decoration:none>6.3.1. Automatic Scheduling</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#explicit_scheduling style=text-decoration:none>6.3.2. Explicit Scheduling</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#compulsary_work_and_checkpoints style=text-decoration:none>6.3.3. Compulsary Work and Checkpoints</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#database_file_optimization style=text-decoration:none>6.4. Database File Optimization</a><br>

<div id=end_of_toc></div>

<h2>Overview</h2>

<p>This document describes the LSM embedded database library and use thereof.
It is part user-manual and part tutorial. It is intended to complement the 
<a href=lsmapi.wiki>LSM API reference manual</a>.

<p>The <a href=#introduction_to_lsm>first section</a> of this document contains
a description of the LSM library and its features. 
<a href=#using_lsm_in_applications>Section 2</a> describes how to use LSM from
within a C or C++ application (how to compile and link LSM, what to #include
etc.). The <a href=#basic_usage>third section</a> describes the essential APIs
that applications use to open and close database connections, and to read from
and write to databases. 

<p>The three sections described above contain all the information required to
create applications that use LSM. The remaining sections discuss more
specialized topics. <a href=#data_durability>Section 4</a> discusses the
configuration parameter that influences transaction durability (the guarantees
offered with respect to recently committed transactions if a power failure 
occurs). <a href=#compressed_and_encrypted_databases>Section 5</a> explains
the interface provided by LSM that allows external data compression and/or
encryption functions to be used to create compressed and/or encrypted
databases. <a href=#performance_tuning>Section 6</a> deals with performance 
tuning.

<h1 id=introduction_to_lsm>1. Introduction to LSM</h1>

<p>LSM is an embedded database library for key-value data, roughly similar
in scope to
<a href="http://www.oracle.com/technetwork/products/berkeleydb/overview/index.html">Berkeley DB</a>, 
<a href="http://code.google.com/p/leveldb/">LevelDB</a> or
<a href="http://fallabs.com/kyotocabinet/">KyotoCabinet</a>.
Both keys and
values are specified and stored as byte arrays. Duplicate keys are not 
supported. Keys are always sorted in memcmp() order. LSM supports the following
operations for the manipulation and query of database data:

<ul>
  <li> Writing a new key and value into the database.
  <li> Deleting an existing key from the database.
  <li> Deleting a range of keys from the database.
  <li> Querying the database for a specific key.
  <li> Iterating through a range of database keys (either forwards or
       backwards).
</ul>

<p>Other salient features are:

<ul>
  <li><p>A <b>single-writer/multiple-reader MVCC</b> based transactional
      concurrency model. SQL style nested sub-transactions are supported. 
      Clients may concurrently access a single LSM database from within a 
      single process or multiple application processes. 

  <li><p>An entire database is stored in a <b>single file on disk</b>. 

  <li><p>Data <b>durability in the face of application or power failure</b>.
      LSM may optionally use a write-ahead log file when writing to the
      database to ensure committed transactions are not lost if an application
      or power failure occurs.

  <li>An API that <b>allows external data compression and/or encryption 
      routines to be used </b> to create and access compressed and/or 
      encrypted databases.
</ul>
    

<p>Many database systems that support range queries, including <a
href=http://www.sqlite.org>SQLite 3</a>, Berkeley DB and Kyoto Cabinet, are
based on one of many variants of the 
<a href="http://en.wikipedia.org/wiki/B-tree">b-tree data structure</a>.
B-trees are attractive because a b-tree structure minimizes the number of disk
sectors that must be read from disk when searching the database for a specific
key. However, b-tree implementations usually suffer from poor write
localization - updating the contents of a b-tree often involves modifying the
contents of nodes scattered throughout the database file. If the database is
stored on a spinning disk (HDD), then the disk heads must be moved between
writing non-contiguous sectors, which is extremely slow. If the database is
stored on solid state storage (SDD) a similar phenomena is encountered due to
the large erase-block sizes. In general, writing to a series of contiguous disk
sectors is orders of magnitude faster than updating to the same number of disk
sectors scattered randomly throughout a large file. Additionally, b-tree
structures are prone to fragmentation, reducing the speed of range queries.

<p><i>TODO: fix the link in the next paragraph to point to a description
of the log-structured-merge tree within lsm.wiki (or its successor).</i>

<p>LSM uses a <a href=lsm.wiki>different data structure</a> that makes the
following performance tradeoffs relative to a b-tree:

<ul>
  <li> A very large percentage of the disk sectors modified when writing to
       the database are contiguous.
       Additionally, in many cases the total number of sectors written
       to disk is reduced. This makes writing to an LSM database faster 
       than the equivalent b-tree.

  <li> LSM databases do not suffer from fragmentation to the same degree
       as b-trees. This means that the performance of large range queries 
       does not degrade as the database is updated as it may with a b-tree.

  <li> Under some circumstances searching an LSM database for a given key will
       involve examining more disk sectors than it would with a b-tree. In
       terms of disk sectors accessed when searching a database of size N, 
       both b-trees and LSM provide O(log(N)) efficiency, but the base of 
       the logarithm is generally larger for a b-tree than for LSM.
</ul>

<p>In other words, writing to an LSM database should be very fast and scanning
through large ranges of keys should also perform well, but searching the
database for specific keys may be slightly slower than when using a b-tree
based system. Additionally, avoiding random writes in favour of largely
contiguous updates (as LSM does) can significantly reduce the wear on SSD or
flash memory devices.

<p>Although it has quite different features to LSM in other respects, 
LevelDB makes similar performance tradeoffs.

<p>Benchmark test results for LSM are <a href=lsmperf.wiki>available here</a>.

<h1 id=using_lsm_in_applications>2. Using LSM in Applications </h1>

<p>LSM is not currently built or distributed independently. Instead, it
is part of the SQLite4 library. To use LSM in an application, the application
links against libsqlite4 and includes the header file "lsm.h" in any files
that access the LSM API.

<p><span style=color:red>Pointer to build instructions for sqlite4</span>

<h1 id=basic_usage>3. Basic Usage</h1>

<h2 id=opening_and_closing_database_connections>3.1. Opening and Closing Database Connections </h2>

<p>Opening a connection to a database is a two-step process. The 
<a href=lsmapi.wiki#lsm_new>lsm_new()</a> function is used to create a new
database handle, and the <a href=lsmapi.wiki#lsm_open>lsm_open()</a> function 
is used to connect an existing database handle to a database on disk. This
is because some database connection properties may only be configured
before the database is opened. In that case, one or more calls to the 
<a href=lsmapi.wiki#lsm_config>lsm_config()</a> method are made between the
calls to lsm_new() and lsm_open().

<p>The functions are defined as follows:

<verbatim>
  int lsm_new(lsm_env *env, lsm_db **pDb);
  int lsm_open(lsm_db *db, const char *zFile);
</verbatim>

<p>Like most lsm_xxx() functions that return type int (the exception is
<a href=lsmapi.wiki#lsm_csr_valid>lsm_csr_valid()</a>), both of the above 
return LSM_OK (0) if successful, or an <a href=lsmapi.wiki#LSM_ERROR>LSM 
error code</a> otherwise. The first argument to lsm_new() may be passed either
a pointer to a <a href=lsmapi.wiki#lsm_env>database environment object</a>
or NULL. Almost all applications should pass NULL. A database environment
object allows the application to supply custom implementations of the various
operating system calls that LSM uses to read and write files, allocate heap 
memory, and coordinate between multiple application threads and processes.
This is normally only required if LSM is being used on a platform that is not
supported by default. Passing NULL instructs the library to use the default
implementations of all these things. The second argument to lsm_new() is an
output variable. Assuming the call is successful, *pDb is set to point to the
new database handle before returning.

<p>The first argument passed to lsm_open() must be an existing database 
handle. The second is the name of the database file to connect to. Once
lsm_open() has been successfully called on a database handle, it can not be
called again on the same handle. Attempting to do so is an LSM_MISUSE error.

<p>For example, to create a new handle and connect it to database "test.db"
on disk:
<verbatim>
  int rc;
  lsm_db *db;

  /* Allocate a new database handle */
  rc = lsm_new(0, &db);
  if( rc!=LSM_OK ) exit(1);

  /* Connect the database handle to database "test.db" */
  rc = lsm_open(db, "test.db");
  if( rc!=LSM_OK ) exit(1);
</verbatim>

<p>A database connection can be closed using the lsm_close() function. Calling
lsm_close() disconnects from the database (assuming lsm_open() has been
successfully called) and deletes the handle itself. Attempting to use a
database handle after it has been passed to lsm_close() results in undefined
behaviour (likely a segfault).

<verbatim>
  rc = lsm_close(db);
</verbatim>

<p>It is important that lsm_close() is called to close all database handles
created with lsm_new(), particularly if the connection has written to the
database. If an application writes to the database and then exits without
closing its database connection, then subsequent clients may have to run
"database recovery" when they open the database, slowing down the lsm_open() 
call. Additionally, not matching each successful lsm_new() call with a call 
to lsm_close() is a resource leak.

<p>Counter-intuitively, an lsm_close() call may fail. In this case the database
handle is not closed, so if the application exits it invites the "database
recovery" performance problem mentioned above. The usual reason for an
lsm_close() call failing is that the database handle has been used to create
<a href=lsmapi.wiki#lsm_csr_open>database cursors</a> that have not been 
closed. Unless all database cursors are closed before lsm_close() is called,
it fails with an LSM_BUSY error and the database handle is not closed.

<h2 id=writing_to_a_database>3.2. Writing to a Database </h2>

<p>Three API functions are used to write to the database:

<ul>
  <li> <b>lsm_insert()</b>: insert a new key/value pair into the database,
       overwriting any existing entry with the same key.
  <li> <b>lsm_delete()</b>: remove a specific key from the database.
  <li> <b>lsm_delete_range()</b>: remove an open-ended range (one that does not
       include its endpoints) of keys from the database. 
</ul>

<p>Each of these functions returns LSM_OK (0) if successful, or an LSM error
code otherwise (some non-zero value).

<p>The following example code inserts a key/value pair into the database. The
key is a 1-byte blob consisting of the value 0x61, and the value is a 3 byte
blob consisting of 0x6F, 0x6E and 0x65, in that order. An application might
interpret these as utf-8 or ASCII codepoints, but LSM treats them as opaque
blobs of data.
<verbatim>
  rc = lsm_insert(db, "a", 1, "one", 3);
</verbatim>

<p>Remove the entry with the key "a" (single 0x61 octet) from the database:

<verbatim>
  rc = lsm_delete(db, "a", 1);
</verbatim>

<p>Remove all entries with keys that are greater than "c" but less than "f".
In this context key blobs are compared in the normal way - using memcmp(), 
with longer keys being considered larger than their prefixes.

<verbatim>
  rc = lsm_delete_range(db, "c", 1, "f", 1);
</verbatim>

<p>The example above removes all keys between "c" and "f", but does not
remove the endpoints "c" and "f" themselves. To do this, requires three
separate calls - one to remove the open-ended range of keys and two to 
remove the two endpoints. As follows:

<verbatim>
  /* Should be checking return codes! */
  lsm_delete(db, "c", 1);
  lsm_delete_range(db, "c", 1, "f", 1);
  lsm_delete(db, "f", 1);
</verbatim>

<h2 id=reading_from_a_database>3.3. Reading from a Database </h2>

<p>All data read from an LSM database is read via a cursor handle. Cursor
handles are opened using the 
<a href=lsmapi.wiki#lsm_csr_open>lsm_csr_open()</a> API, as follows:

<verbatim>
  lsm_csr *csr;
  rc = lsm_csr_open(db, &csr);
</verbatim>

<p>Once an application has finished using a database cursor, it must be closed
using the <a href=lsmapi.wiki#lsm_csr_close>lsm_csr_close()</a> API. The
lsm_csr_close() function does not return any value. It cannot fail.

<verbatim>
  lsm_csr_close(csr);
</verbatim>

<p>Database cursors support the following functions for positioning the cursor:

<ul>
  <li> <b>lsm_csr_seek()</b> - move the cursor to point to a nominated 
       database key.
  <li> <b>lsm_csr_first()</b> - move the cursor to point to the first entry in
       the database (the one with the smallest key).
  <li> <b>lsm_csr_last()</b> - move the cursor to point to the last entry in
       the database (the one with the largest key).
  <li> <b>lsm_csr_next()</b> - move the cursor to point to the next entry in 
       the the database.
  <li> <b>lsm_csr_prev()</b> - move the cursor to point to the previous entry
       in the database.
</ul>

<p>Once a cursor has been positioned, it supports the following functions
for retrieving the details of the current entry:

<ul>
  <li> <b>lsm_csr_valid()</b> - determine whether or not the cursor currently
       points to a valid entry.
  <li> <b>lsm_csr_key()</b> - retrieve the key associated with the
       database entry the cursor points to.
  <li> <b>lsm_csr_value()</b> - retrieve the value associated with the
       database entry the cursor points to.
  <li> <b>lsm_csr_cmp()</b> - compare a key supplied by the application with
       the key associated with the entry the cursor points to. 
</ul>

<p>The following example demonstrates using the 
<a href=lsmapi.wiki#lsm_csr_seek>lsm_csr_seek()</a> function to search the
database for a specified key, 
<a href=lsmapi.wiki#lsm_csr_valid>lsm_csr_valid()</a> to check if the search
was successful, and <a href=lsmapi.wiki#lsm_csr_value>lsm_csr_value()</a> 
to retrieve the value associated with the key within the database.

<verbatim>
  rc = lsm_csr_seek(csr, "b", 1, LSM_SEEK_EQ);
  if( lsm_csr_valid(csr) ){
    const void *pVal; int nVal;

    rc = lsm_csr_value(csr, &pVal, &nVal);
    if( rc==LSM_OK ){
      /* pVal now points to a buffer nVal bytes in size containing the
      ** value associated with database key "b".  */
    }
  }
</verbatim>

<p> The example code below iterates forwards through all entries (in key 
order, from smallest to largest) in the database. Function 
<a href=lsmapi.wiki#lsm_csr_first>lsm_csr_first()</a> is used to position
the cursor to point to the first entry in the database, and 
<a href=lsmapi.wiki#lsm_csr_next>lsm_csr_next()</a> is used to advance to
the next entry. After lsm_csr_next() is called to advance past the final
entry in the database, the cursor is left pointing to no entry at all,
lsm_csr_valid() returns 0, and the loop is finished. API function 
<a href=lsmapi.wiki#lsm_csr_key>lsm_csr_key()</a> is used to retrieve the
key associated with each database entry visited.

<verbatim>
  for(rc=lsm_csr_first(csr); rc==LSM_OK && lsm_csr_valid(csr); rc=lsm_csr_next(csr)){
    const void *pKey; int nKey;
    const void *pVal; int nVal;

    rc = lsm_csr_key(csr, &pKey, &nKey);
    if( rc==LSM_OK ) rc = lsm_csr_value(csr, &pVal, &nVal);
    if( rc==LSM_OK ) break;

    /* At this point pKey points to the current key (size nKey bytes) and
    ** pVal points to the corresponding value (size nVal bytes).  */
  }
</verbatim>

<p> The example code above could be modified to iterate backwards through
the entries in the database (again in key order, but this time from largest
to smallest) by replacing the call to lsm_csr_first() with 
<a href=lsmapi.wiki#lsm_csr_last>lsm_csr_last()</a> and the call to lsm_csr_next() with <a href=lsmapi.wiki#lsm_csr_prev>lsm_csr_prev()</a>.

<p>The signature of lsm_csr_seek() is:

<verbatim>
  int lsm_csr_seek(lsm_cursor *csr, const void *pKey, int nKey, int eSeek);
</verbatim>

<p>The second and third arguments passed to lsm_csr_seek() define the key
to search the database for (pKey must point to the buffer containing the
nKey byte key when this function is called). Assuming no error occurs, if 
there an entry with the requested key is present in the database, the cursor 
is left pointing to it. Otherwise, if no such entry is present, the final
position of the cursor depends on the value passed as the fourth parameter
to lsm_csr_seek(). Valid values for the fourth parameter to lsm_csr_seek()
are:

<dl>
  <dt> LSM_SEEK_EQ
  <dd> <p style=margin-top:0>
       In this case, if the specified key is not present in the database, the
       cursor is not left pointing to any database entry (i.e. calling
       lsm_csr_valid() returns 0).

  <dt> LSM_SEEK_LE
  <dd> <p style=margin-top:0>
       If the specified key is not present in the database and the fourth
       argument to lsm_csr_seek() is LSM_SEEK_LE (Less than or Equal), the
       cursor is left pointing to the database entry with the largest key
       that is less than the specified key. Or, if there are no entries in
       the database with keys smaller than the specified key, the cursor is
       left pointing to no entry at all.

  <dt> LSM_SEEK_GE
  <dd> <p style=margin-top:0>
       If the specified key is not present in the database and the fourth
       argument to lsm_csr_seek() is LSM_SEEK_GE (Greater than or Equal), the
       cursor is left pointing to the database entry with the smallest key
       that is greater than the specified key. Or, if there are no entries 
       in the database with keys larger than the specified key, the cursor is
       left pointing to no entry at all.
  <dd> <p style=margin-top:0>
</dl>

<p> Calls made to lsm_csr_seek() with LSM_SEEK_EQ as the final argument are
slightly more efficient than those made specifying LSM_SEEK_LE or LSM_SEEK_GE.
So to retrieve a specific entry from a database, LSM_SEEK_EQ should be
preferred. The other two values are primarily useful for implementing
range queries. For example, to iterate backwards through all keys from "ggg" 
to "cc", inclusive:

<verbatim>
  for(rc = lsm_csr_seek(csr, "ggg", 3, LSM_SEEK_LE); lsm_csr_valid(csr); rc = lsm_csr_prev(csr)){
    const void *pKey; int nKey;
    const void *pVal; int nVal;
    int res;

    /* Compare the key that the cursor currently points to with "cc". If
    ** the cursor key is less than "cc", break out of the loop. */
    rc = lsm_csr_cmp(csr, "cc", 2, &res);
    if( rc!=LSM_OK || res<0 ) break;

    rc = lsm_csr_key(csr, &pKey, &nKey);
    if( rc==LSM_OK ) rc = lsm_csr_value(csr, &pVal, &nVal);
    if( rc!=LSM_OK ) break;

    /* At this point pKey points to the current key (size nKey bytes) and
    ** pVal points to the corresponding value (size nVal bytes).  */
  }
</verbatim>

<p>In the example code above, the call to lsm_csr_seek() positions the cursor
to point to the entry with key "ggg", if it exists, or to the largest entry
in the database with a key smaller than "ggg" if such a key can be found,
or to EOF otherwise. The lsm_csr_prev() call advances the cursor to the 
next entry in the database file (in key order from largest to smallest), and
the lsm_csr_valid() call returns 0 to break out of the loop once the
cursor is advance past the entry with the smallest key in the database. So
on its own, the "for" statement serves to iterate the cursor in reverse order 
through all keys in the database less than or equal to "ggg".

<p>The call to <a href=lsmapi.wiki#lsm_csr_cmp>lsm_csr_cmp()</a> call in the
body of the loop is used to enforce the lower bound (keys >= "cc") on the 
range query by breaking out of the loop if an entry with a key smaller than 
"cc" is ever visited. lsm_csr_cmp() has the following signature:

<verbatim>
  int lsm_csr_cmp(lsm_cursor *csr, const void *pKey, int nKey, int *piRes);
</verbatim>

<p> When lsm_csr_cmp() is called, the key specified by the second and third
arguments (pKey and nKey) is compared to the database key that the cursor
currently points to. Assuming no error occurs, depending on whether or not the
cursors key is less than, equal to, or greater than the specified key, *piRes
is set to a value less than, equal to, or greater than zero before returning.
In other words:

<pre><i>  *piRes = (cursors key) - (specified key) </i></pre>


<h2 id=database_transactions_and_mvcc>3.4. Database Transactions and MVCC </h2>

<p>LSM supports a single-writer/multiple-reader 
<a href=http://en.wikipedia.org/wiki/Multiversion_concurrency_control>MVCC</a>
based transactional concurrency model. This is the same model that SQLite
supports in <a href="http://www.sqlite.org/wal.html">WAL mode</a>.

<p>A read-transaction must be opened in order to read from the database. 
After a read-transaction has been opened, no writes to the database made 
by other database connections are visible to the database reader. Instead,
the reader operates on a snapshot of the database as it existed when the
read transaction was first opened. Any number of clients may simultaneously
maintain open read-transactions.

<p>If one is not already open, a read-transaction is opened when a database 
cursor is created (the lsm_csr_open() function). It is closed when the number
of open cursors drops to zero.

<p>A write-transaction is required to write to the database. At any point,
at most one database client may hold an open write transaction. If another
client already has an open write transaction, then attempting to open one
is an error (LSM_BUSY). If a read-transaction is already open when the
write-transaction is opened, then the snapshot read by the read-transaction
must correspond to the most recent version of the database. Otherwise,
the attempt to open the write-transaction fails (LSM_BUSY). In other words,
if any other client has written to the database since the current clients
read-transaction was opened, it will not be possible to upgrade to a
write-transaction.

<p>Write-transactions may be opened either implicitly or explicitly. If any
of the following functions are called to write to the database when there 
is no write-transaction open, then an implicit write-transaction is opened and
closed (committed) within the call:

<ul>
  <li> lsm_insert()
  <li> lsm_delete()
  <li> lsm_delete_range()
</ul>

<p>This means, of course, that all three of the above may return LSM_BUSY.
Indicating either that another client currently has an open write-transaction,
or that there is currently an open read-transaction and some other client
has written to the database since it was opened. 

<p>When an explicitly opened transaction is closed, it may either be 
committed or rolled back (reverted - so that the state of the database is
unchanged). Within a write-transaction there may also be a hierarchy of 
nested sub-transactions that may be rolled back or committed independently.
A write-transaction is a property of a database connection - all writes
made by the connection become part of the current transaction (and possibly
sub-transaction).

<p>The functions used to open, commit and rollback explicity transactions
and sub-transactions are, respectively:

<verbatim>
  int lsm_begin(lsm_db *, int);
  int lsm_commit(lsm_db *, int);
  int lsm_rollback(lsm_db *, int);
</verbatim>

<p>In all cases, the second parameter is either the maximum (lsm_commit(),
lsm_rollback()) or minimum (lsm_begin()) the number of nested
write-transactions that will exist following the call (assuming it succeeds).
If the second parameter passed is <i>N</i>,  

<ul>
  <li> <p>Calling <b>lsm_begin(db, <i>N</i>)</b> attempts opens zero or more
       nested write-transactions so that the database connection is left with
       at least <i>N</i> open nested write-transactions. If there are already
       <i>N</i> or more open nested write-transactions open, then lsm_begin(db,
       <i>N</i>) is a no-op. lsm_begin(db, 0) is always a no-op. Calling
       lsm_begin(db, 1) when there is no open write-transaction opens a
       top-level write-transaction.

  <li> <p>Calling <b>lsm_commit(db, <i>N</i>)</b> commits zero or more nested
       write-transactions so that the database connection is left with at most
       <i>N</i> open write-transactions. If the connection has <i>N</i> or
       fewer open nested write-transactions, then lsm_commit(db, <i>N</i>) is a
       no-op. Calling lsm_commit(db, 0) commits the outermost transaction
       (if any).

  <li> <p>Calling <b>lsm_rollback(db, 0)</b> closes and rolls back the 
       top-level write-transaction. Calling lsm_rollback(db, <i>N</i>)
       for any value of <i>N</i> greater than zero closes zero or more nested
       write-transactions so that the database connection is left with at most
       <i>N</i> open transactions. If, following this, the database connection
       has exactly <i>N</i> open nested write-transactions, the outermost is
       rolled back, but not closed. Calling lsm_rollback(db, 1) rolls back
       (but does not close) the top-level transaction.
</ul>

<p>Examples follow. With error checking omitted for brevity's sake.

<verbatim>
  /* Open a write-transaction. Write some data to the database. Then
  ** commit and close the write transaction. Following this, the database
  ** contains:
  **
  **   "j" -> "ten"
  **   "k" -> "eleven"
  */
  lsm_begin(db, 1);
  lsm_insert(db, "j", 1, "ten",    3);
  lsm_insert(db, "k", 1, "eleven", 6);
  lsm_commit(db, 0);

  /* Open a write-transaction, perform all manner of writes and other
  ** operations (not shown). Then roll the top-level transaction back.
  ** Regardless of the write operations performed, the database remains
  ** unchanged:
  **
  **   "j" -> "ten"
  **   "k" -> "eleven"
  */
  lsm_begin(db, 1);
  /* Do all manner of writes, sub-transactions etc. */
  lsm_rollback(db, 0);

  /* Open a write-transaction. Write some data to the database. Then
  ** rollback the top level transaction but do not close it. Write 
  ** different data to the database and commit. Following this block,
  ** the database is:
  **
  **   "j" -> "ten"
  **   "k" -> "eleven"
  **   "m" -> "thirteen"
  */
  lsm_begin(db, 1);
  lsm_insert(db, "l", 1, "twelve",    3);
  lsm_rollback(db, 1);
  lsm_insert(db, "m", 1, "thirteen", 6);
  lsm_commit(db, 0);

  /* Open a write-transaction and 2 nested sub-transactions. Delete a
  ** database key. Then commit and close the outermost sub-transaction.
  ** Open another sub-transaction (so that there are again 2 nested
  ** sub-transactions). Delete a different database key. Then rollback
  ** and close the outermost sub-transaction. Finally, delete yet another
  ** db key and commit the outermost transaction. Leaving just:
  **
  **   "k" -> "eleven"
  */
  lsm_begin(db, 3);
  lsm_delete(db, "j", 1);
  lsm_commit(db, 2);
  lsm_begin(db, 3);
  lsm_delete(db, "k", 1);
  lsm_rollback(db, 2);
  lsm_delete(db, "m", 1);
  lsm_commit(db, 0);
  
</verbatim>

<h1 id=data_durability>4. Data Durability </h1>

<p>The value of the configuration parameter LSM_CONFIG_SAFETY determines
how often data is synced to disk by the LSM library. This is an important
tradeoff - syncing less often can lead to orders of magnitude better
performance, but also exposes the application to the risk of partial or total
data loss in the event of a power failure;

<table valign=top>
<tr> <td valign=top>LSM_SAFETY_OFF 
     <td valign=top style="padding-left:1ex;padding-right:1ex">(0)
     <td> Do not sync to disk at all. This is the fastest mode.
          <p>If a power failure occurs while writing to the database, 
          following recovery the database may be corrupt. All or some data may
          be recoverable.

<tr> <td valign=top>LSM_SAFETY_NORMAL 
     <td valign=top style="padding-left:1ex;padding-right:1ex">(1)
     <td> Sync only as much as is necessary to prevent database corruption.
         This is the default setting. Although slower than LSM_SAFETY_OFF, 
         this mode is still much faster than LSM_SAFETY_FULL.
     <p> If a power failure occurs while writing to the database, following
          recovery some recently committed transactions may have been lost.
          But the database file should not be corrupt and older data intact.

<tr> <td valign=top>LSM_SAFETY_FULL 
     <td valign=top style="padding-left:1ex;padding-right:1ex">(2)
     <td> Sync every transaction to disk as part of committing it. This is
          the slowest mode.
       <p>If a power failure occurs while writing to the database, all
          successfully committed transactions should be present.
          The database file should not be corrupt.
</table>

<p>The following example code sets the value of the LSM_CONFIG_SAFETY 
parameter for connection db to LSM_SAFETY_FULL:

<verbatim>
  int iSafety = LSM_SAFETY_FULL;
  lsm_config(db, LSM_CONFIG_SAFETY, &iSafety);
</verbatim>

<p>The current value of the LSM_CONFIG_SAFETY parameter can also be queried
by setting the initial value of the argument to -1 (or any other negative
value). For example:

<verbatim>
  int iSafety = -1;
  lsm_config(db, LSM_CONFIG_SAFETY, &iSafety);
  /* At this point, variable iSafety is set to the currently configured value
  ** of the LSM_CONFIG_SAFETY parameter (either 0, 1 or 2).  */
</verbatim>

<p>The lsm_config() function may also be used to configure other database
connection parameters.  


<h1 id=compressed_and_encrypted_databases>5. Compressed and Encrypted Databases </h1>

<p>LSM does not provide built-in methods for creating encrypted or compressed
databases. Instead, it allows the user to provide hooks to call external
functions to compress and/or encrypt data before it is written to the database
file, and to decrypt and/or uncompress data as it is read from the database
file.

<p>A database connection is configured to call compression functions using a
call to lsm_config() with the second argument set to
LSM_CONFIG_SET_COMPRESSION. The third argument should point to an instance
of type lsm_compress, which is defined as follows:

<verbatim>
  typedef struct lsm_compress lsm_compress;
  struct lsm_compress {
    u32 iId;
    void *pCtx;
    int (*xBound)(void *pCtx, int nIn);
    int (*xCompress)(void *pCtx, void *pOut, int *pnOut, const void *pIn, int nIn);
    int (*xUncompress)(void *pCtx, void *pOut, int *pnOut, const void *pIn, int nIn);
    void (*xFree)(void *pCtx);
  };
</verbatim>

<p><span style=color:red> Explain how the hooks work here (same as zipvfs)
</span>

<p><span style=color:red> Example code? Using zlib? Or something simple like an
RLE implementation?</span>

<p>The database file header of any LSM database contains a 32-bit unsigned
"compression id" field. If the database is not a compressed database, this
field is set to 1. Otherwise, it is set to an application supplied value
identifying the compression and/or encryption scheme in use. Application
compression scheme ids must be greater than or equal to 10000. Values smaller
than 10000 are reserved for internal use.

<p>The lsm_info() API may be used to read the compression id from a database 
connection as follows: 

<verbatim>
  unsigned int iCompressionId;
  rc = lsm_info(db, LSM_INFO_COMPRESSION_ID, &iCompressionId);
  if( rc==LSM_OK ){
    /* Variable iCompressionId now contains the db compression id */
  }
</verbatim>

Because the compression id is stored in the database
header, it may be read before any required compression or encryption hooks
are configured.

<verbatim>
  #define LSM_COMPRESSION_EMPTY    0
  #define LSM_COMPRESSION_NONE     1
</verbatim>

<p>When a database is opened for the first time, before it is first written,
the compression id field is set to LSM_COMPRESSION_EMPTY (0). After data is
written into the database file, the database compression id is set to a copy 
of the lsm_compress.iId field of the compression hooks for the database handle
doing the writing, or to LSM_COMPRESSION_NONE (1) if no compression hooks 
are configured.

<p>Once the compression id is set to something other than 
LSM_COMPRESSION_EMPTY, when a database handle attempts to read or write the
database file, the compression id is compared against the lsm_compress.iId 
field of the configured compression hooks, or against LSM_COMPRESSION_NONE if
no compression hooks are configured. If the compression id does not match, then
an LSM_MISMATCH error is returned and the operation fails (no transaction or
database cursor is opened).

<p>It is also possible to register a compression factory callback with a 
database handle. If one is registered, the compression factory callback is
invoked instead of returning LSM_MISMATCH if the configured compression hooks
do not match the compression id of a database. If the callback registers
compatible compression hooks with the database handle (using the normal
lsm_config() interface), then the database read or write operation resumes
after it returns. Otherwise, if the compression factory callback does not
register new, compatible, compression hooks with the database handle,
LSM_MISMATCH is returned to the user.

<p>A compression factory callback is registered with a database handle
by calling lsm_config() with the second argument set to
LSM_CONFIG_SET_COMPRESSION_FACTORY, and the third argument set to point to
an instance of structure lsm_compress_factory. The lsm_config() copies the
contents of the structure - it does not retain a pointer to it.

<verbatim>
  typedef struct lsm_compress_factory lsm_compress_factory;
  struct lsm_compress_factory {
    void *pCtx;
    int (*xFactory)(void *pCtx, lsm_db *db, unsigned int iCompressionId);
    void (*xFree)(void *pCtx);
  };
</verbatim>

<p><span style=color:red> Explain how the xFactory hook works here. </span>

<h1 id=performance_tuning>6. Performance Tuning</h1>

<p> This section describes the various measures that can be taken in order to
fine-tune LSM in order to improve performance in specific circumstances.
Sub-section 6.1 contains a high-level overview of the 
<a href=#overview_of_lsm_architecture>system architecture</a>
intended to help in understanding the various performance tradeoffs and
optimizations available to LSM applications. Sub-section 6.2 identifies the 
<a href=#performance_related_configuration_options> configuration parameters</a>
that can be used to influence database performance. 
Sub-section 6.3 discusses options and methods for scheduling the time-consuming
processes of actually 
<a href=#work_and_checkpoint_scheduling>writing and syncing the database file
</a>to disk. Finally, 6.4 introduces "<a
href=#database_file_optimization>database optimization</a>"
- the process of reorganizing a database file internally so that it is as small
as possible and optimized for search queries.

<h2 id=overview_of_lsm_architecture>6.1. Overview of LSM Architecture</h2>

<p>The following steps describe the journey taken by data written to the 
   database from the application to the database file on disk:

<ol>
  <li> <p>
       When an application writes to an LSM database, the new data is first
       written to a log file and stored in an in-memory tree structure. The
       log file is used for database recovery - if an application crash or
       power failure occurs and the contents of the in-memory tree are lost,
       data is recovered by reading the log file.

  <li> <p>
       Once sufficient data has been accumulated in an in-memory tree (by
       default "sufficient data" means 1MB, including data structure 
       overhead), it is marked as "old" and a new "live" in-memory tree 
       created. An old in-memory tree is immutable - new data is always
       inserted into the live tree. There may be at most one old tree
       in memory at a time.

  <li> <p>
       The contents of an old in-memory tree may be written into the 
       database file at any point. Once its contents have been written (or
       "flushed") to the database file, the in-memory tree may be discarded.
       Flushing an in-memory tree to the database file creates a new database
       "segment". A database segment is an immutable b-tree structure stored
       within the database file. A single database file may contain up to 64 
       segments.

  <li> <p>
       At any point, two or more existing segments within the database file
       may be merged together into a single segment. Once their contents has
       been merged into the new segment, the original segments may be 
       discarded.

  <li> <p>
       After the set of segments in a database file has been modified (either
       by flushing an in-memory tree to disk or by merging existing segments
       together), the changes may be made persistent by "checkpointing" the 
       database. Checkpointing involves updating the database file header and
       and (usually) syncing the contents of the database file to disk.
</ol>

<p>Steps 3 and 4 above are known as "working" on the database. Step 5 is
refered to as "checkpointing". By default, database connections perform work
and checkpoint operations periodically from within calls to API functions
<code>lsm_insert</code>, <code>lsm_delete</code>, <code>lsm_delete_range</code>
and <code>lsm_commit</code> (i.e. functions that write to the database).
Alternatively, work and checkpoint operations may be performed on demand using
the <code>lsm_work</code> and <code>lsm_checkpoint</code> APIs. By opening a
second database connection, these operations may be moved to a background
thread or process. 

<p>Optimizing database write throughput and responsiveness is done by
configuring and scheduling work and checkpoint operations, and by configuring 
a few other parameters, as described in the following two sections.

<p>The speed of database read operations is largely determined by the number
of segments in the database file. So optimizing read operations is also linked 
to the configuring and scheduling of database write operations, as these
policies determine the number of segments that are present in the database
file at any time.

<p>Any data written to the database file since the last checkpoint may be
lost if a power or application failure occurs. As a result of this, regular
database checkpoints are required to ensure that unused space within the log
file and database file can be reused in a timely fashion. Specifically:

<ul>
  <li><p>Space within the log file cannot be recycled until the corresponding
         data has been written into a database segment and a checkpoint
         performed.

  <li><p>When two or more existing segments are merged into a new segment
         within the database file, the space occupied by the original segments
         may not be recycled until after a checkpoint has been performed.
</ul>

<p>
In other words, without checkpoints the system will function, but both the log and database files will grow indefinitely as the database is modified (even if the size of the dataset remains constant). Additionally, if a crash or power failure occurs, the next client to open the database file has to process all data written to the log file since the most recent checkpoint. If checkpoints are performed infrequently, this can be a time consuming exercise. 

<p>If a connection attempts to open a write transaction on the database when
another connection already has an open write transaction, the attempt fails
and LSM_BUSY is returned to the caller. This is because to write to the
database, the connection must first obtain the WRITER lock - and at most 
one connection may simultaneously hold the WRITER lock. As well as the WRITER
lock, there are two other exclusive locks that may be obtained on the database
- the WORKER and CHECKPOINTER locks. These are used, not surprisingly, to
ensure that at most one connection attempts to work on or checkpoint the
database at a time. More specifically, the roles of the three locks are:

<table valign=top>
<tr><td valign=top>WRITER<td style="width:3ex"><td>
The WRITER lock is required to modify the contents of the in-memory tree.
Including marking an in-memory tree as "old" and starting a new live tree.
It is also required to write to the log file.

<tr><td valign=top>WORKER<td><td>
The WORKER lock is required to write to segments within the database file.
Either when merging two or more existing segments within the database, or
when flushing an in-memory tree to disk to create a new segment.

<tr><td valign=top>CHECKPOINTER<td><td>
The CHECKPOINTER lock is required to write to the database file header.
</table>

<p>The three locks are independent. It is possible to simultaneously have one
client writing to the database, another working on the database file and a
third performing a checkpoint operation.


<h2 id=performance_related_configuration_options>6.2. Performance Related Configuration Options </h2>

<p>The options in this section are all set to integer values. They may be 
set and queried using the <a href=lsmapi.wiki#lsm_config>lsm_config()</a>
function. To set an option to a value, lsm_config() is used as follows:

<verbatim>
  /* Set the LSM_CONFIG_AUTOFLUSH option to 1MB (1024 KB) */
  int iVal = 1024;
  rc = lsm_config(db, LSM_CONFIG_AUTOFLUSH, &iVal);
</verbatim>

<p>In order to query the current value of an option, the initial value of
the parameter (iVal in the example code above) should be set to a negative
value. Or any other value that happens to be out of range for the parameter -
negative values just happen to be out of range for all integer lsm_config()
parameters.

<verbatim>
  /* Set iVal to the current value of LSM_CONFIG_AUTOFLUSH */
  int iVal = -1;
  rc = lsm_config(db, LSM_CONFIG_AUTOFLUSH, &iVal);
</verbatim>

<dl>
  <dt> <a href=lsmapi.wiki#LSM_CONFIG_AUTOCHECKPOINT>LSM_CONFIG_AUTOCHECKPOINT</a>
  <dd> <p style=margin-top:0>
    This option determines how often the database is checkpointed (synced to
    disk). A checkpoint is performed after N KB (approximately) have been
    written to the database file, where N is the value of this option. 
    Increasing this value (say to 4MB or even 8MB) may improve overall write
    throughput. 

    <p>The default value is 2048 (2MB).

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_AUTOFLUSH>LSM_CONFIG_AUTOFLUSH</a>
  <dd> <p style=margin-top:0>
    This option determines how much data in KB is allowed to accumulate in a
    live in-memory tree before it is marked as "old" (and made eligible to be
    flushed through to the database file). Increasing this value may improve 
    overall write throughput. Decreasing it reduces memory usage.

    <p>The default value is 1024 (1MB).

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_AUTOMERGE>LSM_CONFIG_AUTOMERGE</a>
  <dd> <p style=margin-top:0>
    <p>If auto-work (the LSM_CONFIG_AUTOWORK option below) is enabled, then 
    this option is set to the number of segments that the library attempts 
    to merge simultaneously. Increasing this value may reduce the total
    amount of data written to the database file. Decreasing it increases
    the amount of data written to the file, but also decreases the average
    number of segments present in the file, which can improve the performance
    of database read operations.

    <p>Additionally, whether or not auto-work is enabled, this option is used
    to determine the maximum number of segments of a given age that are 
    allowed to accumulate in the database file. This is described in the
    <a href=#compulsary_work_and_checkpoints>compulsary work and checkpoints</a>
    section below.

    <p>The default value is 4. This option must be set to a value between
    2 and 8, inclusive.

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_AUTOWORK>LSM_CONFIG_AUTOWORK</a>
  <dd> <p style=margin-top:0>
    <p>This option may be set to either 1 (true) or 0 (false). If it is set
    to true, then work and checkpoint operations are automatically scheduled
    within calls to lsm_insert(), lsm_delete(), lsm_delete_range() and
    lsm_commit(). Otherwise, if it is set to false, these operations must
    be <a href=#explicit_scheduling>explicitly scheduled</a> by the
    application. 

    <p>The default value is 1.

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_MMAP>LSM_CONFIG_MMAP</a>
  <dd> <p style=margin-top:0>
    If LSM is running on a system with a 64-bit address space, this option
    may be set to either 1 (true) or 0 (false). On a 32-bit platform, it is
    always set to 0.
    <p> If it is set to true, the entire database file is memory mapped. Or, if
    it is false, data is accessed using ordinary OS file read and write
    primitives. Memory mapping the database file can significantly improve the
    performance of read operations, as database pages do not have to be copied
    from operating system buffers into user space buffers before they can be
    examined. 
    <p>This option may not be set if there is a read or write transaction
    open on the database.
    <p>The default value is 1 (true) on a 64-bit platform, and 0 otherwise.

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_MULTIPLE_PROCESSES>LSM_CONFIG_MULTIPLE_PROCESSES</a>
  <dd> <p style=margin-top:0>
    This option may also be set to either 1 (true) or 0 (false).  The default
    value is 1 (true). If it is set to false, then the library assumes that all
    database clients are located within the same process (have access to the
    same memory space). Assuming this means the library can avoid using OS file
    locking primitives to lock the database file, which speeds up opening and
    closing read and write transactions. 

    <p>This option can only be set before lsm_open() is called on the database
    connection.

    <p>If this option is set to false and there is already a connection to the
    database from another process when lsm_open() is called, the lsm_open()
    call fails with error code LSM_BUSY.

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_SAFETY>LSM_CONFIG_SAFETY</a>
  <dd> <p style=margin-top:0>
    The effect of this option on <a href=#data_durability>data durability</a>
    is described above.

    <p>From a performance point of view, this option determines how often the
    library pauses to wait for data written to the file-system to be stored
    on the persistent media (e.g. hard disk or solid-state memory). This is
    also known as "syncing" data to disk. Since this is orders of magnitude
    slower than simply copying data into operating system buffers, the value
    of this option has a large effect on write performance.
  
    <p>If LSM_CONFIG_SAFETY is set to 2 (FULL), then the library syncs the
    data written to the log file to disk whenever a transaction is committed.
    Or, if LSM_CONFIG_SAFETY is set to 1 (NORMAL), then data is only synced
    to disk when a checkpoint is performed (see above). Finally, if it is set
    to 0 (OFF), then no data is ever synced to disk.

    <p>The default value is 1 (NORMAL).

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_USE_LOG>LSM_CONFIG_USE_LOG</a>
  <dd> <p style=margin-top:0>
    This is another option that may be set to either 1 (true) or 0 (false). 
    The default value is 1 (true). If it is set to false, then the library 
    does not write data into the database log file. This makes writing faster,
    but also means that if an application crash or power failure occurs, it is
    very likely that any recently committed transactions will be lost.

    <p>If this option is set to true, then an application crash cannot cause
    data loss. Whether or not data loss may occur in the event of a power
    failure depends on the value of the <a href=#data_durability>
    LSM_CONFIG_SAFETY</a> parameter.

    <p>This option can only be set if the connection does not currently have
    an open write transaction.

</dl>

<h2 id=work_and_checkpoint_scheduling>6.3. Work and Checkpoint Scheduling</h2>

<h3 id=automatic_scheduling>6.3.1. Automatic Scheduling</h3>

<p>This section describes how work and checkpoint operations are scheduled 
if the boolean LSM_CONFIG_AUTOWORK parameter is set to true. Automatic
work operations may occur within calls to any of the following API functions:

<ul>
  <li>lsm_insert()
  <li>lsm_delete()
  <li>lsm_delete_range()
  <li>lsm_commit()
</ul>

<p>Each time a transaction is committed in auto-work mode, the library checks
to see if there exists an "old" in-memory tree (see the LSM_CONFIG_AUTOFLUSH
option above). If so, it attempts to flush it to disk immediately. Unlike
merges of existing segments, the entire in-memory tree must be flushed to disk
before control is returned to the user. It is not possible to incrementally
flush an in-memory tree in the same ways as it is possible to incrementally
merge existing database segments together. 

<p>Each segment in the database file is assigned an "age" - an integer zero 
or greater indicating how many times the data in the segment has been merged. 
A segment created by flushing the in-memory tree to disk is assigned an age 
of 1. When two or more segments with age=1 are merged together to create a 
larger segment, it is assigned an age of 2. And so on. 

<p>Assuming auto-work is enabled, the library periodically checks the state of
the database file to see if there exist N or more segments with the same age
value A, where N is the value assigned to the LSM_CONFIG_AUTOMERGE parameter.
If so, work is done to merge all such segments with age=A into a new, larger
segment assigned age=A+1. At present, "periodically" as used above means
roughly once for every 32KB of data (including overhead) written to the
in-memory tree. The merge operation is not necessarily completed within a
single call to a write API (this would result in blocking the writer thread for
too long in many cases - in large databases segments may grow to be many GB in
size). Currently, the amount of data written by a single auto-work operation is
roughly 32KB multiplied by the number of segments in the database file. This
formula may change - the point is that the library attempts to limit the amount
of data written in order to avoid blocking the writer thread for too long
within a single API call. 

<p>Checkpoint operations are scheduled based on the value assigned to the
LSM_CONFIG_AUTOCHECKPOINT configuration parameter.

<p>In order to automatically perform work and checkpoint operations, the 
client must obtain the WORKER and CHECKPOINTER locks, respectively. If an
attempt to obtain either of these locks fails (because some other client is
already holding them), it is not an error, the scheduled work or checkpoint
is simply not performed.



<h3 id=explicit_scheduling>6.3.2. Explicit Scheduling</h3>

<p>The alternative to automatic scheduling of work and checkpoint operations
is to explicitly schedule them - possibly in a background thread or dedicated
application process. In order to disable automatic work, a client must set
the LSM_CONFIG_AUTOWORK parameter to zero. This parameter is a property of
a database connection, not of a database itself, so it must be cleared
separately by all processes that may write to the database. Otherwise, they
may attempt automatic database work or checkpoints.

<verbatim>
  /* Disable auto-work on connection db */
  int iVal = 0;
  lsm_config(db, LSM_CONFIG_AUTOWORK, &iVal);
</verbatim>

<p>The lsm_work() function is used to explicitly perform work on the database:

<verbatim>
  int lsm_work(lsm_db *db, int nMerge, int nKB, int *pnWrite);
</verbatim>

<p>Parameter nKB is passed a limit on the number of KB of data that 
should be written to the database file before the call returns. It is a 
hint only, the library does not honor this limit strictly.

<p>If the database has an old in-memory tree when lsm_work() is called, it is
flushed to disk. If this means that more than nKB KB of data is written
to the database file, no further work is performed. Otherwise, the number
of KB written is subtracted from nKB before proceeding.

<p>If parameter nMerge is greater than 1, then the library searches for 
nMerge or more segments of the same age within the database file and performs
up to nKB KB of work to merge them together. If the merge is completed
before the nKB limit is exceeded, the library searches for another set of
nMerge or more segments to work on, and so on. If at any point no such set of
nMerge segments can be found, the call returns without performing any 
further work.

<p>Calling lsm_work() with the nMerge argument set to 1 is used to "optimize"
the database (see below). Passing a value of zero or less for the nMerge
parameter is an error.

<p>In any case, before returning the value of *pnWrite is set to the actual
number of KB written to the database file.

<p>The example code below might be executed in a background thread or process
in order to perform database work and checkpointing. In this case all other
clients should set the LSM_CONFIG_AUTOWORK parameter to zero.

<verbatim>
  int rc;
  lsm_db *db;
  int nCkpt = 4*1024;             /* 4096KB == 4MB */

  /* Open a database connection to database "test.db". 
  **
  ** Configure the connection to automatically checkpoint the database after
  ** writing each 4MB of data to it (instead of the default 2MB). As well
  ** as to auto-work, the LSM_CONFIG_AUTOCHECKPOINT parameter applies to data
  ** written by explicit calls to lsm_work().
  */
  lsm_new(0, &db);
  lsm_config(db, LSM_CONFIG_AUTOCHECKPOINT, &nCkpt);
  lsm_open(db, "test.db");

  while( 1 ){
    int nWrite;

    /* Attempt up to 512KB of work. Set nWrite to the number of bytes
    ** actually written to disk.  */
    rc = lsm_work(db, 2, 512, &nWrite);
    if( rc!=LSM_OK && rc!=LSM_BUSY ){
      /* Anything other than LSM_OK or LSM_BUSY is a problem. LSM_BUSY
      ** indicates that some other client has taken the WORKER lock. Any
      ** other error indicates something has gone quite wrong.  */
      lsm_close(db);
      return rc;
    }

    /* nWrite may be set to zero here in two scenarios. lsm_work()
    ** may have failed to obtain the WORKER lock and returned LSM_BUSY,
    ** indicating that some other connection is working on the database.
    ** Alternatively, it may be that there was no old in-memory tree to
    ** flush and no two segments of the same age within the database file,
    ** meaning the function could find no work to do.
    **
    ** In either case, there is no point in calling lsm_work() again 
    ** immediately. Instead, sleep for a second before continuing. By that
    ** time, things may have changed (the other process may have relinquished
    ** the WORKER lock, or an in-memory tree may have been marked as old).
    */
    if( nWrite==0 ) sleep(1);
  }
</verbatim>

<p>The mechanism associated with the LSM_CONFIG_AUTOCHECKPOINT configuration
parameter applies to data written by both automatically scheduled work and
work performed by calls to the lsm_work() function. The amount of
uncheckpointed data that has been written into the database file is a property
of the database file, not a single connection, so checkpoints occur at the
configured interval even if multiple connections are used to work on the
database.

<p>Alternatively, checkpoint operations may be scheduled separately. If the
LSM_CONFIG_AUTOCHECKPOINT parameter is set to zero, then a connection never
performs a database checkpoint, regardless of how much data it or any other 
connection writes into the database file. As with LSM_CONFIG_AUTOWORK, this
parameter must be zeroed for all connections that may perform work on the
database. Otherwise, they may perform a checkpoint operation.

<p>The <a href=lsmapi.wiki#lsm_checkpoint>lsm_checkpoint()</a> API is used
to expicitly request a checkpoint.

<verbatim>
  int lsm_checkpoint(lsm_db *db, int *pnKB);
</verbatim>

<p>If no work has been performed on the database since the previous
checkpoint, lsm_checkpoint() sets *pnKB to zero and returns immediately. 
Otherwise, it checkpoints the database and sets *pnKB to the number of KB of
data written to the database file since the previous checkpoint.

<p>A database may be queried for the number of KB written to the database 
since the most recent checkpoint using the 
<a href=lsmapi.wiki#lsm_info>lsm_info()</a> API function. As follows:

<verbatim>
  int nCkpt;
  rc = lsm_info(db, LSM_INFO_CHECKPOINT_SIZE, &nCkpt);
</verbatim>

<p>It may also be queried for the size of the in-memory tree or trees. The
following block of code sets variable nLive to the size of the current 
live in-memory tree in KB, and nOld to the size of the old in-memory tree in 
KB (or 0 if there is no old in-memory tree).

<verbatim>
  int nOld, nLive;
  rc = lsm_info(db, LSM_INFO_TREE_SIZE, &nOld, &nLive);
</verbatim>

<h3 id=compulsary_work_and_checkpoints>6.3.3. Compulsary Work and Checkpoints</h3>

<p>There are three scenarios where database work or checkpointing may be
performed automatically, regardless of the value of the LSM_CONFIG_AUTOWORK
parameter.

<ul>
  <li> When closing a database connection, and 
  <li> When the number of segments with a common age in the database 
       file grows unacceptably high.
  <li> When the total number of segments in the database file grows 
       unacceptably high.
</ul>

<p>Whenever an lsm_close() call would mean that the total number of 
connections to a database drops to zero, the connection checks if the 
in-memory tree is empty. If not, it is flushed to disk. Both the live and 
old in-memory trees are flushed to disk in this case. It also checks if the
database file has been modified since the most recent checkpoint was 
performed. If so, it also performs a checkpoint. Finally, assuming no error
has occurred, it deletes the log file.

<p>Additionally, whenever a worker wishes to flush an in-memory tree to a new
age=1 segment, it must first ensure that there are less than (N+1) existing 
age=1 segments, where N is the value that the LSM_CONFIG_AUTOMERGE parameter is
set to. If there are already (N+1) or more age=1 segments, they must be merged
into an age=2 segment before a new age=1 segment can be created within the
database file. Similar rules apply to segments of other ages - it is not
possible to create a new age=I segment if there are already (N+1) segments 
with age=I in the database file. This has two implications:

<p>This scenario should never come about if all connections that write to the
database have auto-work enabled. It only occurs if auto-work is disabled and
the lsm_work() function is called too infrequently. In this case it is possible
that flushing an in-memory tree may require writing a tremendous amount of
data to disk (possibly even rewriting the entire database file).

<p>Finally, regardless of age, a database is limited to a maximum of 64
segments in total. If an attempt is made to flush an in-memory tree to disk
when the database already contains 64 segments, two or more existing segments
must be merged together before the new segment can be created.

<h2 id=database_file_optimization>6.4. Database File Optimization</h2>

<p>Database optimization transforms the contents of database file so that
the following are true:

<ul>
  <li> <p>All database content is stored in a single 
       <a href=#architectural_overview>segment</a>. This makes the
       data structure equivalent to an optimally packed b-tree stucture
       for search operations - minimizing the number of disk sectors that need
       to be visted when searching the database.

  <li> <p>The database file contains no (or as little as possible) free space.
       In other words, it is no larger than required to contain the single
       segment.
</ul>

<p>In order to optimize the database, lsm_work() should be called with 
the nMerge argument set to 1 and the third parameter set to a negative value
(interpreted as "keep working until there is no more work to do"). For 
example:

<verbatim>
  rc = lsm_work(db, 1, -1, 0);
</verbatim>

<p>When optimizing the database as above, either the LSM_CONFIG_AUTOCHECKPOINT
parameter should be set to a non-zero value or lsm_checkpoint() should be
called periodically. Otherwise, no checkpoints will be performed, preventing
the library from reusing any space occupied by old segments even after their
content has been merged into the new segment. The result - a database file that
is optimized, except that it is up to twice as large as it otherwise would be.