SQLite FTS3 Extension

<title>SQLite FTS3 Extension</title>
<tcl>

hd_keywords {fts3}
source [file join $::DOC pages fancyformat.tcl]
fancyformat_document "SQLite FTS3 Extension" {} {

<h2 style="margin-left:1.0em"> Overview</h2>

[h1 "Introduction to FTS3"]

<p>
  The FTS3 extension module allows users to create special tables with a 
  built-in full-text index (hereafter "FTS3 tables"). The full-text index
  allows the user to efficiently query the database for all rows that contain
  one or more instances specified word (hereafter a "token", even if the table
  contains many large documents.

<p>
  For example, if each of the 517430 documents in the 
  "<a href="http://www.cs.cmu.edu/~enron/">Enron E-Mail Dataset</a>"
  is inserted into both the FTS3 table and the ordinary SQLite table
  created using the following SQL script:

[Code {
  CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
  CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */
}]

<p>
  Then either of the two queries below may be executed to find the number of
  documents in the database that contain the word "linux" (351). Using one
  desktop PC hardware configuration, the query on the FTS3 table returns in
  approximately 0.03 seconds, versus 22.5 for querying the ordinary table.

[Code {
  SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux';  /* 0.03 seconds */
  SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
}]

<p>
  Of course, the two queries above are not entirely equivalent. For example
  the LIKE query matches rows that contain terms such as "linuxophobe"
  or "EnterpriseLinux" (as it happens, the Enron E-Mail Dataset does not
  actually contain any such terms), whereas the MATCH query on the FTS3 table
  selects only those rows that contain "linux" as a discreet token. Both 
  searches are case-insensitive. The FTS3 table consumes around 2006 MB on
  disk compared to just 1453 MB for the ordinary table. Using the same
  hardware configuration used to perform the SELECT queries above, the FTS3
  table took just over 39 minutes to populate, versus 25 for the ordinary
  table.

[h1 "Compiling and Enabling FTS3"]

[h1 "Custom Tokenizers" customtokenizer {custom tokenizer}]

[h1 "MATCH Expression Syntax"]

[h2 "Advanced Query Syntax"]
[h2 "Legacy Query Syntax"]

[h1 "Implementation Details"]

<p>
  This section describes at a high-level the implementation of the SQLite FTS3
  extension. It is <b>not necessary to read or understand the material in this
  section in order to use FTS3</b> in an application. However, it may be useful
  to application developers attempting to analyze and understand FTS3 
  performance characteristics, or to developers contemplating enhancements to
  the existing FTS3 feature set.

[h2 "Data Structures"]

<p>
  For each FTS3 virtual table in a database, three real (non-virtual) tables 
  are created to store the underlying data. The real tables are named "%_content",
  "%_segdir" and "%_segments", where "%" is replaced by the name supplied by
  the user for the FTS3 virtual table.

<p>
  The leftmost column of the "%_content" table is an INTEGER PRIMARY KEY field
  named "docid". Following this is one column for each column of the FTS3
  virtual table as declared by the user, named by prepending the column name
  supplied by the user with "c<i>N</i>", where <i>N</i> is the index of the 
  column within the table, numbered from left to right starting with 1. Data
  types supplied as part of the virtual table declaration are not used as
  part of the %_content table declaration. For example:

[Code {
  /* Virtual table declaration */
  CREATE VIRTUAL TABLE abc USING FTS3(a NUMBER, b TEXT, c);

  /* Corresponding %_content table declaration */
  CREATE TABLE abc_content(docid INTEGER PRIMARY KEY, c1a, c2b, c2c);
}]

<p>
  The %_content table contains the unadulterated data inserted by the user 
  into the FTS3 virtual table by the user. If the user does not explicitly
  supply a "docid" value when inserting records, one is selected automatically
  by the system.

<p>
  The two remaining tables, %_segments and %_segdir, are used to store the 
  full-text index. Conceptually, this index is a lookup table that maps each 
  term (word) to the set of docid values corresponding to records in the 
  %_content table that contain one or more occurrences of the term. To
  retrieve all documents that contain a specified term, the FTS3 module
  queries this index to determine the set of docid values for records that
  contain the term, then retrieves the required documents from the %_content
  table. Regardless of the schema of the FTS3 virtual table, the %_segments
  and %_segdir tables are always created as follows:

[Code {
  CREATE TABLE %_segments(
    blockid INTEGER PRIMARY KEY,       /* B-tree node id */
    block blob                         /* B-tree node data */
  );

  CREATE TABLE %_segdir(
    level INTEGER,
    idx INTEGER,
    start_block INTEGER,               /* Blockid of first node in %_segments */
    leaves_end_block INTEGER,          /* Blockid of last leaf node in %_segments */
    end_block INTEGER,                 /* Blockid of last node in %_segments */
    root BLOB,                         /* B-tree root node */
    PRIMARY KEY(level, idx)
  );
}]

<p>
  The schema depicted above is not designed to store the full-text index 
  directly. Instead, it is used to one or more b-tree structures. There
  is one b-tree for each row in the %_segdir table. The %_segdir table
  row contains the root node and various meta-data associated with the
  b-tree structure, and the %_segments table contains all other (non-root)
  b-tree nodes. Each b-tree is referred to as a "segment". Once it has
  been created, a segment b-tree is never updated (although it may be
  deleted altogether).

<p>
  The keys used by each segment b-tree are terms (words). As well as the
  key, each segment b-tree entry has an associated "doclist" (document list).
  A doclist consists of zero or more entries, where each entry consists of:

<ul>
  <li> A docid (document id), and
  <li> A list of term offsets, one for each occurrence of the term within
       the document. A term offset indicates the number of tokens (words)
       that occur before the term in question, not the number of characters
       or bytes. For example, the term offset of the term "war" in the
       phrase "Ancestral voices prophesying war!" is 3.
</ul>

<p>
  Entries within a doclist are sorted by docid. Positions within a doclist
  entry are stored in ascending order.

<p>
  The contents of the logical full-text index is found by merging the
  contents of all segment b-trees. If a term is present in more than one
  segment b-tree, then it maps to the union of each individual doclist. If,
  for a single term, the same docid occurs in more than one doclist, then only
  the doclist that is part of the most recently created segment b-tree is 
  considered valid.
  
[h3 "Segment B-Tree Format"]

<p>
  Segment b-trees are prefix-compressed b+-trees.

[h4 "Segment B-Tree Leaf Nodes"]

[Fig fts3_leaf_node.png "Segment B-Tree Leaf Node Format"]

<p class=todo> There is a varint field before each doclist containing the size of the doclist in bytes.

[h4 "Segment B-Tree Interior Nodes"]

[Fig fts3_interior_node.png "Segment B-Tree Interior Node Format"]

[h3 "Doclist Format"]

<p>
  A doclist consists of an array of 64-bit signed integers, serialized using
  the SQLite <a href="fileformat.html#varint_format">varint</a> format. Each
  doclist entry is made up of a series of two or more integers, as follows:

<ol>
  <li> The docid value. The first entry in a doclist contains the literal docid
       value. The first field of each subsequent doclist entry contains the 
       difference between the new docid and the previous one (always a positive 
       number).
  <li> Zero or more term-offset lists. A term-offset list is present for each
       column of the FTS3 virtual table that contains the term. A term-offset
       list consists of the following:
     <ol>
       <li> Constant value 1. This field is omitted for any term-offset list
            associated with column 0.
       <li> The column number (1 for the second leftmost column, etc.). This
            field is omitted for any term-offset list associated with column 0.
       <li> A list of term-offsets, sorted from smallest to largest. Instead
            of storing the term-offset value literally, each integer stored 
            is the difference between the current term-offset and the previous 
            one (or zero if the current term-offset is the first), plus 2.
     </ol>
  <li> Constant value 0.
</ol>

[Fig fts3_doclist2.png "FTS3 Doclist Format"]

[Fig fts3_doclist.png "FTS3 Doclist Entry Format"]

<p>
  For doclists for which the term appears in more than one column of the FTS3
  virtual table, term-offset lists within the doclist are stored in column 
  number order. This ensures that the term-offset list associated with 
  column 0 (if any) is always first, allowing the first two fields of the
  term-offset list to be omitted in this case.

<p class=todo>
  Add a diagram and make this description more user-friendly.

[h2 "Queries (SELECT)"]

[h2 "Updates (INSERT, UPDATE and DELETE)"]

[h2 "Segment Merges"]

}