SQLite FTS3 Extension

<title>SQLite FTS3 Extension</title>
<tcl>

hd_keywords *fts3 FTS3
source [file join $::DOC pages fancyformat.tcl]
fancyformat_document "SQLite FTS3 Extension" {} {

<h2 style="margin-left:1.0em"> Overview</h2>

[h1 "Introduction to FTS3"]

<p>
  The FTS3 extension module allows users to create special tables with a 
  built-in full-text index (hereafter "FTS3 tables"). The full-text index
  allows the user to efficiently query the database for all rows that contain
  one or more instances specified word (hereafter a "token", even if the table
  contains many large documents.

<p>
  For example, if each of the 517430 documents in the 
  "<a href="http://www.cs.cmu.edu/~enron/">Enron E-Mail Dataset</a>"
  is inserted into both the FTS3 table and the ordinary SQLite table
  created using the following SQL script:

[Code {
  CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
  CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */
}]

<p>
  Then either of the two queries below may be executed to find the number of
  documents in the database that contain the word "linux" (351). Using one
  desktop PC hardware configuration, the query on the FTS3 table returns in
  approximately 0.03 seconds, versus 22.5 for querying the ordinary table.

[Code {
  SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux';  /* 0.03 seconds */
  SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
}]

<p>
  Of course, the two queries above are not entirely equivalent. For example
  the LIKE query matches rows that contain terms such as "linuxophobe"
  or "EnterpriseLinux" (as it happens, the Enron E-Mail Dataset does not
  actually contain any such terms), whereas the MATCH query on the FTS3 table
  selects only those rows that contain "linux" as a discreet token. Both 
  searches are case-insensitive. The FTS3 table consumes around 2006 MB on
  disk compared to just 1453 MB for the ordinary table. Using the same
  hardware configuration used to perform the SELECT queries above, the FTS3
  table took just over 39 minutes to populate, versus 25 for the ordinary
  table.

<p class=todo> 
  Check if the numbers are still valid using the refactored version.

[h2 "Creating and Destroying FTS3 Tables"]

<p>
  Like other virtual table types, new FTS3 tables are created using a 
  \[CREATE VIRTUAL TABLE\] statement. The module name, which follows
  the USING keyword, is "fts3". The virtual table module arguments may
  be left empty, in which case an FTS3 table with a single user-defined 
  column named "content" is created. Alternatively, the module arguments
  may be passed a list of comma separated column names. 

<p>
  If column names are explicitly provided for the FTS3 table as part of
  the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally 
  specified for each column. However, this is pure syntactic sugar, the
  supplied typenames are not used by FTS3 or the SQLite core for any
  purpose. The same applies to any constraints specified along with an
  FTS3 column name - they are parsed but not used or recorded by the system
  in any way.

[Code {
  <i>-- Create an FTS3 table named "data" with one column - "content":</i>
  CREATE VIRTUAL TABLE data USING fts3();

  <i>-- Create an FTS3 table named "pages" with three columns:</i>
  CREATE VIRTUAL TABLE pages USING fts3(title, keywords, body);

  <i>-- Create an FTS3 table named "mail" with two columns. Datatypes
  -- and column constraints are specified along with each column. These
  -- are completely ignored by FTS3 and SQLite. </i>
  CREATE VIRTUAL TABLE mail USING fts3(
    subject VARCHAR(256) NOT NULL,
    body TEXT CHECK(length(body)<10240)
  );
}]

<p>
  As well as a list of columns, the module arguments passed to a CREATE
  VIRTUAL TABLE statement used to create an FTS3 table may be used to specify
  a \[custom tokenizer\]. This is done by specifying a string of the form
  "tokenize=&lt;tokenizer name&gt;" in place of a column name. A tokenizer
  specification may be placed anywhere in the column list, but at most one
  tokenizer declaration is allowed for each CREATE VIRTUAL TABLE statement.
  The second and subsequent tokenizer declaration are interpreted as column
  names. \[custom tokenizer|See below\] for a detailed description of
  implementing and using a custom tokenizer.

[Code {
  <i>-- Create an FTS3 table named "papers" with two columns that uses</i>
  <i>-- the custom tokenizer "porter".</i>
  CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter);

  <i>-- Create an FTS3 table with a single column - "content" - that uses</i>
  <i>-- the "simple" tokenizer.</i>
  CREATE VIRTUAL TABLE data USING fts3(tokenize=simple);
}]

<p>
  FTS3 tables may be dropped from the database using an ordinary \[DROP TABLE\]
  statement. For example:

[Code {
  <i>-- Create, then immediately drop, an FTS3 table.</i>
  CREATE VIRTUAL TABLE data USING fts3();
  DROP TABLE data;
}]

[h2 "Populating FTS3 Tables"]

  <p>
    FTS3 tables are populated using \[INSERT\], \[UPDATE\] and \[DELETE\]
    statements in the same way as ordinary SQLite tables are.

  <p>
    As well as the columns named by the user (or the "content" column if no
    module arguments where specified as part of the \[CREATE VIRTUAL TABLE\] 
    statement), each FTS3 table has a "rowid" column that behaves like an
    \[INTEGER PRIMARY KEY\], except that values remain unchanged if the
    database is rebuilt using the \[VACUUM\] command. For FTS3 tables, "docid"
    is allowed as an alias along with the usual "rowid", "oid" and "_oid_"
    identifiers. Attempting to insert or update a row with a docid value that 
    already exists in the table is an error, just as it would be with an 
    ordinary SQLite table.

[Code {
  <i>-- Create an FTS3 table</i>
  CREATE VIRTUAL TABLE pages USING fts3(title, body);

  <i>-- Insert a row with a specific docid value.</i>
  INSERT INTO pages(docid, title, body) VALUES(53, 'Home Page', 'SQLite is a software...');

  <i>-- Insert a row and allow FTS3 to assign a docid value using the same algorithm as</i>
  <i>-- SQLite uses for ordinary tables. In this case the new docid will be 54,</i>
  <i>-- one greater than the largest docid currently present in the table.</i>
  INSERT INTO pages(title, body) VALUES('Download', 'All SQLite source code...');

  <i>-- Change the title of the row just inserted.</i>
  UPDATE pages SET title = 'Download SQLite' WHERE rowid = last_insert_rowid();

  <i>-- Delete the entire table contents.</i>
  DELETE FROM pages;
}]

  <p>
    To support full-text queries, FTS3 maintains an inverted index that maps
    from each unique term or word that appears in the dataset to the locations
    in which it appears within the table contents. For the curious, a 
    complete description of the \[segment btree|data structure\] used to store
    this index within the database file is described below. A feature of
    this data structure is that at any time the database may contain not
    one index b-tree, but several different b-trees that are incrementally
    merged as rows are inserted, updated and deleted. This technique improves 
    performance when writing to an FTS3 table, but causes some overhead for
    full-text queries that use the index. Executing an SQL statement of the
    form "SELECT optimize(&lt;fts3-table&gt;) FROM &lt;fts3-table&gt;" 
    causes FTS3 to merge all existing index b-trees into a single large
    b-tree containing the entire index. This can be an expensive operation,
    but may speed up future queries. 

  <p>
    For example, to optimize the full-text index for an FTS3 table named
    "docs":

[Code {
  <i>-- Optimize the internal structure of FTS3 table "docs".</i>
  SELECT optimize(docs) FROM docs LIMIT 1;
}]

  <p>
    The statement above may appear syntacticly incorrect to some. Refer to
    the section describing the \[snippets()|FTS3 auxillary functions\] for
    a description of why it is not.

  <p>
    The optimize() function returns a text value. If the index was already
    optimized when it was called the text is "Index already optimal". 
    Otherwise if the index was not already optimized, it is made so and
    the text "Index optimized" returned.

[h2 "Querying FTS3 Tables" {} {simple fts3 queries}]

<p class=todo>
  Some note clarifying what is meant by a "term", and how that relates to
  the specific tokenizer used by the FTS3 table in question.

<p>
  As for all other SQLite tables, virtual or otherwise, data is retrieved
  from FTS3 tables using a \[SELECT\] statement.

<p>
  FTS3 tables can be queried efficiently using SELECT statements of two
  different forms:

<ul>
  <li><p>
    <b>Query by rowid</b>. If the WHERE clause of the SELECT statement
    contains a sub-clause of the form "rowid = ?", where ? is an SQL expression,
    FTS3 is able to retreive the requested row directly using the equivalent 
    of an SQLite \[INTEGER PRIMARY KEY\] index.

  <li><p>
    <b>Full-text query</b>. If the WHERE clause of the SELECT statement contains
    a sub-clause of the form "&lt;column&gt; MATCH ?", FTS3 is able to use 
    the built-in full-text index to restrict the search to those documents 
    that match the full-text query string specified as the right-hand operand
    of the MATCH clause.
</ul>

<p>
  If neither of the two query strategies enumerated above can be used, all
  queries on FTS3 tables are implemented using a linear scan of the entire
  table. If the table contains large amounts of data, this may be an 
  impractically approach (the first example on this page shows that a linear
  scan of 1.5 GB of data takes around 30 seconds using a modern PC).

[Code {
  <i>-- The examples in this block assume the following FTS3 table:</i>
  CREATE VIRTUAL TABLE mail USING fts3(subject, body);

  SELECT * FROM mail WHERE rowid = 15;                <i>-- Fast. Rowid lookup.</i>
  SELECT * FROM mail WHERE body MATCH 'sqlite';       <i>-- Fast. Full-text query.</i>
  SELECT * FROM mail WHERE mail MATCH 'search';       <i>-- Fast. Full-text query.</i>
  SELECT * FROM mail WHERE rowid BETWEEN 15 AND 20;   <i>-- Slow. Linear scan.</i>
  SELECT * FROM mail WHERE subject = 'database';      <i>-- Slow. Linear scan.</i>
  SELECT * FROM mail WHERE subject MATCH 'database';  <i>-- Fast. Full-text query.</i>
}]

<p>
  In all of the full-text queries above, the right-hand operand of the MATCH
  operator is a string consisting of a single word. In this case, the MATCH
  expression evaluates to true for all documents that contain one or more 
  instances of the specified word ("sqlite", "search" or "database", depending 
  on which example you look at). Specifying a single word as the right-hand
  operand of the MATCH operator results in the simplest (and most common) type 
  of full-text query possible. However more complicated queries are possible,
  including phrase searches, term-prefix searches and searches for documents 
  containing combinations of terms occuring within a defined proximity of each
  other. The various ways in which the full-text index may be queried are
  \[FTS3 MATCH|described below\].

<p>
  The paragraph above notes that a MATCH operator with a simple word as the
  right-hand operand evaluates to true for all documents that contain the
  specified term. In this context, the "document" may refer to either the 
  data stored in a single column of a row of an FTS3 table, or to the contents
  of all columns in a single row, depending on the identifier used as the
  left-hand operand to the MATCH operator. If the identifier specified as
  the left-hand operand of the MATCH operator is an FTS3 table column name,
  then the document that the search term must be contained in is the value
  stored in the specified column. However, if the identifier is the name
  of the FTS3 <i>table</i> itself, then the MATCH operator evaluates to true
  for each row of the FTS3 table for which any column contains the search 
  term. The following example demonstrates this:

[Code {
  <i>-- Example schema</i>
  CREATE VIRTUAL TABLE mail USING fts3(subject, body);

  <i>-- Example table population</i>
  INSERT INTO mail(docid, subject, body) VALUES(1, 'software feedback', 'found it too slow');
  INSERT INTO mail(docid, subject, body) VALUES(2, 'software feedback', 'no feedback');
  INSERT INTO mail(docid, subject, body) VALUES(3, 'slow lunch order',  'was a software problem');

  <i>-- Example queries</i>
  SELECT * FROM mail WHERE subject MATCH 'software';    <i>-- Selects rows 1 and 2</i>
  SELECT * FROM mail WHERE body    MATCH 'feedback';    <i>-- Selects row 2</i>
  SELECT * FROM mail WHERE mail    MATCH 'software';    <i>-- Selects rows 1, 2 and 3</i>
  SELECT * FROM mail WHERE mail    MATCH 'slow';        <i>-- Selects rows 1 and 3</i>
}]
  
<p>
  At first glance, the final two full-text queries in the example above seem
  to be syntacticly incorrect, as there is a table name ("mail") used as
  an SQL expression. The reason this is acceptable is that each FTS3 table
  actually has a \[sqlite3_declare_vtab|HIDDEN\] column with the same name
  as the table itself (in this case, "mail"). The value stored in this
  column is not meaningful to the application, but can be used as the 
  left-hand operand to a MATCH operator. This special column may also be
  passed as an argument to the \[snippets()|FTS3 auxillary functions\].

<p>
  The following example illustrates the above. The expressions "docs", 
  "docs.docs" and "main.docs.docs" all refer to column "docs". However, the 
  expression "main.docs" does not refer to any column. It could be used to 
  refer to a table, but a table name is not allowed in the context in which
  it is used below.

[Code {
  <i>-- Example schema</i>
  CREATE VIRTUAL TABLE docs USING fts3(content);

  <i>-- Example queries</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite';              <i>-- OK.</i>
  SELECT * FROM docs WHERE docs.docs MATCH 'sqlite';         <i>-- OK.</i>
  SELECT * FROM docs WHERE main.docs.docs MATCH 'sqlite';    <i>-- OK.</i>
  SELECT * FROM docs WHERE main.docs MATCH 'sqlite';         <i>-- Error.</i>
}]
 
[h2 "Summary"]

<p>
  From the users point of view, FTS3 tables are similar to ordinary SQLite
  tables in many ways. Data may be added to, modified within and removed 
  from FTS3 tables using the INSERT, UPDATE and DELETE commands just as 
  it may be with ordinary tables. Similarly, the SELECT command may be used 
  to query data. The following list summarizes the differences between FTS3
  and ordinary tables:

<ol>
  <li><p> 
    As with all virtual table types, it is not possible to create indices or
    triggers attached to FTS3 tables. Nor is it possible to use the ALTER TABLE
    command to add extra columns to FTS3 tables (although it is possible to use
    ALTER TABLE to rename an FTS3 table).

  <li><p> 
    Data-types specified as part of the "CREATE VIRTUAL TABLE" statement
    used to create an FTS3 table are ignored completely. Instead of the
    normal rules for applying type \[affinity\] to inserted values, all
    values inserted into FTS3 table columns (except the special rowid
    column) are converted to type TEXT before being stored.

  <li><p> 
    FTS3 tables permit the special alias "docid" to be used to refer to the
    rowid column supported by all \[virtual tables\].

  <li><p> 
    The \[FTS3 MATCH\] operator is supported for queries based on the built-in
    full-text index. 

  <li><p> 
    The FTS3 auxillary functions, \[snippets|snippets() and offsets()\], are 
    available to support full-text queries.

  <li><p> 
    Each FTS3 table has a \[sqlite3_declare_vtab()|HIDDEN column\] with the 
    same name as the table itself. The value contained in each row for the
    special column is only useful when used on the left-hand side of a 
    \[FTS3 MATCH|MATCH\] operator, or when specified as an argument to one 
    of the \[snippets|FTS3 auxillary functions\].
    

</ol>


[h1 "Compiling and Enabling FTS3"]

<p>
  Although FTS3 is distributed as part of the SQLite source code, it is not
  enabled by default. To build SQLite with FTS3 functionality enabled, define
  the preprocessor macro \[SQLITE_ENABLE_FTS3\] when compiling. New applications
  should also define the \[SQLITE_ENABLE_FTS3_PARENTHESIS\] macro to enable the
  advanced query syntax (see below). Usually, this is done by adding the 
  following two switches to the compiler command line:

[Code {
  -DSQLITE_ENABLE_FTS3
  -DSQLITE_ENABLE_FTS3_PARENTHESIS
}]

<p>
  If using the amalgamation autoconf based build system, setting the CPPFLAGS
  environment variable while running the 'configure' script is an easy
  way to set these macros. For example, the following command:

[Code {
  CPPFLAGS="-DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS" ./configure &lt;configure options&gt;
}]

<p>
  where <i>&lt;configure options&gt;</i> are those options normally passed to
  the configure script, if any.

<p>
  Because FTS3 is a virtual table, it is incompatible with the
  \[SQLITE_OMIT_VIRTUALTABLE\] option.

<p>
  If an SQLite build does not include FTS3, then any attempt to prepare an
  SQL statement to create an FTS3 table or to drop or access an existing 
  FTS3 table in any way will fail. The error message returned will be similar 
  to "no such module: fts3".
  

[h1 "Full-text Index Queries" {} {FTS3 MATCH}]

<p>
  The most useful thing about FTS3 tables is the queries that may be 
  performed using the built-in full-text index. Full-text queries are 
  performed by specifying a clause of the form 
  "&lt;column&gt; MATCH &lt;full-text query expression&gt;" to the WHERE 
  clause of a SELECT statement that reads data from an FTS3 table. 
  \[simple fts3 queries|Simple FTS3 queries\] that return all documents that 
  contain a given term are described above. In that discussion the right-hand
  operand of the MATCH operator was assumed to be a string consisting of a
  single term. This section describes the more complex query types supported 
  by FTS3 tables, and how they may be utilized by specifying more a more
  complex query expression as the right-hand operand of a MATCH operator.

<p class=todo>
  Where should the column selection syntax go?

<p>
  FTS3 tables support three basic query types:

<ul>
  <li><p><b>Token or token prefix queries</b>. 
    An FTS3 table may be queried for all documents that contain a specified
    term (the \[simple fts3 queries|simple case\] described above), or for
    all documents that contain a term with a specified prefix. As we have
    seen, the query expression for a specific term is simply the term itself.
    The query expression used to search for a term prefix is the prefix
    itself with a '*' character appended to it. For example:
</ul>

[Code {
  <i>-- Query for all documents containing the term "linux":</i>
  SELECT * FROM docs WHERE docs MATCH 'linux';

  <i>-- Query for all documents containing a term with the prefix "lin". This will match</i>
  <i>-- all documents that contain "linux", but also those that contain terms "linear",</i>
  <i>--"linker", "linguistic" and so on.</i>
  SELECT * FROM docs WHERE docs MATCH 'lin*';
}]
   
<ul>
  <li><p><b>Phrase queries</b>.
    A phrase query is a query that retreives all documents that contain a
    nominated set of terms or term prefixes in a specified order with no
    intervening tokens. Phrase queries are specified by enclosing a space
    separated sequence of terms or term prefixes in double quotes (").
    For example:
</ul>

[Code {
  <i>-- Query for all documents that contain the phrase "linux applications".</i>
  SELECT * FROM docs WHERE docs MATCH '"linux applications"';

  <i>-- Query for all documents that contain a phrase that matches "lin* app*". As well as</i>
  <i>-- "linux applications", this will match common phrases such as "linoleum appliances"</i>
  <i>-- or "link apprentice".</i>
  SELECT * FROM docs WHERE docs MATCH '"lin* app*"';
}]

<ul>
  <li><p><b>NEAR queries</b>. 
    A NEAR query is a query that returns documents that contain a two or
    more nominated terms of phrases within a specified proximity of each 
    other (by default with 10 or less intervening terms). A NEAR query is 
    specified by putting the keyword "NEAR" between two phrase, term or 
    term prefix queries. To specify a proximity other than the default,
    an operator of the form "NEAR/<i>&lt;N&gt;</i>" may be used, where
    <i>&lt;N&gt;</i> is the maximum number of intervening terms allowed.
    For example:
</ul>

[Code {
  <i>-- Virtual table declaration.</i>
  CREATE VIRTUAL TABLE docs USING fts3();

  <i>-- Virtual table data.</i>
  INSERT INTO docs VALUES('SQLite is an ACID compliant embedded relational database management system');

  <i>-- Search for a document that contains the terms "sqlite" and "database" with</i>
  <i>-- not more than 10 intervening terms. This matches the only document in</i>
  <i>-- table docs (since there are only six terms between "SQLite" and "database"</i> 
  <i>-- in the document)</i>.
  SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR database';

  <i>-- Search for a document that contains the terms "sqlite" and "database" with</i>
  <i>-- not more than 6 intervening terms. This also matches the only document in</i>
  <i>-- table docs. Note that the order in which the terms appear in the document</i>
  <i>-- does not have to be the same as the order in which they appear in the query.</i>
  SELECT * FROM docs WHERE docs MATCH 'database NEAR/6 sqlite';

  <i>-- Search for a document that contains the terms "sqlite" and "database" with</i>
  <i>-- not more than 5 intervening terms. This query matches no documents.</i>
  SELECT * FROM docs WHERE docs MATCH 'database NEAR/5 sqlite';

  <i>-- Search for a document that contains the phrase "ACID compliant" and the term</i>
  <i>-- "database" with not more than 2 terms separating the two. This matches the</i>
  <i>-- document stored in table docs.</i>
  SELECT * FROM docs WHERE docs MATCH 'database NEAR/2 "ACID compliant"';

  <i>-- Search for a document that contains the phrase "ACID compliant" and the term</i>
  <i>-- "sqlite" with not more than 2 terms separating the two. This also matches</i>
  <i>-- the only document stored in table docs.</i>
  SELECT * FROM docs WHERE docs MATCH '"ACID compliant" NEAR/2 sqlite';
}]

<ul>
  <li style="list-style: none"><p>
    More than one NEAR operator may appear in a single query. In this case each
    pair of terms or phrases separated by a NEAR operator must appear within the
    specified proximity of each other in the document. Using the same table and
    data as in the block of examples above:
</ul>

[Code { <i>-- The following query selects documents that contains an instance of the term </i>
  <i>-- "sqlite" separated by two or fewer terms from an instance of the term "acid",</i>
  <i>-- which is in turn separated by two or fewer terms from an instance of the term</i>
  <i>-- "relational". As it happens, the only document in table docs satisfies this criteria.</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR/2 acid NEAR/2 relational';

  <i>-- This query matches no documents. There is an instance of the term "sqlite" with</i>
  <i>-- sufficient proximity to an instance of "acid" but it is not sufficiently close</i>
  <i>-- to an instance of the term "relational".</i>
  SELECT * FROM docs WHERE docs MATCH 'acid NEAR/2 sqlite NEAR/2 relational';
}]

<p>
  Phrase and NEAR queries may not span multiple columns within a row.

<p>
  The three basic query types described above may be used to query the full-text
  index for the set of documents that match the specified criteria. Using the
  FTS3 query expression language it is possible to perform various set 
  operations on the results of basic queries. There are currently three 
  supported operations:

<ul>
  <li> The AND operator determines the <b>intersection</b> of two sets of documents.

  <li> The OR operator calculates the <b>union</b> of two sets of documents.

  <li> The NOT operator (or, if using the standard syntax, a unary "-" operator)
       may be used to compute the <b>relative complement</b> of one set of
       documents with respect to another.
</ul>

[h2 "Set Operations Using The Enhanced MATCH Syntax"]

[h2 "Set Operations Using The Standard MATCH Syntax"]


[h1 "Auxillary functions - Snippets and Offsets" {} snippets offsets]

[h1 "Custom Tokenizers" customtokenizer {custom tokenizer}]

[h1 "Data Structures" {} {segment btree}]

<p>
  This section describes at a high-level the way the FTS3 module stores its
  index and content in the database. It is <b>not necessary to read or 
  understand the material in this section in order to use FTS3</b> in an 
  application. However, it may be useful to application developers attempting 
  to analyze and understand FTS3 performance characteristics, or to developers 
  contemplating enhancements to the existing FTS3 feature set.

<p>
  For each FTS3 virtual table in a database, three real (non-virtual) tables 
  are created to store the underlying data. The real tables are named "%_content",
  "%_segdir" and "%_segments", where "%" is replaced by the name supplied by
  the user for the FTS3 virtual table.

<p>
  The leftmost column of the "%_content" table is an INTEGER PRIMARY KEY field
  named "docid". Following this is one column for each column of the FTS3
  virtual table as declared by the user, named by prepending the column name
  supplied by the user with "c<i>N</i>", where <i>N</i> is the index of the 
  column within the table, numbered from left to right starting with 1. Data
  types supplied as part of the virtual table declaration are not used as
  part of the %_content table declaration. For example:

[Code {
  <i>-- Virtual table declaration</i>
  CREATE VIRTUAL TABLE abc USING FTS3(a NUMBER, b TEXT, c);

  <i>-- Corresponding %_content table declaration</i>
  CREATE TABLE abc_content(docid INTEGER PRIMARY KEY, c1a, c2b, c2c);
}]

<p>
  The %_content table contains the unadulterated data inserted by the user 
  into the FTS3 virtual table by the user. If the user does not explicitly
  supply a "docid" value when inserting records, one is selected automatically
  by the system.

<p>
  The two remaining tables, %_segments and %_segdir, are used to store the 
  full-text index. Conceptually, this index is a lookup table that maps each 
  term (word) to the set of docid values corresponding to records in the 
  %_content table that contain one or more occurrences of the term. To
  retrieve all documents that contain a specified term, the FTS3 module
  queries this index to determine the set of docid values for records that
  contain the term, then retrieves the required documents from the %_content
  table. Regardless of the schema of the FTS3 virtual table, the %_segments
  and %_segdir tables are always created as follows:

[Code {
  CREATE TABLE %_segments(
    blockid INTEGER PRIMARY KEY,       <i>-- B-tree node id</i>
    block blob                         <i>-- B-tree node data</i>
  );

  CREATE TABLE %_segdir(
    level INTEGER,
    idx INTEGER,
    start_block INTEGER,               <i>-- Blockid of first node in %_segments</i>
    leaves_end_block INTEGER,          <i>-- Blockid of last leaf node in %_segments</i>
    end_block INTEGER,                 <i>-- Blockid of last node in %_segments</i>
    root BLOB,                         <i>-- B-tree root node</i>
    PRIMARY KEY(level, idx)
  );
}]

<p>
  The schema depicted above is not designed to store the full-text index 
  directly. Instead, it is used to one or more b-tree structures. There
  is one b-tree for each row in the %_segdir table. The %_segdir table
  row contains the root node and various meta-data associated with the
  b-tree structure, and the %_segments table contains all other (non-root)
  b-tree nodes. Each b-tree is referred to as a "segment". Once it has
  been created, a segment b-tree is never updated (although it may be
  deleted altogether).

<p>
  The keys used by each segment b-tree are terms (words). As well as the
  key, each segment b-tree entry has an associated "doclist" (document list).
  A doclist consists of zero or more entries, where each entry consists of:

<ul>
  <li> A docid (document id), and
  <li> A list of term offsets, one for each occurrence of the term within
       the document. A term offset indicates the number of tokens (words)
       that occur before the term in question, not the number of characters
       or bytes. For example, the term offset of the term "war" in the
       phrase "Ancestral voices prophesying war!" is 3.
</ul>

<p>
  Entries within a doclist are sorted by docid. Positions within a doclist
  entry are stored in ascending order.

<p>
  The contents of the logical full-text index is found by merging the
  contents of all segment b-trees. If a term is present in more than one
  segment b-tree, then it maps to the union of each individual doclist. If,
  for a single term, the same docid occurs in more than one doclist, then only
  the doclist that is part of the most recently created segment b-tree is 
  considered valid. 

<p>
  Multiple b-tree structures are used instead of a single b-tree to reduce
  the cost of inserting records into FTS3 tables. When a new record is 
  inserted into an FTS3 table that already contains a lot of data, it is
  likely that many of the terms in the new record are already present in
  a large number of existing records. If a single b-tree were used, then
  large doclist structures would have to be loaded from the database,
  amended to include the new docid and term-offset list, then written back
  to the database. Using multiple b-tree tables allows this to be avoided
  by creating a new b-tree which can be merged with the existing b-tree
  (or b-trees) later on. Merging of b-tree structures can be performed as
  a background task, or once a certain number of separate b-tree structures
  have been accumulated. Of course, this scheme makes queries more expensive
  (as the FTS3 code may have to look up individual terms in more than one
  b-tree and merge the results), but it has been found that in practice this
  overhead is often negligible.
  
[h2 "Variable Length Integer (varint) Format"]

<p>
  Integer values stored as part of segment b-tree nodes are encoded using the
  FTS3 varint format. This encoding is similar, but <b>not identical</b>, to the
  the <a href="fileformat.html#varint_format">SQLite varint format</a>.

<p>
  An encoded FTS3 varint consumes between one and ten bytes of space. The
  number of bytes required is determined by the sign and magnitude of the
  integer value encoded. More accurately, the number of bytes used to store
  the encoded integer depends on the position of the most significant set bit
  in the 64-bit twos-compliment representation of the integer value. Negative
  values always have the most significant bit set (the sign bit), and so are
  always stored using the full ten bytes. Positive integer values may be
  stored using less space.

<p>
  The final byte of an encoded FTS3 varint has its most significant bit 
  cleared. All preceding bytes have the most significant bit set. Data
  is stored in the remaining seven least signficant bits of each byte.
  The first byte of the encoded representation contains the least significant
  seven bits of the encoded integer value. The second byte of the encoded
  representation, if it is present, contains the seven next least significant
  bits of the integer value, and so on. The following table contains examples
  of encoded integer values:

[Table]
  [Tr]<th>Decimal<th>Hexadecimal<th width=100%>Encoded Representation
  [Tr]<td>43<td>0x000000000000002B<td>0x2B 
  [Tr]<td>200815<td>0x000000000003106F<td>0x9C 0xA0 0x0C
  [Tr]<td>-1<td>0xFFFFFFFFFFFFFFFF<td>0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0x01
</table>
  

[h2 "Segment B-Tree Format"]

<p>
  Segment b-trees are prefix-compressed b+-trees. There is one segment b-tree
  for each row in the %_segdir table (see above). The root node of the segment
  b-tree is stored as a blob in the "root" field of the corresponding row
  of the %_segdir table. All other nodes (if any exist) are stored in the 
  "blob" column of the %_segments table. Nodes within the %_segments table are
  identified by the integer value in the blockid field of the corresponding
  row. The following table describes the fields of the %_segdir table:

[Table]
  [Tr]<th>Column           <th width=100%>Interpretion
  [Tr]<td>level            <td> 
    Between them, the contents of the "level" and "idx" fields define the
    relative age of the segment b-tree. The smaller the value stored in the
    "level" field, the more recently the segment b-tree was created. If two
    segment b-trees are of the same "level", the segment with the larger
    value stored in the "idx" column is more recent. The PRIMARY KEY constraint
    on the %_segdir table prevents any two segments from having the same value
    for both the "level" and "idx" fields.
  [Tr]<td>idx              <td> See above.
  [Tr]<td>start_block      <td>
    The blockid that corresponds to the node with the smallest blockid that 
    belongs to this segment b-tree. Or zero if the entire segment b-tree
    fits on the root node. If it exists, this node is always a leaf node.
  [Tr]<td>leaves_end_block <td>
    The blockid that corresponds to the leaf node with the largest blockid 
    that belongs to this segment b-tree. Or zero if the entire segment b-tree
    fits on the root node.
  [Tr]<td>end_block <td>
    The blockid that corresponds to the interior node with the largest 
    blockid that belongs to this segment b-tree.  Or zero if the entire segment
    b-tree fits on the root node. If it exists, this node is always an
    interior node.
  [Tr]<td>root             <td>
    Blob containing the root node of the segment b-tree.
</table>

<p>
  Apart from the root node, the nodes that make up a single segment b-tree are
  always stored using a contiguous sequence of blockids. Furthermore, the
  nodes that make up a single level of the b-tree are themselves stored as
  a contiguous block, in b-tree order. The contiguous sequence of blockids
  used to store the b-tree leaves are allocated starting with the blockid
  value stored in the "start_block" column of the corresponding %_segdir row,
  and finishing at the blockid value stored in the "leaves_end_block"
  field of the same row. It is therefore possible to iterate through all the
  leaves of a segment b-tree, in key order, by traversing the %_segments 
  table in blockid order from "start_block" to "leaves_end_block".  

[h3 "Segment B-Tree Leaf Nodes"]

<p>
  The following diagram depicts the format of a segment b-tree leaf node.

[Fig fts3_leaf_node.png "Segment B-Tree Leaf Node Format"]

<p>
  The first term stored on each node ("Term 1" in the figure above) is
  stored verbatim. Each subsequent term is prefix-compressed with respect
  to its predecessor. Terms are stored within a page in sorted (memcmp)
  order.

[h3 "Segment B-Tree Interior Nodes"]

<p>
  The following diagram depicts the format of a segment b-tree interior 
  (non-leaf) node.

[Fig fts3_interior_node.png "Segment B-Tree Interior Node Format"]


[h2 "Doclist Format"]

<p>
  A doclist consists of an array of 64-bit signed integers, serialized using
  the FTS3 varint format. Each doclist entry is made up of a series of two 
  or more integers, as follows:

<ol>
  <li> The docid value. The first entry in a doclist contains the literal docid
       value. The first field of each subsequent doclist entry contains the 
       difference between the new docid and the previous one (always a positive 
       number).
  <li> Zero or more term-offset lists. A term-offset list is present for each
       column of the FTS3 virtual table that contains the term. A term-offset
       list consists of the following:
     <ol>
       <li> Constant value 1. This field is omitted for any term-offset list
            associated with column 0.
       <li> The column number (1 for the second leftmost column, etc.). This
            field is omitted for any term-offset list associated with column 0.
       <li> A list of term-offsets, sorted from smallest to largest. Instead
            of storing the term-offset value literally, each integer stored 
            is the difference between the current term-offset and the previous 
            one (or zero if the current term-offset is the first), plus 2.
     </ol>
  <li> Constant value 0.
</ol>

[Fig fts3_doclist2.png "FTS3 Doclist Format"]

[Fig fts3_doclist.png "FTS3 Doclist Entry Format"]

<p>
  For doclists for which the term appears in more than one column of the FTS3
  virtual table, term-offset lists within the doclist are stored in column 
  number order. This ensures that the term-offset list associated with 
  column 0 (if any) is always first, allowing the first two fields of the
  term-offset list to be omitted in this case.

}