Documentation Source Text

Artifact [a89233647d]
Login

Artifact a89233647de75f6e7b301eeb763880063a3278ed:



<tcl>hd_keywords *fts3 FTS3 {full-text search}</tcl>
<title>SQLite FTS3 and FTS4 Extensions</title>

<table_of_contents>

<h2 style="margin-left:1.0em" notoc> Overview</h2>

<p>
  FTS3 and FTS4 are an SQLite virtual table modules that allows users to perform 
  full-text searches on a set of documents. The most common (and effective) 
  way to describe full-text searches is "what Google, Yahoo and Altavista do
  with documents placed on the World Wide Web". Users input a term, or series 
  of terms, perhaps connected by a binary operator or grouped together into a 
  phrase, and the full-text query system finds the set of documents that best 
  matches those terms considering the operators and groupings the user has 
  specified. This article describes the deployment and usage of FTS3 and FTS4.

<p>
  FTS1 and FTS2 are obsolete full-text search modules for SQLite.  There are known
  issues with these older modules and their use should be avoided.
  Portions of the original FTS3 code were contributed to the SQLite project 
  by Scott Hess of <a href="http://www.google.com">Google</a>. It is now 
  developed and maintained as part of SQLite.

<h1>Introduction to FTS3 and FTS4</h1>

<p>
  The FTS3 and FTS4 extension modules allows users to create special tables with a 
  built-in full-text index (hereafter "FTS tables"). The full-text index
  allows the user to efficiently query the database for all rows that contain
  one or more words (hereafter "tokens"), even if the table
  contains many large documents.

<p>
  For example, if each of the 517430 documents in the 
  "<a href="http://www.cs.cmu.edu/~enron/">Enron E-Mail Dataset</a>"
  is inserted into both an FTS table and an ordinary SQLite table
  created using the following SQL script:

<codeblock>
  CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
  CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */
</codeblock>

<p>
  Then either of the two queries below may be executed to find the number of
  documents in the database that contain the word "linux" (351). Using one
  desktop PC hardware configuration, the query on the FTS3 table returns in
  approximately 0.03 seconds, versus 22.5 for querying the ordinary table.

<codeblock>
  SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux';  /* 0.03 seconds */
  SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
</codeblock>

<p>
  Of course, the two queries above are not entirely equivalent. For example
  the LIKE query matches rows that contain terms such as "linuxophobe"
  or "EnterpriseLinux" (as it happens, the Enron E-Mail Dataset does not
  actually contain any such terms), whereas the MATCH query on the FTS3 table
  selects only those rows that contain "linux" as a discrete token. Both 
  searches are case-insensitive. The FTS3 table consumes around 2006 MB on
  disk compared to just 1453 MB for the ordinary table. Using the same
  hardware configuration used to perform the SELECT queries above, the FTS3
  table took just under 31 minutes to populate, versus 25 for the ordinary
  table.

<h2>Differences between FTS3 and FTS4</h2>
<tcl>hd_fragment fts4 FTS4</tcl>

<p>
  FTS3 and FTS4 are nearly identical. They share most of their code in common,
  and their interfaces are the same. The differences are:

<ul>
  <li> <p>FTS4 contains query performance optimizations that may significantly
       improve the performance of full-text queries that contain terms that are
       very common (present in a large percentage of table rows).

  <li> <p>FTS4 supports some additional options that may used with the [matchinfo()]
       function. 

  <li> <p>Because it stores extra information on disk in two new 
       [FTS shadow tables|shadow tables] in order to support the performance
       optimizations and extra matchinfo() options, FTS4 tables may consume more
       disk space than the equivalent table created using FTS3. Usually the overhead
       is 1-2% or less, but may be as high as 10% if the documents stored in the
       FTS table are very small. The overhead may be reduced by specifying the
       directive [matchinfo_fts3|"matchinfo=fts3"] as part of the FTS4 table
       declaration, but this comes at the expense of sacrificing some of the
       extra supported matchinfo() options.

  <li> <p>FTS4 provides hooks (the compress and uncompress 
       [FTS4 options|options]) allowing data to be stored in a compressed 
       form, reducing disk usage and IO.
</ul>

<p>
  FTS4 is an enhancement to FTS3.  FTS3 has been available since SQLite [version 3.5.0] in
  2007-09-04.  The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08.

<p>
  Which module, FTS3 or FTS4, should you use in your application?  FTS4 is
  sometimes significantly faster than FTS3, even orders of magnitude faster
  depending on the query, though in the common case the performance of the two
  modules is similar. FTS4 also offers the enhanced [matchinfo()] outputs which
  can be useful in ranking the results of a [FTS MATCH|MATCH] operation.  On the
  other hand, in the absence of a [matchinfo_fts3|matchinfo=fts3] directive FTS4 requires a little
  more disk space than FTS3, though only a percent of two in most cases.

<p>
  For newer applications, FTS4 is recommended; though if compatibility with older 
  versions of SQLite is important, then FTS3 will usually serve just as well.  

<h2>Creating and Destroying FTS Tables</h2>

<p>
  Like other virtual table types, new FTS tables are created using a 
  [CREATE VIRTUAL TABLE] statement. The module name, which follows
  the USING keyword, is either "fts3" or "fts4". The virtual table module arguments may
  be left empty, in which case an FTS table with a single user-defined 
  column named "content" is created. Alternatively, the module arguments
  may be passed a list of comma separated column names. 

<p>
  If column names are explicitly provided for the FTS table as part of
  the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally 
  specified for each column. This is pure syntactic sugar, the
  supplied typenames are not used by FTS or the SQLite core for any
  purpose. The same applies to any constraints specified along with an
  FTS column name - they are parsed but not used or recorded by the system
  in any way.

<codeblock>
  <i>-- Create an FTS table named "data" with one column - "content":</i>
  CREATE VIRTUAL TABLE data USING fts3();

  <i>-- Create an FTS table named "pages" with three columns:</i>
  CREATE VIRTUAL TABLE pages USING fts4(title, keywords, body);

  <i>-- Create an FTS table named "mail" with two columns. Datatypes
  -- and column constraints are specified along with each column. These
  -- are completely ignored by FTS and SQLite. </i>
  CREATE VIRTUAL TABLE mail USING fts3(
    subject VARCHAR(256) NOT NULL,
    body TEXT CHECK(length(body)&lt;10240)
  );
</codeblock>

<p>
  As well as a list of columns, the module arguments passed to a CREATE
  VIRTUAL TABLE statement used to create an FTS table may be used to specify
  a [tokenizer]. This is done by specifying a string of the form
  "tokenize=&lt;tokenizer name&gt; &lt;tokenizer args&gt;" in place of a column
  name, where &lt;tokenizer name&gt; is the name of the tokenizer to use and
  &lt;tokenizer args&gt; is an optional list of whitespace separated qualifiers
  to pass to the tokenizer implementation. A tokenizer specification may be
  placed anywhere in the column list, but at most one tokenizer declaration is
  allowed for each CREATE VIRTUAL TABLE statement. [tokenizer|See below] for a 
  detailed description of using (and, if necessary, implementing) a tokenizer.

<codeblock>
  <i>-- Create an FTS table named "papers" with two columns that uses</i>
  <i>-- the tokenizer "porter".</i>
  CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter);

  <i>-- Create an FTS table with a single column - "content" - that uses</i>
  <i>-- the "simple" tokenizer.</i>
  CREATE VIRTUAL TABLE data USING fts4(tokenize=simple);

  <i>-- Create an FTS table with two columns that uses the "icu" tokenizer.</i>
  <i>-- The qualifier "en_AU" is passed to the tokenizer implementation</i>
  CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU);
</codeblock>

<p>
  FTS tables may be dropped from the database using an ordinary [DROP TABLE]
  statement. For example:

<codeblock>
  <i>-- Create, then immediately drop, an FTS4 table.</i>
  CREATE VIRTUAL TABLE data USING fts4();
  DROP TABLE data;
</codeblock>

<h2>Populating FTS Tables</h2>

  <p>
    FTS tables are populated using [INSERT], [UPDATE] and [DELETE]
    statements in the same way as ordinary SQLite tables are.

  <p>
    As well as the columns named by the user (or the "content" column if no
    module arguments were specified as part of the [CREATE VIRTUAL TABLE] 
    statement), each FTS table has a "rowid" column. The rowid of an FTS
    table behaves in the same way as the rowid column of an ordinary SQLite 
    table, except that the values stored in the rowid column of an FTS table 
    remain unchanged if the database is rebuilt using the [VACUUM] command. 
    For FTS tables, "docid" is allowed as an alias along with the usual "rowid",
    "oid" and "_oid_" identifiers. Attempting to insert or update a row with a 
    docid value that already exists in the table is an error, just as it would 
    be with an ordinary SQLite table.

  <p>
    There is one other subtle difference between "docid" and the normal SQLite
    aliases for the rowid column. Normally, if an INSERT or UPDATE statement 
    assigns discrete values to two or more aliases of the rowid column, SQLite 
    writes the rightmost of such values specified in the INSERT or UPDATE
    statement to the database. However, assigning a non-NULL value to both
    the "docid" and one or more of the SQLite rowid aliases when inserting or
    updating an FTS table is considered an error. See below for an example.

<codeblock>
  <i>-- Create an FTS table</i>
  CREATE VIRTUAL TABLE pages USING fts4(title, body);

  <i>-- Insert a row with a specific docid value.</i>
  INSERT INTO pages(docid, title, body) VALUES(53, 'Home Page', 'SQLite is a software...');

  <i>-- Insert a row and allow FTS to assign a docid value using the same algorithm as</i>
  <i>-- SQLite uses for ordinary tables. In this case the new docid will be 54,</i>
  <i>-- one greater than the largest docid currently present in the table.</i>
  INSERT INTO pages(title, body) VALUES('Download', 'All SQLite source code...');

  <i>-- Change the title of the row just inserted.</i>
  UPDATE pages SET title = 'Download SQLite' WHERE rowid = 54;

  <i>-- Delete the entire table contents.</i>
  DELETE FROM pages;

  <i>-- The following is an error. It is not possible to assign non-NULL values to both</i>
  <i>-- the rowid and docid columns of an FTS table.</i>
  INSERT INTO pages(rowid, docid, title, body) VALUES(1, 2, 'A title', 'A document body');
</codeblock>

  <p>
    To support full-text queries, FTS maintains an inverted index that maps
    from each unique term or word that appears in the dataset to the locations
    in which it appears within the table contents. For the curious, a 
    complete description of the [segment btree|data structure] used to store
    this index within the database file appears below. A feature of
    this data structure is that at any time the database may contain not
    one index b-tree, but several different b-trees that are incrementally
    merged as rows are inserted, updated and deleted. This technique improves 
    performance when writing to an FTS table, but causes some overhead for
    full-text queries that use the index. Executing an SQL statement of the
    form "INSERT INTO &lt;fts-table&gt;(&lt;fts-table&gt;) VALUES('optimize')"
    causes FTS to merge all existing index b-trees into a single large
    b-tree containing the entire index. This can be an expensive operation,
    but may speed up future queries. 

  <p>
    For example, to optimize the full-text index for an FTS table named
    "docs":

<codeblock>
  <i>-- Optimize the internal structure of FTS table "docs".</i>
  INSERT INTO docs(docs) VALUES('optimize');
</codeblock>

  <p>
    The statement above may appear syntactically incorrect to some. Refer to
    the section describing the [simple fts queries] for an explanation.

  <p>
    There is another, deprecated, method for invoking the optimize 
    operation using a SELECT statement. New code should use statements
    similar to the INSERT above to optimize FTS structures.

<h2 tags="simple fts queries">Simple FTS Queries</h2>

<p>
  As for all other SQLite tables, virtual or otherwise, data is retrieved
  from FTS tables using a [SELECT] statement.

<p>
  FTS tables can be queried efficiently using SELECT statements of two
  different forms:

<ul>
  <li><p>
    <b>Query by rowid</b>. If the WHERE clause of the SELECT statement
    contains a sub-clause of the form "rowid = ?", where ? is an SQL expression,
    FTS is able to retrieve the requested row directly using the equivalent 
    of an SQLite [INTEGER PRIMARY KEY] index.

  <li><p>
    <b>Full-text query</b>. If the WHERE clause of the SELECT statement contains
    a sub-clause of the form "&lt;column&gt; MATCH ?", FTS is able to use 
    the built-in full-text index to restrict the search to those documents 
    that match the full-text query string specified as the right-hand operand
    of the MATCH clause.
</ul>

<p>
  If neither of these two query strategies can be used, all
  queries on FTS tables are implemented using a linear scan of the entire
  table. If the table contains large amounts of data, this may be an 
  impractical approach (the first example on this page shows that a linear
  scan of 1.5 GB of data takes around 30 seconds using a modern PC).

<codeblock>
  <i>-- The examples in this block assume the following FTS table:</i>
  CREATE VIRTUAL TABLE mail USING fts3(subject, body);

  SELECT * FROM mail WHERE rowid = 15;                <i>-- Fast. Rowid lookup.</i>
  SELECT * FROM mail WHERE body MATCH 'sqlite';       <i>-- Fast. Full-text query.</i>
  SELECT * FROM mail WHERE mail MATCH 'search';       <i>-- Fast. Full-text query.</i>
  SELECT * FROM mail WHERE rowid BETWEEN 15 AND 20;   <i>-- Slow. Linear scan.</i>
  SELECT * FROM mail WHERE subject = 'database';      <i>-- Slow. Linear scan.</i>
  SELECT * FROM mail WHERE subject MATCH 'database';  <i>-- Fast. Full-text query.</i>
</codeblock>

<p>
  In all of the full-text queries above, the right-hand operand of the MATCH
  operator is a string consisting of a single term. In this case, the MATCH
  expression evaluates to true for all documents that contain one or more 
  instances of the specified word ("sqlite", "search" or "database", depending 
  on which example you look at). Specifying a single term as the right-hand
  operand of the MATCH operator results in the simplest and most common type 
  of full-text query possible. However more complicated queries are possible,
  including phrase searches, term-prefix searches and searches for documents 
  containing combinations of terms occuring within a defined proximity of each
  other. The various ways in which the full-text index may be queried are
  [FTS MATCH|described below].

<p>
  Normally, full-text queries are case-insensitive. However, this is
  is dependent on the specific [tokenizer] used by the FTS table
  being queried. Refer to the section on [tokenizer|tokenizers] for details.

<p>
  The paragraph above notes that a MATCH operator with a simple term as the
  right-hand operand evaluates to true for all documents that contain the
  specified term. In this context, the "document" may refer to either the 
  data stored in a single column of a row of an FTS table, or to the contents
  of all columns in a single row, depending on the identifier used as the
  left-hand operand to the MATCH operator. If the identifier specified as
  the left-hand operand of the MATCH operator is an FTS table column name,
  then the document that the search term must be contained in is the value
  stored in the specified column. However, if the identifier is the name
  of the FTS <i>table</i> itself, then the MATCH operator evaluates to true
  for each row of the FTS table for which any column contains the search 
  term. The following example demonstrates this:

<codeblock>
  <i>-- Example schema</i>
  CREATE VIRTUAL TABLE mail USING fts3(subject, body);

  <i>-- Example table population</i>
  INSERT INTO mail(docid, subject, body) VALUES(1, 'software feedback', 'found it too slow');
  INSERT INTO mail(docid, subject, body) VALUES(2, 'software feedback', 'no feedback');
  INSERT INTO mail(docid, subject, body) VALUES(3, 'slow lunch order',  'was a software problem');

  <i>-- Example queries</i>
  SELECT * FROM mail WHERE subject MATCH 'software';    <i>-- Selects rows 1 and 2</i>
  SELECT * FROM mail WHERE body    MATCH 'feedback';    <i>-- Selects row 2</i>
  SELECT * FROM mail WHERE mail    MATCH 'software';    <i>-- Selects rows 1, 2 and 3</i>
  SELECT * FROM mail WHERE mail    MATCH 'slow';        <i>-- Selects rows 1 and 3</i>
</codeblock>
  
<p>
  At first glance, the final two full-text queries in the example above seem
  to be syntactically incorrect, as there is a table name ("mail") used as
  an SQL expression. The reason this is acceptable is that each FTS table
  actually has a [sqlite3_declare_vtab|HIDDEN] column with the same name
  as the table itself (in this case, "mail"). The value stored in this
  column is not meaningful to the application, but can be used as the 
  left-hand operand to a MATCH operator. This special column may also be
  passed as an argument to the [snippet()|FTS auxiliary functions].

<p>
  The following example illustrates the above. The expressions "docs", 
  "docs.docs" and "main.docs.docs" all refer to column "docs". However, the 
  expression "main.docs" does not refer to any column. It could be used to 
  refer to a table, but a table name is not allowed in the context in which
  it is used below.

<codeblock>
  <i>-- Example schema</i>
  CREATE VIRTUAL TABLE docs USING fts4(content);

  <i>-- Example queries</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite';              <i>-- OK.</i>
  SELECT * FROM docs WHERE docs.docs MATCH 'sqlite';         <i>-- OK.</i>
  SELECT * FROM docs WHERE main.docs.docs MATCH 'sqlite';    <i>-- OK.</i>
  SELECT * FROM docs WHERE main.docs MATCH 'sqlite';         <i>-- Error.</i>
</codeblock>
 
<h2>Summary</h2>

<p>
  From the users point of view, FTS tables are similar to ordinary SQLite
  tables in many ways. Data may be added to, modified within and removed 
  from FTS tables using the INSERT, UPDATE and DELETE commands just as 
  it may be with ordinary tables. Similarly, the SELECT command may be used 
  to query data. The following list summarizes the differences between FTS
  and ordinary tables:

<ol>
  <li><p> 
    As with all virtual table types, it is not possible to create indices or
    triggers attached to FTS tables. Nor is it possible to use the ALTER TABLE
    command to add extra columns to FTS tables (although it is possible to use
    ALTER TABLE to rename an FTS table).

  <li><p> 
    Data-types specified as part of the "CREATE VIRTUAL TABLE" statement
    used to create an FTS table are ignored completely. Instead of the
    normal rules for applying type [affinity] to inserted values, all
    values inserted into FTS table columns (except the special rowid
    column) are converted to type TEXT before being stored.

  <li><p> 
    FTS tables permit the special alias "docid" to be used to refer to the
    rowid column supported by all [virtual tables].

  <li><p> 
    The [FTS MATCH] operator is supported for queries based on the built-in
    full-text index. 

  <li><p> 
    The [FTS auxiliary functions], [snippet()], [offsets()], and [matchinfo()] are 
    available to support full-text queries.

  <li><p>
    <tcl>hd_fragment hiddencol {FTS hidden column}</tcl>
    Every FTS table has a [hidden column] with the 
    same name as the table itself. The value contained in each row for the
    hidden column is a blob that is only useful as the left operand of a
    [FTS MATCH|MATCH] operator, or as the left-most argument to one
    of the [FTS auxiliary functions].
    

</ol>


<h1 tags="compile fts">Compiling and Enabling FTS3 and FTS4</h1>

<p>
  Although FTS3 and FTS4 are included with the SQLite core source code, they are not
  enabled by default. To build SQLite with FTS functionality enabled, define
  the preprocessor macro [SQLITE_ENABLE_FTS3] when compiling. New applications
  should also define the [SQLITE_ENABLE_FTS3_PARENTHESIS] macro to enable the
  [enhanced query syntax] (see below). Usually, this is done by adding the 
  following two switches to the compiler command line:

<codeblock>
  -DSQLITE_ENABLE_FTS3
  -DSQLITE_ENABLE_FTS3_PARENTHESIS
</codeblock>

<p>
  Note that enabling FTS3 also makes FTS4 available.  There is not a separate
  SQLITE_ENABLE_FTS4 compile-time option.  An build of SQLite either supports
  both FTS3 and FTS4 or it supports neither.

<p>
  If using the amalgamation autoconf based build system, setting the CPPFLAGS
  environment variable while running the 'configure' script is an easy
  way to set these macros. For example, the following command:

<codeblock>
  CPPFLAGS="-DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS" ./configure &lt;configure options&gt;
</codeblock>

<p>
  where <i>&lt;configure options&gt;</i> are those options normally passed to
  the configure script, if any.

<p>
  Because FTS3 and FTS4 are virtual tables, The [SQLITE_ENABLE_FTS3] compile-time option
  is incompatible with the [SQLITE_OMIT_VIRTUALTABLE] option.

<p>
  If a build of SQLite does not include the FTS modules, then any attempt to prepare an
  SQL statement to create an FTS3 or FTS4 table or to drop or access an existing 
  FTS table in any way will fail. The error message returned will be similar 
  to "no such module: ftsN" (where N is either 3 or 4).

<p>
  If the C version of the <a href=http://site.icu-project.org/>ICU library</a>
  is available, then FTS may also be compiled with the SQLITE_ENABLE_ICU
  pre-processor macro defined. Compiling with this macro enables an FTS
  [tokenizer] that uses the ICU library to split a document into terms
  (words) using the conventions for a specified language and locale.

<codeblock>
  -DSQLITE_ENABLE_ICU
</codeblock>
  

<h1 tags="FTS MATCH">Full-text Index Queries</h1>

<p>
  The most useful thing about FTS tables is the queries that may be 
  performed using the built-in full-text index. Full-text queries are 
  performed by specifying a clause of the form 
  "&lt;column&gt; MATCH &lt;full-text query expression&gt;" as part of the WHERE 
  clause of a SELECT statement that reads data from an FTS table. 
  [simple fts queries|Simple FTS queries] that return all documents that 
  contain a given term are described above. In that discussion the right-hand
  operand of the MATCH operator was assumed to be a string consisting of a
  single term. This section describes the more complex query types supported 
  by FTS tables, and how they may be utilized by specifying a more
  complex query expression as the right-hand operand of a MATCH operator.

<p>
  FTS tables support three basic query types:

<ul>
  <tcl>hd_fragment termprefix {prefix query} {prefix queries}</tcl>
  <li><p><b>Token or token prefix queries</b>. 
    An FTS table may be queried for all documents that contain a specified
    term (the [simple fts queries|simple case] described above), or for
    all documents that contain a term with a specified prefix. As we have
    seen, the query expression for a specific term is simply the term itself.
    The query expression used to search for a term prefix is the prefix
    itself with a '*' character appended to it. For example:
</ul>

<codeblock>
  <i>-- Virtual table declaration</i>
  CREATE VIRTUAL TABLE docs USING fts3(title, body);

  <i>-- Query for all documents containing the term "linux":</i>
  SELECT * FROM docs WHERE docs MATCH 'linux';

  <i>-- Query for all documents containing a term with the prefix "lin". This will match</i>
  <i>-- all documents that contain "linux", but also those that contain terms "linear",</i>
  <i>--"linker", "linguistic" and so on.</i>
  SELECT * FROM docs WHERE docs MATCH 'lin*';
</codeblock>

<ul>
  <li style="list-style:none"><p>
    Normally, a token or token prefix query is matched against the FTS table 
    column specified as the right-hand side of the MATCH operator. Or, if the
    special column with the same name as the FTS table itself is specified,
    against all columns. This may be overridden by specifying a column-name
    followed by a ":" character before a basic term query. There may be space
    between the ":" and the term to query for, but not between the column-name
    and the ":" character. For example:
</ul>
   
<codeblock>
  <i>-- Query the database for documents for which the term "linux" appears in</i>
  <i>-- the document title, and the term "problems" appears in either the title</i>
  <i>-- or body of the document.</i>
  SELECT * FROM docs WHERE docs MATCH 'title:linux problems';

  <i>-- Query the database for documents for which the term "linux" appears in</i>
  <i>-- the document title, and the term "driver" appears in the body of the document</i>
  <i>-- ("driver" may also appear in the title, but this alone will not satisfy the</i>.
  <i>-- query criteria).</i>
  SELECT * FROM docs WHERE body MATCH 'title:linux driver';
</codeblock>

<ul>
  <li><p><b>Phrase queries</b>.
    A phrase query is a query that retrieves all documents that contain a
    nominated set of terms or term prefixes in a specified order with no
    intervening tokens. Phrase queries are specified by enclosing a space
    separated sequence of terms or term prefixes in double quotes (").
    For example:
</ul>

<codeblock>
  <i>-- Query for all documents that contain the phrase "linux applications".</i>
  SELECT * FROM docs WHERE docs MATCH '"linux applications"';

  <i>-- Query for all documents that contain a phrase that matches "lin* app*". As well as</i>
  <i>-- "linux applications", this will match common phrases such as "linoleum appliances"</i>
  <i>-- or "link apprentice".</i>
  SELECT * FROM docs WHERE docs MATCH '"lin* app*"';
</codeblock>

<ul>
  <li><p><b>NEAR queries</b>. 
    A NEAR query is a query that returns documents that contain a two or
    more nominated terms or phrases within a specified proximity of each 
    other (by default with 10 or less intervening terms). A NEAR query is 
    specified by putting the keyword "NEAR" between two phrase, term or 
    [prefix queries]. To specify a proximity other than the default,
    an operator of the form "NEAR/<i>&lt;N&gt;</i>" may be used, where
    <i>&lt;N&gt;</i> is the maximum number of intervening terms allowed.
    For example:
</ul>

<codeblock>
  <i>-- Virtual table declaration.</i>
  CREATE VIRTUAL TABLE docs USING fts4();

  <i>-- Virtual table data.</i>
  INSERT INTO docs VALUES('SQLite is an ACID compliant embedded relational database management system');

  <i>-- Search for a document that contains the terms "sqlite" and "database" with</i>
  <i>-- not more than 10 intervening terms. This matches the only document in</i>
  <i>-- table docs (since there are only six terms between "SQLite" and "database"</i> 
  <i>-- in the document)</i>.
  SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR database';

  <i>-- Search for a document that contains the terms "sqlite" and "database" with</i>
  <i>-- not more than 6 intervening terms. This also matches the only document in</i>
  <i>-- table docs. Note that the order in which the terms appear in the document</i>
  <i>-- does not have to be the same as the order in which they appear in the query.</i>
  SELECT * FROM docs WHERE docs MATCH 'database NEAR/6 sqlite';

  <i>-- Search for a document that contains the terms "sqlite" and "database" with</i>
  <i>-- not more than 5 intervening terms. This query matches no documents.</i>
  SELECT * FROM docs WHERE docs MATCH 'database NEAR/5 sqlite';

  <i>-- Search for a document that contains the phrase "ACID compliant" and the term</i>
  <i>-- "database" with not more than 2 terms separating the two. This matches the</i>
  <i>-- document stored in table docs.</i>
  SELECT * FROM docs WHERE docs MATCH 'database NEAR/2 "ACID compliant"';

  <i>-- Search for a document that contains the phrase "ACID compliant" and the term</i>
  <i>-- "sqlite" with not more than 2 terms separating the two. This also matches</i>
  <i>-- the only document stored in table docs.</i>
  SELECT * FROM docs WHERE docs MATCH '"ACID compliant" NEAR/2 sqlite';
</codeblock>

<ul>
  <li style="list-style: none"><p>
    More than one NEAR operator may appear in a single query. In this case each
    pair of terms or phrases separated by a NEAR operator must appear within the
    specified proximity of each other in the document. Using the same table and
    data as in the block of examples above:
</ul>

<codeblock> 
 <i>-- The following query selects documents that contains an instance of the term </i>
  <i>-- "sqlite" separated by two or fewer terms from an instance of the term "acid",</i>
  <i>-- which is in turn separated by two or fewer terms from an instance of the term</i>
  <i>-- "relational". As it happens, the only document in table docs satisfies this criteria.</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR/2 acid NEAR/2 relational';

  <i>-- This query matches no documents. There is an instance of the term "sqlite" with</i>
  <i>-- sufficient proximity to an instance of "acid" but it is not sufficiently close</i>
  <i>-- to an instance of the term "relational".</i>
  SELECT * FROM docs WHERE docs MATCH 'acid NEAR/2 sqlite NEAR/2 relational';
</codeblock>

<p>
  Phrase and NEAR queries may not span multiple columns within a row.

<p>
  The three basic query types described above may be used to query the full-text
  index for the set of documents that match the specified criteria. Using the
  FTS query expression language it is possible to perform various set 
  operations on the results of basic queries. There are currently three 
  supported operations:

<ul>
  <li> The AND operator determines the <b>intersection</b> of two sets of documents.

  <li> The OR operator calculates the <b>union</b> of two sets of documents.

  <li> The NOT operator (or, if using the standard syntax, a unary "-" operator)
       may be used to compute the <b>relative complement</b> of one set of
       documents with respect to another.
</ul>

<p>
  The FTS modules may be compiled to use one of two slightly different versions
  of the full-text query syntax, the "standard" query syntax and the "enhanced" 
  query syntax. The basic term, term-prefix, phrase and NEAR queries described 
  above are the same in both versions of the syntax. The way in which set 
  operations are specified is slightly different. The following two sub-sections 
  describe the part of the two query syntaxes that pertains to set operations. 
  Refer to the description of how to [compile fts] for compilation notes.

<h2 tags="enhanced query syntax">
  Set Operations Using The Enhanced Query Syntax</h2>

<p>
  The enhanced query syntax supports the AND, OR and NOT binary set operators.
  Each of the two operands to an operator may be a basic FTS query, or the
  result of another AND, OR or NOT set operation. Operators must be entered
  using capital letters. Otherwise, they are interpreted as basic term queries
  instead of set operators.

<p>
  The AND operator may be implicitly specified. If two basic queries appear 
  with no operator separating them in an FTS query string, the results are
  the same as if the two basic queries were separated by an AND operator.
  For example, the query expression "implicit operator" is a more succinct
  version of "implicit AND operator".

<codeblock>
  <i>-- Virtual table declaration</i>
  CREATE VIRTUAL TABLE docs USING fts3();

  <i>-- Virtual table data</i>
  INSERT INTO docs(docid, content) VALUES(1, 'a database is a software system');
  INSERT INTO docs(docid, content) VALUES(2, 'sqlite is a software system');
  INSERT INTO docs(docid, content) VALUES(3, 'sqlite is a database');

  <i>-- Return the set of documents that contain the term "sqlite", and the</i>
  <i>-- term "database". This query will return the document with docid 3 only.</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite AND database';

  <i>-- Again, return the set of documents that contain both "sqlite" and</i>
  <i>-- "database". This time, use an implicit AND operator. Again, document</i>
  <i>-- 3 is the only document matched by this query. </i>
  SELECT * FROM docs WHERE docs MATCH 'database sqlite';

  <i>-- Query for the set of documents that contains either "sqlite" or "database".</i>
  <i>-- All three documents in the database are matched by this query.</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite OR database';

  <i>-- Query for all documents that contain the term "database", but do not contain</i>
  <i>-- the term "sqlite". Document 1 is the only document that matches this criteria.</i>
  SELECT * FROM docs WHERE docs MATCH 'database NOT sqlite';

  <i>-- The following query matches no documents. Because "and" is in lowercase letters,</i>
  <i>-- it is interpreted as a basic term query instead of an operator. Operators must</i>
  <i>-- be specified using capital letters. In practice, this query will match any documents</i>
  <i>-- that contain each of the three terms "database", "and" and "sqlite" at least once.</i>
  <i>-- No documents in the example data above match this criteria.</i>
  SELECT * FROM docs WHERE docs MATCH 'database and sqlite';
</codeblock>

<p>
  The examples above all use basic full-text term queries as both operands of 
  the set operations demonstrated. Phrase and NEAR queries may also be used,
  as may the results of other set operations. When more than one set operation
  is present in an FTS query, the precedence of operators is as follows:

<table striped=1>
  <tr><th>Operator<th>Enhanced Query Syntax Precedence
  <tr><td>NOT <td> Highest precedence (tightest grouping).
  <tr><td>AND <td>
  <tr><td>OR  <td> Lowest precedence (loosest grouping).
</table>

<p>
  When using the enhanced query syntax, parenthesis may be used to override
  the default precedence of the various operators. For example:

<codeblock>

  <i>-- Return the docid values associated with all documents that contain the</i>
  <i>-- two terms "sqlite" and "database", and/or contain the term "library".</i>
  SELECT docid FROM docs WHERE docs MATCH 'sqlite AND database OR library';

  <i>-- This query is equivalent to the above.</i>
  SELECT docid FROM docs WHERE docs MATCH 'sqlite AND database'
    UNION
  SELECT docid FROM docs WHERE docs MATCH 'library';

  <i>-- Query for the set of documents that contains the term "linux", and at least</i>
  <i>-- one of the phrases "sqlite database" and "sqlite library".</i>
  SELECT docid FROM docs WHERE docs MATCH '("sqlite database" OR "sqlite library") AND linux';

  <i>-- This query is equivalent to the above.</i>
  SELECT docid FROM docs WHERE docs MATCH 'linux'
    INTERSECT
  SELECT docid FROM (
    SELECT docid FROM docs WHERE docs MATCH '"sqlite library"'
      UNION
    SELECT docid FROM docs WHERE docs MATCH '"sqlite database"'
  );
</codeblock>


<h2>Set Operations Using The Standard Query Syntax</h2>

<p>
  FTS query set operations using the standard query syntax are similar, but
  not identical, to set operations with the enhanced query syntax. There
  are four differences, as follows:

<ol>
  <li value=1><p> Only the implicit version of the AND operator is supported.
    Specifying the string "AND" as part of an standard query syntax query is
    interpreted as a term query for the set of documents containing the term 
    "and".
</ol>

<ol>
  <li value=2><p> Parenthesis are not supported.
</ol>

<ol>
  <li value=3><p> The NOT operator is not supported. Instead of the NOT 
    operator, the standard query syntax supports a unary "-" operator that
    may be applied to basic term and term-prefix queries (but not to phrase
    or NEAR queries). A term or term-prefix that has a unary "-" operator
    attached to it may not appear as an operand to an OR operator. An FTS
    query may not consist entirely of terms or term-prefix queries with unary
    "-" operators attached to them.
</ol>

<codeblock>
  <i>-- Search for the set of documents that contain the term "sqlite" but do</i>
  <i>-- not contain the term "database".</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite -database';
</codeblock>

<ol>
  <li value=4><p> The relative precedence of the set operations is different. 
   In particular, using the standard query syntax the "OR" operator has a
   higher precedence than "AND". The precedence of operators when using the
   standard query syntax is: 
</ol>

<table striped=1>
  <tr><th>Operator<th>Standard Query Syntax Precedence
  <tr><td>Unary "-" <td> Highest precedence (tightest grouping).
  <tr><td>OR  <td>
  <tr><td>AND <td> Lowest precedence (loosest grouping).
</table>

<ol><li style="list-style:none">
  The following example illustrates precedence of operators using the standard 
  query syntax:
</ol>

<codeblock>
  <i>-- Search for documents that contain at least one of the terms "database"</i>
  <i>-- and "sqlite", and also contain the term "library". Because of the differences</i>
  <i>-- in operator precedences, this query would have a different interpretation using</i>
  <i>-- the enhanced query syntax.</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library';
</codeblock>

<tcl>hd_fragment snippet {FTS auxiliary functions}</tcl>
<h1>Auxiliary Functions - Snippet, Offsets and Matchinfo</h1>

<p>
  The FTS3 and FTS4 modules provide three special SQL scalar functions that may be useful
  to the developers of full-text query systems: "snippet", "offsets" and
  "matchinfo". The purpose of the "snippet" and "offsets" functions is to allow
  the user to identify the location of queried terms in the returned documents.
  The "matchinfo" function provides the user with metrics that may be useful
  for filtering or sorting query results according to relevance.

<p>
  The first argument to all three special SQL scalar functions
  must be the [FTS hidden column] of the FTS table that the function is
  applied to.  The [FTS hidden column] is an automatically-generated column found on
  all FTS tables that has the same name as the FTS table itself.
  For example, given an FTS table named "mail":

<codeblock>
  SELECT offsets(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
  SELECT snippet(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
  SELECT matchinfo(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
</codeblock>

<p>
  The three auxiliary functions are only useful within a SELECT statement that
  uses the FTS table's full-text index. ^If used within a SELECT that uses
  the "query by rowid" or "linear scan" strategies, then the snippet and
  offsets both return an empty string, and the matchinfo function returns
  a blob value zero bytes in size.

<p id=matchable>
  All three auxiliary functions extract a set of "matchable phrases" from
  the FTS query expression to work with. The set of matchable phrases for
  a given query consists of all phrases (including unquoted tokens and
  token prefixes) in the expression except those that are prefixed with
  a unary "-" operator (standard syntax) or are part of a sub-expression 
  that is used as the right-hand operand of a NOT operator.

<p>
  With the following provisos, each series of tokens in the FTS table that
  matches one of the matchable phrases in the query expression is known as a
  "phrase match":

<ol>
  <li> If a matchable phrase is part of a series of phrases connected by
       NEAR operators in the FTS query expression, then each phrase match
       must be sufficiently close to other phrase matches of the relevant
       types to satisfy the NEAR condition.

  <li> If the matchable phrase in the FTS query is restricted to matching
       data in a specified FTS table column, then only phrase matches that 
       occur within that column are considered.
</ol>
 
<tcl>hd_fragment offsets offsets</tcl>
<h2>The Offsets Function</h2>

<p>
  For a SELECT query that uses the full-text index, the offsets() function 
  returns a text value containing a series of space-separated integers. For
  each term in each <a href=#matchable>phrase match</a> of the current row, 
  there are four integers in the returned list. Each set of four integers is 
  interpreted as follows:

<table striped=1>
  <tr><th>Integer <th>Interpretation
  <tr><td>0 
      <td>The column number that the term instance occurs in (0 for the
          leftmost column of the FTS table, 1 for the next leftmost, etc.).
  <tr><td>1
      <td>The term number of the matching term within the full-text query
          expression. Terms within a query expression are numbered starting
          from 0 in the order that they occur.
  <tr><td>2
      <td>The byte offset of the matching term within the column.
  <tr><td>3
      <td>The size of the matching term in bytes.
</table>

<p>
  The following block contains examples that use the offsets function.

<codeblock>
  CREATE VIRTUAL TABLE mail USING fts3(subject, body);
  INSERT INTO mail VALUES('hello world', 'This message is a hello world message.');
  INSERT INTO mail VALUES('urgent: serious', 'This mail is seen as a more serious mail');

  <i>-- The following query returns a single row (as it matches only the first</i>
  <i>-- entry in table "mail". The text returned by the offsets function is</i>
  <i>-- "0 0 6 5 1 0 24 5".</i>
  <i>--</i>
  <i>-- The first set of four integers in the result indicate that column 0</i>
  <i>-- contains an instance of term 0 ("world") at byte offset 6. The term instance</i>
  <i>-- is 5 bytes in size. The second set of four integers shows that column 1</i>
  <i>-- of the matched row contains an instance of term 0 ("world") at byte offset</i>
  <i>-- 24. Again, the term instance is 5 bytes in size.</i>
  SELECT offsets(mail) FROM mail WHERE mail MATCH 'world';

  <i>-- The following query returns also matches only the first row in table "mail".</i>
  <i>-- In this case the returned text is "1 0 5 7 1 0 30 7".</i>
  SELECT offsets(mail) FROM mail WHERE mail MATCH 'message';

  <i>-- The following query matches the second row in table "mail". It returns the</i>
  <i>-- text "1 0 28 7 1 1 36 4". Only those occurrences of terms "serious" and "mail"</i>
  <i>-- that are part of an instance of the phrase "serious mail" are identified; the</i>
  <i>-- other occurrences of "serious" and "mail" are ignored.</i>
  SELECT offsets(mail) FROM mail WHERE mail MATCH '"serious mail"';
</codeblock>

<tcl>hd_fragment snippet snippet</tcl>
<h2>The Snippet Function</h2>

<p>
  The snippet function is used to create formatted fragments of document text
  for display as part of a full-text query results report. The snippet function 
  may be passed between one and six arguments, as follows:

<table striped=1>
  <tr><th>Argument <th>Default Value <th>Description
  <tr><td>0 <td>N/A
      <td> The first argument to the snippet function must always be the [FTS hidden column]
           of the FTS table being queried and from which the snippet is to be taken.  The
           [FTS hidden column] is an automatically generated column with the same name as the
           FTS table itself.
  <tr><td>1 <td>"&lt;b&gt;"
      <td> The "start match" text.
  <tr><td>2 <td>"&lt;/b&gt;"
      <td> The "end match" text.
  <tr><td>3 <td>"&lt;b&gt;...&lt;/b&gt;"
      <td> The "ellipses" text.
  <tr><td>4 <td>-1
      <td> The FTS table column number to extract the returned fragments of
           text from. Columns are numbered from left to right starting with
           zero. A negative value indicates that the text may be extracted
           from any column.
  <tr><td>5 <td>-15
      <td> The absolute value of this integer argument is used as the 
           (approximate) number of tokens to include in the returned text 
           value. The maximum allowable absolute value is 64. The value of
           this argument is referred to as <i>N</i> in the discussion below.
</table>

<p>
  The snippet function first attempts to find a fragment of text consisting
  of <i>|N|</i> tokens within the current row that contains at least one phrase 
  match for each matchable phrase matched somewhere in the current row, 
  where <i>|N|</i> is the absolute value of the sixth argument passed to the
  snippet function. If the text stored in a single column contains less than
  <i>|N|</i> tokens, then the entire column value is considered. Text fragments 
  may not span multiple columns.

<p>
  If such a text fragment can be found, it is returned with the following
  modifications:

<ul>
  <li> If the text fragment does not begin at the start of a column value,
       the "ellipses" text is prepended to it.
  <li> If the text fragment does not finish at the end of a column value,
       the "ellipses" text is appended to it.
  <li> For each token in the text fragment that is part of a phrase match,
       the "start match" text is inserted into the fragment before the token,
       and the "end match" text is inserted immediately after it.
</ul>

<p>
  If more than one such fragment can be found, then fragments that contain
  a larger number of "extra" phrase matches are favoured. The start of
  the selected text fragment may be moved a few tokens forward or backward
  to attempt to concentrate the phrase matches toward the center of the
  fragment.

<p>
  Assuming <i>N</i> is a positive value, if no fragments can be found that
  contain an phrase match corresponding to each matchable phrase, the snippet
  function attempts to find two fragments of approximately <i>N</i>/2 tokens
  that between them contain at least one phrase match for each matchable phrase
  matched by the current row. If this fails, attempts are made to find three
  fragments of <i>N</i>/3 tokens each and finally four <i>N</i>/4 token
  fragments. If a set of four fragments cannot be found that encompasses the
  required phrase matches, the four fragments of <i>N</i>/4 tokens that provide
  the best coverage are selected.

<p>
  If <i>N</i> is a negative value, and no single fragment can be found 
  containing the required phrase matches, the snippet function searches
  for two fragments of <i>|N|</i> tokens each, then three, then four. In
  other words, if the specified value of <i>N</i> is negative, the sizes
  of the fragments is not decreased if more than one fragment is required
  to provide the desired phrase match coverage.

<p>
  After the <i>M</i> fragments have been located, where <i>M</i> is between
  two and four as described in the paragraphs above, they are joined together
  in sorted order with the "ellipses" text separating them. The three 
  modifications enumerated earlier are performed on the text before it is 
  returned.

<codeblock>
  <b>Note: In this block of examples, newlines and whitespace characters have
  been inserted into the document inserted into the FTS table, and the expected
  results described in SQL comments. This is done to enhance readability only,
  they would not be present in actual SQLite commands or output.</b>

  <i>-- Create and populate an FTS table.</i>
  CREATE VIRTUAL TABLE text USING fts4();
  INSERT INTO text VALUES('
    During 30 Nov-1 Dec, 2-3oC drops. Cool in the upper portion, minimum temperature 14-16oC 
    and cool elsewhere, minimum temperature 17-20oC. Cold to very cold on mountaintops, 
    minimum temperature 6-12oC. Northeasterly winds 15-30 km/hr. After that, temperature 
    increases. Northeasterly winds 15-30 km/hr.     
  ');

  <i>-- The following query returns the text value:</i>
  <i>--</i>
  <i>--   "&lt;b&gt;...&lt;/b&gt;cool elsewhere, minimum temperature 17-20oC. &lt;b&gt;Cold&lt;/b&gt; to very </i>
  <i>--    &lt;b&gt;cold&lt;/b&gt; on mountaintops, minimum temperature 6&lt;b&gt;...&lt;/b&gt;".</i>
  <i>--</i>
  SELECT snippet(text) FROM text WHERE text MATCH 'cold';

  <i>-- The following query returns the text value:</i>
  <i>--</i>
  <i>--   "...the upper portion, &#91;minimum&#93; &#91;temperature&#93; 14-16oC and cool elsewhere,</i>
  <i>--    &#91;minimum&#93; &#91;temperature&#93; 17-20oC. Cold..."</i>
  <i>--</i>
  SELECT snippet(text, '&#91; '&#93;', '...') FROM text WHERE text MATCH '"min* tem*"'
</codeblock>

<h2 id=matchinfo tags=matchinfo>The Matchinfo Function</h2>

<p>
  The matchinfo function returns a blob value. If it is used within a query
  that does not use the full-text index (a "query by rowid" or "linear scan"),
  then the blob is zero bytes in size. Otherwise, the blob consists of zero
  or more 32-bit unsigned integers in machine byte-order. The exact number
  of integers in the returned array depends on both the query and the value
  of the second argument (if any) passed to the matchinfo function.

<p>
  The matchinfo function is called with either one or two arguments. As for
  all auxiliary functions, the first argument must be the special 
  [FTS hidden column]. The second argument, if it is specified, must be a text value
  comprised only of the characters 'p', 'c', 'n', 'a', 'l', 's' and 'x'.
  If no second argument is explicitly supplied, it defaults to "pcx". The
  second argument is refered to as the "format string" below.

<p>
  Characters in the matchinfo format string are processed from left to right. 
  Each character in the format string causes one or more 32-bit unsigned
  integer values to be added to the returned array. The "values" column in
  the following table contains the number of integer values appended to the
  output buffer for each supported format string character. In the formula
  given, <i>cols</i> is the number of columns in the FTS table, and 
  <i>phrases</i> is the number of <a href=#matchable>matchable phrases</a> in 
  the query. 

<table striped=1>
  <tr><th>Character<th>Values<th>Description
  <tr><td>p <td>1 <td>The number of matchable phrases in the query.
  <tr><td>c <td>1 <td>The number of user defined columns in the FTS
    table (i.e. not including the docid or the [FTS hidden column]).
  <tr><td>x <td style="white-space:nowrap">3 * <i>cols</i> * <i>phrases</i> 
    <td>
      For each distinct combination of a phrase and table column, the
      following three values:
      <ul>
        <li> In the current row, the number of times the phrase appears in 
             the column.
        <li> The total number of times the phrase appears in the column in
             all rows in the FTS table.
        <li> The total number of rows in the FTS table for which the 
             column contains at least one instance of the phrase.
      </ul>
      The first set of three values corresponds to the left-most column
      of the table (column 0) and the left-most matchable phrase in the
      query (phrase 0). If the table has more than one column, the second
      set of three values in the output array correspond to phrase 0 and
      column 1. Followed by phrase 0, column 2 and so on for all columns of
      the table. And so on for phrase 1, column 0, then phrase 1, column 1
      etc. In other words, the data for occurences of phrase <i>p</i> in
      column <i>c</i> may be found using the following formula:
<pre>
          hits_this_row  = array&#91;3 * (c + p*cols) + 0&#93;
          hits_all_rows  = array&#91;3 * (c + p*cols) + 1&#93;
          docs_with_hits = array&#91;3 * (c + p*cols) + 2&#93;
</pre>

  <tr><td>n <td>1 <td>The number of rows in the FTS4 table. This value is
    only available when querying FTS4 tables, not FTS3.
  <tr><td>a <td><i>cols</i> <td>For each column, the average number of
    tokens in the text values stored in the column (considering all rows in
    the FTS4 table). This value is only available when querying FTS4 tables,
    not FTS3.  
  <tr><td>l <td><i>cols</i> <td>
    For each column, the length of the value stored in the current row of the
    FTS4 table, in tokens.  This value is only available when querying
    FTS4 tables, not FTS3. And only if the "matchinfo=fts3" directive was not
    specified as part of the "CREATE VIRTUAL TABLE" statement used to create
    the FTS4 table.
  <tr><td>s <td><i>cols</i> <td>For each column, the length of the longest 
    subsequence of phrase matches that the column value has in common
    with the query text. For example, if a table column contains the text
    'a b c d e' and the query is 'a c "d e"', then the length of the longest
    common subsequence is 2 (phrase "c" followed by phrase "d e").
    
    
</table>

<p>
  For example:

<codeblock>
  <i>-- Create and populate an FTS4 table with two columns:</i>
  CREATE VIRTUAL TABLE t1 USING fts4(a, b);
  INSERT INTO t1 VALUES('transaction default models default', 'Non transaction reads');
  INSERT INTO t1 VALUES('the default transaction', 'these semantics present');
  INSERT INTO t1 VALUES('single request', 'default data');

  <i>-- In the following query, no format string is specified and so it defaults</i>
  <i>-- to "pcx". It therefore returns a single row consisting of a single blob</i>
  <i>-- value 80 bytes in size (20 32-bit integers - 1 for "p", 1 for "c" and</i>
  <i>-- 3*2*3 for "x"). If each block of 4 bytes in</i> the blob is interpreted</i> 
  <i>-- as an unsigned integer in machine byte-order, the values will be:</i>
  <i>--</i>
  <i>--     3 2  1 3 2  0 1 1  1 2 2  0 1 1  0 0 0  1 1 1</i>
  <i>--</i>
  <i>-- The row returned corresponds to the second entry inserted into table t1.</i>
  <i>-- The first two integers in the blob show that the query contained three</i>
  <i>-- phrases and the table being queried has two columns. The next block of</i>
  <i>-- three integers describes column 0 (in this case column "a") and phrase</i>
  <i>-- 0 (in this case "default"). The current row contains 1 hit for "default"</i>
  <i>-- in column 0, of a total of 3 hits for "default" that occur in column</i>
  <i>-- 0 of any table row. The 3 hits are spread across 2 different rows.</i>
  <i>--</i>
  <i>-- The next set of three integers (0 1 1) pertain to the hits for "default"</i>
  <i>-- in column 1 of the table (0 in this row, 1 in all rows, spread across </i>
  <i>-- 1 rows).</i>
  <i>--</i>
  SELECT matchinfo(t1) FROM t1 WHERE t1 MATCH 'default transaction "these semantics"';

  <i>-- The format string for this query is "nl". The output array will therefore</i>
  <i>-- contain 3 integer values - 1 for "n" and 2 for "l". The query returns</i>
  <i>-- two rows (the first two rows in the table match). The values returned are:</i>
  <i>--</i>
  <i>--     3  1 1</i>
  <i>--     3  2 0</i>
  <i>--</i>
  <i>-- The first value in the matchinfo array returned for both rows is 3 (the </i>
  <i>-- number of rows in the table). The following two values are the lengths </i>
  <i>-- of the longest common subsequence of phrase matches in each column.</i>
  SELECT matchinfo(t1, 'nl') FROM t1 WHERE t1 MATCH 'default transaction';
</codeblock>

<p>
  The matchinfo function is much faster than either the snippet or offsets
  functions. This is because the implementation of both snippet and offsets
  is required to retrieve the documents being analyzed from disk, whereas
  all data required by matchinfo is available as part of the same portions
  of the full-text index that are required to implement the full-text query
  itself. This means that of the following two queries, the first may be
  an order of magnitude faster than the second:

<codeblock>
  SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
  SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
</codeblock>

<p>
  The matchinfo function provides all the information required to calculate
  probabilistic "bag-of-words" relevancy scores such as 
  <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may
  be used to order results in a full-text search application. Appendix A of this 
  document, "[search application tips]", contains an example of using the
  matchinfo() function efficiently.

<h1 id=fts4aux tags="fts4aux">Fts4aux - Direct Access to the Full-Text Index</h1>

<p>
  As of version 3.7.6, SQLite includes a new virtual table module called 
  "fts4aux", which can be used to inspect the full-text index of an exiting
  FTS table directly. Despite its name, fts4aux works just as well with FTS3
  tables as it does with FTS4 tables. Fts4aux tables are read-only. The only
  way to modify the contents of an fts4aux table is by modifying the
  contents of the associated FTS table. The fts4aux module is automatically
  included in all [compile fts|builds that include FTS].

<p>
  An fts4aux virtual table is constructed with a single argument - the 
  unqualified name of the FTS table that it will be used to access.
  For example:

<codeblock>
  <i>-- Create an FTS4 table</i>
  CREATE VIRTUAL TABLE ft USING fts4(x, y);

  <i>-- Create an fts4aux table to access the full-text index for table "ft"</i>
  CREATE VIRTUAL TABLE ft_terms USING fts4aux(ft);
</codeblock>

<p>
  For each term present in the FTS table, there are between 2 and N+1 rows
  in the fts4aux table, where N is the number of user-defined columns in
  the associated FTS table. An fts4aux table always has the same four columns, 
  as follows, from left to right:

<table striped=1>
  <tr><th>Column Name<th>Column Contents
  <tr><td>term<td> 
    Contains the text of the term for this row.
  <tr><td>col<td> 
    This column may contain either the text value '*' (i.e. a single 
    character, UTF codepoint 42) or an integer between 0 and N-1, where N is
    again the number of user-defined columns in the corresponding FTS table.

  <tr><td>documents<td>
    This column always contains an integer value greater than zero.
    <br><br>
    If the "col" column contains the value '*', then this column
    contains the number of rows of the FTS table that contain at least one
    instance of the term (in any column). If col contains an integer
    value, then this column contains the number of rows of the FTS table that
    contain at least one instance of the term in the column identified by
    the col value. As usual, the columns of the FTS table are numbered
    from left to right, starting with zero.

  <tr><td>occurrences<td>
    This column also always contains an integer value greater than zero.
    <br><br>
    If the "col" column contains the value '*', then this column
    contains the total number of instances of the term in all rows of the 
    FTS table (in any column). Otherwise, if col contains an integer
    value, then this column contains the total number of instances of the
    term that appear in the FTS table column identified by the col
    value.
</table>

<p>
  For example, using the tables created above:

<codeblock>
  INSERT INTO ft(x, y) VALUES('Apple banana', 'Cherry');
  INSERT INTO ft(x, y) VALUES('Banana Date Date', 'cherry');
  INSERT INTO ft(x, y) VALUES('Cherry Elderberry', 'Elderberry');

  <i>-- The following query returns this data:</i>
  <i>--</i>
  <i>--     apple       |  *  |  1  |  1</i>
  <i>--     apple       |  0  |  1  |  1</i>
  <i>--     banana      |  *  |  2  |  2</i>
  <i>--     banana      |  0  |  2  |  2</i>
  <i>--     cherry      |  *  |  3  |  3</i>
  <i>--     cherry      |  0  |  1  |  1</i>
  <i>--     cherry      |  1  |  2  |  2</i>
  <i>--     date        |  *  |  1  |  2</i>
  <i>--     date        |  0  |  1  |  2</i>
  <i>--     elderberry  |  *  |  1  |  2</i>
  <i>--     elderberry  |  1  |  1  |  1</i>
  <i>--     elderberry  |  1  |  1  |  1</i>
  <i>--</i>
  SELECT term, col, documents, occurrences FROM ft_terms;
</codeblock>

<p>
  In the example, the values in the "term" column are all lower case, 
  even though they were inserted into table "ft" in mixed case. This is because
  an fts3aux table contains the terms as extracted from the document text
  by the [tokenizer]. In this case, since table "ft" uses the 
  [tokenizer|simple tokenizer], this means all terms have been folded to
  lower case. Also, there is (for example) no row with column "term"
  set to "apple" and column "col" set to 1. Since there are no instances
  of the term "apple" in column 1, no row is present in the fts4aux table.

<p>
  During a transaction, some of the data written to an FTS table may be 
  cached in memory and written to the database only when the transaction is 
  committed. However the implementation of the fts4aux module is only able 
  to read data from the database. In practice this means that if an fts4aux 
  table is queried from within a transaction in which the associated 
  FTS table has been modified, the results of the query are likely to reflect 
  only a (possibly empty) subset of the changes made.

<h1 id=fts4_options tags="FTS4 options">FTS4 Options</h1>

<p>
<tcl>hd_fragment *matchinfo_fts3 {matchinfo_fts3}</tcl>

  If the "CREATE VIRTUAL TABLE" statement specifies module FTS4 (not FTS3), 
  then special directives - FTS4 options - similar to the "tokenize=*" option
  may also appear in place of column names. An FTS4 option consists of the
  option name, followed by an "=" character, followed by the option value.
  The option value may optionally be enclosed in single or double quotes, with
  embedded quote characters escaped in the same way as for SQL literals. There
  may not be whitespace on either side of the "=" character. For example,
  to create an FTS4 table with the value of option "matchinfo" set to "fts3":

<codeblock>
  <i>-- Create a reduced-footprint FTS4 table.</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, matchinfo=fts3);
</codeblock>

<p>
  FTS4 currently supports the following three options:

<table striped=1>
  <tr><th>Option<th>Interpretation
  <tr><td>matchinfo<td> This option may only be set to the value "fts3".
    Attempting to set it otherwise is an error. If this option is specified,
    then some of the extra information stored by FTS4 is omitted. This reduces
    the amount of disk space consumed by an FTS4 table until it is almost the
    same as the amount that would be used by the equivalent FTS3 table, but 
    also means that the data accessed by passing the 'l' flag to the 
    [matchinfo()] function is not available.

  <tr><td>compress<td>
    This option is used to specify the compress function. It is an error to
    specify a compress function without also specifying an uncompress
    function. [fts4 compress option|See below] for details.
    
  <tr><td>uncompress<td>
    This option is used to specify the uncompress function. It is an error to
    specify an uncompress function without also specifying a compress
    function. [fts4 compress option|See below] for details.

  <tr><td>order<td>
    <tcl>hd_fragment fts4order {FTS4 order option}</tcl>
    ^The "order" option may be set to either "DESC" or "ASC" (in upper or
    lower case). ^If it is set to "DESC", then FTS4 stores its data in such
    a way as to optimize returning results in descending order by docid.
    ^If it is set to "ASC" (the default), then the data structures are 
    optimized for returning results in ascending order by docid.  ^In other
    words, if many of the queries run against the FTS4 table use "ORDER BY
    docid DESC", then it may improve performance to add the "order=desc" 
    option to the CREATE VIRTUAL TABLE statement.

  <tr><td>prefix<td>
    This option may be set to a comma-separated list of positive non-zero 
    integers. For each integer N in the list, a separate index is created
    in the database file to optimize [prefix queries] where
    the query term is N bytes in length, not including the '*' character,
    when encoded using UTF-8. [fts4 prefix option|See below] for details.

</table>

<p>
  When using FTS4, specifying a column name that contains an "=" character
  and is not either a "tokenize=*" specification or a recognized FTS4 option
  is an error. With FTS3, the first token in the unrecognized directive is 
  interpreted as a column name. Similarly, specifying multiple "tokenize=*"
  directives in a single table declaration is an error when using FTS4, whereas
  the second and subsequent "tokenize=*" directives are interpreted as column
  names by FTS3. For example:

<codeblock>
  <i>-- An error. FTS4 does not recognize the directive "xyz=abc".</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, xyz=abc);

  <i>-- Create an FTS3 table with three columns - "author", "document"</i>
  <i>-- and "xyz".</i>
  CREATE VIRTUAL TABLE papers USING fts3(author, document, xyz=abc);

  <i>-- An error. FTS4 does not allow multiple tokenize=* directives</i>
  CREATE VIRTUAL TABLE papers USING fts4(tokenize=porter, tokenize=simple);

  <i>-- Create an FTS3 table with a single column named "tokenize". The</i>
  <i>-- table uses the "porter" tokenizer.</i>
  CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple);

  <i>-- An error. Cannot create a table with two columns named "tokenize".</i>
  CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple, tokenize=icu);
</codeblock>

<tcl>hd_fragment *fts4compression {compressed FTS4 content}</tcl>
<h2 tags="fts4 compress option"> Compress/Uncompress </h2>

<p>
  The compress and uncompress options allow FTS4 content to be stored in
  the database in a compressed form. Both options should be set to the name
  of an SQL scalar function registered using [sqlite3_create_function()]
  that accepts a single argument. 

<p>
  The compress function should return a compressed version of the value 
  passed to it as an argument. Each time data is written to the FTS4 table, 
  each column value is passed to the compress function and the result value 
  stored in the database. The compress function may return any type of SQLite 
  value (blob, text, real, integer or null).

<p>
  The uncompress function should uncompress data previously compressed by
  the compress function. In other words, for all SQLite values X, it should
  be true that uncompress(compress(X)) equals X. When data that has been
  compressed by the compress function is read from the database by FTS4, it
  is passed to the uncompress function before it is used.

<p>
  If the specified compress or uncompress functions do not exist, the table
  may still be created. An error is not returned until the FTS4 table is
  read (if the uncompress function does not exist) or written (if it is the 
  compress function that does not exist).

<codeblock>
  <i>-- Create an FTS4 table that stores data in compressed form. This</i>
  <i>-- assumes that the scalar functions zip() and unzip() have been (or</i>
  <i>-- will be) added to the database handle.</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, compress=zip, uncompress=unzip);
</codeblock>

<tcl>hd_fragment fts4prefix {FTS4 prefix option}</tcl>
<h2 tags="fts4 prefix option">Prefix</h2>

<p>
  ^The FTS4 prefix option causes FTS to index term prefixes of specified lengths
  in the same way that it always indexes complete terms.  ^The prefix option
  must be set to a comma separated list of positive non-zero integers. 
  ^For each value N in the list, prefixes of length N bytes (when encoded 
  using UTF-8) are indexed.  ^FTS4 uses term prefix indexes to speed up
  [prefix queries]. The cost, of course, is that indexing term prefixes as
  well as complete terms increases the database size and slows down write 
  operations on the FTS4 table.

<p>
  Prefix indexes may be used to optimize [prefix queries] in two cases.
  If the query is for a prefix of N bytes, then a prefix index created
  with "prefix=N" provides the best optimization. Or, if no "prefix=N"
  index is available, a "prefix=N+1" index imay be used instead. 
  Using a "prefix=N+1" index is less
  efficient than a "prefix=N" index, but is better than no prefix index at all.

<codeblock>
  <i>-- Create an FTS4 table with indexes to optimize 2 and 4 byte prefix queries.</i>
  CREATE VIRTUAL TABLE t1 USING fts4(c1, c2, prefix="2,4");

  <i>-- The following two queries are both optimized using the prefix indexes.</i>
  SELECT * FROM t1 WHERE t1 MATCH 'ab*';
  SELECT * FROM t1 WHERE t1 MATCH 'abcd*';

  <i>-- The following two queries are both partially optimized using the prefix</i>
  <i>-- indexes. The optimization is not as pronounced as it is for the queries</i>
  <i>-- above, but still an improvement over no prefix indexes at all.</i>
  SELECT * FROM t1 WHERE t1 MATCH 'a*';
  SELECT * FROM t1 WHERE t1 MATCH 'abc*';
</codeblock>

<h1 id=tokenizer tags="tokenizer">Tokenizers</h1>

<p>
  An FTS tokenizer is a set of rules for extracting terms from a document 
  or basic FTS full-text query. 

<p>
  Unless a specific tokenizer is specified as part of the CREATE 
  VIRTUAL TABLE statement used to create the FTS table, the default 
  tokenizer, "simple", is used. The simple tokenizer extracts tokens from
  a document or basic FTS full-text query according to the following 
  rules:

<ul>
  <li><p> A term is a contiguous sequence of eligible characters, where 
    eligible characters are all alphanumeric characters, the "_" character,
    and all characters with UTF codepoints greater than or equal to 128.
    All other characters are discarded when splitting a document into terms.
    Their only contribution is to separate adjacent terms.

  <li><p> All uppercase characters within the ASCII range (UTF codepoints less 
    than 128), are transformed to their lowercase equivalents as part of the
    tokenization process. Thus, full-text queries are case-insensitive when
    using the simple tokenizer.
</ul>

<p>
  For example, when a document containing the text "Right now, they're very
  frustrated.", the terms extracted from the document and added to the 
  full-text index are, in order, "right now they re very frustrated". Such
  a document would match a full-text query such as "MATCH 'Frustrated'", 
  as the simple tokenizer transforms the term in the query to lowercase
  before searching the full-text index.

<p>
  As well as the "simple" tokenizer, the FTS source code features a tokenizer 
  that uses the <a href="http://tartarus.org/~martin/PorterStemmer/">Porter 
  Stemming algorithm</a>. This tokenizer uses the same rules to separate
  the input document into terms, but as well as folding all terms to lower
  case it uses the Porter Stemming algorithm to reduce related English language
  words to a common root. For example, using the same input document as in the
  paragraph above, the porter tokenizer extracts the following tokens:
  "right now thei veri frustrat". Even though some of these terms are not even
  English words, in some cases using them to build the full-text index is more
  useful than the more intelligible output produced by the simple tokenizer.
  Using the porter tokenizer, the document not only matches full-text queries
  such as "MATCH 'Frustrated'", but also queries such as "MATCH 'Frustration'",
  as the term "Frustration" is reduced by the Porter stemmer algorithm to 
  "frustrat" - just as "Frustrated" is. So, when using the porter tokenizer,
  FTS is able to find not just exact matches for queried terms, but matches
  against similar English language terms. For more information on the 
  Porter Stemmer algorithm, please refer to the page linked above.

<p>
  Example illustrating the difference between the "simple" and "porter"
  tokenizers:

<codeblock>
  <i>-- Create a table using the simple tokenizer. Insert a document into it.</i>
  CREATE VIRTUAL TABLE simple USING fts3(tokenize=simple);
  INSERT INTO simple VALUES('Right now they''re very frustrated');

  <i>-- The first of the following two queries matches the document stored in</i>
  <i>-- table "simple". The second does not.</i>
  SELECT * FROM simple WHERE simple MATCH 'Frustrated';
  SELECT * FROM simple WHERE simple MATCH 'Frustration';

  <i>-- Create a table using the porter tokenizer. Insert the same document into it</i>
  CREATE VIRTUAL TABLE porter USING fts3(tokenize=porter);
  INSERT INTO porter VALUES('Right now they''re very frustrated');

  <i>-- Both of the following queries match the document stored in table "porter".</i>
  SELECT * FROM porter WHERE porter MATCH 'Frustrated';
  SELECT * FROM porter WHERE porter MATCH 'Frustration';
</codeblock>

<p>
  If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor
  symbol defined, then there exists a built-in tokenizer named "icu"
  implemented using the ICU library. The first argument passed to the
  xCreate() method (see fts3_tokenizer.h) of this tokenizer may be
  an ICU locale identifier. For example "tr_TR" for Turkish as used
  in Turkey, or "en_AU" for English as used in Australia. For example:

<codeblock>
    CREATE VIRTUAL TABLE thai_text USING fts3(text, tokenize=icu th_TH)
</codeblock>

<p>
  The ICU tokenizer implementation is very simple. It splits the input
  text according to the ICU rules for finding word boundaries and discards
  any tokens that consist entirely of white-space. This may be suitable
  for some applications in some locales, but not all. If more complex
  processing is required, for example to implement stemming or
  discard punctuation, this can be done by creating a tokenizer
  implementation that uses the ICU tokenizer as part of its implementation.

<h2>Custom (User Implemented) Tokenizers</h2>

<p>
  As well as the built-in "simple", "porter" and (possibly) "icu" tokenizers,
  FTS exports an interface that allows users to implement custom tokenizers
  using C. The interface used to create a new tokenizer is defined and 
  described in the fts3_tokenizer.h source file.

<p>
  Registering a new FTS tokenizer is similar to registering a new
  virtual table module with SQLite. The user passes a pointer to a
  structure containing pointers to various callback functions that
  make up the implementation of the new tokenizer type. For tokenizers,
  the structure (defined in fts3_tokenizer.h) is called
  "sqlite3_tokenizer_module".

<p>
  FTS does not expose a C-function that users call to register new
  tokenizer types with a database handle. Instead, the pointer must
  be encoded as an SQL blob value and passed to FTS through the SQL
  engine by evaluating a special scalar function, "fts3_tokenizer()".
  The fts3_tokenizer() function may be called with one or two arguments,
  as follows:

<codeblock>
    SELECT fts3_tokenizer(&lt;tokenizer-name&gt;);
    SELECT fts3_tokenizer(&lt;tokenizer-name&gt;, &lt;sqlite3_tokenizer_module ptr&gt;);
</codeblock>

<p>
  Where <tokenizer-name> is a string identifying the tokenizer and
  <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module
  structure encoded as an SQL blob. If the second argument is present,
  it is registered as tokenizer <tokenizer-name> and a copy of it
  returned. If only one argument is passed, a pointer to the tokenizer
  implementation currently registered as <tokenizer-name> is returned,
  encoded as a blob. Or, if no such tokenizer exists, an SQL exception
  (error) is raised.

<p>
  <b>SECURITY WARNING</b>: If the fts3/4 extension is used in an environment
  where potentially malicious users may execute arbitrary SQL, they should 
  be prevented from invoking the fts3_tokenizer() function, possibly using 
  the [sqlite3_set_authorizer()|authorisation callback].

<p>
  The following block contains an example of calling the fts3_tokenizer()
  function from C code:

<codeblock>
  <i>/*
  ** Register a tokenizer implementation with FTS3 or FTS4.
  */</i>
  int registerTokenizer(
    sqlite3 *db,
    char *zName,
    const sqlite3_tokenizer_module *p
  ){
    int rc;
    sqlite3_stmt *pStmt;
    const char *zSql = "SELECT fts3_tokenizer(?, ?)";

    rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
    if( rc!=SQLITE_OK ){
      return rc;
    }

    sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
    sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC);
    sqlite3_step(pStmt);

    return sqlite3_finalize(pStmt);
  }

  <i>/*
  ** Query FTS for the tokenizer implementation named zName.
  */</i>
  int queryTokenizer(
    sqlite3 *db,
    char *zName,
    const sqlite3_tokenizer_module **pp
  ){
    int rc;
    sqlite3_stmt *pStmt;
    const char *zSql = "SELECT fts3_tokenizer(?)";

    *pp = 0;
    rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
    if( rc!=SQLITE_OK ){
      return rc;
    }

    sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
    if( SQLITE_ROW==sqlite3_step(pStmt) ){
      if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){
        memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp));
      }
    }

    return sqlite3_finalize(pStmt);
  }
</codeblock>

  
<h1 tags="segment btree">Data Structures</h1>

<p>
  This section describes at a high-level the way the FTS module stores its
  index and content in the database. It is <b>not necessary to read or 
  understand the material in this section in order to use FTS</b> in an 
  application. However, it may be useful to application developers attempting 
  to analyze and understand FTS performance characteristics, or to developers 
  contemplating enhancements to the existing FTS feature set.

<tcl>hd_fragment *shadowtab {FTS shadow tables}</tcl>
<p>
  For each FTS virtual table in a database, three to five real (non-virtual) tables
  are created to store the underlying data.  These real tables are called "shadow tables".
  The real tables are named "%_content",
  "%_segdir", "%_segments", "%_stat", and "%_docsize", where "%" is replaced by the name
  of the FTS virtual table.

<p>
  The leftmost column of the "%_content" table is an INTEGER PRIMARY KEY field
  named "docid". Following this is one column for each column of the FTS
  virtual table as declared by the user, named by prepending the column name
  supplied by the user with "c<i>N</i>", where <i>N</i> is the index of the 
  column within the table, numbered from left to right starting with 0. Data
  types supplied as part of the virtual table declaration are not used as
  part of the %_content table declaration. For example:

<codeblock>
  <i>-- Virtual table declaration</i>
  CREATE VIRTUAL TABLE abc USING fts4(a NUMBER, b TEXT, c);

  <i>-- Corresponding %_content table declaration</i>
  CREATE TABLE abc_content(docid INTEGER PRIMARY KEY, c0a, c1b, c2c);
</codeblock>

<p>
  The %_content table contains the unadulterated data inserted by the user 
  into the FTS virtual table by the user. If the user does not explicitly
  supply a "docid" value when inserting records, one is selected automatically
  by the system.

<p>
  The %_stat and %_docsize tables are only created if the FTS table uses the
  FTS4 module, not FTS3. Furthermore, the %_docsize table is omitted if the
  FTS4 table is created with the [matchinfo_fts3|"matchinfo=fts3"] directive
  specified as part of the CREATE VIRTUAL TABLE statement. If they are created,
  the schema of the two tables is as follows:
<codeblock>
  CREATE TABLE %_stat(
    id INTEGER PRIMARY KEY, 
    value BLOB
  );

  CREATE TABLE %_docsize(
    docid INTEGER PRIMARY KEY,
    size BLOB
  );
</codeblock>

<p>
  For each row in the FTS table, the %_docsize table contains a corresponding
  row with the same "docid" value. The "size" field contains a blob consisting
  of <i>N</i> FTS varints, where <i>N</i> is the number of user-defined columns
  in the table. Each varint in the "size" blob is the number of tokens in the
  corresponding column of the associated row in the FTS table. The %_stat table
  always contains a single row with the "id" column set to 0. The "value" 
  column contains a blob consisting of <i>N+1</i> FTS varints, where <i>N</i>
  is again the number of user-defined columns in the FTS table. The first
  varint in the blob is set to the total number of rows in the FTS table. The
  second and subsequent varints contain the total number of tokens stored in
  the corresponding column for all rows of the FTS table.

<p>
  The two remaining tables, %_segments and %_segdir, are used to store the 
  full-text index. Conceptually, this index is a lookup table that maps each 
  term (word) to the set of docid values corresponding to records in the 
  %_content table that contain one or more occurrences of the term. To
  retrieve all documents that contain a specified term, the FTS module
  queries this index to determine the set of docid values for records that
  contain the term, then retrieves the required documents from the %_content
  table. Regardless of the schema of the FTS virtual table, the %_segments
  and %_segdir tables are always created as follows:

<codeblock>
  CREATE TABLE %_segments(
    blockid INTEGER PRIMARY KEY,       <i>-- B-tree node id</i>
    block blob                         <i>-- B-tree node data</i>
  );

  CREATE TABLE %_segdir(
    level INTEGER,
    idx INTEGER,
    start_block INTEGER,               <i>-- Blockid of first node in %_segments</i>
    leaves_end_block INTEGER,          <i>-- Blockid of last leaf node in %_segments</i>
    end_block INTEGER,                 <i>-- Blockid of last node in %_segments</i>
    root BLOB,                         <i>-- B-tree root node</i>
    PRIMARY KEY(level, idx)
  );
</codeblock>

<p>
  The schema depicted above is not designed to store the full-text index 
  directly. Instead, it is used to one or more b-tree structures. There
  is one b-tree for each row in the %_segdir table. The %_segdir table
  row contains the root node and various meta-data associated with the
  b-tree structure, and the %_segments table contains all other (non-root)
  b-tree nodes. Each b-tree is referred to as a "segment". Once it has
  been created, a segment b-tree is never updated (although it may be
  deleted altogether).

<p>
  The keys used by each segment b-tree are terms (words). As well as the
  key, each segment b-tree entry has an associated "doclist" (document list).
  A doclist consists of zero or more entries, where each entry consists of:

<ul>
  <li> A docid (document id), and
  <li> A list of term offsets, one for each occurrence of the term within
       the document. A term offset indicates the number of tokens (words)
       that occur before the term in question, not the number of characters
       or bytes. For example, the term offset of the term "war" in the
       phrase "Ancestral voices prophesying war!" is 3.
</ul>

<p>
  Entries within a doclist are sorted by docid. Positions within a doclist
  entry are stored in ascending order.

<p>
  The contents of the logical full-text index is found by merging the
  contents of all segment b-trees. If a term is present in more than one
  segment b-tree, then it maps to the union of each individual doclist. If,
  for a single term, the same docid occurs in more than one doclist, then only
  the doclist that is part of the most recently created segment b-tree is 
  considered valid. 

<p>
  Multiple b-tree structures are used instead of a single b-tree to reduce
  the cost of inserting records into FTS tables. When a new record is 
  inserted into an FTS table that already contains a lot of data, it is
  likely that many of the terms in the new record are already present in
  a large number of existing records. If a single b-tree were used, then
  large doclist structures would have to be loaded from the database,
  amended to include the new docid and term-offset list, then written back
  to the database. Using multiple b-tree tables allows this to be avoided
  by creating a new b-tree which can be merged with the existing b-tree
  (or b-trees) later on. Merging of b-tree structures can be performed as
  a background task, or once a certain number of separate b-tree structures
  have been accumulated. Of course, this scheme makes queries more expensive
  (as the FTS code may have to look up individual terms in more than one
  b-tree and merge the results), but it has been found that in practice this
  overhead is often negligible.
  
<h2>Variable Length Integer (varint) Format</h2>

<p>
  Integer values stored as part of segment b-tree nodes are encoded using the
  FTS varint format. This encoding is similar, but <b>not identical</b>, to the
  the <a href="fileformat.html#varint_format">SQLite varint format</a>.

<p>
  An encoded FTS varint consumes between one and ten bytes of space. The
  number of bytes required is determined by the sign and magnitude of the
  integer value encoded. More accurately, the number of bytes used to store
  the encoded integer depends on the position of the most significant set bit
  in the 64-bit twos-complement representation of the integer value. Negative
  values always have the most significant bit set (the sign bit), and so are
  always stored using the full ten bytes. Positive integer values may be
  stored using less space.

<p>
  The final byte of an encoded FTS varint has its most significant bit 
  cleared. All preceding bytes have the most significant bit set. Data
  is stored in the remaining seven least significant bits of each byte.
  The first byte of the encoded representation contains the least significant
  seven bits of the encoded integer value. The second byte of the encoded
  representation, if it is present, contains the seven next least significant
  bits of the integer value, and so on. The following table contains examples
  of encoded integer values:

<table striped=1>
  <tr><th>Decimal<th>Hexadecimal<th width=100%>Encoded Representation
  <tr><td>43<td>0x000000000000002B<td>0x2B 
  <tr><td>200815<td>0x000000000003106F<td>0x9C 0xA0 0x0C
  <tr><td>-1<td>0xFFFFFFFFFFFFFFFF<td>0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0x01
</table>
  

<h2>Segment B-Tree Format</h2>

<p>
  Segment b-trees are prefix-compressed b+-trees. There is one segment b-tree
  for each row in the %_segdir table (see above). The root node of the segment
  b-tree is stored as a blob in the "root" field of the corresponding row
  of the %_segdir table. All other nodes (if any exist) are stored in the 
  "blob" column of the %_segments table. Nodes within the %_segments table are
  identified by the integer value in the blockid field of the corresponding
  row. The following table describes the fields of the %_segdir table:

<table striped=1>
  <tr><th>Column           <th width=100%>Interpretation
  <tr><td>level            <td> 
    Between them, the contents of the "level" and "idx" fields define the
    relative age of the segment b-tree. The smaller the value stored in the
    "level" field, the more recently the segment b-tree was created. If two
    segment b-trees are of the same "level", the segment with the larger
    value stored in the "idx" column is more recent. The PRIMARY KEY constraint
    on the %_segdir table prevents any two segments from having the same value
    for both the "level" and "idx" fields.
  <tr><td>idx              <td> See above.
  <tr><td>start_block      <td>
    The blockid that corresponds to the node with the smallest blockid that 
    belongs to this segment b-tree. Or zero if the entire segment b-tree
    fits on the root node. If it exists, this node is always a leaf node.
  <tr><td>leaves_end_block <td>
    The blockid that corresponds to the leaf node with the largest blockid 
    that belongs to this segment b-tree. Or zero if the entire segment b-tree
    fits on the root node.
  <tr><td>end_block <td>
    The blockid that corresponds to the interior node with the largest 
    blockid that belongs to this segment b-tree.  Or zero if the entire segment
    b-tree fits on the root node. If it exists, this node is always an
    interior node.
  <tr><td>root             <td>
    Blob containing the root node of the segment b-tree.
</table>

<p>
  Apart from the root node, the nodes that make up a single segment b-tree are
  always stored using a contiguous sequence of blockids. Furthermore, the
  nodes that make up a single level of the b-tree are themselves stored as
  a contiguous block, in b-tree order. The contiguous sequence of blockids
  used to store the b-tree leaves are allocated starting with the blockid
  value stored in the "start_block" column of the corresponding %_segdir row,
  and finishing at the blockid value stored in the "leaves_end_block"
  field of the same row. It is therefore possible to iterate through all the
  leaves of a segment b-tree, in key order, by traversing the %_segments 
  table in blockid order from "start_block" to "leaves_end_block".  

<h3>Segment B-Tree Leaf Nodes</h3>

<p>
  The following diagram depicts the format of a segment b-tree leaf node.

<center>
  <img src=images/fts3_leaf_node.png>
  <p> Segment B-Tree Leaf Node Format
</center>

<p>
  The first term stored on each node ("Term 1" in the figure above) is
  stored verbatim. Each subsequent term is prefix-compressed with respect
  to its predecessor. Terms are stored within a page in sorted (memcmp)
  order.

<h3>Segment B-Tree Interior Nodes</h3>

<p>
  The following diagram depicts the format of a segment b-tree interior 
  (non-leaf) node.

<center>
  <img src=images/fts3_interior_node.png>
  <p> Segment B-Tree Interior Node Format
</center>


<h2>Doclist Format</h2>

<p>
  A doclist consists of an array of 64-bit signed integers, serialized using
  the FTS varint format. Each doclist entry is made up of a series of two 
  or more integers, as follows:

<ol>
  <li> The docid value. The first entry in a doclist contains the literal docid
       value. The first field of each subsequent doclist entry contains the 
       difference between the new docid and the previous one (always a positive 
       number).
  <li> Zero or more term-offset lists. A term-offset list is present for each
       column of the FTS virtual table that contains the term. A term-offset
       list consists of the following:
     <ol>
       <li> Constant value 1. This field is omitted for any term-offset list
            associated with column 0.
       <li> The column number (1 for the second leftmost column, etc.). This
            field is omitted for any term-offset list associated with column 0.
       <li> A list of term-offsets, sorted from smallest to largest. Instead
            of storing the term-offset value literally, each integer stored 
            is the difference between the current term-offset and the previous 
            one (or zero if the current term-offset is the first), plus 2.
     </ol>
  <li> Constant value 0.
</ol>

<center>
  <img src=images/fts3_doclist2.png>
  <p> FTS3 Doclist Format
</center>

<center>
  <img src=images/fts3_doclist.png>
  <p> FTS Doclist Entry Format
</center>

<p>
  For doclists for which the term appears in more than one column of the FTS
  virtual table, term-offset lists within the doclist are stored in column 
  number order. This ensures that the term-offset list associated with 
  column 0 (if any) is always first, allowing the first two fields of the
  term-offset list to be omitted in this case.

<h1 id=appendix_a nonumber tags="search application tips">
  Appendix A: Search Application Tips
</h1>

<p>
  FTS is primarily designed to support Boolean full-text queries - queries
  to find the set of documents that match a specified criteria. However, many 
  (most?) search applications require that results are somehow ranked in order
  of "relevance", where "relevance" is defined as the likelihood that the user
  who performed the search is interested in a specific element of the returned
  set of documents. When using a search engine to find documents on the world
  wide web, the user expects that the most useful, or "relevant", documents 
  will be returned as the first page of results, and that each subsequent page 
  contains progressively less relevant results. Exactly how a machine can 
  determine document relevance based on a users query is a complicated problem
  and the subject of much ongoing research.

<p>
  One very simple scheme might be to count the number of instances of the 
  users search terms in each result document. Those documents that contain
  many instances of the terms are considered more relevant than those with
  a small number of instances of each term. In an FTS application, the 
  number of term instances in each result could be determined by counting
  the number of integers in the return value of the [offsets] function.
  The following example shows a query that could be used to obtain the
  ten most relevant results for a query entered by the user:

<codeblock>
  <i>-- This example (and all others in this section) assumes the following schema</i>
  CREATE VIRTUAL TABLE documents USING fts3(title, content);

  <i>-- Assuming the application has supplied an SQLite user function named "countintegers"</i>
  <i>-- that returns the number of space-separated integers contained in its only argument,</i>
  <i>-- the following query could be used to return the titles of the 10 documents that contain</i>
  <i>-- the greatest number of instances of the users query terms. Hopefully, these 10</i>
  <i>-- documents will be those that the users considers more or less the most "relevant".</i>
  SELECT title FROM documents 
    WHERE documents MATCH &lt;query&gt;
    ORDER BY countintegers(offsets(documents)) DESC
    LIMIT 10 OFFSET 0
</codeblock>

<p>
  The query above could be made to run faster by using the FTS [matchinfo]
  function to determine the number of query term instances that appear in each
  result. The matchinfo function is much more efficient than the offsets 
  function. Furthermore, the matchinfo function provides extra information
  regarding the overall number of occurrences of each query term in the entire
  document set (not just the current row) and the number of documents in which 
  each query term appears. This may be used (for example) to attach a higher
  weight to less common terms which may increase the overall computed relevancy 
  of those results the user considers more interesting.

<codeblock>
  <i>-- If the application supplies an SQLite user function called "rank" that</i>
  <i>-- interprets the blob of data returned by matchinfo and returns a numeric</i>
  <i>-- relevancy based on it, then the following SQL may be used to return the</i>
  <i>-- titles of the 10 most relevant documents in the dataset for a users query.</i>
  SELECT title FROM documents 
    WHERE documents MATCH &lt;query&gt;
    ORDER BY rank(matchinfo(documents)) DESC
    LIMIT 10 OFFSET 0
</codeblock>

<p>
  The SQL query in the example above uses less CPU than the first example
  in this section, but still has a non-obvious performance problem. SQLite
  satisfies this query by retrieving the value of the "title" column and
  matchinfo data from the FTS module for every row matched by the users
  query before it sorts and limits the results. Because of the way SQLite's
  virtual table interface works, retrieving the value of the "title" column
  requires loading the entire row from disk (including the "content" field,
  which may be quite large). This means that if the users query matches
  several thousand documents, many megabytes of "title" and "content" data
  may be loaded from disk into memory even though they will never be used
  for any purpose. 

<p>
  The SQL query in the following example block is one solution to this 
  problem. In SQLite, when a <a href="optoverview.html#flattening">sub-query 
  used in a join contains a LIMIT clause</a>, the results of the sub-query are
  calculated and stored in temporary table before the main query is executed.
  This means that SQLite will load only the docid and matchinfo data for each
  row matching the users query into memory, determine the docid values
  corresponding to the ten most relevant documents, then load only the title
  and content information for those 10 documents only. Because both the matchinfo
  and docid values are gleaned entirely from the full-text index, this results
  in dramatically less data being loaded from the database into memory.

<codeblock>
  SELECT title FROM documents JOIN ( 
      SELECT docid, rank(matchinfo(documents)) AS rank 
      FROM documents
      WHERE documents MATCH &lt;query&gt;
      ORDER BY rank DESC 
      LIMIT 10 OFFSET 0
  ) AS ranktable USING(docid)
  ORDER BY ranktable.rank DESC
</codeblock>

<p>
  The next block of SQL enhances the query with solutions to two other problems
  that may arise in developing search applications using FTS:

<ol>
  <li> <p>
       The [snippet] function cannot be used with the above query. Because
       the outer query does not include a "WHERE ... MATCH" clause, the snippet 
       function may not be used with it. One solution is to duplicate the WHERE
       clause used by the sub-query in the outer query. The overhead associated
       with this is usually negligible.
  <li> <p>
       The relevancy of a document may depend on something other than just
       the data available in the return value of matchinfo. For example
       each document in the database may be assigned a static weight based
       on factors unrelated to its content (origin, author, age, number
       of references etc.). These values can be stored by the application
       in a separate table that can be joined against the documents table
       in the sub-query so that the rank function may access them.
</ol>

<p>
  This version of the query is very similar to that used by the 
  <a href="http://www.sqlite.org/search?q=fts3">sqlite.org documentation search</a> 
  application.

<codeblock>
  <i>-- This table stores the static weight assigned to each document in FTS table</i>
  <i>-- "documents". For each row in the documents table there is a corresponding row</i>
  <i>-- with the same docid value in this table.</i>
  CREATE TABLE documents_data(docid INTEGER PRIMARY KEY, weight);

  <i>-- This query is similar to the one in the block above, except that:</i>
  <i>--</i>
  <i>--   1. It returns a "snippet" of text along with the document title for display. So</i>
  <i>--      that the snippet function may be used, the "WHERE ... MATCH ..." clause from</i>
  <i>--      the sub-query is duplicated in the outer query.</i>
  <i>--</i>
  <i>--   2. The sub-query joins the documents table with the document_data table, so that</i>
  <i>--      implementation of the rank function has access to the static weight assigned</i>
  <i>--      to each document.</i>
  SELECT title, snippet(documents) FROM documents JOIN ( 
      SELECT docid, rank(matchinfo(documents), documents_data.weight) AS rank
      FROM documents JOIN documents_data USING(docid)
      WHERE documents MATCH &lt;query&gt;
      ORDER BY rank DESC 
      LIMIT 10 OFFSET 0
  ) AS ranktable USING(docid)
  WHERE documents MATCH &lt;query&gt;
  ORDER BY ranktable.rank DESC
</codeblock>

<p>
  All the example queries above return the ten most relevant query results.
  By modifying the values used with the OFFSET and LIMIT clauses, a query 
  to return (say) the next ten most relevant results is easy to construct. 
  This may be used to obtain the data required for a search applications second
  and subsequent pages of results.

<p>
  The next block contains an example rank function that uses matchinfo data 
  implemented in C. Instead of a single weight, it allows a weight to be 
  externally assigned to each column of each document. It may be registered
  with SQLite like any other user function using [sqlite3_create_function].

<codeblock>
<i>/*</i>
<i>** SQLite user defined function to use with matchinfo() to calculate the</i>
<i>** relevancy of an FTS match. The value returned is the relevancy score</i>
<i>** (a real value greater than or equal to zero). A larger value indicates </i>
<i>** a more relevant document.</i>
<i>**</i>
<i>** The overall relevancy returned is the sum of the relevancies of each </i>
<i>** column value in the FTS table. The relevancy of a column value is the</i>
<i>** sum of the following for each reportable phrase in the FTS query:</i>
<i>**</i>
<i>**   (&lt;hit count&gt; / &lt;global hit count&gt) * &lt;column weight&gt;</i>
<i>**</i>
<i>** where &lt;hit count&gt; is the number of instances of the phrase in the</i>
<i>** column value of the current row and &lt;global hit count&gt; is the number</i>
<i>** of instances of the phrase in the same column of all rows in the FTS</i>
<i>** table. The &lt;column weight&gt; is a weighting factor assigned to each</i>
<i>** column by the caller (see below).</i>
<i>**</i>
<i>** The first argument to this function must be the return value of the FTS </i>
<i>** matchinfo() function. Following this must be one argument for each column </i>
<i>** of the FTS table containing a numeric weight factor for the corresponding </i>
<i>** column. Example:</i>
<i>**</i>
<i>**     CREATE VIRTUAL TABLE documents USING fts3(title, content)</i>
<i>**</i>
<i>** The following query returns the docids of documents that match the full-text</i>
<i>** query &lt;query&gt; sorted from most to least relevant. When calculating</i>
<i>** relevance, query term instances in the 'title' column are given twice the</i>
<i>** weighting of those in the 'content' column.</i>
<i>**</i>
<i>**     SELECT docid FROM documents </i>
<i>**     WHERE documents MATCH &lt;query&gt; </i>
<i>**     ORDER BY rank(matchinfo(documents), 1.0, 0.5) DESC</i>
<i>*/</i>
static void rankfunc(sqlite3_context *pCtx, int nVal, sqlite3_value **apVal){
  int *aMatchinfo;                <i>/* Return value of matchinfo() */</i>
  int nCol;                       <i>/* Number of columns in the table */</i>
  int nPhrase;                    <i>/* Number of phrases in the query */</i>
  int iPhrase;                    <i>/* Current phrase */</i>
  double score = 0.0;             <i>/* Value to return */</i>

  assert( sizeof(int)==4 );

<i>  /* Check that the number of arguments passed to this function is correct.</i>
<i>  ** If not, jump to wrong_number_args. Set aMatchinfo to point to the array</i>
<i>  ** of unsigned integer values returned by FTS function matchinfo. Set</i>
<i>  ** nPhrase to contain the number of reportable phrases in the users full-text</i>
<i>  ** query, and nCol to the number of columns in the table.</i>
<i>  */</i>
  if( nVal&lt;1 ) goto wrong_number_args;
  aMatchinfo = (unsigned int *)sqlite3_value_blob(apVal&#x5B;0&#x5D;);
  nPhrase = aMatchinfo&#x5B;0&#x5D;;
  nCol = aMatchinfo&#x5B;1&#x5D;;
  if( nVal!=(1+nCol) ) goto wrong_number_args;

<i>  /* Iterate through each phrase in the users query. */</i>
  for(iPhrase=0; iPhrase&lt;nPhrase; iPhrase++){
    int iCol;                     <i>/* Current column */</i>

<i>    /* Now iterate through each column in the users query. For each column,</i>
<i>    ** increment the relevancy score by:</i>
<i>    **</i>
<i>    **   (&lt;hit count&gt; / &lt;global hit count&gt) * &lt;column weight&gt;</i>
<i>    **</i>
<i>    ** aPhraseinfo&#x5B;&#x5D; points to the start of the data for phrase iPhrase. So</i>
<i>    ** the hit count and global hit counts for each column are found in </i>
<i>    ** aPhraseinfo&#x5B;iCol*3&#x5D; and aPhraseinfo&#x5B;iCol*3+1&#x5D;, respectively.</i>
<i>    */</i>
    int *aPhraseinfo = &aMatchinfo&#x5B;2 + iPhrase*nCol*3&#x5D;;
    for(iCol=0; iCol&lt;nCol; iCol++){
      int nHitCount = aPhraseinfo&#x5B;3*iCol&#x5D;;
      int nGlobalHitCount = aPhraseinfo&#x5B;3*iCol+1&#x5D;;
      double weight = sqlite3_value_double(apVal&#x5B;iCol+1&#x5D;);
      if( nHitCount>0 ){
        score += ((double)nHitCount / (double)nGlobalHitCount) * weight;
      }
    }
  }

  sqlite3_result_double(pCtx, score);
  return;

<i>  /* Jump here if the wrong number of arguments are passed to this function */</i>
wrong_number_args:
  sqlite3_result_error(pCtx, "wrong number of arguments to function rank()", -1);
}
</codeblock>