<title>SQLite FTS3 Extension</title>
<tcl>
hd_keywords *fts3 FTS3
source [file join $::DOC pages fancyformat.tcl]
fancyformat_document "SQLite FTS3 Extension" {} {
<h2 style="margin-left:1.0em"> Overview</h2>
<p>
FTS3 is an SQLite virtual table module that allows users to perform
full-text searches on a set of documents. The most common (and effective)
way to describe full-text searches is "what Google, Yahoo and Altavista do
with documents placed on the World Wide Web". Users input a term, or series
of terms, perhaps connected by a binary operator or grouped together into a
phrase, and the full-text query system finds the set of documents that best
matches those terms considering the operators and groupings the user has
specified. This document describes the deployment and usage of FTS3.
<p>
Portions of the original FTS3 code were contributed to the SQLite project
by Scott Hess of <a href="http://www.google.com">Google</a>. It is now
developed and maintained as part of SQLite.
[h1 "Introduction to FTS3"]
<p>
The FTS3 extension module allows users to create special tables with a
built-in full-text index (hereafter "FTS3 tables"). The full-text index
allows the user to efficiently query the database for all rows that contain
one or more instances specified word (hereafter a "token", even if the table
contains many large documents.
<p>
For example, if each of the 517430 documents in the
"<a href="http://www.cs.cmu.edu/~enron/">Enron E-Mail Dataset</a>"
is inserted into both the FTS3 table and the ordinary SQLite table
created using the following SQL script:
[Code {
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table */
CREATE TABLE enrondata2(content TEXT); /* Ordinary table */
}]
<p>
Then either of the two queries below may be executed to find the number of
documents in the database that contain the word "linux" (351). Using one
desktop PC hardware configuration, the query on the FTS3 table returns in
approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
[Code {
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */
SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
}]
<p>
Of course, the two queries above are not entirely equivalent. For example
the LIKE query matches rows that contain terms such as "linuxophobe"
or "EnterpriseLinux" (as it happens, the Enron E-Mail Dataset does not
actually contain any such terms), whereas the MATCH query on the FTS3 table
selects only those rows that contain "linux" as a discrete token. Both
searches are case-insensitive. The FTS3 table consumes around 2006 MB on
disk compared to just 1453 MB for the ordinary table. Using the same
hardware configuration used to perform the SELECT queries above, the FTS3
table took just under 31 minutes to populate, versus 25 for the ordinary
table.
[h2 "Creating and Destroying FTS3 Tables"]
<p>
Like other virtual table types, new FTS3 tables are created using a
\[CREATE VIRTUAL TABLE\] statement. The module name, which follows
the USING keyword, is "fts3". The virtual table module arguments may
be left empty, in which case an FTS3 table with a single user-defined
column named "content" is created. Alternatively, the module arguments
may be passed a list of comma separated column names.
<p>
If column names are explicitly provided for the FTS3 table as part of
the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally
specified for each column. However, this is pure syntactic sugar, the
supplied typenames are not used by FTS3 or the SQLite core for any
purpose. The same applies to any constraints specified along with an
FTS3 column name - they are parsed but not used or recorded by the system
in any way.
[Code {
<i>-- Create an FTS3 table named "data" with one column - "content":</i>
CREATE VIRTUAL TABLE data USING fts3();
<i>-- Create an FTS3 table named "pages" with three columns:</i>
CREATE VIRTUAL TABLE pages USING fts3(title, keywords, body);
<i>-- Create an FTS3 table named "mail" with two columns. Datatypes
-- and column constraints are specified along with each column. These
-- are completely ignored by FTS3 and SQLite. </i>
CREATE VIRTUAL TABLE mail USING fts3(
subject VARCHAR(256) NOT NULL,
body TEXT CHECK(length(body)<10240)
);
}]
<p>
As well as a list of columns, the module arguments passed to a CREATE
VIRTUAL TABLE statement used to create an FTS3 table may be used to specify
a \[tokenizer\]. This is done by specifying a string of the form
"tokenize=<tokenizer name> <tokenizer args<" in place of a column
name, where <tokenizer name> is the name of the tokenizer to use and
<tokenizer args> is an optional list of whitespace separated qualifiers
to pass to the tokenizer implementation. A tokenizer specification may be
placed anywhere in the column list, but at most one tokenizer declaration is
allowed for each CREATE VIRTUAL TABLE statement. The second and subsequent
tokenizer declaration are interpreted as column names.
\[tokenizer|See below\] for a detailed description of using (and, if
necessary, implementing) a tokenizer.
[Code {
<i>-- Create an FTS3 table named "papers" with two columns that uses</i>
<i>-- the tokenizer "porter".</i>
CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter);
<i>-- Create an FTS3 table with a single column - "content" - that uses</i>
<i>-- the "simple" tokenizer.</i>
CREATE VIRTUAL TABLE data USING fts3(tokenize=simple);
<i>-- Create an FTS3 table with two columns that uses the "icu" tokenizer.</i>
<i>-- The qualifier "en_AU" is passed to the tokenizer implementation</i>
CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU);
}]
<p>
FTS3 tables may be dropped from the database using an ordinary \[DROP TABLE\]
statement. For example:
[Code {
<i>-- Create, then immediately drop, an FTS3 table.</i>
CREATE VIRTUAL TABLE data USING fts3();
DROP TABLE data;
}]
[h2 "Populating FTS3 Tables"]
<p>
FTS3 tables are populated using \[INSERT\], \[UPDATE\] and \[DELETE\]
statements in the same way as ordinary SQLite tables are.
<p>
As well as the columns named by the user (or the "content" column if no
module arguments where specified as part of the \[CREATE VIRTUAL TABLE\]
statement), each FTS3 table has a "rowid" column. The rowid of an FTS3
table behaves in the same way as the rowid column of an ordinary SQLite
table, except that the values stored in the rowid column of an FTS3 table
remain unchanged if the database is rebuilt using the \[VACUUM\] command.
For FTS3 tables, "docid" is allowed as an alias along with the usual "rowid",
"oid" and "_oid_" identifiers. Attempting to insert or update a row with a
docid value that already exists in the table is an error, just as it would
be with an ordinary SQLite table.
<p>
There is one other subtle difference between "docid" and the normal SQLite
aliases for the rowid column. Normally, if an INSERT or UPDATE statement
assigns discreet values to two or more aliases of the rowid column, SQLite
writes the rightmost of such values specified in the INSERT or UPDATE
statement to the database. However, assigning a non-NULL value to both
the "docid" and one or more of the SQLite rowid aliases when inserting or
updating an FTS3 table is considered an error. See below for an example.
[Code {
<i>-- Create an FTS3 table</i>
CREATE VIRTUAL TABLE pages USING fts3(title, body);
<i>-- Insert a row with a specific docid value.</i>
INSERT INTO pages(docid, title, body) VALUES(53, 'Home Page', 'SQLite is a software...');
<i>-- Insert a row and allow FTS3 to assign a docid value using the same algorithm as</i>
<i>-- SQLite uses for ordinary tables. In this case the new docid will be 54,</i>
<i>-- one greater than the largest docid currently present in the table.</i>
INSERT INTO pages(title, body) VALUES('Download', 'All SQLite source code...');
<i>-- Change the title of the row just inserted.</i>
UPDATE pages SET title = 'Download SQLite' WHERE rowid = 54;
<i>-- Delete the entire table contents.</i>
DELETE FROM pages;
<i>-- The following is an error. It is not possible to assign non-NULL values to both</i>
<i>-- the rowid and docid columns of an FTS3 table.</i>
INSERT INTO pages(rowid, docid, title, body) VALUES(1, 2, 'A title', 'A document body');
}]
<p>
To support full-text queries, FTS3 maintains an inverted index that maps
from each unique term or word that appears in the dataset to the locations
in which it appears within the table contents. For the curious, a
complete description of the \[segment btree|data structure\] used to store
this index within the database file is described below. A feature of
this data structure is that at any time the database may contain not
one index b-tree, but several different b-trees that are incrementally
merged as rows are inserted, updated and deleted. This technique improves
performance when writing to an FTS3 table, but causes some overhead for
full-text queries that use the index. Executing an SQL statement of the
form "INSERT INTO <fts3-table>(<fts3-table>) VALUES('optimize')"
causes FTS3 to merge all existing index b-trees into a single large
b-tree containing the entire index. This can be an expensive operation,
but may speed up future queries.
<p>
For example, to optimize the full-text index for an FTS3 table named
"docs":
[Code {
<i>-- Optimize the internal structure of FTS3 table "docs".</i>
INSERT INTO docs(docs) VALUES('optimize');
}]
<p>
The statement above may appear syntacticly incorrect to some. Refer to
the section describing the \[simple fts3 queries\] for an explanation.
<p>
There is another, deprecated, method for invoking the optimize
operation using a SELECT statement. New code should use statements
similar to the INSERT above to optimize FTS3 structures.
[h2 "Querying FTS3 Tables" {} {simple fts3 queries}]
<p>
As for all other SQLite tables, virtual or otherwise, data is retrieved
from FTS3 tables using a \[SELECT\] statement.
<p>
FTS3 tables can be queried efficiently using SELECT statements of two
different forms:
<ul>
<li><p>
<b>Query by rowid</b>. If the WHERE clause of the SELECT statement
contains a sub-clause of the form "rowid = ?", where ? is an SQL expression,
FTS3 is able to retreive the requested row directly using the equivalent
of an SQLite \[INTEGER PRIMARY KEY\] index.
<li><p>
<b>Full-text query</b>. If the WHERE clause of the SELECT statement contains
a sub-clause of the form "<column> MATCH ?", FTS3 is able to use
the built-in full-text index to restrict the search to those documents
that match the full-text query string specified as the right-hand operand
of the MATCH clause.
</ul>
<p>
If neither of the two query strategies enumerated above can be used, all
queries on FTS3 tables are implemented using a linear scan of the entire
table. If the table contains large amounts of data, this may be an
impractically approach (the first example on this page shows that a linear
scan of 1.5 GB of data takes around 30 seconds using a modern PC).
[Code {
<i>-- The examples in this block assume the following FTS3 table:</i>
CREATE VIRTUAL TABLE mail USING fts3(subject, body);
SELECT * FROM mail WHERE rowid = 15; <i>-- Fast. Rowid lookup.</i>
SELECT * FROM mail WHERE body MATCH 'sqlite'; <i>-- Fast. Full-text query.</i>
SELECT * FROM mail WHERE mail MATCH 'search'; <i>-- Fast. Full-text query.</i>
SELECT * FROM mail WHERE rowid BETWEEN 15 AND 20; <i>-- Slow. Linear scan.</i>
SELECT * FROM mail WHERE subject = 'database'; <i>-- Slow. Linear scan.</i>
SELECT * FROM mail WHERE subject MATCH 'database'; <i>-- Fast. Full-text query.</i>
}]
<p>
In all of the full-text queries above, the right-hand operand of the MATCH
operator is a string consisting of a single term. In this case, the MATCH
expression evaluates to true for all documents that contain one or more
instances of the specified word ("sqlite", "search" or "database", depending
on which example you look at). Specifying a single term as the right-hand
operand of the MATCH operator results in the simplest (and most common) type
of full-text query possible. However more complicated queries are possible,
including phrase searches, term-prefix searches and searches for documents
containing combinations of terms occuring within a defined proximity of each
other. The various ways in which the full-text index may be queried are
\[FTS3 MATCH|described below\].
<p>
Normally, full-text queries are case-insensitive. However, this is
is dependent on the specific \[tokenizer\] used by the FTS3 table
being queried. Refer to the section on \[tokenizer|tokenizers\] for details.
<p>
The paragraph above notes that a MATCH operator with a simple term as the
right-hand operand evaluates to true for all documents that contain the
specified term. In this context, the "document" may refer to either the
data stored in a single column of a row of an FTS3 table, or to the contents
of all columns in a single row, depending on the identifier used as the
left-hand operand to the MATCH operator. If the identifier specified as
the left-hand operand of the MATCH operator is an FTS3 table column name,
then the document that the search term must be contained in is the value
stored in the specified column. However, if the identifier is the name
of the FTS3 <i>table</i> itself, then the MATCH operator evaluates to true
for each row of the FTS3 table for which any column contains the search
term. The following example demonstrates this:
[Code {
<i>-- Example schema</i>
CREATE VIRTUAL TABLE mail USING fts3(subject, body);
<i>-- Example table population</i>
INSERT INTO mail(docid, subject, body) VALUES(1, 'software feedback', 'found it too slow');
INSERT INTO mail(docid, subject, body) VALUES(2, 'software feedback', 'no feedback');
INSERT INTO mail(docid, subject, body) VALUES(3, 'slow lunch order', 'was a software problem');
<i>-- Example queries</i>
SELECT * FROM mail WHERE subject MATCH 'software'; <i>-- Selects rows 1 and 2</i>
SELECT * FROM mail WHERE body MATCH 'feedback'; <i>-- Selects row 2</i>
SELECT * FROM mail WHERE mail MATCH 'software'; <i>-- Selects rows 1, 2 and 3</i>
SELECT * FROM mail WHERE mail MATCH 'slow'; <i>-- Selects rows 1 and 3</i>
}]
<p>
At first glance, the final two full-text queries in the example above seem
to be syntacticly incorrect, as there is a table name ("mail") used as
an SQL expression. The reason this is acceptable is that each FTS3 table
actually has a \[sqlite3_declare_vtab|HIDDEN\] column with the same name
as the table itself (in this case, "mail"). The value stored in this
column is not meaningful to the application, but can be used as the
left-hand operand to a MATCH operator. This special column may also be
passed as an argument to the \[snippet()|FTS3 auxillary functions\].
<p>
The following example illustrates the above. The expressions "docs",
"docs.docs" and "main.docs.docs" all refer to column "docs". However, the
expression "main.docs" does not refer to any column. It could be used to
refer to a table, but a table name is not allowed in the context in which
it is used below.
[Code {
<i>-- Example schema</i>
CREATE VIRTUAL TABLE docs USING fts3(content);
<i>-- Example queries</i>
SELECT * FROM docs WHERE docs MATCH 'sqlite'; <i>-- OK.</i>
SELECT * FROM docs WHERE docs.docs MATCH 'sqlite'; <i>-- OK.</i>
SELECT * FROM docs WHERE main.docs.docs MATCH 'sqlite'; <i>-- OK.</i>
SELECT * FROM docs WHERE main.docs MATCH 'sqlite'; <i>-- Error.</i>
}]
[h2 "Summary"]
<p>
From the users point of view, FTS3 tables are similar to ordinary SQLite
tables in many ways. Data may be added to, modified within and removed
from FTS3 tables using the INSERT, UPDATE and DELETE commands just as
it may be with ordinary tables. Similarly, the SELECT command may be used
to query data. The following list summarizes the differences between FTS3
and ordinary tables:
<ol>
<li><p>
As with all virtual table types, it is not possible to create indices or
triggers attached to FTS3 tables. Nor is it possible to use the ALTER TABLE
command to add extra columns to FTS3 tables (although it is possible to use
ALTER TABLE to rename an FTS3 table).
<li><p>
Data-types specified as part of the "CREATE VIRTUAL TABLE" statement
used to create an FTS3 table are ignored completely. Instead of the
normal rules for applying type \[affinity\] to inserted values, all
values inserted into FTS3 table columns (except the special rowid
column) are converted to type TEXT before being stored.
<li><p>
FTS3 tables permit the special alias "docid" to be used to refer to the
rowid column supported by all \[virtual tables\].
<li><p>
The \[FTS3 MATCH\] operator is supported for queries based on the built-in
full-text index.
<li><p>
The FTS3 auxillary functions, \[snippet|snippet() and offsets()\], are
available to support full-text queries.
<li><p>
Each FTS3 table has a \[sqlite3_declare_vtab()|HIDDEN column\] with the
same name as the table itself. The value contained in each row for the
special column is only useful when used on the left-hand side of a
\[FTS3 MATCH|MATCH\] operator, or when specified as an argument to one
of the \[snippet|FTS3 auxillary functions\].
</ol>
[h1 "Compiling and Enabling FTS3" {} {compile fts3}]
<p>
Although FTS3 is distributed as part of the SQLite source code, it is not
enabled by default. To build SQLite with FTS3 functionality enabled, define
the preprocessor macro \[SQLITE_ENABLE_FTS3\] when compiling. New applications
should also define the \[SQLITE_ENABLE_FTS3_PARENTHESIS\] macro to enable the
\[enhanced query syntax\] (see below). Usually, this is done by adding the
following two switches to the compiler command line:
[Code {
-DSQLITE_ENABLE_FTS3
-DSQLITE_ENABLE_FTS3_PARENTHESIS
}]
<p>
If using the amalgamation autoconf based build system, setting the CPPFLAGS
environment variable while running the 'configure' script is an easy
way to set these macros. For example, the following command:
[Code {
CPPFLAGS="-DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS" ./configure <configure options>
}]
<p>
where <i><configure options></i> are those options normally passed to
the configure script, if any.
<p>
Because FTS3 is a virtual table, it is incompatible with the
\[SQLITE_OMIT_VIRTUALTABLE\] option.
<p>
If an SQLite build does not include FTS3, then any attempt to prepare an
SQL statement to create an FTS3 table or to drop or access an existing
FTS3 table in any way will fail. The error message returned will be similar
to "no such module: fts3".
<p>
If the C version of the <a href=http://site.icu-project.org/>ICU library</a>
is available, then FTS3 may also be compiled with the SQLITE_ENABLE_ICU
pre-processor macro defined. Compiling with this macro enables an FTS3
\[tokenizer\] that uses the ICU library to split a document into terms
(words) using the conventions for a specified language and locale.
[Code {
-DSQLITE_ENABLE_ICU
}]
[h1 "Full-text Index Queries" {} {FTS3 MATCH}]
<p>
The most useful thing about FTS3 tables is the queries that may be
performed using the built-in full-text index. Full-text queries are
performed by specifying a clause of the form
"<column> MATCH <full-text query expression>" to the WHERE
clause of a SELECT statement that reads data from an FTS3 table.
\[simple fts3 queries|Simple FTS3 queries\] that return all documents that
contain a given term are described above. In that discussion the right-hand
operand of the MATCH operator was assumed to be a string consisting of a
single term. This section describes the more complex query types supported
by FTS3 tables, and how they may be utilized by specifying a more
complex query expression as the right-hand operand of a MATCH operator.
<p>
FTS3 tables support three basic query types:
<ul>
<li><p><b>Token or token prefix queries</b>.
An FTS3 table may be queried for all documents that contain a specified
term (the \[simple fts3 queries|simple case\] described above), or for
all documents that contain a term with a specified prefix. As we have
seen, the query expression for a specific term is simply the term itself.
The query expression used to search for a term prefix is the prefix
itself with a '*' character appended to it. For example:
</ul>
[Code {
<i>-- Virtual table declaration</i>
CREATE VIRTUAL TABLE docs USING fts3(title, body);
<i>-- Query for all documents containing the term "linux":</i>
SELECT * FROM docs WHERE docs MATCH 'linux';
<i>-- Query for all documents containing a term with the prefix "lin". This will match</i>
<i>-- all documents that contain "linux", but also those that contain terms "linear",</i>
<i>--"linker", "linguistic" and so on.</i>
SELECT * FROM docs WHERE docs MATCH 'lin*';
}]
<ul>
<li style="list-style:none"><p>
Normally, a token or token prefix query is matched against the FTS3 table
column specified as the right-hand side of the MATCH operator. Or, if the
special column with the same name as the FTS3 table itself is specified,
against all columns. This may be overridden by specifying a column-name
followed by a ":" character before a basic term query. There may be space
between the ":" and the term to query for, but not between the column-name
and the ":" character. For example:
</ul>
[Code {
<i>-- Query the database for documents for which the term "linux" appears in</i>
<i>-- the document title, and the term "problems" appears in either the title</i>
<i>-- or body of the document.</i>
SELECT * FROM docs WHERE docs MATCH 'title:linux problems';
<i>-- Query the database for documents for which the term "linux" appears in</i>
<i>-- the document title, and the term "driver" appears in the body of the document</i>
<i>-- ("driver" may also appear in the title, but this alone will not satisfy the</i>.
<i>-- query criteria).</i>
SELECT * FROM docs WHERE body MATCH 'title:linux driver';
}]
<ul>
<li><p><b>Phrase queries</b>.
A phrase query is a query that retrieves all documents that contain a
nominated set of terms or term prefixes in a specified order with no
intervening tokens. Phrase queries are specified by enclosing a space
separated sequence of terms or term prefixes in double quotes (").
For example:
</ul>
[Code {
<i>-- Query for all documents that contain the phrase "linux applications".</i>
SELECT * FROM docs WHERE docs MATCH '"linux applications"';
<i>-- Query for all documents that contain a phrase that matches "lin* app*". As well as</i>
<i>-- "linux applications", this will match common phrases such as "linoleum appliances"</i>
<i>-- or "link apprentice".</i>
SELECT * FROM docs WHERE docs MATCH '"lin* app*"';
}]
<ul>
<li><p><b>NEAR queries</b>.
A NEAR query is a query that returns documents that contain a two or
more nominated terms or phrases within a specified proximity of each
other (by default with 10 or less intervening terms). A NEAR query is
specified by putting the keyword "NEAR" between two phrase, term or
term prefix queries. To specify a proximity other than the default,
an operator of the form "NEAR/<i><N></i>" may be used, where
<i><N></i> is the maximum number of intervening terms allowed.
For example:
</ul>
[Code {
<i>-- Virtual table declaration.</i>
CREATE VIRTUAL TABLE docs USING fts3();
<i>-- Virtual table data.</i>
INSERT INTO docs VALUES('SQLite is an ACID compliant embedded relational database management system');
<i>-- Search for a document that contains the terms "sqlite" and "database" with</i>
<i>-- not more than 10 intervening terms. This matches the only document in</i>
<i>-- table docs (since there are only six terms between "SQLite" and "database"</i>
<i>-- in the document)</i>.
SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR database';
<i>-- Search for a document that contains the terms "sqlite" and "database" with</i>
<i>-- not more than 6 intervening terms. This also matches the only document in</i>
<i>-- table docs. Note that the order in which the terms appear in the document</i>
<i>-- does not have to be the same as the order in which they appear in the query.</i>
SELECT * FROM docs WHERE docs MATCH 'database NEAR/6 sqlite';
<i>-- Search for a document that contains the terms "sqlite" and "database" with</i>
<i>-- not more than 5 intervening terms. This query matches no documents.</i>
SELECT * FROM docs WHERE docs MATCH 'database NEAR/5 sqlite';
<i>-- Search for a document that contains the phrase "ACID compliant" and the term</i>
<i>-- "database" with not more than 2 terms separating the two. This matches the</i>
<i>-- document stored in table docs.</i>
SELECT * FROM docs WHERE docs MATCH 'database NEAR/2 "ACID compliant"';
<i>-- Search for a document that contains the phrase "ACID compliant" and the term</i>
<i>-- "sqlite" with not more than 2 terms separating the two. This also matches</i>
<i>-- the only document stored in table docs.</i>
SELECT * FROM docs WHERE docs MATCH '"ACID compliant" NEAR/2 sqlite';
}]
<ul>
<li style="list-style: none"><p>
More than one NEAR operator may appear in a single query. In this case each
pair of terms or phrases separated by a NEAR operator must appear within the
specified proximity of each other in the document. Using the same table and
data as in the block of examples above:
</ul>
[Code { <i>-- The following query selects documents that contains an instance of the term </i>
<i>-- "sqlite" separated by two or fewer terms from an instance of the term "acid",</i>
<i>-- which is in turn separated by two or fewer terms from an instance of the term</i>
<i>-- "relational". As it happens, the only document in table docs satisfies this criteria.</i>
SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR/2 acid NEAR/2 relational';
<i>-- This query matches no documents. There is an instance of the term "sqlite" with</i>
<i>-- sufficient proximity to an instance of "acid" but it is not sufficiently close</i>
<i>-- to an instance of the term "relational".</i>
SELECT * FROM docs WHERE docs MATCH 'acid NEAR/2 sqlite NEAR/2 relational';
}]
<p>
Phrase and NEAR queries may not span multiple columns within a row.
<p>
The three basic query types described above may be used to query the full-text
index for the set of documents that match the specified criteria. Using the
FTS3 query expression language it is possible to perform various set
operations on the results of basic queries. There are currently three
supported operations:
<ul>
<li> The AND operator determines the <b>intersection</b> of two sets of documents.
<li> The OR operator calculates the <b>union</b> of two sets of documents.
<li> The NOT operator (or, if using the standard syntax, a unary "-" operator)
may be used to compute the <b>relative complement</b> of one set of
documents with respect to another.
</ul>
<p>
The FTS3 module may be compiled to use one of two slightly different versions
of the full-text query syntax, the "standard" query syntax and the "enhanced"
query syntax. The basic term, term-prefix, phrase and NEAR queries described
above are the same in both versions of the syntax. The way in which set
operations are specified is slightly different. The following two sub-sections
describe the part of the two query syntaxes that pertains to set operations.
Refer to the description of how to \[compile fts3\] for compilation notes.
[h2 "Set Operations Using The Enhanced Query Syntax" {} {enhanced query syntax}]
<p>
The enhanced query syntax supports the AND, OR and NOT binary set operators.
Each of the two operands to an operator may be a basic FTS3 query, or the
result of another AND, OR or NOT set operation. Operators must be entered
using capital letters. Otherwise, they are interpreted as basic term queries
instead of set operators.
<p>
The AND operator may be implicitly specified. If two basic queries appear
with no operator separating them in an FTS3 query string, the results are
the same as if the two basic queries were separated by an AND operator.
For example, the query expression "implicit operator" is a more succinct
version of "implicit AND operator".
[Code {
<i>-- Virtual table declaration</i>
CREATE VIRTUAL TABLE docs USING fts3();
<i>-- Virtual table data</i>
INSERT INTO docs(docid, content) VALUES(1, 'a database is a software system');
INSERT INTO docs(docid, content) VALUES(2, 'sqlite is a software system');
INSERT INTO docs(docid, content) VALUES(3, 'sqlite is a database');
<i>-- Return the set of documents that contain the term "sqlite", and the</i>
<i>-- term "database". This query will return the document with docid 3 only.</i>
SELECT * FROM docs WHERE docs MATCH 'sqlite AND database';
<i>-- Again, return the set of documents that contain both "sqlite" and</i>
<i>-- "database". This time, use an implicit AND operator. Again, document</i>
<i>-- 3 is the only document matched by this query. </i>
SELECT * FROM docs WHERE docs MATCH 'database sqlite';
<i>-- Query for the set of documents that contains either "sqlite" or "database".</i>
<i>-- All three documents in the database are matched by this query.</i>
SELECT * FROM docs WHERE docs MATCH 'sqlite OR database';
<i>-- Query for all documents that contain the term "database", but do not contain</i>
<i>-- the term "sqlite". Document 1 is the only document that matches this criteria.</i>
SELECT * FROM docs WHERE docs MATCH 'database NOT sqlite';
<i>-- The following query matches no documents. Because "and" is in lowercase letters,</i>
<i>-- it is interpreted as a basic term query instead of an operator. Operators must</i>
<i>-- be specified using capital letters. In practice, this query will match any documents</i>
<i>-- that contain each of the three terms "database", "and" and "sqlite" at least once.</i>
<i>-- No documents in the example data above match this criteria.</i>
SELECT * FROM docs WHERE docs MATCH 'database and sqlite';
}]
<p>
The examples above all use basic full-text term queries as both operands of
the set operations demonstrated. Phrase and NEAR queries may also be used,
as may the results of other set operations. When more than one set operation
is present in an FTS3 query, the precedence of operators is as follows:
[Table]
[Tr]<th>Operator<th>Enhanced Query Syntax Precedence
[Tr]<td>NOT <td> Highest precedence (tightest grouping).
[Tr]<td>AND <td>
[Tr]<td>OR <td> Lowest precedence (loosest grouping).
</table>
<p>
When using the enhanced query syntax, parenthesis may be used to override
the default precedence of the various operators. For example:
[Code {
<i>-- Return the docid values associated with all documents that contain the</i>
<i>-- two terms "sqlite" and "database", and/or contain the term "library".</i>
SELECT docid FROM docs WHERE docs MATCH 'sqlite AND database OR library';
<i>-- This query is equivalent to the above.</i>
SELECT docid FROM docs WHERE docs MATCH 'sqlite AND database'
UNION
SELECT docid FROM docs WHERE docs MATCH 'library';
<i>-- Query for the set of documents that contains the term "linux", and at least</i>
<i>-- one of the phrases "sqlite database" and "sqlite library".</i>
SELECT docid FROM docs WHERE docs MATCH '("sqlite database" OR "sqlite library") AND linux';
<i>-- This query is equivalent to the above.</i>
SELECT docid FROM docs WHERE docs MATCH 'linux'
INTERSECT
SELECT docid FROM (
SELECT docid FROM docs WHERE docs MATCH '"sqlite library"'
UNION
SELECT docid FROM docs WHERE docs MATCH '"sqlite database"'
);
}]
[h2 "Set Operations Using The Standard Query Syntax"]
<p>
FTS3 query set operations using the standard query syntax are similar, but
not identical, to set operations with the enhanced query syntax. There
are four differences, as follows:
<ol>
<li value=1><p> Only the implicit version of the AND operator is supported.
Specifying the string "AND" as part of an standard query syntax query is
interpreted as a term query for the set of documents containing the term
"and".
</ol>
<ol>
<li value=2><p> Parenthesis are not supported.
</ol>
<ol>
<li value=3><p> The NOT operator is not supported. Instead of the NOT
operator, the standard query syntax supports a unary "-" operator that
may be applied to basic term and term-prefix queries (but not to phrase
or NEAR queries). A term or term-prefix that has a unary "-" operator
attached to it may not appear as an operand to an OR operator. An FTS3
query may not consist entirely of terms or term-prefix queries with unary
"-" operators attached to them.
</ol>
[Code {
<i>-- Search for the set of documents that contain the term "sqlite" but do</i>
<i>-- not contain the term "database".</i>
SELECT * FROM docs WHERE docs MATCH 'sqlite -database';
}]
<ol>
<li value=4><p> The relative precedence of the set operations is different.
In particular, using the standard query syntax the "OR" operator has a
higher precedence than "AND". The precedence of operators when using the
standard query syntax is:
</ol>
[Table]
[Tr]<th>Operator<th>Standard Query Syntax Precedence
[Tr]<td>Unary "-" <td> Highest precedence (tightest grouping).
[Tr]<td>OR <td>
[Tr]<td>AND <td> Lowest precedence (loosest grouping).
</table>
<ol><li style="list-style:none">
The following example illustrates precedence of operators using the standard
query syntax:
</ol>
[Code {
<i>-- Search for documents that contains at least one of the terms "database"</i>
<i>-- and "sqlite", and also contains the term "library". Because of the differences</i>
<i>-- in operator precedences, this query would have a different interpretation using</i>
<i>-- the enhanced query syntax.</i>
SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library';
}]
[h1 "Auxillary functions - Snippets and Offsets" {} snippet offsets]
<p>
The FTS3 module provides two special SQL scalar functions that may be useful
to the developers of full-text query systems, "snippet" and "offsets". The
purpose of both functions is to allow the user to identify the location of
queried terms in the returned documents.
<p>
The first argument to both the snippet and offsets SQL scalar functions
must be the the special hidden column of an FTS3 table that has the same
name as the table (see above). For example, given an FTS3 table named
"mail":
[Code {
SELECT offsets(mail) FROM mail WHERE mail MATCH <full-text query expression>;
SELECT snippet(mail) FROM mail WHERE mail MATCH <full-text query expression>;
}]
<p>
The two auxillary functions are only useful within a SELECT statement that
uses the FTS3 table's full-text index. If used within a SELECT that uses
the "query by rowid" or "linear scan" strategies, both functions return
an empty string.
<p>
For a SELECT query that uses the full-text index, the offsets() function
returns a text value containing a series of space-separated integers. For
each occurence of a queried term in the document, there are four integers
in the returned list. Each set of four integers is interpreted as
follows:
[Table]
[Tr]<th>Integer <th>Interpretation
[Tr]<td>0
<td>The column number that the term instance occurs in (0 for the
leftmost column of the FTS3 table, 1 for the next leftmost, etc.).
[Tr]<td>1
<td>The term number of the matching term within the full-text query
expression. Terms within a query expression are numbered starting
from 0 in the order that they occur.
[Tr]<td>2
<td>The byte offset of the matching term within the column.
[Tr]<td>3
<td>The size of the matching term in bytes.
</table>
<p>
The following block contains examples that use the offsets function.
[Code {
CREATE VIRTUAL TABLE mail USING fts3(subject, body);
INSERT INTO mail VALUES('hello world', 'This message is a hello world message.');
INSERT INTO mail VALUES('urgent: serious', 'This mail is seen as a more serious mail');
<i>-- The following query returns a single row (as it matches only the first</i>
<i>-- entry in table "mail". The text returned by the offsets function is</i>
<i>-- "0 0 6 5 1 0 24 5".</i>
<i>--</i>
<i>-- The first set of four integers in the result indicate that column 0</i>
<i>-- contains an instance of term 0 ("world") at byte offset 6. The term instance</i>
<i>-- is 5 bytes in size. The second set of four integers shows that column 1</i>
<i>-- of the matched row contains an instance of term 0 ("world") at byte offset</i>
<i>-- 24. Again, the term instance is 5 bytes in size.</i>
SELECT offsets(mail) FROM mail WHERE mail MATCH 'world';
<i>-- The following query returns also matches only the first row in table "mail".</i>
<i>-- In this case the returned text is "1 0 5 7 1 0 30 7".</i>
SELECT offsets(mail) FROM mail WHERE mail MATCH 'message';
<i>-- The following query matches the second row in table "mail". It returns the</i>
<i>-- text "1 0 28 7 1 1 36 4". Only those occurences of terms "serious" and "mail"</i>
<i>-- that are part of an instance of the phrase "serious mail" are identified; the</i>
<i>-- other occurences of "serious" and "mail" are ignored.</i>
SELECT offsets(mail) FROM mail WHERE mail MATCH '"serious mail"';
}]
<p>
The snippet function is used to create formatted fragments of document text
for display as part of a full-text query results report. The snippet function
may be passed between one and four arguments, as follows:
[Table]
[Tr]<th>Argument <th>Default Value <th>Description
[Tr]<td>0 <td>N/A
<td> The first argument to the snippet function must always be the special
hidden column of the FTS3 table that takes the same name as the table
itself.
[Tr]<td>1 <td>"<b>"
<td> The "start match" text.
[Tr]<td>2 <td>"<b>"
<td> The "end match" text.
[Tr]<td>3 <td>"<b>...</b>"
<td> The "ellipses" text.
</table>
<p>
The snippet function returns a fragment of text from the original document
surrounding the term identified by the first four integers that would be
returned by the offsets function if it were used in a similar context. In
most cases, the selected fragment contains 40 or less bytes of text before
the identified term, and 40 or more bytes of text following the identified
term. Slightly less than 40 bytes of preceding or following text is provided
so that the fragment does not contain any partial terms. If the first term
(that would be) identified by the offsets function is less than 40 bytes
from the beginning or end of the document, then extra text may appear before
or after the identified term within the fragment to make up the difference.
<p>
If the returned fragment of text does not start at the start of the entire
document, then the "ellipses" text (see table above) is prepended to the
fragment before it is returned. Similarly, if the end of the returned fragment
is not also the end of the entire document, the "ellipses" text is appended
to it before it is returned.
<p>
Before it is returned, the "start match" text is inserted into the fragment
immediately before any terms within the fragment that would have been
identified by the offsets function (not just the first one) were it invoked
in the same context. The "end match" is inserted immediately following all
such terms.
[Code {
<b>Note: In this block of examples, newlines and whitespace characters have
been inserted into the document inserted into the FTS3 table, and the expected
results described in SQL comments. This is done to enhance readability only,
they would not be present in actual SQLite commands or output.</b>
<i>-- Create and populate an FTS3 table.</i>
CREATE VIRTUAL TABLE text USING fts3();
INSERT INTO text VALUES('
During 30 Nov-1 Dec, 2-3oC drops. Cool in the upper portion, minimum temperature 14-16oC
and cool elsewhere, minimum temperature 17-20oC. Cold to very cold on mountaintops,
minimum temperature 6-12oC. Northeasterly winds 15-30 km/hr. After that, temperature
increases. Northeasterly winds 15-30 km/hr.
');
<i>-- The following query returns the text value:</i>
<i>--</i>
<i>-- "<b>...</b> elsewhere, minimum temperature 17-20oC. <b>Cold</b> to very <b>cold</b> on</i>
<i>-- mountaintops, minimum <b>...</b>".</i>
<i>--</i>
SELECT snippet(text) FROM text WHERE text MATCH 'cold';
<i>-- The following query returns the text value:</i>
<i>--</i>
<i>-- "... 2-3oC drops. Cool in the upper portion, [minimum] [temperature] 14-16oC and cool</i>
<i>-- elsewhere, [minimum] ..."</i>
<i>--</i>
SELECT snippet(text, '[ ']', '...') FROM text WHERE text MATCH '"min* tem*"'
}]
[h1 "Tokenizers" tokenizer {tokenizer}]
<p>
An FTS3 tokenizer is a set of rules for extracting terms from a document
or basic FTS3 full-text query.
<p>
Unless a specific tokenizer is specified as part of the CREATE
VIRTUAL TABLE statement used to create the FTS3 table, the default
tokenizer, "simple", is used. The simple tokenizer extracts tokens from
a document or basic FTS3 full-text query according to the following
rules:
<ul>
<li><p> A term is a contiguous sequence of eligible characters, where
eligible characters are all alphanumeric characters, the "_" character,
and all characters with UTF codepoints greater than or equal to 128.
All other characters are discarded when splitting a document into terms.
They serve only to separate adjacent terms.
<li><p> All uppercase characters within the ASCII range (UTF codepoints less
than 128), are transformed to their lowercase equivalents as part of the
tokenization process. Thus, full-text queries are case-insensitive when
using the simple tokenizer.
</ul>
<p>
For example, when a document containing the text "Right now, they're very
frustrated.", the terms extracted from the document and added to the
full-text index are, in order, "right now they re very frustrated". Such
a document would match a full-text query such as "MATCH 'Frustrated'",
as the simple tokenizer transforms the term in the query to lowercase
before searching the full-text index.
<p>
As well as the "simple" tokenizer, the FTS3 source code features a tokenizer
that uses the <a href="http://tartarus.org/~martin/PorterStemmer/">Porter
Stemming algorithm</a>. This tokenizer uses the same rules to separate
the input document into terms, but as well as folding all terms to lower
case it uses the Porter Stemming algorithm to reduce related English language
words to a common root. For example, using the same input document as in the
paragraph above, the porter tokenizer extracts the following tokens:
"right now thei veri frustrat". Even though some of these terms are not even
English words, in some cases using them to build the full-text index is more
useful than the more intelligible output produced by the simple tokenizer.
Using the porter tokenizer, the document not only matches full-text queries
such as "MATCH 'Frustrated'", but also queries such as "MATCH 'Frustration'",
as the term "Frustration" is reduced by the Porter stemmer algorithm to
"frustrat" - just as "Frustrated" is. So, when using the porter tokenizer,
FTS3 is able to find not just exact matches for queried terms, but matches
against similar English language terms. For more information on the
Porter Stemmer algorithm, please refer to the page linked above.
<p>
Example illustrating the difference between the "simple" and "porter"
tokenizers:
[Code {
<i>-- Create a table using the simple tokenizer. Insert a document into it.</i>
CREATE VIRTUAL TABLE simple USING fts3(tokenize=simple);
INSERT INTO simple VALUES('Right now they''re very frustrated');
<i>-- The first of the following two queries matches the document stored in</i>
<i>-- table "simple". The second does not.</i>
SELECT * FROM simple WHERE simple MATCH 'Frustrated');
SELECT * FROM simple WHERE simple MATCH 'Frustration');
<i>-- Create a table using the porter tokenizer. Insert the same document into it</i>
CREATE VIRTUAL TABLE porter USING fts3(tokenize=porter);
INSERT INTO porter VALUES('Right now they''re very frustrated');
<i>-- Both of the following queries match the document stored in table "porter".</i>
SELECT * FROM porter WHERE porter MATCH 'Frustrated');
SELECT * FROM porter WHERE porter MATCH 'Frustration');
}]
<p>
If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor
symbol defined, then there exists a built-in tokenizer named "icu"
implemented using the ICU library. The first argument passed to the
xCreate() method (see fts3_tokenizer.h) of this tokenizer may be
an ICU locale identifier. For example "tr_TR" for Turkish as used
in Turkey, or "en_AU" for English as used in Australia. For example:
[Code {
CREATE VIRTUAL TABLE thai_text USING fts3(text, tokenize=icu th_TH)
}]
<p>
The ICU tokenizer implementation is very simple. It splits the input
text according to the ICU rules for finding word boundaries and discards
any tokens that consist entirely of white-space. This may be suitable
for some applications in some locales, but not all. If more complex
processing is required, for example to implement stemming or
discard punctuation, this can be done by creating a tokenizer
implementation that uses the ICU tokenizer as part of its implementation.
[h2 "Custom (User Implemented) Tokenizers"]
<p>
As well as the built-in "simple", "porter" and (possibly) "icu" tokenizers,
FTS3 exports an interface that allows users to implement custom tokenizers
using C. The interface used to create a new tokenizer is defined and
described in the fts3_tokenizer.h source file.
<p>
Registering a new FTS3 tokenizer is similar to registering a new
virtual table module with SQLite. The user passes a pointer to a
structure containing pointers to various callback functions that
make up the implementation of the new tokenizer type. For tokenizers,
the structure (defined in fts3_tokenizer.h) is called
"sqlite3_tokenizer_module".
<p>
FTS3 does not expose a C-function that users call to register new
tokenizer types with a database handle. Instead, the pointer must
be encoded as an SQL blob value and passed to FTS3 through the SQL
engine by evaluating a special scalar function, "fts3_tokenizer()".
The fts3_tokenizer() function may be called with one or two arguments,
as follows:
[Code {
SELECT fts3_tokenizer(<tokenizer-name>);
SELECT fts3_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>);
}]
<p>
Where <tokenizer-name> is a string identifying the tokenizer and
<sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module
structure encoded as an SQL blob. If the second argument is present,
it is registered as tokenizer <tokenizer-name> and a copy of it
returned. If only one argument is passed, a pointer to the tokenizer
implementation currently registered as <tokenizer-name> is returned,
encoded as a blob. Or, if no such tokenizer exists, an SQL exception
(error) is raised.
<p>
<b>SECURITY WARNING</b>: If the fts3 extension is used in an environment
where potentially malicious users may execute arbitrary SQL, they should
be prevented from invoking the fts3_tokenizer() function, possibly using
the \[sqlite3_set_authorizer()|authorisation callback\].
<p>
The following block contains an example of calling the fts3_tokenizer()
function from C code:
[Code {
<i>/*
** Register a tokenizer implementation with FTS3.
*/</i>
int registerTokenizer(
sqlite3 *db,
char *zName,
const sqlite3_tokenizer_module *p
){
int rc;
sqlite3_stmt *pStmt;
const char *zSql = "SELECT fts3_tokenizer(?, ?)";
rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
if( rc!=SQLITE_OK ){
return rc;
}
sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC);
sqlite3_step(pStmt);
return sqlite3_finalize(pStmt);
}
<i>/*
** Query FTS3 for the tokenizer implementation named zName.
*/</i>
int queryTokenizer(
sqlite3 *db,
char *zName,
const sqlite3_tokenizer_module **pp
){
int rc;
sqlite3_stmt *pStmt;
const char *zSql = "SELECT fts3_tokenizer(?)";
*pp = 0;
rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
if( rc!=SQLITE_OK ){
return rc;
}
sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
if( SQLITE_ROW==sqlite3_step(pStmt) ){
if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){
memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp));
}
}
return sqlite3_finalize(pStmt);
}
}]
[h1 "Data Structures" {} {segment btree}]
<p>
This section describes at a high-level the way the FTS3 module stores its
index and content in the database. It is <b>not necessary to read or
understand the material in this section in order to use FTS3</b> in an
application. However, it may be useful to application developers attempting
to analyze and understand FTS3 performance characteristics, or to developers
contemplating enhancements to the existing FTS3 feature set.
<p>
For each FTS3 virtual table in a database, three real (non-virtual) tables
are created to store the underlying data. The real tables are named "%_content",
"%_segdir" and "%_segments", where "%" is replaced by the name supplied by
the user for the FTS3 virtual table.
<p>
The leftmost column of the "%_content" table is an INTEGER PRIMARY KEY field
named "docid". Following this is one column for each column of the FTS3
virtual table as declared by the user, named by prepending the column name
supplied by the user with "c<i>N</i>", where <i>N</i> is the index of the
column within the table, numbered from left to right starting with 1. Data
types supplied as part of the virtual table declaration are not used as
part of the %_content table declaration. For example:
[Code {
<i>-- Virtual table declaration</i>
CREATE VIRTUAL TABLE abc USING FTS3(a NUMBER, b TEXT, c);
<i>-- Corresponding %_content table declaration</i>
CREATE TABLE abc_content(docid INTEGER PRIMARY KEY, c0a, c1b, c2c);
}]
<p>
The %_content table contains the unadulterated data inserted by the user
into the FTS3 virtual table by the user. If the user does not explicitly
supply a "docid" value when inserting records, one is selected automatically
by the system.
<p>
The two remaining tables, %_segments and %_segdir, are used to store the
full-text index. Conceptually, this index is a lookup table that maps each
term (word) to the set of docid values corresponding to records in the
%_content table that contain one or more occurrences of the term. To
retrieve all documents that contain a specified term, the FTS3 module
queries this index to determine the set of docid values for records that
contain the term, then retrieves the required documents from the %_content
table. Regardless of the schema of the FTS3 virtual table, the %_segments
and %_segdir tables are always created as follows:
[Code {
CREATE TABLE %_segments(
blockid INTEGER PRIMARY KEY, <i>-- B-tree node id</i>
block blob <i>-- B-tree node data</i>
);
CREATE TABLE %_segdir(
level INTEGER,
idx INTEGER,
start_block INTEGER, <i>-- Blockid of first node in %_segments</i>
leaves_end_block INTEGER, <i>-- Blockid of last leaf node in %_segments</i>
end_block INTEGER, <i>-- Blockid of last node in %_segments</i>
root BLOB, <i>-- B-tree root node</i>
PRIMARY KEY(level, idx)
);
}]
<p>
The schema depicted above is not designed to store the full-text index
directly. Instead, it is used to one or more b-tree structures. There
is one b-tree for each row in the %_segdir table. The %_segdir table
row contains the root node and various meta-data associated with the
b-tree structure, and the %_segments table contains all other (non-root)
b-tree nodes. Each b-tree is referred to as a "segment". Once it has
been created, a segment b-tree is never updated (although it may be
deleted altogether).
<p>
The keys used by each segment b-tree are terms (words). As well as the
key, each segment b-tree entry has an associated "doclist" (document list).
A doclist consists of zero or more entries, where each entry consists of:
<ul>
<li> A docid (document id), and
<li> A list of term offsets, one for each occurrence of the term within
the document. A term offset indicates the number of tokens (words)
that occur before the term in question, not the number of characters
or bytes. For example, the term offset of the term "war" in the
phrase "Ancestral voices prophesying war!" is 3.
</ul>
<p>
Entries within a doclist are sorted by docid. Positions within a doclist
entry are stored in ascending order.
<p>
The contents of the logical full-text index is found by merging the
contents of all segment b-trees. If a term is present in more than one
segment b-tree, then it maps to the union of each individual doclist. If,
for a single term, the same docid occurs in more than one doclist, then only
the doclist that is part of the most recently created segment b-tree is
considered valid.
<p>
Multiple b-tree structures are used instead of a single b-tree to reduce
the cost of inserting records into FTS3 tables. When a new record is
inserted into an FTS3 table that already contains a lot of data, it is
likely that many of the terms in the new record are already present in
a large number of existing records. If a single b-tree were used, then
large doclist structures would have to be loaded from the database,
amended to include the new docid and term-offset list, then written back
to the database. Using multiple b-tree tables allows this to be avoided
by creating a new b-tree which can be merged with the existing b-tree
(or b-trees) later on. Merging of b-tree structures can be performed as
a background task, or once a certain number of separate b-tree structures
have been accumulated. Of course, this scheme makes queries more expensive
(as the FTS3 code may have to look up individual terms in more than one
b-tree and merge the results), but it has been found that in practice this
overhead is often negligible.
[h2 "Variable Length Integer (varint) Format"]
<p>
Integer values stored as part of segment b-tree nodes are encoded using the
FTS3 varint format. This encoding is similar, but <b>not identical</b>, to the
the <a href="fileformat.html#varint_format">SQLite varint format</a>.
<p>
An encoded FTS3 varint consumes between one and ten bytes of space. The
number of bytes required is determined by the sign and magnitude of the
integer value encoded. More accurately, the number of bytes used to store
the encoded integer depends on the position of the most significant set bit
in the 64-bit twos-compliment representation of the integer value. Negative
values always have the most significant bit set (the sign bit), and so are
always stored using the full ten bytes. Positive integer values may be
stored using less space.
<p>
The final byte of an encoded FTS3 varint has its most significant bit
cleared. All preceding bytes have the most significant bit set. Data
is stored in the remaining seven least signficant bits of each byte.
The first byte of the encoded representation contains the least significant
seven bits of the encoded integer value. The second byte of the encoded
representation, if it is present, contains the seven next least significant
bits of the integer value, and so on. The following table contains examples
of encoded integer values:
[Table]
[Tr]<th>Decimal<th>Hexadecimal<th width=100%>Encoded Representation
[Tr]<td>43<td>0x000000000000002B<td>0x2B
[Tr]<td>200815<td>0x000000000003106F<td>0x9C 0xA0 0x0C
[Tr]<td>-1<td>0xFFFFFFFFFFFFFFFF<td>0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0x01
</table>
[h2 "Segment B-Tree Format"]
<p>
Segment b-trees are prefix-compressed b+-trees. There is one segment b-tree
for each row in the %_segdir table (see above). The root node of the segment
b-tree is stored as a blob in the "root" field of the corresponding row
of the %_segdir table. All other nodes (if any exist) are stored in the
"blob" column of the %_segments table. Nodes within the %_segments table are
identified by the integer value in the blockid field of the corresponding
row. The following table describes the fields of the %_segdir table:
[Table]
[Tr]<th>Column <th width=100%>Interpretion
[Tr]<td>level <td>
Between them, the contents of the "level" and "idx" fields define the
relative age of the segment b-tree. The smaller the value stored in the
"level" field, the more recently the segment b-tree was created. If two
segment b-trees are of the same "level", the segment with the larger
value stored in the "idx" column is more recent. The PRIMARY KEY constraint
on the %_segdir table prevents any two segments from having the same value
for both the "level" and "idx" fields.
[Tr]<td>idx <td> See above.
[Tr]<td>start_block <td>
The blockid that corresponds to the node with the smallest blockid that
belongs to this segment b-tree. Or zero if the entire segment b-tree
fits on the root node. If it exists, this node is always a leaf node.
[Tr]<td>leaves_end_block <td>
The blockid that corresponds to the leaf node with the largest blockid
that belongs to this segment b-tree. Or zero if the entire segment b-tree
fits on the root node.
[Tr]<td>end_block <td>
The blockid that corresponds to the interior node with the largest
blockid that belongs to this segment b-tree. Or zero if the entire segment
b-tree fits on the root node. If it exists, this node is always an
interior node.
[Tr]<td>root <td>
Blob containing the root node of the segment b-tree.
</table>
<p>
Apart from the root node, the nodes that make up a single segment b-tree are
always stored using a contiguous sequence of blockids. Furthermore, the
nodes that make up a single level of the b-tree are themselves stored as
a contiguous block, in b-tree order. The contiguous sequence of blockids
used to store the b-tree leaves are allocated starting with the blockid
value stored in the "start_block" column of the corresponding %_segdir row,
and finishing at the blockid value stored in the "leaves_end_block"
field of the same row. It is therefore possible to iterate through all the
leaves of a segment b-tree, in key order, by traversing the %_segments
table in blockid order from "start_block" to "leaves_end_block".
[h3 "Segment B-Tree Leaf Nodes"]
<p>
The following diagram depicts the format of a segment b-tree leaf node.
[Fig fts3_leaf_node.png "Segment B-Tree Leaf Node Format"]
<p>
The first term stored on each node ("Term 1" in the figure above) is
stored verbatim. Each subsequent term is prefix-compressed with respect
to its predecessor. Terms are stored within a page in sorted (memcmp)
order.
[h3 "Segment B-Tree Interior Nodes"]
<p>
The following diagram depicts the format of a segment b-tree interior
(non-leaf) node.
[Fig fts3_interior_node.png "Segment B-Tree Interior Node Format"]
[h2 "Doclist Format"]
<p>
A doclist consists of an array of 64-bit signed integers, serialized using
the FTS3 varint format. Each doclist entry is made up of a series of two
or more integers, as follows:
<ol>
<li> The docid value. The first entry in a doclist contains the literal docid
value. The first field of each subsequent doclist entry contains the
difference between the new docid and the previous one (always a positive
number).
<li> Zero or more term-offset lists. A term-offset list is present for each
column of the FTS3 virtual table that contains the term. A term-offset
list consists of the following:
<ol>
<li> Constant value 1. This field is omitted for any term-offset list
associated with column 0.
<li> The column number (1 for the second leftmost column, etc.). This
field is omitted for any term-offset list associated with column 0.
<li> A list of term-offsets, sorted from smallest to largest. Instead
of storing the term-offset value literally, each integer stored
is the difference between the current term-offset and the previous
one (or zero if the current term-offset is the first), plus 2.
</ol>
<li> Constant value 0.
</ol>
[Fig fts3_doclist2.png "FTS3 Doclist Format"]
[Fig fts3_doclist.png "FTS3 Doclist Entry Format"]
<p>
For doclists for which the term appears in more than one column of the FTS3
virtual table, term-offset lists within the doclist are stored in column
number order. This ensures that the term-offset list associated with
column 0 (if any) is always first, allowing the first two fields of the
term-offset list to be omitted in this case.
}