FTS3 and FTS4 are an SQLite virtual table modules that allows users to perform full-text searches on a set of documents. The most common (and effective) way to describe full-text searches is "what Google, Yahoo and Altavista do with documents placed on the World Wide Web". Users input a term, or series of terms, perhaps connected by a binary operator or grouped together into a phrase, and the full-text query system finds the set of documents that best matches those terms considering the operators and groupings the user has specified. This article describes the deployment and usage of FTS3 and FTS4.
FTS1 and FTS2 are obsolete full-text search modules for SQLite. There are known issues with these older modules and their use should be avoided. Portions of the original FTS3 code were contributed to the SQLite project by Scott Hess of Google. It is now developed and maintained as part of SQLite.
The FTS3 and FTS4 extension modules allows users to create special tables with a built-in full-text index (hereafter "FTS tables"). The full-text index allows the user to efficiently query the database for all rows that contain one or more words (hereafter "tokens"), even if the table contains many large documents.
For example, if each of the 517430 documents in the
"Enron E-Mail Dataset"
is inserted into both an FTS table and an ordinary SQLite table
created using the following SQL script:
Then either of the two queries below may be executed to find the number of
documents in the database that contain the word "linux" (351). Using one
desktop PC hardware configuration, the query on the FTS3 table returns in
approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
Of course, the two queries above are not entirely equivalent. For example the LIKE query matches rows that contain terms such as "linuxophobe" or "EnterpriseLinux" (as it happens, the Enron E-Mail Dataset does not actually contain any such terms), whereas the MATCH query on the FTS3 table selects only those rows that contain "linux" as a discrete token. Both searches are case-insensitive. The FTS3 table consumes around 2006 MB on disk compared to just 1453 MB for the ordinary table. Using the same hardware configuration used to perform the SELECT queries above, the FTS3 table took just under 31 minutes to populate, versus 25 for the ordinary table.
FTS3 and FTS4 are nearly identical. They share most of their code in common, and their interfaces are the same. The differences are:
FTS4 contains query performance optimizations that may significantly improve the performance of full-text queries that contain terms that are very common (present in a large percentage of table rows).
FTS4 supports some additional options that may used with the [matchinfo()] function.
Because it stores extra information on disk in two new [FTS shadow tables|shadow tables] in order to support the performance optimizations and extra matchinfo() options, FTS4 tables may consume more disk space than the equivalent table created using FTS3. Usually the overhead is 1-2% or less, but may be as high as 10% if the documents stored in the FTS table are very small. The overhead may be reduced by specifying the directive [matchinfo_fts3|"matchinfo=fts3"] as part of the FTS4 table declaration, but this comes at the expense of sacrificing some of the extra supported matchinfo() options.
FTS4 provides hooks (the compress and uncompress [FTS4 options|options]) allowing data to be stored in a compressed form, reducing disk usage and IO.
FTS4 is an enhancement to FTS3. FTS3 has been available since SQLite [version 3.5.0] in 2007-09-04. The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08.
Which module, FTS3 or FTS4, should you use in your application? FTS4 is sometimes significantly faster than FTS3, even orders of magnitude faster depending on the query, though in the common case the performance of the two modules is similar. FTS4 also offers the enhanced [matchinfo()] outputs which can be useful in ranking the results of a [FTS MATCH|MATCH] operation. On the other hand, in the absence of a [matchinfo_fts3|matchinfo=fts3] directive FTS4 requires a little more disk space than FTS3, though only a percent of two in most cases.
For newer applications, FTS4 is recommended; though if compatibility with older versions of SQLite is important, then FTS3 will usually serve just as well.
Like other virtual table types, new FTS tables are created using a [CREATE VIRTUAL TABLE] statement. The module name, which follows the USING keyword, is either "fts3" or "fts4". The virtual table module arguments may be left empty, in which case an FTS table with a single user-defined column named "content" is created. Alternatively, the module arguments may be passed a list of comma separated column names.
If column names are explicitly provided for the FTS table as part of
the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally
specified for each column. This is pure syntactic sugar, the
supplied typenames are not used by FTS or the SQLite core for any
purpose. The same applies to any constraints specified along with an
FTS column name - they are parsed but not used or recorded by the system
in any way.
As well as a list of columns, the module arguments passed to a CREATE
VIRTUAL TABLE statement used to create an FTS table may be used to specify
a [tokenizer]. This is done by specifying a string of the form
"tokenize=<tokenizer name> <tokenizer args>" in place of a column
name, where <tokenizer name> is the name of the tokenizer to use and
<tokenizer args> is an optional list of whitespace separated qualifiers
to pass to the tokenizer implementation. A tokenizer specification may be
placed anywhere in the column list, but at most one tokenizer declaration is
allowed for each CREATE VIRTUAL TABLE statement. [tokenizer|See below] for a
detailed description of using (and, if necessary, implementing) a tokenizer.
FTS tables may be dropped from the database using an ordinary [DROP TABLE]
statement. For example:
FTS tables are populated using [INSERT], [UPDATE] and [DELETE] statements in the same way as ordinary SQLite tables are.
As well as the columns named by the user (or the "content" column if no module arguments were specified as part of the [CREATE VIRTUAL TABLE] statement), each FTS table has a "rowid" column. The rowid of an FTS table behaves in the same way as the rowid column of an ordinary SQLite table, except that the values stored in the rowid column of an FTS table remain unchanged if the database is rebuilt using the [VACUUM] command. For FTS tables, "docid" is allowed as an alias along with the usual "rowid", "oid" and "_oid_" identifiers. Attempting to insert or update a row with a docid value that already exists in the table is an error, just as it would be with an ordinary SQLite table.
There is one other subtle difference between "docid" and the normal SQLite
aliases for the rowid column. Normally, if an INSERT or UPDATE statement
assigns discrete values to two or more aliases of the rowid column, SQLite
writes the rightmost of such values specified in the INSERT or UPDATE
statement to the database. However, assigning a non-NULL value to both
the "docid" and one or more of the SQLite rowid aliases when inserting or
updating an FTS table is considered an error. See below for an example.
To support full-text queries, FTS maintains an inverted index that maps from each unique term or word that appears in the dataset to the locations in which it appears within the table contents. For the curious, a complete description of the [segment btree|data structure] used to store this index within the database file appears below. A feature of this data structure is that at any time the database may contain not one index b-tree, but several different b-trees that are incrementally merged as rows are inserted, updated and deleted. This technique improves performance when writing to an FTS table, but causes some overhead for full-text queries that use the index. Executing an SQL statement of the form "INSERT INTO <fts-table>(<fts-table>) VALUES('optimize')" causes FTS to merge all existing index b-trees into a single large b-tree containing the entire index. This can be an expensive operation, but may speed up future queries.
For example, to optimize the full-text index for an FTS table named
"docs":
The statement above may appear syntactically incorrect to some. Refer to the section describing the [simple fts queries] for an explanation.
There is another, deprecated, method for invoking the optimize operation using a SELECT statement. New code should use statements similar to the INSERT above to optimize FTS structures.
As for all other SQLite tables, virtual or otherwise, data is retrieved from FTS tables using a [SELECT] statement.
FTS tables can be queried efficiently using SELECT statements of two different forms:
Query by rowid. If the WHERE clause of the SELECT statement contains a sub-clause of the form "rowid = ?", where ? is an SQL expression, FTS is able to retrieve the requested row directly using the equivalent of an SQLite [INTEGER PRIMARY KEY] index.
Full-text query. If the WHERE clause of the SELECT statement contains a sub-clause of the form "<column> MATCH ?", FTS is able to use the built-in full-text index to restrict the search to those documents that match the full-text query string specified as the right-hand operand of the MATCH clause.
If neither of these two query strategies can be used, all
queries on FTS tables are implemented using a linear scan of the entire
table. If the table contains large amounts of data, this may be an
impractical approach (the first example on this page shows that a linear
scan of 1.5 GB of data takes around 30 seconds using a modern PC).
In all of the full-text queries above, the right-hand operand of the MATCH operator is a string consisting of a single term. In this case, the MATCH expression evaluates to true for all documents that contain one or more instances of the specified word ("sqlite", "search" or "database", depending on which example you look at). Specifying a single term as the right-hand operand of the MATCH operator results in the simplest and most common type of full-text query possible. However more complicated queries are possible, including phrase searches, term-prefix searches and searches for documents containing combinations of terms occuring within a defined proximity of each other. The various ways in which the full-text index may be queried are [FTS MATCH|described below].
Normally, full-text queries are case-insensitive. However, this is is dependent on the specific [tokenizer] used by the FTS table being queried. Refer to the section on [tokenizer|tokenizers] for details.
The paragraph above notes that a MATCH operator with a simple term as the
right-hand operand evaluates to true for all documents that contain the
specified term. In this context, the "document" may refer to either the
data stored in a single column of a row of an FTS table, or to the contents
of all columns in a single row, depending on the identifier used as the
left-hand operand to the MATCH operator. If the identifier specified as
the left-hand operand of the MATCH operator is an FTS table column name,
then the document that the search term must be contained in is the value
stored in the specified column. However, if the identifier is the name
of the FTS table itself, then the MATCH operator evaluates to true
for each row of the FTS table for which any column contains the search
term. The following example demonstrates this:
At first glance, the final two full-text queries in the example above seem to be syntactically incorrect, as there is a table name ("mail") used as an SQL expression. The reason this is acceptable is that each FTS table actually has a [sqlite3_declare_vtab|HIDDEN] column with the same name as the table itself (in this case, "mail"). The value stored in this column is not meaningful to the application, but can be used as the left-hand operand to a MATCH operator. This special column may also be passed as an argument to the [snippet()|FTS auxiliary functions].
The following example illustrates the above. The expressions "docs",
"docs.docs" and "main.docs.docs" all refer to column "docs". However, the
expression "main.docs" does not refer to any column. It could be used to
refer to a table, but a table name is not allowed in the context in which
it is used below.
From the users point of view, FTS tables are similar to ordinary SQLite tables in many ways. Data may be added to, modified within and removed from FTS tables using the INSERT, UPDATE and DELETE commands just as it may be with ordinary tables. Similarly, the SELECT command may be used to query data. The following list summarizes the differences between FTS and ordinary tables:
As with all virtual table types, it is not possible to create indices or triggers attached to FTS tables. Nor is it possible to use the ALTER TABLE command to add extra columns to FTS tables (although it is possible to use ALTER TABLE to rename an FTS table).
Data-types specified as part of the "CREATE VIRTUAL TABLE" statement used to create an FTS table are ignored completely. Instead of the normal rules for applying type [affinity] to inserted values, all values inserted into FTS table columns (except the special rowid column) are converted to type TEXT before being stored.
FTS tables permit the special alias "docid" to be used to refer to the rowid column supported by all [virtual tables].
The [FTS MATCH] operator is supported for queries based on the built-in full-text index.
The [FTS auxiliary functions], [snippet()], [offsets()], and [matchinfo()] are available to support full-text queries.
Although FTS3 and FTS4 are included with the SQLite core source code, they are not
enabled by default. To build SQLite with FTS functionality enabled, define
the preprocessor macro [SQLITE_ENABLE_FTS3] when compiling. New applications
should also define the [SQLITE_ENABLE_FTS3_PARENTHESIS] macro to enable the
[enhanced query syntax] (see below). Usually, this is done by adding the
following two switches to the compiler command line:
Note that enabling FTS3 also makes FTS4 available. There is not a separate SQLITE_ENABLE_FTS4 compile-time option. An build of SQLite either supports both FTS3 and FTS4 or it supports neither.
If using the amalgamation autoconf based build system, setting the CPPFLAGS
environment variable while running the 'configure' script is an easy
way to set these macros. For example, the following command:
where <configure options> are those options normally passed to the configure script, if any.
Because FTS3 and FTS4 are virtual tables, The [SQLITE_ENABLE_FTS3] compile-time option is incompatible with the [SQLITE_OMIT_VIRTUALTABLE] option.
If a build of SQLite does not include the FTS modules, then any attempt to prepare an SQL statement to create an FTS3 or FTS4 table or to drop or access an existing FTS table in any way will fail. The error message returned will be similar to "no such module: ftsN" (where N is either 3 or 4).
If the C version of the ICU library
is available, then FTS may also be compiled with the SQLITE_ENABLE_ICU
pre-processor macro defined. Compiling with this macro enables an FTS
[tokenizer] that uses the ICU library to split a document into terms
(words) using the conventions for a specified language and locale.
The most useful thing about FTS tables is the queries that may be performed using the built-in full-text index. Full-text queries are performed by specifying a clause of the form "<column> MATCH <full-text query expression>" as part of the WHERE clause of a SELECT statement that reads data from an FTS table. [simple fts queries|Simple FTS queries] that return all documents that contain a given term are described above. In that discussion the right-hand operand of the MATCH operator was assumed to be a string consisting of a single term. This section describes the more complex query types supported by FTS tables, and how they may be utilized by specifying a more complex query expression as the right-hand operand of a MATCH operator.
FTS tables support three basic query types:
Token or token prefix queries. An FTS table may be queried for all documents that contain a specified term (the [simple fts queries|simple case] described above), or for all documents that contain a term with a specified prefix. As we have seen, the query expression for a specific term is simply the term itself. The query expression used to search for a term prefix is the prefix itself with a '*' character appended to it. For example:
Normally, a token or token prefix query is matched against the FTS table column specified as the right-hand side of the MATCH operator. Or, if the special column with the same name as the FTS table itself is specified, against all columns. This may be overridden by specifying a column-name followed by a ":" character before a basic term query. There may be space between the ":" and the term to query for, but not between the column-name and the ":" character. For example:
Phrase queries. A phrase query is a query that retrieves all documents that contain a nominated set of terms or term prefixes in a specified order with no intervening tokens. Phrase queries are specified by enclosing a space separated sequence of terms or term prefixes in double quotes ("). For example:
NEAR queries. A NEAR query is a query that returns documents that contain a two or more nominated terms or phrases within a specified proximity of each other (by default with 10 or less intervening terms). A NEAR query is specified by putting the keyword "NEAR" between two phrase, term or term prefix queries. To specify a proximity other than the default, an operator of the form "NEAR/<N>" may be used, where <N> is the maximum number of intervening terms allowed. For example:
More than one NEAR operator may appear in a single query. In this case each pair of terms or phrases separated by a NEAR operator must appear within the specified proximity of each other in the document. Using the same table and data as in the block of examples above:
Phrase and NEAR queries may not span multiple columns within a row.
The three basic query types described above may be used to query the full-text index for the set of documents that match the specified criteria. Using the FTS query expression language it is possible to perform various set operations on the results of basic queries. There are currently three supported operations:
The FTS modules may be compiled to use one of two slightly different versions of the full-text query syntax, the "standard" query syntax and the "enhanced" query syntax. The basic term, term-prefix, phrase and NEAR queries described above are the same in both versions of the syntax. The way in which set operations are specified is slightly different. The following two sub-sections describe the part of the two query syntaxes that pertains to set operations. Refer to the description of how to [compile fts] for compilation notes.
The enhanced query syntax supports the AND, OR and NOT binary set operators. Each of the two operands to an operator may be a basic FTS query, or the result of another AND, OR or NOT set operation. Operators must be entered using capital letters. Otherwise, they are interpreted as basic term queries instead of set operators.
The AND operator may be implicitly specified. If two basic queries appear
with no operator separating them in an FTS query string, the results are
the same as if the two basic queries were separated by an AND operator.
For example, the query expression "implicit operator" is a more succinct
version of "implicit AND operator".
The examples above all use basic full-text term queries as both operands of the set operations demonstrated. Phrase and NEAR queries may also be used, as may the results of other set operations. When more than one set operation is present in an FTS query, the precedence of operators is as follows:
Operator | Enhanced Query Syntax Precedence |
---|---|
NOT | Highest precedence (tightest grouping). |
AND | |
OR | Lowest precedence (loosest grouping). |
When using the enhanced query syntax, parenthesis may be used to override
the default precedence of the various operators. For example:
FTS query set operations using the standard query syntax are similar, but not identical, to set operations with the enhanced query syntax. There are four differences, as follows:
Only the implicit version of the AND operator is supported. Specifying the string "AND" as part of an standard query syntax query is interpreted as a term query for the set of documents containing the term "and".
Parenthesis are not supported.
The NOT operator is not supported. Instead of the NOT operator, the standard query syntax supports a unary "-" operator that may be applied to basic term and term-prefix queries (but not to phrase or NEAR queries). A term or term-prefix that has a unary "-" operator attached to it may not appear as an operand to an OR operator. An FTS query may not consist entirely of terms or term-prefix queries with unary "-" operators attached to them.
The relative precedence of the set operations is different. In particular, using the standard query syntax the "OR" operator has a higher precedence than "AND". The precedence of operators when using the standard query syntax is:
Operator | Standard Query Syntax Precedence |
---|---|
Unary "-" | Highest precedence (tightest grouping). |
OR | |
AND | Lowest precedence (loosest grouping). |
The FTS3 and FTS4 modules provide three special SQL scalar functions that may be useful to the developers of full-text query systems: "snippet", "offsets" and "matchinfo". The purpose of the "snippet" and "offsets" functions is to allow the user to identify the location of queried terms in the returned documents. The "matchinfo" function provides the user with metrics that may be useful for filtering or sorting query results according to relevance.
The first argument to all three special SQL scalar functions
must be the [FTS hidden column] of the FTS table that the function is
applied to. The [FTS hidden column] is an automatically-generated column found on
all FTS tables that has the same name as the FTS table itself.
For example, given an FTS table named "mail":
The three auxiliary functions are only useful within a SELECT statement that uses the FTS table's full-text index. ^If used within a SELECT that uses the "query by rowid" or "linear scan" strategies, then the snippet and offsets both return an empty string, and the matchinfo function returns a blob value zero bytes in size.
All three auxiliary functions extract a set of "matchable phrases" from the FTS query expression to work with. The set of matchable phrases for a given query consists of all phrases (including unquoted tokens and token prefixes) in the expression except those that are prefixed with a unary "-" operator (standard syntax) or are part of a sub-expression that is used as the right-hand operand of a NOT operator.
With the following provisos, each series of tokens in the FTS table that matches one of the matchable phrases in the query expression is known as a "phrase match":
For a SELECT query that uses the full-text index, the offsets() function returns a text value containing a series of space-separated integers. For each term in each phrase match of the current row, there are four integers in the returned list. Each set of four integers is interpreted as follows:
Integer | Interpretation |
---|---|
0 | The column number that the term instance occurs in (0 for the leftmost column of the FTS table, 1 for the next leftmost, etc.). |
1 | The term number of the matching term within the full-text query expression. Terms within a query expression are numbered starting from 0 in the order that they occur. |
2 | The byte offset of the matching term within the column. |
3 | The size of the matching term in bytes. |
The following block contains examples that use the offsets function.
The snippet function is used to create formatted fragments of document text for display as part of a full-text query results report. The snippet function may be passed between one and six arguments, as follows:
Argument | Default Value | Description |
---|---|---|
0 | N/A | The first argument to the snippet function must always be the [FTS hidden column] of the FTS table being queried and from which the snippet is to be taken. The [FTS hidden column] is an automatically generated column with the same name as the FTS table itself. |
1 | "<b>" | The "start match" text. |
2 | "</b>" | The "end match" text. |
3 | "<b>...</b>" | The "ellipses" text. |
4 | -1 | The FTS table column number to extract the returned fragments of text from. Columns are numbered from left to right starting with zero. A negative value indicates that the text may be extracted from any column. |
5 | -15 | The absolute value of this integer argument is used as the (approximate) number of tokens to include in the returned text value. The maximum allowable absolute value is 64. The value of this argument is referred to as N in the discussion below. |
The snippet function first attempts to find a fragment of text consisting of |N| tokens within the current row that contains at least one phrase match for each matchable phrase matched somewhere in the current row, where |N| is the absolute value of the sixth argument passed to the snippet function. If the text stored in a single column contains less than |N| tokens, then the entire column value is considered. Text fragments may not span multiple columns.
If such a text fragment can be found, it is returned with the following modifications:
If more than one such fragment can be found, then fragments that contain a larger number of "extra" phrase matches are favoured. The start of the selected text fragment may be moved a few tokens forward or backward to attempt to concentrate the phrase matches toward the center of the fragment.
Assuming N is a positive value, if no fragments can be found that contain an phrase match corresponding to each matchable phrase, the snippet function attempts to find two fragments of approximately N/2 tokens that between them contain at least one phrase match for each matchable phrase matched by the current row. If this fails, attempts are made to find three fragments of N/3 tokens each and finally four N/4 token fragments. If a set of four fragments cannot be found that encompasses the required phrase matches, the four fragments of N/4 tokens that provide the best coverage are selected.
If N is a negative value, and no single fragment can be found containing the required phrase matches, the snippet function searches for two fragments of |N| tokens each, then three, then four. In other words, if the specified value of N is negative, the sizes of the fragments is not decreased if more than one fragment is required to provide the desired phrase match coverage.
After the M fragments have been located, where M is between
two and four as described in the paragraphs above, they are joined together
in sorted order with the "ellipses" text separating them. The three
modifications enumerated earlier are performed on the text before it is
returned.
The matchinfo function returns a blob value. If it is used within a query that does not use the full-text index (a "query by rowid" or "linear scan"), then the blob is zero bytes in size. Otherwise, the blob consists of zero or more 32-bit unsigned integers in machine byte-order. The exact number of integers in the returned array depends on both the query and the value of the second argument (if any) passed to the matchinfo function.
The matchinfo function is called with either one or two arguments. As for all auxiliary functions, the first argument must be the special [FTS hidden column]. The second argument, if it is specified, must be a text value comprised only of the characters 'p', 'c', 'n', 'a', 'l', 's' and 'x'. If no second argument is explicitly supplied, it defaults to "pcx". The second argument is refered to as the "format string" below.
Characters in the matchinfo format string are processed from left to right. Each character in the format string causes one or more 32-bit unsigned integer values to be added to the returned array. The "values" column in the following table contains the number of integer values appended to the output buffer for each supported format string character. In the formula given, cols is the number of columns in the FTS table, and phrases is the number of matchable phrases in the query.
Character | Values | Description |
---|---|---|
p | 1 | The number of matchable phrases in the query. |
c | 1 | The number of user defined columns in the FTS table (i.e. not including the docid or the [FTS hidden column]). |
x | 3 * cols * phrases |
For each distinct combination of a phrase and table column, the
following three values:
hits_this_row = array[3 * (c + p*cols) + 0] hits_all_rows = array[3 * (c + p*cols) + 1] docs_with_hits = array[3 * (c + p*cols) + 2] |
n | 1 | The number of rows in the FTS4 table. This value is only available when querying FTS4 tables, not FTS3. |
a | cols | For each column, the average number of tokens in the text values stored in the column (considering all rows in the FTS4 table). This value is only available when querying FTS4 tables, not FTS3. |
l | cols | For each column, the length of the value stored in the current row of the FTS4 table, in tokens. This value is only available when querying FTS4 tables, not FTS3. And only if the "matchinfo=fts3" directive was not specified as part of the "CREATE VIRTUAL TABLE" statement used to create the FTS4 table. |
s | cols | For each column, the length of the longest subsequence of phrase matches that the column value has in common with the query text. For example, if a table column contains the text 'a b c d e' and the query is 'a c "d e"', then the length of the longest common subsequence is 2 (phrase "c" followed by phrase "d e"). |
For example:
The matchinfo function is much faster than either the snippet or offsets
functions. This is because the implementation of both snippet and offsets
is required to retrieve the documents being analyzed from disk, whereas
all data required by matchinfo is available as part of the same portions
of the full-text index that are required to implement the full-text query
itself. This means that of the following two queries, the first may be
an order of magnitude faster than the second:
The matchinfo function provides all the information required to calculate probabilistic "bag-of-words" relevancy scores such as Okapi BM25/BM25F that may be used to order results in a full-text search application. Appendix A of this document, "[search application tips]", contains an example of using the matchinfo() function efficiently.
As of version 3.7.6, SQLite includes a new virtual table module called "fts4aux", which can be used to inspect the full-text index of an exiting FTS table directly. Despite its name, fts4aux works just as well with FTS3 tables as it does with FTS4 tables. Fts4aux tables are read-only. The only way to modify the contents of an fts4aux table is by modifying the contents of the associated FTS table. The fts4aux module is automatically included in all [compile fts|builds that include FTS].
An fts4aux virtual table is constructed with a single argument - the
unqualified name of the FTS table that it will be used to access.
For example:
For each term present in the FTS table, there are between 2 and N+1 rows in the fts4aux table, where N is the number of user-defined columns in the associated FTS table. An fts4aux table always has the same four columns, as follows, from left to right:
Column Name | Column Contents |
---|---|
term | Contains the text of the term for this row. |
col | This column may contain either the text value '*' (i.e. a single character, UTF codepoint 42) or an integer between 0 and N-1, where N is again the number of user-defined columns in the corresponding FTS table. |
documents |
This column always contains an integer value greater than zero.
If the "col" column contains the value '*', then this column contains the number of rows of the FTS table that contain at least one instance of the term (in any column). If col contains an integer value, then this column contains the number of rows of the FTS table that contain at least one instance of the term in the column identified by the col value. As usual, the columns of the FTS table are numbered from left to right, starting with zero. |
occurrences |
This column also always contains an integer value greater than zero.
If the "col" column contains the value '*', then this column contains the total number of instances of the term in all rows of the FTS table (in any column). Otherwise, if col contains an integer value, then this column contains the total number of instances of the term that appear in the FTS table column identified by the col value. |
For example, using the tables created above:
In the example, the values in the "term" column are all lower case, even though they were inserted into table "ft" in mixed case. This is because an fts3aux table contains the terms as extracted from the document text by the [tokenizer]. In this case, since table "ft" uses the [tokenizer|simple tokenizer], this means all terms have been folded to lower case. Also, there is (for example) no row with column "term" set to "apple" and column "col" set to 1. Since there are no instances of the term "apple" in column 1, no row is present in the fts4aux table.
During a transaction, some of the data written to an FTS table may be cached in memory and written to the database only when the transaction is committed. However the implementation of the fts4aux module is only able to read data from the database. In practice this means that if an fts4aux table is queried from within a transaction in which the associated FTS table has been modified, the results of the query are likely to reflect only a (possibly empty) subset of the changes made.
FTS4 currently supports the following three options:
Option | Interpretation |
---|---|
matchinfo | This option may only be set to the value "fts3". Attempting to set it otherwise is an error. If this option is specified, then some of the extra information stored by FTS4 is omitted. This reduces the amount of disk space consumed by an FTS4 table until it is almost the same as the amount that would be used by the equivalent FTS3 table, but also means that the data accessed by passing the 'l' flag to the [matchinfo()] function is not available. |
compress | This option is used to specify the compress function. It is an error to specify a compress function without also specifying an uncompress function. [fts4 compress option|See below] for details. |
uncompress | This option is used to specify the uncompress function. It is an error to specify an uncompress function without also specifying a compress function. [fts4 compress option|See below] for details. |
order | The "order" option may be set to either "DESC" or "ASC" (in upper or lower case). If it is set to "DESC", then FTS4 stores its data in such a way as to optimize returning results in descending order by docid. If it is set to "ASC" (the default), then the data structures are optimized for returning results in ascending order by docid. In other words, if many of the queries run against the FTS4 table use "ORDER BY docid DESC", then it may improve performance to add the "order=desc" option to the CREATE VIRTUAL TABLE statement. |
prefix | This option may be set to a comma-separated list of positive non-zero integers. For each integer N in the list, a separate index is created in the database file to optimize [FTS MATCH|term prefix] queries where the query term is N bytes in length, not including the '*' character, when encoded using UTF-8. [fts4 prefix option|See below] for details. |
When using FTS4, specifying a column name that contains an "=" character
and is not either a "tokenize=*" specification or a recognized FTS4 option
is an error. With FTS3, the first token in the unrecognized directive is
interpreted as a column name. Similarly, specifying multiple "tokenize=*"
directives in a single table declaration is an error when using FTS4, whereas
the second and subsequent "tokenize=*" directives are interpreted as column
names by FTS3. For example:
The compress and uncompress options allow FTS4 content to be stored in the database in a compressed form. Both options should be set to the name of an SQL scalar function registered using [sqlite3_create_function()] that accepts a single argument.
The compress function should return a compressed version of the value passed to it as an argument. Each time data is written to the FTS4 table, each column value is passed to the compress function and the result value stored in the database. The compress function may return any type of SQLite value (blob, text, real, integer or null).
The uncompress function should uncompress data previously compressed by the compress function. In other words, for all SQLite values X, it should be true that uncompress(compress(X)) equals X. When data that has been compressed by the compress function is read from the database by FTS4, it is passed to the uncompress function before it is used.
If the specified compress or uncompress functions do not exist, the table
may still be created. An error is not returned until the FTS4 table is
read (if the uncompress function does not exist) or written (if it is the
compress function that does not exist).
The FTS4 prefix option causes FTS to index term prefixes of specified lengths in the same way that it always indexes complete terms. The prefix option must be set to a comma separated list of positive non-zero integers. For each value N in the list, prefixes of length N bytes (when encoded using UTF-8) are indexed. FTS4 uses term prefix indexes to speed up term prefix queries. The cost, of course, is that indexing term prefixes as well as complete terms increases the database size and slows down write operations on the FTS4 table.
Prefix indexes may be used to optimize term prefix queries in two cases. If the
query is for a prefix of N bytes, then a prefix index created with "prefix=N"
provides the best optimization. Or, if no "prefix=N" index is available, a
"prefix=N+1" index imay be used instead. Using a "prefix=N+1" index is less
efficient than a "prefix=N" index, but is better than no prefix index at all.
An FTS tokenizer is a set of rules for extracting terms from a document or basic FTS full-text query.
Unless a specific tokenizer is specified as part of the CREATE VIRTUAL TABLE statement used to create the FTS table, the default tokenizer, "simple", is used. The simple tokenizer extracts tokens from a document or basic FTS full-text query according to the following rules:
A term is a contiguous sequence of eligible characters, where eligible characters are all alphanumeric characters, the "_" character, and all characters with UTF codepoints greater than or equal to 128. All other characters are discarded when splitting a document into terms. Their only contribution is to separate adjacent terms.
All uppercase characters within the ASCII range (UTF codepoints less than 128), are transformed to their lowercase equivalents as part of the tokenization process. Thus, full-text queries are case-insensitive when using the simple tokenizer.
For example, when a document containing the text "Right now, they're very frustrated.", the terms extracted from the document and added to the full-text index are, in order, "right now they re very frustrated". Such a document would match a full-text query such as "MATCH 'Frustrated'", as the simple tokenizer transforms the term in the query to lowercase before searching the full-text index.
As well as the "simple" tokenizer, the FTS source code features a tokenizer that uses the Porter Stemming algorithm. This tokenizer uses the same rules to separate the input document into terms, but as well as folding all terms to lower case it uses the Porter Stemming algorithm to reduce related English language words to a common root. For example, using the same input document as in the paragraph above, the porter tokenizer extracts the following tokens: "right now thei veri frustrat". Even though some of these terms are not even English words, in some cases using them to build the full-text index is more useful than the more intelligible output produced by the simple tokenizer. Using the porter tokenizer, the document not only matches full-text queries such as "MATCH 'Frustrated'", but also queries such as "MATCH 'Frustration'", as the term "Frustration" is reduced by the Porter stemmer algorithm to "frustrat" - just as "Frustrated" is. So, when using the porter tokenizer, FTS is able to find not just exact matches for queried terms, but matches against similar English language terms. For more information on the Porter Stemmer algorithm, please refer to the page linked above.
Example illustrating the difference between the "simple" and "porter"
tokenizers:
If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor
symbol defined, then there exists a built-in tokenizer named "icu"
implemented using the ICU library. The first argument passed to the
xCreate() method (see fts3_tokenizer.h) of this tokenizer may be
an ICU locale identifier. For example "tr_TR" for Turkish as used
in Turkey, or "en_AU" for English as used in Australia. For example:
The ICU tokenizer implementation is very simple. It splits the input text according to the ICU rules for finding word boundaries and discards any tokens that consist entirely of white-space. This may be suitable for some applications in some locales, but not all. If more complex processing is required, for example to implement stemming or discard punctuation, this can be done by creating a tokenizer implementation that uses the ICU tokenizer as part of its implementation.
As well as the built-in "simple", "porter" and (possibly) "icu" tokenizers, FTS exports an interface that allows users to implement custom tokenizers using C. The interface used to create a new tokenizer is defined and described in the fts3_tokenizer.h source file.
Registering a new FTS tokenizer is similar to registering a new virtual table module with SQLite. The user passes a pointer to a structure containing pointers to various callback functions that make up the implementation of the new tokenizer type. For tokenizers, the structure (defined in fts3_tokenizer.h) is called "sqlite3_tokenizer_module".
FTS does not expose a C-function that users call to register new
tokenizer types with a database handle. Instead, the pointer must
be encoded as an SQL blob value and passed to FTS through the SQL
engine by evaluating a special scalar function, "fts3_tokenizer()".
The fts3_tokenizer() function may be called with one or two arguments,
as follows:
Where
SECURITY WARNING: If the fts3/4 extension is used in an environment
where potentially malicious users may execute arbitrary SQL, they should
be prevented from invoking the fts3_tokenizer() function, possibly using
the [sqlite3_set_authorizer()|authorisation callback].
The following block contains an example of calling the fts3_tokenizer()
function from C code:
This section describes at a high-level the way the FTS module stores its
index and content in the database. It is not necessary to read or
understand the material in this section in order to use FTS in an
application. However, it may be useful to application developers attempting
to analyze and understand FTS performance characteristics, or to developers
contemplating enhancements to the existing FTS feature set.
For each FTS virtual table in a database, three to five real (non-virtual) tables
are created to store the underlying data. These real tables are called "shadow tables".
The real tables are named "%_content",
"%_segdir", "%_segments", "%_stat", and "%_docsize", where "%" is replaced by the name
of the FTS virtual table.
The leftmost column of the "%_content" table is an INTEGER PRIMARY KEY field
named "docid". Following this is one column for each column of the FTS
virtual table as declared by the user, named by prepending the column name
supplied by the user with "cN", where N is the index of the
column within the table, numbered from left to right starting with 0. Data
types supplied as part of the virtual table declaration are not used as
part of the %_content table declaration. For example:
The %_content table contains the unadulterated data inserted by the user
into the FTS virtual table by the user. If the user does not explicitly
supply a "docid" value when inserting records, one is selected automatically
by the system.
The %_stat and %_docsize tables are only created if the FTS table uses the
FTS4 module, not FTS3. Furthermore, the %_docsize table is omitted if the
FTS4 table is created with the [matchinfo_fts3|"matchinfo=fts3"] directive
specified as part of the CREATE VIRTUAL TABLE statement. If they are created,
the schema of the two tables is as follows:
For each row in the FTS table, the %_docsize table contains a corresponding
row with the same "docid" value. The "size" field contains a blob consisting
of N FTS varints, where N is the number of user-defined columns
in the table. Each varint in the "size" blob is the number of tokens in the
corresponding column of the associated row in the FTS table. The %_stat table
always contains a single row with the "id" column set to 0. The "value"
column contains a blob consisting of N+1 FTS varints, where N
is again the number of user-defined columns in the FTS table. The first
varint in the blob is set to the total number of rows in the FTS table. The
second and subsequent varints contain the total number of tokens stored in
the corresponding column for all rows of the FTS table.
The two remaining tables, %_segments and %_segdir, are used to store the
full-text index. Conceptually, this index is a lookup table that maps each
term (word) to the set of docid values corresponding to records in the
%_content table that contain one or more occurrences of the term. To
retrieve all documents that contain a specified term, the FTS module
queries this index to determine the set of docid values for records that
contain the term, then retrieves the required documents from the %_content
table. Regardless of the schema of the FTS virtual table, the %_segments
and %_segdir tables are always created as follows:
The schema depicted above is not designed to store the full-text index
directly. Instead, it is used to one or more b-tree structures. There
is one b-tree for each row in the %_segdir table. The %_segdir table
row contains the root node and various meta-data associated with the
b-tree structure, and the %_segments table contains all other (non-root)
b-tree nodes. Each b-tree is referred to as a "segment". Once it has
been created, a segment b-tree is never updated (although it may be
deleted altogether).
The keys used by each segment b-tree are terms (words). As well as the
key, each segment b-tree entry has an associated "doclist" (document list).
A doclist consists of zero or more entries, where each entry consists of:
Entries within a doclist are sorted by docid. Positions within a doclist
entry are stored in ascending order.
The contents of the logical full-text index is found by merging the
contents of all segment b-trees. If a term is present in more than one
segment b-tree, then it maps to the union of each individual doclist. If,
for a single term, the same docid occurs in more than one doclist, then only
the doclist that is part of the most recently created segment b-tree is
considered valid.
Multiple b-tree structures are used instead of a single b-tree to reduce
the cost of inserting records into FTS tables. When a new record is
inserted into an FTS table that already contains a lot of data, it is
likely that many of the terms in the new record are already present in
a large number of existing records. If a single b-tree were used, then
large doclist structures would have to be loaded from the database,
amended to include the new docid and term-offset list, then written back
to the database. Using multiple b-tree tables allows this to be avoided
by creating a new b-tree which can be merged with the existing b-tree
(or b-trees) later on. Merging of b-tree structures can be performed as
a background task, or once a certain number of separate b-tree structures
have been accumulated. Of course, this scheme makes queries more expensive
(as the FTS code may have to look up individual terms in more than one
b-tree and merge the results), but it has been found that in practice this
overhead is often negligible.
Integer values stored as part of segment b-tree nodes are encoded using the
FTS varint format. This encoding is similar, but not identical, to the
the SQLite varint format.
An encoded FTS varint consumes between one and ten bytes of space. The
number of bytes required is determined by the sign and magnitude of the
integer value encoded. More accurately, the number of bytes used to store
the encoded integer depends on the position of the most significant set bit
in the 64-bit twos-complement representation of the integer value. Negative
values always have the most significant bit set (the sign bit), and so are
always stored using the full ten bytes. Positive integer values may be
stored using less space.
The final byte of an encoded FTS varint has its most significant bit
cleared. All preceding bytes have the most significant bit set. Data
is stored in the remaining seven least significant bits of each byte.
The first byte of the encoded representation contains the least significant
seven bits of the encoded integer value. The second byte of the encoded
representation, if it is present, contains the seven next least significant
bits of the integer value, and so on. The following table contains examples
of encoded integer values:
Segment b-trees are prefix-compressed b+-trees. There is one segment b-tree
for each row in the %_segdir table (see above). The root node of the segment
b-tree is stored as a blob in the "root" field of the corresponding row
of the %_segdir table. All other nodes (if any exist) are stored in the
"blob" column of the %_segments table. Nodes within the %_segments table are
identified by the integer value in the blockid field of the corresponding
row. The following table describes the fields of the %_segdir table:
Apart from the root node, the nodes that make up a single segment b-tree are
always stored using a contiguous sequence of blockids. Furthermore, the
nodes that make up a single level of the b-tree are themselves stored as
a contiguous block, in b-tree order. The contiguous sequence of blockids
used to store the b-tree leaves are allocated starting with the blockid
value stored in the "start_block" column of the corresponding %_segdir row,
and finishing at the blockid value stored in the "leaves_end_block"
field of the same row. It is therefore possible to iterate through all the
leaves of a segment b-tree, in key order, by traversing the %_segments
table in blockid order from "start_block" to "leaves_end_block".
The following diagram depicts the format of a segment b-tree leaf node.
Segment B-Tree Leaf Node Format
The first term stored on each node ("Term 1" in the figure above) is
stored verbatim. Each subsequent term is prefix-compressed with respect
to its predecessor. Terms are stored within a page in sorted (memcmp)
order.
The following diagram depicts the format of a segment b-tree interior
(non-leaf) node.
Segment B-Tree Interior Node Format
A doclist consists of an array of 64-bit signed integers, serialized using
the FTS varint format. Each doclist entry is made up of a series of two
or more integers, as follows:
FTS3 Doclist Format
FTS Doclist Entry Format
For doclists for which the term appears in more than one column of the FTS
virtual table, term-offset lists within the doclist are stored in column
number order. This ensures that the term-offset list associated with
column 0 (if any) is always first, allowing the first two fields of the
term-offset list to be omitted in this case.
FTS is primarily designed to support Boolean full-text queries - queries
to find the set of documents that match a specified criteria. However, many
(most?) search applications require that results are somehow ranked in order
of "relevance", where "relevance" is defined as the likelihood that the user
who performed the search is interested in a specific element of the returned
set of documents. When using a search engine to find documents on the world
wide web, the user expects that the most useful, or "relevant", documents
will be returned as the first page of results, and that each subsequent page
contains progressively less relevant results. Exactly how a machine can
determine document relevance based on a users query is a complicated problem
and the subject of much ongoing research.
One very simple scheme might be to count the number of instances of the
users search terms in each result document. Those documents that contain
many instances of the terms are considered more relevant than those with
a small number of instances of each term. In an FTS application, the
number of term instances in each result could be determined by counting
the number of integers in the return value of the [offsets] function.
The following example shows a query that could be used to obtain the
ten most relevant results for a query entered by the user:
The query above could be made to run faster by using the FTS [matchinfo]
function to determine the number of query term instances that appear in each
result. The matchinfo function is much more efficient than the offsets
function. Furthermore, the matchinfo function provides extra information
regarding the overall number of occurrences of each query term in the entire
document set (not just the current row) and the number of documents in which
each query term appears. This may be used (for example) to attach a higher
weight to less common terms which may increase the overall computed relevancy
of those results the user considers more interesting.
The SQL query in the example above uses less CPU than the first example
in this section, but still has a non-obvious performance problem. SQLite
satisfies this query by retrieving the value of the "title" column and
matchinfo data from the FTS module for every row matched by the users
query before it sorts and limits the results. Because of the way SQLite's
virtual table interface works, retrieving the value of the "title" column
requires loading the entire row from disk (including the "content" field,
which may be quite large). This means that if the users query matches
several thousand documents, many megabytes of "title" and "content" data
may be loaded from disk into memory even though they will never be used
for any purpose.
The SQL query in the following example block is one solution to this
problem. In SQLite, when a sub-query
used in a join contains a LIMIT clause, the results of the sub-query are
calculated and stored in temporary table before the main query is executed.
This means that SQLite will load only the docid and matchinfo data for each
row matching the users query into memory, determine the docid values
corresponding to the ten most relevant documents, then load only the title
and content information for those 10 documents only. Because both the matchinfo
and docid values are gleaned entirely from the full-text index, this results
in dramatically less data being loaded from the database into memory.
The next block of SQL enhances the query with solutions to two other problems
that may arise in developing search applications using FTS:
The [snippet] function cannot be used with the above query. Because
the outer query does not include a "WHERE ... MATCH" clause, the snippet
function may not be used with it. One solution is to duplicate the WHERE
clause used by the sub-query in the outer query. The overhead associated
with this is usually negligible.
The relevancy of a document may depend on something other than just
the data available in the return value of matchinfo. For example
each document in the database may be assigned a static weight based
on factors unrelated to its content (origin, author, age, number
of references etc.). These values can be stored by the application
in a separate table that can be joined against the documents table
in the sub-query so that the rank function may access them.
This version of the query is very similar to that used by the
sqlite.org documentation search
application.
All the example queries above return the ten most relevant query results.
By modifying the values used with the OFFSET and LIMIT clauses, a query
to return (say) the next ten most relevant results is easy to construct.
This may be used to obtain the data required for a search applications second
and subsequent pages of results.
The next block contains an example rank function that uses matchinfo data
implemented in C. Instead of a single weight, it allows a weight to be
externally assigned to each column of each document. It may be registered
with SQLite like any other user function using [sqlite3_create_function].
Data Structures
Variable Length Integer (varint) Format
Decimal Hexadecimal Encoded Representation
43 0x000000000000002B 0x2B
200815 0x000000000003106F 0x9C 0xA0 0x0C
-1 0xFFFFFFFFFFFFFFFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0x01
Segment B-Tree Format
Column Interpretation
level
Between them, the contents of the "level" and "idx" fields define the
relative age of the segment b-tree. The smaller the value stored in the
"level" field, the more recently the segment b-tree was created. If two
segment b-trees are of the same "level", the segment with the larger
value stored in the "idx" column is more recent. The PRIMARY KEY constraint
on the %_segdir table prevents any two segments from having the same value
for both the "level" and "idx" fields.
idx See above.
start_block
The blockid that corresponds to the node with the smallest blockid that
belongs to this segment b-tree. Or zero if the entire segment b-tree
fits on the root node. If it exists, this node is always a leaf node.
leaves_end_block
The blockid that corresponds to the leaf node with the largest blockid
that belongs to this segment b-tree. Or zero if the entire segment b-tree
fits on the root node.
end_block
The blockid that corresponds to the interior node with the largest
blockid that belongs to this segment b-tree. Or zero if the entire segment
b-tree fits on the root node. If it exists, this node is always an
interior node.
root
Blob containing the root node of the segment b-tree.
Segment B-Tree Leaf Nodes
Segment B-Tree Interior Nodes
Doclist Format
Appendix A: Search Application Tips