FTS5 is an SQLite [virtual table module] that provides full-text search functionality to database applications. In their most elementary form, full-text search engines allow the user to efficiently search a large collection of documents for the subset that contain one or more instances of a search term. The search functionality provided to world wide web users by Google is, among other things, a full-text search engine, as it allows users to search for all documents on the web that contain, for example, the term "fts5".
To use FTS5, the user creates an FTS5 virtual table with one or more
columns. For example:
It is an error to add types, constraints or [PRIMARY KEY] declarations to a CREATE VIRTUAL TABLE statement used to create an FTS5 table. Once created, an FTS5 table may be populated using [INSERT], [UPDATE] or [DELETE] statements like any other table. Like any other table with no PRIMARY KEY declaration, an FTS5 table has an implicit INTEGER PRIMARY KEY field named rowid.
Not shown in the example above is that there are also [FTS5 CREATE TABLE Options | various options] that may be provided to FTS5 as part of the CREATE VIRTUAL TABLE statement to configure various aspects of the new table. These may be used to modify the way in which the FTS5 table extracts terms from documents and queries, to create extra indexes on disk to speed up prefix queries, or to create an FTS5 table that acts as an index on content stored elsewhere.
Once populated, there are three ways to execute a full-text query against the contents of an FTS5 table:
If using the MATCH or = operators, the expression to the left of the MATCH
operator is usually the name of the FTS5 table (the exception is when
[FTS5 column filters | specifying a column-filter]). The expression on the right
must be a text value specifying the term to search for. For the table-valued
function syntax, the term to search for is specified as the first table argument.
For example:
By default, FTS5 full-text searches are case-independent. Like any other
SQL query that does not contain an ORDER BY clause, the example above returns
results in an arbitrary order. To sort results by relevance (most to least
relevant), an ORDER BY may be added to a full-text query as follows:
As well as the column values and rowid of a matching row, an application
may use [FTS5 auxiliary functions] to retrieve extra information regarding
the matched row. For example, an auxiliary function may be used to retrieve
a copy of a column value for a matched row with all instances of the matched
term surrounded by html <b></b> tags. Auxiliary functions are
invoked in the same way as SQLite [corefunc | scalar functions], except that the name
of the FTS5 table is specified as the first argument. For example:
A description of the available auxiliary functions, and more details regarding configuration of the special "rank" column, are [FTS5 auxiliary functions | available below]. [FTS5 custom auxiliary functions| Custom auxiliary functions] may also be implemented in C and registered with FTS5, just as custom SQL functions may be registered with the SQLite core.
As well as searching for all rows that contain a term, FTS5 allows the user to search for rows that contain:
Such advanced searches are requested by providing a more complicated FTS5 query string as the text to the right of the MATCH operator (or = operator, or as the first argument to a table-valued function syntax). The full query syntax is [FTS5 query syntax | described here].
As of [version 3.9.0] ([dateof:3.9.0]), FTS5 is included as part of the SQLite [amalgamation]. It is disabled by default. If using the two autoconf build system, it is enabled by specifying the "--enable-fts5" option when running the configure script.
Or, if sqlite3.c is compiled using some other build system, by arranging for the SQLITE_ENABLE_FTS5 pre-processor symbol to be defined.
Alternatively, FTS5 may be built as a loadable extension.
The canonical FTS5 source code consists of a series of *.c and other files in the "ext/fts5" directory of the SQLite source tree. A build process reduces this to just two files - "fts5.c" and "fts5.h" - which may be used to build an SQLite loadable extension.
The code in "fts5.c" may then be compiled into a loadable extension or statically linked into an application as described in [Compiling Loadable Extensions]. There are two entry points defined, both of which do the same thing:
The other file, "fts5.h", is not required to compile the FTS5 extension. It is used by applications that implement [Extending FTS5 | custom FTS5 tokenizers or auxiliary functions].
The following block contains a summary of the FTS query syntax in BNF form.
A detailed explanation follows.
Within an FTS expression a string may be specified in one of two ways:
By enclosing it in double quotes ("). Within a string, any embedded double quote characters may be escaped SQL-style - by adding a second double-quote character.
As an FTS5 bareword that is not "AND", "OR" or "NOT" (case sensitive). An FTS5 bareword is a string of one or more consecutive characters that are all either:
FTS queries are made up of phrases. A phrase is an ordered list of
one or more tokens. A string is transformed into a phrase by passing it to
the FTS table tokenizer. Two phrases can be concatenated into a single
large phrase using the "+" operator. For example, assuming the tokenizer
module being used tokenizes the input "one.two.three" to three separate
tokens, the following three queries all specify the same phrase:
A phrase matches a document if the document contains at least one sub-sequence of tokens that matches the sequence of tokens that make up the phrase.
If a "*" character follows a string within an FTS expression, then the final
token extracted from the string is marked as a prefix token. As you
might expect, a prefix token matches any document token of which it is a
prefix. For example, the first two queries in the following block will match
any document that contains the token "one" immediately followed by the token
"two" and then any token that begins with "thr".
The final query in the block above may not work as expected. Because the "*" character is inside the double-quotes, it will be passed to the tokenizer, which will likely discard it (or perhaps, depending on the specific tokenizer in use, include it as part of the final token) instead of recognizing it as a special FTS character.
Two or more phrases may be grouped into a NEAR group. A NEAR group
is specified by the token "NEAR" (case sensitive) followed by an open
parenthesis character, followed by two or more whitespace separated phrases, optionally followed by a comma and the numeric parameter N, followed by
a close parenthesis. For example:
If no N parameter is supplied, it defaults to 10. A NEAR group matches a document if the document contains at least one clump of tokens that:
For example:
A single phrase or NEAR group may be restricted to matching text within a
specified column of the FTS table by prefixing it with the column name
followed by a colon character. Or to a set of columns by prefixing it
with a whitespace separated list of column names enclosed in parenthesis
("curly brackets") followed by a colon character. Column names may be specified
using either of the two forms described for strings above. Unlike strings that
are part of phrases, column names are not passed to the tokenizer module.
Column names are case-insensitive in the usual way for SQLite column names -
upper/lower case equivalence is understood for ASCII-range characters only.
If a column filter specification is preceded by a "-" character, then
it is interpreted as a list of column not to match against. For example:
Column filter specifications may also be applied to arbitrary expressions
enclosed in parenthesis. In this case the column filter applies to all
phrases within the expression. Nested column filter operations may only
further restrict the subset of columns matched, they can not be used to
re-enable filtered columns. For example:
Finally, a column filter for a single column may be specified by using
the column name as the LHS of a MATCH operator (instead of the usual
table name). For example:
Phrases and NEAR groups may be arranged into expressions using boolean operators. In order of precedence, from highest (tightest grouping) to lowest (loosest grouping), the operators are:
Operator | Function |
---|---|
<query1> NOT <query2>
| Matches if query1 matches and query2 does not match. |
<query1> AND <query2>
| Matches if both query1 and query2 match. |
<query1> OR <query2>
| Matches if either query1 or query2 match. |
Parenthesis may be used to group expressions in order to modify operator
precedence in the usual ways. For example:
Phrases and NEAR groups may also be connected by implicit AND operators.
For simplicity, these are not shown in the BNF grammar above. Essentially, any
sequence of phrases or NEAR groups (including those restricted to matching
specified columns) separated only by whitespace are handled as if there were an
implicit AND operator between each pair of phrases or NEAR groups. Implicit
AND operators are never inserted after or before an expression enclosed in
parenthesis. For example:
Each argument specified as part of a "CREATE VIRTUAL TABLE ... USING fts5 ..." statement is either a column declaration or a configuration option. A column declaration consists of one or more whitespace separated FTS5 barewords or string literals quoted in any manner acceptable to SQLite.
The first string or bareword in a column declaration is the column name. It is an error to attempt to name an fts5 table column "rowid" or "rank", or to assign the same name to a column as is used by the table itself. This is not supported.
Each subsequent string or bareword in a column declaration is a column option that modifies the behaviour of that column. Column options are case-independent. Unlike the SQLite core, FTS5 considers unrecognized column options to be errors. Currently, the only option recognized is [unindexed | "UNINDEXED" (see below)].
A configuration option consists of an FTS5 bareword - the option name -
followed by an "=" character, followed by the option value. The option value is
specified using either a single FTS5 bareword or a string literal, again quoted
in any manner acceptable to the SQLite core. For example:
There are currently the following configuration options:
The contents of columns qualified with the UNINDEXED column option are not added to the FTS index. This means that for the purposes of MATCH queries and [FTS5 auxiliary functions], the column contains no matchable tokens.
For example, to avoid adding the contents of the "uuid" field to the FTS
index:
By default, FTS5 maintains a single index recording the location of each token instance within the document set. This means that querying for complete tokens is fast, as it requires a single lookup, but querying for a prefix token can be slow, as it requires a range scan. For example, to query for the prefix token "abc*" requires a range scan of all tokens greater than or equal to "abc" and less than "abd".
A prefix index is a separate index that records the location of all instances of prefix tokens of a certain length in characters used to speed up queries for prefix tokens. For example, optimizing a query for prefix token "abc*" requires a prefix index of three-character prefixes.
To add prefix indexes to an FTS5 table, the "prefix" option is set to
either a single positive integer or a text value containing a white-space
separated list of one or more positive integer values. A prefix index is
created for each integer specified. If more than one "prefix" option is
specified as part of a single CREATE VIRTUAL TABLE statement, all apply.
The CREATE VIRTUAL TABLE "tokenize" option is used to configure the specific tokenizer used by the FTS5 table. The option argument must be either an FTS5 bareword, or an SQL text literal. The text of the argument is itself treated as a white-space series of one or more FTS5 barewords or SQL text literals. The first of these is the name of the tokenizer to use. The second and subsequent list elements, if they exist, are arguments passed to the tokenizer implementation.
Unlike option values and column names, SQL text literals intended as
tokenizers must be quoted using single quote characters. For example:
FTS5 features three built-in tokenizer modules, described in subsequent sections:
It is also possible to create custom tokenizers for FTS5. The API for doing so is [custom tokenizers | described here].
The unicode tokenizer classifies all unicode characters as either "separator" or "token" characters. By default all space and punctuation characters, as defined by Unicode 6.1, are considered separators, and all other characters as token characters. Each contiguous run of one or more token characters is considered to be a token. The tokenizer is case-insensitive according to the rules defined by Unicode 6.1.
By default, diacritics are removed from all Latin script characters. This means, for example, that "A", "a", "À", "à", "Â" and "â" are all considered to be equivalent.
Any arguments following "unicode61" in the token specification are treated as a list of alternating option names and values. Unicode61 supports the following options:
Option | Usage |
---|---|
remove_diacritics | This option should be set to "0" or "1". If it is set (the default), diacritics are removed from all latin script characters as described above. If it is clear, they are not. |
tokenchars | This option is used to specify additional unicode characters that should be considered token characters, even if they are white-space or punctuation characters according to Unicode 6.1. All characters in the string that this option is set to are considered token characters. |
separators | This option is used to specify additional unicode characters that should be considered as separator characters, even if they are token characters according to Unicode 6.1. All characters in the string that this option is set to are considered separators. |
For example:
The fts5 unicode61 tokenizer is byte-for-byte compatible with the fts3/4 unicode61 tokenizer.
The Ascii tokenizer is similar to the Unicode61 tokenizer, except that:
For example:
The porter tokenizer is a wrapper tokenizer. It takes the output of some other tokenizer and applies the porter stemming algorithm to each token before it returns it to FTS5. This allows search terms like "correction" to match similar words such as "corrected" or "correcting". The porter stemmer algorithm is designed for use with English language terms only - using it with other languages may or may not improve search utility.
By default, the porter tokenizer operates as a wrapper around the default
tokenizer (unicode61). Or, if one or more extra arguments are added to the
"tokenize" option following "porter", they are treated as a specification for
the underlying tokenizer that the porter stemmer uses. For example:
Normally, when a row is inserted into an FTS5 table, as well as the various full-text index entries and other data a copy of the row is stored in a private table managed by the FTS5 module. When column values are requested from the FTS5 table by the user or by an auxiliary function implementation, they are read from this private table. The "content" option may be used to create an FTS5 table that stores only FTS full-text index entries. Because the column values themselves are usually much larger than the associated full-text index entries, this can save significant database space.
There are two ways to use the "content" option:
A contentless FTS5 table is created by setting the "content" option to
an empty string. For example:
Contentless FTS5 tables do not support UPDATE or DELETE statements, or INSERT statements that do not supply a non-NULL value for the rowid field. Contentless tables do not support REPLACE conflict handling. REPLACE and INSERT OR REPLACE statements are treated as regular INSERT statements. Rows may be deleted from a contentless table using an [FTS5 delete command].
Attempting to read any column value except the rowid from a contentless FTS5 table returns an SQL NULL value.
An external content FTS5 table is created by setting the content
option to the name of a table, virtual table or view (hereafter the "content
table") within the same database. Whenever column values are required by
FTS5, it queries the content table as follows, with the rowid of the row
for which values are required bound to the SQL variable:
In the above, <content> is replaced by the name of the content table.
By default, <content_rowid> is replaced by the literal text "rowid". Or,
if the "content_rowid" option is set within the CREATE VIRTUAL TABLE statement,
by the value of that option. <cols> is replaced by a comma-separated list
of the FTS5 table column names. For example:
The content table may also be queried as follows:
It is still the responsibility of the user to ensure that the contents of
an external content FTS5 table are kept up to date with the content table.
One way to do this is with triggers. For example:
Like contentless tables, external content tables do not support REPLACE conflict handling. Any operations that specify REPLACE conflict handling are handled using ABORT.
Normally, FTS5 maintains a special backing table within the database that
stores the size of each column value in tokens inserted into the main FTS5
table in a separate table. This backing table is used by the
xColumnSize API function, which is in turn used by
the built-in [FTS5 bm25 | bm25 ranking function] (and is likely to be useful
to other ranking functions as well).
In order to save space, this backing table may be omitted by setting the
columnsize option to zero. For example:
It is an error to set the columnsize option to any value other than
0 or 1.
If an FTS5 table is configured with columnsize=0 but is not a
[FTS5 contentless tables | contentless table], the xColumnSize API function
still works, but runs much more slowly. In this case, instead of reading
the value to return directly from the database, it reads the text value
itself and count the tokens within it on demand.
Or, if the table is also a [FTS5 contentless tables | contentless table],
then the following apply:
The xColumnSize API always returns -1. There is no way to determine
the number of tokens in a value stored within a contentless FTS5 table
configured with columnsize=0.
Each inserted row must be accompanied by an explicitly specified rowid
value. If a contentless table is configured with columnsize=0,
attempting to insert a NULL value into the rowid is an SQLITE_MISMATCH
error.
All queries on the table must be full-text queries. In other words,
they must use the MATCH or = operator with the table-name column as the
left-hand operand, or else use the table-valued function syntax. Any
query that is not a full-text query results in an error.
The name of the table in which the xColumnSize values are stored
(unless columnsize=0 is specified) is "<name>_docsize", where
<name> is the name of the FTS5 table itself. The
sqlite3_analyzer
tool may be used on an existing database in order to determine how much
space might be saved by recreating an FTS5 table using columnsize=0.
For each term in a document, the FTS index maintained by FTS5
stores the rowid of the document, the column number of the column that contains
the term and the offset of the term within the column value. The "detail"
option may be used to omit some of this information. This reduces the space
that the index consumes within the database file, but also reduces the
capability and efficiency of the system.
The detail option may be set to "full" (the default value), "column" or
"none". For example:
If the detail option is set to column, then for each term the FTS
index records the rowid and column number only, omitting the term offset
information. This results in the following restrictions:
If the detail option is set to none, then for each term the FTS
index records just the rowid is stored. Both column and offset information
are omitted. As well as the restrictions itemized above for detail=column
mode, this imposes the following extra limitations:
In one test that indexed a large set of emails (1636 MiB on disk), the FTS
index was 743 MiB on disk with detail=full, 340 MiB with detail=column and 134
MiB with detail=none.
Auxiliary functions are similar to [corefunc | SQL scalar functions],
except that they may only be used within full-text queries (those that use
the MATCH operator) on an FTS5 table. Their results are calculated based not
only on the arguments passed to them, but also on the current match and
matched row. For example, an auxiliary function may return a numeric value
indicating the accuracy of the match (see the [FTS5 bm25| bm25()] function),
or a fragment of text from the matched row that contains one or more
instances of the search terms (see the [FTS5 snippet | snippet()] function).
To invoke an auxiliary function, the name of the FTS5 table should be
specified as the first argument. Other arguments may follow the first,
depending on the specific auxiliary function being invoked. For example, to
invoke the "highlight" function:
The built-in auxiliary functions provided as part of FTS5 are described in
the following section. Applications may also implement
[FTS5 custom auxiliary functions | custom auxiliary functions in C].
FTS5 provides three built-in auxiliary functions:
The built-in auxiliary function bm25() returns a real value indicating
how well the current row matches the full-text query. The better the match,
the numerically smaller the value returned. A query such as the following may
be used to return matches in order from best to worst match:
In order to calculate a documents score, the full-text query is separated
into its component phrases. The bm25 score for document D and
query Q is then calculated as follows:
In the above, nPhrase is the number of phrases in the query.
|D| is the number of tokens in the current document, and
avgdl is the average number of tokens in all documents within the
FTS5 table. k1 and b are both constants,
hard-coded at 1.2 and 0.75 respectively.
The "-1" term at the start of the formula is not found in most
implementations of the BM25 algorithm. Without it, a better match is assigned
a numerically higher BM25 score. Since the default sorting order is
"ascending", this means that appending "ORDER BY bm25(fts)" to a query would
cause results to be returned in order from worst to best. The "DESC" keyword
would be required in order to return the best matches first. In order to
avoid this pitfall, the FTS5 implementation of BM25 multiplies the result
by -1 before returning it, ensuring that better matches are assigned
numerically lower scores.
IDF(qi) is the inverse-document-frequency of query
phrase i. It is calculated as follows, where N is the total
number of rows in the FTS5 table and n(qi) is the total
number of rows that contain at least one instance of phrase i:
Finally, f(qi,D) is the phrase frequency of phrase
i. By default, this is simply the number of occurrences of the phrase
within the current row. However, by passing extra real value arguments to
the bm25() SQL function, each column of the table may be assigned a different
weight and the phrase frequency calculated as follows:
where wc is the weight assigned to column c and
n(qi,c) is the number of occurrences of phrase i in
column c of the current row. The first argument passed to bm25()
following the table name is the weight assigned to the leftmost column of
the FTS5 table. The second is the weight assigned to the second leftmost
column, and so on. If there are not enough arguments for all table columns,
remaining columns are assigned a weight of 1.0. If there are too many
trailing arguments, the extras are ignored. For example:
Refer to wikipedia for
more information regarding
BM25 and its variants.
The highlight() function returns a copy of the text from a specified
column of the current row with extra markup text inserted to mark the start
and end of phrase matches.
The highlight() must be invoked with exactly three arguments following
the table name. To be interpreted as follows:
For example:
In cases where two or more phrase instances overlap (share one or more
tokens in common), a single open and close marker is inserted for each set
of overlapping phrases. For example:
The snippet() function is similar to highlight(), except that instead of
returning entire column values, it automatically selects and extracts a
short fragment of document text to process and return. The snippet() function
must be passed five parameters following the table name argument:
All FTS5 tables feature a special hidden column named "rank". If the
current query is not a full-text query (i.e. if it does not include a MATCH
operator), the value of the "rank" column is always NULL. Otherwise, in a
full-text query, column rank contains by default the same value as would be
returned by executing the bm25() auxiliary function with no trailing
arguments.
The difference between reading from the rank column and using the bm25()
function directly within the query is only significant when sorting by the
returned value. In this case, using "rank" is faster than using bm25().
Instead of using bm25() with no trailing arguments, the specific auxiliary
function mapped to the rank column may be configured either on a per-query
basis, or by setting a different persistent default for the FTS table.
In order to change the mapping of the rank column for a single query,
a term similar to either of the following is added to the WHERE clause of a
query:
The right-hand-side of the MATCH or = operator must be a constant
expression that evaluates to a string consisting of the auxiliary function to
invoke, followed by zero or more comma separated arguments within parenthesis.
Arguments must be SQL literals. For example:
The table-valued function syntax may also be used to specify an alternative
ranking function. In this case the text describing the ranking function should
be specified as the second table-valued function argument. The following three
queries are equivalent:
The default mapping of the rank column for a table may be modified
using the [FTS5 rank configuration option].
Instead of using a single data structure on disk to store the full-text
index, FTS5 uses a series of b-trees. Each time a new transaction is
committed, a new b-tree containing the contents of the committed transaction
is written into the database file. When the full-text index is queried, each
b-tree must be queried individually and the results merged before being
returned to the user.
In order to prevent the number of b-trees in the database from becoming too
large (slowing down queries), smaller b-trees are periodically merged into
single larger b-trees containing the same data. By default, this happens
automatically within INSERT, UPDATE or DELETE statements that modify the
full-text index. The 'automerge' parameter determines how many smaller
b-trees are merged together at a time. Setting it to a small value can
speed up queries (as they have to query and merge the results from fewer
b-trees), but can also slow down writing to the database (as each INSERT,
UPDATE or DELETE statement has to do more work as part of the automatic
merging process).
Each of the b-trees that make up the full-text index is assigned to a "level"
based on its size. Level-0 b-trees are the smallest, as they contain the
contents of a single transaction. Higher level b-trees are the result of
merging two or more level-0 b-trees together and so they are larger. FTS5
begins to merge b-trees together once there exist M or more b-trees
with the same level, where M is the value of the 'automerge'
parameter.
The maximum allowed value for the 'automerge' parameter is 16. The default
value is 4. Setting the 'automerge' parameter to 0 disables the automatic
incremental merging of b-trees altogether.
The 'crisismerge' option is similar to 'automerge', in that it determines
how and how often the component b-trees that make up the full-text index are
merged together. Once there exist C or more b-trees on a single level
within the full-text index, where C is the value of the 'crisismerge'
option, all b-trees on the level are immediately merged into a single b-tree.
The difference between this option and the 'automerge' option is that when
the 'automerge' limit is reached FTS5 only begins to merge the b-trees
together. Most of the work is performed as part of subsequent INSERT,
UPDATE or DELETE operations. Whereas when the 'crisismerge' limit is reached,
the offending b-trees are all merged immediately. This means that an INSERT,
UPDATE or DELETE that triggers a crisis-merge may take a long time to
complete.
The default 'crisismerge' value is 16. There is no maximum limit. Attempting
to set the 'crisismerge' parameter to a value of 0 or 1 is equivalent to
setting it to the default value (16). It is an error to attempt to set the
'crisismerge' option to a negative value.
This command is only available with [FTS5 external content tables |
external content] and [FTS5 contentless tables | contentless] tables. It
is used to delete the index entries associated with a single row from the
full-text index. This command and the [FTS5 delete-all command | delete-all]
command are the only ways to remove entries from the full-text index of a
contentless table.
In order to use this command to delete a row, the text value 'delete'
must be inserted into the special column with the same name as the table.
The rowid of the row to delete is inserted into the rowid column. The
values inserted into the other columns must match the values currently
stored in the table. For example:
If the values "inserted" into the text columns as part of a 'delete'
command are not the same as those currently stored within the table, the
results may be unpredictable.
The reason for this is easy to understand: When a document is inserted
into the FTS5 table, an entry is added to the full-text index to record the
position of each token within the new document. When a document is removed,
the original data is required in order to determine the set of entries that
need to be removed from the full-text index. So if the data supplied to FTS5
when a row is deleted using this command is different from that used to
determine the set of token instances when it was inserted, some full-text
index entries may not be correctly deleted, or FTS5 may try to remove index
entries that do not exist. This can leave the full-text index in an
unpredictable state, making future query results unreliable.
This command is only available with [FTS5 external content tables |
external content] and [FTS5 contentless tables | contentless] tables. It
deletes all entries from the full-text index.
This command is used to verify that the full-text index is consistent
with the contents of the FTS5 table or [FTS5 external content tables | content
table]. It is not available with [FTS5 contentless tables | contentless tables].
The integrity-check command is invoked by inserting the text value
'integrity-check' into the special column with the same name as the FTS5
table. For example:
If the full-text index is consistent with the contents of the table, the
INSERT used to invoke the integrity-check command succeeds. Or, if any
discrepancy is found, it fails with an [SQLITE_CORRUPT_VTAB] error.
This command merges b-tree structures together until roughly N pages
of merged data have been written to the database, where N is the absolute
value of the parameter specified as part of the 'merge' command. The size of
each page is as configured by the [FTS5 pgsz option].
If the parameter is a postive value, B-tree structures are only eligible
for merging if one of the following is true:
It is possible to tell whether or not the 'merge' command found any
b-trees to merge together by checking the value returned by the
[sqlite3_total_changes()] API before and after the command is executed. If
the difference between the two values is 2 or greater, then work was performed.
If the difference is less than 2, then the 'merge' command was a no-op. In this
case there is no reason to execute the same 'merge' command again, at least
until after the FTS table is next updated.
If the parameter is negative, and there are B-tree structures on more than
one level within the FTS index, all B-tree structures are assigned to the same
level before the merge operation is commenced. Additionally, if the parameter
is negative, the value of the usermerge configuration option is not
respected - as few as two b-trees from the same level may be merged together.
The above means that executing the 'merge' command with a negative
parameter until the before and after difference in the return value of
[sqlite3_total_changes()] is less than two optimizes the FTS index in the
same way as the [FTS5 optimize command]. However, if a new b-tree is added
to the FTS index while this process is ongoing, FTS5 will move the new
b-tree to the same level as the existing b-trees and restart the merge. To
avoid this, only the first call to 'merge' should specify a negative parameter.
Each subsequent call to 'merge' should specify a postive value so that the
merge started by the first call is run to completion even if new b-trees are
added to the FTS index.
This command merges all individual b-trees that currently make up the
full-text index into a single large b-tree structure. This ensures that the
full-text index consumes the minimum space within the database and is in the
fastest form to query.
Refer to the documentation for the [FTS5 automerge option] for more details
regarding the relationship between the full-text index and its component
b-trees.
Because it reorganizes the entire FTS index, the optimize command can
take a long time to run. The [FTS5 merge command] can be used to divide
the work of optimizing the FTS index into multiple steps. To do this:
where N is the number of pages of data to merge within each invocation of
the merge command. The application should stop invoking merge when the
difference in the value returned by the sqlite3_total_changes() function before
and after the merge command drops to below two. The merge commands may be
issued as part of the same or separate transactions, and by the same or
different database clients. Refer to the documentation for the
[FTS5 merge command | merge command] for further details.
This command is used to set the persistent "pgsz" option.
The full-text index maintained by FTS5 is stored as a series of fixed-size
blobs in a database table. It is not strictly necessary for all blobs that make
up a full-text index to be the same size. The pgsz option determines the size
of all blobs created by subsequent index writers. The default value is 1000.
This command is used to set the persistent "rank" option.
The rank option is used to change the default auxiliary function mapping
for the rank column. The option should be set to a text value in the same
format as described for [auxiliary function mapping | "rank MATCH ?"] terms
above. For example:
This command first deletes the entire full-text index, then rebuilds it
based on the contents of the table or [FTS5 external content tables | content
table]. It is not available with [FTS5 contentless tables | contentless
tables].
This command is used to set the persistent "usermerge" option.
The usermerge option is similar to the automerge and crisismerge options.
It is the minimum number of b-tree segments that will be merged together by
a 'merge' command with a positive parameter. For example:
The default value of the usermerge option is 4. The minimum allowed value
is 2, and the maximum 16.
FTS5 features APIs allowing it to be extended by:
The built-in tokenizers and auxiliary functions described in this
document are all implemented using the publicly available API described
below.
Before a new auxiliary function or tokenizer implementation may be
registered with FTS5, an application must obtain a pointer to the "fts5_api"
structure. There is one fts5_api structure for each database connection with
which the FTS5 extension is registered. To obtain the pointer, the application
invokes the SQL user-defined function fts5() with a single argument. That
argument must be set to a pointer to a pointer to an fts5_api object
using the [sqlite3_bind_pointer()] interface.
The following example code demonstrates the technique:
Backwards Compatibility Warning:
Prior to SQLite version 3.20.0 ([dateof:3.20.0]), the fts5() worked slightly
differently. Older applications that extend FTS5 must be revised to use
the new technique shown above.
The fts5_api structure is defined as follows. It exposes three methods,
one each for registering new auxiliary functions and tokenizers, and one for
retrieving existing tokenizer. The latter is intended to facilitate the
implementation of "tokenizer wrappers" similar to the built-in
porter tokenizer.
To invoke a method of the fts5_api object, the fts5_api pointer itself
should be passed as the methods first argument followed by the other, method
specific, arguments. For example:
The fts5_api structure methods are described individually in the following
sections.
To create a custom tokenizer, an application must implement three
functions: a tokenizer constructor (xCreate), a destructor (xDelete) and a
function to do the actual tokenization (xTokenize). The type of each
function is as for the member variables of the fts5_tokenizer struct:
When an FTS5 table uses the custom tokenizer, the FTS5 core calls xCreate()
once to create a tokenizer, then xTokenize() zero or more times to tokenize
strings, then xDelete() to free any resources allocated by xCreate(). More
specifically:
Implementing a custom auxiliary function is similar to implementing a
[application-defined SQL function | scalar SQL function]. The implementation
should be a C function of type fts5_extension_function, defined as follows:
The implementation is registered with the FTS5 module by calling the
xCreateFunction() method of the fts5_api object. If there is already an
auxiliary function with the same name, it is replaced by the new function.
If a non-NULL xDestroy parameter is passed to xCreateFunction(), it is invoked
with a copy of the pContext pointer passed as the only argument when the
database handle is closed or when the registered auxiliary function is
replaced.
The final three arguments passed to the auxiliary function callback are
similar to the three arguments passed to the implementation of a scalar SQL
function. All arguments except the first passed to the auxiliary function are
available to the implementation in the apVal[] array. The
implementation should return a result or error via the content handle pCtx.
The first argument passed to an auxiliary function callback is a pointer
to a structure containing methods that may be invoked in order to obtain
information regarding the current query or row. The second argument is an
opaque handle that should be passed as the first argument to any such method
invocation. For example, the following auxiliary function definition returns
the total number of tokens in all columns of the current row:
The following section describes the API offered to auxiliary function
implementations in detail. Further examples may be found in the "fts5_aux.c"
file of the source code.
The fts5vocab virtual table module allows users to extract information from
an FTS5 full-text index directly. The fts5vocab module is a part of FTS5 - it
is available whenever FTS5 is.
Each fts5vocab table is associated with a single FTS5 table. An fts5vocab
table is usually created by specifying two arguments in place of column names
in the CREATE VIRTUAL TABLE statement - the name of the associated FTS5 table
and the type of fts5vocab table. Currently there are two types of fts5vocab
table, "row" and "col". Unless the fts5vocab table is created within the "temp"
database, it must be part of the same database as the associated FTS5 table.
If an fts5vocab table is created in the temp database, it may be associated
with an FTS5 table in any attached database. In order to attach the fts5vocab
table to an FTS5 table located in a database other than "temp", the name of the
database is inserted before the FTS5 table name in the CREATE VIRTUAL TABLE
arguments. For example:
Specifying three arguments when creating an fts5vocab table in any database
other than "temp" results in an error.
An fts5vocab table of type "row" contains one row for each distinct term
in the associated FTS5 table. The table columns are as follows:
The Detail Option
Auxiliary Functions
Built-in Auxiliary Functions
The bm25() function
The highlight() function
The snippet() function
Sorting by Auxiliary Function Results
Special INSERT Commands
The 'automerge' Configuration Option
The 'crisismerge' Configuration Option
The 'delete' Command
The 'delete-all' Command
The 'integrity-check' Command
The 'merge' Command
The 'optimize' Command
The 'pgsz' Configuration Option
The 'rank' Configuration Option
The 'rebuild' Command
The 'usermerge' Configuration Option
Extending FTS5
Custom Tokenizers
Custom Auxiliary Functions
Custom Auxiliary Functions API Reference
The fts5vocab Virtual Table Module
Column | Contents |
---|---|
term | The term, as stored in the FTS5 index. |
doc | The number of rows that contain at least one instance of the term. |
cnt | The total number of instances of the term in the entire FTS5 table. |
An fts5vocab table of type "col" contains one row for each distinct term/column combination in the associated FTS5 table. Table columns are as follows:
Column | Contents |
---|---|
term | The term, as stored in the FTS5 index. |
col | The name of the FTS5 table column that contains the term. |
doc | The number of rows in the FTS5 table for which column $col contains at least one instance of the term. |
cnt | The total number of instances of the term that appear in column $col of the FTS5 table (considering all rows). |
Example:
Also available is the similar but more mature [fts3 | FTS3/4] module. FTS5 is a new version of FTS4 that includes various fixes and solutions for problems that could not be fixed in FTS4 without sacrificing backwards compatibility. Some of these problems are [fts5 technical differences | described below].
In order to use FTS5 instead of FTS3 or FTS4, applications usually require minimal modifications. Most of these fall into three categories - changes required to the CREATE VIRTUAL TABLE statement used to create the FTS table, changes required to SELECT queries used to execute queries against the table, and changes required to applications that use [FTS auxiliary functions].
The module name must be changed from "fts3" or "fts4" to "fts5".
All type information or constraint specifications must be removed from column definitions. FTS3/4 ignores everything following the column name in a column definition, FTS5 attempts to parse it (and will report an error if it fails to).
The "matchinfo=fts3" option is not available. The [FTS5 columnsize option | "columnsize=0"] option is equivalent.
The notindexed= option is not available. Adding [unindexed | UNINDEXED] to the column definition is equivalent.
The ICU tokenizer is not available.
The compress=, uncompress= and languageid= options are not available. There is as of yet no equivalent for their functionality.
The "docid" alias does not exist. Applications must use "rowid" instead.
The behaviour of queries when a column-filter is specified both as
part of the FTS query and by using a column as the LHS of a MATCH
operator is slightly different. For a table with columns "a" and "b"
and a query similar to:
FTS3/4 searches for matches in column "b". However, FTS5 always returns zero rows, as results are first filtered for column "b", then for column "a", leaving no results. In other words, in FTS3/4 the inner filter overrides the outer, in FTS5 both filters are applied.
The FTS query syntax (right hand side of the MATCH operator) has changed in some ways. The FTS5 syntax is quite close to the FTS4 "enhanced syntax". The main difference is that FTS5 is fussier about unrecognized punctuation characters and similar within query strings. Most queries that work with FTS3/4 should also work with FTS5, and those that do not should return parse errors.
FTS5 has no matchinfo() or offsets() function, and the snippet() function is not as fully-featured as in FTS3/4. However, since FTS5 does provide an API allowing applications to create [custom auxiliary functions], any required functionality may be implemented within the application code.
The set of built-in auxiliary functions provided by FTS5 may be improved upon in the future.
The functionality provided by the fts4aux module is now provided by [fts5vocab]. The schema of these two tables is slightly different.
The FTS3/4 "merge=X,Y" command has been replaced by the [FTS5 merge command].
The FTS3/4 "automerge=X" command has been replaced by the [FTS5 automerge option].
FTS5 is similar to FTS3/4 in that the primary task of each is to maintain
an index mapping from each unique token to a list of instances of that token
within a set of documents, where each instance is identified by the document
in which it appears and its position within that document. For example:
In the example above, each triple identifies the location of a token instance by rowid, column number (columns are numbered sequentially starting at 0 from left to right) and position within the column value (the first token in a column value is 0, the second is 1, and so on). Using this index, FTS5 is able to provide timely answers to queries such as "the set of all documents that contain the token 'A'", or "the set of all documents that contain the sequence 'Y Z'". The list of instances associated with a single token is called an "instance-list".
The principle difference between FTS3/4 and FTS5 is that in FTS3/4, each instance-list is stored as a single large database record, whereas in FTS5 large instance-lists are divided between multiple database records. This has the following implications for dealing with large databases that contain large lists:
FTS5 is able to load instance-lists into memory incrementally in order to reduce memory usage and peak allocation size. FTS3/4 very often loads entire instance-lists into memory.
When processing queries that feature more than one token, FTS5 is sometimes able to determine that the query can be answered by inspecting a subset of a large instance-list. FTS3/4 almost always has to traverse entire instance-lists.
For these reasons, many complex queries may use less memory and run faster using FTS5.
Some other ways in which FTS5 differs from FTS3/4 are:
FTS5 supports "ORDER BY rank" for returning results in order of decreasing relevancy.
FTS5 features an API allowing users to create custom auxiliary functions for advanced ranking and text processing applications. The special "rank" column may be mapped to a custom auxiliary function so that adding "ORDER BY rank" to a query works as expected.
FTS5 recognizes unicode separator characters and case equivalence by default. This is also possible using FTS3/4, but must be explicitly enabled.
The query syntax has been revised where necessary to remove ambiguities and to make it possible to escape special characters in query terms.
By default, FTS3/4 occasionally merges together two or more of the b-trees that make up its full-text index within an INSERT, UPDATE or DELETE statement executed by the user. This means that any operation on an FTS3/4 table may turn out to be surprisingly slow, as FTS3/4 may unpredictably choose to merge together two or more large b-trees within it. FTS5 uses incremental merging by default, which limits the amount of processing that may take place within any given INSERT, UPDATE or DELETE operation.