Each argument specified as part of a "CREATE VIRTUAL TABLE ... USING fts5 ..." statement is either a column name or a configuration option. A column name consists of a single FTS5 bareword or a single string literal quoted in any manner acceptable to SQLite. A configuration option consists of an FTS5 bareword - the option name - followed by an "=" character, followed by the option value. The option value is specified using either a single FTS5 bareword or a string literal, again quoted in any manner acceptable to the SQLite core. Anything else is a syntax error.
It is an error to attempt to name an fts5 table column "rowid" or "rank". This is not supported.
A configuration option consists of an FTS5 bareword - the option name -
followed by an "=" character, followed by a either an FTS5 bareword or a
string literal. For example:
There are currently the following configuration options:
The following block contains a summary of the FTS query syntax in BNF form.
A detailed explanation follows.
Within an FTS expression a string may be specified in one of two ways:
By enclosing it in double quotes ("). Within a string, any embedded double quote characters may be escaped SQL-style - by adding a second double-quote character.
As a bareword that includes no whitespace or reserved characters, and is not "AND", "OR" or "NOT" (case sensitive). Reserved characters are:
: ~ ! @ # $ % ^ & * ( ) + , =In other words, the top row of a regular US keyboard, the plus sign, comma and colon characters. Strings that include any of these characters must be quoted.
FTS queries are made up of phrases. A phrase is an ordered list of
one or more tokens. A string is transformed into a phrase by passing it to
the FTS table tokenizer. Two phrases can be concatenated into a single
large phrase using the "+" operator. For example, assuming the tokenizer
module being used tokenizes the input "one.two.three" to three separate
tokens, the following three queries all specify the same phrase:
A phrase matches a document if the document contains at least one sub-sequence of tokens that matches the sequence of tokens that make up the phrase.
If a "*" character follows a string within an FTS expression, then the final
token extracted from the string is marked as a prefix token. As you
might expect, a prefix token matches any document token of which it is a
prefix. For example, the first two queries in the following block will match
any document that contains the token "one" immediately followed by the token
"two" and then any token that begins with "thr".
The final query in the block above may not work as expected. Because the "*" character is inside the double-quotes, it will be passed to the tokenizer, which will likely discard it (or perhaps, depending on the specific tokenizer in use, include it as part of the final token) instead of recognizing it as a special FTS character.
Two or more phrases may be grouped into a NEAR group. A NEAR group
is specified by the token "NEAR" (case sensitive) followed by an open
parenthesis character, followed by two or more whitespace separated phrases, optionally followed by a comma and the numeric parameter N, followed by
a close parenthesis. For example:
If no N parameter is supplied, it defaults to 10. A NEAR group matches a document if the document contains at least one clump of tokens that:
For example:
A single phrase or NEAR group may be restricted to matching text within a
specified column of the FTS table by prefixing it with the column name
followed by a colon character. Column names may be specified using either
of the two forms described for strings above. Unlike strings that are part
of phrases, column names are not passed to the tokenizer module. Column
names are case-insensitive in the usual way for SQLite column names -
upper/lower case equivalence is understood for ASCII-range characters only.
Phrases and NEAR groups may be arranged into expressions using boolean operators. In order of precedence, from highest to lowest, the operators are:
Operator | Function |
---|---|
<query1> AND <query2>
| Matches if both query1 and query2 match. |
<query1> OR <query2>
| Matches if either query1 or query2 match. |
<query1> NOT <query2>
| Matches if query1 matches and query2 does not match. |
Parenthesis may be used to group expressions in order to modify operator
precedence in the usual ways. For example:
Phrases and NEAR groups may also be connected by implicit AND operators.
For simplicity, these are not shown in the BNF grammar above. Essentially, any
sequence of phrases or NEAR groups (including those restricted to matching
specified columns) separated only by whitespace are handled as if there were an
implicit AND operator between each pair of phrases or NEAR groups. Implicit
AND operators are never inserted after or before an expression enclosed in
parenthesis. For example:
By default, FTS5 maintains a single index recording the location of each token instance within the document set. This means that querying for complete tokens is fast, as it requires a single lookup, but querying for a prefix token can be slow, as it requires a range scan. For example, to query for the prefix token "abc*" requires a range scan of all tokens greater than or equal to "abc" and less than "abd".
A prefix index is a separate index that records the location of all instances of prefix tokens of a certain length in characters used to speed up queries for prefix tokens. For example, optimizing a query for prefix token "abc*" requires a prefix index of three-character prefixes.
To add prefix indexes to an FTS5 table, the "prefix" option is set to
either a single positive integer or a text value containing a white-space
separated list of one or more positive integer values. A prefix index is
created for each integer specified. If more than one "prefix" option is
specified as part of a single CREATE VIRTUAL TABLE statement, all apply.
The CREATE VIRTUAL TABLE "tokenize" option is used to configure the specific tokenizer used by the FTS5 table. The option argument must be either an FTS5 bareword, or an SQL text literal. The text of the argument is itself treated as a white-space series of one or more FTS5 barewords or SQL text literals. The first of these is the name of the tokenizer to use. The second and subsequent list elements, if they exist, are arguments passed to the tokenizer implementation.
Unlike option values and column names, SQL text literals intended as
tokenizers must be quoted using single quote characters. For example:
FTS5 features three built-in tokenizer modules, described in subsequent sections:
It is also possible to create custom tokenizers for FTS5. The API for doing so is [custom tokenizers | described here].
The unicode tokenizer classifies all unicode characters as either "separator" or "token" characters. By default all space and punctuation characters, as defined by Unicode 6.1, are considered separators, and all other characters as token characters. Each contiguous run of one or more token characters is considered to be a token. The tokenizer is case-insensitive according to the rules defined by Unicode 6.1.
By default, diacritics are removed from all Latin script characters. This means, for example, that "A", "a", "À", "à", "Â" and "â" are all considered to be equivalent.
Any arguments following "unicode61" in the token specification are treated as a list of alternating option names and values. Unicode61 supports the following options:
Option | Usage |
---|---|
remove_diacritics | This option should be set to "0" or "1". If it is set (the default), diacritics are removed from all latin script characters as described above. If it is clear, they are not. |
tokenchars | This option is used to specify additional unicode characters that should be considered token characters, even if they are white-space or punctuation characters according to Unicode 6.1. All characters in the string that this option is set to are considered token characters. |
separators | This option is used to specify additional unicode characters that should be considered as separator characters, even if they are token characters according to Unicode 6.1. All characters in the string that this option is set to are considered separators. |
For example:
The Ascii tokenizer is similar to the Unicode61 tokenizer, except that:
For example:
The porter tokenizer is a wrapper tokenizer. It takes the output of some other tokenizer and applies the porter stemming algorithm to each token before it returns it to FTS5. This allows search terms like "correction" to match similar words such as "corrected" or "correcting". The porter stemmer algorithm is designed for use with English language terms only - using it with other languages may or may not improve search utility.
By default, the porter tokenizer operates as a wrapper around the default
tokenizer (unicode61). Or, if one or more extra arguments are added to the
"tokenize" option following "porter", they are treated as a specification for
the underlying tokenizer that the porter stemmer uses. For example:
Normally, when a row is inserted into an FTS5 table, as well as the various full-text index entries and other data a copy of the row is stored in a private table managed by the FTS5 module. When column values are requested from the FTS5 table by the user or by an auxiliary function implementation, they are read from this private table. The "content" option may be used to create an FTS5 table that stores only FTS full-text index entries. Because the column values themselves are usually much larger than the associated full-text index entries, this can save significant database space.
There are two ways to use the "content" option:
A contentless FTS5 table is created by setting the "content" option to
an empty string. For example:
Contentless FTS5 tables do not support UPDATE or DELETE statements, or INSERT statements that do not supply a non-NULL value for the rowid field. Rows may be deleted from a contentless table using an [FTS5 delete command].
Attempting to read any column value except the rowid from a contentless FTS5 table returns an SQL NULL value.
An external content FTS5 table is created by setting the content
option to the name of a table, virtual table or view (hereafter the "content
table") within the same database. Whenever column values are required by
FTS5, it queries the content table as follows, with the rowid of the row
for which values are required bound to the SQL variable:
In the above, <content> is replaced by the name of the content table. By default, <content_rowid> is replaced by the literal text "rowid". Or, if the "content_rowid" option is set within the CREATE VIRTUAL TABLE statement, by the value of that option.
The "*" in the above query must expand to a set of columns consisting of the <column_rowid> column followed by each indexed column, in the same order as they are present in the external content fts5 table.
The content table may also be queried as follows:
It is still the responsibility of the user to ensure that the contents of
an external content FTS5 table are kept up to date with the content table.
One way to do this is with triggers. For example:
The built-in auxiliary function bm25() returns a real value indicating
how well the current row matches the full-text query. The better the match,
the larger the value returned. A query such as the following may be used
to return matches in order from best to worst match:
In order to calculate a documents score, the full-text query is separated into its component phrases. The bm25 score for document D and query Q is then calculated as follows:
In the above, nPhrase is the number of phrases in the query. |D| is the number of tokens in the current document, and avgdl is the average number of tokens in all documents within the FTS5 table. k1 and b are both constants, hard-coded at 1.2 and 0.75 respectively.
IDF(qi) is the inverse-document-frequency of query phrase i. It is calculated as follows, where N is the total number of rows in the FTS5 table and n(qi) is the total number of rows that contain at least one instance of phrase i:
Finally, f(qi,D) is the phrase frequency of phrase i. By default, this is simply the number of occurrences of the phrase within the current row. However, by passing extra real value arguments to the bm25() SQL function, each column of the table may be assigned a different weight and the phrase frequency calculated as follows:
where wc is the weight assigned to column c and
n(qi,c) is the number of occurrences of phrase i in
column c of the current row. The first argument passed to bm25()
following the table name is the weight assigned to the leftmost column of
the FTS5 table. The second is the weight assigned to the second leftmost
column, and so on. If there are not enough arguments for all table columns,
remaining columns are assigned a weight of 1.0. If there are too many
trailing arguments, the extras are ignored. For example:
Refer to wikipedia for more information regarding BM25 and its variants.
The highlight() function returns a copy of the text from a specified column of the current row with extra markup text inserted to mark the start and end of phrase matches.
The highlight() must be invoked with exactly three arguments following the table name. To be interpreted as follows:
For example:
In cases where two or more phrase instances overlap (share one or more
tokens in common), a single open and close marker is inserted for each set
of overlapping phrases. For example:
The snippet() function is similar to highlight(), except that instead of returning entire column values, it automatically selects and extracts a short fragment of document text to process and return. The snippet() function must be passed five parameters following the table name argument:
All FTS5 tables feature a special hidden column named "rank". If the current query is not a full-text query (i.e. if it does not include a MATCH operator), the value of the "rank" column is always NULL. Otherwise, in a full-text query, column rank contains by default the same value as would be returned by executing the bm25() auxiliary function with no trailing arguments.
The difference between reading from the rank column and using the bm25()
function directly within the query is only significant when sorting by the
returned value. In this case, using "rank" is faster than using bm25().
Instead of using bm25() with no trailing arguments, the specific auxiliary function mapped to the rank column may be configured either on a per-query basis, or by setting a different persistent default for the FTS table.
In order to change the mapping of the rank column for a single query,
a term similar to the following is added to the WHERE clause of a query:
The right-hand-side of the MATCH clause must be a constant expression that
evaluates to a string consisting of the auxiliary function to invoke, followed
by zero or more comma separated arguments within parenthesis. Arguments must
be SQL literals. For example:
The default mapping of the rank column for a table may be modified using the [FTS5 rank configuration option].
Instead of using a single data structure on disk to store the full-text index, FTS5 uses a series of b-trees. Each time a new transaction is committed, a new b-tree containing the contents of the committed transaction is written into the database file. When the full-text index is queried, each b-tree must be queried individually and the results merged before being returned to the user.
In order to prevent the number of b-trees in the database from becoming too large (slowing down queries), smaller b-trees are periodically merged into single larger b-trees containing the same data. By default, this happens automatically within INSERT, UPDATE or DELETE statements that modify the full-text index. The 'automerge' parameter determines how many smaller b-trees are merged together at a time. Setting it to a small value can speed up queries (as they have to query and merge the results from fewer b-trees), but can also slow down writing to the database (as each INSERT, UPDATE or DELETE statement has to do more work as part of the automatic merging process).
Each of the b-trees that make up the full-text index is assigned to a "level" based on its size. Level-0 b-trees are the smallest, as they contain the contents of a single transaction. Higher level b-trees are the result of merging two or more level-0 b-trees together and so they are larger. FTS5 begins to merge b-trees together once there exist M or more b-trees with the same level, where M is the value of the 'automerge' parameter.
The maximum allowed value for the 'automerge' parameter is 16. The default
value is 4. Setting the 'automerge' parameter to 0 disables the automatic
incremental merging of b-trees altogether.
The 'crisismerge' option is similar to 'automerge', in that it determines how and how often the component b-trees that make up the full-text index are merged together. Once there exist C or more b-trees on a single level within the full-text index, where C is the value of the 'crisismerge' option, all b-trees on the level are immediately merged into a single b-tree.
The difference between this option and the 'automerge' option is that when the 'automerge' limit is reached FTS5 only begins to merge the b-trees together. Most of the work is performed as part of subsequent INSERT, UPDATE or DELETE operations. Whereas when the 'crisismerge' limit is reached, the offending b-trees are all merged immediately. This means that an INSERT, UPDATE or DELETE that triggers a crisis-merge may take a long time to complete.
The default 'crisismerge' value is 16. There is no maximum limit. Attempting
to set the 'crisismerge' parameter to a value of 0 or 1 is equivalent to
setting it to the default value (16). It is an error to attempt to set the
'crisismerge' option to a negative value.
This command is only available with [FTS5 external content tables | external content] and [FTS5 contentless tables | contentless] tables. It is used to delete the index entries associated with a single row from the full-text index. This command and the [FTS5 delete-all command | delete-all] command are the only ways to remove entries from the full-text index of a contentless table.
In order to use this command to delete a row, the text value 'delete'
must be inserted into the special column with the same name as the table.
The rowid of the row to delete is inserted into the rowid column. The
values inserted into the other columns must match the values currently
stored in the table. For example:
If the values "inserted" into the text columns as part of a 'delete' command are not the same as those currently stored within the table, the results may be unpredictable.
todo: explain why
This command is only available with [FTS5 external content tables |
external content] and [FTS5 contentless tables | contentless] tables. It
deletes all entries from the full-text index.
This command is used to verify that the full-text index is consistent with the contents of the FTS5 table or [FTS5 external content tables | content table]. It is not available with [FTS5 contentless tables | contentless tables].
The integrity-check command is invoked by inserting the text value
'integrity-check' into the special column with the same name as the FTS5
table. For example:
If the full-text index is consistent with the contents of the table, the INSERT used to invoke the integrity-check command succeeds. Or, if any discrepancy is found, it fails with an [SQLITE_CORRUPT_VTAB] error.
This command merges all individual b-trees that currently make up the full-text index into a single large b-tree structure. This ensures that the full-text index consumes the mimimum space within the database and is in the fastest form to query.
Refer to the documentation for the [FTS5 automerge option] for more details
regarding the relationship between the full-text index and its component
b-trees.
This command is used to set the persistent "pgsz" option.
The full-text index maintained by FTS5 is stored as a series of fixed-size
blobs in a database table. It is not strictly necessary for all blobs that make
up a full-text index to be the same size. The pgsz option determines the size
of all blobs created by subsequent index writers. The default value is 1000.
This command is used to set the persistent "rank" option.
The rank option is used to change the default auxiliary function mapping
for the rank column. The option should be set to a text value in the same
format as described for [auxiliary function mapping | "rank MATCH ?"] terms
above. For example:
This command first deletes the entire full-text index, then rebuilds it
based on the contents of the table or [FTS5 external content tables | content
table]. It is not available with [FTS5 contentless tables | contentless
tables].
FTS5 features APIs allowing it to be extended by:
The built-in tokenizers and auxiliary functions described in this documented are all implemented using the publicly available API described below.
Before a new auxiliary function or tokenizer implementation may be
registered with FTS5, an application must obtain a pointer to the "fts5_api"
structure. There is one fts5_api structure for each database connection with
which the FTS5 extension is registered. To obtain the pointer, the application
invokes the SQL user-defined function fts5(), which returns a blob value
containing the pointer to the fts5_api structure for the connection. The
following example code demonstrates the technique:
The fts5_api structure is defined as follows. It exposes three methods,
one each for registering new auxiliary functions and tokenizers, and one for
retrieving existing tokenizer. The latter is intended to facilitate the
implementation of "tokenizer wrappers" similar to the built-in
porter tokenizer.
To invoke a method of the fts5_api object, the fts5_api pointer itself
should be passed as the methods first argument followed by the other, method
specific, arguments. For example:
The fts5_api structure methods are described individually in the following sections.
To create a custom tokenizer, an application must implement three
functions: a tokenizer constructor (xCreate), a destructor (xDelete) and a
function to do the actual tokenization (xTokenize). The type of each
function is as for the member variables of the fts5_tokenizer struct:
When an FTS5 table uses the custom tokenizer, the FTS5 core calls xCreate()
once to create a tokenizer, then xTokenize() zero or more times to tokenize
strings, then xDelete() to free any resources allocated by xCreate(). More
specifically:
Implementing a custom auxiliary function is similar to implementing an
[application-defined SQL function | scalar SQL function]. The implementation
should be a C function of type fts5_extension_function, defined as follows:
The implementation is registered with the FTS5 module by calling the xCreateFunction() method of the fts5_api object. If there is already an auxiliary function with the same name, it is replaced by the new function. If a non-NULL xDestroy parameter is passed to xCreateFunction(), it is invoked with a copy of the pContext pointer passed as the only argument when the database handle is closed or when the registered auxiliary function is replaced.
The final three arguments passed to the auxiliary function callback are similar to the three arguments passed to the implementation of a scalar SQL function. All arguments except the first passed to the auxiliary function are available to the implementation in the apVal[] array. The implementation should return a result or error via the content handle pCtx.
The first argument passed to an auxiliary function callback is a pointer
to a structure containing methods that may be invoked in order to obtain
information regarding the current query or row. The second argument is an
opaque handle that should be passed as the first argument to any such method
invocation. For example, the following auxiliary function definition returns
the total number of tokens in all columns of the current row:
The following section describes the API offered to auxiliary function implementations in detail. Further examples may be found in the "fts5_aux.c" file of the source code.