Index: pages/fts3.in
==================================================================
--- pages/fts3.in
+++ pages/fts3.in
@@ -764,18 +764,18 @@
The following example illustrates precedence of operators using the standard
query syntax:
The FTS3 module provides three special SQL scalar functions that may be useful to the developers of full-text query systems: "snippet", "offsets" and "matchinfo". The purpose of the "snippet" and "offsets" functions is to allow @@ -999,55 +999,104 @@
- The matchinfo function returns a blob value. If used within a query that - uses the full-text index (not a "query by rowid" or "linear scan"), then - the blob consists of (2 + C * P * 3) 32-bit unsigned - integers in machine byte-order, where C is the number of columns - in the FTS3 table being queried, and P is the number of - matchable phrases in the query. - -
- Phrases and columns are both numbered from left to right starting from - zero. + The matchinfo function returns a blob value. If it is used within a query + that does not use the full-text index (a "query by rowid" or "linear scan"), + then the blob is zero bytes in size. Otherwise, the blob consists of zero + or more 32-bit unsigned integers in machine byte-order. The exact number + of integers in the returned array depends on both the query and the value + of the second argument (if any) passed to the matchinfo function. + +
+ The matchinfo function is called with either one or two arguments. As for + all auxiliary functions, the first argument must be the special hidden + column of an FTS3/4 table that has the same name as the table itself (see + above). The second argument, if it is specified, must be a text value + comprised only of the characters 'p', 'c', 'n', 'a', 'l', 's' and 'x'. + If no second argument is explicitly supplied, it defaults to "pcx". The + second argument is refered to as the "format string" below. + +
+ Characters in the matchinfo format string are processed from left to right. + Each character in the format string causes one or more 32-bit unsigned + integer values to be added to the returned array. The "values" column in + the following table contains the number of integer values appended to the + output buffer for each supported format string character. In the formula + given, cols is the number of columns in the FTS3/4 table, and + phrases is the number of matchable phrases in + the query.
Array Element | Interpretation - | |
---|---|---|
0 | - Number of matchable phrases in the query expression (value P in - the formula below). - | |
1 | - Number of columns in the FTS3 table being queried (value C in - the formula below). - - | |
2 + 3 * (c + C*p) + 0 | - Number of phrase matches for matchable phrase p in column - c of the current FTS3 table row. - | |
2 + 3 * (c + C*p) + 1 | - Sum of the number of phrase matches for matchable phrase p in - column c for all rows of the FTS3 table. - | |
2 + 3 * (c + C*p) + 2 | - Number of rows of the FTS3 table for which column c contains at - least one phrase match for matchable phrase p. + | |
Character | Values | Description + |
p | 1 | The number of matchable phrases in the query. + |
c | 1 | The number of user defined columns in the FTS3/4 + table (i.e. not including the docid or the hidden column with the same + name as the table). + |
x | 3 * cols * phrases + |
+ For each distinct combination of a phrase and table column, the
+ following three values:
+
+ hits_this_row = array[3 * (c + p*cols) + 0] + hits_all_rows = array[3 * (c + p*cols) + 1] + docs_with_hits = array[3 * (c + p*cols) + 2] ++ + |
n | 1 | The number of rows in the FTS4 table. This value is + only available when querying FTS4 tables, not FTS3. + |
a | cols | For each column, the average number of + tokens in the text values stored in the column (considering all rows in + the FTS4 table). This value is only available when querying FTS4 tables, + not FTS3. + |
l | cols | + For each column, the length of the value stored in the current row of the + FTS4 table, in tokens. This value is only available when querying + FTS4 tables, not FTS3. And only if the "matchinfo=fts3" directive was not + specified as part of the "CREATE VIRTUAL TABLE" statement used to create + the FTS4 table. + |
s | cols | For each column, the length of the longest + subsequence of phrase matches that the column value has in common + the query text. For example, if a table column contains the text + 'a b c d e' and the query is 'a c "d e"', then the length of the longest + common subsequence is 2 (phrase "c" followed by phrase "d e"). + + |
For example:
The matchinfo function is much faster than either the snippet or offsets functions. This is because the implementation of both snippet and offsets @@ -1077,22 +1138,16 @@ SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH <query expression>; SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH <query expression>;
- The matchinfo function provides much of the information required to calculate + The matchinfo function provides all the information required to calculate probabilistic "bag-of-words" relevancy scores such as Okapi BM25/BM25F that may - be used to order results in a full-text search application. Also often - used in such functions is the length or relative length of each document - or document field. Unfortunately, this information is not made available - by the matchinfo function as it would require loading extra data from the - database, potentially slowing matchinfo() down by an order of magnitude. - One solution is for the application to store the lengths of each document - or document field in a separate table for use in calculating relevancy - scores. Appendix A of this document, "[search application tips]", contains - an example of using the matchinfo() function efficiently. + be used to order results in a full-text search application. Appendix A of this + document, "[search application tips]", contains an example of using the + matchinfo() function efficiently.
An FTS3 tokenizer is a set of rules for extracting terms from a document