Index: pages/fts3.in ================================================================== --- pages/fts3.in +++ pages/fts3.in @@ -764,18 +764,18 @@ The following example illustrates precedence of operators using the standard query syntax: - -- Search for documents that contains at least one of the terms "database" - -- and "sqlite", and also contains the term "library". Because of the differences + -- Search for documents that contain at least one of the terms "database" + -- and "sqlite", and also contain the term "library". Because of the differences -- in operator precedences, this query would have a different interpretation using -- the enhanced query syntax. SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library'; -

auxiliary functions - Snippet, Offsets and Matchinfo

+

Auxiliary Functions - Snippet, Offsets and Matchinfo

The FTS3 module provides three special SQL scalar functions that may be useful to the developers of full-text query systems: "snippet", "offsets" and "matchinfo". The purpose of the "snippet" and "offsets" functions is to allow @@ -999,55 +999,104 @@

The Matchinfo Function

- The matchinfo function returns a blob value. If used within a query that - uses the full-text index (not a "query by rowid" or "linear scan"), then - the blob consists of (2 + C * P * 3) 32-bit unsigned - integers in machine byte-order, where C is the number of columns - in the FTS3 table being queried, and P is the number of - matchable phrases in the query. - -

- Phrases and columns are both numbered from left to right starting from - zero. + The matchinfo function returns a blob value. If it is used within a query + that does not use the full-text index (a "query by rowid" or "linear scan"), + then the blob is zero bytes in size. Otherwise, the blob consists of zero + or more 32-bit unsigned integers in machine byte-order. The exact number + of integers in the returned array depends on both the query and the value + of the second argument (if any) passed to the matchinfo function. + +

+ The matchinfo function is called with either one or two arguments. As for + all auxiliary functions, the first argument must be the special hidden + column of an FTS3/4 table that has the same name as the table itself (see + above). The second argument, if it is specified, must be a text value + comprised only of the characters 'p', 'c', 'n', 'a', 'l', 's' and 'x'. + If no second argument is explicitly supplied, it defaults to "pcx". The + second argument is refered to as the "format string" below. + +

+ Characters in the matchinfo format string are processed from left to right. + Each character in the format string causes one or more 32-bit unsigned + integer values to be added to the returned array. The "values" column in + the following table contains the number of integer values appended to the + output buffer for each supported format string character. In the formula + given, cols is the number of columns in the FTS3/4 table, and + phrases is the number of matchable phrases in + the query. -
Array Element Interpretation -
0 - Number of matchable phrases in the query expression (value P in - the formula below). -
1 - Number of columns in the FTS3 table being queried (value C in - the formula below). - -
2 + 3 * (c + C*p) + 0 - Number of phrase matches for matchable phrase p in column - c of the current FTS3 table row. -
2 + 3 * (c + C*p) + 1 - Sum of the number of phrase matches for matchable phrase p in - column c for all rows of the FTS3 table. -
2 + 3 * (c + C*p) + 2 - Number of rows of the FTS3 table for which column c contains at - least one phrase match for matchable phrase p. +
CharacterValuesDescription +
p 1 The number of matchable phrases in the query. +
c 1 The number of user defined columns in the FTS3/4 + table (i.e. not including the docid or the hidden column with the same + name as the table). +
x 3 * cols * phrases + + For each distinct combination of a phrase and table column, the + following three values: +
    +
  • In the current row, the number of times the phrase appears in + the column. +
  • The total number of times the phrase appears in the column in + all rows in the FTS3/4 table. +
  • The total number of rows in the FTS3/4 table for which the + column contains at least one instance of the phrase. +
+ The first set of three values corresponds to the left-most column + of the table (column 0) and the left-most matchable phrase in the + query (phrase 0). If the table has more than one column, the second + set of three values in the output array correspond to phrase 0 and + column 1. Followed by phrase 0, column 2 and so on for all columns of + the table. And so on for phrase 1, column 0, then phrase 1, column 1 + etc. In other words, the data for occurences of phrase p in + column c may be found using the following formula: +
+          hits_this_row  = array[3 * (c + p*cols) + 0]
+          hits_all_rows  = array[3 * (c + p*cols) + 1]
+          docs_with_hits = array[3 * (c + p*cols) + 2]
+
+ +
n 1 The number of rows in the FTS4 table. This value is + only available when querying FTS4 tables, not FTS3. +
a cols For each column, the average number of + tokens in the text values stored in the column (considering all rows in + the FTS4 table). This value is only available when querying FTS4 tables, + not FTS3. +
l cols + For each column, the length of the value stored in the current row of the + FTS4 table, in tokens. This value is only available when querying + FTS4 tables, not FTS3. And only if the "matchinfo=fts3" directive was not + specified as part of the "CREATE VIRTUAL TABLE" statement used to create + the FTS4 table. +
s cols For each column, the length of the longest + subsequence of phrase matches that the column value has in common + the query text. For example, if a table column contains the text + 'a b c d e' and the query is 'a c "d e"', then the length of the longest + common subsequence is 2 (phrase "c" followed by phrase "d e"). + +

For example: - -- Create and populate an FTS3 table with two columns: - CREATE VIRTUAL TABLE t1 USING fts3(a, b); + -- Create and populate an FTS4 table with two columns: + CREATE VIRTUAL TABLE t1 USING fts4(a, b); INSERT INTO t1 VALUES('transaction default models default', 'Non transaction reads'); INSERT INTO t1 VALUES('the default transaction', 'these semantics present'); INSERT INTO t1 VALUES('single request', 'default data'); - -- The following query returns a single row consisting of a single blob - -- value 80 bytes in size (20 32-bit integers). If each block of 4 bytes in - -- the blob is interpreted as an unsigned integer in machine byte-order, - -- the integers will be: + -- In the following query, no format string is specified and so it defaults + -- to "pcx". It therefore returns a single row consisting of a single blob + -- value 80 bytes in size (20 32-bit integers - 1 for "p", 1 for "c" and + -- 3*2*3 for "x"). If each block of 4 bytes in the blob is interpreted + -- as an unsigned integer in machine byte-order, the values will be: -- -- 3 2 1 3 2 0 1 1 1 2 2 0 1 1 0 0 0 1 1 1 -- -- The row returned corresponds to the second entry inserted into table t1. -- The first two integers in the blob show that the query contained three @@ -1060,10 +1109,22 @@ -- The next set of three integers (0 1 1) pertain to the hits for "default" -- in column 1 of the table (0 in this row, 1 in all rows, spread across -- 1 rows). -- SELECT matchinfo(t1) FROM t1 WHERE t1 MATCH 'default transaction "these semantics"'; + + -- The format string for this query is "nl". The output array will therefore + -- contain 3 integer values - 1 for "n" and 2 for "l". The query returns + -- two rows (the first two rows in the table match). The values returned are: + -- + -- 3 1 1 + -- 3 2 0 + -- + -- The first value in the matchinfo array returned for both rows is 3 (the + -- number of rows in the table). The following two values are the lengths + -- of the longest common subsequence of phrase matches in each column. + SELECT matchinfo(t1, 'nl') FROM t1 WHERE t1 MATCH 'default transaction';

The matchinfo function is much faster than either the snippet or offsets functions. This is because the implementation of both snippet and offsets @@ -1077,22 +1138,16 @@ SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH <query expression>; SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH <query expression>;

- The matchinfo function provides much of the information required to calculate + The matchinfo function provides all the information required to calculate probabilistic "bag-of-words" relevancy scores such as Okapi BM25/BM25F that may - be used to order results in a full-text search application. Also often - used in such functions is the length or relative length of each document - or document field. Unfortunately, this information is not made available - by the matchinfo function as it would require loading extra data from the - database, potentially slowing matchinfo() down by an order of magnitude. - One solution is for the application to store the lengths of each document - or document field in a separate table for use in calculating relevancy - scores. Appendix A of this document, "[search application tips]", contains - an example of using the matchinfo() function efficiently. + be used to order results in a full-text search application. Appendix A of this + document, "[search application tips]", contains an example of using the + matchinfo() function efficiently.

Tokenizers

An FTS3 tokenizer is a set of rules for extracting terms from a document