Documentation Source Text

Check-in [4deb70dd76]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Update documentation for fts3 matchinfo function.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 4deb70dd769f52143faf4beed00d5db046ab6de8
User & Date: dan 2010-11-26 18:23:56.000
Context
2010-11-26
21:06
Updates to the FTS3 documentation to include additional discussion and links to FTS4 concepts. (check-in: b8865a0011 user: drh tags: trunk)
18:23
Update documentation for fts3 matchinfo function. (check-in: 4deb70dd76 user: dan tags: trunk)
15:18
Further updates to pragma documentation. Mark deprecated pragmas as such. Tag pragmas used for debugging only. (check-in: e3dca045a2 user: drh tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to pages/fts3.in.
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783

<ol><li style="list-style:none">
  The following example illustrates precedence of operators using the standard 
  query syntax:
</ol>

<codeblock>
  <i>-- Search for documents that contains at least one of the terms "database"</i>
  <i>-- and "sqlite", and also contains the term "library". Because of the differences</i>
  <i>-- in operator precedences, this query would have a different interpretation using</i>
  <i>-- the enhanced query syntax.</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library';
</codeblock>

<h1 tags="snippet, offsets">auxiliary functions - Snippet, Offsets and Matchinfo</h1>

<p>
  The FTS3 module provides three special SQL scalar functions that may be useful
  to the developers of full-text query systems: "snippet", "offsets" and
  "matchinfo". The purpose of the "snippet" and "offsets" functions is to allow
  the user to identify the location of queried terms in the returned documents.
  The "matchinfo" function provides the user with metrics that may be useful







|
|





|







762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783

<ol><li style="list-style:none">
  The following example illustrates precedence of operators using the standard 
  query syntax:
</ol>

<codeblock>
  <i>-- Search for documents that contain at least one of the terms "database"</i>
  <i>-- and "sqlite", and also contain the term "library". Because of the differences</i>
  <i>-- in operator precedences, this query would have a different interpretation using</i>
  <i>-- the enhanced query syntax.</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library';
</codeblock>

<h1 tags="snippet, offsets">Auxiliary Functions - Snippet, Offsets and Matchinfo</h1>

<p>
  The FTS3 module provides three special SQL scalar functions that may be useful
  to the developers of full-text query systems: "snippet", "offsets" and
  "matchinfo". The purpose of the "snippet" and "offsets" functions is to allow
  the user to identify the location of queried terms in the returned documents.
  The "matchinfo" function provides the user with metrics that may be useful
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008

1009








1010
1011
1012






1013
1014
1015
1016
1017
1018


1019

1020






1021


1022













1023
1024
1025




1026
1027






1028
1029


1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044

1045
1046

1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064












1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
  <i>--</i>
  SELECT snippet(text, '&#91; '&#93;', '...') FROM text WHERE text MATCH '"min* tem*"'
</codeblock>

<h2 id=matchinfo tags=matchinfo>The Matchinfo Function</h2>

<p>
  The matchinfo function returns a blob value. If used within a query that
  uses the full-text index (not a "query by rowid" or "linear scan"), then
  the blob consists of (2 + <i>C</i> * <i>P</i> * 3) 32-bit unsigned 
  integers in machine byte-order, where <i>C</i> is the number of columns 
  in the FTS3 table being queried, and <i>P</i> is the number of 

  <a href=#matchable>matchable phrases</a> in the query. 









  <p>
    Phrases and columns are both numbered from left to right starting from 






    zero.

<table striped=1>
  <tr><th>Array Element <th>Interpretation
  <tr><td>0 <td> 
    Number of matchable phrases in the query expression (value <i>P</i> in 


    the formula below).

  <tr><td>1 <td>






    Number of columns in the FTS3 table being queried (value <i>C</i> in 


    the formula below).














  <tr><td style="white-space:nowrap">2 + 3 * (<i>c</i> + <i>C</i>*<i>p</i>) + 0<td>
    Number of phrase matches for matchable phrase <i>p</i> in column 




    <i>c</i> of the current FTS3 table row.
  <tr><td style="white-space:nowrap">2 + 3 * (<i>c</i> + <i>C</i>*<i>p</i>) + 1<td>






    Sum of the number of phrase matches for matchable phrase <i>p</i> in 
    column <i>c</i> for all rows of the FTS3 table.


  <tr><td style="white-space:nowrap">2 + 3 * (<i>c</i> + <i>C</i>*<i>p</i>) + 2<td>
    Number of rows of the FTS3 table for which column <i>c</i> contains at 
    least one phrase match for matchable phrase <i>p</i>.
</table>

<p>
  For example:

<codeblock>
  <i>-- Create and populate an FTS3 table with two columns:</i>
  CREATE VIRTUAL TABLE t1 USING fts3(a, b);
  INSERT INTO t1 VALUES('transaction default models default', 'Non transaction reads');
  INSERT INTO t1 VALUES('the default transaction', 'these semantics present');
  INSERT INTO t1 VALUES('single request', 'default data');


  <i>-- The following query returns a single row consisting of a single blob</i>
  <i>-- value 80 bytes in size (20 32-bit integers). If each block of 4 bytes in</i>

  <i>-- the blob is interpreted as an unsigned integer in machine byte-order,</i>
  <i>-- the integers will be:</i>
  <i>--</i>
  <i>--     3 2  1 3 2  0 1 1  1 2 2  0 1 1  0 0 0  1 1 1</i>
  <i>--</i>
  <i>-- The row returned corresponds to the second entry inserted into table t1.</i>
  <i>-- The first two integers in the blob show that the query contained three</i>
  <i>-- phrases and the table being queried has two columns. The next block of</i>
  <i>-- three integers describes column 0 (in this case column "a") and phrase</i>
  <i>-- 0 (in this case "default"). The current row contains 1 hit for "default"</i>
  <i>-- in column 0, of a total of 3 hits for "default" that occur in column</i>
  <i>-- 0 of any table row. The 3 hits are spread across 2 different rows.</i>
  <i>--</i>
  <i>-- The next set of three integers (0 1 1) pertain to the hits for "default"</i>
  <i>-- in column 1 of the table (0 in this row, 1 in all rows, spread across </i>
  <i>-- 1 rows).</i>
  <i>--</i>
  SELECT matchinfo(t1) FROM t1 WHERE t1 MATCH 'default transaction "these semantics"';












</codeblock>

<p>
  The matchinfo function is much faster than either the snippet or offsets
  functions. This is because the implementation of both snippet and offsets
  is required to retrieve the documents being analyzed from disk, whereas
  all data required by matchinfo is available as part of the same portions
  of the full-text index that are required to implement the full-text query
  itself. This means that of the following two queries, the first may be
  an order of magnitude faster than the second:

<codeblock>
  SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
  SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
</codeblock>

<p>
  The matchinfo function provides much of the information required to calculate
  probabilistic "bag-of-words" relevancy scores such as 
  <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may
  be used to order results in a full-text search application. Also often
  used in such functions is the length or relative length of each document
  or document field. Unfortunately, this information is not made available
  by the matchinfo function as it would require loading extra data from the
  database, potentially slowing matchinfo() down by an order of magnitude.
  One solution is for the application to store the lengths of each document
  or document field in a separate table for use in calculating relevancy
  scores. Appendix A of this document, "[search application tips]", contains
  an example of using the matchinfo() function efficiently.

<h1 id=tokenizer tags="tokenizer">Tokenizers</h1>

<p>
  An FTS3 tokenizer is a set of rules for extracting terms from a document 
  or basic FTS3 full-text query. 








|
|
|
|
|
>
|
>
>
>
>
>
>
>
>

|
|
>
>
>
>
>
>
|


<
|
|
>
>
|
>
|
>
>
>
>
>
>
|
>
>
|
>
>
>
>
>
>
>
>
>
>
>
>
>

<
|
>
>
>
>
|
|
>
>
>
>
>
>
|
|
>
>
|
|
<






|
|




>
|
|
>
|
<
















>
>
>
>
>
>
>
>
>
>
>
>

















|


|
<
<
<
<
<
<
|
|







997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030

1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061

1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080

1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097

1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146






1147
1148
1149
1150
1151
1152
1153
1154
1155
  <i>--</i>
  SELECT snippet(text, '&#91; '&#93;', '...') FROM text WHERE text MATCH '"min* tem*"'
</codeblock>

<h2 id=matchinfo tags=matchinfo>The Matchinfo Function</h2>

<p>
  The matchinfo function returns a blob value. If it is used within a query
  that does not use the full-text index (a "query by rowid" or "linear scan"),
  then the blob is zero bytes in size. Otherwise, the blob consists of zero
  or more 32-bit unsigned integers in machine byte-order. The exact number
  of integers in the returned array depends on both the query and the value
  of the second argument (if any) passed to the matchinfo function.

<p>
  The matchinfo function is called with either one or two arguments. As for
  all auxiliary functions, the first argument must be the special hidden
  column of an FTS3/4 table that has the same name as the table itself (see 
  above). The second argument, if it is specified, must be a text value
  comprised only of the characters 'p', 'c', 'n', 'a', 'l', 's' and 'x'.
  If no second argument is explicitly supplied, it defaults to "pcx". The
  second argument is refered to as the "format string" below.

<p>
  Characters in the matchinfo format string are processed from left to right. 
  Each character in the format string causes one or more 32-bit unsigned
  integer values to be added to the returned array. The "values" column in
  the following table contains the number of integer values appended to the
  output buffer for each supported format string character. In the formula
  given, <i>cols</i> is the number of columns in the FTS3/4 table, and 
  <i>phrases</i> is the number of <a href=#matchable>matchable phrases</a> in 
  the query. 

<table striped=1>

  <tr><th>Character<th>Values<th>Description
  <tr><td>p <td>1 <td>The number of matchable phrases in the query.
  <tr><td>c <td>1 <td>The number of user defined columns in the FTS3/4
    table (i.e. not including the docid or the hidden column with the same 
    name as the table).
  <tr><td>x <td style="white-space:nowrap">3 * <i>cols</i> * <i>phrases</i> 
    <td>
      For each distinct combination of a phrase and table column, the
      following three values:
      <ul>
        <li> In the current row, the number of times the phrase appears in 
             the column.
        <li> The total number of times the phrase appears in the column in
             all rows in the FTS3/4 table.
        <li> The total number of rows in the FTS3/4 table for which the 
             column contains at least one instance of the phrase.
      </ul>
      The first set of three values corresponds to the left-most column
      of the table (column 0) and the left-most matchable phrase in the
      query (phrase 0). If the table has more than one column, the second
      set of three values in the output array correspond to phrase 0 and
      column 1. Followed by phrase 0, column 2 and so on for all columns of
      the table. And so on for phrase 1, column 0, then phrase 1, column 1
      etc. In other words, the data for occurences of phrase <i>p</i> in
      column <i>c</i> may be found using the following formula:
<pre>
          hits_this_row  = array&#91;3 * (c + p*cols) + 0&#93;
          hits_all_rows  = array&#91;3 * (c + p*cols) + 1&#93;
          docs_with_hits = array&#91;3 * (c + p*cols) + 2&#93;
</pre>


  <tr><td>n <td>1 <td>The number of rows in the FTS4 table. This value is
    only available when querying FTS4 tables, not FTS3.
  <tr><td>a <td><i>cols</i> <td>For each column, the average number of
    tokens in the text values stored in the column (considering all rows in
    the FTS4 table). This value is only available when querying FTS4 tables,
    not FTS3.  
  <tr><td>l <td><i>cols</i> <td>
    For each column, the length of the value stored in the current row of the
    FTS4 table, in tokens.  This value is only available when querying
    FTS4 tables, not FTS3. And only if the "matchinfo=fts3" directive was not
    specified as part of the "CREATE VIRTUAL TABLE" statement used to create
    the FTS4 table.
  <tr><td>s <td><i>cols</i> <td>For each column, the length of the longest 
    subsequence of phrase matches that the column value has in common
    the query text. For example, if a table column contains the text
    'a b c d e' and the query is 'a c "d e"', then the length of the longest
    common subsequence is 2 (phrase "c" followed by phrase "d e").
    
    

</table>

<p>
  For example:

<codeblock>
  <i>-- Create and populate an FTS4 table with two columns:</i>
  CREATE VIRTUAL TABLE t1 USING fts4(a, b);
  INSERT INTO t1 VALUES('transaction default models default', 'Non transaction reads');
  INSERT INTO t1 VALUES('the default transaction', 'these semantics present');
  INSERT INTO t1 VALUES('single request', 'default data');

  <i>-- In the following query, no format string is specified and so it defaults</i>
  <i>-- to "pcx". It therefore returns a single row consisting of a single blob</i>
  <i>-- value 80 bytes in size (20 32-bit integers - 1 for "p", 1 for "c" and</i>
  <i>-- 3*2*3 for "x"). If each block of 4 bytes in</i> the blob is interpreted</i> 
  <i>-- as an unsigned integer in machine byte-order, the values will be:</i>

  <i>--</i>
  <i>--     3 2  1 3 2  0 1 1  1 2 2  0 1 1  0 0 0  1 1 1</i>
  <i>--</i>
  <i>-- The row returned corresponds to the second entry inserted into table t1.</i>
  <i>-- The first two integers in the blob show that the query contained three</i>
  <i>-- phrases and the table being queried has two columns. The next block of</i>
  <i>-- three integers describes column 0 (in this case column "a") and phrase</i>
  <i>-- 0 (in this case "default"). The current row contains 1 hit for "default"</i>
  <i>-- in column 0, of a total of 3 hits for "default" that occur in column</i>
  <i>-- 0 of any table row. The 3 hits are spread across 2 different rows.</i>
  <i>--</i>
  <i>-- The next set of three integers (0 1 1) pertain to the hits for "default"</i>
  <i>-- in column 1 of the table (0 in this row, 1 in all rows, spread across </i>
  <i>-- 1 rows).</i>
  <i>--</i>
  SELECT matchinfo(t1) FROM t1 WHERE t1 MATCH 'default transaction "these semantics"';

  <i>-- The format string for this query is "nl". The output array will therefore</i>
  <i>-- contain 3 integer values - 1 for "n" and 2 for "l". The query returns</i>
  <i>-- two rows (the first two rows in the table match). The values returned are:</i>
  <i>--</i>
  <i>--     3  1 1</i>
  <i>--     3  2 0</i>
  <i>--</i>
  <i>-- The first value in the matchinfo array returned for both rows is 3 (the </i>
  <i>-- number of rows in the table). The following two values are the lengths </i>
  <i>-- of the longest common subsequence of phrase matches in each column.</i>
  SELECT matchinfo(t1, 'nl') FROM t1 WHERE t1 MATCH 'default transaction';
</codeblock>

<p>
  The matchinfo function is much faster than either the snippet or offsets
  functions. This is because the implementation of both snippet and offsets
  is required to retrieve the documents being analyzed from disk, whereas
  all data required by matchinfo is available as part of the same portions
  of the full-text index that are required to implement the full-text query
  itself. This means that of the following two queries, the first may be
  an order of magnitude faster than the second:

<codeblock>
  SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
  SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
</codeblock>

<p>
  The matchinfo function provides all the information required to calculate
  probabilistic "bag-of-words" relevancy scores such as 
  <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may
  be used to order results in a full-text search application. Appendix A of this 






  document, "[search application tips]", contains an example of using the
  matchinfo() function efficiently.

<h1 id=tokenizer tags="tokenizer">Tokenizers</h1>

<p>
  An FTS3 tokenizer is a set of rules for extracting terms from a document 
  or basic FTS3 full-text query.