Documentation Source Text

Check-in [4deb70dd76]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Update documentation for fts3 matchinfo function.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 4deb70dd769f52143faf4beed00d5db046ab6de8
User & Date: dan 2010-11-26 18:23:56
Context
2010-11-26
21:06
Updates to the FTS3 documentation to include additional discussion and links to FTS4 concepts. check-in: b8865a0011 user: drh tags: trunk
18:23
Update documentation for fts3 matchinfo function. check-in: 4deb70dd76 user: dan tags: trunk
15:18
Further updates to pragma documentation. Mark deprecated pragmas as such. Tag pragmas used for debugging only. check-in: e3dca045a2 user: drh tags: trunk
Changes
Hide Diffs Side-by-Side Diffs Ignore Whitespace Patch

Changes to pages/fts3.in.

   762    762   
   763    763   <ol><li style="list-style:none">
   764    764     The following example illustrates precedence of operators using the standard 
   765    765     query syntax:
   766    766   </ol>
   767    767   
   768    768   <codeblock>
   769         -  <i>-- Search for documents that contains at least one of the terms "database"</i>
   770         -  <i>-- and "sqlite", and also contains the term "library". Because of the differences</i>
          769  +  <i>-- Search for documents that contain at least one of the terms "database"</i>
          770  +  <i>-- and "sqlite", and also contain the term "library". Because of the differences</i>
   771    771     <i>-- in operator precedences, this query would have a different interpretation using</i>
   772    772     <i>-- the enhanced query syntax.</i>
   773    773     SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library';
   774    774   </codeblock>
   775    775   
   776         -<h1 tags="snippet, offsets">auxiliary functions - Snippet, Offsets and Matchinfo</h1>
          776  +<h1 tags="snippet, offsets">Auxiliary Functions - Snippet, Offsets and Matchinfo</h1>
   777    777   
   778    778   <p>
   779    779     The FTS3 module provides three special SQL scalar functions that may be useful
   780    780     to the developers of full-text query systems: "snippet", "offsets" and
   781    781     "matchinfo". The purpose of the "snippet" and "offsets" functions is to allow
   782    782     the user to identify the location of queried terms in the returned documents.
   783    783     The "matchinfo" function provides the user with metrics that may be useful
................................................................................
   997    997     <i>--</i>
   998    998     SELECT snippet(text, '&#91; '&#93;', '...') FROM text WHERE text MATCH '"min* tem*"'
   999    999   </codeblock>
  1000   1000   
  1001   1001   <h2 id=matchinfo tags=matchinfo>The Matchinfo Function</h2>
  1002   1002   
  1003   1003   <p>
  1004         -  The matchinfo function returns a blob value. If used within a query that
  1005         -  uses the full-text index (not a "query by rowid" or "linear scan"), then
  1006         -  the blob consists of (2 + <i>C</i> * <i>P</i> * 3) 32-bit unsigned 
  1007         -  integers in machine byte-order, where <i>C</i> is the number of columns 
  1008         -  in the FTS3 table being queried, and <i>P</i> is the number of 
  1009         -  <a href=#matchable>matchable phrases</a> in the query. 
         1004  +  The matchinfo function returns a blob value. If it is used within a query
         1005  +  that does not use the full-text index (a "query by rowid" or "linear scan"),
         1006  +  then the blob is zero bytes in size. Otherwise, the blob consists of zero
         1007  +  or more 32-bit unsigned integers in machine byte-order. The exact number
         1008  +  of integers in the returned array depends on both the query and the value
         1009  +  of the second argument (if any) passed to the matchinfo function.
  1010   1010   
  1011         -  <p>
  1012         -    Phrases and columns are both numbered from left to right starting from 
  1013         -    zero.
         1011  +<p>
         1012  +  The matchinfo function is called with either one or two arguments. As for
         1013  +  all auxiliary functions, the first argument must be the special hidden
         1014  +  column of an FTS3/4 table that has the same name as the table itself (see 
         1015  +  above). The second argument, if it is specified, must be a text value
         1016  +  comprised only of the characters 'p', 'c', 'n', 'a', 'l', 's' and 'x'.
         1017  +  If no second argument is explicitly supplied, it defaults to "pcx". The
         1018  +  second argument is refered to as the "format string" below.
         1019  +
         1020  +<p>
         1021  +  Characters in the matchinfo format string are processed from left to right. 
         1022  +  Each character in the format string causes one or more 32-bit unsigned
         1023  +  integer values to be added to the returned array. The "values" column in
         1024  +  the following table contains the number of integer values appended to the
         1025  +  output buffer for each supported format string character. In the formula
         1026  +  given, <i>cols</i> is the number of columns in the FTS3/4 table, and 
         1027  +  <i>phrases</i> is the number of <a href=#matchable>matchable phrases</a> in 
         1028  +  the query. 
  1014   1029   
  1015   1030   <table striped=1>
  1016         -  <tr><th>Array Element <th>Interpretation
  1017         -  <tr><td>0 <td> 
  1018         -    Number of matchable phrases in the query expression (value <i>P</i> in 
  1019         -    the formula below).
  1020         -  <tr><td>1 <td>
  1021         -    Number of columns in the FTS3 table being queried (value <i>C</i> in 
  1022         -    the formula below).
  1023         -
  1024         -  <tr><td style="white-space:nowrap">2 + 3 * (<i>c</i> + <i>C</i>*<i>p</i>) + 0<td>
  1025         -    Number of phrase matches for matchable phrase <i>p</i> in column 
  1026         -    <i>c</i> of the current FTS3 table row.
  1027         -  <tr><td style="white-space:nowrap">2 + 3 * (<i>c</i> + <i>C</i>*<i>p</i>) + 1<td>
  1028         -    Sum of the number of phrase matches for matchable phrase <i>p</i> in 
  1029         -    column <i>c</i> for all rows of the FTS3 table.
  1030         -  <tr><td style="white-space:nowrap">2 + 3 * (<i>c</i> + <i>C</i>*<i>p</i>) + 2<td>
  1031         -    Number of rows of the FTS3 table for which column <i>c</i> contains at 
  1032         -    least one phrase match for matchable phrase <i>p</i>.
         1031  +  <tr><th>Character<th>Values<th>Description
         1032  +  <tr><td>p <td>1 <td>The number of matchable phrases in the query.
         1033  +  <tr><td>c <td>1 <td>The number of user defined columns in the FTS3/4
         1034  +    table (i.e. not including the docid or the hidden column with the same 
         1035  +    name as the table).
         1036  +  <tr><td>x <td style="white-space:nowrap">3 * <i>cols</i> * <i>phrases</i> 
         1037  +    <td>
         1038  +      For each distinct combination of a phrase and table column, the
         1039  +      following three values:
         1040  +      <ul>
         1041  +        <li> In the current row, the number of times the phrase appears in 
         1042  +             the column.
         1043  +        <li> The total number of times the phrase appears in the column in
         1044  +             all rows in the FTS3/4 table.
         1045  +        <li> The total number of rows in the FTS3/4 table for which the 
         1046  +             column contains at least one instance of the phrase.
         1047  +      </ul>
         1048  +      The first set of three values corresponds to the left-most column
         1049  +      of the table (column 0) and the left-most matchable phrase in the
         1050  +      query (phrase 0). If the table has more than one column, the second
         1051  +      set of three values in the output array correspond to phrase 0 and
         1052  +      column 1. Followed by phrase 0, column 2 and so on for all columns of
         1053  +      the table. And so on for phrase 1, column 0, then phrase 1, column 1
         1054  +      etc. In other words, the data for occurences of phrase <i>p</i> in
         1055  +      column <i>c</i> may be found using the following formula:
         1056  +<pre>
         1057  +          hits_this_row  = array&#91;3 * (c + p*cols) + 0&#93;
         1058  +          hits_all_rows  = array&#91;3 * (c + p*cols) + 1&#93;
         1059  +          docs_with_hits = array&#91;3 * (c + p*cols) + 2&#93;
         1060  +</pre>
         1061  +
         1062  +  <tr><td>n <td>1 <td>The number of rows in the FTS4 table. This value is
         1063  +    only available when querying FTS4 tables, not FTS3.
         1064  +  <tr><td>a <td><i>cols</i> <td>For each column, the average number of
         1065  +    tokens in the text values stored in the column (considering all rows in
         1066  +    the FTS4 table). This value is only available when querying FTS4 tables,
         1067  +    not FTS3.  
         1068  +  <tr><td>l <td><i>cols</i> <td>
         1069  +    For each column, the length of the value stored in the current row of the
         1070  +    FTS4 table, in tokens.  This value is only available when querying
         1071  +    FTS4 tables, not FTS3. And only if the "matchinfo=fts3" directive was not
         1072  +    specified as part of the "CREATE VIRTUAL TABLE" statement used to create
         1073  +    the FTS4 table.
         1074  +  <tr><td>s <td><i>cols</i> <td>For each column, the length of the longest 
         1075  +    subsequence of phrase matches that the column value has in common
         1076  +    the query text. For example, if a table column contains the text
         1077  +    'a b c d e' and the query is 'a c "d e"', then the length of the longest
         1078  +    common subsequence is 2 (phrase "c" followed by phrase "d e").
         1079  +    
         1080  +    
  1033   1081   </table>
  1034   1082   
  1035   1083   <p>
  1036   1084     For example:
  1037   1085   
  1038   1086   <codeblock>
  1039         -  <i>-- Create and populate an FTS3 table with two columns:</i>
  1040         -  CREATE VIRTUAL TABLE t1 USING fts3(a, b);
         1087  +  <i>-- Create and populate an FTS4 table with two columns:</i>
         1088  +  CREATE VIRTUAL TABLE t1 USING fts4(a, b);
  1041   1089     INSERT INTO t1 VALUES('transaction default models default', 'Non transaction reads');
  1042   1090     INSERT INTO t1 VALUES('the default transaction', 'these semantics present');
  1043   1091     INSERT INTO t1 VALUES('single request', 'default data');
  1044   1092   
  1045         -  <i>-- The following query returns a single row consisting of a single blob</i>
  1046         -  <i>-- value 80 bytes in size (20 32-bit integers). If each block of 4 bytes in</i>
  1047         -  <i>-- the blob is interpreted as an unsigned integer in machine byte-order,</i>
  1048         -  <i>-- the integers will be:</i>
         1093  +  <i>-- In the following query, no format string is specified and so it defaults</i>
         1094  +  <i>-- to "pcx". It therefore returns a single row consisting of a single blob</i>
         1095  +  <i>-- value 80 bytes in size (20 32-bit integers - 1 for "p", 1 for "c" and</i>
         1096  +  <i>-- 3*2*3 for "x"). If each block of 4 bytes in</i> the blob is interpreted</i> 
         1097  +  <i>-- as an unsigned integer in machine byte-order, the values will be:</i>
  1049   1098     <i>--</i>
  1050   1099     <i>--     3 2  1 3 2  0 1 1  1 2 2  0 1 1  0 0 0  1 1 1</i>
  1051   1100     <i>--</i>
  1052   1101     <i>-- The row returned corresponds to the second entry inserted into table t1.</i>
  1053   1102     <i>-- The first two integers in the blob show that the query contained three</i>
  1054   1103     <i>-- phrases and the table being queried has two columns. The next block of</i>
  1055   1104     <i>-- three integers describes column 0 (in this case column "a") and phrase</i>
................................................................................
  1058   1107     <i>-- 0 of any table row. The 3 hits are spread across 2 different rows.</i>
  1059   1108     <i>--</i>
  1060   1109     <i>-- The next set of three integers (0 1 1) pertain to the hits for "default"</i>
  1061   1110     <i>-- in column 1 of the table (0 in this row, 1 in all rows, spread across </i>
  1062   1111     <i>-- 1 rows).</i>
  1063   1112     <i>--</i>
  1064   1113     SELECT matchinfo(t1) FROM t1 WHERE t1 MATCH 'default transaction "these semantics"';
         1114  +
         1115  +  <i>-- The format string for this query is "nl". The output array will therefore</i>
         1116  +  <i>-- contain 3 integer values - 1 for "n" and 2 for "l". The query returns</i>
         1117  +  <i>-- two rows (the first two rows in the table match). The values returned are:</i>
         1118  +  <i>--</i>
         1119  +  <i>--     3  1 1</i>
         1120  +  <i>--     3  2 0</i>
         1121  +  <i>--</i>
         1122  +  <i>-- The first value in the matchinfo array returned for both rows is 3 (the </i>
         1123  +  <i>-- number of rows in the table). The following two values are the lengths </i>
         1124  +  <i>-- of the longest common subsequence of phrase matches in each column.</i>
         1125  +  SELECT matchinfo(t1, 'nl') FROM t1 WHERE t1 MATCH 'default transaction';
  1065   1126   </codeblock>
  1066   1127   
  1067   1128   <p>
  1068   1129     The matchinfo function is much faster than either the snippet or offsets
  1069   1130     functions. This is because the implementation of both snippet and offsets
  1070   1131     is required to retrieve the documents being analyzed from disk, whereas
  1071   1132     all data required by matchinfo is available as part of the same portions
................................................................................
  1075   1136   
  1076   1137   <codeblock>
  1077   1138     SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
  1078   1139     SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
  1079   1140   </codeblock>
  1080   1141   
  1081   1142   <p>
  1082         -  The matchinfo function provides much of the information required to calculate
         1143  +  The matchinfo function provides all the information required to calculate
  1083   1144     probabilistic "bag-of-words" relevancy scores such as 
  1084   1145     <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may
  1085         -  be used to order results in a full-text search application. Also often
  1086         -  used in such functions is the length or relative length of each document
  1087         -  or document field. Unfortunately, this information is not made available
  1088         -  by the matchinfo function as it would require loading extra data from the
  1089         -  database, potentially slowing matchinfo() down by an order of magnitude.
  1090         -  One solution is for the application to store the lengths of each document
  1091         -  or document field in a separate table for use in calculating relevancy
  1092         -  scores. Appendix A of this document, "[search application tips]", contains
  1093         -  an example of using the matchinfo() function efficiently.
         1146  +  be used to order results in a full-text search application. Appendix A of this 
         1147  +  document, "[search application tips]", contains an example of using the
         1148  +  matchinfo() function efficiently.
  1094   1149   
  1095   1150   <h1 id=tokenizer tags="tokenizer">Tokenizers</h1>
  1096   1151   
  1097   1152   <p>
  1098   1153     An FTS3 tokenizer is a set of rules for extracting terms from a document 
  1099   1154     or basic FTS3 full-text query. 
  1100   1155