Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Update documentation for fts3 matchinfo function. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
4deb70dd769f52143faf4beed00d5db0 |
User & Date: | dan 2010-11-26 18:23:56.000 |
Context
2010-11-26
| ||
21:06 | Updates to the FTS3 documentation to include additional discussion and links to FTS4 concepts. (check-in: b8865a0011 user: drh tags: trunk) | |
18:23 | Update documentation for fts3 matchinfo function. (check-in: 4deb70dd76 user: dan tags: trunk) | |
15:18 | Further updates to pragma documentation. Mark deprecated pragmas as such. Tag pragmas used for debugging only. (check-in: e3dca045a2 user: drh tags: trunk) | |
Changes
Changes to pages/fts3.in.
︙ | ︙ | |||
762 763 764 765 766 767 768 | <ol><li style="list-style:none"> The following example illustrates precedence of operators using the standard query syntax: </ol> <codeblock> | | | | | 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 | <ol><li style="list-style:none"> The following example illustrates precedence of operators using the standard query syntax: </ol> <codeblock> <i>-- Search for documents that contain at least one of the terms "database"</i> <i>-- and "sqlite", and also contain the term "library". Because of the differences</i> <i>-- in operator precedences, this query would have a different interpretation using</i> <i>-- the enhanced query syntax.</i> SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library'; </codeblock> <h1 tags="snippet, offsets">Auxiliary Functions - Snippet, Offsets and Matchinfo</h1> <p> The FTS3 module provides three special SQL scalar functions that may be useful to the developers of full-text query systems: "snippet", "offsets" and "matchinfo". The purpose of the "snippet" and "offsets" functions is to allow the user to identify the location of queried terms in the returned documents. The "matchinfo" function provides the user with metrics that may be useful |
︙ | ︙ | |||
997 998 999 1000 1001 1002 1003 | <i>--</i> SELECT snippet(text, '[ ']', '...') FROM text WHERE text MATCH '"min* tem*"' </codeblock> <h2 id=matchinfo tags=matchinfo>The Matchinfo Function</h2> <p> | | | | | | > | > > > > > > > > | | > > > > > > | < | | > > | > | > > > > > > | > > | > > > > > > > > > > > > > < | > > > > | | > > > > > > | | > > | | < | | > | | > | < > > > > > > > > > > > > | | < < < < < < | | | 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 | <i>--</i> SELECT snippet(text, '[ ']', '...') FROM text WHERE text MATCH '"min* tem*"' </codeblock> <h2 id=matchinfo tags=matchinfo>The Matchinfo Function</h2> <p> The matchinfo function returns a blob value. If it is used within a query that does not use the full-text index (a "query by rowid" or "linear scan"), then the blob is zero bytes in size. Otherwise, the blob consists of zero or more 32-bit unsigned integers in machine byte-order. The exact number of integers in the returned array depends on both the query and the value of the second argument (if any) passed to the matchinfo function. <p> The matchinfo function is called with either one or two arguments. As for all auxiliary functions, the first argument must be the special hidden column of an FTS3/4 table that has the same name as the table itself (see above). The second argument, if it is specified, must be a text value comprised only of the characters 'p', 'c', 'n', 'a', 'l', 's' and 'x'. If no second argument is explicitly supplied, it defaults to "pcx". The second argument is refered to as the "format string" below. <p> Characters in the matchinfo format string are processed from left to right. Each character in the format string causes one or more 32-bit unsigned integer values to be added to the returned array. The "values" column in the following table contains the number of integer values appended to the output buffer for each supported format string character. In the formula given, <i>cols</i> is the number of columns in the FTS3/4 table, and <i>phrases</i> is the number of <a href=#matchable>matchable phrases</a> in the query. <table striped=1> <tr><th>Character<th>Values<th>Description <tr><td>p <td>1 <td>The number of matchable phrases in the query. <tr><td>c <td>1 <td>The number of user defined columns in the FTS3/4 table (i.e. not including the docid or the hidden column with the same name as the table). <tr><td>x <td style="white-space:nowrap">3 * <i>cols</i> * <i>phrases</i> <td> For each distinct combination of a phrase and table column, the following three values: <ul> <li> In the current row, the number of times the phrase appears in the column. <li> The total number of times the phrase appears in the column in all rows in the FTS3/4 table. <li> The total number of rows in the FTS3/4 table for which the column contains at least one instance of the phrase. </ul> The first set of three values corresponds to the left-most column of the table (column 0) and the left-most matchable phrase in the query (phrase 0). If the table has more than one column, the second set of three values in the output array correspond to phrase 0 and column 1. Followed by phrase 0, column 2 and so on for all columns of the table. And so on for phrase 1, column 0, then phrase 1, column 1 etc. In other words, the data for occurences of phrase <i>p</i> in column <i>c</i> may be found using the following formula: <pre> hits_this_row = array[3 * (c + p*cols) + 0] hits_all_rows = array[3 * (c + p*cols) + 1] docs_with_hits = array[3 * (c + p*cols) + 2] </pre> <tr><td>n <td>1 <td>The number of rows in the FTS4 table. This value is only available when querying FTS4 tables, not FTS3. <tr><td>a <td><i>cols</i> <td>For each column, the average number of tokens in the text values stored in the column (considering all rows in the FTS4 table). This value is only available when querying FTS4 tables, not FTS3. <tr><td>l <td><i>cols</i> <td> For each column, the length of the value stored in the current row of the FTS4 table, in tokens. This value is only available when querying FTS4 tables, not FTS3. And only if the "matchinfo=fts3" directive was not specified as part of the "CREATE VIRTUAL TABLE" statement used to create the FTS4 table. <tr><td>s <td><i>cols</i> <td>For each column, the length of the longest subsequence of phrase matches that the column value has in common the query text. For example, if a table column contains the text 'a b c d e' and the query is 'a c "d e"', then the length of the longest common subsequence is 2 (phrase "c" followed by phrase "d e"). </table> <p> For example: <codeblock> <i>-- Create and populate an FTS4 table with two columns:</i> CREATE VIRTUAL TABLE t1 USING fts4(a, b); INSERT INTO t1 VALUES('transaction default models default', 'Non transaction reads'); INSERT INTO t1 VALUES('the default transaction', 'these semantics present'); INSERT INTO t1 VALUES('single request', 'default data'); <i>-- In the following query, no format string is specified and so it defaults</i> <i>-- to "pcx". It therefore returns a single row consisting of a single blob</i> <i>-- value 80 bytes in size (20 32-bit integers - 1 for "p", 1 for "c" and</i> <i>-- 3*2*3 for "x"). If each block of 4 bytes in</i> the blob is interpreted</i> <i>-- as an unsigned integer in machine byte-order, the values will be:</i> <i>--</i> <i>-- 3 2 1 3 2 0 1 1 1 2 2 0 1 1 0 0 0 1 1 1</i> <i>--</i> <i>-- The row returned corresponds to the second entry inserted into table t1.</i> <i>-- The first two integers in the blob show that the query contained three</i> <i>-- phrases and the table being queried has two columns. The next block of</i> <i>-- three integers describes column 0 (in this case column "a") and phrase</i> <i>-- 0 (in this case "default"). The current row contains 1 hit for "default"</i> <i>-- in column 0, of a total of 3 hits for "default" that occur in column</i> <i>-- 0 of any table row. The 3 hits are spread across 2 different rows.</i> <i>--</i> <i>-- The next set of three integers (0 1 1) pertain to the hits for "default"</i> <i>-- in column 1 of the table (0 in this row, 1 in all rows, spread across </i> <i>-- 1 rows).</i> <i>--</i> SELECT matchinfo(t1) FROM t1 WHERE t1 MATCH 'default transaction "these semantics"'; <i>-- The format string for this query is "nl". The output array will therefore</i> <i>-- contain 3 integer values - 1 for "n" and 2 for "l". The query returns</i> <i>-- two rows (the first two rows in the table match). The values returned are:</i> <i>--</i> <i>-- 3 1 1</i> <i>-- 3 2 0</i> <i>--</i> <i>-- The first value in the matchinfo array returned for both rows is 3 (the </i> <i>-- number of rows in the table). The following two values are the lengths </i> <i>-- of the longest common subsequence of phrase matches in each column.</i> SELECT matchinfo(t1, 'nl') FROM t1 WHERE t1 MATCH 'default transaction'; </codeblock> <p> The matchinfo function is much faster than either the snippet or offsets functions. This is because the implementation of both snippet and offsets is required to retrieve the documents being analyzed from disk, whereas all data required by matchinfo is available as part of the same portions of the full-text index that are required to implement the full-text query itself. This means that of the following two queries, the first may be an order of magnitude faster than the second: <codeblock> SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH <query expression>; SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH <query expression>; </codeblock> <p> The matchinfo function provides all the information required to calculate probabilistic "bag-of-words" relevancy scores such as <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may be used to order results in a full-text search application. Appendix A of this document, "[search application tips]", contains an example of using the matchinfo() function efficiently. <h1 id=tokenizer tags="tokenizer">Tokenizers</h1> <p> An FTS3 tokenizer is a set of rules for extracting terms from a document or basic FTS3 full-text query. |
︙ | ︙ |