Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Add a description of matchinfo() and update the description of snippet() in fts3.html. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
c31f4ab8b8f9da1b3bfb66e9951ef490 |
User & Date: | dan 2010-01-13 18:09:22.000 |
Context
2010-01-13
| ||
22:15 | Add fragments to headings of longer documents. (check-in: ad70832674 user: drh tags: trunk) | |
18:09 | Add a description of matchinfo() and update the description of snippet() in fts3.html. (check-in: c31f4ab8b8 user: dan tags: trunk) | |
2010-01-12
| ||
17:35 | Add many new anchors to the testing.html document. (check-in: 9203c7ec38 user: drh tags: trunk) | |
Changes
Changes to pages/fts3.in.
︙ | ︙ | |||
768 769 770 771 772 773 774 | <i>-- Search for documents that contains at least one of the terms "database"</i> <i>-- and "sqlite", and also contains the term "library". Because of the differences</i> <i>-- in operator precedences, this query would have a different interpretation using</i> <i>-- the enhanced query syntax.</i> SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library'; }] | | | | | | > > | > | | | > > > > > > > > > > > > > > > > > > > > > > > > > > > > | | | | 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 | <i>-- Search for documents that contains at least one of the terms "database"</i> <i>-- and "sqlite", and also contains the term "library". Because of the differences</i> <i>-- in operator precedences, this query would have a different interpretation using</i> <i>-- the enhanced query syntax.</i> SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library'; }] [h1 "Auxillary functions - Snippet, Offsets and Matchinfo" {} snippet offsets] <p> The FTS3 module provides three special SQL scalar functions that may be useful to the developers of full-text query systems: "snippet", "offsets" and "matchinfo". The purpose of the "snippet" and "offsets" functions is to allow the user to identify the location of queried terms in the returned documents. The "matchinfo" function provides the user with metrics that may be useful for filtering or sorting query results according to relevance. <p> The first argument to all three special SQL scalar functions must be the the special hidden column of an FTS3 table that has the same name as the table (see above). For example, given an FTS3 table named "mail": [Code { SELECT offsets(mail) FROM mail WHERE mail MATCH <full-text query expression>; SELECT snippet(mail) FROM mail WHERE mail MATCH <full-text query expression>; SELECT matchinfo(mail) FROM mail WHERE mail MATCH <full-text query expression>; }] <p> The three auxillary functions are only useful within a SELECT statement that uses the FTS3 table's full-text index. If used within a SELECT that uses the "query by rowid" or "linear scan" strategies, then the snippet and offsets both return an an empty string, and the matchinfo function returns a blob value zero bytes in size. <p id=matchable> All three auxillary functions extract a set of "matchable phrases" from the FTS3 query expression to work with. The set of matchable phrases for a given query consists of all phrases (including unquoted tokens and token prefixes) in the expression except those that are prefixed with a unary "-" operator (standard syntax) or are part of a sub-expression that is used as the right-hand operand of a NOT operator. <p> With the following provisos, each series of tokens in the FTS3 table that matches one of the matchable phrases in the query expression is known as a "phrase match": <ol> <li> If a matchable phrase is part of a series of phrases connected by NEAR operators in the FTS3 query expression, then each phrase match must be sufficiently close to other phrase matches of the relevant types to satisfy the NEAR condition. <li> If the matchable phrase in the FTS3 query is restricted to matching data in a specified FTS3 table column, then only phrase matches that occur within that column are considered. </ol> [h2 "The Offsets Function"] <p> For a SELECT query that uses the full-text index, the offsets() function returns a text value containing a series of space-separated integers. For each term in each <a href=#matchable>phrase match</a> of the current row, there are four integers in the returned list. Each set of four integers is interpreted as follows: [Table] [Tr]<th>Integer <th>Interpretation [Tr]<td>0 <td>The column number that the term instance occurs in (0 for the leftmost column of the FTS3 table, 1 for the next leftmost, etc.). [Tr]<td>1 |
︙ | ︙ | |||
845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 | <i>-- The following query matches the second row in table "mail". It returns the</i> <i>-- text "1 0 28 7 1 1 36 4". Only those occurences of terms "serious" and "mail"</i> <i>-- that are part of an instance of the phrase "serious mail" are identified; the</i> <i>-- other occurences of "serious" and "mail" are ignored.</i> SELECT offsets(mail) FROM mail WHERE mail MATCH '"serious mail"'; }] <p> The snippet function is used to create formatted fragments of document text for display as part of a full-text query results report. The snippet function may be passed between one and four arguments, as follows: [Table] [Tr]<th>Argument <th>Default Value <th>Description [Tr]<td>0 <td>N/A <td> The first argument to the snippet function must always be the special hidden column of the FTS3 table that takes the same name as the table itself. [Tr]<td>1 <td>"<b>" <td> The "start match" text. [Tr]<td>2 <td>"<b>" <td> The "end match" text. [Tr]<td>3 <td>"<b>...</b>" <td> The "ellipses" text. </table> <p> | > > > > > > > > > > > > | | > > | > > | > > > | > > | | > > > > > | < > > > | > > | < > > > > | < > > | | > > > | > | < > > > > > | | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 | <i>-- The following query matches the second row in table "mail". It returns the</i> <i>-- text "1 0 28 7 1 1 36 4". Only those occurences of terms "serious" and "mail"</i> <i>-- that are part of an instance of the phrase "serious mail" are identified; the</i> <i>-- other occurences of "serious" and "mail" are ignored.</i> SELECT offsets(mail) FROM mail WHERE mail MATCH '"serious mail"'; }] [h2 "The Snippet Function"] <p> The snippet function is used to create formatted fragments of document text for display as part of a full-text query results report. The snippet function may be passed between one and four arguments, as follows: [Table] [Tr]<th>Argument <th>Default Value <th>Description [Tr]<td>0 <td>N/A <td> The first argument to the snippet function must always be the special hidden column of the FTS3 table that takes the same name as the table itself. [Tr]<td>1 <td>"<b>" <td> The "start match" text. [Tr]<td>2 <td>"<b>" <td> The "end match" text. [Tr]<td>3 <td>"<b>...</b>" <td> The "ellipses" text. [Tr]<td>4 <td>-1 <td> The FTS3 table column number to extract the returned fragments of text from. Columns are numbered from left to right starting with zero. A negative value indicates that the text may be extracted from any column. [Tr]<td>5 <td>-15 <td> The absolute value of this integer argument is used as the (approximate) number of tokens to include in the returned text value. The maximum allowable absolute value is 64. The value of this argument is refered to as <i>N</i> in the discussion below. </table> <p> The snippet function first attempts to find a fragment of text consisting of <i>|N|</i> tokens within the current row that contains at least one phrase match for each matchable phrase matched somewhere in the current row, where <i>|N|</i> is the absolute value of the sixth argument passed to the snippet function. If the text stored in a single column contains less than <i>|N|</i> tokens, then the entire column value is considered. Text fragments may not span multiple columns. <p> If such a text fragment can be found, it is returned with the following modifications: <ul> <li> If the text fragment does not begin at the start of a column value, the "ellipses" text is prepended to it. <li> If the text fragment does not finish at the end of a column value, the "ellipses" text is appended to it. <li> For each token in the text fragment that is part of a phrase match, the "start match" text is inserted into the fragment before the token, and the "end match" text is inserted immediately after it. </ul> <p> If more than one such fragment can be found, then fragments that contain a larger number of "extra" phrase matches are favoured. The start of the selected text fragment may be moved a few tokens forward or backward to attempt to concentrate the phrase matches toward the center of the fragment. <p> Assuming <i>N</i> is a positive value, if no fragments can be found that contain an phrase match corresponding to each matchable phrase, the snippet function attempts to find two fragments of approximately <i>N</i>/2 tokens that between them contain at least one phrase match for each matchable phrase matched by the current row. If this fails, attempts are made to find three fragments of <i>N</i>/3 tokens each and finally four <i>N</i>/4 token fragments. If a set of four fragments cannot be found that encompasses the required phrase matches, the four fragments of <i>N</i>/4 tokens that provide the best coverage are selected. <p> If <i>N</i> is a negative value, and no single fragment can be found containing the required phrase matches, the snippet function searches for two fragments of <i>|N|</i> tokens each, then three, then four. In other words, if the specified value of <i>N</i> is negative, the sizes of the fragments is not decreased if more than one fragment is required to provide the desired phrase match coverage. <p> After the <i>M</i> fragments have been located, where <i>M</i> is between two and four as described in the paragraphs above, they are joined together in sorted order with the "ellipses" text separating them. The three modifications enumerated earlier are performed on the text before it is returned. [Code { <b>Note: In this block of examples, newlines and whitespace characters have been inserted into the document inserted into the FTS3 table, and the expected results described in SQL comments. This is done to enhance readability only, they would not be present in actual SQLite commands or output.</b> <i>-- Create and populate an FTS3 table.</i> CREATE VIRTUAL TABLE text USING fts3(); INSERT INTO text VALUES(' During 30 Nov-1 Dec, 2-3oC drops. Cool in the upper portion, minimum temperature 14-16oC and cool elsewhere, minimum temperature 17-20oC. Cold to very cold on mountaintops, minimum temperature 6-12oC. Northeasterly winds 15-30 km/hr. After that, temperature increases. Northeasterly winds 15-30 km/hr. '); <i>-- The following query returns the text value:</i> <i>--</i> <i>-- "<b>...</b>cool elsewhere, minimum temperature 17-20oC. <b>Cold</b> to very </i> <i>-- <b>cold</b> on mountaintops, minimum temperature 6<b>...</b>".</i> <i>--</i> SELECT snippet(text) FROM text WHERE text MATCH 'cold'; <i>-- The following query returns the text value:</i> <i>--</i> <i>-- "...the upper portion, [minimum] [temperature] 14-16oC and cool elsewhere,</i> <i>-- [minimum] [temperature] 17-20oC. Cold..."</i> <i>--</i> SELECT snippet(text, '[ ']', '...') FROM text WHERE text MATCH '"min* tem*"' }] [h2 "The Matchinfo Function"] <p> The matchinfo function returns a blob value. If used within a query that uses the full-text index (not a "query by rowid" or "linear scan"), then the blob consists of (2 + <i>C</i> * <i>P</i> * 3) 32-bit unsigned integers in machine byte-order, where <i>C</i> is the number of columns in the FTS3 table being queried, and <i>P</i> is the number of <a href=#matchable>matchable phrases</a> in the query. <p> Phrases and columns are both numbered from left to right starting from zero. [Table] [Tr]<th>Array Element <th>Interpretation [Tr]<td>0 <td> Number of matchable phrases in the query expression (value <i>P</i> in the formula below). [Tr]<td>1 <td> Number of columns in the FTS3 table being queried (value <i>C</i> in the formula below). [Tr]<td style="white-space:nowrap">2 + 3 * (<i>c</i> + <i>C</i>*<i>p</i>) + 0<td> Number of phrase matches for matchable phrase <i>p</i> in column <i>c</i> of the current FTS3 table row. [Tr]<td style="white-space:nowrap">2 + 3 * (<i>c</i> + <i>C</i>*<i>p</i>) + 1<td> Sum of the number of phrase matches for matchable phrase <i>p</i> in column <i>c</i> for all rows of the FTS3 table. [Tr]<td style="white-space:nowrap">2 + 3 * (<i>c</i> + <i>C</i>*<i>p</i>) + 2<td> Number of rows of the FTS3 table for which column <i>c</i> contains at least one phrase match for matchable phrase <i>p</i>. </table> <p> For example: [Code { <i>-- Create and populate an FTS3 table with two columns:</i> CREATE VIRTUAL TABLE t1 USING fts3(a, b); INSERT INTO t1 VALUES('transaction default models default', 'Non transaction reads'); INSERT INTO t1 VALUES('the default transaction', 'these semantics present'); INSERT INTO t1 VALUES('single request', 'default data'); <i>-- The following query returns a single row consisting of a single blob</i> <i>-- value 80 bytes in size (20 32-bit integers). If each block of 4 bytes in</i> <i>-- the blob is interpreted as an unsigned integer in machine byte-order,</i> <i>-- the integers will be:</i> <i>--</i> <i>-- 3 2 1 3 2 0 1 1 1 2 2 0 1 1 0 0 0 1 1 0</i> <i>--</i> <i>-- The row returned corresponds to the second entry inserted into table t1.</i> <i>-- The first two integers in the blob show that the query contained three</i> <i>-- phrases and the table being queried has two columns. The next block of</i> <i>-- three integers describes column 0 (in this case column "a") and phrase</i> <i>-- 0 (in this case "default"). The current row contains 1 hit for "default"</i> <i>-- in column 0, of a total of 3 hits for "default" that occur in column</i> <i>-- 0 of any table row. The 3 hits are spread across 2 different rows.</i> <i>--</i> <i>-- The next set of three integers (0 1 1) pertain to the hits for "default"</i> <i>-- in column 1 of the table (0 in this row, 1 in all rows, spread across </i> <i>-- 1 rows).</i> <i>--</i> SELECT matchinfo(t1) FROM t1 WHERE t1 MATCH 'default transaction "these semantics"'; }] <p> The matchinfo function is much faster than either the snippet or offsets functions. This is because the implementation of both snippet and offsets is required to retrieve the documents being analyzed from disk, whereas all data required by matchinfo is available as part of the same portions of the full-text index that are required to implement the full-text query itself. This means that of the following two queries, the first may be an order of magnitude faster than the second: [Code { SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH <query expression>; SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH <query expression>; }] <p> The TODO page contains an example of how to take advantage of this in a full-text search application. <p> The matchinfo function provides much of the information required to calculate probabalistic "bag-of-words" relevancy scores such as <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may be used to order results in a full-text search application. Also often used in such functions is the length or relative length of each document or document field. Unfortunately, this information is not made available by the matchinfo function as it would require loading extra data from the database, potentially slowing matchinfo() down by an order of magnitude. One solution is for the application to store the lengths of each document or document field in a separate table for use in calculating relevancy scores. The TODO page contains an example of this technique. [h1 "Tokenizers" tokenizer {tokenizer}] <p> An FTS3 tokenizer is a set of rules for extracting terms from a document or basic FTS3 full-text query. <p> |
︙ | ︙ |