Documentation Source Text

Check-in [c31f4ab8b8]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add a description of matchinfo() and update the description of snippet() in fts3.html.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: c31f4ab8b8f9da1b3bfb66e9951ef490e7bf2e51
User & Date: dan 2010-01-13 18:09:22
Context
2010-01-13
22:15
Add fragments to headings of longer documents. check-in: ad70832674 user: drh tags: trunk
18:09
Add a description of matchinfo() and update the description of snippet() in fts3.html. check-in: c31f4ab8b8 user: dan tags: trunk
2010-01-12
17:35
Add many new anchors to the testing.html document. check-in: 9203c7ec38 user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/fts3.in.

768
769
770
771
772
773
774
775
776
777
778
779
780
781


782
783
784

785
786
787
788
789
790
791

792
793
794
795
796
797
798

799








800



















801
802
803

804
805
806
807
808
809
810
811
812
...
845
846
847
848
849
850
851


852
853
854
855
856
857
858
...
862
863
864
865
866
867
868










869
870
871
872
873
874
875
876
877
878
879
880
881







882
883
884





885
886

887
888




889
890
891
892
893
894
895































896
897
898
899
900
901
902
...
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926

































































































927
928
929
930
931
932
933
  <i>-- Search for documents that contains at least one of the terms "database"</i>
  <i>-- and "sqlite", and also contains the term "library". Because of the differences</i>
  <i>-- in operator precedences, this query would have a different interpretation using</i>
  <i>-- the enhanced query syntax.</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library';
}]

[h1 "Auxillary functions - Snippets and Offsets" {} snippet offsets]

<p>
  The FTS3 module provides two special SQL scalar functions that may be useful
  to the developers of full-text query systems, "snippet" and "offsets". The
  purpose of both functions is to allow the user to identify the location of
  queried terms in the returned documents.



<p>
  The first argument to both the snippet and offsets SQL scalar functions

  must be the the special hidden column of an FTS3 table that has the same
  name as the table (see above). For example, given an FTS3 table named 
  "mail":

[Code {
  SELECT offsets(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
  SELECT snippet(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;

}]

<p>
  The two auxillary functions are only useful within a SELECT statement that
  uses the FTS3 table's full-text index. If used within a SELECT that uses
  the "query by rowid" or "linear scan" strategies, both functions return 
  an empty string.










<p>



















  For a SELECT query that uses the full-text index, the offsets() function 
  returns a text value containing a series of space-separated integers. For
  each occurence of a queried term in the document, there are four integers

  in the returned list. Each set of four integers is interpreted as 
  follows:

[Table]
  [Tr]<th>Integer <th>Interpretation
  [Tr]<td>0 
      <td>The column number that the term instance occurs in (0 for the
          leftmost column of the FTS3 table, 1 for the next leftmost, etc.).
  [Tr]<td>1
................................................................................
  <i>-- The following query matches the second row in table "mail". It returns the</i>
  <i>-- text "1 0 28 7 1 1 36 4". Only those occurences of terms "serious" and "mail"</i>
  <i>-- that are part of an instance of the phrase "serious mail" are identified; the</i>
  <i>-- other occurences of "serious" and "mail" are ignored.</i>
  SELECT offsets(mail) FROM mail WHERE mail MATCH '"serious mail"';
}]



<p>
  The snippet function is used to create formatted fragments of document text
  for display as part of a full-text query results report. The snippet function 
  may be passed between one and four arguments, as follows:

[Table]
  [Tr]<th>Argument <th>Default Value <th>Description
................................................................................
           itself.
  [Tr]<td>1 <td>"&lt;b&gt;"
      <td> The "start match" text.
  [Tr]<td>2 <td>"&lt;b&gt;"
      <td> The "end match" text.
  [Tr]<td>3 <td>"&lt;b&gt;...&lt;/b&gt;"
      <td> The "ellipses" text.










</table>

<p>
  The snippet function returns a fragment of text from the original document 
  surrounding the term identified by the first four integers that would be
  returned by the offsets function if it were used in a similar context. In
  most cases, the selected fragment contains 40 or less bytes of text before
  the identified term, and 40 or more bytes of text following the identified
  term. Slightly less than 40 bytes of preceding or following text is provided
  so that the fragment does not contain any partial terms. If the first term 
  (that would be) identified by the offsets function is less than 40 bytes 
  from the beginning or end of the document, then extra text may appear before
  or after the identified term within the fragment to make up the difference.








<p>
  If the returned fragment of text does not start at the start of the entire





  document, then the "ellipses" text (see table above) is prepended to the
  fragment before it is returned. Similarly, if the end of the returned fragment

  is not also the end of the entire document, the "ellipses" text is appended
  to it before it is returned.





<p>
  Before it is returned, the "start match" text is inserted into the fragment
  immediately before any terms within the fragment that would have been
  identified by the offsets function (not just the first one) were it invoked
  in the same context. The "end match" is inserted immediately following all
  such terms.
































[Code {
  <b>Note: In this block of examples, newlines and whitespace characters have
  been inserted into the document inserted into the FTS3 table, and the expected
  results described in SQL comments. This is done to enhance readability only,
  they would not be present in actual SQLite commands or output.</b>

................................................................................
    and cool elsewhere, minimum temperature 17-20oC. Cold to very cold on mountaintops, 
    minimum temperature 6-12oC. Northeasterly winds 15-30 km/hr. After that, temperature 
    increases. Northeasterly winds 15-30 km/hr.     
  ');

  <i>-- The following query returns the text value:</i>
  <i>--</i>
  <i>--   "&lt;b&gt;...&lt;/b&gt;cool elsewhere, minimum temperature 17-20oC. &lt;b&gt;Cold&lt;/b&gt; to very 
  <i>--    &lt;b&gt;cold&lt;/b&gt; on mountaintops, minimum temperature 6&lt;b&gt;...&lt;/b&gt;".</i>
  <i>--</i>
  SELECT snippet(text) FROM text WHERE text MATCH 'cold';

  <i>-- The following query returns the text value:</i>
  <i>--</i>
  <i>--   "...the upper portion, &#91;minimum&#93; &#91;temperature&#93; 14-16oC and cool elsewhere,</i>
  <i>--    &#91;minimum&#93; &#91;temperature&#93; 17-20oC. Cold..."</i>
  <i>--</i>
  SELECT snippet(text, '&#91; '&#93;', '...') FROM text WHERE text MATCH '"min* tem*"'
}]


































































































[h1 "Tokenizers" tokenizer {tokenizer}]

<p>
  An FTS3 tokenizer is a set of rules for extracting terms from a document 
  or basic FTS3 full-text query. 

<p>







|


|
|
|
|
>
>


<
>







>



|

|
|
>

>
>
>
>
>
>
>
>

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


<
>
|
|







 







>
>







 







>
>
>
>
>
>
>
>
>
>



<
<
<
<
<
<
<
<
<
<
>
>
>
>
>
>
>


<
>
>
>
>
>
|
<
>
|
<
>
>
>
>


<
<
<
<
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







 







|












>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785

786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833

834
835
836
837
838
839
840
841
842
843
...
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
...
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914










915
916
917
918
919
920
921
922
923

924
925
926
927
928
929

930
931

932
933
934
935
936
937





938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
...
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
  <i>-- Search for documents that contains at least one of the terms "database"</i>
  <i>-- and "sqlite", and also contains the term "library". Because of the differences</i>
  <i>-- in operator precedences, this query would have a different interpretation using</i>
  <i>-- the enhanced query syntax.</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library';
}]

[h1 "Auxillary functions - Snippet, Offsets and Matchinfo" {} snippet offsets]

<p>
  The FTS3 module provides three special SQL scalar functions that may be useful
  to the developers of full-text query systems: "snippet", "offsets" and
  "matchinfo". The purpose of the "snippet" and "offsets" functions is to allow
  the user to identify the location of queried terms in the returned documents.
  The "matchinfo" function provides the user with metrics that may be useful
  for filtering or sorting query results according to relevance.

<p>

  The first argument to all three special SQL scalar functions
  must be the the special hidden column of an FTS3 table that has the same
  name as the table (see above). For example, given an FTS3 table named 
  "mail":

[Code {
  SELECT offsets(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
  SELECT snippet(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
  SELECT matchinfo(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
}]

<p>
  The three auxillary functions are only useful within a SELECT statement that
  uses the FTS3 table's full-text index. If used within a SELECT that uses
  the "query by rowid" or "linear scan" strategies, then the snippet and
  offsets both return an an empty string, and the matchinfo function returns
  a blob value zero bytes in size.

<p id=matchable>
  All three auxillary functions extract a set of "matchable phrases" from
  the FTS3 query expression to work with. The set of matchable phrases for
  a given query consists of all phrases (including unquoted tokens and
  token prefixes) in the expression except those that are prefixed with
  a unary "-" operator (standard syntax) or are part of a sub-expression 
  that is used as the right-hand operand of a NOT operator.

<p>
  With the following provisos, each series of tokens in the FTS3 table that
  matches one of the matchable phrases in the query expression is known as a
  "phrase match":

<ol>
  <li> If a matchable phrase is part of a series of phrases connected by
       NEAR operators in the FTS3 query expression, then each phrase match
       must be sufficiently close to other phrase matches of the relevant
       types to satisfy the NEAR condition.

  <li> If the matchable phrase in the FTS3 query is restricted to matching
       data in a specified FTS3 table column, then only phrase matches that 
       occur within that column are considered.
</ol>
 

[h2 "The Offsets Function"]

<p>
  For a SELECT query that uses the full-text index, the offsets() function 
  returns a text value containing a series of space-separated integers. For

  each term in each <a href=#matchable>phrase match</a> of the current row, 
  there are four integers in the returned list. Each set of four integers is 
  interpreted as follows:

[Table]
  [Tr]<th>Integer <th>Interpretation
  [Tr]<td>0 
      <td>The column number that the term instance occurs in (0 for the
          leftmost column of the FTS3 table, 1 for the next leftmost, etc.).
  [Tr]<td>1
................................................................................
  <i>-- The following query matches the second row in table "mail". It returns the</i>
  <i>-- text "1 0 28 7 1 1 36 4". Only those occurences of terms "serious" and "mail"</i>
  <i>-- that are part of an instance of the phrase "serious mail" are identified; the</i>
  <i>-- other occurences of "serious" and "mail" are ignored.</i>
  SELECT offsets(mail) FROM mail WHERE mail MATCH '"serious mail"';
}]

[h2 "The Snippet Function"]

<p>
  The snippet function is used to create formatted fragments of document text
  for display as part of a full-text query results report. The snippet function 
  may be passed between one and four arguments, as follows:

[Table]
  [Tr]<th>Argument <th>Default Value <th>Description
................................................................................
           itself.
  [Tr]<td>1 <td>"&lt;b&gt;"
      <td> The "start match" text.
  [Tr]<td>2 <td>"&lt;b&gt;"
      <td> The "end match" text.
  [Tr]<td>3 <td>"&lt;b&gt;...&lt;/b&gt;"
      <td> The "ellipses" text.
  [Tr]<td>4 <td>-1
      <td> The FTS3 table column number to extract the returned fragments of
           text from. Columns are numbered from left to right starting with
           zero. A negative value indicates that the text may be extracted
           from any column.
  [Tr]<td>5 <td>-15
      <td> The absolute value of this integer argument is used as the 
           (approximate) number of tokens to include in the returned text 
           value. The maximum allowable absolute value is 64. The value of
           this argument is refered to as <i>N</i> in the discussion below.
</table>

<p>










  The snippet function first attempts to find a fragment of text consisting
  of <i>|N|</i> tokens within the current row that contains at least one phrase 
  match for each matchable phrase matched somewhere in the current row, 
  where <i>|N|</i> is the absolute value of the sixth argument passed to the
  snippet function. If the text stored in a single column contains less than
  <i>|N|</i> tokens, then the entire column value is considered. Text fragments 
  may not span multiple columns.

<p>

  If such a text fragment can be found, it is returned with the following
  modifications:

<ul>
  <li> If the text fragment does not begin at the start of a column value,
       the "ellipses" text is prepended to it.

  <li> If the text fragment does not finish at the end of a column value,
       the "ellipses" text is appended to it.

  <li> For each token in the text fragment that is part of a phrase match,
       the "start match" text is inserted into the fragment before the token,
       and the "end match" text is inserted immediately after it.
</ul>

<p>





  If more than one such fragment can be found, then fragments that contain
  a larger number of "extra" phrase matches are favoured. The start of
  the selected text fragment may be moved a few tokens forward or backward
  to attempt to concentrate the phrase matches toward the center of the
  fragment.

<p>
  Assuming <i>N</i> is a positive value, if no fragments can be found that
  contain an phrase match corresponding to each matchable phrase, the snippet
  function attempts to find two fragments of approximately <i>N</i>/2 tokens
  that between them contain at least one phrase match for each matchable phrase
  matched by the current row. If this fails, attempts are made to find three
  fragments of <i>N</i>/3 tokens each and finally four <i>N</i>/4 token
  fragments. If a set of four fragments cannot be found that encompasses the
  required phrase matches, the four fragments of <i>N</i>/4 tokens that provide
  the best coverage are selected.

<p>
  If <i>N</i> is a negative value, and no single fragment can be found 
  containing the required phrase matches, the snippet function searches
  for two fragments of <i>|N|</i> tokens each, then three, then four. In
  other words, if the specified value of <i>N</i> is negative, the sizes
  of the fragments is not decreased if more than one fragment is required
  to provide the desired phrase match coverage.

<p>
  After the <i>M</i> fragments have been located, where <i>M</i> is between
  two and four as described in the paragraphs above, they are joined together
  in sorted order with the "ellipses" text separating them. The three 
  modifications enumerated earlier are performed on the text before it is 
  returned.

[Code {
  <b>Note: In this block of examples, newlines and whitespace characters have
  been inserted into the document inserted into the FTS3 table, and the expected
  results described in SQL comments. This is done to enhance readability only,
  they would not be present in actual SQLite commands or output.</b>

................................................................................
    and cool elsewhere, minimum temperature 17-20oC. Cold to very cold on mountaintops, 
    minimum temperature 6-12oC. Northeasterly winds 15-30 km/hr. After that, temperature 
    increases. Northeasterly winds 15-30 km/hr.     
  ');

  <i>-- The following query returns the text value:</i>
  <i>--</i>
  <i>--   "&lt;b&gt;...&lt;/b&gt;cool elsewhere, minimum temperature 17-20oC. &lt;b&gt;Cold&lt;/b&gt; to very </i>
  <i>--    &lt;b&gt;cold&lt;/b&gt; on mountaintops, minimum temperature 6&lt;b&gt;...&lt;/b&gt;".</i>
  <i>--</i>
  SELECT snippet(text) FROM text WHERE text MATCH 'cold';

  <i>-- The following query returns the text value:</i>
  <i>--</i>
  <i>--   "...the upper portion, &#91;minimum&#93; &#91;temperature&#93; 14-16oC and cool elsewhere,</i>
  <i>--    &#91;minimum&#93; &#91;temperature&#93; 17-20oC. Cold..."</i>
  <i>--</i>
  SELECT snippet(text, '&#91; '&#93;', '...') FROM text WHERE text MATCH '"min* tem*"'
}]

[h2 "The Matchinfo Function"]

<p>
  The matchinfo function returns a blob value. If used within a query that
  uses the full-text index (not a "query by rowid" or "linear scan"), then
  the blob consists of (2 + <i>C</i> * <i>P</i> * 3) 32-bit unsigned 
  integers in machine byte-order, where <i>C</i> is the number of columns 
  in the FTS3 table being queried, and <i>P</i> is the number of 
  <a href=#matchable>matchable phrases</a> in the query. 

  <p>
    Phrases and columns are both numbered from left to right starting from 
    zero.

[Table]
  [Tr]<th>Array Element <th>Interpretation
  [Tr]<td>0 <td> 
    Number of matchable phrases in the query expression (value <i>P</i> in 
    the formula below).
  [Tr]<td>1 <td>
    Number of columns in the FTS3 table being queried (value <i>C</i> in 
    the formula below).

  [Tr]<td style="white-space:nowrap">2 + 3 * (<i>c</i> + <i>C</i>*<i>p</i>) + 0<td>
    Number of phrase matches for matchable phrase <i>p</i> in column 
    <i>c</i> of the current FTS3 table row.
  [Tr]<td style="white-space:nowrap">2 + 3 * (<i>c</i> + <i>C</i>*<i>p</i>) + 1<td>
    Sum of the number of phrase matches for matchable phrase <i>p</i> in 
    column <i>c</i> for all rows of the FTS3 table.
  [Tr]<td style="white-space:nowrap">2 + 3 * (<i>c</i> + <i>C</i>*<i>p</i>) + 2<td>
    Number of rows of the FTS3 table for which column <i>c</i> contains at 
    least one phrase match for matchable phrase <i>p</i>.
</table>

<p>
  For example:

[Code {
  <i>-- Create and populate an FTS3 table with two columns:</i>
  CREATE VIRTUAL TABLE t1 USING fts3(a, b);
  INSERT INTO t1 VALUES('transaction default models default', 'Non transaction reads');
  INSERT INTO t1 VALUES('the default transaction', 'these semantics present');
  INSERT INTO t1 VALUES('single request', 'default data');

  <i>-- The following query returns a single row consisting of a single blob</i>
  <i>-- value 80 bytes in size (20 32-bit integers). If each block of 4 bytes in</i>
  <i>-- the blob is interpreted as an unsigned integer in machine byte-order,</i>
  <i>-- the integers will be:</i>
  <i>--</i>
  <i>--     3 2  1 3 2  0 1 1  1 2 2  0 1 1  0 0 0  1 1 0</i>
  <i>--</i>
  <i>-- The row returned corresponds to the second entry inserted into table t1.</i>
  <i>-- The first two integers in the blob show that the query contained three</i>
  <i>-- phrases and the table being queried has two columns. The next block of</i>
  <i>-- three integers describes column 0 (in this case column "a") and phrase</i>
  <i>-- 0 (in this case "default"). The current row contains 1 hit for "default"</i>
  <i>-- in column 0, of a total of 3 hits for "default" that occur in column</i>
  <i>-- 0 of any table row. The 3 hits are spread across 2 different rows.</i>
  <i>--</i>
  <i>-- The next set of three integers (0 1 1) pertain to the hits for "default"</i>
  <i>-- in column 1 of the table (0 in this row, 1 in all rows, spread across </i>
  <i>-- 1 rows).</i>
  <i>--</i>
  SELECT matchinfo(t1) FROM t1 WHERE t1 MATCH 'default transaction "these semantics"';
}]

<p>
  The matchinfo function is much faster than either the snippet or offsets
  functions. This is because the implementation of both snippet and offsets
  is required to retrieve the documents being analyzed from disk, whereas
  all data required by matchinfo is available as part of the same portions
  of the full-text index that are required to implement the full-text query
  itself. This means that of the following two queries, the first may be
  an order of magnitude faster than the second:

[Code {
  SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
  SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
}]

<p>
  The TODO page contains an example of how to take advantage of this in
  a full-text search application.

<p>
  The matchinfo function provides much of the information required to calculate
  probabalistic "bag-of-words" relevancy scores such as 
  <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may
  be used to order results in a full-text search application. Also often
  used in such functions is the length or relative length of each document
  or document field. Unfortunately, this information is not made available
  by the matchinfo function as it would require loading extra data from the
  database, potentially slowing matchinfo() down by an order of magnitude.
  One solution is for the application to store the lengths of each document
  or document field in a separate table for use in calculating relevancy
  scores. The TODO page contains an example of this technique.

[h1 "Tokenizers" tokenizer {tokenizer}]

<p>
  An FTS3 tokenizer is a set of rules for extracting terms from a document 
  or basic FTS3 full-text query. 

<p>