Documentation Source Text

Check-in [a358d9babf]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Improve the documentation of fts5 auxiliary functions.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: a358d9babfbaa5004ef344a38b272f52ac8b5e8b
User & Date: dan 2015-03-07 11:16:26
Context
2015-03-09
19:43
Update the change log to mention VxWorks support. check-in: ba1c4ee6c3 user: drh tags: trunk
2015-03-07
11:16
Improve the documentation of fts5 auxiliary functions. check-in: a358d9babf user: dan tags: trunk
2015-03-06
19:29
Further enhancements to fts5.in. check-in: 984c3483df user: dan tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to images/fts5_formula.odf.

cannot compute difference between binary files

Changes to images/fts5_formula1.png.

cannot compute difference between binary files

Changes to images/fts5_formula2.png.

cannot compute difference between binary files

Changes to images/fts5_formula3.png.

cannot compute difference between binary files

Changes to pages/fts5.in.

623
624
625
626
627
628
629






















630
631



















632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653










654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
...
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
...
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
...
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
...
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
    INSERT INTO fts_idx(rowid, b, c) VALUES (new.a, new.b, new.c);
  END;
</codeblock>


<h1 tags="FTS5 auxiliary functions"> Auxiliary Functions </h1>























<h2>Built-in Auxiliary Functions</h2>




















<h3>The bm25() function</h3>

<p> The built-in auxiliary function bm25() returns a real value indicating
how well the current row matches the full-text query. The better the match,
the larger the value returned. A query such as the following may be used
to return matches in order from best to worst match:

<codeblock>
  SELECT * FROM fts WHERE fts MATCH ? ORDER BY bm25(fts) DESC
</codeblock>

<p> In order to calculate a documents score, the full-text query is separated
    into its component phrases. The bm25 score for document <i>D</i> and 
    query <i>Q</i> is then calculated as follows:

<p> <img src="images/fts5_formula1.png" style="width:55ex;margin-left:5ex">

<p> In the above, <i>nPhrase</i> is the number of phrases in the query.
    <i>|D|</i> is the number of tokens in the current document, and
    <i>avgdl</i> is the average number of tokens in all documents within the
    FTS5 table.  <i>k<sub>1</sub></i> and <i>b</i> are both constants,
    hard-coded at 1.2 and 0.75 respectively.











<p> <i>IDF(q<sub>i</sub>)</i> is the inverse-document-frequency of query 
    phrase <i>i</i>. It is calculated as follows, where <i>N</i> is the total
    number of rows in the FTS5 table and <i>n(q<sub>i</sub>)</i> is the total
    number of rows that contain at least one instance of phrase <i>i</i>:

<p> <img src="images/fts5_formula2.png" style="width:55ex;margin-left:5ex">

<p> Finally, <i>f(q<sub>i</sub>,D)</i> is the phrase frequency of phrase 
<i>i</i>. By default, this is simply the number of occurrences of the phrase
within the current row. However, by passing extra real value arguments to 
the bm25() SQL function, each column of the table may be assigned a different
weight and the phrase frequency calculated as follows:

<p> <img src="images/fts5_formula3.png" style="width:55ex;margin-left:5ex">

<p> where <i>w<sub>c</sub></i> is the weight assigned to column <i>c</i> and
<i>n(q<sub>i</sub>,c)</i> is the number of occurrences of phrase <i>i</i> in
column <i>c</i> of the current row. The first argument passed to bm25()
following the table name is the weight assigned to the leftmost column of
the FTS5 table. The second is the weight assigned to the second leftmost
column, and so on. If there are not enough arguments for all table columns,
................................................................................
  <i>-- Assuming the following schema:</i>
  CREATE VIRTUAL TABLE email USING fts5(sender, title, body);

  <i>-- Return results in bm25 order, with each phrase hit in the "sender"</i>
  <i>-- column considered the equal of 10 hits in the "body" column, and</i>
  <i>-- each hit in the "title" column considered as valuable as 5 hits in</i>
  <i>-- the "body" column.</i>
  SELECT * FROM email WHERE email MATCH ? ORDER BY bm25(email, 10.0, 5.0) DESC;
</codeblock>

<p>Refer to wikipedia for 
<a href="http://en.wikipedia.org/wiki/Okapi_BM25">more information regarding
BM25</a> and its variants.

<h3>The highlight() function</h3>

<p> The highlight() function returns a copy of the text from a specified 
column of the current row with extra markup text inserted to mark the start 
and end of phrase matches. 

<p>The highlight() must be invoked with exactly three arguments following 
the table name. To be interpreted as follows:
................................................................................
  <i>-- The following SELECT statement returns these three rows:</i>
  <i>--   '&#91;a b c&#93; x &#91;c d e&#93;'</i>
  <i>--   '&#91;a b c&#93; &#91;c d e&#93;'</i>
  <i>--   '&#91;a b c d e&#93;'</i>
  SELECT highlight(ft, 0, '&#91;', '&#93;') FROM ft WHERE ft MATCH 'a+b+c AND c+d+e';
</codeblock>

<h3>The snippet() function</h3>

<p>The snippet() function is similar to highlight(), except that instead of
returning entire column values, it automatically selects and extracts a
short fragment of document text to process and return. The snippet() function
must be passed five parameters following the table name argument:

<ol>
................................................................................
returned value. In this case, using "rank" is faster than using bm25().

<codeblock>
  <i>-- The following queries are logically equivalent. But the second may</i>
  <i>-- be faster, particularly if the caller abandons the query before</i>
  <i>-- all rows have been returned (or if the queries were modified to </i>
  <i>-- include LIMIT clauses).</i>
  SELECT * FROM fts WHERE fts MATCH ? ORDER BY bm25(fts) DESC;
  SELECT * FROM fts WHERE fts MATCH ? ORDER BY rank DESC;
</codeblock>

<p> Instead of using bm25() with no trailing arguments, the specific auxiliary
function mapped to the rank column may be configured either on a per-query
basis, or by setting a different persistent default for the FTS table.

<p> In order to change the mapping of the rank column for a single query, 
................................................................................
evaluates to a string consisting of the auxiliary function to invoke, followed
by zero or more comma separated arguments within parenthesis. Arguments must
be SQL literals. For example:

<codeblock>
  <i>-- The following queries are logically equivalent. But the second may</i>
  <i>-- be faster. See above. </i>
  SELECT * FROM fts WHERE fts MATCH ? ORDER BY bm25(fts, 10.0, 5.0) DESC;
  SELECT * FROM fts WHERE fts MATCH ? AND rank MATCH 'bm25(10.0, 5.0)' ORDER BY rank DESC;
</codeblock>

<p> The default mapping of the rank column for a table may be modified 
using the [FTS5 rank configuration option].

<h1>Special INSERT Commands</h1>








>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|







|






|






>
>
>
>
>
>
>
>
>
>






|







|







 







|






|







 







|







 







|
|







 







|
|







623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
...
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
...
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
...
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
...
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
    INSERT INTO fts_idx(rowid, b, c) VALUES (new.a, new.b, new.c);
  END;
</codeblock>


<h1 tags="FTS5 auxiliary functions"> Auxiliary Functions </h1>

<p> Auxiliary functions are similar to [corefunc | SQL scalar functions],
except that they may only be used within full-text queries (those that use
the MATCH operator) on an FTS5 table. Their results are calculated based not
only on the arguments passed to them, but also on the current match and 
matched row. For example, an auxiliary function may return a numeric value
indicating the accuracy of the match (see the [FTS5 bm25| bm25()] function), 
or a fragment of text from the matched row that contains one or more 
instances of the search terms (see the [FTS5 snippet | snippet()] function).

<p>To invoke an auxiliary function, the name of the FTS5 table should be
specified as the first argument. Other arguments may follow the first,
depending on the specific auxiliary function being invoked. For example, to
invoke the "highlight" function:

<codeblock>
  SELECT highlight(email, 2, '&lt;b&gt;', '&lt;/b&gt;') FROM email WHERE email MATCH 'fts5'
</codeblock>

<p>The built-in auxiliary functions provided as part of FTS5 are described in
the following section. Applications may also implement 
[FTS5 custom auxiliary functions | custom auxiliary functions in C].

<h2>Built-in Auxiliary Functions</h2>

<p> FTS5 provides three built-in auxiliary functions:

<ul>
  <li> The [FTS5 bm25 | bm25() auxiliary function] returns a real value
       reflecting the accuracy of the current match. Better matches are
       assigned numerically lower values.

  <li> The [FTS5 highlight | highlight() auxiliary function] returns a copy
       of the text from one of the columns of the current match with each
       instance of a queried term within the result surrounded by specified
       markup (for example "&lt;b&gt;" and "&lt;/b&gt;").

  <li> The [FTS5 snippet | snippet() auxiliary function] selects a short
       fragment of text from one of the columns of the matched row and returns
       it with each instance of a queried term surrounded by markup in
       the same manner as the highlight() function. The fragment of text is
       selected so as to maximize the number of queried terms it contains.
</ul>

<h3 tags="FTS5 bm25">The bm25() function</h3>

<p> The built-in auxiliary function bm25() returns a real value indicating
how well the current row matches the full-text query. The better the match,
the larger the value returned. A query such as the following may be used
to return matches in order from best to worst match:

<codeblock>
  SELECT * FROM fts WHERE fts MATCH ? ORDER BY bm25(fts)
</codeblock>

<p> In order to calculate a documents score, the full-text query is separated
    into its component phrases. The bm25 score for document <i>D</i> and 
    query <i>Q</i> is then calculated as follows:

<p> <img src="images/fts5_formula1.png" style="width:75ex;margin-left:5ex">

<p> In the above, <i>nPhrase</i> is the number of phrases in the query.
    <i>|D|</i> is the number of tokens in the current document, and
    <i>avgdl</i> is the average number of tokens in all documents within the
    FTS5 table.  <i>k<sub>1</sub></i> and <i>b</i> are both constants,
    hard-coded at 1.2 and 0.75 respectively.

<p> The "-1" term at the start of the formula is not found in most
implementations of the BM25 algorithm. Without it, a better match is assigned
a numerically higher BM25 score. Since the default sorting order is
"ascending", this means that appending "ORDER BY bm25(fts)" to a query would
cause results to be returned in order from worst to best. The "DESC" keyword
would be required in order to return the best matches first. In order to
avoid this pitfall, the FTS5 implementation of BM25 multiplies the result
by -1 before returning it, ensuring that better matches are assigned
numerically lower scores.

<p> <i>IDF(q<sub>i</sub>)</i> is the inverse-document-frequency of query 
    phrase <i>i</i>. It is calculated as follows, where <i>N</i> is the total
    number of rows in the FTS5 table and <i>n(q<sub>i</sub>)</i> is the total
    number of rows that contain at least one instance of phrase <i>i</i>:

<p> <img src="images/fts5_formula2.png" style="width:75ex;margin-left:5ex">

<p> Finally, <i>f(q<sub>i</sub>,D)</i> is the phrase frequency of phrase 
<i>i</i>. By default, this is simply the number of occurrences of the phrase
within the current row. However, by passing extra real value arguments to 
the bm25() SQL function, each column of the table may be assigned a different
weight and the phrase frequency calculated as follows:

<p> <img src="images/fts5_formula3.png" style="width:75ex;margin-left:5ex">

<p> where <i>w<sub>c</sub></i> is the weight assigned to column <i>c</i> and
<i>n(q<sub>i</sub>,c)</i> is the number of occurrences of phrase <i>i</i> in
column <i>c</i> of the current row. The first argument passed to bm25()
following the table name is the weight assigned to the leftmost column of
the FTS5 table. The second is the weight assigned to the second leftmost
column, and so on. If there are not enough arguments for all table columns,
................................................................................
  <i>-- Assuming the following schema:</i>
  CREATE VIRTUAL TABLE email USING fts5(sender, title, body);

  <i>-- Return results in bm25 order, with each phrase hit in the "sender"</i>
  <i>-- column considered the equal of 10 hits in the "body" column, and</i>
  <i>-- each hit in the "title" column considered as valuable as 5 hits in</i>
  <i>-- the "body" column.</i>
  SELECT * FROM email WHERE email MATCH ? ORDER BY bm25(email, 10.0, 5.0);
</codeblock>

<p>Refer to wikipedia for 
<a href="http://en.wikipedia.org/wiki/Okapi_BM25">more information regarding
BM25</a> and its variants.

<h3 tags="FTS5 highlight">The highlight() function</h3>

<p> The highlight() function returns a copy of the text from a specified 
column of the current row with extra markup text inserted to mark the start 
and end of phrase matches. 

<p>The highlight() must be invoked with exactly three arguments following 
the table name. To be interpreted as follows:
................................................................................
  <i>-- The following SELECT statement returns these three rows:</i>
  <i>--   '&#91;a b c&#93; x &#91;c d e&#93;'</i>
  <i>--   '&#91;a b c&#93; &#91;c d e&#93;'</i>
  <i>--   '&#91;a b c d e&#93;'</i>
  SELECT highlight(ft, 0, '&#91;', '&#93;') FROM ft WHERE ft MATCH 'a+b+c AND c+d+e';
</codeblock>

<h3 tags="FTS5 snippet">The snippet() function</h3>

<p>The snippet() function is similar to highlight(), except that instead of
returning entire column values, it automatically selects and extracts a
short fragment of document text to process and return. The snippet() function
must be passed five parameters following the table name argument:

<ol>
................................................................................
returned value. In this case, using "rank" is faster than using bm25().

<codeblock>
  <i>-- The following queries are logically equivalent. But the second may</i>
  <i>-- be faster, particularly if the caller abandons the query before</i>
  <i>-- all rows have been returned (or if the queries were modified to </i>
  <i>-- include LIMIT clauses).</i>
  SELECT * FROM fts WHERE fts MATCH ? ORDER BY bm25(fts);
  SELECT * FROM fts WHERE fts MATCH ? ORDER BY rank;
</codeblock>

<p> Instead of using bm25() with no trailing arguments, the specific auxiliary
function mapped to the rank column may be configured either on a per-query
basis, or by setting a different persistent default for the FTS table.

<p> In order to change the mapping of the rank column for a single query, 
................................................................................
evaluates to a string consisting of the auxiliary function to invoke, followed
by zero or more comma separated arguments within parenthesis. Arguments must
be SQL literals. For example:

<codeblock>
  <i>-- The following queries are logically equivalent. But the second may</i>
  <i>-- be faster. See above. </i>
  SELECT * FROM fts WHERE fts MATCH ? ORDER BY bm25(fts, 10.0, 5.0);
  SELECT * FROM fts WHERE fts MATCH ? AND rank MATCH 'bm25(10.0, 5.0)' ORDER BY rank;
</codeblock>

<p> The default mapping of the rank column for a table may be modified 
using the [FTS5 rank configuration option].

<h1>Special INSERT Commands</h1>