Documentation Source Text

Check-in [0917dec067]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add a section to fts3.in to explain how to use the matchinfo() function efficiently.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 0917dec0676131c0ec080c1e8ecb3ccf17b13e8b
User & Date: dan 2010-01-18 14:12:05
Context
2010-01-20
05:52
Fix typo in fts3.html. check-in: 897b36953b user: dan tags: trunk
2010-01-18
14:12
Add a section to fts3.in to explain how to use the matchinfo() function efficiently. check-in: 0917dec067 user: dan tags: trunk
2010-01-15
05:16
Fix another typo in custombuild.html. check-in: 06a55cbcb5 user: dan tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/fts3.in.

993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
....
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095

1096
1097
1098
1099
1100
1101
1102
....
1560
1561
1562
1563
1564
1565
1566

1567




























































































































































































































































  <i>--</i>
  <i>--   "...the upper portion, &#91;minimum&#93; &#91;temperature&#93; 14-16oC and cool elsewhere,</i>
  <i>--    &#91;minimum&#93; &#91;temperature&#93; 17-20oC. Cold..."</i>
  <i>--</i>
  SELECT snippet(text, '&#91; '&#93;', '...') FROM text WHERE text MATCH '"min* tem*"'
}]

[h2 "The Matchinfo Function"]

<p>
  The matchinfo function returns a blob value. If used within a query that
  uses the full-text index (not a "query by rowid" or "linear scan"), then
  the blob consists of (2 + <i>C</i> * <i>P</i> * 3) 32-bit unsigned 
  integers in machine byte-order, where <i>C</i> is the number of columns 
  in the FTS3 table being queried, and <i>P</i> is the number of 
................................................................................
  an order of magnitude faster than the second:

[Code {
  SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
  SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
}]

<p>
  The TODO page contains an example of how to take advantage of this in
  a full-text search application.

<p>
  The matchinfo function provides much of the information required to calculate
  probabalistic "bag-of-words" relevancy scores such as 
  <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may
  be used to order results in a full-text search application. Also often
  used in such functions is the length or relative length of each document
  or document field. Unfortunately, this information is not made available
  by the matchinfo function as it would require loading extra data from the
  database, potentially slowing matchinfo() down by an order of magnitude.
  One solution is for the application to store the lengths of each document
  or document field in a separate table for use in calculating relevancy
  scores. The TODO page contains an example of this technique.


[h1 "Tokenizers" tokenizer {tokenizer}]

<p>
  An FTS3 tokenizer is a set of rules for extracting terms from a document 
  or basic FTS3 full-text query. 

................................................................................
<p>
  For doclists for which the term appears in more than one column of the FTS3
  virtual table, term-offset lists within the doclist are stored in column 
  number order. This ensures that the term-offset list associated with 
  column 0 (if any) is always first, allowing the first two fields of the
  term-offset list to be omitted in this case.


}



































































































































































































































































|







 







<
<
<
<











|
>







 







>
|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
....
1073
1074
1075
1076
1077
1078
1079




1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
....
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
  <i>--</i>
  <i>--   "...the upper portion, &#91;minimum&#93; &#91;temperature&#93; 14-16oC and cool elsewhere,</i>
  <i>--    &#91;minimum&#93; &#91;temperature&#93; 17-20oC. Cold..."</i>
  <i>--</i>
  SELECT snippet(text, '&#91; '&#93;', '...') FROM text WHERE text MATCH '"min* tem*"'
}]

[h2 "The Matchinfo Function" matchinfo matchinfo]

<p>
  The matchinfo function returns a blob value. If used within a query that
  uses the full-text index (not a "query by rowid" or "linear scan"), then
  the blob consists of (2 + <i>C</i> * <i>P</i> * 3) 32-bit unsigned 
  integers in machine byte-order, where <i>C</i> is the number of columns 
  in the FTS3 table being queried, and <i>P</i> is the number of 
................................................................................
  an order of magnitude faster than the second:

[Code {
  SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
  SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;;
}]





<p>
  The matchinfo function provides much of the information required to calculate
  probabalistic "bag-of-words" relevancy scores such as 
  <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may
  be used to order results in a full-text search application. Also often
  used in such functions is the length or relative length of each document
  or document field. Unfortunately, this information is not made available
  by the matchinfo function as it would require loading extra data from the
  database, potentially slowing matchinfo() down by an order of magnitude.
  One solution is for the application to store the lengths of each document
  or document field in a separate table for use in calculating relevancy
  scores. Appendix A of this document, "\[search application tips\]", contains
  an example of using the matchinfo() function efficiently.

[h1 "Tokenizers" tokenizer {tokenizer}]

<p>
  An FTS3 tokenizer is a set of rules for extracting terms from a document 
  or basic FTS3 full-text query. 

................................................................................
<p>
  For doclists for which the term appears in more than one column of the FTS3
  virtual table, term-offset lists within the doclist are stored in column 
  number order. This ensures that the term-offset list associated with 
  column 0 (if any) is always first, allowing the first two fields of the
  term-offset list to be omitted in this case.

[h1 "Appendix A: Search Application Tips" {} "search application tips"]

<p>
  FTS3 is primarily designed to support Boolean full-text queries - queries
  to find the set of documents that match a specified criteria. However, many 
  (most?) search applications require that results are somehow ranked in order
  of "relevance", where "relevance" is defined as the likelihood that the user
  who performed the search is interested in a specific element of the returned
  set of documents. When using a search engine to find documents on the world
  wide web, the user expects that the most useful, or "relevant", documents 
  will be returned as the first page of results, and that each subsequent page 
  contains progressively less relevant results. Exactly how a machine can 
  determine document relevance based on a users query is a complicated problem
  and the subject of much ongoing research.

<p>
  One very simple scheme might be to count the number of instances of the 
  users search terms in each result document. Those documents that contain
  many instances of the terms are considered more relevant than those with
  a small number of instances of each term. In an FTS3 application, the 
  number of term instances in each result could be determined by counting
  the number of integers in the return value of the \[offsets\] function.
  The following example shows a query that could be used to obtain the
  ten most relevant results for a query entered by the user:

[Code {
  <i>-- This example (and all others in this section) assumes the following schema</i>
  CREATE VIRTUAL TABLE documents USING fts3(title, content);

  <i>-- Assuming the application has supplied an SQLite user function named "countintegers"</i>
  <i>-- that returns the number of space-separated integers contained in its only argument,</i>
  <i>-- the following query could be used to return the titles of the 10 documents that contain</i>
  <i>-- the greatest number of instances of the users query terms. Hopefully, these 10</i>
  <i>-- documents will be those that the users considers more or less the most "relevant".</i>
  SELECT title FROM documents 
    WHERE documents MATCH &lt;query&gt;
    ORDER BY countintegers(offsets(document)) DESC
    OFFSET 0 LIMIT 10
}]

<p>
  The query above could be made to run faster by using the FTS3 \[matchinfo\]
  function to determine the number of query term instances that appear in each
  result. The matchinfo function is much more efficient than the offsets 
  function. Furthermore, the matchinfo function provides extra information
  regarding the overall number of occurences of each query term in the entire
  document set (not just the current row) and the number of documents in which 
  each query term appears. This may be used (for example) to attach a higher
  weight to less common terms which may increase the overall computed relevancy 
  of those results the user considers more interesting.

[Code {
  <i>-- If the application supplies an SQLite user function called "rank" that</i>
  <i>-- interprets the blob of data returned by matchinfo and returns a numeric</i>
  <i>-- relevancy based on it, then the following SQL may be used to return the</i>
  <i>-- titles of the 10 most relevant documents in the dataset for a users query.</i>
  SELECT title FROM documents 
    WHERE documents MATCH &lt;query&gt;
    ORDER BY rank(matchinfo(document)) DESC
    OFFSET 0 LIMIT 10
}]

<p>
  The SQL query in the example above uses less CPU than the first example
  in this section, but still has a non-obvious performance problem. SQLite
  satisfies this query by retreiving the value of the "title" column and
  matchinfo data from the FTS3 module for every row matched by the users
  query before it sorts and limits the results. Because of the way SQLite's
  virtual table interface works, retrieving the value of the "title" column
  requires loading the entire row from disk (including the "content" field,
  which may be quite large). This means that if the users query matches
  several thousand documents, many megabytes of "title" and "content" data
  may be loaded from disk into memory even though they will never be used
  for any purpose. 

<p>
  The SQL query in the following example block is one solution to this 
  problem. In SQLite, when a <a href="optoverview.html#flattening">sub-query 
  used in a join contains a LIMIT clause</a>, the results of the sub-query are
  calcuated and stored in temporary table before the main query is executed.
  This means that SQLite will load only the docid and matchinfo data for each
  row matching the users query into memory, determine the docid values
  corresponding to the ten most relevant documents, then load only the title
  and content information for those 10 documents only. Because both the matchinfo
  and docid values are gleaned entirely from the full-text index, this results
  in dramatically less data being loaded from the database into memory.

[Code {
  SELECT title FROM documents JOIN ( 
      SELECT docid, rank(matchinfo(document)) AS rank 
      FROM documents
      WHERE documents MATCH &lt;query&gt;
      ORDER BY rank DESC 
      OFFSET 0 LIMIT 10
  ) AS ranktable USING(docid)
  ORDER BY ranktable.rank DESC
}]

<p>
  The next block of SQL enhances the query with solutions to two other problems
  that may arise in developing search applications using FTS3:

<ol>
  <li> <p>
       The \[snippet\] function cannot be used with the above query. Because
       the outer query does not include a "WHERE ... MATCH" clause, the snippet 
       function may not be used with it. One solution is to duplicate the WHERE
       clause used by the sub-query in the outer query. The overhead associated
       with this is usually negligible.
  <li> <p>
       The relevancy of a document may depend on something other than just
       the data available in the return value of matchinfo. For example
       each document in the database may be assigned a static weight based
       on factors unrelated to its content (origin, author, age, number
       of references etc.). These values can be stored by the application
       in a separate table that can be joined against the documents table
       in the sub-query so that the rank function may access them.
</ol>

<p>
  This version of the query is very similar to that used by the 
  <a href="http://www.sqlite.org/search?q=fts3">sqlite.org documentation search</a> 
  application.

[Code {
  <i>-- This table stores the static weight assigned to each document in FTS3 table</i>
  <i>-- "documents". For each row in the documents table there is a corresponding row</i>
  <i>-- with the same docid value in this table.</i>
  CREATE TABLE documents_data(docid INTEGER PRIMARY KEY, weight);

  <i>-- This query is similar to the one in the block above, except that:</i>
  <i>--</i>
  <i>--   1. It returns a "snippet" of text along with the document title for display. So</i>
  <i>--      that the snippet function may be used, the "WHERE ... MATCH ..." clause from</i>
  <i>--      the sub-query is duplicated in the outer query.</i>
  <i>--</i>
  <i>--   2. The sub-query joins the documents table with the document_data table, so that</i>
  <i>--      implementation of the rank function has access to the static weight assigned</i>
  <i>--      to each document.</i>
  SELECT title, snippet(documents) FROM documents JOIN ( 
      SELECT docid, rank(matchinfo(document), documents_data.weight) AS rank
      FROM documents JOIN documents_data USING(docid)
      WHERE documents MATCH &lt;query&gt;
      ORDER BY rank DESC 
      OFFSET 0 LIMIT 10
  ) AS ranktable USING(docid)
  WHERE documents MATCH &lt;query&gt;
  ORDER BY ranktable.rank DESC
}]

<p>
  All the example queries above return the ten most relevant query results.
  By modifying the values used with the OFFSET and LIMIT clauses, a query 
  to return (say) the next ten most relevant results is easy to construct. 
  This may be used to obtain the data required for a search applications second
  and subsequent pages of results.

<p>
  The next block contains an example rank function that uses matchinfo data 
  implemented in C. Instead of a single weight, it allows a weight to be 
  externally assigned to each column of each document. It may be registered
  with SQLite like any other user function using \[sqlite3_create_function()\].

[Code [string map {[ &#x5B; ] &#x5D;} {
<i>/*</i>
<i>** SQLite user defined function to use with matchinfo() to calculate the</i>
<i>** relevancy of an FTS3 match. The value returned is the relevancy score</i>
<i>** (a real value greater than or equal to zero). A larger value indicates </i>
<i>** a more relevant document.</i>
<i>**</i>
<i>** The overall relevancy returned is the sum of the relevancies of each </i>
<i>** column value in the FTS3 table. The relevancy of a column value is the</i>
<i>** sum of the following for each reportable phrase in the FTS3 query:</i>
<i>**</i>
<i>**   (&lt;hit count&gt; / &lt;global hit count&gt) * &lt;column weight&gt;</i>
<i>**</i>
<i>** where &lt;hit count&gt; is the number of instances of the phrase in the</i>
<i>** column value of the current row and &lt;global hit count&gt; is the number</i>
<i>** of instances of the phrase in the same column of all rows in the FTS3</i>
<i>** table. The &lt;column weight&gt; is a weighting factor assigned to each</i>
<i>** column by the caller (see below).</i>
<i>**</i>
<i>** The first argument to this function must be the return value of the FTS3 </i>
<i>** matchinfo() function. Following this must be one argument for each column </i>
<i>** of the FTS3 table containing a numeric weight factor for the corresponding </i>
<i>** column. Example:</i>
<i>**</i>
<i>**     CREATE VIRTUAL TABLE documents USING fts3(title, content)</i>
<i>**</i>
<i>** The following query returns the docids of documents that match the full-text</i>
<i>** query &lt;query&gt; sorted from most to least relevant. When calculating</i>
<i>** relevance, query term instances in the 'title' column are given twice the</i>
<i>** weighting of those in the 'content' column.</i>
<i>**</i>
<i>**     SELECT docid FROM documents </i>
<i>**     WHERE documents MATCH &lt;query&gt; </i>
<i>**     ORDER BY rank(matchinfo(documents), 1.0, 0.5) DESC</i>
<i>*/</i>
static void rankfunc(sqlite3_context *pCtx, int nVal, sqlite3_value **apVal){
  int *aMatchinfo;                <i>/* Return value of matchinfo() */</i>
  int nCol;                       <i>/* Number of columns in the table */</i>
  int nPhrase;                    <i>/* Number of phrases in the query */</i>
  int iPhrase;                    <i>/* Current phrase */</i>
  double score = 0.0;             <i>/* Value to return */</i>

  assert( sizeof(int)==4 );

<i>  /* Check that the number of arguments passed to this function is correct.</i>
<i>  ** If not, jump to wrong_number_args. Set aMatchinfo to point to the array</i>
<i>  ** of unsigned integer values returned by FTS3 function matchinfo. Set</i>
<i>  ** nPhrase to contain the number of reportable phrases in the users full-text</i>
<i>  ** query, and nCol to the number of columns in the table.</i>
<i>  */</i>
  if( nVal&lt;1 ) goto wrong_number_args;
  aMatchinfo = (unsigned int *)sqlite3_value_blob(apVal[0]);
  nPhrase = aMatchinfo[0];
  nCol = aMatchinfo[1];
  if( nVal!=(1+nCol) ) goto wrong_number_args;

<i>  /* Iterate through each phrase in the users query. */</i>
  for(iPhrase=0; iPhrase&lt;nPhrase; iPhrase++){
    int iCol;                     <i>/* Current column */</i>

<i>    /* Now iterate through each column in the users query. For each column,</i>
<i>    ** increment the relevancy score by:</i>
<i>    **</i>
<i>    **   (&lt;hit count&gt; / &lt;global hit count&gt) * &lt;column weight&gt;</i>
<i>    **</i>
<i>    ** aPhraseinfo[] points to the start of the data for phrase iPhrase. So</i>
<i>    ** the hit count and global hit counts for each column are found in </i>
<i>    ** aPhraseinfo[iCol*3] and aPhraseinfo[iCol*3+1], respectively.</i>
<i>    */</i>
    int *aPhraseinfo = &aMatchinfo[2 + iPhrase*nCol*3];
    for(iCol=0; iCol&lt;nCol; iCol++){
      int nHitCount = aPhraseinfo[3*iCol];
      int nGlobalHitCount = aPhraseinfo[3*iCol+1];
      double weight = sqlite3_value_double(apVal[iCol+1]);
      if( nHitCount>0 ){
        score += ((double)nHitCount / (double)nGlobalHitCount) * weight;
      }
    }
  }

  sqlite3_result_double(pCtx, score);
  return;

<i>  /* Jump here if the wrong number of arguments are passed to this function */</i>
wrong_number_args:
  sqlite3_result_error(pCtx, "wrong number of arguments to function rank()", -1);
}
}]]

}