Documentation Source Text
Check-in [0917dec067]
Not logged in

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
SHA1 Hash:0917dec0676131c0ec080c1e8ecb3ccf17b13e8b
Date: 2010-01-18 14:12:05
User: dan
Comment:Add a section to fts3.in to explain how to use the matchinfo() function efficiently.
Tags And Properties
Changes
hide diffs unified diffs patch

Changes to pages/fts3.in

993 <i>--</i> 993 <i>--</i> 994 <i>-- "...the upper portion, &#91;minimum&#93; &#91;temperature&#93; 14-16oC 994 <i>-- "...the upper portion, &#91;minimum&#93; &#91;temperature&#93; 14-16oC 995 <i>-- &#91;minimum&#93; &#91;temperature&#93; 17-20oC. Cold..."</i> 995 <i>-- &#91;minimum&#93; &#91;temperature&#93; 17-20oC. Cold..."</i> 996 <i>--</i> 996 <i>--</i> 997 SELECT snippet(text, '&#91; '&#93;', '...') FROM text WHERE text MATCH '"min* 997 SELECT snippet(text, '&#91; '&#93;', '...') FROM text WHERE text MATCH '"min* 998 }] 998 }] 999 999 1000 [h2 "The Matchinfo Function"] | 1000 [h2 "The Matchinfo Function" matchinfo matchinfo] 1001 1001 1002 <p> 1002 <p> 1003 The matchinfo function returns a blob value. If used within a query that 1003 The matchinfo function returns a blob value. If used within a query that 1004 uses the full-text index (not a "query by rowid" or "linear scan"), then 1004 uses the full-text index (not a "query by rowid" or "linear scan"), then 1005 the blob consists of (2 + <i>C</i> * <i>P</i> * 3) 32-bit unsigned 1005 the blob consists of (2 + <i>C</i> * <i>P</i> * 3) 32-bit unsigned 1006 integers in machine byte-order, where <i>C</i> is the number of columns 1006 integers in machine byte-order, where <i>C</i> is the number of columns 1007 in the FTS3 table being queried, and <i>P</i> is the number of 1007 in the FTS3 table being queried, and <i>P</i> is the number of ................................................................................................................................................................................ 1073 an order of magnitude faster than the second: 1073 an order of magnitude faster than the second: 1074 1074 1075 [Code { 1075 [Code { 1076 SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt; 1076 SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt; 1077 SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;; 1077 SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH &lt;query expression&gt;; 1078 }] 1078 }] 1079 1079 1080 <p> < 1081 The TODO page contains an example of how to take advantage of this in < 1082 a full-text search application. < 1083 < 1084 <p> 1080 <p> 1085 The matchinfo function provides much of the information required to calculate 1081 The matchinfo function provides much of the information required to calculate 1086 probabalistic "bag-of-words" relevancy scores such as 1082 probabalistic "bag-of-words" relevancy scores such as 1087 <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may 1083 <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may 1088 be used to order results in a full-text search application. Also often 1084 be used to order results in a full-text search application. Also often 1089 used in such functions is the length or relative length of each document 1085 used in such functions is the length or relative length of each document 1090 or document field. Unfortunately, this information is not made available 1086 or document field. Unfortunately, this information is not made available 1091 by the matchinfo function as it would require loading extra data from the 1087 by the matchinfo function as it would require loading extra data from the 1092 database, potentially slowing matchinfo() down by an order of magnitude. 1088 database, potentially slowing matchinfo() down by an order of magnitude. 1093 One solution is for the application to store the lengths of each document 1089 One solution is for the application to store the lengths of each document 1094 or document field in a separate table for use in calculating relevancy 1090 or document field in a separate table for use in calculating relevancy 1095 scores. The TODO page contains an example of this technique. | 1091 scores. Appendix A of this document, "\[search application tips\]", contains > 1092 an example of using the matchinfo() function efficiently. 1096 1093 1097 [h1 "Tokenizers" tokenizer {tokenizer}] 1094 [h1 "Tokenizers" tokenizer {tokenizer}] 1098 1095 1099 <p> 1096 <p> 1100 An FTS3 tokenizer is a set of rules for extracting terms from a document 1097 An FTS3 tokenizer is a set of rules for extracting terms from a document 1101 or basic FTS3 full-text query. 1098 or basic FTS3 full-text query. 1102 1099 ................................................................................................................................................................................ 1560 <p> 1557 <p> 1561 For doclists for which the term appears in more than one column of the FTS3 1558 For doclists for which the term appears in more than one column of the FTS3 1562 virtual table, term-offset lists within the doclist are stored in column 1559 virtual table, term-offset lists within the doclist are stored in column 1563 number order. This ensures that the term-offset list associated with 1560 number order. This ensures that the term-offset list associated with 1564 column 0 (if any) is always first, allowing the first two fields of the 1561 column 0 (if any) is always first, allowing the first two fields of the 1565 term-offset list to be omitted in this case. 1562 term-offset list to be omitted in this case. 1566 1563 > 1564 [h1 "Appendix A: Search Application Tips" {} "search application tips"] > 1565 > 1566 <p> > 1567 FTS3 is primarily designed to support Boolean full-text queries - queries > 1568 to find the set of documents that match a specified criteria. However, many > 1569 (most?) search applications require that results are somehow ranked in order > 1570 of "relevance", where "relevance" is defined as the likelihood that the user > 1571 who performed the search is interested in a specific element of the returned > 1572 set of documents. When using a search engine to find documents on the world > 1573 wide web, the user expects that the most useful, or "relevant", documents > 1574 will be returned as the first page of results, and that each subsequent page > 1575 contains progressively less relevant results. Exactly how a machine can > 1576 determine document relevance based on a users query is a complicated problem > 1577 and the subject of much ongoing research. > 1578 > 1579 <p> > 1580 One very simple scheme might be to count the number of instances of the > 1581 users search terms in each result document. Those documents that contain > 1582 many instances of the terms are considered more relevant than those with > 1583 a small number of instances of each term. In an FTS3 application, the > 1584 number of term instances in each result could be determined by counting > 1585 the number of integers in the return value of the \[offsets\] function. > 1586 The following example shows a query that could be used to obtain the > 1587 ten most relevant results for a query entered by the user: > 1588 > 1589 [Code { > 1590 <i>-- This example (and all others in this section) assumes the following sche > 1591 CREATE VIRTUAL TABLE documents USING fts3(title, content); > 1592 > 1593 <i>-- Assuming the application has supplied an SQLite user function named "cou > 1594 <i>-- that returns the number of space-separated integers contained in its onl > 1595 <i>-- the following query could be used to return the titles of the 10 documen > 1596 <i>-- the greatest number of instances of the users query terms. Hopefully, th > 1597 <i>-- documents will be those that the users considers more or less the most " > 1598 SELECT title FROM documents > 1599 WHERE documents MATCH &lt;query&gt; > 1600 ORDER BY countintegers(offsets(document)) DESC > 1601 OFFSET 0 LIMIT 10 > 1602 }] > 1603 > 1604 <p> > 1605 The query above could be made to run faster by using the FTS3 \[matchinfo\] > 1606 function to determine the number of query term instances that appear in each > 1607 result. The matchinfo function is much more efficient than the offsets > 1608 function. Furthermore, the matchinfo function provides extra information > 1609 regarding the overall number of occurences of each query term in the entire > 1610 document set (not just the current row) and the number of documents in which > 1611 each query term appears. This may be used (for example) to attach a higher > 1612 weight to less common terms which may increase the overall computed relevancy > 1613 of those results the user considers more interesting. > 1614 > 1615 [Code { > 1616 <i>-- If the application supplies an SQLite user function called "rank" that</ > 1617 <i>-- interprets the blob of data returned by matchinfo and returns a numeric< > 1618 <i>-- relevancy based on it, then the following SQL may be used to return the< > 1619 <i>-- titles of the 10 most relevant documents in the dataset for a users quer > 1620 SELECT title FROM documents > 1621 WHERE documents MATCH &lt;query&gt; > 1622 ORDER BY rank(matchinfo(document)) DESC > 1623 OFFSET 0 LIMIT 10 > 1624 }] > 1625 > 1626 <p> > 1627 The SQL query in the example above uses less CPU than the first example > 1628 in this section, but still has a non-obvious performance problem. SQLite > 1629 satisfies this query by retreiving the value of the "title" column and > 1630 matchinfo data from the FTS3 module for every row matched by the users > 1631 query before it sorts and limits the results. Because of the way SQLite's > 1632 virtual table interface works, retrieving the value of the "title" column > 1633 requires loading the entire row from disk (including the "content" field, > 1634 which may be quite large). This means that if the users query matches > 1635 several thousand documents, many megabytes of "title" and "content" data > 1636 may be loaded from disk into memory even though they will never be used > 1637 for any purpose. > 1638 > 1639 <p> > 1640 The SQL query in the following example block is one solution to this > 1641 problem. In SQLite, when a <a href="optoverview.html#flattening">sub-query > 1642 used in a join contains a LIMIT clause</a>, the results of the sub-query are > 1643 calcuated and stored in temporary table before the main query is executed. > 1644 This means that SQLite will load only the docid and matchinfo data for each > 1645 row matching the users query into memory, determine the docid values > 1646 corresponding to the ten most relevant documents, then load only the title > 1647 and content information for those 10 documents only. Because both the matchinf > 1648 and docid values are gleaned entirely from the full-text index, this results > 1649 in dramatically less data being loaded from the database into memory. > 1650 > 1651 [Code { > 1652 SELECT title FROM documents JOIN ( > 1653 SELECT docid, rank(matchinfo(document)) AS rank > 1654 FROM documents > 1655 WHERE documents MATCH &lt;query&gt; > 1656 ORDER BY rank DESC > 1657 OFFSET 0 LIMIT 10 > 1658 ) AS ranktable USING(docid) > 1659 ORDER BY ranktable.rank DESC > 1660 }] > 1661 > 1662 <p> > 1663 The next block of SQL enhances the query with solutions to two other problems > 1664 that may arise in developing search applications using FTS3: > 1665 > 1666 <ol> > 1667 <li> <p> > 1668 The \[snippet\] function cannot be used with the above query. Because > 1669 the outer query does not include a "WHERE ... MATCH" clause, the snippet > 1670 function may not be used with it. One solution is to duplicate the WHERE > 1671 clause used by the sub-query in the outer query. The overhead associated > 1672 with this is usually negligible. > 1673 <li> <p> > 1674 The relevancy of a document may depend on something other than just > 1675 the data available in the return value of matchinfo. For example > 1676 each document in the database may be assigned a static weight based > 1677 on factors unrelated to its content (origin, author, age, number > 1678 of references etc.). These values can be stored by the application > 1679 in a separate table that can be joined against the documents table > 1680 in the sub-query so that the rank function may access them. > 1681 </ol> > 1682 > 1683 <p> > 1684 This version of the query is very similar to that used by the > 1685 <a href="http://www.sqlite.org/search?q=fts3">sqlite.org documentation search< > 1686 application. > 1687 > 1688 [Code { > 1689 <i>-- This table stores the static weight assigned to each document in FTS3 ta > 1690 <i>-- "documents". For each row in the documents table there is a correspondin > 1691 <i>-- with the same docid value in this table.</i> > 1692 CREATE TABLE documents_data(docid INTEGER PRIMARY KEY, weight); > 1693 > 1694 <i>-- This query is similar to the one in the block above, except that:</i> > 1695 <i>--</i> > 1696 <i>-- 1. It returns a "snippet" of text along with the document title for di > 1697 <i>-- that the snippet function may be used, the "WHERE ... MATCH ..." cl > 1698 <i>-- the sub-query is duplicated in the outer query.</i> > 1699 <i>--</i> > 1700 <i>-- 2. The sub-query joins the documents table with the document_data tabl > 1701 <i>-- implementation of the rank function has access to the static weight > 1702 <i>-- to each document.</i> > 1703 SELECT title, snippet(documents) FROM documents JOIN ( > 1704 SELECT docid, rank(matchinfo(document), documents_data.weight) AS rank > 1705 FROM documents JOIN documents_data USING(docid) > 1706 WHERE documents MATCH &lt;query&gt; > 1707 ORDER BY rank DESC > 1708 OFFSET 0 LIMIT 10 > 1709 ) AS ranktable USING(docid) > 1710 WHERE documents MATCH &lt;query&gt; > 1711 ORDER BY ranktable.rank DESC > 1712 }] > 1713 > 1714 <p> > 1715 All the example queries above return the ten most relevant query results. > 1716 By modifying the values used with the OFFSET and LIMIT clauses, a query > 1717 to return (say) the next ten most relevant results is easy to construct. > 1718 This may be used to obtain the data required for a search applications second > 1719 and subsequent pages of results. > 1720 > 1721 <p> > 1722 The next block contains an example rank function that uses matchinfo data > 1723 implemented in C. Instead of a single weight, it allows a weight to be > 1724 externally assigned to each column of each document. It may be registered > 1725 with SQLite like any other user function using \[sqlite3_create_function()\]. > 1726 > 1727 [Code [string map {[ &#x5B; ] &#x5D;} { > 1728 <i>/*</i> > 1729 <i>** SQLite user defined function to use with matchinfo() to calculate the</i> > 1730 <i>** relevancy of an FTS3 match. The value returned is the relevancy score</i> > 1731 <i>** (a real value greater than or equal to zero). A larger value indicates </i > 1732 <i>** a more relevant document.</i> > 1733 <i>**</i> > 1734 <i>** The overall relevancy returned is the sum of the relevancies of each </i> > 1735 <i>** column value in the FTS3 table. The relevancy of a column value is the</i> > 1736 <i>** sum of the following for each reportable phrase in the FTS3 query:</i> > 1737 <i>**</i> > 1738 <i>** (&lt;hit count&gt; / &lt;global hit count&gt) * &lt;column weight&gt;</i > 1739 <i>**</i> > 1740 <i>** where &lt;hit count&gt; is the number of instances of the phrase in the</i > 1741 <i>** column value of the current row and &lt;global hit count&gt; is the number > 1742 <i>** of instances of the phrase in the same column of all rows in the FTS3</i> > 1743 <i>** table. The &lt;column weight&gt; is a weighting factor assigned to each</i > 1744 <i>** column by the caller (see below).</i> > 1745 <i>**</i> > 1746 <i>** The first argument to this function must be the return value of the FTS3 < > 1747 <i>** matchinfo() function. Following this must be one argument for each column > 1748 <i>** of the FTS3 table containing a numeric weight factor for the corresponding > 1749 <i>** column. Example:</i> > 1750 <i>**</i> > 1751 <i>** CREATE VIRTUAL TABLE documents USING fts3(title, content)</i> > 1752 <i>**</i> > 1753 <i>** The following query returns the docids of documents that match the full-te > 1754 <i>** query &lt;query&gt; sorted from most to least relevant. When calculating</ > 1755 <i>** relevance, query term instances in the 'title' column are given twice the< > 1756 <i>** weighting of those in the 'content' column.</i> > 1757 <i>**</i> > 1758 <i>** SELECT docid FROM documents </i> > 1759 <i>** WHERE documents MATCH &lt;query&gt; </i> > 1760 <i>** ORDER BY rank(matchinfo(documents), 1.0, 0.5) DESC</i> > 1761 <i>*/</i> > 1762 static void rankfunc(sqlite3_context *pCtx, int nVal, sqlite3_value **apVal){ > 1763 int *aMatchinfo; <i>/* Return value of matchinfo() */</i> > 1764 int nCol; <i>/* Number of columns in the table */</i> > 1765 int nPhrase; <i>/* Number of phrases in the query */</i> > 1766 int iPhrase; <i>/* Current phrase */</i> > 1767 double score = 0.0; <i>/* Value to return */</i> > 1768 > 1769 assert( sizeof(int)==4 ); > 1770 > 1771 <i> /* Check that the number of arguments passed to this function is correct.</ > 1772 <i> ** If not, jump to wrong_number_args. Set aMatchinfo to point to the array< > 1773 <i> ** of unsigned integer values returned by FTS3 function matchinfo. Set</i> > 1774 <i> ** nPhrase to contain the number of reportable phrases in the users full-te > 1775 <i> ** query, and nCol to the number of columns in the table.</i> > 1776 <i> */</i> > 1777 if( nVal&lt;1 ) goto wrong_number_args; > 1778 aMatchinfo = (unsigned int *)sqlite3_value_blob(apVal[0]); > 1779 nPhrase = aMatchinfo[0]; > 1780 nCol = aMatchinfo[1]; > 1781 if( nVal!=(1+nCol) ) goto wrong_number_args; > 1782 > 1783 <i> /* Iterate through each phrase in the users query. */</i> > 1784 for(iPhrase=0; iPhrase&lt;nPhrase; iPhrase++){ > 1785 int iCol; <i>/* Current column */</i> > 1786 > 1787 <i> /* Now iterate through each column in the users query. For each column,</ > 1788 <i> ** increment the relevancy score by:</i> > 1789 <i> **</i> > 1790 <i> ** (&lt;hit count&gt; / &lt;global hit count&gt) * &lt;column weight&gt > 1791 <i> **</i> > 1792 <i> ** aPhraseinfo[] points to the start of the data for phrase iPhrase. So</ > 1793 <i> ** the hit count and global hit counts for each column are found in </i> > 1794 <i> ** aPhraseinfo[iCol*3] and aPhraseinfo[iCol*3+1], respectively.</i> > 1795 <i> */</i> > 1796 int *aPhraseinfo = &aMatchinfo[2 + iPhrase*nCol*3]; > 1797 for(iCol=0; iCol&lt;nCol; iCol++){ > 1798 int nHitCount = aPhraseinfo[3*iCol]; > 1799 int nGlobalHitCount = aPhraseinfo[3*iCol+1]; > 1800 double weight = sqlite3_value_double(apVal[iCol+1]); > 1801 if( nHitCount>0 ){ > 1802 score += ((double)nHitCount / (double)nGlobalHitCount) * weight; > 1803 } > 1804 } > 1805 } > 1806 > 1807 sqlite3_result_double(pCtx, score); > 1808 return; > 1809 > 1810 <i> /* Jump here if the wrong number of arguments are passed to this function * > 1811 wrong_number_args: > 1812 sqlite3_result_error(pCtx, "wrong number of arguments to function rank()", -1) > 1813 } > 1814 }]] > 1815 1567 } 1816 } > 1817