Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Add a section to fts3.in to explain how to use the matchinfo() function efficiently. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
0917dec0676131c0ec080c1e8ecb3ccf |
User & Date: | dan 2010-01-18 14:12:05.000 |
Context
2010-01-20
| ||
05:52 | Fix typo in fts3.html. (check-in: 897b36953b user: dan tags: trunk) | |
2010-01-18
| ||
14:12 | Add a section to fts3.in to explain how to use the matchinfo() function efficiently. (check-in: 0917dec067 user: dan tags: trunk) | |
2010-01-15
| ||
05:16 | Fix another typo in custombuild.html. (check-in: 06a55cbcb5 user: dan tags: trunk) | |
Changes
Changes to pages/fts3.in.
︙ | ︙ | |||
993 994 995 996 997 998 999 | <i>--</i> <i>-- "...the upper portion, [minimum] [temperature] 14-16oC and cool elsewhere,</i> <i>-- [minimum] [temperature] 17-20oC. Cold..."</i> <i>--</i> SELECT snippet(text, '[ ']', '...') FROM text WHERE text MATCH '"min* tem*"' }] | | | 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 | <i>--</i> <i>-- "...the upper portion, [minimum] [temperature] 14-16oC and cool elsewhere,</i> <i>-- [minimum] [temperature] 17-20oC. Cold..."</i> <i>--</i> SELECT snippet(text, '[ ']', '...') FROM text WHERE text MATCH '"min* tem*"' }] [h2 "The Matchinfo Function" matchinfo matchinfo] <p> The matchinfo function returns a blob value. If used within a query that uses the full-text index (not a "query by rowid" or "linear scan"), then the blob consists of (2 + <i>C</i> * <i>P</i> * 3) 32-bit unsigned integers in machine byte-order, where <i>C</i> is the number of columns in the FTS3 table being queried, and <i>P</i> is the number of |
︙ | ︙ | |||
1073 1074 1075 1076 1077 1078 1079 | an order of magnitude faster than the second: [Code { SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH <query expression>; SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH <query expression>; }] | < < < < > | | 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 | an order of magnitude faster than the second: [Code { SELECT docid, matchinfo(tbl) FROM tbl WHERE tbl MATCH <query expression>; SELECT docid, offsets(tbl) FROM tbl WHERE tbl MATCH <query expression>; }] <p> The matchinfo function provides much of the information required to calculate probabalistic "bag-of-words" relevancy scores such as <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may be used to order results in a full-text search application. Also often used in such functions is the length or relative length of each document or document field. Unfortunately, this information is not made available by the matchinfo function as it would require loading extra data from the database, potentially slowing matchinfo() down by an order of magnitude. One solution is for the application to store the lengths of each document or document field in a separate table for use in calculating relevancy scores. Appendix A of this document, "\[search application tips\]", contains an example of using the matchinfo() function efficiently. [h1 "Tokenizers" tokenizer {tokenizer}] <p> An FTS3 tokenizer is a set of rules for extracting terms from a document or basic FTS3 full-text query. |
︙ | ︙ | |||
1560 1561 1562 1563 1564 1565 1566 | <p> For doclists for which the term appears in more than one column of the FTS3 virtual table, term-offset lists within the doclist are stored in column number order. This ensures that the term-offset list associated with column 0 (if any) is always first, allowing the first two fields of the term-offset list to be omitted in this case. | > | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 | <p> For doclists for which the term appears in more than one column of the FTS3 virtual table, term-offset lists within the doclist are stored in column number order. This ensures that the term-offset list associated with column 0 (if any) is always first, allowing the first two fields of the term-offset list to be omitted in this case. [h1 "Appendix A: Search Application Tips" {} "search application tips"] <p> FTS3 is primarily designed to support Boolean full-text queries - queries to find the set of documents that match a specified criteria. However, many (most?) search applications require that results are somehow ranked in order of "relevance", where "relevance" is defined as the likelihood that the user who performed the search is interested in a specific element of the returned set of documents. When using a search engine to find documents on the world wide web, the user expects that the most useful, or "relevant", documents will be returned as the first page of results, and that each subsequent page contains progressively less relevant results. Exactly how a machine can determine document relevance based on a users query is a complicated problem and the subject of much ongoing research. <p> One very simple scheme might be to count the number of instances of the users search terms in each result document. Those documents that contain many instances of the terms are considered more relevant than those with a small number of instances of each term. In an FTS3 application, the number of term instances in each result could be determined by counting the number of integers in the return value of the \[offsets\] function. The following example shows a query that could be used to obtain the ten most relevant results for a query entered by the user: [Code { <i>-- This example (and all others in this section) assumes the following schema</i> CREATE VIRTUAL TABLE documents USING fts3(title, content); <i>-- Assuming the application has supplied an SQLite user function named "countintegers"</i> <i>-- that returns the number of space-separated integers contained in its only argument,</i> <i>-- the following query could be used to return the titles of the 10 documents that contain</i> <i>-- the greatest number of instances of the users query terms. Hopefully, these 10</i> <i>-- documents will be those that the users considers more or less the most "relevant".</i> SELECT title FROM documents WHERE documents MATCH <query> ORDER BY countintegers(offsets(document)) DESC OFFSET 0 LIMIT 10 }] <p> The query above could be made to run faster by using the FTS3 \[matchinfo\] function to determine the number of query term instances that appear in each result. The matchinfo function is much more efficient than the offsets function. Furthermore, the matchinfo function provides extra information regarding the overall number of occurences of each query term in the entire document set (not just the current row) and the number of documents in which each query term appears. This may be used (for example) to attach a higher weight to less common terms which may increase the overall computed relevancy of those results the user considers more interesting. [Code { <i>-- If the application supplies an SQLite user function called "rank" that</i> <i>-- interprets the blob of data returned by matchinfo and returns a numeric</i> <i>-- relevancy based on it, then the following SQL may be used to return the</i> <i>-- titles of the 10 most relevant documents in the dataset for a users query.</i> SELECT title FROM documents WHERE documents MATCH <query> ORDER BY rank(matchinfo(document)) DESC OFFSET 0 LIMIT 10 }] <p> The SQL query in the example above uses less CPU than the first example in this section, but still has a non-obvious performance problem. SQLite satisfies this query by retreiving the value of the "title" column and matchinfo data from the FTS3 module for every row matched by the users query before it sorts and limits the results. Because of the way SQLite's virtual table interface works, retrieving the value of the "title" column requires loading the entire row from disk (including the "content" field, which may be quite large). This means that if the users query matches several thousand documents, many megabytes of "title" and "content" data may be loaded from disk into memory even though they will never be used for any purpose. <p> The SQL query in the following example block is one solution to this problem. In SQLite, when a <a href="optoverview.html#flattening">sub-query used in a join contains a LIMIT clause</a>, the results of the sub-query are calcuated and stored in temporary table before the main query is executed. This means that SQLite will load only the docid and matchinfo data for each row matching the users query into memory, determine the docid values corresponding to the ten most relevant documents, then load only the title and content information for those 10 documents only. Because both the matchinfo and docid values are gleaned entirely from the full-text index, this results in dramatically less data being loaded from the database into memory. [Code { SELECT title FROM documents JOIN ( SELECT docid, rank(matchinfo(document)) AS rank FROM documents WHERE documents MATCH <query> ORDER BY rank DESC OFFSET 0 LIMIT 10 ) AS ranktable USING(docid) ORDER BY ranktable.rank DESC }] <p> The next block of SQL enhances the query with solutions to two other problems that may arise in developing search applications using FTS3: <ol> <li> <p> The \[snippet\] function cannot be used with the above query. Because the outer query does not include a "WHERE ... MATCH" clause, the snippet function may not be used with it. One solution is to duplicate the WHERE clause used by the sub-query in the outer query. The overhead associated with this is usually negligible. <li> <p> The relevancy of a document may depend on something other than just the data available in the return value of matchinfo. For example each document in the database may be assigned a static weight based on factors unrelated to its content (origin, author, age, number of references etc.). These values can be stored by the application in a separate table that can be joined against the documents table in the sub-query so that the rank function may access them. </ol> <p> This version of the query is very similar to that used by the <a href="http://www.sqlite.org/search?q=fts3">sqlite.org documentation search</a> application. [Code { <i>-- This table stores the static weight assigned to each document in FTS3 table</i> <i>-- "documents". For each row in the documents table there is a corresponding row</i> <i>-- with the same docid value in this table.</i> CREATE TABLE documents_data(docid INTEGER PRIMARY KEY, weight); <i>-- This query is similar to the one in the block above, except that:</i> <i>--</i> <i>-- 1. It returns a "snippet" of text along with the document title for display. So</i> <i>-- that the snippet function may be used, the "WHERE ... MATCH ..." clause from</i> <i>-- the sub-query is duplicated in the outer query.</i> <i>--</i> <i>-- 2. The sub-query joins the documents table with the document_data table, so that</i> <i>-- implementation of the rank function has access to the static weight assigned</i> <i>-- to each document.</i> SELECT title, snippet(documents) FROM documents JOIN ( SELECT docid, rank(matchinfo(document), documents_data.weight) AS rank FROM documents JOIN documents_data USING(docid) WHERE documents MATCH <query> ORDER BY rank DESC OFFSET 0 LIMIT 10 ) AS ranktable USING(docid) WHERE documents MATCH <query> ORDER BY ranktable.rank DESC }] <p> All the example queries above return the ten most relevant query results. By modifying the values used with the OFFSET and LIMIT clauses, a query to return (say) the next ten most relevant results is easy to construct. This may be used to obtain the data required for a search applications second and subsequent pages of results. <p> The next block contains an example rank function that uses matchinfo data implemented in C. Instead of a single weight, it allows a weight to be externally assigned to each column of each document. It may be registered with SQLite like any other user function using \[sqlite3_create_function()\]. [Code [string map {[ [ ] ]} { <i>/*</i> <i>** SQLite user defined function to use with matchinfo() to calculate the</i> <i>** relevancy of an FTS3 match. The value returned is the relevancy score</i> <i>** (a real value greater than or equal to zero). A larger value indicates </i> <i>** a more relevant document.</i> <i>**</i> <i>** The overall relevancy returned is the sum of the relevancies of each </i> <i>** column value in the FTS3 table. The relevancy of a column value is the</i> <i>** sum of the following for each reportable phrase in the FTS3 query:</i> <i>**</i> <i>** (<hit count> / <global hit count>) * <column weight></i> <i>**</i> <i>** where <hit count> is the number of instances of the phrase in the</i> <i>** column value of the current row and <global hit count> is the number</i> <i>** of instances of the phrase in the same column of all rows in the FTS3</i> <i>** table. The <column weight> is a weighting factor assigned to each</i> <i>** column by the caller (see below).</i> <i>**</i> <i>** The first argument to this function must be the return value of the FTS3 </i> <i>** matchinfo() function. Following this must be one argument for each column </i> <i>** of the FTS3 table containing a numeric weight factor for the corresponding </i> <i>** column. Example:</i> <i>**</i> <i>** CREATE VIRTUAL TABLE documents USING fts3(title, content)</i> <i>**</i> <i>** The following query returns the docids of documents that match the full-text</i> <i>** query <query> sorted from most to least relevant. When calculating</i> <i>** relevance, query term instances in the 'title' column are given twice the</i> <i>** weighting of those in the 'content' column.</i> <i>**</i> <i>** SELECT docid FROM documents </i> <i>** WHERE documents MATCH <query> </i> <i>** ORDER BY rank(matchinfo(documents), 1.0, 0.5) DESC</i> <i>*/</i> static void rankfunc(sqlite3_context *pCtx, int nVal, sqlite3_value **apVal){ int *aMatchinfo; <i>/* Return value of matchinfo() */</i> int nCol; <i>/* Number of columns in the table */</i> int nPhrase; <i>/* Number of phrases in the query */</i> int iPhrase; <i>/* Current phrase */</i> double score = 0.0; <i>/* Value to return */</i> assert( sizeof(int)==4 ); <i> /* Check that the number of arguments passed to this function is correct.</i> <i> ** If not, jump to wrong_number_args. Set aMatchinfo to point to the array</i> <i> ** of unsigned integer values returned by FTS3 function matchinfo. Set</i> <i> ** nPhrase to contain the number of reportable phrases in the users full-text</i> <i> ** query, and nCol to the number of columns in the table.</i> <i> */</i> if( nVal<1 ) goto wrong_number_args; aMatchinfo = (unsigned int *)sqlite3_value_blob(apVal[0]); nPhrase = aMatchinfo[0]; nCol = aMatchinfo[1]; if( nVal!=(1+nCol) ) goto wrong_number_args; <i> /* Iterate through each phrase in the users query. */</i> for(iPhrase=0; iPhrase<nPhrase; iPhrase++){ int iCol; <i>/* Current column */</i> <i> /* Now iterate through each column in the users query. For each column,</i> <i> ** increment the relevancy score by:</i> <i> **</i> <i> ** (<hit count> / <global hit count>) * <column weight></i> <i> **</i> <i> ** aPhraseinfo[] points to the start of the data for phrase iPhrase. So</i> <i> ** the hit count and global hit counts for each column are found in </i> <i> ** aPhraseinfo[iCol*3] and aPhraseinfo[iCol*3+1], respectively.</i> <i> */</i> int *aPhraseinfo = &aMatchinfo[2 + iPhrase*nCol*3]; for(iCol=0; iCol<nCol; iCol++){ int nHitCount = aPhraseinfo[3*iCol]; int nGlobalHitCount = aPhraseinfo[3*iCol+1]; double weight = sqlite3_value_double(apVal[iCol+1]); if( nHitCount>0 ){ score += ((double)nHitCount / (double)nGlobalHitCount) * weight; } } } sqlite3_result_double(pCtx, score); return; <i> /* Jump here if the wrong number of arguments are passed to this function */</i> wrong_number_args: sqlite3_result_error(pCtx, "wrong number of arguments to function rank()", -1); } }]] } |