Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Updates to the FTS4 document to describe recent merge=X,Y, automerge=B, and integrity-check commands. Updates to the change log for the same and for other recent changes. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
2c33993bdc914aa9bc5f9fb11f7019de |
User & Date: | drh 2012-04-02 14:47:19.656 |
Context
2012-04-02
| ||
15:00 | Enforce the 80-character line limit on the new documentation source text for FTS4 commands. (check-in: f1d54965d0 user: drh tags: trunk) | |
14:47 | Updates to the FTS4 document to describe recent merge=X,Y, automerge=B, and integrity-check commands. Updates to the change log for the same and for other recent changes. (check-in: 2c33993bdc user: drh tags: trunk) | |
2012-03-31
| ||
12:10 | Fix typo in the fileformat2 document. (check-in: 912e06a446 user: drh tags: trunk) | |
Changes
Changes to pages/changes.in.
︙ | ︙ | |||
42 43 44 45 46 47 48 49 50 51 52 53 54 55 | } } chng {2012 May 1 (3.7.12)} { <li>Add the [SQLITE_DBSTATUS_CACHE_WRITE] option for [sqlite3_db_status()]. <li>Optimize the [typeof()] and [length()] SQL functions so that they avoid unnecessary reading of database content from disk. } chng {2012 March 20 (3.7.11)} { <li>Enhance the [INSERT] syntax to allow multiple rows to be inserted via the VALUES clause. <li>Enhance the [CREATE VIRTUAL TABLE] command to support the IF NOT EXISTS clause. | > > > > > > > > > > > > | 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | } } chng {2012 May 1 (3.7.12)} { <li>Add the [SQLITE_DBSTATUS_CACHE_WRITE] option for [sqlite3_db_status()]. <li>Optimize the [typeof()] and [length()] SQL functions so that they avoid unnecessary reading of database content from disk. <li>Add the [FTS4 "merge" command], the [FTS4 "automerge" command], and the [FTS4 "integrity-check" command]. <li>Report the name of specific [CHECK] constraints that fail. <li>In the command-line shell, use popen() instead of fopen() if the first character of the argument to the ".output" command is "|". <li>Make use of OVERLAPPED in the windows [VFS] to avoid some system calls and thereby obtain a performance improvement. <li>More aggressive optimization of the AND operator when one side or the other is always false. <li>Bug fix: Fix the [RELEASE] command so that it does not cancel pending queries. This fixes a problem introduced in 3.7.11. } chng {2012 March 20 (3.7.11)} { <li>Enhance the [INSERT] syntax to allow multiple rows to be inserted via the VALUES clause. <li>Enhance the [CREATE VIRTUAL TABLE] command to support the IF NOT EXISTS clause. |
︙ | ︙ |
Changes to pages/fts3.in.
1 2 3 4 5 6 7 8 9 | <tcl>hd_keywords *fts3 FTS3 {full-text search}</tcl> <title>SQLite FTS3 and FTS4 Extensions</title> <table_of_contents> <h2 style="margin-left:1.0em" notoc> Overview</h2> <p> | | | | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | <tcl>hd_keywords *fts3 FTS3 {full-text search}</tcl> <title>SQLite FTS3 and FTS4 Extensions</title> <table_of_contents> <h2 style="margin-left:1.0em" notoc> Overview</h2> <p> FTS3 and FTS4 are SQLite virtual table modules that allows users to perform full-text searches on a set of documents. The most common (and effective) way to describe full-text searches is "what Google, Yahoo, and Bing do with documents placed on the World Wide Web". Users input a term, or series of terms, perhaps connected by a binary operator or grouped together into a phrase, and the full-text query system finds the set of documents that best matches those terms considering the operators and groupings the user has specified. This article describes the deployment and usage of FTS3 and FTS4. <p> |
︙ | ︙ | |||
240 241 242 243 244 245 246 | in which it appears within the table contents. For the curious, a complete description of the [segment btree|data structure] used to store this index within the database file appears below. A feature of this data structure is that at any time the database may contain not one index b-tree, but several different b-trees that are incrementally merged as rows are inserted, updated and deleted. This technique improves performance when writing to an FTS table, but causes some overhead for | | > | | 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 | in which it appears within the table contents. For the curious, a complete description of the [segment btree|data structure] used to store this index within the database file appears below. A feature of this data structure is that at any time the database may contain not one index b-tree, but several different b-trees that are incrementally merged as rows are inserted, updated and deleted. This technique improves performance when writing to an FTS table, but causes some overhead for full-text queries that use the index. Evaluating the special ["optimize" command], an SQL statement of the form "INSERT INTO <fts-table>(<fts-table>) VALUES('optimize')", causes FTS to merge all existing index b-trees into a single large b-tree containing the entire index. This can be an expensive operation, but may speed up future queries. <p> For example, to optimize the full-text index for an FTS table named "docs": |
︙ | ︙ | |||
991 992 993 994 995 996 997 | <li> For each token in the text fragment that is part of a phrase match, the "start match" text is inserted into the fragment before the token, and the "end match" text is inserted immediately after it. </ul> <p> If more than one such fragment can be found, then fragments that contain | | | 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 | <li> For each token in the text fragment that is part of a phrase match, the "start match" text is inserted into the fragment before the token, and the "end match" text is inserted immediately after it. </ul> <p> If more than one such fragment can be found, then fragments that contain a larger number of "extra" phrase matches are favored. The start of the selected text fragment may be moved a few tokens forward or backward to attempt to concentrate the phrase matches toward the center of the fragment. <p> Assuming <i>N</i> is a positive value, if no fragments can be found that contain a phrase match corresponding to each matchable phrase, the snippet |
︙ | ︙ | |||
1636 1637 1638 1639 1640 1641 1642 | values in order to update the full-text index. And the INSERT trigger must be fired after the new row is inserted, so as to handle the case where the rowid is assigned automatically within the system. The UPDATE trigger must be split into two parts, one fired before and one after the update of the content table, for the same reasons. <p> | | | | | | | 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 | values in order to update the full-text index. And the INSERT trigger must be fired after the new row is inserted, so as to handle the case where the rowid is assigned automatically within the system. The UPDATE trigger must be split into two parts, one fired before and one after the update of the content table, for the same reasons. <p> The [FTS4 "rebuild" command] deletes the entire full-text index and rebuilds it based on the current set of documents in the content table. Assuming again that "t3" is the name of the external content FTS4 table, the rebuild command looks like this: <codeblock> INSERT INTO t3(t3) VALUES('rebuild'); </codeblock> <p> This command may also be used with ordinary FTS4 tables, for example if the implementation of the tokenizer changes. It is an error to attempt to rebuild the full-text index maintained by a contentless FTS4 table, since no content will be available to do the rebuilding. <tcl>hd_fragment *fts4languageid {FTS4 languageid option}</tcl> <h2 tags="fts4 languageid option">The languageid= option</h2> <p> When the languageid option is present, it specifies the name of |
︙ | ︙ | |||
1732 1733 1734 1735 1736 1737 1738 | well as complete terms increases the database size and slows down write operations on the FTS4 table. <p> Prefix indexes may be used to optimize [prefix queries] in two cases. If the query is for a prefix of N bytes, then a prefix index created with "prefix=N" provides the best optimization. Or, if no "prefix=N" | | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 | well as complete terms increases the database size and slows down write operations on the FTS4 table. <p> Prefix indexes may be used to optimize [prefix queries] in two cases. If the query is for a prefix of N bytes, then a prefix index created with "prefix=N" provides the best optimization. Or, if no "prefix=N" index is available, a "prefix=N+1" index may be used instead. Using a "prefix=N+1" index is less efficient than a "prefix=N" index, but is better than no prefix index at all. <codeblock> <i>-- Create an FTS4 table with indexes to optimize 2 and 4 byte prefix queries.</i> CREATE VIRTUAL TABLE t1 USING fts4(c1, c2, prefix="2,4"); <i>-- The following two queries are both optimized using the prefix indexes.</i> SELECT * FROM t1 WHERE t1 MATCH 'ab*'; SELECT * FROM t1 WHERE t1 MATCH 'abcd*'; <i>-- The following two queries are both partially optimized using the prefix</i> <i>-- indexes. The optimization is not as pronounced as it is for the queries</i> <i>-- above, but still an improvement over no prefix indexes at all.</i> SELECT * FROM t1 WHERE t1 MATCH 'a*'; SELECT * FROM t1 WHERE t1 MATCH 'abc*'; </codeblock> <h1 id=commands tags="commands">Special Commands For FTS3 and FTS4</h1> <p> Special INSERT operates can be used to issue commands to FTS3 and FTS4 tables. Every FTS3 and FTS4 has a hidden, read-only column which is the same name as the table itself. INSERTs into this hidden column are interpreted as commands to the FTS3/4 table. For a table with the name "xyz" the following commands are supported: <ul> <li><p>INSERT INTO xyz(xyz) VALUES('optimize');</p> <li><p>INSERT INTO xyz(xyz) VALUES('rebuild');</p> <li><p>INSERT INTO xyz(xyz) VALUES('integrity-check');</p> <li><p>INSERT INTO xyz(xyz) VALUES('merge=X,Y');</p> <li><p>INSERT INTO xyz(xyz) VALUES('automerge=B');</p> </ul> <tcl>hd_fragment *fts4optcmd {FTS4 "optimize" command} {"optimize" command}</tcl> <h2 id=optimize>The "optimize" command</h2> <p> The "optimize" command causes FTS3/4 to merge together all of its inverted index b-trees into one large and complete b-tree. Doing an optimize will make subsequent queries run faster since there are fewer b-trees to search, and it may reduce disk usage by coalescing redundant entries. However, for a large FTS table, running optimize can be as expensive as running [VACUUM]. The optimize command essentially has to read and write the entire FTS table, resulting in a large transaction. <p> In batch-mode operation, where an FTS table is initially built up using a large number of INSERT operations, then queried repeatedly without further changes, it is often a good idea to run "optimize" after the last INSERT and before the first query. <tcl>hd_fragment *fts4rebuidcmd {FTS4 "rebuild" command} {"rebuild" command}</tcl> <h2 id=rebuild>The "rebuild" command</h2> <p> The "rebuild" command causes SQLite to discard the entire FTS3/4 table and then rebuild it again from original text. The concept is similar to [REINDEX], only that it applies to an FTS3/4 table instead of an ordinary index. <p> The "rebuild" command should be run whenever the implementation of a custom tokenizer changes, so that all content can be retokenized. The "rebuild" command is also useful when using the [FTS4 content option] after changes have been made to the original content table. <tcl>hd_fragment *fts4ickcmd {FTS4 "integrity-check" command}</tcl> <h2 id=integcheck>The "integrity-check" command</h2> <p> The "integrity-check" command causes SQLite to read and verify the accuracy of all inverted indices in an FTS3/4 table by comparing those inverted indices against the original content. The "integrity-check" command silently succeeds if the inverted indices are all ok, but will fail with an SQLITE_CORRUPT error if any problems are found. <p> The "integrity-check" command is similar in concept to [PRAGMA integrity_check]. In a working system, the "integrity-command" should aways be successful. Possible causes of integrity-check failures include: <ul> <li> The application has made changes to the [FTS shadow tables] directly, without using the FTS3/4 virtual table, causing the shadow tables to become out of sync with each other. <li> Using the [FTS4 content option] and failing to manually keep the content in sync with the FTS4 inverted indices. <li> Bugs in the FTS3/4 virtual table. (The "integrity-check" command was original conceived as part of the test suite for FTS3/4.) <li> Corruption to the underlying SQLite database file. (See documentation on [how to corrupt] and SQLite database for additional information.) </ul> <tcl>hd_fragment *fts4mergecmd {FTS4 "merge" command} {"merge" command}</tcl> <h2 id="mergecmd">The "merge=X,Y" command</h2> <p> The "merge=X,Y" command (where X and Y are integers) causes SQLite to do a limited amount of work toward merging the various inverted index b-trees of an FTS3/4 table together into one large b-tree. The X value is the target number of "blocks" to be merged, and Y is the minimum number of b-tree segments on a level required before merging will be applied to that level. The value of Y should be between 2 and 16 with a recommended value of 8. The value of X can be any positive integer but values on the order of 100 to 300 are recommended. <p> When an FTS table accumulates 16 b-tree segments at the same level, the next INSERT into that table will cause all 16 segments to be merged into a single b-tree segment at the next higher level. The effect of these level merges is that most INSERTs into an FTS table are very fast and take minimal memory, but an occasional INSERT is slow and generates a large transaction because of the need to do merging. This results in "spiky" performance of INSERTs. <p> To avoid spiky INSERT performance, an application can run the "merge=X,Y" command periodically, possibly in an idle thread or idle process, to ensure that the FTS table never accumulates too many b-tree segments at the same level. INSERT performance spikes can generally be avoided, and performance of FTS3/4 can be maximized, by running "merge=X,Y" after every few thousand document inserts. Each "merge=X,Y" command will run in a separate transaction (unless they are grouped together using [BEGIN]...[COMMIT], of course). The transactions can be kept small by choosing a value for X in the range of 100 to 300. The idle thread that is running the merge commands can know when it is done by checking the difference in [sqlite3_total_changes()] before and after each "merge=X,Y" command and stopping the loop when the difference drops below two. <tcl>hd_fragment *fts4automergecmd {FTS4 "automerge" command} {"automerge" command}</tcl> <h2 id=automerge">The "automerge=B" command</h2> <p> The "automerge=B" command (where B is either "1" or "0") disables or enables automatic incremental inverted index merging for an FTS3/4 table. The default for new tables is for automatic incremental merging to be disabled. The "automerge=B" command changes this setting. The change is persistent and continues to be in effect for all subsequent database connections to the same database. <p> Enabling automatic incremental merge causes SQLite to do a small amount of inverted index merging after every INSERT operation. The amount of merging performed is designed so that the FTS3/4 table never reaches a point where it has 16 segments at the same level and hence has to do a large merge in order to complete an insert. In other words, automatic incremental merging is designed to prevent spiky INSERT performance. <p> The downside of automatic incremental merging is that it makes every INSERT, UPDATE, and DELETE operation on an FTS3/4 table run a little slower, since extra time must be used to do the incremental merge. For maximum performance, it is recommended that applications disable automatic incremental merge and instead use the ["merge" command] in an idle process to keep the inverted indices well merged. But if the structure of an application does not easily allow for idle processes, the use of automatic incremental merge is a very reasonable fallback solution. <h1 id=tokenizer tags="tokenizer">Tokenizers</h1> <p> An FTS tokenizer is a set of rules for extracting terms from a document or basic FTS full-text query. |
︙ | ︙ | |||
1891 1892 1893 1894 1895 1896 1897 | encoded as a blob. Or, if no such tokenizer exists, an SQL exception (error) is raised. <p> <b>SECURITY WARNING</b>: If the fts3/4 extension is used in an environment where potentially malicious users may execute arbitrary SQL, they should be prevented from invoking the fts3_tokenizer() function, possibly using | | | 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 | encoded as a blob. Or, if no such tokenizer exists, an SQL exception (error) is raised. <p> <b>SECURITY WARNING</b>: If the fts3/4 extension is used in an environment where potentially malicious users may execute arbitrary SQL, they should be prevented from invoking the fts3_tokenizer() function, possibly using the [sqlite3_set_authorizer()|authorization callback]. <p> The following block contains an example of calling the fts3_tokenizer() function from C code: <codeblock> <i>/* |
︙ | ︙ | |||
1962 1963 1964 1965 1966 1967 1968 | This section describes at a high-level the way the FTS module stores its index and content in the database. It is <b>not necessary to read or understand the material in this section in order to use FTS</b> in an application. However, it may be useful to application developers attempting to analyze and understand FTS performance characteristics, or to developers contemplating enhancements to the existing FTS feature set. | | > | 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 | This section describes at a high-level the way the FTS module stores its index and content in the database. It is <b>not necessary to read or understand the material in this section in order to use FTS</b> in an application. However, it may be useful to application developers attempting to analyze and understand FTS performance characteristics, or to developers contemplating enhancements to the existing FTS feature set. <tcl>hd_fragment *shadowtab {FTS shadow tables} {shadow tables}</tcl> <h2 tags="shadowtabs">Shadow Tables</h2> <p> For each FTS virtual table in a database, three to five real (non-virtual) tables are created to store the underlying data. These real tables are called "shadow tables". The real tables are named "%_content", "%_segdir", "%_segments", "%_stat", and "%_docsize", where "%" is replaced by the name of the FTS virtual table. |
︙ | ︙ |