Documentation Source Text

Check-in [2c33993bdc]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Updates to the FTS4 document to describe recent merge=X,Y, automerge=B, and integrity-check commands. Updates to the change log for the same and for other recent changes.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 2c33993bdc914aa9bc5f9fb11f7019de3cb57589
User & Date: drh 2012-04-02 14:47:19.656
Context
2012-04-02
15:00
Enforce the 80-character line limit on the new documentation source text for FTS4 commands. (check-in: f1d54965d0 user: drh tags: trunk)
14:47
Updates to the FTS4 document to describe recent merge=X,Y, automerge=B, and integrity-check commands. Updates to the change log for the same and for other recent changes. (check-in: 2c33993bdc user: drh tags: trunk)
2012-03-31
12:10
Fix typo in the fileformat2 document. (check-in: 912e06a446 user: drh tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to pages/changes.in.
42
43
44
45
46
47
48












49
50
51
52
53
54
55
  }
}

chng {2012 May 1 (3.7.12)} {
<li>Add the [SQLITE_DBSTATUS_CACHE_WRITE] option for [sqlite3_db_status()].
<li>Optimize the [typeof()] and [length()] SQL functions so that they avoid
    unnecessary reading of database content from disk.












}

chng {2012 March 20 (3.7.11)} {
<li>Enhance the [INSERT] syntax to allow multiple rows to be inserted
    via the VALUES clause.
<li>Enhance the [CREATE VIRTUAL TABLE] command to support the
    IF NOT EXISTS clause.







>
>
>
>
>
>
>
>
>
>
>
>







42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
  }
}

chng {2012 May 1 (3.7.12)} {
<li>Add the [SQLITE_DBSTATUS_CACHE_WRITE] option for [sqlite3_db_status()].
<li>Optimize the [typeof()] and [length()] SQL functions so that they avoid
    unnecessary reading of database content from disk.
<li>Add the [FTS4 "merge" command], the [FTS4 "automerge" command], and
    the [FTS4 "integrity-check" command].
<li>Report the name of specific [CHECK] constraints that fail.
<li>In the command-line shell, use popen() instead of fopen() if the first
    character of the argument to the ".output" command is "|".
<li>Make use of OVERLAPPED in the windows [VFS] to avoid some system calls
    and thereby obtain a performance improvement.
<li>More aggressive optimization of the AND operator when one side or the
    other is always false.
<li>Bug fix: Fix the [RELEASE] command so that it does not cancel pending
    queries.  This fixes a problem introduced in 3.7.11.

}

chng {2012 March 20 (3.7.11)} {
<li>Enhance the [INSERT] syntax to allow multiple rows to be inserted
    via the VALUES clause.
<li>Enhance the [CREATE VIRTUAL TABLE] command to support the
    IF NOT EXISTS clause.
Changes to pages/fts3.in.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

<tcl>hd_keywords *fts3 FTS3 {full-text search}</tcl>
<title>SQLite FTS3 and FTS4 Extensions</title>

<table_of_contents>

<h2 style="margin-left:1.0em" notoc> Overview</h2>

<p>
  FTS3 and FTS4 are an SQLite virtual table modules that allows users to perform 
  full-text searches on a set of documents. The most common (and effective) 
  way to describe full-text searches is "what Google, Yahoo and Altavista do
  with documents placed on the World Wide Web". Users input a term, or series 
  of terms, perhaps connected by a binary operator or grouped together into a 
  phrase, and the full-text query system finds the set of documents that best 
  matches those terms considering the operators and groupings the user has 
  specified. This article describes the deployment and usage of FTS3 and FTS4.

<p>









|

|







1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

<tcl>hd_keywords *fts3 FTS3 {full-text search}</tcl>
<title>SQLite FTS3 and FTS4 Extensions</title>

<table_of_contents>

<h2 style="margin-left:1.0em" notoc> Overview</h2>

<p>
  FTS3 and FTS4 are SQLite virtual table modules that allows users to perform 
  full-text searches on a set of documents. The most common (and effective) 
  way to describe full-text searches is "what Google, Yahoo, and Bing do
  with documents placed on the World Wide Web". Users input a term, or series 
  of terms, perhaps connected by a binary operator or grouped together into a 
  phrase, and the full-text query system finds the set of documents that best 
  matches those terms considering the operators and groupings the user has 
  specified. This article describes the deployment and usage of FTS3 and FTS4.

<p>
240
241
242
243
244
245
246
247

248
249
250
251
252
253
254
255
    in which it appears within the table contents. For the curious, a 
    complete description of the [segment btree|data structure] used to store
    this index within the database file appears below. A feature of
    this data structure is that at any time the database may contain not
    one index b-tree, but several different b-trees that are incrementally
    merged as rows are inserted, updated and deleted. This technique improves 
    performance when writing to an FTS table, but causes some overhead for
    full-text queries that use the index. Executing an SQL statement of the

    form "INSERT INTO &lt;fts-table&gt;(&lt;fts-table&gt;) VALUES('optimize')"
    causes FTS to merge all existing index b-trees into a single large
    b-tree containing the entire index. This can be an expensive operation,
    but may speed up future queries. 

  <p>
    For example, to optimize the full-text index for an FTS table named
    "docs":







|
>
|







240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
    in which it appears within the table contents. For the curious, a 
    complete description of the [segment btree|data structure] used to store
    this index within the database file appears below. A feature of
    this data structure is that at any time the database may contain not
    one index b-tree, but several different b-trees that are incrementally
    merged as rows are inserted, updated and deleted. This technique improves 
    performance when writing to an FTS table, but causes some overhead for
    full-text queries that use the index. Evaluating the special ["optimize" command], 
    an SQL statement of the
    form "INSERT INTO &lt;fts-table&gt;(&lt;fts-table&gt;) VALUES('optimize')",
    causes FTS to merge all existing index b-trees into a single large
    b-tree containing the entire index. This can be an expensive operation,
    but may speed up future queries. 

  <p>
    For example, to optimize the full-text index for an FTS table named
    "docs":
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
  <li> For each token in the text fragment that is part of a phrase match,
       the "start match" text is inserted into the fragment before the token,
       and the "end match" text is inserted immediately after it.
</ul>

<p>
  If more than one such fragment can be found, then fragments that contain
  a larger number of "extra" phrase matches are favoured. The start of
  the selected text fragment may be moved a few tokens forward or backward
  to attempt to concentrate the phrase matches toward the center of the
  fragment.

<p>
  Assuming <i>N</i> is a positive value, if no fragments can be found that
  contain a phrase match corresponding to each matchable phrase, the snippet







|







992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
  <li> For each token in the text fragment that is part of a phrase match,
       the "start match" text is inserted into the fragment before the token,
       and the "end match" text is inserted immediately after it.
</ul>

<p>
  If more than one such fragment can be found, then fragments that contain
  a larger number of "extra" phrase matches are favored. The start of
  the selected text fragment may be moved a few tokens forward or backward
  to attempt to concentrate the phrase matches toward the center of the
  fragment.

<p>
  Assuming <i>N</i> is a positive value, if no fragments can be found that
  contain a phrase match corresponding to each matchable phrase, the snippet
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
  values in order to update the full-text index. And the INSERT trigger must
  be fired after the new row is inserted, so as to handle the case where the
  rowid is assigned automatically within the system. The UPDATE trigger must
  be split into two parts, one fired before and one after the update of the
  content table, for the same reasons.

<p>
  FTS4 features a special command similar to the 'optimize' command that
  deletes the entire full-text index and rebuilds it based on the current
  set of documents in the content table. Assuming again that "t3" is the
  name of the external content FTS4 table, the command is:

<codeblock>
  INSERT INTO t3(t3) VALUES('rebuild');
</codeblock>

<p>
  This command may also be used with ordinary FTS4 tables, although it may
  only be useful if the full-text index has somehow become corrupt. It is an
  error to attempt to rebuild the full-text index maintained by a contentless
  FTS4 table.


<tcl>hd_fragment *fts4languageid {FTS4 languageid option}</tcl>
<h2 tags="fts4 languageid option">The languageid= option</h2>

<p>
  When the languageid option is present, it specifies the name of







|


|






|
|

|







1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
  values in order to update the full-text index. And the INSERT trigger must
  be fired after the new row is inserted, so as to handle the case where the
  rowid is assigned automatically within the system. The UPDATE trigger must
  be split into two parts, one fired before and one after the update of the
  content table, for the same reasons.

<p>
  The [FTS4 "rebuild" command]
  deletes the entire full-text index and rebuilds it based on the current
  set of documents in the content table. Assuming again that "t3" is the
  name of the external content FTS4 table, the rebuild command looks like this:

<codeblock>
  INSERT INTO t3(t3) VALUES('rebuild');
</codeblock>

<p>
  This command may also be used with ordinary FTS4 tables, for example if
  the implementation of the tokenizer changes.  It is an
  error to attempt to rebuild the full-text index maintained by a contentless
  FTS4 table, since no content will be available to do the rebuilding.


<tcl>hd_fragment *fts4languageid {FTS4 languageid option}</tcl>
<h2 tags="fts4 languageid option">The languageid= option</h2>

<p>
  When the languageid option is present, it specifies the name of
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756





























































































































1757
1758
1759
1760
1761
1762
1763
  well as complete terms increases the database size and slows down write 
  operations on the FTS4 table.

<p>
  Prefix indexes may be used to optimize [prefix queries] in two cases.
  If the query is for a prefix of N bytes, then a prefix index created
  with "prefix=N" provides the best optimization. Or, if no "prefix=N"
  index is available, a "prefix=N+1" index imay be used instead. 
  Using a "prefix=N+1" index is less
  efficient than a "prefix=N" index, but is better than no prefix index at all.

<codeblock>
  <i>-- Create an FTS4 table with indexes to optimize 2 and 4 byte prefix queries.</i>
  CREATE VIRTUAL TABLE t1 USING fts4(c1, c2, prefix="2,4");

  <i>-- The following two queries are both optimized using the prefix indexes.</i>
  SELECT * FROM t1 WHERE t1 MATCH 'ab*';
  SELECT * FROM t1 WHERE t1 MATCH 'abcd*';

  <i>-- The following two queries are both partially optimized using the prefix</i>
  <i>-- indexes. The optimization is not as pronounced as it is for the queries</i>
  <i>-- above, but still an improvement over no prefix indexes at all.</i>
  SELECT * FROM t1 WHERE t1 MATCH 'a*';
  SELECT * FROM t1 WHERE t1 MATCH 'abc*';
</codeblock>






























































































































<h1 id=tokenizer tags="tokenizer">Tokenizers</h1>

<p>
  An FTS tokenizer is a set of rules for extracting terms from a document 
  or basic FTS full-text query. 








|

















>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
  well as complete terms increases the database size and slows down write 
  operations on the FTS4 table.

<p>
  Prefix indexes may be used to optimize [prefix queries] in two cases.
  If the query is for a prefix of N bytes, then a prefix index created
  with "prefix=N" provides the best optimization. Or, if no "prefix=N"
  index is available, a "prefix=N+1" index may be used instead. 
  Using a "prefix=N+1" index is less
  efficient than a "prefix=N" index, but is better than no prefix index at all.

<codeblock>
  <i>-- Create an FTS4 table with indexes to optimize 2 and 4 byte prefix queries.</i>
  CREATE VIRTUAL TABLE t1 USING fts4(c1, c2, prefix="2,4");

  <i>-- The following two queries are both optimized using the prefix indexes.</i>
  SELECT * FROM t1 WHERE t1 MATCH 'ab*';
  SELECT * FROM t1 WHERE t1 MATCH 'abcd*';

  <i>-- The following two queries are both partially optimized using the prefix</i>
  <i>-- indexes. The optimization is not as pronounced as it is for the queries</i>
  <i>-- above, but still an improvement over no prefix indexes at all.</i>
  SELECT * FROM t1 WHERE t1 MATCH 'a*';
  SELECT * FROM t1 WHERE t1 MATCH 'abc*';
</codeblock>

<h1 id=commands tags="commands">Special Commands For FTS3 and FTS4</h1>

<p>
  Special INSERT operates can be used to issue commands to FTS3 and FTS4 tables.
  Every FTS3 and FTS4 has a hidden, read-only column which is the same name as
  the table itself.  INSERTs into this hidden column are interpreted as commands
  to the FTS3/4 table.  For a table with the name "xyz" the following commands
  are supported:

<ul>
<li><p>INSERT INTO xyz(xyz) VALUES('optimize');</p>
<li><p>INSERT INTO xyz(xyz) VALUES('rebuild');</p>
<li><p>INSERT INTO xyz(xyz) VALUES('integrity-check');</p>
<li><p>INSERT INTO xyz(xyz) VALUES('merge=X,Y');</p>
<li><p>INSERT INTO xyz(xyz) VALUES('automerge=B');</p>
</ul>

<tcl>hd_fragment *fts4optcmd {FTS4 "optimize" command} {"optimize" command}</tcl>
<h2 id=optimize>The "optimize" command</h2>

<p>
  The "optimize" command causes FTS3/4 to merge together all of its inverted index b-trees into
  one large and complete b-tree.  Doing an optimize will make subsequent queries
  run faster since there are fewer b-trees to search, and it may reduce disk usage
  by coalescing redundant entries.  However, for a large FTS table, running optimize can be as
  expensive as running [VACUUM].  The optimize command essentially has to read and write the
  entire FTS table, resulting in a large transaction.

<p>
  In batch-mode operation, where an FTS table is initially built up using a large number of
  INSERT operations, then queried repeatedly without further changes, it is often a good idea
  to run "optimize" after the last INSERT and before the first query.

<tcl>hd_fragment *fts4rebuidcmd {FTS4 "rebuild" command} {"rebuild" command}</tcl>
<h2 id=rebuild>The "rebuild" command</h2>

<p>
  The "rebuild" command causes SQLite to discard the entire FTS3/4 table and then rebuild it
  again from original text.  The concept is similar to [REINDEX], only that it applies to an
  FTS3/4 table instead of an ordinary index.

<p>
  The "rebuild" command should be run whenever the implementation of a custom tokenizer changes,
  so that all content can be retokenized.  The "rebuild" command is also useful when using
  the [FTS4 content option] after changes have been made to the original content table.

<tcl>hd_fragment *fts4ickcmd {FTS4 "integrity-check" command}</tcl>
<h2 id=integcheck>The "integrity-check" command</h2>

<p>
  The "integrity-check" command causes SQLite to read and verify the accuracy of all inverted
  indices in an FTS3/4 table by comparing those inverted indices against the original content.
  The "integrity-check" command silently succeeds if the inverted indices are all ok, but will
  fail with an SQLITE_CORRUPT error if any problems are found.

<p>
  The "integrity-check" command is similar in concept to [PRAGMA integrity_check].  
  In a working system, the "integrity-command" should aways be successful.  Possible causes
  of integrity-check failures include:
  <ul>
  <li> The application has made changes to the [FTS shadow tables] directly, without using the FTS3/4
       virtual table, causing the shadow tables to become out of sync with each other.
  <li> Using the [FTS4 content option] and failing to manually keep the content in sync with the
       FTS4 inverted indices.
  <li> Bugs in the FTS3/4 virtual table.  (The "integrity-check" command was original conceived
       as part of the test suite for FTS3/4.)
  <li> Corruption to the underlying SQLite database file.  (See documentation on
       [how to corrupt] and SQLite database for additional information.)
  </ul>

<tcl>hd_fragment *fts4mergecmd {FTS4 "merge" command} {"merge" command}</tcl>
<h2 id="mergecmd">The "merge=X,Y" command</h2>

<p>
  The "merge=X,Y" command (where X and Y are integers) causes SQLite to do a limited amount of work
  toward merging the various inverted index b-trees of an FTS3/4 table together into one large b-tree.
  The X value is the target number of "blocks" to be merged, and Y is the minimum number of b-tree
  segments on a level required before merging will be applied to that level.  The value of Y should
  be between 2 and 16 with a recommended value of 8.  The value of X can be any positive integer but
  values on the order of 100 to 300 are recommended.

<p>
  When an FTS table accumulates 16 b-tree segments at the same level, the next INSERT into that
  table will cause all 16 segments to be merged into a single b-tree segment at the next higher level.
  The effect of these level merges is that most INSERTs into an FTS table are very fast and take minimal
  memory, but an occasional INSERT is slow and generates a large transaction because of the need to
  do merging. This results in "spiky" performance of INSERTs.

<p>
  To avoid spiky INSERT performance, an application can run the "merge=X,Y" command periodically,
  possibly in an idle thread or idle process, to ensure that the FTS table never accumulates
  too many b-tree segments at the same level.  INSERT performance spikes can generally be avoided, and
  performance of FTS3/4 can be maximized, by running "merge=X,Y" after every few thousand
  document inserts.  Each "merge=X,Y" command will run in a separate transaction (unless they
  are grouped together using [BEGIN]...[COMMIT], of course).  The transactions can be kept
  small by choosing a value for X in the range of 100 to 300.  The idle thread that is running
  the merge commands can know when it is done by checking the difference in
  [sqlite3_total_changes()] before and after each "merge=X,Y" command and stopping the loop
  when the difference drops below two.

<tcl>hd_fragment *fts4automergecmd {FTS4 "automerge" command} {"automerge" command}</tcl>
<h2 id=automerge">The "automerge=B" command</h2>

<p>
  The "automerge=B" command (where B is either "1" or "0") disables or enables automatic incremental
  inverted index merging for an FTS3/4 table.  The default for new tables is for automatic incremental
  merging to be disabled.  The "automerge=B" command changes this setting.  The change is persistent
  and continues to be in effect for all subsequent database connections to the same database.

<p>
  Enabling automatic incremental merge causes SQLite to do a small amount of inverted index merging
  after every INSERT operation.  The amount of merging performed is designed so that the FTS3/4 table
  never reaches a point where it has 16 segments at the same level and hence has to do a large merge
  in order to complete an insert.  In other words, automatic incremental merging is designed to
  prevent spiky INSERT performance.

<p>
  The downside of automatic incremental merging is that it makes every INSERT, UPDATE, and DELETE
  operation on an FTS3/4 table run a little slower, since extra time must be used to do the incremental
  merge.  For maximum performance, it is recommended that applications disable automatic incremental
  merge and instead use the ["merge" command] in an idle process to keep the inverted indices well
  merged.  But if the structure of an application does not easily allow for idle processes, the use
  of automatic incremental merge is a very reasonable fallback solution.


<h1 id=tokenizer tags="tokenizer">Tokenizers</h1>

<p>
  An FTS tokenizer is a set of rules for extracting terms from a document 
  or basic FTS full-text query. 

1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
  encoded as a blob. Or, if no such tokenizer exists, an SQL exception
  (error) is raised.

<p>
  <b>SECURITY WARNING</b>: If the fts3/4 extension is used in an environment
  where potentially malicious users may execute arbitrary SQL, they should 
  be prevented from invoking the fts3_tokenizer() function, possibly using 
  the [sqlite3_set_authorizer()|authorisation callback].

<p>
  The following block contains an example of calling the fts3_tokenizer()
  function from C code:

<codeblock>
  <i>/*







|







2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
  encoded as a blob. Or, if no such tokenizer exists, an SQL exception
  (error) is raised.

<p>
  <b>SECURITY WARNING</b>: If the fts3/4 extension is used in an environment
  where potentially malicious users may execute arbitrary SQL, they should 
  be prevented from invoking the fts3_tokenizer() function, possibly using 
  the [sqlite3_set_authorizer()|authorization callback].

<p>
  The following block contains an example of calling the fts3_tokenizer()
  function from C code:

<codeblock>
  <i>/*
1962
1963
1964
1965
1966
1967
1968
1969

1970
1971
1972
1973
1974
1975
1976
  This section describes at a high-level the way the FTS module stores its
  index and content in the database. It is <b>not necessary to read or 
  understand the material in this section in order to use FTS</b> in an 
  application. However, it may be useful to application developers attempting 
  to analyze and understand FTS performance characteristics, or to developers 
  contemplating enhancements to the existing FTS feature set.

<tcl>hd_fragment *shadowtab {FTS shadow tables}</tcl>

<p>
  For each FTS virtual table in a database, three to five real (non-virtual) tables
  are created to store the underlying data.  These real tables are called "shadow tables".
  The real tables are named "%_content",
  "%_segdir", "%_segments", "%_stat", and "%_docsize", where "%" is replaced by the name
  of the FTS virtual table.








|
>







2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
  This section describes at a high-level the way the FTS module stores its
  index and content in the database. It is <b>not necessary to read or 
  understand the material in this section in order to use FTS</b> in an 
  application. However, it may be useful to application developers attempting 
  to analyze and understand FTS performance characteristics, or to developers 
  contemplating enhancements to the existing FTS feature set.

<tcl>hd_fragment *shadowtab {FTS shadow tables} {shadow tables}</tcl>
<h2 tags="shadowtabs">Shadow Tables</h2>
<p>
  For each FTS virtual table in a database, three to five real (non-virtual) tables
  are created to store the underlying data.  These real tables are called "shadow tables".
  The real tables are named "%_content",
  "%_segdir", "%_segments", "%_stat", and "%_docsize", where "%" is replaced by the name
  of the FTS virtual table.