Documentation Source Text

Check-in [5ab479563e]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add some extra details to fts3.in.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 5ab479563e4697dade62d315158ba1a03398ad7e
User & Date: dan 2010-11-27 15:55:03
Context
2010-11-29
12:05
Change a sentence in lang_droptable.html. check-in: 8c82296b21 user: dan tags: trunk
2010-11-27
15:55
Add some extra details to fts3.in. check-in: 5ab479563e user: dan tags: trunk
11:46
Fix typos in fts3.in. check-in: bead827e51 user: dan tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/fts3.in.

62
63
64
65
66
67
68
69
70

71
72
73
74
75


76
77
















78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
...
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
...
152
153
154
155
156
157
158











































159
160
161
162
163
164
165
....
1423
1424
1425
1426
1427
1428
1429































1430
1431
1432
1433
1434
1435
1436
1437
  selects only those rows that contain "linux" as a discrete token. Both 
  searches are case-insensitive. The FTS3 table consumes around 2006 MB on
  disk compared to just 1453 MB for the ordinary table. Using the same
  hardware configuration used to perform the SELECT queries above, the FTS3
  table took just under 31 minutes to populate, versus 25 for the ordinary
  table.

<tcl>hd_fragment fts4 FTS4</tcl>
<h2>Differences between FTS3 and FTS4</h2>


<p>
  FTS3 and FTS4 are nearly identical.  They share most of their code in common, and their
  interfaces are the same.  The only difference is that FTS4 stores some additional information
  about the document collection in two of new [FTS shadow tables].  This additional information allows


  FTS4 to use certain query performance optimizations that FTS3 cannot use.  And the added information
  permits some additional useful output options in the [matchinfo()] function.

















<p>
  FTS4 is an enhancement to FTS3.  FTS3 has been available since SQLite [version 3.5.0] in
  2007-09-04.  The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08.

<p>
  Which module, FTS3 or FTS4, should you use in your application?
  FTS4 is sometimes significantly faster than FTS3, even orders
  of magnitude faster depending on the query, though in the common case the performance of the
  two modules is similar.  FTS4 also offers the enhanced [matchinfo()] outputs which can be
  useful in ranking the results of a [FTS MATCH|MATCH] operation.  On the other hand, because FTS4
  stores additional information, FTS4 requires a little more disk space than FTS3, though only a 
  percent of two in most cases.

<p>
  For newer applications, FTS4 is recommended; though if minimal disk usage or compatibility
  with older versions of SQLite are important, then FTS3 will usually serve just as well.  

<h2>Creating and Destroying FTS Tables</h2>

<p>
  Like other virtual table types, new FTS tables are created using a 
  [CREATE VIRTUAL TABLE] statement. The module name, which follows
  the USING keyword, is either "fts3" or "fts4". The virtual table module arguments may
................................................................................
  VIRTUAL TABLE statement used to create an FTS table may be used to specify
  a [tokenizer]. This is done by specifying a string of the form
  "tokenize=&lt;tokenizer name&gt; &lt;tokenizer args&gt;" in place of a column
  name, where &lt;tokenizer name&gt; is the name of the tokenizer to use and
  &lt;tokenizer args&gt; is an optional list of whitespace separated qualifiers
  to pass to the tokenizer implementation. A tokenizer specification may be
  placed anywhere in the column list, but at most one tokenizer declaration is
  allowed for each CREATE VIRTUAL TABLE statement.  The second and subsequent
  tokenizer declaration are interpreted as column names. 
  [tokenizer|See below] for a detailed description of using (and, if
  necessary, implementing) a tokenizer.

<codeblock>
  <i>-- Create an FTS table named "papers" with two columns that uses</i>
  <i>-- the tokenizer "porter".</i>
  CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter);

  <i>-- Create an FTS table with a single column - "content" - that uses</i>
................................................................................
  CREATE VIRTUAL TABLE data USING fts4(tokenize=simple);

  <i>-- Create an FTS table with two columns that uses the "icu" tokenizer.</i>
  <i>-- The qualifier "en_AU" is passed to the tokenizer implementation</i>
  CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU);
</codeblock>












































<p>
  FTS tables may be dropped from the database using an ordinary [DROP TABLE]
  statement. For example:

<codeblock>
  <i>-- Create, then immediately drop, an FTS4 table.</i>
  CREATE VIRTUAL TABLE data USING fts4();
................................................................................
<p>
  The %_content table contains the unadulterated data inserted by the user 
  into the FTS virtual table by the user. If the user does not explicitly
  supply a "docid" value when inserting records, one is selected automatically
  by the system.

<p>































  The two tables, %_segments and %_segdir, are used to store the 
  full-text index. Conceptually, this index is a lookup table that maps each 
  term (word) to the set of docid values corresponding to records in the 
  %_content table that contain one or more occurrences of the term. To
  retrieve all documents that contain a specified term, the FTS module
  queries this index to determine the set of docid values for records that
  contain the term, then retrieves the required documents from the %_content
  table. Regardless of the schema of the FTS virtual table, the %_segments







<

>


|
|
<
>
>
|
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>






|
|
|
|
|
|
|


|
|







 







|
<
<
|







 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|







62
63
64
65
66
67
68

69
70
71
72
73
74

75
76
77

78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
...
149
150
151
152
153
154
155
156


157
158
159
160
161
162
163
164
...
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
....
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
  selects only those rows that contain "linux" as a discrete token. Both 
  searches are case-insensitive. The FTS3 table consumes around 2006 MB on
  disk compared to just 1453 MB for the ordinary table. Using the same
  hardware configuration used to perform the SELECT queries above, the FTS3
  table took just under 31 minutes to populate, versus 25 for the ordinary
  table.


<h2>Differences between FTS3 and FTS4</h2>
<tcl>hd_fragment fts4 FTS4</tcl>

<p>
  FTS3 and FTS4 are nearly identical. They share most of their code in common,
  and their interfaces are the same. The differences are:


<ul>
  <li> <p>FTS4 contains query performance optimizations that may significantly

       improve the performance of full-text queries that contain terms that are
       very common (present in a large percentage of table rows).

  <li> <p>FTS4 supports some additional options that may used with the [matchinfo()]
       function. 

  <li> <p>Because it stores extra information on disk in two new 
       [FTS shadow tables|shadow tables] in order to support the performance
       optimizations and extra matchinfo() options, FTS4 tables may consume more
       disk space than the equivalent table created using FTS3. Usually the overhead
       is 1-2% or less, but may be as high as 10% if the documents stored in the
       FTS table are very small. The overhead may be reduced by specifying the
       directive [matchinfo_fts3|"matchinfo=fts3"] as part of the FTS4 table
       declaration, but this comes at the expense of sacrificing some of the
       extra supported matchinfo() options.
</ul>

<p>
  FTS4 is an enhancement to FTS3.  FTS3 has been available since SQLite [version 3.5.0] in
  2007-09-04.  The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08.

<p>
  Which module, FTS3 or FTS4, should you use in your application?  FTS4 is
  sometimes significantly faster than FTS3, even orders of magnitude faster
  depending on the query, though in the common case the performance of the two
  modules is similar. FTS4 also offers the enhanced [matchinfo()] outputs which
  can be useful in ranking the results of a [FTS MATCH|MATCH] operation.  On the
  other hand, in the absence of a [matchinfo_fts3|matchinfo=fts3] directive FTS4 requires a little
  more disk space than FTS3, though only a percent of two in most cases.

<p>
  For newer applications, FTS4 is recommended; though if compatibility with older 
  versions of SQLite is important, then FTS3 will usually serve just as well.  

<h2>Creating and Destroying FTS Tables</h2>

<p>
  Like other virtual table types, new FTS tables are created using a 
  [CREATE VIRTUAL TABLE] statement. The module name, which follows
  the USING keyword, is either "fts3" or "fts4". The virtual table module arguments may
................................................................................
  VIRTUAL TABLE statement used to create an FTS table may be used to specify
  a [tokenizer]. This is done by specifying a string of the form
  "tokenize=&lt;tokenizer name&gt; &lt;tokenizer args&gt;" in place of a column
  name, where &lt;tokenizer name&gt; is the name of the tokenizer to use and
  &lt;tokenizer args&gt; is an optional list of whitespace separated qualifiers
  to pass to the tokenizer implementation. A tokenizer specification may be
  placed anywhere in the column list, but at most one tokenizer declaration is
  allowed for each CREATE VIRTUAL TABLE statement. [tokenizer|See below] for a 


  detailed description of using (and, if necessary, implementing) a tokenizer.

<codeblock>
  <i>-- Create an FTS table named "papers" with two columns that uses</i>
  <i>-- the tokenizer "porter".</i>
  CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter);

  <i>-- Create an FTS table with a single column - "content" - that uses</i>
................................................................................
  CREATE VIRTUAL TABLE data USING fts4(tokenize=simple);

  <i>-- Create an FTS table with two columns that uses the "icu" tokenizer.</i>
  <i>-- The qualifier "en_AU" is passed to the tokenizer implementation</i>
  CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU);
</codeblock>

<p>
<tcl>hd_fragment *matchinfo_fts3 {matchinfo_fts3}</tcl>
  If the "CREATE VIRTUAL TABLE" statement specifies module FTS4 (not FTS3), 
  then the special directive "matchinfo=fts3" may also appear in place of
  a column name. If it does, then some of the extra information stored by
  FTS4 is omitted. This reduces the amount of disk space consumed by an FTS4 
  table until it is almost the same as the amount that would be used by
  the equivalent FTS3 table, but also means that the data accessed by passing
  the 'l' flag to the [matchinfo()] function is not available. For example:

<codeblock>
  <i>-- Create a reduced-footprint FTS4 table.</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, matchinfo=fts3);
</codeblock>

<p>
  When using FTS4, specifying a column name that contains an "=" character
  and is not either a "tokenize=*" or "matchinfo=fts3" directive is an error.
  With FTS3, the first token in the unrecognized directive is interpreted
  as a column name. Similarly, specifying multiple "tokenize=*" directives
  in a single table declaration is an error when using FTS4, whereas the
  second and subsequent "tokenize=*" directives are interpreted as column
  names by FTS3. For example:

<codeblock>
  <i>-- An error. FTS4 does not recognize the directive "xyz=abc".</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, xyz=abc);

  <i>-- Create an FTS3 table with three columns - "author", "document"</i>
  <i>-- and "xyz".</i>
  CREATE VIRTUAL TABLE papers USING fts3(author, document, xyz=abc);

  <i>-- An error. FTS4 does not allow multiple tokenize=* directives</i>
  CREATE VIRTUAL TABLE papers USING fts4(tokenize=porter, tokenize=simple);

  <i>-- Create an FTS3 table with a single column named "tokenize". The</i>
  <i>-- table uses the "porter" tokenizer.</i>
  CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple);

  <i>-- An error. Cannot create a table with two columns named "tokenize".</i>
  CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple, tokenize=icu);
</codeblock>

<p>
  FTS tables may be dropped from the database using an ordinary [DROP TABLE]
  statement. For example:

<codeblock>
  <i>-- Create, then immediately drop, an FTS4 table.</i>
  CREATE VIRTUAL TABLE data USING fts4();
................................................................................
<p>
  The %_content table contains the unadulterated data inserted by the user 
  into the FTS virtual table by the user. If the user does not explicitly
  supply a "docid" value when inserting records, one is selected automatically
  by the system.

<p>
  The %_stat and %_docsize tables are only created if the FTS table uses the
  FTS4 module, not FTS3. Furthermore, the %_docsize table is omitted if the
  FTS4 table is created with the [matchinfo_fts3|"matchinfo=fts3"] directive
  specified as part of the CREATE VIRTUAL TABLE statement. If they are created,
  the schema of the two tables is as follows:
<codeblock>
  CREATE TABLE %_stat(
    id INTEGER PRIMARY KEY, 
    value BLOB
  );

  CREATE TABLE %_docsize(
    docid INTEGER PRIMARY KEY,
    size BLOB
  );
</codeblock>

<p>
  For each row in the FTS table, the %_docsize table contains a corresponding
  row with the same "docid" value. The "size" field contains a blob consisting
  of <i>N</i> FTS varints, where <i>N</i> is the number of user-defined columns
  in the table. Each varint in the "size" blob is the number of tokens in the
  corresponding column of the associated row in the FTS table. The %_stat table
  always contains a single row with the "id" column set to 0. The "value" 
  column contains a blob consisting of <i>N+1</i> FTS varints, where <i>N</i>
  is again the number of user-defined columns in the FTS table. The first
  varint in the blob is set to the total number of rows in the FTS table. The
  second and subsequent varints contain the total number of tokens stored in
  the corresponding column for all rows of the FTS table.

<p>
  The two remaining tables, %_segments and %_segdir, are used to store the 
  full-text index. Conceptually, this index is a lookup table that maps each 
  term (word) to the set of docid values corresponding to records in the 
  %_content table that contain one or more occurrences of the term. To
  retrieve all documents that contain a specified term, the FTS module
  queries this index to determine the set of docid values for records that
  contain the term, then retrieves the required documents from the %_content
  table. Regardless of the schema of the FTS virtual table, the %_segments