Documentation Source Text

Check-in [bead827e51]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Fix typos in fts3.in.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: bead827e51d77c15f059ff9ab009f779916adf58
User & Date: dan 2010-11-27 11:46:46
Context
2010-11-27
15:55
Add some extra details to fts3.in. check-in: 5ab479563e user: dan tags: trunk
11:46
Fix typos in fts3.in. check-in: bead827e51 user: dan tags: trunk
2010-11-26
22:46
Update the download page to use a new, consistent file naming scheme. Download files end with a version number of the form WXXYYZZ where W is 3, XX is the major version, YY is the minor version, and ZZ is the patch number. check-in: 8b77a71a8c user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/fts3.in.

31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
..
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
...
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
...
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
...
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
...
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
...
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
...
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
...
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
...
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
...
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
....
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
....
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
....
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
  allows the user to efficiently query the database for all rows that contain
  one or more words (hereafter "tokens"), even if the table
  contains many large documents.

<p>
  For example, if each of the 517430 documents in the 
  "<a href="http://www.cs.cmu.edu/~enron/">Enron E-Mail Dataset</a>"
  is inserted into both the FTS table and the ordinary SQLite table
  created using the following SQL script:

<codeblock>
  CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
  CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */
</codeblock>

................................................................................

<p>
  FTS4 is an enhancement to FTS3.  FTS3 has been available since SQLite [version 3.5.0] in
  2007-09-04.  The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08.

<p>
  Which module, FTS3 or FTS4, should you use in your application?
  FTS4 can sometimes is significantly faster than FTS3, even orders
  of magnitude faster, depending on the query, though in the common case the performance of the
  two modules is similar.  FTS4 also offers the enhanced [matchinfo()] outputs which can be
  useful in ranking the results of a [FTS MATCH|MATCH] operation.  On the other hand, because FTS4
  stores additional information, FTS4 requires a little more disk space than FTS3, though only a 
  percent of two in most cases.

<p>
  For newer applications, FTS4 is recommended; though if minimal disk usage or compatibility
................................................................................
  be left empty, in which case an FTS table with a single user-defined 
  column named "content" is created. Alternatively, the module arguments
  may be passed a list of comma separated column names. 

<p>
  If column names are explicitly provided for the FTS table as part of
  the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally 
  specified for each column. However, this is pure syntactic sugar, the
  supplied typenames are not used by FTS or the SQLite core for any
  purpose. The same applies to any constraints specified along with an
  FTS column name - they are parsed but not used or recorded by the system
  in any way.

<codeblock>
  <i>-- Create an FTS table named "data" with one column - "content":</i>
................................................................................
</codeblock>

<p>
  FTS tables may be dropped from the database using an ordinary [DROP TABLE]
  statement. For example:

<codeblock>
  <i>-- Create, then immediately drop, an FTS3 table.</i>
  CREATE VIRTUAL TABLE data USING fts4();
  DROP TABLE data;
</codeblock>

<h2>Populating FTS Tables</h2>

  <p>
    FTS tables are populated using [INSERT], [UPDATE] and [DELETE]
    statements in the same way as ordinary SQLite tables are.

  <p>
    As well as the columns named by the user (or the "content" column if no
    module arguments where specified as part of the [CREATE VIRTUAL TABLE] 
    statement), each FTS table has a "rowid" column. The rowid of an FTS
    table behaves in the same way as the rowid column of an ordinary SQLite 
    table, except that the values stored in the rowid column of an FTS table 
    remain unchanged if the database is rebuilt using the [VACUUM] command. 
    For FTS tables, "docid" is allowed as an alias along with the usual "rowid",
    "oid" and "_oid_" identifiers. Attempting to insert or update a row with a 
    docid value that already exists in the table is an error, just as it would 
................................................................................
</codeblock>

  <p>
    To support full-text queries, FTS maintains an inverted index that maps
    from each unique term or word that appears in the dataset to the locations
    in which it appears within the table contents. For the curious, a 
    complete description of the [segment btree|data structure] used to store
    this index within the database file is described below. A feature of
    this data structure is that at any time the database may contain not
    one index b-tree, but several different b-trees that are incrementally
    merged as rows are inserted, updated and deleted. This technique improves 
    performance when writing to an FTS table, but causes some overhead for
    full-text queries that use the index. Executing an SQL statement of the
    form "INSERT INTO &lt;fts-table&gt;(&lt;fts-table&gt;) VALUES('optimize')"
    causes FTS to merge all existing index b-trees into a single large
................................................................................
    a sub-clause of the form "&lt;column&gt; MATCH ?", FTS is able to use 
    the built-in full-text index to restrict the search to those documents 
    that match the full-text query string specified as the right-hand operand
    of the MATCH clause.
</ul>

<p>
  If neither of the two query strategies enumerated above can be used, all
  queries on FTS tables are implemented using a linear scan of the entire
  table. If the table contains large amounts of data, this may be an 
  impractically approach (the first example on this page shows that a linear
  scan of 1.5 GB of data takes around 30 seconds using a modern PC).

<codeblock>
  <i>-- The examples in this block assume the following FTS table:</i>
  CREATE VIRTUAL TABLE mail USING fts3(subject, body);

  SELECT * FROM mail WHERE rowid = 15;                <i>-- Fast. Rowid lookup.</i>
................................................................................

<p>
  In all of the full-text queries above, the right-hand operand of the MATCH
  operator is a string consisting of a single term. In this case, the MATCH
  expression evaluates to true for all documents that contain one or more 
  instances of the specified word ("sqlite", "search" or "database", depending 
  on which example you look at). Specifying a single term as the right-hand
  operand of the MATCH operator results in the simplest (and most common) type 
  of full-text query possible. However more complicated queries are possible,
  including phrase searches, term-prefix searches and searches for documents 
  containing combinations of terms occuring within a defined proximity of each
  other. The various ways in which the full-text index may be queried are
  [FTS MATCH|described below].

<p>
................................................................................

</ol>


<h1 tags="compile fts">Compiling and Enabling FTS3 and FTS4</h1>

<p>
  Although the FTS3 and FTS4 is included with the SQLite core source code, it is not
  enabled by default. To build SQLite with FTS functionality enabled, define
  the preprocessor macro [SQLITE_ENABLE_FTS3] when compiling. New applications
  should also define the [SQLITE_ENABLE_FTS3_PARENTHESIS] macro to enable the
  [enhanced query syntax] (see below). Usually, this is done by adding the 
  following two switches to the compiler command line:

<codeblock>
................................................................................
  the configure script, if any.

<p>
  Because FTS3 and FTS4 are virtual tables, The [SQLITE_ENABLE_FTS3] compile-time option
  is incompatible with the [SQLITE_OMIT_VIRTUALTABLE] option.

<p>
  If a build of SQLite does not include FTS3, then any attempt to prepare an
  SQL statement to create an FTS3 or FTS4 table or to drop or access an existing 
  FTS table in any way will fail. The error message returned will be similar 
  to "no such module: ftsN" (where N is either 3 or 4).

<p>
  If the C version of the <a href=http://site.icu-project.org/>ICU library</a>
  is available, then FTS may also be compiled with the SQLITE_ENABLE_ICU
................................................................................

<h1 tags="FTS MATCH">Full-text Index Queries</h1>

<p>
  The most useful thing about FTS tables is the queries that may be 
  performed using the built-in full-text index. Full-text queries are 
  performed by specifying a clause of the form 
  "&lt;column&gt; MATCH &lt;full-text query expression&gt;" to the WHERE 
  clause of a SELECT statement that reads data from an FTS table. 
  [simple fts queries|Simple FTS queries] that return all documents that 
  contain a given term are described above. In that discussion the right-hand
  operand of the MATCH operator was assumed to be a string consisting of a
  single term. This section describes the more complex query types supported 
  by FTS tables, and how they may be utilized by specifying a more
  complex query expression as the right-hand operand of a MATCH operator.
................................................................................

  <li> The NOT operator (or, if using the standard syntax, a unary "-" operator)
       may be used to compute the <b>relative complement</b> of one set of
       documents with respect to another.
</ul>

<p>
  The FTS module may be compiled to use one of two slightly different versions
  of the full-text query syntax, the "standard" query syntax and the "enhanced" 
  query syntax. The basic term, term-prefix, phrase and NEAR queries described 
  above are the same in both versions of the syntax. The way in which set 
  operations are specified is slightly different. The following two sub-sections 
  describe the part of the two query syntaxes that pertains to set operations. 
  Refer to the description of how to [compile fts] for compilation notes.

................................................................................
  then the blob is zero bytes in size. Otherwise, the blob consists of zero
  or more 32-bit unsigned integers in machine byte-order. The exact number
  of integers in the returned array depends on both the query and the value
  of the second argument (if any) passed to the matchinfo function.

<p>
  The matchinfo function is called with either one or two arguments. As for
  all auxiliary functions, the first argument must be the special hidden
  column of an FTS table that has the same name as the table itself (see 
  above). The second argument, if it is specified, must be a text value
  comprised only of the characters 'p', 'c', 'n', 'a', 'l', 's' and 'x'.
  If no second argument is explicitly supplied, it defaults to "pcx". The
  second argument is refered to as the "format string" below.

<p>
  Characters in the matchinfo format string are processed from left to right. 
  Each character in the format string causes one or more 32-bit unsigned
................................................................................
    For each column, the length of the value stored in the current row of the
    FTS4 table, in tokens.  This value is only available when querying
    FTS4 tables, not FTS3. And only if the "matchinfo=fts3" directive was not
    specified as part of the "CREATE VIRTUAL TABLE" statement used to create
    the FTS4 table.
  <tr><td>s <td><i>cols</i> <td>For each column, the length of the longest 
    subsequence of phrase matches that the column value has in common
    the query text. For example, if a table column contains the text
    'a b c d e' and the query is 'a c "d e"', then the length of the longest
    common subsequence is 2 (phrase "c" followed by phrase "d e").
    
    
</table>

<p>
................................................................................
  rules:

<ul>
  <li><p> A term is a contiguous sequence of eligible characters, where 
    eligible characters are all alphanumeric characters, the "_" character,
    and all characters with UTF codepoints greater than or equal to 128.
    All other characters are discarded when splitting a document into terms.
    They serve only to separate adjacent terms.

  <li><p> All uppercase characters within the ASCII range (UTF codepoints less 
    than 128), are transformed to their lowercase equivalents as part of the
    tokenization process. Thus, full-text queries are case-insensitive when
    using the simple tokenizer.
</ul>








|







 







|
|







 







|







 







|












|







 







|







 







|


|







 







|







 







|







 







|







 







|







 







|







 







|
<
|







 







|







 







|







31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
..
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
...
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
...
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
...
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
...
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
...
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
...
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
...
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
...
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
...
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
....
1045
1046
1047
1048
1049
1050
1051
1052

1053
1054
1055
1056
1057
1058
1059
1060
....
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
....
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
  allows the user to efficiently query the database for all rows that contain
  one or more words (hereafter "tokens"), even if the table
  contains many large documents.

<p>
  For example, if each of the 517430 documents in the 
  "<a href="http://www.cs.cmu.edu/~enron/">Enron E-Mail Dataset</a>"
  is inserted into both an FTS table and an ordinary SQLite table
  created using the following SQL script:

<codeblock>
  CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
  CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */
</codeblock>

................................................................................

<p>
  FTS4 is an enhancement to FTS3.  FTS3 has been available since SQLite [version 3.5.0] in
  2007-09-04.  The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08.

<p>
  Which module, FTS3 or FTS4, should you use in your application?
  FTS4 is sometimes significantly faster than FTS3, even orders
  of magnitude faster depending on the query, though in the common case the performance of the
  two modules is similar.  FTS4 also offers the enhanced [matchinfo()] outputs which can be
  useful in ranking the results of a [FTS MATCH|MATCH] operation.  On the other hand, because FTS4
  stores additional information, FTS4 requires a little more disk space than FTS3, though only a 
  percent of two in most cases.

<p>
  For newer applications, FTS4 is recommended; though if minimal disk usage or compatibility
................................................................................
  be left empty, in which case an FTS table with a single user-defined 
  column named "content" is created. Alternatively, the module arguments
  may be passed a list of comma separated column names. 

<p>
  If column names are explicitly provided for the FTS table as part of
  the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally 
  specified for each column. This is pure syntactic sugar, the
  supplied typenames are not used by FTS or the SQLite core for any
  purpose. The same applies to any constraints specified along with an
  FTS column name - they are parsed but not used or recorded by the system
  in any way.

<codeblock>
  <i>-- Create an FTS table named "data" with one column - "content":</i>
................................................................................
</codeblock>

<p>
  FTS tables may be dropped from the database using an ordinary [DROP TABLE]
  statement. For example:

<codeblock>
  <i>-- Create, then immediately drop, an FTS4 table.</i>
  CREATE VIRTUAL TABLE data USING fts4();
  DROP TABLE data;
</codeblock>

<h2>Populating FTS Tables</h2>

  <p>
    FTS tables are populated using [INSERT], [UPDATE] and [DELETE]
    statements in the same way as ordinary SQLite tables are.

  <p>
    As well as the columns named by the user (or the "content" column if no
    module arguments were specified as part of the [CREATE VIRTUAL TABLE] 
    statement), each FTS table has a "rowid" column. The rowid of an FTS
    table behaves in the same way as the rowid column of an ordinary SQLite 
    table, except that the values stored in the rowid column of an FTS table 
    remain unchanged if the database is rebuilt using the [VACUUM] command. 
    For FTS tables, "docid" is allowed as an alias along with the usual "rowid",
    "oid" and "_oid_" identifiers. Attempting to insert or update a row with a 
    docid value that already exists in the table is an error, just as it would 
................................................................................
</codeblock>

  <p>
    To support full-text queries, FTS maintains an inverted index that maps
    from each unique term or word that appears in the dataset to the locations
    in which it appears within the table contents. For the curious, a 
    complete description of the [segment btree|data structure] used to store
    this index within the database file appears below. A feature of
    this data structure is that at any time the database may contain not
    one index b-tree, but several different b-trees that are incrementally
    merged as rows are inserted, updated and deleted. This technique improves 
    performance when writing to an FTS table, but causes some overhead for
    full-text queries that use the index. Executing an SQL statement of the
    form "INSERT INTO &lt;fts-table&gt;(&lt;fts-table&gt;) VALUES('optimize')"
    causes FTS to merge all existing index b-trees into a single large
................................................................................
    a sub-clause of the form "&lt;column&gt; MATCH ?", FTS is able to use 
    the built-in full-text index to restrict the search to those documents 
    that match the full-text query string specified as the right-hand operand
    of the MATCH clause.
</ul>

<p>
  If neither of these two query strategies can be used, all
  queries on FTS tables are implemented using a linear scan of the entire
  table. If the table contains large amounts of data, this may be an 
  impractical approach (the first example on this page shows that a linear
  scan of 1.5 GB of data takes around 30 seconds using a modern PC).

<codeblock>
  <i>-- The examples in this block assume the following FTS table:</i>
  CREATE VIRTUAL TABLE mail USING fts3(subject, body);

  SELECT * FROM mail WHERE rowid = 15;                <i>-- Fast. Rowid lookup.</i>
................................................................................

<p>
  In all of the full-text queries above, the right-hand operand of the MATCH
  operator is a string consisting of a single term. In this case, the MATCH
  expression evaluates to true for all documents that contain one or more 
  instances of the specified word ("sqlite", "search" or "database", depending 
  on which example you look at). Specifying a single term as the right-hand
  operand of the MATCH operator results in the simplest and most common type 
  of full-text query possible. However more complicated queries are possible,
  including phrase searches, term-prefix searches and searches for documents 
  containing combinations of terms occuring within a defined proximity of each
  other. The various ways in which the full-text index may be queried are
  [FTS MATCH|described below].

<p>
................................................................................

</ol>


<h1 tags="compile fts">Compiling and Enabling FTS3 and FTS4</h1>

<p>
  Although FTS3 and FTS4 are included with the SQLite core source code, they are not
  enabled by default. To build SQLite with FTS functionality enabled, define
  the preprocessor macro [SQLITE_ENABLE_FTS3] when compiling. New applications
  should also define the [SQLITE_ENABLE_FTS3_PARENTHESIS] macro to enable the
  [enhanced query syntax] (see below). Usually, this is done by adding the 
  following two switches to the compiler command line:

<codeblock>
................................................................................
  the configure script, if any.

<p>
  Because FTS3 and FTS4 are virtual tables, The [SQLITE_ENABLE_FTS3] compile-time option
  is incompatible with the [SQLITE_OMIT_VIRTUALTABLE] option.

<p>
  If a build of SQLite does not include the FTS modules, then any attempt to prepare an
  SQL statement to create an FTS3 or FTS4 table or to drop or access an existing 
  FTS table in any way will fail. The error message returned will be similar 
  to "no such module: ftsN" (where N is either 3 or 4).

<p>
  If the C version of the <a href=http://site.icu-project.org/>ICU library</a>
  is available, then FTS may also be compiled with the SQLITE_ENABLE_ICU
................................................................................

<h1 tags="FTS MATCH">Full-text Index Queries</h1>

<p>
  The most useful thing about FTS tables is the queries that may be 
  performed using the built-in full-text index. Full-text queries are 
  performed by specifying a clause of the form 
  "&lt;column&gt; MATCH &lt;full-text query expression&gt;" as part of the WHERE 
  clause of a SELECT statement that reads data from an FTS table. 
  [simple fts queries|Simple FTS queries] that return all documents that 
  contain a given term are described above. In that discussion the right-hand
  operand of the MATCH operator was assumed to be a string consisting of a
  single term. This section describes the more complex query types supported 
  by FTS tables, and how they may be utilized by specifying a more
  complex query expression as the right-hand operand of a MATCH operator.
................................................................................

  <li> The NOT operator (or, if using the standard syntax, a unary "-" operator)
       may be used to compute the <b>relative complement</b> of one set of
       documents with respect to another.
</ul>

<p>
  The FTS modules may be compiled to use one of two slightly different versions
  of the full-text query syntax, the "standard" query syntax and the "enhanced" 
  query syntax. The basic term, term-prefix, phrase and NEAR queries described 
  above are the same in both versions of the syntax. The way in which set 
  operations are specified is slightly different. The following two sub-sections 
  describe the part of the two query syntaxes that pertains to set operations. 
  Refer to the description of how to [compile fts] for compilation notes.

................................................................................
  then the blob is zero bytes in size. Otherwise, the blob consists of zero
  or more 32-bit unsigned integers in machine byte-order. The exact number
  of integers in the returned array depends on both the query and the value
  of the second argument (if any) passed to the matchinfo function.

<p>
  The matchinfo function is called with either one or two arguments. As for
  all auxiliary functions, the first argument must be the special 

  [FTS hidden column]. The second argument, if it is specified, must be a text value
  comprised only of the characters 'p', 'c', 'n', 'a', 'l', 's' and 'x'.
  If no second argument is explicitly supplied, it defaults to "pcx". The
  second argument is refered to as the "format string" below.

<p>
  Characters in the matchinfo format string are processed from left to right. 
  Each character in the format string causes one or more 32-bit unsigned
................................................................................
    For each column, the length of the value stored in the current row of the
    FTS4 table, in tokens.  This value is only available when querying
    FTS4 tables, not FTS3. And only if the "matchinfo=fts3" directive was not
    specified as part of the "CREATE VIRTUAL TABLE" statement used to create
    the FTS4 table.
  <tr><td>s <td><i>cols</i> <td>For each column, the length of the longest 
    subsequence of phrase matches that the column value has in common
    with the query text. For example, if a table column contains the text
    'a b c d e' and the query is 'a c "d e"', then the length of the longest
    common subsequence is 2 (phrase "c" followed by phrase "d e").
    
    
</table>

<p>
................................................................................
  rules:

<ul>
  <li><p> A term is a contiguous sequence of eligible characters, where 
    eligible characters are all alphanumeric characters, the "_" character,
    and all characters with UTF codepoints greater than or equal to 128.
    All other characters are discarded when splitting a document into terms.
    Their only contribution is to separate adjacent terms.

  <li><p> All uppercase characters within the ASCII range (UTF codepoints less 
    than 128), are transformed to their lowercase equivalents as part of the
    tokenization process. Thus, full-text queries are case-insensitive when
    using the simple tokenizer.
</ul>