Documentation Source Text

Check-in [074255f9eb]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Update fts3.html with the new "order" and "prefix" options.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 074255f9eb46a364d5cac4281731bda6ac57e8e7
User & Date: dan 2011-06-14 19:38:18.048
Context
2011-06-15
12:23
Update the limits.html document for the new upper bound on SQLITE_MAX_ATTACHED and to describe the maximum database size and the maximum number of rows in a table. (check-in: b91558ccbf user: drh tags: trunk)
2011-06-14
19:38
Update fts3.html with the new "order" and "prefix" options. (check-in: 074255f9eb user: dan tags: trunk)
2011-06-08
15:11
Fix a typo in the how-to-corrupt document. (check-in: a1408d26c6 user: drh tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to pages/fts3.in.
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298

<codeblock>
  <i>-- Create, then immediately drop, an FTS4 table.</i>
  CREATE VIRTUAL TABLE data USING fts4();
  DROP TABLE data;
</codeblock>

<h3 id=fts4_options tags="FTS4 options">FTS4 Options</h3>

<p>
<tcl>hd_fragment *matchinfo_fts3 {matchinfo_fts3}</tcl>

  If the "CREATE VIRTUAL TABLE" statement specifies module FTS4 (not FTS3), 
  then special directives - FTS4 options - similar to the "tokenize=*" option
  may also appear in place of column names. An FTS4 option consists of the
  option name, followed by an "=" character, followed by the option value.
  The option value may optionally be enclosed in single or double quotes, with
  embedded quote characters escaped in the same way as for SQL literals. There
  may not be whitespace on either side of the "=" character. For example,
  to create an FTS4 table with the value of option "matchinfo" set to "fts3":

<codeblock>
  <i>-- Create a reduced-footprint FTS4 table.</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, matchinfo=fts3);
</codeblock>

<p>
  FTS4 currently supports the following three options:

<table striped=1>
  <tr><th>Option<th>Interpretation
  <tr><td>matchinfo<td> This option may only be set to the value "fts3".
    Attempting to set it otherwise is an error. If this option is specified,
    then some of the extra information stored by FTS4 is omitted. This reduces
    the amount of disk space consumed by an FTS4 table until it is almost the
    same as the amount that would be used by the equivalent FTS3 table, but 
    also means that the data accessed by passing the 'l' flag to the 
    [matchinfo()] function is not available.

  <tr><td>compress<td>
    This option is used to specify the compress function. It is an error to
    specify a compress function without also specifying an uncompress
    function.
    
  <tr><td>uncompress<td>
    This option is used to specify the uncompress function. It is an error to
    specify an uncompress function without also specifying a compress
    function.
</table>

<tcl>hd_fragment *fts4compression {compressed FTS4 content}</tcl>
<p>
  The compress and uncompress options allow FTS4 content to be stored in
  the database in a compressed form. Both options should be set to the name
  of an SQL scalar function registered using [sqlite3_create_function()]
  that accepts a single argument. 

<p>
  The compress function should return a compressed version of the value 
  passed to it as an argument. Each time data is written to the FTS4 table, 
  each column value is passed to the compress function and the result value 
  stored in the database. The compress function may return any type of SQLite 
  value (blob, text, real, integer or null).

<p>
  The uncompress function should uncompress data previously compressed by
  the compress function. In other words, for all SQLite values X, it should
  be true that uncompress(compress(X)) equals X. When data that has been
  compressed by the compress function is read from the database by FTS4, it
  is passed to the uncompress function before it is used.

<p>
  If the specified compress or uncompress functions do not exist, the table
  may still be created. An error is not returned until the FTS4 table is
  read (if the uncompress function does not exist) or written (if it is the 
  compress function that does not exist).

<codeblock>
  <i>-- Create an FTS4 table that stores data in compressed form. This</i>
  <i>-- assumes that the scalar functions zip() and unzip() have been (or</i>
  <i>-- will be) added to the database handle.</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, compress=zip, uncompress=unzip);
</codeblock>

<p>
  When using FTS4, specifying a column name that contains an "=" character
  and is not either a "tokenize=*" specification or a recognized FTS4 option
  is an error. With FTS3, the first token in the unrecognized directive is 
  interpreted as a column name. Similarly, specifying multiple "tokenize=*"
  directives in a single table declaration is an error when using FTS4, whereas
  the second and subsequent "tokenize=*" directives are interpreted as column
  names by FTS3. For example:

<codeblock>
  <i>-- An error. FTS4 does not recognize the directive "xyz=abc".</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, xyz=abc);

  <i>-- Create an FTS3 table with three columns - "author", "document"</i>
  <i>-- and "xyz".</i>
  CREATE VIRTUAL TABLE papers USING fts3(author, document, xyz=abc);

  <i>-- An error. FTS4 does not allow multiple tokenize=* directives</i>
  CREATE VIRTUAL TABLE papers USING fts4(tokenize=porter, tokenize=simple);

  <i>-- Create an FTS3 table with a single column named "tokenize". The</i>
  <i>-- table uses the "porter" tokenizer.</i>
  CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple);

  <i>-- An error. Cannot create a table with two columns named "tokenize".</i>
  CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple, tokenize=icu);
</codeblock>

<h2>Populating FTS Tables</h2>

  <p>
    FTS tables are populated using [INSERT], [UPDATE] and [DELETE]
    statements in the same way as ordinary SQLite tables are.

  <p>







<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<







180
181
182
183
184
185
186









































































































187
188
189
190
191
192
193

<codeblock>
  <i>-- Create, then immediately drop, an FTS4 table.</i>
  CREATE VIRTUAL TABLE data USING fts4();
  DROP TABLE data;
</codeblock>










































































































<h2>Populating FTS Tables</h2>

  <p>
    FTS tables are populated using [INSERT], [UPDATE] and [DELETE]
    statements in the same way as ordinary SQLite tables are.

  <p>
1408
1409
1410
1411
1412
1413
1414






























































































































































1415
1416
1417
1418
1419
1420
1421
  During a transaction, some of the data written to an FTS table may be 
  cached in memory and written to the database only when the transaction is 
  committed. However the implementation of the fts4aux module is only able 
  to read data from the database. In practice this means that if an fts4aux 
  table is queried from within a transaction in which the associated 
  FTS table has been modified, the results of the query are likely to reflect 
  only a (possibly empty) subset of the changes made.































































































































































<h1 id=tokenizer tags="tokenizer">Tokenizers</h1>

<p>
  An FTS tokenizer is a set of rules for extracting terms from a document 
  or basic FTS full-text query. 








>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
  During a transaction, some of the data written to an FTS table may be 
  cached in memory and written to the database only when the transaction is 
  committed. However the implementation of the fts4aux module is only able 
  to read data from the database. In practice this means that if an fts4aux 
  table is queried from within a transaction in which the associated 
  FTS table has been modified, the results of the query are likely to reflect 
  only a (possibly empty) subset of the changes made.

<h1 id=fts4_options tags="FTS4 options">FTS4 Options</h1>

<p>
<tcl>hd_fragment *matchinfo_fts3 {matchinfo_fts3}</tcl>

  If the "CREATE VIRTUAL TABLE" statement specifies module FTS4 (not FTS3), 
  then special directives - FTS4 options - similar to the "tokenize=*" option
  may also appear in place of column names. An FTS4 option consists of the
  option name, followed by an "=" character, followed by the option value.
  The option value may optionally be enclosed in single or double quotes, with
  embedded quote characters escaped in the same way as for SQL literals. There
  may not be whitespace on either side of the "=" character. For example,
  to create an FTS4 table with the value of option "matchinfo" set to "fts3":

<codeblock>
  <i>-- Create a reduced-footprint FTS4 table.</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, matchinfo=fts3);
</codeblock>

<p>
  FTS4 currently supports the following three options:

<table striped=1>
  <tr><th>Option<th>Interpretation
  <tr><td>matchinfo<td> This option may only be set to the value "fts3".
    Attempting to set it otherwise is an error. If this option is specified,
    then some of the extra information stored by FTS4 is omitted. This reduces
    the amount of disk space consumed by an FTS4 table until it is almost the
    same as the amount that would be used by the equivalent FTS3 table, but 
    also means that the data accessed by passing the 'l' flag to the 
    [matchinfo()] function is not available.

  <tr><td>compress<td>
    This option is used to specify the compress function. It is an error to
    specify a compress function without also specifying an uncompress
    function. [fts4 compress option|See below] for details.
    
  <tr><td>uncompress<td>
    This option is used to specify the uncompress function. It is an error to
    specify an uncompress function without also specifying a compress
    function. [fts4 compress option|See below] for details.

  <tr><td>order<td>
    The "order" option may be set to either "DESC" or "ASC" (in upper or
    lower case). If it is set to "DESC", then FTS4 stores its data in such
    a way as to optimize returning results in descending order by docid.
    If it is set to "ASC" (the default), then the data structures are 
    optimized for returning results in ascending order by docid. In other
    words, if many of the queries run against the FTS4 table use "ORDER BY
    docid DESC", then it may improve performance to add the "order=desc" 
    option to the CREATE VIRTUAL TABLE statement.

  <tr><td>prefix<td>
    This option may be set to a comma-separated list of positive non-zero 
    integers. For each integer N in the list, a separate index is created
    in the database file to optimize [FTS MATCH|term prefix] queries where
    the query term is N bytes in length, not including the '*' character,
    when encoded using UTF-8. [fts4 prefix option|See below] for details.

</table>

<p>
  When using FTS4, specifying a column name that contains an "=" character
  and is not either a "tokenize=*" specification or a recognized FTS4 option
  is an error. With FTS3, the first token in the unrecognized directive is 
  interpreted as a column name. Similarly, specifying multiple "tokenize=*"
  directives in a single table declaration is an error when using FTS4, whereas
  the second and subsequent "tokenize=*" directives are interpreted as column
  names by FTS3. For example:

<codeblock>
  <i>-- An error. FTS4 does not recognize the directive "xyz=abc".</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, xyz=abc);

  <i>-- Create an FTS3 table with three columns - "author", "document"</i>
  <i>-- and "xyz".</i>
  CREATE VIRTUAL TABLE papers USING fts3(author, document, xyz=abc);

  <i>-- An error. FTS4 does not allow multiple tokenize=* directives</i>
  CREATE VIRTUAL TABLE papers USING fts4(tokenize=porter, tokenize=simple);

  <i>-- Create an FTS3 table with a single column named "tokenize". The</i>
  <i>-- table uses the "porter" tokenizer.</i>
  CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple);

  <i>-- An error. Cannot create a table with two columns named "tokenize".</i>
  CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple, tokenize=icu);
</codeblock>

<tcl>hd_fragment *fts4compression {compressed FTS4 content}</tcl>
<h2 tags="fts4 compress option"> Compress/Uncompress </h2>

<p>
  The compress and uncompress options allow FTS4 content to be stored in
  the database in a compressed form. Both options should be set to the name
  of an SQL scalar function registered using [sqlite3_create_function()]
  that accepts a single argument. 

<p>
  The compress function should return a compressed version of the value 
  passed to it as an argument. Each time data is written to the FTS4 table, 
  each column value is passed to the compress function and the result value 
  stored in the database. The compress function may return any type of SQLite 
  value (blob, text, real, integer or null).

<p>
  The uncompress function should uncompress data previously compressed by
  the compress function. In other words, for all SQLite values X, it should
  be true that uncompress(compress(X)) equals X. When data that has been
  compressed by the compress function is read from the database by FTS4, it
  is passed to the uncompress function before it is used.

<p>
  If the specified compress or uncompress functions do not exist, the table
  may still be created. An error is not returned until the FTS4 table is
  read (if the uncompress function does not exist) or written (if it is the 
  compress function that does not exist).

<codeblock>
  <i>-- Create an FTS4 table that stores data in compressed form. This</i>
  <i>-- assumes that the scalar functions zip() and unzip() have been (or</i>
  <i>-- will be) added to the database handle.</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, compress=zip, uncompress=unzip);
</codeblock>

<h2 tags="fts4 prefix option">Prefix</h2>

<p>
  The FTS4 prefix option causes FTS to index term prefixes of specified lengths
  in the same way that it always indexes complete terms. The prefix option must be set 
  to a comma separated list of positive non-zero integers. For each value N in the 
  list, prefixes of length N bytes (when encoded using UTF-8) are indexed. FTS4 uses
  term prefix indexes to speed up term prefix queries. The cost, of course, is that
  indexing term prefixes as well as complete terms increases the database size and
  slows down write operations on the FTS4 table.

<p>
  Prefix indexes may be used to optimize term prefix queries in two cases. If the
  query is for a prefix of N bytes, then a prefix index created with "prefix=N" 
  provides the best optimization. Or, if no "prefix=N" index is available, a
  "prefix=N+1" index imay be used instead. Using a "prefix=N+1" index is less
  efficient than a "prefix=N" index, but is better than no prefix index at all.

<codeblock>
  <i>-- Create an FTS4 table with indexes to optimize 2 and 4 byte term prefix queries.</i>
  CREATE VIRTUAL TABLE t1 USING fts4(c1, c2, prefix="2,4");

  <i>-- The following two queries are both optimized using the prefix indexes.</i>
  SELECT * FROM t1 WHERE t1 MATCH 'ab*';
  SELECT * FROM t1 WHERE t1 MATCH 'abcd*';

  <i>-- The following two queries are both partially optimized using the prefix</i>
  <i>-- indexes. The optimization is not as pronounced as it is for the queries</i>
  <i>-- above, but still an improvement over no prefix indexes at all.</i>
  SELECT * FROM t1 WHERE t1 MATCH 'a*';
  SELECT * FROM t1 WHERE t1 MATCH 'abc*';
</codeblock>

<h1 id=tokenizer tags="tokenizer">Tokenizers</h1>

<p>
  An FTS tokenizer is a set of rules for extracting terms from a document 
  or basic FTS full-text query.