Documentation Source Text

Check-in [ead2cc60ef]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Update fts3.html to describe the fts4aux module and the fts4 compress/uncompress options.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: ead2cc60ef6b652f0eb80f1d98625e32809bf6c6
User & Date: dan 2011-02-03 19:06:21
Context
2011-02-04
00:52
Update the ATTACH documentation to allow an expression as the filename parameter. Core ticket http://www.sqlite.org/src/info/9013e13dba. check-in: be0c879c12 user: drh tags: trunk
2011-02-03
19:06
Update fts3.html to describe the fts4aux module and the fts4 compress/uncompress options. check-in: ead2cc60ef user: dan tags: trunk
2011-01-31
14:41
Change the date on version 3.7.5 to February 1. check-in: ee5bc9089c user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/fts3.in.

86
87
88
89
90
91
92




93
94
95
96
97
98
99
...
166
167
168
169
170
171
172
173












174

175

176


















177
178
179
180
181
182





































183
184



185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
...
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
....
1236
1237
1238
1239
1240
1241
1242















































































1243
1244
1245
1246
1247
1248
1249
       optimizations and extra matchinfo() options, FTS4 tables may consume more
       disk space than the equivalent table created using FTS3. Usually the overhead
       is 1-2% or less, but may be as high as 10% if the documents stored in the
       FTS table are very small. The overhead may be reduced by specifying the
       directive [matchinfo_fts3|"matchinfo=fts3"] as part of the FTS4 table
       declaration, but this comes at the expense of sacrificing some of the
       extra supported matchinfo() options.




</ul>

<p>
  FTS4 is an enhancement to FTS3.  FTS3 has been available since SQLite [version 3.5.0] in
  2007-09-04.  The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08.

<p>
................................................................................
  CREATE VIRTUAL TABLE data USING fts4(tokenize=simple);

  <i>-- Create an FTS table with two columns that uses the "icu" tokenizer.</i>
  <i>-- The qualifier "en_AU" is passed to the tokenizer implementation</i>
  CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU);
</codeblock>

<p>












<tcl>hd_fragment *matchinfo_fts3 {matchinfo_fts3}</tcl>

  If the "CREATE VIRTUAL TABLE" statement specifies module FTS4 (not FTS3), 

  then the special directive "matchinfo=fts3" may also appear in place of


















  a column name. If it does, then some of the extra information stored by
  FTS4 is omitted. This reduces the amount of disk space consumed by an FTS4 
  table until it is almost the same as the amount that would be used by
  the equivalent FTS3 table, but also means that the data accessed by passing
  the 'l' flag to the [matchinfo()] function is not available. For example:






































<codeblock>
  <i>-- Create a reduced-footprint FTS4 table.</i>



  CREATE VIRTUAL TABLE papers USING fts4(author, document, matchinfo=fts3);
</codeblock>

<p>
  When using FTS4, specifying a column name that contains an "=" character
  and is not either a "tokenize=*" or "matchinfo=fts3" directive is an error.
  With FTS3, the first token in the unrecognized directive is interpreted
  as a column name. Similarly, specifying multiple "tokenize=*" directives
  in a single table declaration is an error when using FTS4, whereas the
  second and subsequent "tokenize=*" directives are interpreted as column
  names by FTS3. For example:

<codeblock>
  <i>-- An error. FTS4 does not recognize the directive "xyz=abc".</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, xyz=abc);

  <i>-- Create an FTS3 table with three columns - "author", "document"</i>
................................................................................
  <i>-- table uses the "porter" tokenizer.</i>
  CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple);

  <i>-- An error. Cannot create a table with two columns named "tokenize".</i>
  CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple, tokenize=icu);
</codeblock>

<p>
  FTS tables may be dropped from the database using an ordinary [DROP TABLE]
  statement. For example:

<codeblock>
  <i>-- Create, then immediately drop, an FTS4 table.</i>
  CREATE VIRTUAL TABLE data USING fts4();
  DROP TABLE data;
</codeblock>

<h2>Populating FTS Tables</h2>

  <p>
    FTS tables are populated using [INSERT], [UPDATE] and [DELETE]
    statements in the same way as ordinary SQLite tables are.

  <p>
................................................................................
<p>
  The matchinfo function provides all the information required to calculate
  probabilistic "bag-of-words" relevancy scores such as 
  <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may
  be used to order results in a full-text search application. Appendix A of this 
  document, "[search application tips]", contains an example of using the
  matchinfo() function efficiently.
















































































<h1 id=tokenizer tags="tokenizer">Tokenizers</h1>

<p>
  An FTS tokenizer is a set of rules for extracting terms from a document 
  or basic FTS full-text query. 








>
>
>
>







 








>
>
>
>
>
>
>
>
>
>
>
>

>

>
|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|
|
|
|
|

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

<
>
>
>
|




|
|
|
|
|







 







<
<
<
<
<
<
<
<
<
<







 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
...
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256

257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
...
284
285
286
287
288
289
290










291
292
293
294
295
296
297
....
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
       optimizations and extra matchinfo() options, FTS4 tables may consume more
       disk space than the equivalent table created using FTS3. Usually the overhead
       is 1-2% or less, but may be as high as 10% if the documents stored in the
       FTS table are very small. The overhead may be reduced by specifying the
       directive [matchinfo_fts3|"matchinfo=fts3"] as part of the FTS4 table
       declaration, but this comes at the expense of sacrificing some of the
       extra supported matchinfo() options.

  <li> <p>FTS4 provides hooks (the compress and uncompress 
       [FTS4 options|options]) allowing data to be stored in a compressed 
       form, reducing disk usage and IO.
</ul>

<p>
  FTS4 is an enhancement to FTS3.  FTS3 has been available since SQLite [version 3.5.0] in
  2007-09-04.  The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08.

<p>
................................................................................
  CREATE VIRTUAL TABLE data USING fts4(tokenize=simple);

  <i>-- Create an FTS table with two columns that uses the "icu" tokenizer.</i>
  <i>-- The qualifier "en_AU" is passed to the tokenizer implementation</i>
  CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU);
</codeblock>

<p>
  FTS tables may be dropped from the database using an ordinary [DROP TABLE]
  statement. For example:

<codeblock>
  <i>-- Create, then immediately drop, an FTS4 table.</i>
  CREATE VIRTUAL TABLE data USING fts4();
  DROP TABLE data;
</codeblock>

<h3 id=fts4_options tags="FTS4 options">FTS4 Options</h3>

<p>
<tcl>hd_fragment *matchinfo_fts3 {matchinfo_fts3}</tcl>

  If the "CREATE VIRTUAL TABLE" statement specifies module FTS4 (not FTS3), 
  then special directives - FTS4 options - similar to the "tokenize=*" option
  may also appear in place of column names. An FTS4 option consists of the
  option name, followed by an "=" character, followed by the option value.
  The option value may optionally be enclosed in single or double quotes, with
  embedded quote characters escaped in the same way as for SQL literals. There
  may not be whitespace on either side of the "=" character. For example,
  to create an FTS4 table with the value of option "matchinfo" set to "fts3":

<codeblock>
  <i>-- Create a reduced-footprint FTS4 table.</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, matchinfo=fts3);
</codeblock>

<p>
  FTS4 currently supports the following three options:

<table striped=1>
  <tr><th>Option<th>Interpretation
  <tr><td>matchinfo<td> This option may only be set to the value "fts3".
    Attempting to set it otherwise is an error. If this option is specified,
    then some of the extra information stored by FTS4 is omitted. This reduces
    the amount of disk space consumed by an FTS4 table until it is almost the
    same as the amount that would be used by the equivalent FTS3 table, but 
    also means that the data accessed by passing the 'l' flag to the 
    [matchinfo()] function is not available. For example:

  <tr><td>compress<td>
    This option is used to specify the compress function. It is an error to
    specify a compress function without also specifying an uncompress
    function.
    
  <tr><td>uncompress<td>
    This option is used to specify the uncompress function. It is an error to
    specify an uncompress function without also specifying a compress
    function.
</table>

<p>
  The compress and uncompress options allow FTS4 content to be stored in
  the database in a compressed form. Both options should be set to the name
  of an SQL scalar function registered using [sqlite3_create_function()]
  that accepts a single argument. 

<p>
  The compress function should return a compressed version of the value 
  passed to it as an argument. Each time data is written to the FTS4 table, 
  each column value is passed to the compress function and the result value 
  stored in the database. The compress function may return any type of SQLite 
  value (blob, text, real, integer or null).

<p>
  The uncompress function should uncompress data previously compressed by
  the compress function. In other words, for all SQLite values X, it should
  be true that uncompress(compress(X)) equals X. When data that has been
  compressed by the compress function is read from the database by FTS4, it
  is passed to the uncompress function before it is used.

<p>
  If the specified compress or uncompress functions do not exist, the table
  may still be created. An error is not returned until the FTS4 table is
  read (if the uncompress function does not exist) or written (if it is the 
  compress function that does not exist).

<codeblock>

  <i>-- Create an FTS4 table that stores data in compressed form. This</i>
  <i>-- assumes that the scalar functions zip() and unzip() have been (or</i>
  <i>-- will be) added to the database handle.</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, compress=zip, uncompress=unzip);
</codeblock>

<p>
  When using FTS4, specifying a column name that contains an "=" character
  and is not either a "tokenize=*" specification or a recognized FTS4 option
  is an error. With FTS3, the first token in the unrecognized directive is 
  interpreted as a column name. Similarly, specifying multiple "tokenize=*"
  directives in a single table declaration is an error when using FTS4, whereas
  the second and subsequent "tokenize=*" directives are interpreted as column
  names by FTS3. For example:

<codeblock>
  <i>-- An error. FTS4 does not recognize the directive "xyz=abc".</i>
  CREATE VIRTUAL TABLE papers USING fts4(author, document, xyz=abc);

  <i>-- Create an FTS3 table with three columns - "author", "document"</i>
................................................................................
  <i>-- table uses the "porter" tokenizer.</i>
  CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple);

  <i>-- An error. Cannot create a table with two columns named "tokenize".</i>
  CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple, tokenize=icu);
</codeblock>











<h2>Populating FTS Tables</h2>

  <p>
    FTS tables are populated using [INSERT], [UPDATE] and [DELETE]
    statements in the same way as ordinary SQLite tables are.

  <p>
................................................................................
<p>
  The matchinfo function provides all the information required to calculate
  probabilistic "bag-of-words" relevancy scores such as 
  <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may
  be used to order results in a full-text search application. Appendix A of this 
  document, "[search application tips]", contains an example of using the
  matchinfo() function efficiently.

<h1 id=fts4aux tags="fts4aux">Fts4aux - Direct Access to the Full-Text Index</h1>

<p>
  As of version 3.7.6, SQLite includes a new virtual table module called 
  "fts4aux", which can be used to inspect the full-text index of an exiting
  FTS3 or FTS4 table directly (despite its name, fts4aux works just as well
  with FTS3 tables as it does with FTS4 tables). Fts4aux tables are read-only.
  The only way to modify the contents of an fts4aux table is by modifying the
  contents of the associated FTS table. The fts4aux module is automatically
  included in all [compile fts|builds that include FTS].

<p>
  An fts4aux virtual table is constructed with a single argument - the 
  unqualified name of the FTS table that it will be used to access.
  For example:

<codeblock>
  <i>-- Create an FTS4 table</i>
  CREATE VIRTUAL TABLE ft USING fts4;

  <i>-- Create an fts4aux table to access the full-text index for table "ft"</i>
  CREATE VIRTUAL TABLE ft_terms USING fts4aux(ft);
</codeblock>

<p>
  There is one row in an fts4aux table for each distinct term in the
  associated FTS table. An fts4aux table always has the same three columns, 
  as follows, from left to right:

<table striped=1>
  <tr><th>Column Name<th>Column Contents
  <tr><td>term<td> 
    Contains the text of the term for this row.
  <tr><td>documents<td>
    Contains the number of rows of the FTS table that contain at 
    least one instance of the term (in any column). This column always 
    contains an integer value.
  <tr><td>occurrences<td>
    Contains the total number of instances of the term in all rows of the 
    FTS table. This column also always contains an integer value.
</table>

<p>
  For example, using the tables created above:

<codeblock>
  INSERT INTO ft VALUES('Apple banana Cherry');
  INSERT INTO ft VALUES('Banana Date Fig');
  INSERT INTO ft VALUES('Cherry Grapefruit Cherry');

  <i>-- The following query returns this data:</i>
  <i>--</i>
  <i>--     apple       |  1  |  1</i>
  <i>--     banana      |  2  |  2</i>
  <i>--     cherry      |  2  |  3</i>
  <i>--     date        |  1  |  1</i>
  <i>--     fig         |  1  |  1</i>
  <i>--     grapefruit  |  1  |  1</i>
  <i>--</i>
  SELECT term, documents, occurrences FROM ft_terms;
</codeblock>

<p>
  In the example, the values in the "term" column are all lower case, even
  though they were inserted into table "ft" in mixed case. This is because
  an fts3aux table contains the terms as extracted from the document text
  by the [tokenizer]. In this case, since table "ft" uses the 
  [tokenizer|simple tokenizer], this means all terms have been folded to
  lower case.

<p>
  During a transaction, some of the data written to an FTS table may be 
  cached in memory and written to the database only when the transaction is 
  committed. However the implementation of the fts4aux module is only able 
  to read data from the database. In practice this means that if an fts4aux 
  table is queried from within a transaction in which the associated 
  FTS table has been modified, the results of the query are likely to reflect 
  only a (possibly empty) subset of the changes made.

<h1 id=tokenizer tags="tokenizer">Tokenizers</h1>

<p>
  An FTS tokenizer is a set of rules for extracting terms from a document 
  or basic FTS full-text query.