Documentation Source Text

Check-in [f650ee44b0]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add an "application porting guide" describing the steps required to port fts3/4 applciations to fts5. Fix some other deficiencies in fts5.html.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: f650ee44b0c626168c35f7dad15d13361c01d68c
User & Date: dan 2015-07-10 17:04:56
Context
2015-07-13
22:07
First cut at OTA documentation. check-in: a5dfd4a26b user: drh tags: trunk
2015-07-10
17:04
Add an "application porting guide" describing the steps required to port fts3/4 applciations to fts5. Fix some other deficiencies in fts5.html. check-in: f650ee44b0 user: dan tags: trunk
2015-07-02
20:03
Add "Compiling and Using FTS5" section to fts5.html. check-in: 59ff0eaf28 user: dan tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/fts5.in.

14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
..
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
....
1163
1164
1165
1166
1167
1168
1169














1170
1171
1172
1173
1174
1175
1176
....
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
....
1375
1376
1377
1378
1379
1380
1381

1382

1383
1384
1385
1386
1387
1388
1389
....
1392
1393
1394
1395
1396
1397
1398


































































































































































































full-text search engines allow the user to efficiently search a large 
collection of documents for the subset that contain one or more instances of a
search term. The search functionality provided to world wide web users by
<a href=www.google.com>Google</a> is, amongst other things, a full-text search
engine, as it allows users to search for all documents on the web that contain,
for example, the term "fts5".

<h2>Overview of Functionality</h2>

<p>To use FTS5, the user creates an FTS5 virtual table with one or more
columns. For example:

<codeblock>
  CREATE VIRTUAL TABLE email USING fts5(sender, title, body);
</codeblock>

................................................................................
  <li> boolean combinations of any of the above.
</ul>

<p> Such advanced searches are requested by providing a more complicated 
FTS5 query string as the text to the right of the MATCH operator. The full
query syntax is [FTS5 query syntax | described here].

<h2>Differences between FTS5 and FTS3/4</h2>

<p> Also available is the similar but more mature [fts3 | FTS3/4] module.
FTS5 is similar to FTS3/4 in that the primary task of each is to maintain
an index mapping from each unique token to a list of instances of that token 
within a set of documents, where each instance is identified by the document 
in which it appears and its position within that document. For example:

<codeblock>
  <i>-- Given the following SQL:</i>
  CREATE VIRTUAL TABLE ft USING fts5(a, b);
  INSERT INTO ft(rowid, a, b) VALUES(1, 'X Y', 'Y Z');
  INSERT INTO ft(rowid, a, b) VALUES(2, 'A Z', 'Y Y');

  <i>-- The FTS5 module creates the following mapping on disk:</i>
  A --&gt; (2, 0, 0)
  X --&gt; (1, 0, 0)
  Y --&gt; (1, 1, 0) (2, 1, 0) (2, 1, 1)
  Z --&gt; (2, 1, 1) (1, 1, 1)
</codeblock>

<p>In the example above, each triple identifies the location of a token
instance by rowid, column number (columns are numbered sequentially
starting at 0 from left to right) and position within the column value (the
first token in a column value is 0, the second is 1, and so on). Using this
index, FTS5 is able to provide timely answers to queries such as "the set
of all documents that contain the token 'A'", or "the set of all documents
that contain the sequence 'Y Z'". The list of instances associated with a
single token is called an "instance-list".

<p>The principle difference between FTS3/4 and FTS5 is that in FTS3/4,
each instance-list is stored as a single large database record, whereas
in FTS5 large instance-lists are divided betwen multiple database records.
This has the following implications for dealing with large databases that
contain large lists:

<ul>
  <li> <p>FTS5 is able to load instance-lists into memory incrementally in
       order to reduce memory usage and peak allocation size. FTS3/4 very
       often loads entire instance-lists into memory.

  <li> <p>When processing queries that feature more than one token, FTS5 is
       sometimes able to determine that the query can be answered by
       inspecting a subset of a large instance-list. FTS3/4 almost always
       has to traverse entire instance-lists.

  <li> If an instance-list grows so large that it exceeds
       the [SQLITE_MAX_LENGTH] limit, FTS3/4 is unable to handle it. FTS5
       does not have this problem. 
</ul>

<p>For these reasons, many complex queries may use less memory and run faster 
using FTS5.

<p>Some other ways in which FTS5 differs from FTS3/4 are:

<ul>
  <li> <p>FTS5 supports "ORDER BY rank" for returning results in order of
       decreasing relevancy.

  <li> <p>FTS5 features an API allowing users to create custom auxiliary 
       functions for advanced ranking and text processing applications. The
       special "rank" column may be mapped to a custom auxiliary function
       so that adding "ORDER BY rank" to a query works as expected.

  <li> <p>FTS5 recognizes unicode separator characters and case equivalence by
       default. This is also possible using FTS3/4, but must be explicitly
       enabled.

  <li> <p>The query syntax has been revised where necessary to remove
       ambiguities and to make it possible to escape special characters
       in query terms.

  <li> <p>By default, FTS3/4 occasionally merges together two or more of the
       b-trees that make up its full-text index within an INSERT, UPDATE or
       DELETE statement executed by the user. This means that any operation
       on an FTS3/4 table may turn out to be surprisingly slow, as FTS3/4 
       may unpredictably choose to merge together two or more large b-trees
       within it. FTS5 uses incremental merging by default, which limits
       the amount of processing that may take place within any given 
       INSERT, UPDATE or DELETE operation.
</ul>

<h1 tags="FTS5 building">Compiling and Using FTS5</h1>

<p>The FTS5 source code consists of a series of *.c and other files in the
"ext/fts5" directory of the SQLite source tree. A build process reduces this
to just two files - "fts5.c" and "fts5.h" - which are used to build an SQLite
loadable extension or statically linked into an application.

................................................................................
  INSERT INTO ft(ft) VALUES('integrity-check');
</codeblock>

<p>If the full-text index is consistent with the contents of the table, the
INSERT used to invoke the integrity-check command succeeds. Or, if any
discrepancy is found, it fails with an [SQLITE_CORRUPT_VTAB] error.















<h2 tags="FTS5 optimize command">The 'optimize' Command</h2>

<p>This command merges all individual b-trees that currently make up the
full-text index into a single large b-tree structure. This ensures that the
full-text index consumes the minimum space within the database and is in the
fastest form to query.

................................................................................
regarding the relationship between the full-text index and its component
b-trees.

<codeblock>
  INSERT INTO ft(ft) VALUES('optimize');
</codeblock>

<h2>The 'pgsz' Configuration Option</h2>

<p> This command is used to set the persistent "pgsz" option.

<p> The full-text index maintained by FTS5 is stored as a series of fixed-size
blobs in a database table. It is not strictly necessary for all blobs that make
up a full-text index to be the same size. The pgsz option determines the size
of all blobs created by subsequent index writers. The default value is 1000.
................................................................................
}
</codeblock>

<p>The following section describes the API offered to auxiliary function
implementations in detail. Further examples may be found in the "fts5_aux.c"
file of the source code.


<h3>Custom Auxiliary Functions API Reference</h3>


<codeblock>
<tclscript>
  set res ""
  set ::extract_api_docs_mode Fts5ExtensionApi
  catch { set res [source [file join $::SRC ext/fts5/extract_api_docs.tcl]] }
  set res
................................................................................

<tclscript>
  set res ""
  unset -nocomplain ::extract_api_docs_mode 
  catch { set res [source [file join $::SRC ext/fts5/extract_api_docs.tcl]] }
  set res
</tclscript>









































































































































































































<
<







 







<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<







 







>
>
>
>
>
>
>
>
>
>
>
>
>
>







 







|







 







>
|
>







 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
14
15
16
17
18
19
20


21
22
23
24
25
26
27
..
95
96
97
98
99
100
101



















































































102
103
104
105
106
107
108
....
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
....
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
....
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
....
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
full-text search engines allow the user to efficiently search a large 
collection of documents for the subset that contain one or more instances of a
search term. The search functionality provided to world wide web users by
<a href=www.google.com>Google</a> is, amongst other things, a full-text search
engine, as it allows users to search for all documents on the web that contain,
for example, the term "fts5".



<p>To use FTS5, the user creates an FTS5 virtual table with one or more
columns. For example:

<codeblock>
  CREATE VIRTUAL TABLE email USING fts5(sender, title, body);
</codeblock>

................................................................................
  <li> boolean combinations of any of the above.
</ul>

<p> Such advanced searches are requested by providing a more complicated 
FTS5 query string as the text to the right of the MATCH operator. The full
query syntax is [FTS5 query syntax | described here].




















































































<h1 tags="FTS5 building">Compiling and Using FTS5</h1>

<p>The FTS5 source code consists of a series of *.c and other files in the
"ext/fts5" directory of the SQLite source tree. A build process reduces this
to just two files - "fts5.c" and "fts5.h" - which are used to build an SQLite
loadable extension or statically linked into an application.

................................................................................
  INSERT INTO ft(ft) VALUES('integrity-check');
</codeblock>

<p>If the full-text index is consistent with the contents of the table, the
INSERT used to invoke the integrity-check command succeeds. Or, if any
discrepancy is found, it fails with an [SQLITE_CORRUPT_VTAB] error.

<h2 tags="FTS5 merge command">The 'merge' Command</h2>

<codeblock>
  INSERT INTO ft(ft, rank) VALUES('merge', 500);
</codeblock>

<p> This command merges b-tree structures together until roughly N pages
of merged data have been written to the database. The size of each page is
as configured by the [FTS5 pgsz option].

<p> B-tree structures are only eligible for merging if they would also be
eligible for automatic merging as configured by the [FTS5 automerge option].
If no such b-trees are found when this command is invoked, it is a no-op.

<h2 tags="FTS5 optimize command">The 'optimize' Command</h2>

<p>This command merges all individual b-trees that currently make up the
full-text index into a single large b-tree structure. This ensures that the
full-text index consumes the minimum space within the database and is in the
fastest form to query.

................................................................................
regarding the relationship between the full-text index and its component
b-trees.

<codeblock>
  INSERT INTO ft(ft) VALUES('optimize');
</codeblock>

<h2 tags="FTS5 pgsz option">The 'pgsz' Configuration Option</h2>

<p> This command is used to set the persistent "pgsz" option.

<p> The full-text index maintained by FTS5 is stored as a series of fixed-size
blobs in a database table. It is not strictly necessary for all blobs that make
up a full-text index to be the same size. The pgsz option determines the size
of all blobs created by subsequent index writers. The default value is 1000.
................................................................................
}
</codeblock>

<p>The following section describes the API offered to auxiliary function
implementations in detail. Further examples may be found in the "fts5_aux.c"
file of the source code.

<h3 tags="custom auxiliary functions">
  Custom Auxiliary Functions API Reference
</h3>

<codeblock>
<tclscript>
  set res ""
  set ::extract_api_docs_mode Fts5ExtensionApi
  catch { set res [source [file join $::SRC ext/fts5/extract_api_docs.tcl]] }
  set res
................................................................................

<tclscript>
  set res ""
  unset -nocomplain ::extract_api_docs_mode 
  catch { set res [source [file join $::SRC ext/fts5/extract_api_docs.tcl]] }
  set res
</tclscript>

<h1 id=appendix_a nonumber tags="comparison with fts4">
  Appendix A: Comparison with FTS3/4
</h1>

<p> Also available is the similar but more mature [fts3 | FTS3/4] module. 
FTS5 is a new version of FTS4 that includes various fixes and solutions for 
problems that could not be fixed in FTS4 without sacrificing backwards 
compatibility. Some of these problems are 
[fts5 technical differences | described below].

<h2 nonumber> Application Porting Guide </h2>

<p> In order to use FTS5 instead of FTS3 or FTS4, applications usually require
minimal modifications. Most of these fall into three categories - changes
required to the CREATE VIRTUAL TABLE statement used to create the FTS table,
changes required to SELECT queries used to execute queries against the table,
and changes required to applications that use [FTS auxiliary functions].

<h3 nonumber> Changes to CREATE VIRTUAL TABLE statements </h3>

<ol>
<li> <p>The module name must be changed from "fts3" or "fts4" to "fts5".

<li> <p>All type information or constraint specifications must be removed from
     column definitions. FTS3/4 ignores everything following the column name in
     a column definition, FTS5 attempts to parse it (and will report an error
     if it fails to).

<li> <p>The "matchinfo=fts3" option is not available. The 
     [FTS5 columnsize option | "columnsize=0"] option is equivalent.

<li> <p>The notindexed= option is not available. Adding [unindexed | UNINDEXED]
     to the column definition is equivalent.

<li> <p>The ICU tokenizer is not available.

<li> <p>The compress=, uncompress= and languageid= options are not available.
     There is as of yet no equivalent for their functionality.
</ol>

<codeblock>
  <i> -- FTS3/4 statement </i>
  CREATE VIRTUAL TABLE t1 USING fts4(
    linkid INTEGER,
    header CHAR(20),
    text VARCHAR,
    notindexed=linkid,
    matchinfo=fts3,
    tokenizer=unicode61
  );

  <i> -- FTS5 equivalent (note - the "tokenizer=unicode61" option is not</i>
  <i> -- required as this is the default for FTS5 anyway)</i>
  CREATE VIRTUAL TABLE t1 USING fts5(
    linkid UNINDEXED,
    header,
    text,
    columnsize=0
  );
</codeblock>

<h3 nonumber> Changes to SELECT statements </h3>

<ol>
  <li> <p>The "docid" alias does not exist. Applications must use "rowid"
          instead.

  <li> <p>The left hand side of the MATCH operator in a full-text query must be
          the table name - not any column name as in FTS4.

  <li> <p>The FTS query syntax (right hand side of the MATCH operator) has
          changed in some ways. The FTS5 syntax is quite close to the FTS4
          "enhanced syntax". The main difference is that FTS5 is fussier 
          about unrecognized punctuation characters and similar within query
          strings. Most queries that work with FTS3/4 should also work with
          FTS5, and those that do not should return parse errors.
</ol>

<codeblock>
  <i>-- FTS3/4 query </i>
  SELECT docid FROM t1 WHERE text MATCH 'token';

  <i>-- FTS5 equivalent </i>
  SELECT rowid FROM t1 WHERE t1 MATCH 'text:token';
</codeblock>

<h3 nonumber> Auxiliary Function Changes </h3>

<p> FTS5 has no matchinfo() or offsets() function, and the snippet() function
is not as fully-featured as in FTS3/4. However, since FTS5 does provide 
an API allowing applications to create [custom auxiliary functions], any
required functionality may be implemented within the application code.

<p> The set of built-in auxiliary functions provided by FTS5 may be 
improved upon in the future.

<h3 nonumber> Other Issues</h3>

<ol>
  <li><p> FTS5 does not currently provide an equivalent to the fts4aux table.

  <li><p> The FTS3/4 "merge=X,Y" command has been replaced by the 
          [FTS5 merge command].

  <li><p> The FTS3/4 "automerge=X" command has been replaced by the 
          [FTS5 automerge option].
</ol>

<h2 nonumber tags="fts5 technical differences"> 
  Summary of Technical Differences 
</h2>

<p>FTS5 is similar to FTS3/4 in that the primary task of each is to maintain
an index mapping from each unique token to a list of instances of that token 
within a set of documents, where each instance is identified by the document 
in which it appears and its position within that document. For example:

<codeblock>
  <i>-- Given the following SQL:</i>
  CREATE VIRTUAL TABLE ft USING fts5(a, b);
  INSERT INTO ft(rowid, a, b) VALUES(1, 'X Y', 'Y Z');
  INSERT INTO ft(rowid, a, b) VALUES(2, 'A Z', 'Y Y');

  <i>-- The FTS5 module creates the following mapping on disk:</i>
  A --&gt; (2, 0, 0)
  X --&gt; (1, 0, 0)
  Y --&gt; (1, 1, 0) (2, 1, 0) (2, 1, 1)
  Z --&gt; (2, 1, 1) (1, 1, 1)
</codeblock>

<p>In the example above, each triple identifies the location of a token
instance by rowid, column number (columns are numbered sequentially
starting at 0 from left to right) and position within the column value (the
first token in a column value is 0, the second is 1, and so on). Using this
index, FTS5 is able to provide timely answers to queries such as "the set
of all documents that contain the token 'A'", or "the set of all documents
that contain the sequence 'Y Z'". The list of instances associated with a
single token is called an "instance-list".

<p>The principle difference between FTS3/4 and FTS5 is that in FTS3/4,
each instance-list is stored as a single large database record, whereas
in FTS5 large instance-lists are divided betwen multiple database records.
This has the following implications for dealing with large databases that
contain large lists:

<ul>
  <li> <p>FTS5 is able to load instance-lists into memory incrementally in
       order to reduce memory usage and peak allocation size. FTS3/4 very
       often loads entire instance-lists into memory.

  <li> <p>When processing queries that feature more than one token, FTS5 is
       sometimes able to determine that the query can be answered by
       inspecting a subset of a large instance-list. FTS3/4 almost always
       has to traverse entire instance-lists.

  <li> If an instance-list grows so large that it exceeds
       the [SQLITE_MAX_LENGTH] limit, FTS3/4 is unable to handle it. FTS5
       does not have this problem. 
</ul>

<p>For these reasons, many complex queries may use less memory and run faster 
using FTS5.

<p>Some other ways in which FTS5 differs from FTS3/4 are:

<ul>
  <li> <p>FTS5 supports "ORDER BY rank" for returning results in order of
       decreasing relevancy.

  <li> <p>FTS5 features an API allowing users to create custom auxiliary 
       functions for advanced ranking and text processing applications. The
       special "rank" column may be mapped to a custom auxiliary function
       so that adding "ORDER BY rank" to a query works as expected.

  <li> <p>FTS5 recognizes unicode separator characters and case equivalence by
       default. This is also possible using FTS3/4, but must be explicitly
       enabled.

  <li> <p>The query syntax has been revised where necessary to remove
       ambiguities and to make it possible to escape special characters
       in query terms.

  <li> <p>By default, FTS3/4 occasionally merges together two or more of the
       b-trees that make up its full-text index within an INSERT, UPDATE or
       DELETE statement executed by the user. This means that any operation
       on an FTS3/4 table may turn out to be surprisingly slow, as FTS3/4 
       may unpredictably choose to merge together two or more large b-trees
       within it. FTS5 uses incremental merging by default, which limits
       the amount of processing that may take place within any given 
       INSERT, UPDATE or DELETE operation.
</ul>