Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Add an "application porting guide" describing the steps required to port fts3/4 applciations to fts5. Fix some other deficiencies in fts5.html. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
f650ee44b0c626168c35f7dad15d1336 |
User & Date: | dan 2015-07-10 17:04:56.403 |
Context
2015-07-13
| ||
22:07 | First cut at OTA documentation. (check-in: a5dfd4a26b user: drh tags: trunk) | |
2015-07-10
| ||
17:04 | Add an "application porting guide" describing the steps required to port fts3/4 applciations to fts5. Fix some other deficiencies in fts5.html. (check-in: f650ee44b0 user: dan tags: trunk) | |
2015-07-02
| ||
20:03 | Add "Compiling and Using FTS5" section to fts5.html. (check-in: 59ff0eaf28 user: dan tags: trunk) | |
Changes
Changes to pages/fts5.in.
︙ | ︙ | |||
14 15 16 17 18 19 20 | full-text search engines allow the user to efficiently search a large collection of documents for the subset that contain one or more instances of a search term. The search functionality provided to world wide web users by <a href=www.google.com>Google</a> is, amongst other things, a full-text search engine, as it allows users to search for all documents on the web that contain, for example, the term "fts5". | < < | 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | full-text search engines allow the user to efficiently search a large collection of documents for the subset that contain one or more instances of a search term. The search functionality provided to world wide web users by <a href=www.google.com>Google</a> is, amongst other things, a full-text search engine, as it allows users to search for all documents on the web that contain, for example, the term "fts5". <p>To use FTS5, the user creates an FTS5 virtual table with one or more columns. For example: <codeblock> CREATE VIRTUAL TABLE email USING fts5(sender, title, body); </codeblock> |
︙ | ︙ | |||
97 98 99 100 101 102 103 | <li> boolean combinations of any of the above. </ul> <p> Such advanced searches are requested by providing a more complicated FTS5 query string as the text to the right of the MATCH operator. The full query syntax is [FTS5 query syntax | described here]. | < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < | 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | <li> boolean combinations of any of the above. </ul> <p> Such advanced searches are requested by providing a more complicated FTS5 query string as the text to the right of the MATCH operator. The full query syntax is [FTS5 query syntax | described here]. <h1 tags="FTS5 building">Compiling and Using FTS5</h1> <p>The FTS5 source code consists of a series of *.c and other files in the "ext/fts5" directory of the SQLite source tree. A build process reduces this to just two files - "fts5.c" and "fts5.h" - which are used to build an SQLite loadable extension or statically linked into an application. |
︙ | ︙ | |||
1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 | INSERT INTO ft(ft) VALUES('integrity-check'); </codeblock> <p>If the full-text index is consistent with the contents of the table, the INSERT used to invoke the integrity-check command succeeds. Or, if any discrepancy is found, it fails with an [SQLITE_CORRUPT_VTAB] error. <h2 tags="FTS5 optimize command">The 'optimize' Command</h2> <p>This command merges all individual b-trees that currently make up the full-text index into a single large b-tree structure. This ensures that the full-text index consumes the minimum space within the database and is in the fastest form to query. <p>Refer to the documentation for the [FTS5 automerge option] for more details regarding the relationship between the full-text index and its component b-trees. <codeblock> INSERT INTO ft(ft) VALUES('optimize'); </codeblock> | > > > > > > > > > > > > > > | | 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 | INSERT INTO ft(ft) VALUES('integrity-check'); </codeblock> <p>If the full-text index is consistent with the contents of the table, the INSERT used to invoke the integrity-check command succeeds. Or, if any discrepancy is found, it fails with an [SQLITE_CORRUPT_VTAB] error. <h2 tags="FTS5 merge command">The 'merge' Command</h2> <codeblock> INSERT INTO ft(ft, rank) VALUES('merge', 500); </codeblock> <p> This command merges b-tree structures together until roughly N pages of merged data have been written to the database. The size of each page is as configured by the [FTS5 pgsz option]. <p> B-tree structures are only eligible for merging if they would also be eligible for automatic merging as configured by the [FTS5 automerge option]. If no such b-trees are found when this command is invoked, it is a no-op. <h2 tags="FTS5 optimize command">The 'optimize' Command</h2> <p>This command merges all individual b-trees that currently make up the full-text index into a single large b-tree structure. This ensures that the full-text index consumes the minimum space within the database and is in the fastest form to query. <p>Refer to the documentation for the [FTS5 automerge option] for more details regarding the relationship between the full-text index and its component b-trees. <codeblock> INSERT INTO ft(ft) VALUES('optimize'); </codeblock> <h2 tags="FTS5 pgsz option">The 'pgsz' Configuration Option</h2> <p> This command is used to set the persistent "pgsz" option. <p> The full-text index maintained by FTS5 is stored as a series of fixed-size blobs in a database table. It is not strictly necessary for all blobs that make up a full-text index to be the same size. The pgsz option determines the size of all blobs created by subsequent index writers. The default value is 1000. |
︙ | ︙ | |||
1375 1376 1377 1378 1379 1380 1381 | } </codeblock> <p>The following section describes the API offered to auxiliary function implementations in detail. Further examples may be found in the "fts5_aux.c" file of the source code. | > | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 | } </codeblock> <p>The following section describes the API offered to auxiliary function implementations in detail. Further examples may be found in the "fts5_aux.c" file of the source code. <h3 tags="custom auxiliary functions"> Custom Auxiliary Functions API Reference </h3> <codeblock> <tclscript> set res "" set ::extract_api_docs_mode Fts5ExtensionApi catch { set res [source [file join $::SRC ext/fts5/extract_api_docs.tcl]] } set res </tclscript> </codeblock> <tclscript> set res "" unset -nocomplain ::extract_api_docs_mode catch { set res [source [file join $::SRC ext/fts5/extract_api_docs.tcl]] } set res </tclscript> <h1 id=appendix_a nonumber tags="comparison with fts4"> Appendix A: Comparison with FTS3/4 </h1> <p> Also available is the similar but more mature [fts3 | FTS3/4] module. FTS5 is a new version of FTS4 that includes various fixes and solutions for problems that could not be fixed in FTS4 without sacrificing backwards compatibility. Some of these problems are [fts5 technical differences | described below]. <h2 nonumber> Application Porting Guide </h2> <p> In order to use FTS5 instead of FTS3 or FTS4, applications usually require minimal modifications. Most of these fall into three categories - changes required to the CREATE VIRTUAL TABLE statement used to create the FTS table, changes required to SELECT queries used to execute queries against the table, and changes required to applications that use [FTS auxiliary functions]. <h3 nonumber> Changes to CREATE VIRTUAL TABLE statements </h3> <ol> <li> <p>The module name must be changed from "fts3" or "fts4" to "fts5". <li> <p>All type information or constraint specifications must be removed from column definitions. FTS3/4 ignores everything following the column name in a column definition, FTS5 attempts to parse it (and will report an error if it fails to). <li> <p>The "matchinfo=fts3" option is not available. The [FTS5 columnsize option | "columnsize=0"] option is equivalent. <li> <p>The notindexed= option is not available. Adding [unindexed | UNINDEXED] to the column definition is equivalent. <li> <p>The ICU tokenizer is not available. <li> <p>The compress=, uncompress= and languageid= options are not available. There is as of yet no equivalent for their functionality. </ol> <codeblock> <i> -- FTS3/4 statement </i> CREATE VIRTUAL TABLE t1 USING fts4( linkid INTEGER, header CHAR(20), text VARCHAR, notindexed=linkid, matchinfo=fts3, tokenizer=unicode61 ); <i> -- FTS5 equivalent (note - the "tokenizer=unicode61" option is not</i> <i> -- required as this is the default for FTS5 anyway)</i> CREATE VIRTUAL TABLE t1 USING fts5( linkid UNINDEXED, header, text, columnsize=0 ); </codeblock> <h3 nonumber> Changes to SELECT statements </h3> <ol> <li> <p>The "docid" alias does not exist. Applications must use "rowid" instead. <li> <p>The left hand side of the MATCH operator in a full-text query must be the table name - not any column name as in FTS4. <li> <p>The FTS query syntax (right hand side of the MATCH operator) has changed in some ways. The FTS5 syntax is quite close to the FTS4 "enhanced syntax". The main difference is that FTS5 is fussier about unrecognized punctuation characters and similar within query strings. Most queries that work with FTS3/4 should also work with FTS5, and those that do not should return parse errors. </ol> <codeblock> <i>-- FTS3/4 query </i> SELECT docid FROM t1 WHERE text MATCH 'token'; <i>-- FTS5 equivalent </i> SELECT rowid FROM t1 WHERE t1 MATCH 'text:token'; </codeblock> <h3 nonumber> Auxiliary Function Changes </h3> <p> FTS5 has no matchinfo() or offsets() function, and the snippet() function is not as fully-featured as in FTS3/4. However, since FTS5 does provide an API allowing applications to create [custom auxiliary functions], any required functionality may be implemented within the application code. <p> The set of built-in auxiliary functions provided by FTS5 may be improved upon in the future. <h3 nonumber> Other Issues</h3> <ol> <li><p> FTS5 does not currently provide an equivalent to the fts4aux table. <li><p> The FTS3/4 "merge=X,Y" command has been replaced by the [FTS5 merge command]. <li><p> The FTS3/4 "automerge=X" command has been replaced by the [FTS5 automerge option]. </ol> <h2 nonumber tags="fts5 technical differences"> Summary of Technical Differences </h2> <p>FTS5 is similar to FTS3/4 in that the primary task of each is to maintain an index mapping from each unique token to a list of instances of that token within a set of documents, where each instance is identified by the document in which it appears and its position within that document. For example: <codeblock> <i>-- Given the following SQL:</i> CREATE VIRTUAL TABLE ft USING fts5(a, b); INSERT INTO ft(rowid, a, b) VALUES(1, 'X Y', 'Y Z'); INSERT INTO ft(rowid, a, b) VALUES(2, 'A Z', 'Y Y'); <i>-- The FTS5 module creates the following mapping on disk:</i> A --> (2, 0, 0) X --> (1, 0, 0) Y --> (1, 1, 0) (2, 1, 0) (2, 1, 1) Z --> (2, 1, 1) (1, 1, 1) </codeblock> <p>In the example above, each triple identifies the location of a token instance by rowid, column number (columns are numbered sequentially starting at 0 from left to right) and position within the column value (the first token in a column value is 0, the second is 1, and so on). Using this index, FTS5 is able to provide timely answers to queries such as "the set of all documents that contain the token 'A'", or "the set of all documents that contain the sequence 'Y Z'". The list of instances associated with a single token is called an "instance-list". <p>The principle difference between FTS3/4 and FTS5 is that in FTS3/4, each instance-list is stored as a single large database record, whereas in FTS5 large instance-lists are divided betwen multiple database records. This has the following implications for dealing with large databases that contain large lists: <ul> <li> <p>FTS5 is able to load instance-lists into memory incrementally in order to reduce memory usage and peak allocation size. FTS3/4 very often loads entire instance-lists into memory. <li> <p>When processing queries that feature more than one token, FTS5 is sometimes able to determine that the query can be answered by inspecting a subset of a large instance-list. FTS3/4 almost always has to traverse entire instance-lists. <li> If an instance-list grows so large that it exceeds the [SQLITE_MAX_LENGTH] limit, FTS3/4 is unable to handle it. FTS5 does not have this problem. </ul> <p>For these reasons, many complex queries may use less memory and run faster using FTS5. <p>Some other ways in which FTS5 differs from FTS3/4 are: <ul> <li> <p>FTS5 supports "ORDER BY rank" for returning results in order of decreasing relevancy. <li> <p>FTS5 features an API allowing users to create custom auxiliary functions for advanced ranking and text processing applications. The special "rank" column may be mapped to a custom auxiliary function so that adding "ORDER BY rank" to a query works as expected. <li> <p>FTS5 recognizes unicode separator characters and case equivalence by default. This is also possible using FTS3/4, but must be explicitly enabled. <li> <p>The query syntax has been revised where necessary to remove ambiguities and to make it possible to escape special characters in query terms. <li> <p>By default, FTS3/4 occasionally merges together two or more of the b-trees that make up its full-text index within an INSERT, UPDATE or DELETE statement executed by the user. This means that any operation on an FTS3/4 table may turn out to be surprisingly slow, as FTS3/4 may unpredictably choose to merge together two or more large b-trees within it. FTS5 uses incremental merging by default, which limits the amount of processing that may take place within any given INSERT, UPDATE or DELETE operation. </ul> |