Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Documentation for the content= and langaugeid= options for FTS4. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
16b58c9eb47d4fda6b8e56a77984ed62 |
User & Date: | drh 2012-03-05 19:14:01.813 |
Context
2012-03-05
| ||
19:40 | Fix typos in the FTS4 languageid option documentation. Add entries to the 3.7.11 change log. (check-in: eb26f05574 user: drh tags: trunk) | |
19:14 | Documentation for the content= and langaugeid= options for FTS4. (check-in: 16b58c9eb4 user: drh tags: trunk) | |
2012-02-27
| ||
07:06 | Fix a bug in the description of the 'simple' FTS tokenizer. Underscores (codepoint 95) are divider characters not token characters. (check-in: 73a0dac584 user: dan tags: trunk) | |
Changes
Changes to pages/fts3.in.
︙ | ︙ | |||
83 84 85 86 87 88 89 | <li> <p>Because it stores extra information on disk in two new [FTS shadow tables|shadow tables] in order to support the performance optimizations and extra matchinfo() options, FTS4 tables may consume more disk space than the equivalent table created using FTS3. Usually the overhead is 1-2% or less, but may be as high as 10% if the documents stored in the FTS table are very small. The overhead may be reduced by specifying the | | | | 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | <li> <p>Because it stores extra information on disk in two new [FTS shadow tables|shadow tables] in order to support the performance optimizations and extra matchinfo() options, FTS4 tables may consume more disk space than the equivalent table created using FTS3. Usually the overhead is 1-2% or less, but may be as high as 10% if the documents stored in the FTS table are very small. The overhead may be reduced by specifying the directive [FTS4 matchinfo option|"matchinfo=fts3"] as part of the FTS4 table declaration, but this comes at the expense of sacrificing some of the extra supported matchinfo() options. <li> <p>FTS4 provides hooks (the compress and uncompress [FTS4 options|options]) allowing data to be stored in a compressed form, reducing disk usage and IO. </ul> <p> FTS4 is an enhancement to FTS3. FTS3 has been available since SQLite [version 3.5.0] in 2007-09-04. The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08. <p> Which module, FTS3 or FTS4, should you use in your application? FTS4 is sometimes significantly faster than FTS3, even orders of magnitude faster depending on the query, though in the common case the performance of the two modules is similar. FTS4 also offers the enhanced [matchinfo()] outputs which can be useful in ranking the results of a [FTS MATCH|MATCH] operation. On the other hand, in the absence of a [FTS4 matchinfo option|matchinfo=fts3] directive FTS4 requires a little more disk space than FTS3, though only a percent of two in most cases. <p> For newer applications, FTS4 is recommended; though if compatibility with older versions of SQLite is important, then FTS3 will usually serve just as well. <h2>Creating and Destroying FTS Tables</h2> |
︙ | ︙ | |||
1308 1309 1310 1311 1312 1313 1314 | table is queried from within a transaction in which the associated FTS table has been modified, the results of the query are likely to reflect only a (possibly empty) subset of the changes made. <h1 id=fts4_options tags="FTS4 options">FTS4 Options</h1> <p> | < < | < < < < < < < < | | | > > > | > > > > > | | > > > > > > > > | 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 | table is queried from within a transaction in which the associated FTS table has been modified, the results of the query are likely to reflect only a (possibly empty) subset of the changes made. <h1 id=fts4_options tags="FTS4 options">FTS4 Options</h1> <p> If the "CREATE VIRTUAL TABLE" statement specifies module FTS4 (not FTS3), then special directives - FTS4 options - similar to the "tokenize=*" option may also appear in place of column names. An FTS4 option consists of the option name, followed by an "=" character, followed by the option value. The option value may optionally be enclosed in single or double quotes, with embedded quote characters escaped in the same way as for SQL literals. There may not be whitespace on either side of the "=" character. For example, to create an FTS4 table with the value of option "matchinfo" set to "fts3": <codeblock> <i>-- Create a reduced-footprint FTS4 table.</i> CREATE VIRTUAL TABLE papers USING fts4(author, document, matchinfo=fts3); </codeblock> <p> FTS4 currently supports the following options: <table striped=1> <tr><th>Option<th>Interpretation <tr><td>compress<td> ^The compress option is used to specify the compress function. ^It is an error to specify a compress function without also specifying an uncompress function. [fts4 compress option|See below] for details. <tr><td>content<td> ^The content allows the text being indexed to stored in a separate table distinct from the FTS4 table, or or even outside of SQLite. <tr><td>languageid<td> ^The languageid option causes the FTS4 table to have an additional hidden integer column that identifies the language of the text contained in each row. The use of the languageid option allows the same FTS4 table to hold text in multiple languages or scripts, each with different tokenizer rules, and to query each language independently of the others. <tr><td>matchinfo<td> When set to the value "fts3", the matchinfo option reduces the amount of information stored by FTS4 with the consequence that the "l" option of [matchinfo()] is no longer available. <tr><td>order<td> <tcl>hd_fragment fts4order {FTS4 order option}</tcl> ^The "order" option may be set to either "DESC" or "ASC" (in upper or lower case). ^If it is set to "DESC", then FTS4 stores its data in such a way as to optimize returning results in descending order by docid. ^If it is set to "ASC" (the default), then the data structures are optimized for returning results in ascending order by docid. ^In other words, if many of the queries run against the FTS4 table use "ORDER BY docid DESC", then it may improve performance to add the "order=desc" option to the CREATE VIRTUAL TABLE statement. <tr><td>prefix<td> This option may be set to a comma-separated list of positive non-zero integers. For each integer N in the list, a separate index is created in the database file to optimize [prefix queries] where the query term is N bytes in length, not including the '*' character, when encoded using UTF-8. [fts4 prefix option|See below] for details. <tr><td>uncompress<td> This option is used to specify the uncompress function. It is an error to specify an uncompress function without also specifying a compress function. [fts4 compress option|See below] for details. </table> <p> When using FTS4, specifying a column name that contains an "=" character and is not either a "tokenize=*" specification or a recognized FTS4 option is an error. With FTS3, the first token in the unrecognized directive is interpreted as a column name. Similarly, specifying multiple "tokenize=*" |
︙ | ︙ | |||
1396 1397 1398 1399 1400 1401 1402 | CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple); <i>-- An error. Cannot create a table with two columns named "tokenize".</i> CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple, tokenize=icu); </codeblock> <tcl>hd_fragment *fts4compression {compressed FTS4 content}</tcl> | | | 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 | CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple); <i>-- An error. Cannot create a table with two columns named "tokenize".</i> CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple, tokenize=icu); </codeblock> <tcl>hd_fragment *fts4compression {compressed FTS4 content}</tcl> <h2 tags="fts4 compress option">The compress= and uncompress= options</h2> <p> The compress and uncompress options allow FTS4 content to be stored in the database in a compressed form. Both options should be set to the name of an SQL scalar function registered using [sqlite3_create_function()] that accepts a single argument. |
︙ | ︙ | |||
1440 1441 1442 1443 1444 1445 1446 1447 1448 | pay attention to data types. Specifically, when a user reads a value from a compressed FTS table, the value returned by FTS is exactly the same as the value returned by the uncompress function, including the data type. If that data type is not the same as the data type of the original value as passed to the compress function (for example if the uncompress function is returning BLOB when compress was originally passed TEXT), then the users query may not function as expected. <tcl>hd_fragment fts4prefix {FTS4 prefix option}</tcl> | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | | 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 | pay attention to data types. Specifically, when a user reads a value from a compressed FTS table, the value returned by FTS is exactly the same as the value returned by the uncompress function, including the data type. If that data type is not the same as the data type of the original value as passed to the compress function (for example if the uncompress function is returning BLOB when compress was originally passed TEXT), then the users query may not function as expected. <tcl>hd_fragment *fts4content {FTS4 content option}</tcl> <h2 tags="fts4 content option">The content= option </h2> <p> The content option allows FTS4 to forego storing the text being indexed. The content option can be used in two ways: <ul> <li><p> The indexed documents are not stored within the SQLite database at all (a "contentless" FTS4 table), or <li><p> The indexed documents are stored in a database table created and managed by the user (an "external content" FTS4 table). </ul> <p> Because the indexed documents themselves are usually much larger than the full-text index, the content option can be used to achieve significant space savings. <h3> Contentless FTS4 Tables </h3> <p> In order to create an FTS4 table that does not store a copy of the indexed documents at all, the content option should be set to an empty string. For example, the following SQL creates such an FTS4 table with three columns - "a", "b", and "c": <codeblock> CREATE VIRTUAL TABLE t1 USING fts4(content="", a, b, c); </codeblock> <p> Data can be inserted into such an FTS4 table using an INSERT statements. However, unlike ordinary FTS4 tables, the user must supply an explicit integer docid value. For example: <codeblock> <i>-- This statement is Ok:</i> INSERT INTO t1(docid, a, b, c) VALUES(1, 'a b c', 'd e f', 'g h i'); <i>-- This statement causes an error, as no docid value has been provided:</i> INSERT INTO t1(a, b, c) VALUES('j k l', 'm n o', 'p q r'); </codeblock> <p> It is not possible to UPDATE or DELETE a row stored in a contentless FTS4 table. Attempting to do so is an error. <p> Contentless FTS4 tables also support SELECT statements. However, it is an error to attempt to retrieve the value of any table column other than the docid column. The auxiliary function matchinfo() may be used, but snippet() and offsets() may not. For example: <codeblock> <i>-- The following statements are Ok:</i> SELECT docid FROM t1 WHERE t1 MATCH 'xxx'; SELECT docid FROM t1 WHERE a MATCH 'xxx'; SELECT matchinfo(t1) FROM t1 WHERE t1 MATCH 'xxx'; <i>-- The following statements all cause errors, as the value of columns</i> <i>-- other than docid are required to evaluate them.</i> SELECT * FROM t1; SELECT a, b FROM t1 WHERE t1 MATCH 'xxx'; SELECT docid FROM t1 WHERE a LIKE 'xxx%'; SELECT snippet(t1) FROM t1 WHERE t1 MATCH 'xxx'; </codeblock> <p> Errors related to attempting to retrieve column values other than docid are runtime errors that occur within sqlite3_step(). In some cases, for example if the MATCH expression in a SELECT query matches zero rows, there may be no error at all even if a statement does refer to column values other than docid. <h3> External Content FTS4 Tables </h3> <p> An "external content" FTS4 table is similar to a contentless table, except that if evaluation of a query requires the value of a column other than docid, FTS4 attempts to retrieve that value from a table (or view, or virtual table) nominated by the user (hereafter referred to as the "content table"). The FTS4 module never writes to the content table, and writing to the content table does not affect the full-text index. It is the responsibility of the user to ensure that the content table and the full-text index are consistent. <p> An external content FTS4 table is created by setting the content option to the name of a table (or view, or virtual table) that may be queried by FTS4 to retrieve column values when required. If the nominated table does not exist, then an external content table behaves in the same way as a contentless table. For example: <codeblock> CREATE TABLE t2(id INTEGER PRIMARY KEY, a, b, c); CREATE VIRTUAL TABLE t3 USING fts4(content="t2", a, c); </codeblock> <p> Assuming the nominated table does exist, then its columns must be the same as or a superset of those defined for the FTS table. <p> When a users query on the FTS table requires a column value other than docid, FTS attempts to read this value from the corresponding column of the row in the content table with a rowid value equal to the current FTS docid. Or, if such a row cannot be found in the content table, a NULL value is used instead. For example: <codeblock> CREATE TABLE t2(id INTEGER PRIMARY KEY, a, b, c, d); CREATE VIRTUAL TABLE t3 USING fts4(content="t2", b, c); INSERT INTO t2 VALUES(2, 'a b', 'c d', 'e f'); INSERT INTO t2 VALUES(3, 'g h', 'i j', 'k l'); INSERT INTO t3(docid, b, c) SELECT id, b, c FROM t2; <i>-- The following query returns a single row with two columns containing -- the text values "i j" and "k l". -- -- The query uses the full-text index to discover that the MATCH -- term matches the row with docid=3. It then retrieves the values -- of columns b and c from the row with rowid=3 in the content table -- to return. --</i> SELECT * FROM t3 WHERE t3 MATCH 'k'; <i>-- Following the UPDATE, the query still returns a single row, this -- time containing the text values "xxx" and "yyy". This is because the -- full-text index still indicates that the row with docid=3 matches -- the FTS4 query 'k', even though the documents stored in the content -- table have been modified. --</i> UPDATE t2 SET b = 'xxx', c = 'yyy' WHERE rowid = 3; SELECT * FROM t3 WHERE t3 MATCH 'k'; <i>-- Following the DELETE below, the query returns one row containing two -- NULL values. NULL values are returned because FTS is unable to find -- a row with rowid=3 within the content table. --</i> DELETE FROM t2; SELECT * FROM t3 WHERE t3 MATCH 'k'; </codeblock> <p> When a row is deleted from an external content FTS4 table, FTS4 needs to retrieve the column values of the row being deleted from the content table. This is so that FTS4 can update the full-text index entries for each token that occurs within the deleted row to indicate that that row has been deleted. If the content table row cannot be found, or if it contains values inconsistent with the contents of the FTS index, the results can be difficult to predict. The FTS index may be left containing entries corresponding to the deleted row, which can lead to seemingly nonsensical results being returned by subsequent SELECT queries. The same applies when a row is updated, as internally an UPDATE is the same as a DELETE followed by an INSERT. <p> Instead of writing separately to the full-text index and the content table, some users may wish to use database triggers to keep the full-text index up to date with respect to the set of documents stored in the content table. For example, using the tables from earlier examples: <codeblock> CREATE TRIGGER t2_bu BEFORE UPDATE ON t2 BEGIN DELETE FROM t3 WHERE docid=old.rowid; END; CREATE TRIGGER t2_bd BEFORE DELETE ON t2 BEGIN DELETE FROM t3 WHERE docid=old.rowid; END; CREATE TRIGGER t2_bu AFTER UPDATE ON t2 BEGIN INSERT INTO t3(docid, b, c) VALUES(new.rowid, new.b, new.c); END; CREATE TRIGGER t2_bd AFTER INSERT ON t2 BEGIN INSERT INTO t3(docid, b, c) VALUES(new.rowid, new.b, new.c); END; </codeblock> <p> The DELETE trigger must be fired before the actual delete takes place on the content table. This is so that FTS4 can still retrieve the original values in order to update the full-text index. And the INSERT trigger must be fired after the new row is inserted, so as to handle the case where the rowid is assigned automatically within the system. The UPDATE trigger must be split into two parts, one fired before and one after the update of the content table, for the same reasons. <p> FTS4 features a special command similar to the 'optimize' command that deletes the entire full-text index and rebuilds it based on the current set of documents in the content table. Assuming again that "t3" is the name of the external content FTS4 table, the command is: <codeblock> INSERT INTO t3(t3) VALUES('rebuild'); </codeblock> <p> This command may also be used with ordinary FTS4 tables, although it may only be useful if the full-text index has somehow become corrupt. It is an error to attempt to rebuild the full-text index maintained by a contentless FTS4 table. <tcl>hd_fragment *fts4langaugeid {FTS4 langaugeid option}</tcl> <h2 tags="fts4 languageid option">The languageid= option</h2> <p> When the langaugeid option is present, it specifies the name of another [hidden column] that is added to the FTS4 table and which is used to specify the language stored in each row of the FTS4 table. The name of the languageid hidden column must be distinct from all other column names in the FTS4 table. Example: <codeblock> CREATE VIRTUAL TABLE t1 USING fts4(x, y, languageid="lid") </codeblock> <p> The default value of a languageid column is 0. Any value inserted into a languageid column is converted to a 32-bit (not 64) signed integer. <p> By default, FTS queries (those that use the MATCH operator) consider only those rows with the languageid column set to 0. To query for rows with other languageid values, a constraint of the form "<language-id> = <integer>" must be added to the queries WHERE clause. For example: <codeblock> SELECT * FROM t1 WHERE t1 MATCH 'abc' AND lid=5; </codeblock> <p> It is not possible for a single FTS query to return rows with different languageid values. The results of adding WHERE clauses that use other operators (e.g. lid!=5, or lid<=5) are undefined. <p> If the content option is used along with the languageid option, then the named languageid column must exist in the content= table (subject to the usual rules - if a query never needs to read the content table then this restriction does not apply). <p> When the languageid option is used, SQLite invokes the xLanguageid() on the sqlite3_tokenizer_module object immediately after the object is created in order to pass in the language id that particular tokenizer should use. The xLanguageid() method will never be called more than once for any single tokenizer object. <tcl>hd_fragment fts4matchinfo {FTS4 matchinfo option}</tcl> <h2 tags="fts4 matchinfo option">The matchinfo= option</h2> <p> The matchinfo option may only be set to the value "fts3". Attempting to set matchinfo to anything other than "fts3" is an error. If this option is specified, then some of the extra information stored by FTS4 is omitted. This reduces the amount of disk space consumed by an FTS4 table until it is almost the same as the amount that would be used by the equivalent FTS3 table, but also means that the data accessed by passing the 'l' flag to the [matchinfo()] function is not available. <tcl>hd_fragment fts4prefix {FTS4 prefix option}</tcl> <h2 tags="fts4 prefix option">The prefix= option</h2> <p> ^The FTS4 prefix option causes FTS to index term prefixes of specified lengths in the same way that it always indexes complete terms. ^The prefix option must be set to a comma separated list of positive non-zero integers. ^For each value N in the list, prefixes of length N bytes (when encoded using UTF-8) are indexed. ^FTS4 uses term prefix indexes to speed up |
︙ | ︙ | |||
1722 1723 1724 1725 1726 1727 1728 | into the FTS virtual table by the user. If the user does not explicitly supply a "docid" value when inserting records, one is selected automatically by the system. <p> The %_stat and %_docsize tables are only created if the FTS table uses the FTS4 module, not FTS3. Furthermore, the %_docsize table is omitted if the | | | 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 | into the FTS virtual table by the user. If the user does not explicitly supply a "docid" value when inserting records, one is selected automatically by the system. <p> The %_stat and %_docsize tables are only created if the FTS table uses the FTS4 module, not FTS3. Furthermore, the %_docsize table is omitted if the FTS4 table is created with the [FTS4 matchinfo option|"matchinfo=fts3"] directive specified as part of the CREATE VIRTUAL TABLE statement. If they are created, the schema of the two tables is as follows: <codeblock> CREATE TABLE %_stat( id INTEGER PRIMARY KEY, value BLOB ); |
︙ | ︙ |