Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Fix typos in fts3.in. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
bead827e51d77c15f059ff9ab009f779 |
User & Date: | dan 2010-11-27 11:46:46.000 |
Context
2010-11-27
| ||
15:55 | Add some extra details to fts3.in. (check-in: 5ab479563e user: dan tags: trunk) | |
11:46 | Fix typos in fts3.in. (check-in: bead827e51 user: dan tags: trunk) | |
2010-11-26
| ||
22:46 | Update the download page to use a new, consistent file naming scheme. Download files end with a version number of the form WXXYYZZ where W is 3, XX is the major version, YY is the minor version, and ZZ is the patch number. (check-in: 8b77a71a8c user: drh tags: trunk) | |
Changes
Changes to pages/fts3.in.
︙ | ︙ | |||
31 32 33 34 35 36 37 | allows the user to efficiently query the database for all rows that contain one or more words (hereafter "tokens"), even if the table contains many large documents. <p> For example, if each of the 517430 documents in the "<a href="http://www.cs.cmu.edu/~enron/">Enron E-Mail Dataset</a>" | | | 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | allows the user to efficiently query the database for all rows that contain one or more words (hereafter "tokens"), even if the table contains many large documents. <p> For example, if each of the 517430 documents in the "<a href="http://www.cs.cmu.edu/~enron/">Enron E-Mail Dataset</a>" is inserted into both an FTS table and an ordinary SQLite table created using the following SQL script: <codeblock> CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table */ CREATE TABLE enrondata2(content TEXT); /* Ordinary table */ </codeblock> |
︙ | ︙ | |||
78 79 80 81 82 83 84 | <p> FTS4 is an enhancement to FTS3. FTS3 has been available since SQLite [version 3.5.0] in 2007-09-04. The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08. <p> Which module, FTS3 or FTS4, should you use in your application? | | | | 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | <p> FTS4 is an enhancement to FTS3. FTS3 has been available since SQLite [version 3.5.0] in 2007-09-04. The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08. <p> Which module, FTS3 or FTS4, should you use in your application? FTS4 is sometimes significantly faster than FTS3, even orders of magnitude faster depending on the query, though in the common case the performance of the two modules is similar. FTS4 also offers the enhanced [matchinfo()] outputs which can be useful in ranking the results of a [FTS MATCH|MATCH] operation. On the other hand, because FTS4 stores additional information, FTS4 requires a little more disk space than FTS3, though only a percent of two in most cases. <p> For newer applications, FTS4 is recommended; though if minimal disk usage or compatibility |
︙ | ︙ | |||
102 103 104 105 106 107 108 | be left empty, in which case an FTS table with a single user-defined column named "content" is created. Alternatively, the module arguments may be passed a list of comma separated column names. <p> If column names are explicitly provided for the FTS table as part of the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally | | | 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | be left empty, in which case an FTS table with a single user-defined column named "content" is created. Alternatively, the module arguments may be passed a list of comma separated column names. <p> If column names are explicitly provided for the FTS table as part of the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally specified for each column. This is pure syntactic sugar, the supplied typenames are not used by FTS or the SQLite core for any purpose. The same applies to any constraints specified along with an FTS column name - they are parsed but not used or recorded by the system in any way. <codeblock> <i>-- Create an FTS table named "data" with one column - "content":</i> |
︙ | ︙ | |||
157 158 159 160 161 162 163 | </codeblock> <p> FTS tables may be dropped from the database using an ordinary [DROP TABLE] statement. For example: <codeblock> | | | | 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | </codeblock> <p> FTS tables may be dropped from the database using an ordinary [DROP TABLE] statement. For example: <codeblock> <i>-- Create, then immediately drop, an FTS4 table.</i> CREATE VIRTUAL TABLE data USING fts4(); DROP TABLE data; </codeblock> <h2>Populating FTS Tables</h2> <p> FTS tables are populated using [INSERT], [UPDATE] and [DELETE] statements in the same way as ordinary SQLite tables are. <p> As well as the columns named by the user (or the "content" column if no module arguments were specified as part of the [CREATE VIRTUAL TABLE] statement), each FTS table has a "rowid" column. The rowid of an FTS table behaves in the same way as the rowid column of an ordinary SQLite table, except that the values stored in the rowid column of an FTS table remain unchanged if the database is rebuilt using the [VACUUM] command. For FTS tables, "docid" is allowed as an alias along with the usual "rowid", "oid" and "_oid_" identifiers. Attempting to insert or update a row with a docid value that already exists in the table is an error, just as it would |
︙ | ︙ | |||
217 218 219 220 221 222 223 | </codeblock> <p> To support full-text queries, FTS maintains an inverted index that maps from each unique term or word that appears in the dataset to the locations in which it appears within the table contents. For the curious, a complete description of the [segment btree|data structure] used to store | | | 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | </codeblock> <p> To support full-text queries, FTS maintains an inverted index that maps from each unique term or word that appears in the dataset to the locations in which it appears within the table contents. For the curious, a complete description of the [segment btree|data structure] used to store this index within the database file appears below. A feature of this data structure is that at any time the database may contain not one index b-tree, but several different b-trees that are incrementally merged as rows are inserted, updated and deleted. This technique improves performance when writing to an FTS table, but causes some overhead for full-text queries that use the index. Executing an SQL statement of the form "INSERT INTO <fts-table>(<fts-table>) VALUES('optimize')" causes FTS to merge all existing index b-trees into a single large |
︙ | ︙ | |||
272 273 274 275 276 277 278 | a sub-clause of the form "<column> MATCH ?", FTS is able to use the built-in full-text index to restrict the search to those documents that match the full-text query string specified as the right-hand operand of the MATCH clause. </ul> <p> | | | | | 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | a sub-clause of the form "<column> MATCH ?", FTS is able to use the built-in full-text index to restrict the search to those documents that match the full-text query string specified as the right-hand operand of the MATCH clause. </ul> <p> If neither of these two query strategies can be used, all queries on FTS tables are implemented using a linear scan of the entire table. If the table contains large amounts of data, this may be an impractical approach (the first example on this page shows that a linear scan of 1.5 GB of data takes around 30 seconds using a modern PC). <codeblock> <i>-- The examples in this block assume the following FTS table:</i> CREATE VIRTUAL TABLE mail USING fts3(subject, body); SELECT * FROM mail WHERE rowid = 15; <i>-- Fast. Rowid lookup.</i> SELECT * FROM mail WHERE body MATCH 'sqlite'; <i>-- Fast. Full-text query.</i> SELECT * FROM mail WHERE mail MATCH 'search'; <i>-- Fast. Full-text query.</i> SELECT * FROM mail WHERE rowid BETWEEN 15 AND 20; <i>-- Slow. Linear scan.</i> SELECT * FROM mail WHERE subject = 'database'; <i>-- Slow. Linear scan.</i> SELECT * FROM mail WHERE subject MATCH 'database'; <i>-- Fast. Full-text query.</i> </codeblock> <p> In all of the full-text queries above, the right-hand operand of the MATCH operator is a string consisting of a single term. In this case, the MATCH expression evaluates to true for all documents that contain one or more instances of the specified word ("sqlite", "search" or "database", depending on which example you look at). Specifying a single term as the right-hand operand of the MATCH operator results in the simplest and most common type of full-text query possible. However more complicated queries are possible, including phrase searches, term-prefix searches and searches for documents containing combinations of terms occuring within a defined proximity of each other. The various ways in which the full-text index may be queried are [FTS MATCH|described below]. <p> |
︙ | ︙ | |||
417 418 419 420 421 422 423 | </ol> <h1 tags="compile fts">Compiling and Enabling FTS3 and FTS4</h1> <p> | | | 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 | </ol> <h1 tags="compile fts">Compiling and Enabling FTS3 and FTS4</h1> <p> Although FTS3 and FTS4 are included with the SQLite core source code, they are not enabled by default. To build SQLite with FTS functionality enabled, define the preprocessor macro [SQLITE_ENABLE_FTS3] when compiling. New applications should also define the [SQLITE_ENABLE_FTS3_PARENTHESIS] macro to enable the [enhanced query syntax] (see below). Usually, this is done by adding the following two switches to the compiler command line: <codeblock> |
︙ | ︙ | |||
452 453 454 455 456 457 458 | the configure script, if any. <p> Because FTS3 and FTS4 are virtual tables, The [SQLITE_ENABLE_FTS3] compile-time option is incompatible with the [SQLITE_OMIT_VIRTUALTABLE] option. <p> | | | 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 | the configure script, if any. <p> Because FTS3 and FTS4 are virtual tables, The [SQLITE_ENABLE_FTS3] compile-time option is incompatible with the [SQLITE_OMIT_VIRTUALTABLE] option. <p> If a build of SQLite does not include the FTS modules, then any attempt to prepare an SQL statement to create an FTS3 or FTS4 table or to drop or access an existing FTS table in any way will fail. The error message returned will be similar to "no such module: ftsN" (where N is either 3 or 4). <p> If the C version of the <a href=http://site.icu-project.org/>ICU library</a> is available, then FTS may also be compiled with the SQLITE_ENABLE_ICU |
︙ | ︙ | |||
475 476 477 478 479 480 481 | <h1 tags="FTS MATCH">Full-text Index Queries</h1> <p> The most useful thing about FTS tables is the queries that may be performed using the built-in full-text index. Full-text queries are performed by specifying a clause of the form | | | 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 | <h1 tags="FTS MATCH">Full-text Index Queries</h1> <p> The most useful thing about FTS tables is the queries that may be performed using the built-in full-text index. Full-text queries are performed by specifying a clause of the form "<column> MATCH <full-text query expression>" as part of the WHERE clause of a SELECT statement that reads data from an FTS table. [simple fts queries|Simple FTS queries] that return all documents that contain a given term are described above. In that discussion the right-hand operand of the MATCH operator was assumed to be a string consisting of a single term. This section describes the more complex query types supported by FTS tables, and how they may be utilized by specifying a more complex query expression as the right-hand operand of a MATCH operator. |
︙ | ︙ | |||
641 642 643 644 645 646 647 | <li> The NOT operator (or, if using the standard syntax, a unary "-" operator) may be used to compute the <b>relative complement</b> of one set of documents with respect to another. </ul> <p> | | | 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 | <li> The NOT operator (or, if using the standard syntax, a unary "-" operator) may be used to compute the <b>relative complement</b> of one set of documents with respect to another. </ul> <p> The FTS modules may be compiled to use one of two slightly different versions of the full-text query syntax, the "standard" query syntax and the "enhanced" query syntax. The basic term, term-prefix, phrase and NEAR queries described above are the same in both versions of the syntax. The way in which set operations are specified is slightly different. The following two sub-sections describe the part of the two query syntaxes that pertains to set operations. Refer to the description of how to [compile fts] for compilation notes. |
︙ | ︙ | |||
1045 1046 1047 1048 1049 1050 1051 | then the blob is zero bytes in size. Otherwise, the blob consists of zero or more 32-bit unsigned integers in machine byte-order. The exact number of integers in the returned array depends on both the query and the value of the second argument (if any) passed to the matchinfo function. <p> The matchinfo function is called with either one or two arguments. As for | | < | | 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 | then the blob is zero bytes in size. Otherwise, the blob consists of zero or more 32-bit unsigned integers in machine byte-order. The exact number of integers in the returned array depends on both the query and the value of the second argument (if any) passed to the matchinfo function. <p> The matchinfo function is called with either one or two arguments. As for all auxiliary functions, the first argument must be the special [FTS hidden column]. The second argument, if it is specified, must be a text value comprised only of the characters 'p', 'c', 'n', 'a', 'l', 's' and 'x'. If no second argument is explicitly supplied, it defaults to "pcx". The second argument is refered to as the "format string" below. <p> Characters in the matchinfo format string are processed from left to right. Each character in the format string causes one or more 32-bit unsigned |
︙ | ︙ | |||
1107 1108 1109 1110 1111 1112 1113 | For each column, the length of the value stored in the current row of the FTS4 table, in tokens. This value is only available when querying FTS4 tables, not FTS3. And only if the "matchinfo=fts3" directive was not specified as part of the "CREATE VIRTUAL TABLE" statement used to create the FTS4 table. <tr><td>s <td><i>cols</i> <td>For each column, the length of the longest subsequence of phrase matches that the column value has in common | | | 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 | For each column, the length of the value stored in the current row of the FTS4 table, in tokens. This value is only available when querying FTS4 tables, not FTS3. And only if the "matchinfo=fts3" directive was not specified as part of the "CREATE VIRTUAL TABLE" statement used to create the FTS4 table. <tr><td>s <td><i>cols</i> <td>For each column, the length of the longest subsequence of phrase matches that the column value has in common with the query text. For example, if a table column contains the text 'a b c d e' and the query is 'a c "d e"', then the length of the longest common subsequence is 2 (phrase "c" followed by phrase "d e"). </table> <p> |
︙ | ︙ | |||
1199 1200 1201 1202 1203 1204 1205 | rules: <ul> <li><p> A term is a contiguous sequence of eligible characters, where eligible characters are all alphanumeric characters, the "_" character, and all characters with UTF codepoints greater than or equal to 128. All other characters are discarded when splitting a document into terms. | | | 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 | rules: <ul> <li><p> A term is a contiguous sequence of eligible characters, where eligible characters are all alphanumeric characters, the "_" character, and all characters with UTF codepoints greater than or equal to 128. All other characters are discarded when splitting a document into terms. Their only contribution is to separate adjacent terms. <li><p> All uppercase characters within the ASCII range (UTF codepoints less than 128), are transformed to their lowercase equivalents as part of the tokenization process. Thus, full-text queries are case-insensitive when using the simple tokenizer. </ul> |
︙ | ︙ |