Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Update fts3.html to describe the fts4aux module and the fts4 compress/uncompress options. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
ead2cc60ef6b652f0eb80f1d98625e32 |
User & Date: | dan 2011-02-03 19:06:21.455 |
Context
2011-02-04
| ||
00:52 | Update the ATTACH documentation to allow an expression as the filename parameter. Core ticket http://www.sqlite.org/src/info/9013e13dba. (check-in: be0c879c12 user: drh tags: trunk) | |
2011-02-03
| ||
19:06 | Update fts3.html to describe the fts4aux module and the fts4 compress/uncompress options. (check-in: ead2cc60ef user: dan tags: trunk) | |
2011-01-31
| ||
14:41 | Change the date on version 3.7.5 to February 1. (check-in: ee5bc9089c user: drh tags: trunk) | |
Changes
Changes to pages/fts3.in.
︙ | ︙ | |||
86 87 88 89 90 91 92 93 94 95 96 97 98 99 | optimizations and extra matchinfo() options, FTS4 tables may consume more disk space than the equivalent table created using FTS3. Usually the overhead is 1-2% or less, but may be as high as 10% if the documents stored in the FTS table are very small. The overhead may be reduced by specifying the directive [matchinfo_fts3|"matchinfo=fts3"] as part of the FTS4 table declaration, but this comes at the expense of sacrificing some of the extra supported matchinfo() options. </ul> <p> FTS4 is an enhancement to FTS3. FTS3 has been available since SQLite [version 3.5.0] in 2007-09-04. The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08. <p> | > > > > | 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | optimizations and extra matchinfo() options, FTS4 tables may consume more disk space than the equivalent table created using FTS3. Usually the overhead is 1-2% or less, but may be as high as 10% if the documents stored in the FTS table are very small. The overhead may be reduced by specifying the directive [matchinfo_fts3|"matchinfo=fts3"] as part of the FTS4 table declaration, but this comes at the expense of sacrificing some of the extra supported matchinfo() options. <li> <p>FTS4 provides hooks (the compress and uncompress [FTS4 options|options]) allowing data to be stored in a compressed form, reducing disk usage and IO. </ul> <p> FTS4 is an enhancement to FTS3. FTS3 has been available since SQLite [version 3.5.0] in 2007-09-04. The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08. <p> |
︙ | ︙ | |||
166 167 168 169 170 171 172 173 174 175 | CREATE VIRTUAL TABLE data USING fts4(tokenize=simple); <i>-- Create an FTS table with two columns that uses the "icu" tokenizer.</i> <i>-- The qualifier "en_AU" is passed to the tokenizer implementation</i> CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU); </codeblock> <p> <tcl>hd_fragment *matchinfo_fts3 {matchinfo_fts3}</tcl> If the "CREATE VIRTUAL TABLE" statement specifies module FTS4 (not FTS3), | > > > > > > > > > > > > > | | | < < > > | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | | | | | | 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 | CREATE VIRTUAL TABLE data USING fts4(tokenize=simple); <i>-- Create an FTS table with two columns that uses the "icu" tokenizer.</i> <i>-- The qualifier "en_AU" is passed to the tokenizer implementation</i> CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU); </codeblock> <p> FTS tables may be dropped from the database using an ordinary [DROP TABLE] statement. For example: <codeblock> <i>-- Create, then immediately drop, an FTS4 table.</i> CREATE VIRTUAL TABLE data USING fts4(); DROP TABLE data; </codeblock> <h3 id=fts4_options tags="FTS4 options">FTS4 Options</h3> <p> <tcl>hd_fragment *matchinfo_fts3 {matchinfo_fts3}</tcl> If the "CREATE VIRTUAL TABLE" statement specifies module FTS4 (not FTS3), then special directives - FTS4 options - similar to the "tokenize=*" option may also appear in place of column names. An FTS4 option consists of the option name, followed by an "=" character, followed by the option value. The option value may optionally be enclosed in single or double quotes, with embedded quote characters escaped in the same way as for SQL literals. There may not be whitespace on either side of the "=" character. For example, to create an FTS4 table with the value of option "matchinfo" set to "fts3": <codeblock> <i>-- Create a reduced-footprint FTS4 table.</i> CREATE VIRTUAL TABLE papers USING fts4(author, document, matchinfo=fts3); </codeblock> <p> FTS4 currently supports the following three options: <table striped=1> <tr><th>Option<th>Interpretation <tr><td>matchinfo<td> This option may only be set to the value "fts3". Attempting to set it otherwise is an error. If this option is specified, then some of the extra information stored by FTS4 is omitted. This reduces the amount of disk space consumed by an FTS4 table until it is almost the same as the amount that would be used by the equivalent FTS3 table, but also means that the data accessed by passing the 'l' flag to the [matchinfo()] function is not available. For example: <tr><td>compress<td> This option is used to specify the compress function. It is an error to specify a compress function without also specifying an uncompress function. <tr><td>uncompress<td> This option is used to specify the uncompress function. It is an error to specify an uncompress function without also specifying a compress function. </table> <p> The compress and uncompress options allow FTS4 content to be stored in the database in a compressed form. Both options should be set to the name of an SQL scalar function registered using [sqlite3_create_function()] that accepts a single argument. <p> The compress function should return a compressed version of the value passed to it as an argument. Each time data is written to the FTS4 table, each column value is passed to the compress function and the result value stored in the database. The compress function may return any type of SQLite value (blob, text, real, integer or null). <p> The uncompress function should uncompress data previously compressed by the compress function. In other words, for all SQLite values X, it should be true that uncompress(compress(X)) equals X. When data that has been compressed by the compress function is read from the database by FTS4, it is passed to the uncompress function before it is used. <p> If the specified compress or uncompress functions do not exist, the table may still be created. An error is not returned until the FTS4 table is read (if the uncompress function does not exist) or written (if it is the compress function that does not exist). <codeblock> <i>-- Create an FTS4 table that stores data in compressed form. This</i> <i>-- assumes that the scalar functions zip() and unzip() have been (or</i> <i>-- will be) added to the database handle.</i> CREATE VIRTUAL TABLE papers USING fts4(author, document, compress=zip, uncompress=unzip); </codeblock> <p> When using FTS4, specifying a column name that contains an "=" character and is not either a "tokenize=*" specification or a recognized FTS4 option is an error. With FTS3, the first token in the unrecognized directive is interpreted as a column name. Similarly, specifying multiple "tokenize=*" directives in a single table declaration is an error when using FTS4, whereas the second and subsequent "tokenize=*" directives are interpreted as column names by FTS3. For example: <codeblock> <i>-- An error. FTS4 does not recognize the directive "xyz=abc".</i> CREATE VIRTUAL TABLE papers USING fts4(author, document, xyz=abc); <i>-- Create an FTS3 table with three columns - "author", "document"</i> |
︙ | ︙ | |||
209 210 211 212 213 214 215 | <i>-- table uses the "porter" tokenizer.</i> CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple); <i>-- An error. Cannot create a table with two columns named "tokenize".</i> CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple, tokenize=icu); </codeblock> | < < < < < < < < < < | 284 285 286 287 288 289 290 291 292 293 294 295 296 297 | <i>-- table uses the "porter" tokenizer.</i> CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple); <i>-- An error. Cannot create a table with two columns named "tokenize".</i> CREATE VIRTUAL TABLE papers USING fts3(tokenize=porter, tokenize=simple, tokenize=icu); </codeblock> <h2>Populating FTS Tables</h2> <p> FTS tables are populated using [INSERT], [UPDATE] and [DELETE] statements in the same way as ordinary SQLite tables are. <p> |
︙ | ︙ | |||
1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 | <p> The matchinfo function provides all the information required to calculate probabilistic "bag-of-words" relevancy scores such as <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may be used to order results in a full-text search application. Appendix A of this document, "[search application tips]", contains an example of using the matchinfo() function efficiently. <h1 id=tokenizer tags="tokenizer">Tokenizers</h1> <p> An FTS tokenizer is a set of rules for extracting terms from a document or basic FTS full-text query. | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 | <p> The matchinfo function provides all the information required to calculate probabilistic "bag-of-words" relevancy scores such as <a href=http://en.wikipedia.org/wiki/Okapi_BM25>Okapi BM25/BM25F</a> that may be used to order results in a full-text search application. Appendix A of this document, "[search application tips]", contains an example of using the matchinfo() function efficiently. <h1 id=fts4aux tags="fts4aux">Fts4aux - Direct Access to the Full-Text Index</h1> <p> As of version 3.7.6, SQLite includes a new virtual table module called "fts4aux", which can be used to inspect the full-text index of an exiting FTS3 or FTS4 table directly (despite its name, fts4aux works just as well with FTS3 tables as it does with FTS4 tables). Fts4aux tables are read-only. The only way to modify the contents of an fts4aux table is by modifying the contents of the associated FTS table. The fts4aux module is automatically included in all [compile fts|builds that include FTS]. <p> An fts4aux virtual table is constructed with a single argument - the unqualified name of the FTS table that it will be used to access. For example: <codeblock> <i>-- Create an FTS4 table</i> CREATE VIRTUAL TABLE ft USING fts4; <i>-- Create an fts4aux table to access the full-text index for table "ft"</i> CREATE VIRTUAL TABLE ft_terms USING fts4aux(ft); </codeblock> <p> There is one row in an fts4aux table for each distinct term in the associated FTS table. An fts4aux table always has the same three columns, as follows, from left to right: <table striped=1> <tr><th>Column Name<th>Column Contents <tr><td>term<td> Contains the text of the term for this row. <tr><td>documents<td> Contains the number of rows of the FTS table that contain at least one instance of the term (in any column). This column always contains an integer value. <tr><td>occurrences<td> Contains the total number of instances of the term in all rows of the FTS table. This column also always contains an integer value. </table> <p> For example, using the tables created above: <codeblock> INSERT INTO ft VALUES('Apple banana Cherry'); INSERT INTO ft VALUES('Banana Date Fig'); INSERT INTO ft VALUES('Cherry Grapefruit Cherry'); <i>-- The following query returns this data:</i> <i>--</i> <i>-- apple | 1 | 1</i> <i>-- banana | 2 | 2</i> <i>-- cherry | 2 | 3</i> <i>-- date | 1 | 1</i> <i>-- fig | 1 | 1</i> <i>-- grapefruit | 1 | 1</i> <i>--</i> SELECT term, documents, occurrences FROM ft_terms; </codeblock> <p> In the example, the values in the "term" column are all lower case, even though they were inserted into table "ft" in mixed case. This is because an fts3aux table contains the terms as extracted from the document text by the [tokenizer]. In this case, since table "ft" uses the [tokenizer|simple tokenizer], this means all terms have been folded to lower case. <p> During a transaction, some of the data written to an FTS table may be cached in memory and written to the database only when the transaction is committed. However the implementation of the fts4aux module is only able to read data from the database. In practice this means that if an fts4aux table is queried from within a transaction in which the associated FTS table has been modified, the results of the query are likely to reflect only a (possibly empty) subset of the changes made. <h1 id=tokenizer tags="tokenizer">Tokenizers</h1> <p> An FTS tokenizer is a set of rules for extracting terms from a document or basic FTS full-text query. |
︙ | ︙ |