Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Further enhancements to fts5.in. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
984c3483dfdaeeca9a3dff93abab0e1d |
User & Date: | dan 2015-03-06 19:29:24.179 |
Context
2015-03-07
| ||
11:16 | Improve the documentation of fts5 auxiliary functions. (check-in: a358d9babf user: dan tags: trunk) | |
2015-03-06
| ||
19:29 | Further enhancements to fts5.in. (check-in: 984c3483df user: dan tags: trunk) | |
2015-03-05
| ||
21:00 | Add a couple of details to fts5.in. (check-in: 23989471bd user: dan tags: trunk) | |
Changes
Changes to pages/fts5.in.
1 2 3 4 5 6 | <tcl>hd_keywords *fts5 FTS5</tcl> <title>SQLite FTS5 Extension</title> <table_of_contents> | | | | > > > > > > > > | > | > > | > > > | > > | > > | < > > > > > > > > > > > > | > > > > > < > | > > > > > > > | < < > > > > > > | | > > > | > > > > > > > > > > > > > | > > > | > | > > > | > > > | > > > > | > > > | > > > > > > > > > > > > > > > | | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | <tcl>hd_keywords *fts5 FTS5</tcl> <title>SQLite FTS5 Extension</title> <table_of_contents> <h2 style="margin-left:1.0em" notoc> Overview</h2> <h1>Overview of FTS5</h1> <p>FTS5 is an SQLite [virtual table module] that provides <a href=http://en.wikipedia.org/wiki/Full_text_search>full-text search</a> functionality to database applications. In their most elementary form, full-text search engines allow the user to efficiently search a large collection of documents for the subset that contain one or more instances of a search term. The search functionality provided to world wide web users by <a href=www.google.com>Google</a> is, amongst other things, a full-text search engine, as it allows users to search for all documents on the web that contain, for example, the term "fts5". <h2>Overview of Functionality</h2> <p>To use FTS5, the user creates an FTS5 virtual table with one or more columns. For example: <codeblock> CREATE VIRTUAL TABLE email USING fts5(sender, title, body); </codeblock> <p>It is an error to add types, constraints or [PRIMARY KEY] declarations to a CREATE VIRTUAL TABLE statement used to create an FTS5 table. Once created, an FTS5 table may be populated using [INSERT], [UPDATE] or [DELETE] statements like any other table. Like any other table with no PRIMARY KEY declaration, an FTS5 table has an implicit INTEGER PRIMARY KEY field named rowid. <p>Not shown in the example above is that there are also [FTS5 CREATE TABLE Options | various options] that may be provided to FTS5 as part of the CREATE VIRTUAL TABLE statement to configure various aspects of the new table. These may be used to modify the way in which the FTS5 table extracts terms from documents and queries, to create extra indexes on disk to speed up prefix queries, or to create an FTS5 table that acts as an index on content stored elsewhere. <p>Once populated, a full-text query may be executed on the contents of an FTS5 table by adding a MATCH constraint to the WHERE clause of a SELECT query. The expression to the right of the MATCH operator must be the name of the FTS5 table. The expression on the left must be a text value specifying the term to search for. For example: <codeblock> <i>-- Query for all rows that contain at least once instance of the term</i> <i>-- "fts5" (in any column).</i> SELECT * FROM email WHERE email MATCH 'fts5'; </codeblock> <p> By default, FTS5 full-text searches are case-independent. Like any other SQL query that does not contain an ORDER BY clause, the example above returns results in an arbitrary order. To sort results by relevance (most to least relevant), an ORDER BY may be added to a full-text query as follows: <codeblock> <i>-- Query for all rows that contain at least once instance of the term</i> <i>-- "fts5" (in any column). Return results in order from best to worst</i> <i>-- match. </i> SELECT * FROM email WHERE email MATCH 'fts5' ORDER BY rank; </codeblock> <p> As well as the column values and rowid of a matching row, an application may use [FTS5 auxiliary functions] to retrieve extra information regarding the matched row. For example, an auxiliary function may be used to retrieve a copy of a column value for a matched row with all instances of the matched term surrounded by html <b></b> tags. Auxiliary functions are invoked in the same way as SQLite [corefunc | scalar functions], except that the name of the FTS5 table is specified as the first argument. For example: <codeblock> <i>-- Query for rows that match "fts5". Return a copy of the "body" column</i> <i>-- of each row with the matches surrounded by <b></b> tags.</i> SELECT highlight(email, 2, '<b>', '</b>') FROM email WHERE email MATCH 'fts5' </codeblock> <p>A description of the available auxiliary functions, and more details regarding configuration of the special "rank" column, are [FTS5 auxiliary functions | available below]. [FTS5 custom auxiliary functions| Custom auxiliary functions] may also be implemented in C and registered with FTS5, just as custom SQL functions may be registered with the SQLite core. <p> As well as seaching for all rows that contain a term, FTS5 allows the user to search for rows that contain: <ul> <li> any terms that begin with a specified prefix, <li> "phrases" - sequences of terms or prefix terms that must feature in a document for it to match the query, <li> sets of terms, prefix terms or phrases that appear within a specified proximity of each other (these are called "NEAR queries"), or <li> boolean combinations of any of the above. </ul> <p> Such advanced searches are requested by providing a more complicated FTS5 query string as the text to the right of the MATCH operator. The full query syntax is [FTS5 query syntax | described here]. <h2>Differences between FTS5 and FTS3/4</h2> <p> Also available is the similar but more mature [fts3 | FTS3/4] module. Apart from the exciting new name, FTS5 differs from FTS3/4 in the following ways: <ul> <li> <p>FTS5 supports "ORDER BY rank" for returning results in order of decreasing relevancy. <li> <p>FTS5 features an API allowing users to create custom auxiliary functions for advanced ranking and text processing applications. The special "rank" column may be mapped to a custom auxiliary function so that adding "ORDER BY rank" to a query works as expected. <li> <p>FTS5 recognizes unicode separator characters and case equivalence by default. This is also possible using FTS3/4, but must be explicitly enabled. <li> <p>The query syntax has been revised where necessary to remove ambiguities and to make it possible to escape special characters in query terms. <li> <p>By default, FTS3/4 occasionally merges together two or more of the b-trees that make up its full-text index within an INSERT, UPDATE or DELETE statement executed by the user. This means that any operation on an FTS3/4 table may turn out to be surprisingly slow, as FTS3/4 may unpredictably choose to merge together two or more large b-trees within it. FTS5 uses incremental merging by default, which limits the amount of processing that may take place within any given INSERT, UPDATE or DELETE operation. <li> <p>FTS5 uses significantly less memory when one or more terms in a query match a very large number of documents. </ul> <h1 tags="FTS5 query syntax">Full-text Query Syntax</h1> <p> The following block contains a summary of the FTS query syntax in BNF form. A detailed explanation follows. <codeblock> <phrase> := string [*] |
︙ | ︙ | |||
222 223 224 225 226 227 228 | ... MATCH 'NEAR(one two) three' <i>-- 'NEAR(one two) AND three'</i> ... MATCH 'one OR two three' <i>-- 'one OR two AND three'</i> ... MATCH '(one OR two) three' <i>-- Syntax error!</i> ... MATCH 'func(one two)' <i>-- Syntax error!</i> </codeblock> | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | | 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 | ... MATCH 'NEAR(one two) three' <i>-- 'NEAR(one two) AND three'</i> ... MATCH 'one OR two three' <i>-- 'one OR two AND three'</i> ... MATCH '(one OR two) three' <i>-- Syntax error!</i> ... MATCH 'func(one two)' <i>-- Syntax error!</i> </codeblock> <h1 tags="FTS5 CREATE TABLE Options">FTS5 Table Creation and Initialization</h1> <p>Each argument specified as part of a "CREATE VIRTUAL TABLE ... USING fts5 ..." statement is either a column name or a configuration option. A column name consists of a single FTS5 bareword or a single string literal quoted in any manner acceptable to SQLite. A configuration option consists of an FTS5 bareword - the option name - followed by an "=" character, followed by the option value. The option value is specified using either a single FTS5 bareword or a string literal, again quoted in any manner acceptable to the SQLite core. Anything else is a syntax error. <p>It is an error to attempt to name an fts5 table column "rowid" or "rank", or to assign the same name to a column as is used by the table itself. This is not supported. <p>It is also an error to add a type, constraint specification or PRIMARY KEY clause to a column name in an FTS5 CREATE VIRTUAL TABLE statement. <p> A configuration option consists of an FTS5 bareword - the option name - followed by an "=" character, followed by a either an FTS5 bareword or a string literal. For example: <codeblock> CREATE VIRTUAL TABLE mail USING fts5(sender, title, body, tokenize = 'porter ascii'); </codeblock> <p> There are currently the following configuration options: <ul> <li> The "tokenize" option, used to configure a [FTS5 tokenizers | custom tokenizer]. <li> The "prefix" option, used to add [FTS5 prefix indexes | prefix indexes] to an FTS5 table. <li> The "content" option, used to make the FTS5 table an [FTS5 content option | external content or contentless table]. <li> The "content_rowid" option, used to set the rowid field of an [FTS5 external content tables | external content table]. </ul> <h2 tags="FTS5 prefix indexes">Prefix Indexes</h2> <p> By default, FTS5 maintains a single index recording the location of each token instance within the document set. This means that querying for complete tokens is fast, as it requires a single lookup, but querying for a prefix token can be slow, as it requires a range scan. For example, to query for the prefix token "abc*" requires a range scan of all tokens greater than or equal to "abc" and less than "abd". |
︙ | ︙ | |||
249 250 251 252 253 254 255 | <codeblock> <i>-- Two ways to create an FTS5 table that maintains prefix indexes for -- two and three character prefix tokens.</i> CREATE VIRTUAL TABLE ft USING fts5(a, b, prefix='2 3'); CREATE VIRTUAL TABLE ft USING fts5(a, b, prefix=2, prefix=3); </codeblock> | | | 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 | <codeblock> <i>-- Two ways to create an FTS5 table that maintains prefix indexes for -- two and three character prefix tokens.</i> CREATE VIRTUAL TABLE ft USING fts5(a, b, prefix='2 3'); CREATE VIRTUAL TABLE ft USING fts5(a, b, prefix=2, prefix=3); </codeblock> <h2 tags="FTS5 tokenizers">Tokenizers</h2> <p> The CREATE VIRTUAL TABLE "tokenize" option is used to configure the specific tokenizer used by the FTS5 table. The option argument must be either an FTS5 bareword, or an SQL text literal. The text of the argument is itself treated as a white-space series of one or more FTS5 barewords or SQL text literals. The first of these is the name of the tokenizer to use. The second and subsequent list elements, if they exist, are arguments passed to the |
︙ | ︙ | |||
294 295 296 297 298 299 300 | <li> The <b>porter</b> tokenizer, which implements the <a href=http://tartarus.org/martin/PorterStemmer/>porter stemming algorithm</a>. </ul> <p> It is also possible to create custom tokenizers for FTS5. The API for doing so is [custom tokenizers | described here]. | | | 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 | <li> The <b>porter</b> tokenizer, which implements the <a href=http://tartarus.org/martin/PorterStemmer/>porter stemming algorithm</a>. </ul> <p> It is also possible to create custom tokenizers for FTS5. The API for doing so is [custom tokenizers | described here]. <h3>Unicode61 Tokenizer</h3> <p> The unicode tokenizer classifies all unicode characters as either "separator" or "token" characters. By default all space and punctuation characters, as defined by Unicode 6.1, are considered separators, and all other characters as token characters. Each contiguous run of one or more token characters is considered to be a token. The tokenizer is case-insensitive according to the rules defined by Unicode 6.1. |
︙ | ︙ | |||
342 343 344 345 346 347 348 | -- script characters, and that considers hyphens and underscore characters -- to be part of tokens. </i> CREATE VIRTUAL TABLE ft USING fts5(a, b, tokenize = "unicode61 remove_diacritics 0 tokenchars '-_'" ); </codeblock> | | | 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 | -- script characters, and that considers hyphens and underscore characters -- to be part of tokens. </i> CREATE VIRTUAL TABLE ft USING fts5(a, b, tokenize = "unicode61 remove_diacritics 0 tokenchars '-_'" ); </codeblock> <h3>Ascii Tokenizer</h3> <p> The Ascii tokenizer is similar to the Unicode61 tokenizer, except that: <ul> <li> All non-ASCII characters (those with codepoints greater than 127) are always considered token characters. If any non-ASCII characters are specified as part of the separators option, they are ignored. |
︙ | ︙ | |||
367 368 369 370 371 372 373 | <i>-- Create an FTS5 table that uses the ascii tokenizer, but does not -- consider numeric characters to be part of tokens.</i> CREATE VIRTUAL TABLE ft USING fts5(a, b, tokenize = "ascii separators '0123456789'" ); </codeblock> | | | 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 | <i>-- Create an FTS5 table that uses the ascii tokenizer, but does not -- consider numeric characters to be part of tokens.</i> CREATE VIRTUAL TABLE ft USING fts5(a, b, tokenize = "ascii separators '0123456789'" ); </codeblock> <h3>Porter Tokenizer</h3> <p> The porter tokenizer is a wrapper tokenizer. It takes the output of some other tokenizer and applies the <a href=http://tartarus.org/martin/PorterStemmer/>porter stemming algorithm</a> to each token before it returns it to FTS5. This allows search terms like "correction" to match similar words such as "corrected" or "correcting". The porter stemmer algorithm is designed for use with English language terms |
︙ | ︙ | |||
393 394 395 396 397 398 399 | CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter unicode61'); <i>-- A porter tokenizer used to stem the output of the unicode61 tokenizer, -- with diacritics removed before stemming.</i> CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter unicode61 remove_diacritics 1'); </codeblock> | | | 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 | CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter unicode61'); <i>-- A porter tokenizer used to stem the output of the unicode61 tokenizer, -- with diacritics removed before stemming.</i> CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter unicode61 remove_diacritics 1'); </codeblock> <h2 tags="FTS5 content option">External Content and Contentless Tables</h2> <p> Normally, when a row is inserted into an FTS5 table, as well as the various full-text index entries and other data a copy of the row is stored in a private table managed by the FTS5 module. When column values are requested from the FTS5 table by the user or by an auxiliary function implementation, they are read from this private table. The "content" option may be used to create an |
︙ | ︙ | |||
423 424 425 426 427 428 429 | values. This is known as an "external content" table. In this case all FTS5 functionality may be used, but it is the responsibility of the user to ensure that the contents of the full-text index are consistent with the named database object. If they are not, query results may be unpredictable. </ul> | | | | 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 | values. This is known as an "external content" table. In this case all FTS5 functionality may be used, but it is the responsibility of the user to ensure that the contents of the full-text index are consistent with the named database object. If they are not, query results may be unpredictable. </ul> <h3 tags="FTS5 contentless tables">Contentless Tables</h3> <p> A contentless FTS5 table is created by setting the "content" option to an empty string. For example: <codeblock> CREATE VIRTUAL TABLE f1 USING fts5(a, b, c, content=''); </codeblock> <p> Contentless FTS5 tables do not support UPDATE or DELETE statements, or INSERT statements that do not supply a non-NULL value for the rowid field. Rows may be deleted from a contentless table using an [FTS5 delete command]. <p> Attempting to read any column value except the rowid from a contentless FTS5 table returns an SQL NULL value. <h3 tags="FTS5 external content tables">External Content Tables</h3> <p> An external content FTS5 table is created by setting the content option to the name of a table, virtual table or view (hereafter the "content table") within the same database. Whenever column values are required by FTS5, it queries the content table as follows, with the rowid of the row for which values are required bound to the SQL variable: |
︙ | ︙ | |||
490 491 492 493 494 495 496 | CREATE TRIGGER tbl_au AFTER UPDATE ON tbl BEGIN INSERT INTO fts_idx(fts_idx, rowid, b, c) VALUES('delete', old.a, old.b, old.c); INSERT INTO fts_idx(rowid, b, c) VALUES (new.a, new.b, new.c); END; </codeblock> | | | 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 | CREATE TRIGGER tbl_au AFTER UPDATE ON tbl BEGIN INSERT INTO fts_idx(fts_idx, rowid, b, c) VALUES('delete', old.a, old.b, old.c); INSERT INTO fts_idx(rowid, b, c) VALUES (new.a, new.b, new.c); END; </codeblock> <h1 tags="FTS5 auxiliary functions"> Auxiliary Functions </h1> <h2>Built-in Auxiliary Functions</h2> <h3>The bm25() function</h3> <p> The built-in auxiliary function bm25() returns a real value indicating how well the current row matches the full-text query. The better the match, |
︙ | ︙ | |||
955 956 957 958 959 960 961 | <tclscript> set res "" set ::extract_api_docs_mode tokenizer_api catch { set res [source [file join $::SRC ext/fts5/extract_api_docs.tcl]] } set res </tclscript> | | | 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 | <tclscript> set res "" set ::extract_api_docs_mode tokenizer_api catch { set res [source [file join $::SRC ext/fts5/extract_api_docs.tcl]] } set res </tclscript> <h2 tags="FTS5 custom auxiliary functions">Custom Auxiliary Functions</h2> <p> Implementing a custom auxiliary function is similar to implementing an [application-defined SQL function | scalar SQL function]. The implementation should be a C function of type fts5_extension_function, defined as follows: <codeblock> <tclscript> |
︙ | ︙ |