Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Add documentation for the fts3tokenize table. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
a6e655aa62475c7cba9d808bceb78f4e |
User & Date: | drh 2013-04-26 14:38:59.079 |
Context
2013-04-26
| ||
14:47 | Fix typos and add clarification in the fts3tokenize documentation. (check-in: aee1b746ba user: drh tags: trunk) | |
14:38 | Add documentation for the fts3tokenize table. (check-in: a6e655aa62 user: drh tags: trunk) | |
2013-04-25
| ||
11:35 | Update the FTS3 documentation to make it clearer that external content tables must be in the same database as the FTS virtual table. (check-in: 35441559aa user: drh tags: trunk) | |
Changes
Changes to pages/changes.in.
︙ | ︙ | |||
45 46 47 48 49 50 51 52 53 54 55 56 57 58 | <li>Add support for [memory-mapped I/O]. <li>Add the [sqlite3_strglob()] convenience interface. <li>Report rollback recovery in the [error log] as SQLITE_NOTICE_RECOVER_ROLLBACK. Change the error log code for WAL recover from SQLITE_OK to SQLITE_NOTICE_RECOVER_WAL. <li>Report the risky uses of [unlinked database files] and [database filename aliasing] as SQLITE_WARNING messages in the [error log]. } chng {2013-04-12 (3.7.16.2)} { <li>Fix a bug (present since version 3.7.13) that could result in database corruption on windows if two or more processes try to access the same database file at the same time and immediately after third process crashed in the middle of committing to that same file. See ticket | > > > > > > > > > > > > > > > > > > | 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | <li>Add support for [memory-mapped I/O]. <li>Add the [sqlite3_strglob()] convenience interface. <li>Report rollback recovery in the [error log] as SQLITE_NOTICE_RECOVER_ROLLBACK. Change the error log code for WAL recover from SQLITE_OK to SQLITE_NOTICE_RECOVER_WAL. <li>Report the risky uses of [unlinked database files] and [database filename aliasing] as SQLITE_WARNING messages in the [error log]. <li>Added the [SQLITE_TRACE_SIZE_LIMIT] compile-time option. <li>Increase the default value of [SQLITE_MAX_SCHEMA_RETRY] to 50 and make sure that it is honored in every place that a schema change might force a statement retry. <li>Add a new test harness called "mptester" used to verify correct operation when multiple processes are using the same database file at the same time. <li>Only consider AS names from the result set as candidates for resolving identifiers in the WHERE clause if there are no other matches. In the ORDER BY clause, AS names take priority over any column names. <li>Enhance the [extension loading] mechanism to be more flexible (while still maintaining backwards compatibility). <li>Added many new loadable extensions to the source tree, including amatch, closure, fuzzer, ieee754, nextchar, regexp, spellfix, and wholenumber. See header comments on each extension source file for further information about what that extension does. <li>Added the [fts3tokenize virtual table] to the [full-text search] logic. <li>Prevent excessive stack usage in the [full-text search] engine when processing full-text queries with many thousands of search terms. } chng {2013-04-12 (3.7.16.2)} { <li>Fix a bug (present since version 3.7.13) that could result in database corruption on windows if two or more processes try to access the same database file at the same time and immediately after third process crashed in the middle of committing to that same file. See ticket |
︙ | ︙ |
Changes to pages/compile.in.
︙ | ︙ | |||
160 161 162 163 164 165 166 | COMPILE_OPTION {SQLITE_MAX_SCHEMA_RETRY=<i>N</i>} { Whenever the database schema changes, prepared statements are automatically reprepared to accommodate the new schema. There is a race condition here in that if one thread is constantly changing the schema, another thread might spin on reparses and repreparations of a prepared statement and never get any real work done. This parameter prevents an infinite loop by forcing the spinning thread to give up after a fixed number of attempts | | | < < | 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | COMPILE_OPTION {SQLITE_MAX_SCHEMA_RETRY=<i>N</i>} { Whenever the database schema changes, prepared statements are automatically reprepared to accommodate the new schema. There is a race condition here in that if one thread is constantly changing the schema, another thread might spin on reparses and repreparations of a prepared statement and never get any real work done. This parameter prevents an infinite loop by forcing the spinning thread to give up after a fixed number of attempts at recompiling the prepared statement. The default setting is 50 which is more than adequate for most applications. } COMPILE_OPTION {SQLITE_POWERSAFE_OVERWRITE=<i><0 or 1></i>} { This option changes the default assumption about [powersafe overwrite] for the underlying filesystems for the unix and windows [VFSes]. Setting SQLITE_POWERSAFE_OVERWRITE to 1 causes SQLite to assume that application-level writes cannot changes bytes outside the range of |
︙ | ︙ | |||
334 335 336 337 338 339 340 341 342 343 344 345 346 347 | [PRAGMA temp_store] command to override</td></tr> <tr><td align="center">3</td><td>Always use memory</td></tr> </table> The default setting is 1. Additional information can be found in [tempstore | tempfiles.html]. } COMPILE_OPTION {SQLITE_USE_URI} { This option causes the [URI filename] process logic to be enabled by default. } </tcl> | > > > > > > | 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 | [PRAGMA temp_store] command to override</td></tr> <tr><td align="center">3</td><td>Always use memory</td></tr> </table> The default setting is 1. Additional information can be found in [tempstore | tempfiles.html]. } COMPILE_OPTION {SQLITE_TRACE_SIZE_LIMIT=<i>N</i>} { If this macro is defined to a positive integer <i>N</i>, then the length of strings and BLOB that are expanded into parameters in the output of [sqlite3_trace()] is limited to <i>N</i> bytes. } COMPILE_OPTION {SQLITE_USE_URI} { This option causes the [URI filename] process logic to be enabled by default. } </tcl> |
︙ | ︙ |
Changes to pages/fts3.in.
︙ | ︙ | |||
1961 1962 1963 1964 1965 1966 1967 | as the simple tokenizer transforms the term in the query to lowercase before searching the full-text index. <p> As well as the "simple" tokenizer, the FTS source code features a tokenizer that uses the <a href="http://tartarus.org/~martin/PorterStemmer/">Porter Stemming algorithm</a>. This tokenizer uses the same rules to separate | | | | 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 | as the simple tokenizer transforms the term in the query to lowercase before searching the full-text index. <p> As well as the "simple" tokenizer, the FTS source code features a tokenizer that uses the <a href="http://tartarus.org/~martin/PorterStemmer/">Porter Stemming algorithm</a>. This tokenizer uses the same rules to separate the input document into terms including folding all terms into lower case, but also uses the Porter Stemming algorithm to reduce related English language words to a common root. For example, using the same input document as in the paragraph above, the porter tokenizer extracts the following tokens: "right now thei veri frustrat". Even though some of these terms are not even English words, in some cases using them to build the full-text index is more useful than the more intelligible output produced by the simple tokenizer. Using the porter tokenizer, the document not only matches full-text queries such as "MATCH 'Frustrated'", but also queries such as "MATCH 'Frustration'", |
︙ | ︙ | |||
2032 2033 2034 2035 2036 2037 2038 | unicode space and punctuation characters and uses those to separate tokens. The simple tokenizer only does case folding of ASCII characters and only recognizes ASCII space and punctuation characters as token separators. <h2>Custom (User Implemented) Tokenizers</h2> <p> | | > | 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 | unicode space and punctuation characters and uses those to separate tokens. The simple tokenizer only does case folding of ASCII characters and only recognizes ASCII space and punctuation characters as token separators. <h2>Custom (User Implemented) Tokenizers</h2> <p> As well as the built-in "simple", "porter" and (possibly) "icu" and "unicode61" tokenizers, FTS exports an interface that allows users to implement custom tokenizers using C. The interface used to create a new tokenizer is defined and described in the fts3_tokenizer.h source file. <p> Registering a new FTS tokenizer is similar to registering a new virtual table module with SQLite. The user passes a pointer to a |
︙ | ︙ | |||
2132 2133 2134 2135 2136 2137 2138 | } } return sqlite3_finalize(pStmt); } </codeblock> | | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 | } } return sqlite3_finalize(pStmt); } </codeblock> <tcl>hd_fragment fts3tok {fts3tokenize} {fts3tokenize virtual table}</tcl> <h2>Querying Tokenizers</h2> <p>The "fts3tokenize" virtual table can be used to directly access any tokenizer. The following SQL demonstrates how to create an instance of the fts3tokenize virtual table: <codeblock> CREATE VIRTUAL TABLE tok1 USING fts3tokenize('porter'); </codeblock> <p>The name of the desired tokenizer should be substitued in place of 'porter' in the example, of course. Once the virtual table is created, it can be queried as follows: <codeblock> SELECT token, start, end, position FROM tok1 WHERE input='This is a test sentence.'; </codeblock> <p>The virtual table will return one row of output for each token in the input string. The "token" column is the text of the token. The "start" and "end" columns are the byte offset to the beginning and end of the token in the original input string. The "pos" column is the sequence number of the token in the original input string. The example above generates the following output: <codeblock> thi|0|4|0 is|5|7|1 a|8|9|2 test|10|14|3 sentenc|15|23|4 </codeblock> <p>Notice that the tokens in the result set from the fts3tokenize virtual table have been transformed according to the rules of the tokenizer. Since this example used the "porter" tokenizer, the "This" token was converted into "thi". If the original text of the token is desired, it can be retrieved using the "start" and "end" columns with the [substr()] function. For example: <codeblock> SELECT substr(input, start+1, end-start), token, position FROM tok1 WHERE input='This is a test sentence.'; </codeblock> <p>The fts3tokenize virtual table can be used on any tokenizer, regardless of whether or not there exists an FTS3 or FTS4 table that actually uses that tokenizer. <h1 tags="segment btree">Data Structures</h1> <p> This section describes at a high-level the way the FTS module stores its index and content in the database. It is <b>not necessary to read or understand the material in this section in order to use FTS</b> in an application. However, it may be useful to application developers attempting |
︙ | ︙ |