Documentation Source Text

Check-in [a6e655aa62]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add documentation for the fts3tokenize table.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: a6e655aa62475c7cba9d808bceb78f4edc5e13c2
User & Date: drh 2013-04-26 14:38:59.079
Context
2013-04-26
14:47
Fix typos and add clarification in the fts3tokenize documentation. (check-in: aee1b746ba user: drh tags: trunk)
14:38
Add documentation for the fts3tokenize table. (check-in: a6e655aa62 user: drh tags: trunk)
2013-04-25
11:35
Update the FTS3 documentation to make it clearer that external content tables must be in the same database as the FTS virtual table. (check-in: 35441559aa user: drh tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to pages/changes.in.
45
46
47
48
49
50
51


















52
53
54
55
56
57
58
<li>Add support for [memory-mapped I/O].
<li>Add the [sqlite3_strglob()] convenience interface.
<li>Report rollback recovery in the [error log] as SQLITE_NOTICE_RECOVER_ROLLBACK.
    Change the error log code for WAL recover from 
    SQLITE_OK to SQLITE_NOTICE_RECOVER_WAL.
<li>Report the risky uses of [unlinked database files] and 
   [database filename aliasing] as SQLITE_WARNING messages in the [error log].


















}

chng {2013-04-12 (3.7.16.2)} {
<li>Fix a bug (present since version 3.7.13) that could result in database corruption
    on windows if two or more processes try to access the same database file at the
    same time and immediately after third process crashed in the middle of committing
    to that same file.  See ticket 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
<li>Add support for [memory-mapped I/O].
<li>Add the [sqlite3_strglob()] convenience interface.
<li>Report rollback recovery in the [error log] as SQLITE_NOTICE_RECOVER_ROLLBACK.
    Change the error log code for WAL recover from 
    SQLITE_OK to SQLITE_NOTICE_RECOVER_WAL.
<li>Report the risky uses of [unlinked database files] and 
   [database filename aliasing] as SQLITE_WARNING messages in the [error log].
<li>Added the [SQLITE_TRACE_SIZE_LIMIT] compile-time option.
<li>Increase the default value of [SQLITE_MAX_SCHEMA_RETRY] to 50 and make sure
    that it is honored in every place that a schema change might force a statement
    retry.
<li>Add a new test harness called "mptester" used to verify correct operation
    when multiple processes are using the same database file at the same time.
<li>Only consider AS names from the result set as candidates for resolving
    identifiers in the WHERE clause if there are no other matches. In the 
    ORDER BY clause, AS names take priority over any column names.
<li>Enhance the [extension loading] mechanism to be more flexible (while
    still maintaining backwards compatibility).
<li>Added many new loadable extensions to the source tree, including
    amatch, closure, fuzzer, ieee754, nextchar, regexp, spellfix,
    and wholenumber.  See header comments on each extension source file 
    for further information about what that extension does.
<li>Added the [fts3tokenize virtual table] to the [full-text search] logic.
<li>Prevent excessive stack usage in the [full-text search] engine when processing
    full-text queries with many thousands of search terms.
}

chng {2013-04-12 (3.7.16.2)} {
<li>Fix a bug (present since version 3.7.13) that could result in database corruption
    on windows if two or more processes try to access the same database file at the
    same time and immediately after third process crashed in the middle of committing
    to that same file.  See ticket 
Changes to pages/compile.in.
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
COMPILE_OPTION {SQLITE_MAX_SCHEMA_RETRY=<i>N</i>} {
  Whenever the database schema changes, prepared statements are automatically
  reprepared to accommodate the new schema.  There is a race condition here
  in that if one thread is constantly changing the schema, another thread
  might spin on reparses and repreparations of a prepared statement and
  never get any real work done.  This parameter prevents an infinite loop
  by forcing the spinning thread to give up after a fixed number of attempts
  at recompiling the prepared statement.  The default setting is 5 which is
  more than adequate for most applications.  But in some obscure cases, it
  is useful to raise this parameter to 100 or more to prevent spurious
  [SQLITE_SCHEMA] errors when running [sqlite3_step()].
}

COMPILE_OPTION {SQLITE_POWERSAFE_OVERWRITE=<i>&lt;0 or 1&gt;</i>} {
  This option changes the default assumption about [powersafe overwrite]
  for the underlying filesystems for the unix and windows [VFSes].
  Setting SQLITE_POWERSAFE_OVERWRITE to 1 causes SQLite to assume that
  application-level writes cannot changes bytes outside the range of







|
|
<
<







160
161
162
163
164
165
166
167
168


169
170
171
172
173
174
175
COMPILE_OPTION {SQLITE_MAX_SCHEMA_RETRY=<i>N</i>} {
  Whenever the database schema changes, prepared statements are automatically
  reprepared to accommodate the new schema.  There is a race condition here
  in that if one thread is constantly changing the schema, another thread
  might spin on reparses and repreparations of a prepared statement and
  never get any real work done.  This parameter prevents an infinite loop
  by forcing the spinning thread to give up after a fixed number of attempts
  at recompiling the prepared statement.  The default setting is 50 which is
  more than adequate for most applications.


}

COMPILE_OPTION {SQLITE_POWERSAFE_OVERWRITE=<i>&lt;0 or 1&gt;</i>} {
  This option changes the default assumption about [powersafe overwrite]
  for the underlying filesystems for the unix and windows [VFSes].
  Setting SQLITE_POWERSAFE_OVERWRITE to 1 causes SQLite to assume that
  application-level writes cannot changes bytes outside the range of
334
335
336
337
338
339
340






341
342
343
344
345
346
347
  [PRAGMA temp_store] command to override</td></tr>
  <tr><td align="center">3</td><td>Always use memory</td></tr>
  </table>

  The default setting is 1.  
  Additional information can be found in [tempstore | tempfiles.html].
}







COMPILE_OPTION {SQLITE_USE_URI} {
  This option causes the [URI filename] process logic to be enabled by 
  default.  
}

</tcl>







>
>
>
>
>
>







332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
  [PRAGMA temp_store] command to override</td></tr>
  <tr><td align="center">3</td><td>Always use memory</td></tr>
  </table>

  The default setting is 1.  
  Additional information can be found in [tempstore | tempfiles.html].
}

COMPILE_OPTION {SQLITE_TRACE_SIZE_LIMIT=<i>N</i>} {
  If this macro is defined to a positive integer <i>N</i>, then the length of
  strings and BLOB that are expanded into parameters in the output of
  [sqlite3_trace()] is limited to <i>N</i> bytes.  
}

COMPILE_OPTION {SQLITE_USE_URI} {
  This option causes the [URI filename] process logic to be enabled by 
  default.  
}

</tcl>
Changes to pages/fts3.in.
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
  as the simple tokenizer transforms the term in the query to lowercase
  before searching the full-text index.

<p>
  As well as the "simple" tokenizer, the FTS source code features a tokenizer 
  that uses the <a href="http://tartarus.org/~martin/PorterStemmer/">Porter 
  Stemming algorithm</a>. This tokenizer uses the same rules to separate
  the input document into terms, but as well as folding all terms to lower
  case it uses the Porter Stemming algorithm to reduce related English language
  words to a common root. For example, using the same input document as in the
  paragraph above, the porter tokenizer extracts the following tokens:
  "right now thei veri frustrat". Even though some of these terms are not even
  English words, in some cases using them to build the full-text index is more
  useful than the more intelligible output produced by the simple tokenizer.
  Using the porter tokenizer, the document not only matches full-text queries
  such as "MATCH 'Frustrated'", but also queries such as "MATCH 'Frustration'",







|
|







1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
  as the simple tokenizer transforms the term in the query to lowercase
  before searching the full-text index.

<p>
  As well as the "simple" tokenizer, the FTS source code features a tokenizer 
  that uses the <a href="http://tartarus.org/~martin/PorterStemmer/">Porter 
  Stemming algorithm</a>. This tokenizer uses the same rules to separate
  the input document into terms including folding all terms into lower case,
  but also uses the Porter Stemming algorithm to reduce related English language
  words to a common root. For example, using the same input document as in the
  paragraph above, the porter tokenizer extracts the following tokens:
  "right now thei veri frustrat". Even though some of these terms are not even
  English words, in some cases using them to build the full-text index is more
  useful than the more intelligible output produced by the simple tokenizer.
  Using the porter tokenizer, the document not only matches full-text queries
  such as "MATCH 'Frustrated'", but also queries such as "MATCH 'Frustration'",
2032
2033
2034
2035
2036
2037
2038
2039

2040
2041
2042
2043
2044
2045
2046
  unicode space and punctuation characters and uses those to separate tokens.
  The simple tokenizer only does case folding of ASCII characters and only
  recognizes ASCII space and punctuation characters as token separators.

<h2>Custom (User Implemented) Tokenizers</h2>

<p>
  As well as the built-in "simple", "porter" and (possibly) "icu" tokenizers,

  FTS exports an interface that allows users to implement custom tokenizers
  using C. The interface used to create a new tokenizer is defined and 
  described in the fts3_tokenizer.h source file.

<p>
  Registering a new FTS tokenizer is similar to registering a new
  virtual table module with SQLite. The user passes a pointer to a







|
>







2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
  unicode space and punctuation characters and uses those to separate tokens.
  The simple tokenizer only does case folding of ASCII characters and only
  recognizes ASCII space and punctuation characters as token separators.

<h2>Custom (User Implemented) Tokenizers</h2>

<p>
  As well as the built-in "simple", "porter" and (possibly) "icu" and
  "unicode61" tokenizers,
  FTS exports an interface that allows users to implement custom tokenizers
  using C. The interface used to create a new tokenizer is defined and 
  described in the fts3_tokenizer.h source file.

<p>
  Registering a new FTS tokenizer is similar to registering a new
  virtual table module with SQLite. The user passes a pointer to a
2132
2133
2134
2135
2136
2137
2138
2139






















































2140
2141
2142
2143
2144
2145
2146
      }
    }

    return sqlite3_finalize(pStmt);
  }
</codeblock>

  






















































<h1 tags="segment btree">Data Structures</h1>

<p>
  This section describes at a high-level the way the FTS module stores its
  index and content in the database. It is <b>not necessary to read or 
  understand the material in this section in order to use FTS</b> in an 
  application. However, it may be useful to application developers attempting 







|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
      }
    }

    return sqlite3_finalize(pStmt);
  }
</codeblock>


<tcl>hd_fragment fts3tok {fts3tokenize} {fts3tokenize virtual table}</tcl>
<h2>Querying Tokenizers</h2>

<p>The "fts3tokenize" virtual table can be used to directly access any
   tokenizer.  The following SQL demonstrates how to create an instance 
   of the fts3tokenize virtual table:

<codeblock>
CREATE VIRTUAL TABLE tok1 USING fts3tokenize('porter');
</codeblock>

<p>The name of the desired tokenizer should be substitued in place of
   'porter' in the example, of course.  Once the virtual table is created,
   it can be queried as follows:

<codeblock>
SELECT token, start, end, position 
  FROM tok1
 WHERE input='This is a test sentence.';
</codeblock>

<p>The virtual table will return one row of output for each token in the
   input string.  The "token" column is the text of the token.  The "start"
   and "end" columns are the byte offset to the beginning and end of the
   token in the original input string.  The "pos" column is the sequence number
   of the token in the original input string.  The example above generates
   the following output:

<codeblock>
thi|0|4|0
is|5|7|1
a|8|9|2
test|10|14|3
sentenc|15|23|4
</codeblock>

<p>Notice that the tokens in the result set from the fts3tokenize virtual
   table have been transformed according to the rules of the tokenizer.
   Since this example used the "porter" tokenizer, the "This" token was
   converted into "thi".  If the original text of the token is desired,
   it can be retrieved using the "start" and "end" columns with the
   [substr()] function.  For example:

<codeblock>
SELECT substr(input, start+1, end-start), token, position
  FROM tok1
 WHERE input='This is a test sentence.';
</codeblock>

<p>The fts3tokenize virtual table can be used on any tokenizer, regardless
   of whether or not there exists an FTS3 or FTS4 table that actually uses
   that tokenizer.

 
<h1 tags="segment btree">Data Structures</h1>

<p>
  This section describes at a high-level the way the FTS module stores its
  index and content in the database. It is <b>not necessary to read or 
  understand the material in this section in order to use FTS</b> in an 
  application. However, it may be useful to application developers attempting