Documentation Source Text

Check-in [a6e655aa62]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add documentation for the fts3tokenize table.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: a6e655aa62475c7cba9d808bceb78f4edc5e13c2
User & Date: drh 2013-04-26 14:38:59
Context
2013-04-26
14:47
Fix typos and add clarification in the fts3tokenize documentation. check-in: aee1b746ba user: drh tags: trunk
14:38
Add documentation for the fts3tokenize table. check-in: a6e655aa62 user: drh tags: trunk
2013-04-25
11:35
Update the FTS3 documentation to make it clearer that external content tables must be in the same database as the FTS virtual table. check-in: 35441559aa user: drh tags: trunk
Changes
Hide Diffs Side-by-Side Diffs Ignore Whitespace Patch

Changes to pages/changes.in.

    45     45   <li>Add support for [memory-mapped I/O].
    46     46   <li>Add the [sqlite3_strglob()] convenience interface.
    47     47   <li>Report rollback recovery in the [error log] as SQLITE_NOTICE_RECOVER_ROLLBACK.
    48     48       Change the error log code for WAL recover from 
    49     49       SQLITE_OK to SQLITE_NOTICE_RECOVER_WAL.
    50     50   <li>Report the risky uses of [unlinked database files] and 
    51     51      [database filename aliasing] as SQLITE_WARNING messages in the [error log].
           52  +<li>Added the [SQLITE_TRACE_SIZE_LIMIT] compile-time option.
           53  +<li>Increase the default value of [SQLITE_MAX_SCHEMA_RETRY] to 50 and make sure
           54  +    that it is honored in every place that a schema change might force a statement
           55  +    retry.
           56  +<li>Add a new test harness called "mptester" used to verify correct operation
           57  +    when multiple processes are using the same database file at the same time.
           58  +<li>Only consider AS names from the result set as candidates for resolving
           59  +    identifiers in the WHERE clause if there are no other matches. In the 
           60  +    ORDER BY clause, AS names take priority over any column names.
           61  +<li>Enhance the [extension loading] mechanism to be more flexible (while
           62  +    still maintaining backwards compatibility).
           63  +<li>Added many new loadable extensions to the source tree, including
           64  +    amatch, closure, fuzzer, ieee754, nextchar, regexp, spellfix,
           65  +    and wholenumber.  See header comments on each extension source file 
           66  +    for further information about what that extension does.
           67  +<li>Added the [fts3tokenize virtual table] to the [full-text search] logic.
           68  +<li>Prevent excessive stack usage in the [full-text search] engine when processing
           69  +    full-text queries with many thousands of search terms.
    52     70   }
    53     71   
    54     72   chng {2013-04-12 (3.7.16.2)} {
    55     73   <li>Fix a bug (present since version 3.7.13) that could result in database corruption
    56     74       on windows if two or more processes try to access the same database file at the
    57     75       same time and immediately after third process crashed in the middle of committing
    58     76       to that same file.  See ticket 

Changes to pages/compile.in.

   160    160   COMPILE_OPTION {SQLITE_MAX_SCHEMA_RETRY=<i>N</i>} {
   161    161     Whenever the database schema changes, prepared statements are automatically
   162    162     reprepared to accommodate the new schema.  There is a race condition here
   163    163     in that if one thread is constantly changing the schema, another thread
   164    164     might spin on reparses and repreparations of a prepared statement and
   165    165     never get any real work done.  This parameter prevents an infinite loop
   166    166     by forcing the spinning thread to give up after a fixed number of attempts
   167         -  at recompiling the prepared statement.  The default setting is 5 which is
   168         -  more than adequate for most applications.  But in some obscure cases, it
   169         -  is useful to raise this parameter to 100 or more to prevent spurious
   170         -  [SQLITE_SCHEMA] errors when running [sqlite3_step()].
          167  +  at recompiling the prepared statement.  The default setting is 50 which is
          168  +  more than adequate for most applications.
   171    169   }
   172    170   
   173    171   COMPILE_OPTION {SQLITE_POWERSAFE_OVERWRITE=<i>&lt;0 or 1&gt;</i>} {
   174    172     This option changes the default assumption about [powersafe overwrite]
   175    173     for the underlying filesystems for the unix and windows [VFSes].
   176    174     Setting SQLITE_POWERSAFE_OVERWRITE to 1 causes SQLite to assume that
   177    175     application-level writes cannot changes bytes outside the range of
................................................................................
   334    332     [PRAGMA temp_store] command to override</td></tr>
   335    333     <tr><td align="center">3</td><td>Always use memory</td></tr>
   336    334     </table>
   337    335   
   338    336     The default setting is 1.  
   339    337     Additional information can be found in [tempstore | tempfiles.html].
   340    338   }
          339  +
          340  +COMPILE_OPTION {SQLITE_TRACE_SIZE_LIMIT=<i>N</i>} {
          341  +  If this macro is defined to a positive integer <i>N</i>, then the length of
          342  +  strings and BLOB that are expanded into parameters in the output of
          343  +  [sqlite3_trace()] is limited to <i>N</i> bytes.  
          344  +}
   341    345   
   342    346   COMPILE_OPTION {SQLITE_USE_URI} {
   343    347     This option causes the [URI filename] process logic to be enabled by 
   344    348     default.  
   345    349   }
   346    350   
   347    351   </tcl>

Changes to pages/fts3.in.

  1961   1961     as the simple tokenizer transforms the term in the query to lowercase
  1962   1962     before searching the full-text index.
  1963   1963   
  1964   1964   <p>
  1965   1965     As well as the "simple" tokenizer, the FTS source code features a tokenizer 
  1966   1966     that uses the <a href="http://tartarus.org/~martin/PorterStemmer/">Porter 
  1967   1967     Stemming algorithm</a>. This tokenizer uses the same rules to separate
  1968         -  the input document into terms, but as well as folding all terms to lower
  1969         -  case it uses the Porter Stemming algorithm to reduce related English language
         1968  +  the input document into terms including folding all terms into lower case,
         1969  +  but also uses the Porter Stemming algorithm to reduce related English language
  1970   1970     words to a common root. For example, using the same input document as in the
  1971   1971     paragraph above, the porter tokenizer extracts the following tokens:
  1972   1972     "right now thei veri frustrat". Even though some of these terms are not even
  1973   1973     English words, in some cases using them to build the full-text index is more
  1974   1974     useful than the more intelligible output produced by the simple tokenizer.
  1975   1975     Using the porter tokenizer, the document not only matches full-text queries
  1976   1976     such as "MATCH 'Frustrated'", but also queries such as "MATCH 'Frustration'",
................................................................................
  2032   2032     unicode space and punctuation characters and uses those to separate tokens.
  2033   2033     The simple tokenizer only does case folding of ASCII characters and only
  2034   2034     recognizes ASCII space and punctuation characters as token separators.
  2035   2035   
  2036   2036   <h2>Custom (User Implemented) Tokenizers</h2>
  2037   2037   
  2038   2038   <p>
  2039         -  As well as the built-in "simple", "porter" and (possibly) "icu" tokenizers,
         2039  +  As well as the built-in "simple", "porter" and (possibly) "icu" and
         2040  +  "unicode61" tokenizers,
  2040   2041     FTS exports an interface that allows users to implement custom tokenizers
  2041   2042     using C. The interface used to create a new tokenizer is defined and 
  2042   2043     described in the fts3_tokenizer.h source file.
  2043   2044   
  2044   2045   <p>
  2045   2046     Registering a new FTS tokenizer is similar to registering a new
  2046   2047     virtual table module with SQLite. The user passes a pointer to a
................................................................................
  2132   2133         }
  2133   2134       }
  2134   2135   
  2135   2136       return sqlite3_finalize(pStmt);
  2136   2137     }
  2137   2138   </codeblock>
  2138   2139   
  2139         -  
         2140  +
         2141  +<tcl>hd_fragment fts3tok {fts3tokenize} {fts3tokenize virtual table}</tcl>
         2142  +<h2>Querying Tokenizers</h2>
         2143  +
         2144  +<p>The "fts3tokenize" virtual table can be used to directly access any
         2145  +   tokenizer.  The following SQL demonstrates how to create an instance 
         2146  +   of the fts3tokenize virtual table:
         2147  +
         2148  +<codeblock>
         2149  +CREATE VIRTUAL TABLE tok1 USING fts3tokenize('porter');
         2150  +</codeblock>
         2151  +
         2152  +<p>The name of the desired tokenizer should be substitued in place of
         2153  +   'porter' in the example, of course.  Once the virtual table is created,
         2154  +   it can be queried as follows:
         2155  +
         2156  +<codeblock>
         2157  +SELECT token, start, end, position 
         2158  +  FROM tok1
         2159  + WHERE input='This is a test sentence.';
         2160  +</codeblock>
         2161  +
         2162  +<p>The virtual table will return one row of output for each token in the
         2163  +   input string.  The "token" column is the text of the token.  The "start"
         2164  +   and "end" columns are the byte offset to the beginning and end of the
         2165  +   token in the original input string.  The "pos" column is the sequence number
         2166  +   of the token in the original input string.  The example above generates
         2167  +   the following output:
         2168  +
         2169  +<codeblock>
         2170  +thi|0|4|0
         2171  +is|5|7|1
         2172  +a|8|9|2
         2173  +test|10|14|3
         2174  +sentenc|15|23|4
         2175  +</codeblock>
         2176  +
         2177  +<p>Notice that the tokens in the result set from the fts3tokenize virtual
         2178  +   table have been transformed according to the rules of the tokenizer.
         2179  +   Since this example used the "porter" tokenizer, the "This" token was
         2180  +   converted into "thi".  If the original text of the token is desired,
         2181  +   it can be retrieved using the "start" and "end" columns with the
         2182  +   [substr()] function.  For example:
         2183  +
         2184  +<codeblock>
         2185  +SELECT substr(input, start+1, end-start), token, position
         2186  +  FROM tok1
         2187  + WHERE input='This is a test sentence.';
         2188  +</codeblock>
         2189  +
         2190  +<p>The fts3tokenize virtual table can be used on any tokenizer, regardless
         2191  +   of whether or not there exists an FTS3 or FTS4 table that actually uses
         2192  +   that tokenizer.
         2193  +
         2194  + 
  2140   2195   <h1 tags="segment btree">Data Structures</h1>
  2141   2196   
  2142   2197   <p>
  2143   2198     This section describes at a high-level the way the FTS module stores its
  2144   2199     index and content in the database. It is <b>not necessary to read or 
  2145   2200     understand the material in this section in order to use FTS</b> in an 
  2146   2201     application. However, it may be useful to application developers attempting