Documentation Source Text

Check-in [00c7833d65]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add documentation for the fts4 unicode61 "tokenchars" and "separators" options.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 00c7833d650959518a9bde013aaae6b98059f420
User & Date: dan 2013-09-13 14:52:34.536
Context
2013-09-13
22:03
Add documentation for unlikely(), likelihood() and the soft_heap_limit pragma. (check-in: dc4eae98aa user: drh tags: trunk)
14:52
Add documentation for the fts4 unicode61 "tokenchars" and "separators" options. (check-in: 00c7833d65 user: dan tags: trunk)
2013-09-02
14:24
Merge change-log enhancements from the 3.8.0 branch into trunk. (check-in: 265fde6eac user: drh tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to pages/fts3.in.
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082












































2083
2084
2085
2086
2087
2088
2089

<codeblock>
    <i>-- Create tables that remove diacritics from Latin script characters</i>
    <i>-- as part of tokenization.</i>
    CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
    CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");

    <i>-- Create a tables that does not remove diacritics from Latin script</i>
    <i>-- characters as part of tokenization.</i>
    CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
</codeblock>













































<h2>Custom (User Implemented) Tokenizers</h2>

<p>
  As well as the built-in "simple", "porter" and (possibly) "icu" and
  "unicode61" tokenizers,
  FTS exports an interface that allows users to implement custom tokenizers
  using C. The interface used to create a new tokenizer is defined and 







|




>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133

<codeblock>
    <i>-- Create tables that remove diacritics from Latin script characters</i>
    <i>-- as part of tokenization.</i>
    CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
    CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");

    <i>-- Create a table that does not remove diacritics from Latin script</i>
    <i>-- characters as part of tokenization.</i>
    CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
</codeblock>

<p>
  It is also possible to customize the set of codepoints that unicode61 treats
  as separator characters. The "separators=" option may be used to specify one
  or more extra characters that should be treated as separator characters, and
  the "tokenchars=" option may be used to specify one or more extra characters
  that should be treated as part of tokens instead of as separator characters.
  For example:

<codeblock>
    <i>-- Create a table that uses the unicode61 tokenizer, but considers "."</i>
    <i>-- and "=" characters to be part of tokens, and capital "X" characters to</i>
    <i>-- function as separators.</i>
    CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "tokenchars=.=" "separators=X");

    <i>-- Create a tables that considers space characters (codepoint 32) to be</i>
    <i>-- a token character</i>
    CREATE VIRTUAL TABLE txt4 USING fts4(tokenize=unicode61 "tokenchars= ");
</codeblock>

<p>
  If a character specified as part of the argument to "tokenchars=" is considered
  to be a token character by default, it is ignored. This is true even if it has
  been marked as a separator by an earlier "separators=" option. Similarly, if
  a character specified as part of a "separators=" option is treated as a separator
  character by default, it is ignored. If multiple "tokenchars=" or "separators="
  options are specified, all are processed. For example:

<codeblock>
    <i>-- Create a table that uses the unicode61 tokenizer, but considers "."</i>
    <i>-- and "=" characters to be part of tokens, and capital "X" characters to</i>
    <i>-- function as separators. Both of the "tokenchars=" options are processed</i>
    <i>-- The "separators=" option ignores the "." passed to it, as "." is by</i>
    <i>-- default a separator character, even though it has been marked as a token</i>
    <i>-- character by an earlier "tokenchars=" option.</i>
    CREATE VIRTUAL TABLE txt5 USING fts4(
        tokenize=unicode61 "tokenchars=." "separators=X." "tokenchars=="
    );
</codeblock>

<p>
  The arguments passed to the "tokenchars=" or "separators=" options are 
  case-sensitive. In the example above, specifying that "X" is a separator
  character does not affect the way "x" is handled.

<h2>Custom (User Implemented) Tokenizers</h2>

<p>
  As well as the built-in "simple", "porter" and (possibly) "icu" and
  "unicode61" tokenizers,
  FTS exports an interface that allows users to implement custom tokenizers
  using C. The interface used to create a new tokenizer is defined and