Documentation Source Text
Check-in [bc38355689]
Not logged in

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
SHA1 Hash:bc38355689db1c27731ac30d6df2b5ce0286259f
Date: 2013-09-03 17:20:48
User: dan
Comment:Add documentation for the fts4 unicode61 tokenizer option "remove_diacritics=0".
Tags And Properties
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/fts3.in

2060
2061
2062
2063
2064
2065
2066
















2067
2068
2069
2070
2071
2072
2073
  The "unicode61" tokenizer is available beginning with SQLite [version 3.7.13].
  Unicode61 works very much like "simple" except that it does full unicode
  case folding according to rules in Unicode Version 6.1 and it recognizes
  unicode space and punctuation characters and uses those to separate tokens.
  The simple tokenizer only does case folding of ASCII characters and only
  recognizes ASCII space and punctuation characters as token separators.

















<h2>Custom (User Implemented) Tokenizers</h2>

<p>
  As well as the built-in "simple", "porter" and (possibly) "icu" and
  "unicode61" tokenizers,
  FTS exports an interface that allows users to implement custom tokenizers
  using C. The interface used to create a new tokenizer is defined and 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
  The "unicode61" tokenizer is available beginning with SQLite [version 3.7.13].
  Unicode61 works very much like "simple" except that it does full unicode
  case folding according to rules in Unicode Version 6.1 and it recognizes
  unicode space and punctuation characters and uses those to separate tokens.
  The simple tokenizer only does case folding of ASCII characters and only
  recognizes ASCII space and punctuation characters as token separators.

<p>
  By default, "unicode61" also removes all diacritics from Latin script
  characters. This behaviour can be overriden by adding the tokenizer argument
  "remove_diacritics=0". For example:

<codeblock>
    <i>-- Create tables that remove diacritics from Latin script characters</i>
    <i>-- as part of tokenization.</i>
    CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
    CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");

    <i>-- Create a tables that does not remove diacritics from Latin script</i>
    <i>-- characters as part of tokenization.</i>
    CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
</codeblock>

<h2>Custom (User Implemented) Tokenizers</h2>

<p>
  As well as the built-in "simple", "porter" and (possibly) "icu" and
  "unicode61" tokenizers,
  FTS exports an interface that allows users to implement custom tokenizers
  using C. The interface used to create a new tokenizer is defined and