Documentation Source Text

Check-in [bc38355689]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add documentation for the fts4 unicode61 tokenizer option "remove_diacritics=0".
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | branch-3.8.0
Files: files | file ages | folders
SHA1: bc38355689db1c27731ac30d6df2b5ce0286259f
User & Date: dan 2013-09-03 17:20:48.567
Context
2013-09-03
17:22
Add the sha1sum and SQLITE_SOURCE_ID for 3.8.0.2. (check-in: 4069a1b264 user: dan tags: branch-3.8.0)
17:20
Add documentation for the fts4 unicode61 tokenizer option "remove_diacritics=0". (check-in: bc38355689 user: dan tags: branch-3.8.0)
16:06
Changes for the 3.8.0.2 release. (check-in: d705052eb2 user: drh tags: branch-3.8.0)
Changes
Unified Diff Ignore Whitespace Patch
Changes to pages/fts3.in.
2060
2061
2062
2063
2064
2065
2066
















2067
2068
2069
2070
2071
2072
2073
  The "unicode61" tokenizer is available beginning with SQLite [version 3.7.13].
  Unicode61 works very much like "simple" except that it does full unicode
  case folding according to rules in Unicode Version 6.1 and it recognizes
  unicode space and punctuation characters and uses those to separate tokens.
  The simple tokenizer only does case folding of ASCII characters and only
  recognizes ASCII space and punctuation characters as token separators.

















<h2>Custom (User Implemented) Tokenizers</h2>

<p>
  As well as the built-in "simple", "porter" and (possibly) "icu" and
  "unicode61" tokenizers,
  FTS exports an interface that allows users to implement custom tokenizers
  using C. The interface used to create a new tokenizer is defined and 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
  The "unicode61" tokenizer is available beginning with SQLite [version 3.7.13].
  Unicode61 works very much like "simple" except that it does full unicode
  case folding according to rules in Unicode Version 6.1 and it recognizes
  unicode space and punctuation characters and uses those to separate tokens.
  The simple tokenizer only does case folding of ASCII characters and only
  recognizes ASCII space and punctuation characters as token separators.

<p>
  By default, "unicode61" also removes all diacritics from Latin script
  characters. This behaviour can be overriden by adding the tokenizer argument
  "remove_diacritics=0". For example:

<codeblock>
    <i>-- Create tables that remove diacritics from Latin script characters</i>
    <i>-- as part of tokenization.</i>
    CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
    CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");

    <i>-- Create a tables that does not remove diacritics from Latin script</i>
    <i>-- characters as part of tokenization.</i>
    CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
</codeblock>

<h2>Custom (User Implemented) Tokenizers</h2>

<p>
  As well as the built-in "simple", "porter" and (possibly) "icu" and
  "unicode61" tokenizers,
  FTS exports an interface that allows users to implement custom tokenizers
  using C. The interface used to create a new tokenizer is defined and