Documentation Source Text

Check-in [bc38355689]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add documentation for the fts4 unicode61 tokenizer option "remove_diacritics=0".
Timelines: family | ancestors | descendants | both | branch-3.8.0
Files: files | file ages | folders
SHA1:bc38355689db1c27731ac30d6df2b5ce0286259f
User & Date: dan 2013-09-03 17:20:48
Context
2013-09-03
17:22
Add the sha1sum and SQLITE_SOURCE_ID for 3.8.0.2. check-in: 4069a1b264 user: dan tags: branch-3.8.0
17:20
Add documentation for the fts4 unicode61 tokenizer option "remove_diacritics=0". check-in: bc38355689 user: dan tags: branch-3.8.0
16:06
Changes for the 3.8.0.2 release. check-in: d705052eb2 user: drh tags: branch-3.8.0
Changes
Hide Diffs Side-by-Side Diffs Ignore Whitespace Patch

Changes to pages/fts3.in.

  2060   2060     The "unicode61" tokenizer is available beginning with SQLite [version 3.7.13].
  2061   2061     Unicode61 works very much like "simple" except that it does full unicode
  2062   2062     case folding according to rules in Unicode Version 6.1 and it recognizes
  2063   2063     unicode space and punctuation characters and uses those to separate tokens.
  2064   2064     The simple tokenizer only does case folding of ASCII characters and only
  2065   2065     recognizes ASCII space and punctuation characters as token separators.
  2066   2066   
         2067  +<p>
         2068  +  By default, "unicode61" also removes all diacritics from Latin script
         2069  +  characters. This behaviour can be overriden by adding the tokenizer argument
         2070  +  "remove_diacritics=0". For example:
         2071  +
         2072  +<codeblock>
         2073  +    <i>-- Create tables that remove diacritics from Latin script characters</i>
         2074  +    <i>-- as part of tokenization.</i>
         2075  +    CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
         2076  +    CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");
         2077  +
         2078  +    <i>-- Create a tables that does not remove diacritics from Latin script</i>
         2079  +    <i>-- characters as part of tokenization.</i>
         2080  +    CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
         2081  +</codeblock>
         2082  +
  2067   2083   <h2>Custom (User Implemented) Tokenizers</h2>
  2068   2084   
  2069   2085   <p>
  2070   2086     As well as the built-in "simple", "porter" and (possibly) "icu" and
  2071   2087     "unicode61" tokenizers,
  2072   2088     FTS exports an interface that allows users to implement custom tokenizers
  2073   2089     using C. The interface used to create a new tokenizer is defined and