Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Add documentation for the fts4 unicode61 tokenizer option "remove_diacritics=0". |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | branch-3.8.0 |
Files: | files | file ages | folders |
SHA1: |
bc38355689db1c27731ac30d6df2b5ce |
User & Date: | dan 2013-09-03 17:20:48.567 |
Context
2013-09-03
| ||
17:22 | Add the sha1sum and SQLITE_SOURCE_ID for 3.8.0.2. (check-in: 4069a1b264 user: dan tags: branch-3.8.0) | |
17:20 | Add documentation for the fts4 unicode61 tokenizer option "remove_diacritics=0". (check-in: bc38355689 user: dan tags: branch-3.8.0) | |
16:06 | Changes for the 3.8.0.2 release. (check-in: d705052eb2 user: drh tags: branch-3.8.0) | |
Changes
Changes to pages/fts3.in.
︙ | ︙ | |||
2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 | The "unicode61" tokenizer is available beginning with SQLite [version 3.7.13]. Unicode61 works very much like "simple" except that it does full unicode case folding according to rules in Unicode Version 6.1 and it recognizes unicode space and punctuation characters and uses those to separate tokens. The simple tokenizer only does case folding of ASCII characters and only recognizes ASCII space and punctuation characters as token separators. <h2>Custom (User Implemented) Tokenizers</h2> <p> As well as the built-in "simple", "porter" and (possibly) "icu" and "unicode61" tokenizers, FTS exports an interface that allows users to implement custom tokenizers using C. The interface used to create a new tokenizer is defined and | > > > > > > > > > > > > > > > > | 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 | The "unicode61" tokenizer is available beginning with SQLite [version 3.7.13]. Unicode61 works very much like "simple" except that it does full unicode case folding according to rules in Unicode Version 6.1 and it recognizes unicode space and punctuation characters and uses those to separate tokens. The simple tokenizer only does case folding of ASCII characters and only recognizes ASCII space and punctuation characters as token separators. <p> By default, "unicode61" also removes all diacritics from Latin script characters. This behaviour can be overriden by adding the tokenizer argument "remove_diacritics=0". For example: <codeblock> <i>-- Create tables that remove diacritics from Latin script characters</i> <i>-- as part of tokenization.</i> CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61); CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1"); <i>-- Create a tables that does not remove diacritics from Latin script</i> <i>-- characters as part of tokenization.</i> CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0"); </codeblock> <h2>Custom (User Implemented) Tokenizers</h2> <p> As well as the built-in "simple", "porter" and (possibly) "icu" and "unicode61" tokenizers, FTS exports an interface that allows users to implement custom tokenizers using C. The interface used to create a new tokenizer is defined and |
︙ | ︙ |