Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Add documentation for the fts4 unicode61 "tokenchars" and "separators" options. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
00c7833d650959518a9bde013aaae6b9 |
User & Date: | dan 2013-09-13 14:52:34.536 |
Context
2013-09-13
| ||
22:03 | Add documentation for unlikely(), likelihood() and the soft_heap_limit pragma. (check-in: dc4eae98aa user: drh tags: trunk) | |
14:52 | Add documentation for the fts4 unicode61 "tokenchars" and "separators" options. (check-in: 00c7833d65 user: dan tags: trunk) | |
2013-09-02
| ||
14:24 | Merge change-log enhancements from the 3.8.0 branch into trunk. (check-in: 265fde6eac user: drh tags: trunk) | |
Changes
Changes to pages/fts3.in.
︙ | ︙ | |||
2071 2072 2073 2074 2075 2076 2077 | <codeblock> <i>-- Create tables that remove diacritics from Latin script characters</i> <i>-- as part of tokenization.</i> CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61); CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1"); | | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 | <codeblock> <i>-- Create tables that remove diacritics from Latin script characters</i> <i>-- as part of tokenization.</i> CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61); CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1"); <i>-- Create a table that does not remove diacritics from Latin script</i> <i>-- characters as part of tokenization.</i> CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0"); </codeblock> <p> It is also possible to customize the set of codepoints that unicode61 treats as separator characters. The "separators=" option may be used to specify one or more extra characters that should be treated as separator characters, and the "tokenchars=" option may be used to specify one or more extra characters that should be treated as part of tokens instead of as separator characters. For example: <codeblock> <i>-- Create a table that uses the unicode61 tokenizer, but considers "."</i> <i>-- and "=" characters to be part of tokens, and capital "X" characters to</i> <i>-- function as separators.</i> CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "tokenchars=.=" "separators=X"); <i>-- Create a tables that considers space characters (codepoint 32) to be</i> <i>-- a token character</i> CREATE VIRTUAL TABLE txt4 USING fts4(tokenize=unicode61 "tokenchars= "); </codeblock> <p> If a character specified as part of the argument to "tokenchars=" is considered to be a token character by default, it is ignored. This is true even if it has been marked as a separator by an earlier "separators=" option. Similarly, if a character specified as part of a "separators=" option is treated as a separator character by default, it is ignored. If multiple "tokenchars=" or "separators=" options are specified, all are processed. For example: <codeblock> <i>-- Create a table that uses the unicode61 tokenizer, but considers "."</i> <i>-- and "=" characters to be part of tokens, and capital "X" characters to</i> <i>-- function as separators. Both of the "tokenchars=" options are processed</i> <i>-- The "separators=" option ignores the "." passed to it, as "." is by</i> <i>-- default a separator character, even though it has been marked as a token</i> <i>-- character by an earlier "tokenchars=" option.</i> CREATE VIRTUAL TABLE txt5 USING fts4( tokenize=unicode61 "tokenchars=." "separators=X." "tokenchars==" ); </codeblock> <p> The arguments passed to the "tokenchars=" or "separators=" options are case-sensitive. In the example above, specifying that "X" is a separator character does not affect the way "x" is handled. <h2>Custom (User Implemented) Tokenizers</h2> <p> As well as the built-in "simple", "porter" and (possibly) "icu" and "unicode61" tokenizers, FTS exports an interface that allows users to implement custom tokenizers using C. The interface used to create a new tokenizer is defined and |
︙ | ︙ |