Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Add documentation for the fts3/4/5 remove_diacritic options. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA3-256: |
34f973966af70008ac21ee9df18e7bd2 |
User & Date: | dan 2019-02-11 13:21:49.239 |
Context
2019-02-11
| ||
13:24 | Fix "asterix" typo in fts5.in. (check-in: 569262e571 user: dan tags: trunk) | |
13:21 | Add documentation for the fts3/4/5 remove_diacritic options. (check-in: 34f973966a user: dan tags: trunk) | |
2019-02-08
| ||
16:06 | The change of removing deprecated PRAGMAs with SQLITE_OMIT_DEPRECATED was backed out, so remove it from the change log. (check-in: 0ed8a559a1 user: drh tags: trunk) | |
Changes
Changes to pages/fts3.in.
︙ | ︙ | |||
2213 2214 2215 2216 2217 2218 2219 | Unicode61 works very much like "simple" except that it does simple unicode case folding according to rules in Unicode Version 6.1 and it recognizes unicode space and punctuation characters and uses those to separate tokens. The simple tokenizer only does case folding of ASCII characters and only recognizes ASCII space and punctuation characters as token separators. <p> | | | | > > > > > > > > > > > | 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 | Unicode61 works very much like "simple" except that it does simple unicode case folding according to rules in Unicode Version 6.1 and it recognizes unicode space and punctuation characters and uses those to separate tokens. The simple tokenizer only does case folding of ASCII characters and only recognizes ASCII space and punctuation characters as token separators. <p> By default, "unicode61" attempts to remove diacritics from Latin script characters. This behaviour can be overridden by adding the tokenizer argument "remove_diacritics=0". For example: <codeblock> <i>-- Create tables that remove <b>all</b>diacritics from Latin script characters</i> <i>-- as part of tokenization.</i> CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61); CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=2"); <i>-- Create a table that does not remove diacritics from Latin script</i> <i>-- characters as part of tokenization.</i> CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0"); </codeblock> <p>The remove_diacritics option may be set to "0", "1" or "2". The default value is "1". If it is set to "1" or "2", then diacritics are removed from Latin script characters as described above. However, if it is set to "1", then diacritics are not removed in the fairly uncommon case where a single unicode codepoint is used to represent a character with more that one diacritic. For example, diacritics are not removed from codepoint 0x1ED9 ("LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW"). This is technically a bug, but cannot be fixed without creating backwards compatibility problems. If this option is set to "2", then diacritics are correctly removed from all Latin characters. <p> It is also possible to customize the set of codepoints that unicode61 treats as separator characters. The "separators=" option may be used to specify one or more extra characters that should be treated as separator characters, and the "tokenchars=" option may be used to specify one or more extra characters that should be treated as part of tokens instead of as separator characters. |
︙ | ︙ |
Changes to pages/fts5.in.
︙ | ︙ | |||
585 586 587 588 589 590 591 | <p> Any arguments following "unicode61" in the token specification are treated as a list of alternating option names and values. Unicode61 supports the following options: <table striped=1> <tr><th> Option <th> Usage <tr><td> remove_diacritics | | > | > > | > > > | 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 | <p> Any arguments following "unicode61" in the token specification are treated as a list of alternating option names and values. Unicode61 supports the following options: <table striped=1> <tr><th> Option <th> Usage <tr><td> remove_diacritics <td>This option should be set to "0", "1" or "2". The default value is "1". If it is set to "1" or "2", then diacritics are removed from Latin script characters as described above. However, if it is set to "1", then diacritics are not removed in the fairly uncommon case where a single unicode codepoint is used to represent a character with more that one diacritic. For example, diacritics are not removed from codepoint 0x1ED9 ("LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW"). This is technically a bug, but cannot be fixed without creating backwards compatibility problems. If this option is set to "2", then diacritics are correctly removed from all Latin characters. <tr><td> categories <td>This option may be used to modify the set of Unicode general categories that are considered to correspond to token characters. The argument must consist of a space separated list of two-character general category abbreviations (e.g. "Lu" or "Nd"), or of the same with the second character replaced with an asterix ("*"), interpreted as a glob pattern. The default |
︙ | ︙ |