Documentation Source Text

Check-in [34f973966a]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add documentation for the fts3/4/5 remove_diacritic options.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: 34f973966af70008ac21ee9df18e7bd2b497cc53be81025113ae90843f195e7a
User & Date: dan 2019-02-11 13:21:49.239
Context
2019-02-11
13:24
Fix "asterix" typo in fts5.in. (check-in: 569262e571 user: dan tags: trunk)
13:21
Add documentation for the fts3/4/5 remove_diacritic options. (check-in: 34f973966a user: dan tags: trunk)
2019-02-08
16:06
The change of removing deprecated PRAGMAs with SQLITE_OMIT_DEPRECATED was backed out, so remove it from the change log. (check-in: 0ed8a559a1 user: drh tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to pages/fts3.in.
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233











2234
2235
2236
2237
2238
2239
2240
  Unicode61 works very much like "simple" except that it does simple unicode
  case folding according to rules in Unicode Version 6.1 and it recognizes
  unicode space and punctuation characters and uses those to separate tokens.
  The simple tokenizer only does case folding of ASCII characters and only
  recognizes ASCII space and punctuation characters as token separators.

<p>
  By default, "unicode61" also removes all diacritics from Latin script
  characters. This behaviour can be overridden by adding the tokenizer argument
  "remove_diacritics=0". For example:

<codeblock>
    <i>-- Create tables that remove diacritics from Latin script characters</i>
    <i>-- as part of tokenization.</i>
    CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
    CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");

    <i>-- Create a table that does not remove diacritics from Latin script</i>
    <i>-- characters as part of tokenization.</i>
    CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
</codeblock>












<p>
  It is also possible to customize the set of codepoints that unicode61 treats
  as separator characters. The "separators=" option may be used to specify one
  or more extra characters that should be treated as separator characters, and
  the "tokenchars=" option may be used to specify one or more extra characters
  that should be treated as part of tokens instead of as separator characters.







|




|


|





>
>
>
>
>
>
>
>
>
>
>







2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
  Unicode61 works very much like "simple" except that it does simple unicode
  case folding according to rules in Unicode Version 6.1 and it recognizes
  unicode space and punctuation characters and uses those to separate tokens.
  The simple tokenizer only does case folding of ASCII characters and only
  recognizes ASCII space and punctuation characters as token separators.

<p>
  By default, "unicode61" attempts to remove diacritics from Latin script
  characters. This behaviour can be overridden by adding the tokenizer argument
  "remove_diacritics=0". For example:

<codeblock>
    <i>-- Create tables that remove <b>all</b>diacritics from Latin script characters</i>
    <i>-- as part of tokenization.</i>
    CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
    CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=2");

    <i>-- Create a table that does not remove diacritics from Latin script</i>
    <i>-- characters as part of tokenization.</i>
    CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
</codeblock>

<p>The remove_diacritics option may be set to "0", "1" or "2". The default
   value is "1".  If it is set to "1" or "2", then diacritics are removed from
   Latin script characters as described above. However, if it is set to "1",
   then diacritics are not removed in the fairly uncommon case where a single
   unicode codepoint is used to represent a character with more that one
   diacritic. For example, diacritics are not removed from codepoint 0x1ED9
   ("LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW"). This is technically
   a bug, but cannot be fixed without creating backwards compatibility
   problems. If this option is set to "2", then diacritics are correctly
   removed from all Latin characters.

<p>
  It is also possible to customize the set of codepoints that unicode61 treats
  as separator characters. The "separators=" option may be used to specify one
  or more extra characters that should be treated as separator characters, and
  the "tokenchars=" option may be used to specify one or more extra characters
  that should be treated as part of tokens instead of as separator characters.
Changes to pages/fts5.in.
585
586
587
588
589
590
591
592

593


594



595
596
597
598
599
600
601
<p> Any arguments following "unicode61" in the token specification are treated
as a list of alternating option names and values. Unicode61 supports the
following options:

<table striped=1>
  <tr><th> Option <th> Usage
  <tr><td> remove_diacritics
  <td>This option should be set to "0" or "1". If it is set (the default),

  diacritics are removed from all latin script characters as described above.


  If it is clear, they are not. 




  <tr><td> categories
  <td>This option may be used to modify the set of Unicode general categories
  that are considered to correspond to token characters. The argument must
  consist of a space separated list of two-character general category
  abbreviations (e.g. "Lu" or "Nd"), or of the same with the second character
  replaced with an asterix ("*"), interpreted as a glob pattern. The default







|
>
|
>
>
|
>
>
>







585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
<p> Any arguments following "unicode61" in the token specification are treated
as a list of alternating option names and values. Unicode61 supports the
following options:

<table striped=1>
  <tr><th> Option <th> Usage
  <tr><td> remove_diacritics
  <td>This option should be set to "0", "1" or "2". The default value is "1".
  If it is set to "1" or "2", then diacritics are removed from Latin script
  characters as described above. However, if it is set to "1", then diacritics
  are not removed in the fairly uncommon case where a single unicode codepoint
  is used to represent a character with more that one diacritic. For example,
  diacritics are not removed from codepoint 0x1ED9 ("LATIN SMALL LETTER O WITH
  CIRCUMFLEX AND DOT BELOW"). This is technically a bug, but cannot be fixed
  without creating backwards compatibility problems. If this option is set to
  "2", then diacritics are correctly removed from all Latin characters.

  <tr><td> categories
  <td>This option may be used to modify the set of Unicode general categories
  that are considered to correspond to token characters. The argument must
  consist of a space separated list of two-character general category
  abbreviations (e.g. "Lu" or "Nd"), or of the same with the second character
  replaced with an asterix ("*"), interpreted as a glob pattern. The default