Documentation Source Text

Check-in [34f973966a]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add documentation for the fts3/4/5 remove_diacritic options.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256:34f973966af70008ac21ee9df18e7bd2b497cc53be81025113ae90843f195e7a
User & Date: dan 2019-02-11 13:21:49
Context
2019-02-11
13:24
Fix "asterix" typo in fts5.in. check-in: 569262e571 user: dan tags: trunk
13:21
Add documentation for the fts3/4/5 remove_diacritic options. check-in: 34f973966a user: dan tags: trunk
2019-02-08
16:06
The change of removing deprecated PRAGMAs with SQLITE_OMIT_DEPRECATED was backed out, so remove it from the change log. check-in: 0ed8a559a1 user: drh tags: trunk
Changes
Hide Diffs Side-by-Side Diffs Ignore Whitespace Patch

Changes to pages/fts3.in.

  2213   2213     Unicode61 works very much like "simple" except that it does simple unicode
  2214   2214     case folding according to rules in Unicode Version 6.1 and it recognizes
  2215   2215     unicode space and punctuation characters and uses those to separate tokens.
  2216   2216     The simple tokenizer only does case folding of ASCII characters and only
  2217   2217     recognizes ASCII space and punctuation characters as token separators.
  2218   2218   
  2219   2219   <p>
  2220         -  By default, "unicode61" also removes all diacritics from Latin script
         2220  +  By default, "unicode61" attempts to remove diacritics from Latin script
  2221   2221     characters. This behaviour can be overridden by adding the tokenizer argument
  2222   2222     "remove_diacritics=0". For example:
  2223   2223   
  2224   2224   <codeblock>
  2225         -    <i>-- Create tables that remove diacritics from Latin script characters</i>
         2225  +    <i>-- Create tables that remove <b>all</b>diacritics from Latin script characters</i>
  2226   2226       <i>-- as part of tokenization.</i>
  2227   2227       CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
  2228         -    CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");
         2228  +    CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=2");
  2229   2229   
  2230   2230       <i>-- Create a table that does not remove diacritics from Latin script</i>
  2231   2231       <i>-- characters as part of tokenization.</i>
  2232   2232       CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
  2233   2233   </codeblock>
         2234  +
         2235  +<p>The remove_diacritics option may be set to "0", "1" or "2". The default
         2236  +   value is "1".  If it is set to "1" or "2", then diacritics are removed from
         2237  +   Latin script characters as described above. However, if it is set to "1",
         2238  +   then diacritics are not removed in the fairly uncommon case where a single
         2239  +   unicode codepoint is used to represent a character with more that one
         2240  +   diacritic. For example, diacritics are not removed from codepoint 0x1ED9
         2241  +   ("LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW"). This is technically
         2242  +   a bug, but cannot be fixed without creating backwards compatibility
         2243  +   problems. If this option is set to "2", then diacritics are correctly
         2244  +   removed from all Latin characters.
  2234   2245   
  2235   2246   <p>
  2236   2247     It is also possible to customize the set of codepoints that unicode61 treats
  2237   2248     as separator characters. The "separators=" option may be used to specify one
  2238   2249     or more extra characters that should be treated as separator characters, and
  2239   2250     the "tokenchars=" option may be used to specify one or more extra characters
  2240   2251     that should be treated as part of tokens instead of as separator characters.

Changes to pages/fts5.in.

   585    585   <p> Any arguments following "unicode61" in the token specification are treated
   586    586   as a list of alternating option names and values. Unicode61 supports the
   587    587   following options:
   588    588   
   589    589   <table striped=1>
   590    590     <tr><th> Option <th> Usage
   591    591     <tr><td> remove_diacritics
   592         -  <td>This option should be set to "0" or "1". If it is set (the default),
   593         -  diacritics are removed from all latin script characters as described above.
   594         -  If it is clear, they are not. 
          592  +  <td>This option should be set to "0", "1" or "2". The default value is "1".
          593  +  If it is set to "1" or "2", then diacritics are removed from Latin script
          594  +  characters as described above. However, if it is set to "1", then diacritics
          595  +  are not removed in the fairly uncommon case where a single unicode codepoint
          596  +  is used to represent a character with more that one diacritic. For example,
          597  +  diacritics are not removed from codepoint 0x1ED9 ("LATIN SMALL LETTER O WITH
          598  +  CIRCUMFLEX AND DOT BELOW"). This is technically a bug, but cannot be fixed
          599  +  without creating backwards compatibility problems. If this option is set to
          600  +  "2", then diacritics are correctly removed from all Latin characters.
   595    601   
   596    602     <tr><td> categories
   597    603     <td>This option may be used to modify the set of Unicode general categories
   598    604     that are considered to correspond to token characters. The argument must
   599    605     consist of a space separated list of two-character general category
   600    606     abbreviations (e.g. "Lu" or "Nd"), or of the same with the second character
   601    607     replaced with an asterix ("*"), interpreted as a glob pattern. The default