Documentation Source Text

Check-in [37a01760c6]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Fix the description of the case folding performed by the unicode61 tokenizer in FTS3.
Timelines: family | ancestors | descendants | both | branch-3.11
Files: files | file ages | folders
SHA1: 37a01760c60712bd4441ba0d92896e04621ea3ee
User & Date: drh 2016-03-21 20:02:27
Context
2016-03-26
23:12
Update TH3 license information. Leaf check-in: 84f9b8afc2 user: drh tags: branch-3.11
2016-03-22
17:23
Merge fixes off of the 3.11 branch. check-in: ad0172e592 user: drh tags: trunk
2016-03-21
20:02
Fix the description of the case folding performed by the unicode61 tokenizer in FTS3. check-in: 37a01760c6 user: drh tags: branch-3.11
2016-03-18
14:56
Update the wal.html document at the bigwal anchor to reflect improvements to large transactions in WAL mode added to version 3.11.0. check-in: e273b7fdae user: drh tags: branch-3.11
Changes
Hide Diffs Side-by-Side Diffs Ignore Whitespace Patch

Changes to pages/fts3.in.

  2200   2200     processing is required, for example to implement stemming or
  2201   2201     discard punctuation, this can be done by creating a tokenizer
  2202   2202     implementation that uses the ICU tokenizer as part of its implementation.
  2203   2203   
  2204   2204   <tcl>hd_fragment unicode61 unicode61</tcl>
  2205   2205   <p>
  2206   2206     The "unicode61" tokenizer is available beginning with SQLite [version 3.7.13].
  2207         -  Unicode61 works very much like "simple" except that it does full unicode
         2207  +  Unicode61 works very much like "simple" except that it does simple unicode
  2208   2208     case folding according to rules in Unicode Version 6.1 and it recognizes
  2209   2209     unicode space and punctuation characters and uses those to separate tokens.
  2210   2210     The simple tokenizer only does case folding of ASCII characters and only
  2211   2211     recognizes ASCII space and punctuation characters as token separators.
  2212   2212   
  2213   2213   <p>
  2214   2214     By default, "unicode61" also removes all diacritics from Latin script