Documentation Source Text

Check-in [e32e877119]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add docs for fts5 unicode61 tokenizer option "categories".
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: e32e87711917c5f124548bc46d956e599fa9e948b4c63016514183025aacff91
User & Date: dan 2018-07-13 20:31:10
Context
2018-07-23
10:51
Changes to information on support packages. check-in: 7e685f86a9 user: drh tags: trunk
2018-07-13
20:31
Add docs for fts5 unicode61 tokenizer option "categories". check-in: e32e877119 user: dan tags: trunk
2018-07-11
11:08
Update the keyword list with all of the new keywords added for UPSERT and window functions. check-in: 824b38d28e user: drh tags: trunk
Changes
Hide Diffs Side-by-Side Diffs Show Whitespace Changes Patch

Changes to pages/fts5.in.

   563    563   <p> It is also possible to create custom tokenizers for FTS5. The API for doing so is [custom tokenizers | described here].
   564    564   
   565    565   <h3>Unicode61 Tokenizer</h3>
   566    566   
   567    567   <p> The unicode tokenizer classifies all unicode characters as either 
   568    568   "separator" or "token" characters. By default all space and punctuation
   569    569   characters, as defined by Unicode 6.1, are considered separators, and all 
   570         -other characters as token characters. Each contiguous run of one or more 
   571         -token characters is considered to be a token. The tokenizer is case-insensitive
   572         -according to the rules defined by Unicode 6.1.
          570  +other characters as token characters. More specifically, all unicode 
          571  +characters assigned to a 
          572  +<a href=https://en.wikipedia.org/wiki/Unicode_character_property#General_Category>
          573  +general category</a> beginning with "L" or "N" (letters and numbers,
          574  +specfically) or to category "Co" ("other, private use") are considered tokens.
          575  +All other characters are separators.
          576  + 
          577  +<p>Each contiguous run of one or more token characters is considered to be a
          578  +token. The tokenizer is case-insensitive according to the rules defined by
          579  +Unicode 6.1.
   573    580   
   574    581   <p> By default, diacritics are removed from all Latin script characters. This
   575    582   means, for example, that "A", "a", "&#192;", "&#224;", "&#194;" and "&#226;"
   576    583   are all considered to be equivalent.
   577    584   
   578    585   <p> Any arguments following "unicode61" in the token specification are treated
   579    586   as a list of alternating option names and values. Unicode61 supports the
................................................................................
   582    589   <table striped=1>
   583    590     <tr><th> Option <th> Usage
   584    591     <tr><td> remove_diacritics
   585    592     <td>This option should be set to "0" or "1". If it is set (the default),
   586    593     diacritics are removed from all latin script characters as described above.
   587    594     If it is clear, they are not. 
   588    595   
          596  +  <tr><td> categories
          597  +  <td>This option may be used to modify the set of Unicode general categories
          598  +  that are considered to correspond to token characters. The argument must
          599  +  consist of a space separated list of two-character general category
          600  +  abbreviations (e.g. "Lu" or "Nd"), or of the same with the second character
          601  +  replaced with an asterix ("*"), interpreted as a glob pattern. The default
          602  +  value is "L* N* Co".
          603  +
   589    604     <tr><td> tokenchars
   590    605     <td> This option is used to specify additional unicode characters that 
   591    606     should be considered token characters, even if they are white-space or
   592    607     punctuation characters according to Unicode 6.1. All characters in the
   593    608     string that this option is set to are considered token characters.
   594    609   
   595    610     <tr><td> separators
................................................................................
   606    621     -- script characters, and that considers hyphens and underscore characters
   607    622     -- to be part of tokens. </i>
   608    623     CREATE VIRTUAL TABLE ft USING fts5(a, b, 
   609    624         tokenize = "unicode61 remove_diacritics 0 tokenchars '-_'"
   610    625     );
   611    626   </codeblock>
   612    627   
          628  +<p> or:
          629  +
          630  +<codeblock>
          631  +  <i>-- Create an FTS5 table that, as well as the default token character classes,</i>
          632  +  <i>-- considers characters in class "Mn" to be token characters.</i>
          633  +  CREATE VIRTUAL TABLE ft USING fts5(a, b, 
          634  +      tokenize = "unicode61 categories 'L* N* Co Mn'"
          635  +  );
          636  +</codeblock>
          637  +
   613    638   <p> The fts5 unicode61 tokenizer is byte-for-byte compatible with the fts3/4
   614    639   unicode61 tokenizer.
   615    640   
   616    641   <h3>Ascii Tokenizer</h3>
   617    642   
   618    643   <p> The Ascii tokenizer is similar to the Unicode61 tokenizer, except that:
   619    644