Documentation Source Text

Check-in [e32e877119]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add docs for fts5 unicode61 tokenizer option "categories".
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: e32e87711917c5f124548bc46d956e599fa9e948b4c63016514183025aacff91
User & Date: dan 2018-07-13 20:31:10.656
Context
2018-07-23
10:51
Changes to information on support packages. (check-in: 7e685f86a9 user: drh tags: trunk)
2018-07-13
20:31
Add docs for fts5 unicode61 tokenizer option "categories". (check-in: e32e877119 user: dan tags: trunk)
2018-07-11
11:08
Update the keyword list with all of the new keywords added for UPSERT and window functions. (check-in: 824b38d28e user: drh tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to pages/fts5.in.
563
564
565
566
567
568
569
570






571
572

573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588








589
590
591
592
593
594
595
<p> It is also possible to create custom tokenizers for FTS5. The API for doing so is [custom tokenizers | described here].

<h3>Unicode61 Tokenizer</h3>

<p> The unicode tokenizer classifies all unicode characters as either 
"separator" or "token" characters. By default all space and punctuation
characters, as defined by Unicode 6.1, are considered separators, and all 
other characters as token characters. Each contiguous run of one or more 






token characters is considered to be a token. The tokenizer is case-insensitive
according to the rules defined by Unicode 6.1.


<p> By default, diacritics are removed from all Latin script characters. This
means, for example, that "A", "a", "&#192;", "&#224;", "&#194;" and "&#226;"
are all considered to be equivalent.

<p> Any arguments following "unicode61" in the token specification are treated
as a list of alternating option names and values. Unicode61 supports the
following options:

<table striped=1>
  <tr><th> Option <th> Usage
  <tr><td> remove_diacritics
  <td>This option should be set to "0" or "1". If it is set (the default),
  diacritics are removed from all latin script characters as described above.
  If it is clear, they are not. 









  <tr><td> tokenchars
  <td> This option is used to specify additional unicode characters that 
  should be considered token characters, even if they are white-space or
  punctuation characters according to Unicode 6.1. All characters in the
  string that this option is set to are considered token characters.

  <tr><td> separators







|
>
>
>
>
>
>
|
|
>
















>
>
>
>
>
>
>
>







563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
<p> It is also possible to create custom tokenizers for FTS5. The API for doing so is [custom tokenizers | described here].

<h3>Unicode61 Tokenizer</h3>

<p> The unicode tokenizer classifies all unicode characters as either 
"separator" or "token" characters. By default all space and punctuation
characters, as defined by Unicode 6.1, are considered separators, and all 
other characters as token characters. More specifically, all unicode 
characters assigned to a 
<a href=https://en.wikipedia.org/wiki/Unicode_character_property#General_Category>
general category</a> beginning with "L" or "N" (letters and numbers,
specfically) or to category "Co" ("other, private use") are considered tokens.
All other characters are separators.
 
<p>Each contiguous run of one or more token characters is considered to be a
token. The tokenizer is case-insensitive according to the rules defined by
Unicode 6.1.

<p> By default, diacritics are removed from all Latin script characters. This
means, for example, that "A", "a", "&#192;", "&#224;", "&#194;" and "&#226;"
are all considered to be equivalent.

<p> Any arguments following "unicode61" in the token specification are treated
as a list of alternating option names and values. Unicode61 supports the
following options:

<table striped=1>
  <tr><th> Option <th> Usage
  <tr><td> remove_diacritics
  <td>This option should be set to "0" or "1". If it is set (the default),
  diacritics are removed from all latin script characters as described above.
  If it is clear, they are not. 

  <tr><td> categories
  <td>This option may be used to modify the set of Unicode general categories
  that are considered to correspond to token characters. The argument must
  consist of a space separated list of two-character general category
  abbreviations (e.g. "Lu" or "Nd"), or of the same with the second character
  replaced with an asterix ("*"), interpreted as a glob pattern. The default
  value is "L* N* Co".

  <tr><td> tokenchars
  <td> This option is used to specify additional unicode characters that 
  should be considered token characters, even if they are white-space or
  punctuation characters according to Unicode 6.1. All characters in the
  string that this option is set to are considered token characters.

  <tr><td> separators
606
607
608
609
610
611
612










613
614
615
616
617
618
619
  -- script characters, and that considers hyphens and underscore characters
  -- to be part of tokens. </i>
  CREATE VIRTUAL TABLE ft USING fts5(a, b, 
      tokenize = "unicode61 remove_diacritics 0 tokenchars '-_'"
  );
</codeblock>











<p> The fts5 unicode61 tokenizer is byte-for-byte compatible with the fts3/4
unicode61 tokenizer.

<h3>Ascii Tokenizer</h3>

<p> The Ascii tokenizer is similar to the Unicode61 tokenizer, except that:








>
>
>
>
>
>
>
>
>
>







621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
  -- script characters, and that considers hyphens and underscore characters
  -- to be part of tokens. </i>
  CREATE VIRTUAL TABLE ft USING fts5(a, b, 
      tokenize = "unicode61 remove_diacritics 0 tokenchars '-_'"
  );
</codeblock>

<p> or:

<codeblock>
  <i>-- Create an FTS5 table that, as well as the default token character classes,</i>
  <i>-- considers characters in class "Mn" to be token characters.</i>
  CREATE VIRTUAL TABLE ft USING fts5(a, b, 
      tokenize = "unicode61 categories 'L* N* Co Mn'"
  );
</codeblock>

<p> The fts5 unicode61 tokenizer is byte-for-byte compatible with the fts3/4
unicode61 tokenizer.

<h3>Ascii Tokenizer</h3>

<p> The Ascii tokenizer is similar to the Unicode61 tokenizer, except that: