Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Add docs for fts5 unicode61 tokenizer option "categories". |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA3-256: |
e32e87711917c5f124548bc46d956e59 |
User & Date: | dan 2018-07-13 20:31:10.656 |
Context
2018-07-23
| ||
10:51 | Changes to information on support packages. (check-in: 7e685f86a9 user: drh tags: trunk) | |
2018-07-13
| ||
20:31 | Add docs for fts5 unicode61 tokenizer option "categories". (check-in: e32e877119 user: dan tags: trunk) | |
2018-07-11
| ||
11:08 | Update the keyword list with all of the new keywords added for UPSERT and window functions. (check-in: 824b38d28e user: drh tags: trunk) | |
Changes
Changes to pages/fts5.in.
︙ | ︙ | |||
563 564 565 566 567 568 569 | <p> It is also possible to create custom tokenizers for FTS5. The API for doing so is [custom tokenizers | described here]. <h3>Unicode61 Tokenizer</h3> <p> The unicode tokenizer classifies all unicode characters as either "separator" or "token" characters. By default all space and punctuation characters, as defined by Unicode 6.1, are considered separators, and all | | > > > > > > | | > > > > > > > > > | 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 | <p> It is also possible to create custom tokenizers for FTS5. The API for doing so is [custom tokenizers | described here]. <h3>Unicode61 Tokenizer</h3> <p> The unicode tokenizer classifies all unicode characters as either "separator" or "token" characters. By default all space and punctuation characters, as defined by Unicode 6.1, are considered separators, and all other characters as token characters. More specifically, all unicode characters assigned to a <a href=https://en.wikipedia.org/wiki/Unicode_character_property#General_Category> general category</a> beginning with "L" or "N" (letters and numbers, specfically) or to category "Co" ("other, private use") are considered tokens. All other characters are separators. <p>Each contiguous run of one or more token characters is considered to be a token. The tokenizer is case-insensitive according to the rules defined by Unicode 6.1. <p> By default, diacritics are removed from all Latin script characters. This means, for example, that "A", "a", "À", "à", "Â" and "â" are all considered to be equivalent. <p> Any arguments following "unicode61" in the token specification are treated as a list of alternating option names and values. Unicode61 supports the following options: <table striped=1> <tr><th> Option <th> Usage <tr><td> remove_diacritics <td>This option should be set to "0" or "1". If it is set (the default), diacritics are removed from all latin script characters as described above. If it is clear, they are not. <tr><td> categories <td>This option may be used to modify the set of Unicode general categories that are considered to correspond to token characters. The argument must consist of a space separated list of two-character general category abbreviations (e.g. "Lu" or "Nd"), or of the same with the second character replaced with an asterix ("*"), interpreted as a glob pattern. The default value is "L* N* Co". <tr><td> tokenchars <td> This option is used to specify additional unicode characters that should be considered token characters, even if they are white-space or punctuation characters according to Unicode 6.1. All characters in the string that this option is set to are considered token characters. <tr><td> separators |
︙ | ︙ | |||
606 607 608 609 610 611 612 613 614 615 616 617 618 619 | -- script characters, and that considers hyphens and underscore characters -- to be part of tokens. </i> CREATE VIRTUAL TABLE ft USING fts5(a, b, tokenize = "unicode61 remove_diacritics 0 tokenchars '-_'" ); </codeblock> <p> The fts5 unicode61 tokenizer is byte-for-byte compatible with the fts3/4 unicode61 tokenizer. <h3>Ascii Tokenizer</h3> <p> The Ascii tokenizer is similar to the Unicode61 tokenizer, except that: | > > > > > > > > > > | 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 | -- script characters, and that considers hyphens and underscore characters -- to be part of tokens. </i> CREATE VIRTUAL TABLE ft USING fts5(a, b, tokenize = "unicode61 remove_diacritics 0 tokenchars '-_'" ); </codeblock> <p> or: <codeblock> <i>-- Create an FTS5 table that, as well as the default token character classes,</i> <i>-- considers characters in class "Mn" to be token characters.</i> CREATE VIRTUAL TABLE ft USING fts5(a, b, tokenize = "unicode61 categories 'L* N* Co Mn'" ); </codeblock> <p> The fts5 unicode61 tokenizer is byte-for-byte compatible with the fts3/4 unicode61 tokenizer. <h3>Ascii Tokenizer</h3> <p> The Ascii tokenizer is similar to the Unicode61 tokenizer, except that: |
︙ | ︙ |