Documentation Source Text

Check-in [3cc1ff72a4]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Update fts5.html to node that the unicode61 tokenizer is compatible with the fts3 tokenizer of the same name.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 3cc1ff72a41ca47fb4324676c8ab5f3aac201222
User & Date: dan 2016-01-15 08:33:20.421
Context
2016-01-25
13:55
Merge 3.10.2 changes. Add the SQLITE_EXTRA_DURABLE=1 compile-time option. (check-in: d39c6c7cfc user: drh tags: trunk)
2016-01-15
08:33
Update fts5.html to node that the unicode61 tokenizer is compatible with the fts3 tokenizer of the same name. (check-in: 3cc1ff72a4 user: dan tags: trunk)
2016-01-14
18:26
Merge updates from the 3.10 branch. (check-in: 885e891911 user: drh tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to pages/fts5.in.
536
537
538
539
540
541
542



543
544
545
546
547
548
549
  <i>-- Create an FTS5 table that does not remove diacritics from Latin
  -- script characters, and that considers hyphens and underscore characters
  -- to be part of tokens. </i>
  CREATE VIRTUAL TABLE ft USING fts5(a, b, 
      tokenize = "unicode61 remove_diacritics 0 tokenchars '-_'"
  );
</codeblock>




<h3>Ascii Tokenizer</h3>

<p> The Ascii tokenizer is similar to the Unicode61 tokenizer, except that:

<ul>
  <li> All non-ASCII characters (those with codepoints greater than 127) are







>
>
>







536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
  <i>-- Create an FTS5 table that does not remove diacritics from Latin
  -- script characters, and that considers hyphens and underscore characters
  -- to be part of tokens. </i>
  CREATE VIRTUAL TABLE ft USING fts5(a, b, 
      tokenize = "unicode61 remove_diacritics 0 tokenchars '-_'"
  );
</codeblock>

<p> The fts5 unicode61 tokenizer is byte-for-byte compatible with the fts3/4
unicode61 tokenizer.

<h3>Ascii Tokenizer</h3>

<p> The Ascii tokenizer is similar to the Unicode61 tokenizer, except that:

<ul>
  <li> All non-ASCII characters (those with codepoints greater than 127) are