Documentation Source Text

Check-in [73a0dac584]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Fix a bug in the description of the 'simple' FTS tokenizer. Underscores (codepoint 95) are divider characters not token characters.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 73a0dac5840af9808ae908dfa6029f127ffc0e62
User & Date: dan 2012-02-27 07:06:24.152
Context
2012-03-05
19:14
Documentation for the content= and langaugeid= options for FTS4. (check-in: 16b58c9eb4 user: drh tags: trunk)
2012-02-27
07:06
Fix a bug in the description of the 'simple' FTS tokenizer. Underscores (codepoint 95) are divider characters not token characters. (check-in: 73a0dac584 user: dan tags: trunk)
2012-02-23
14:40
Documentation of the SQLITE_FCNTL_PRAGMA file-control. Point out that disabling compound SELECT statements also disables multi-value INSERT. (check-in: cf86dcee73 user: drh tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to pages/fts3.in.
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
  VIRTUAL TABLE statement used to create the FTS table, the default 
  tokenizer, "simple", is used. The simple tokenizer extracts tokens from
  a document or basic FTS full-text query according to the following 
  rules:

<ul>
  <li><p> A term is a contiguous sequence of eligible characters, where 
    eligible characters are all alphanumeric characters, the "_" character,
    and all characters with UTF codepoints greater than or equal to 128.
    All other characters are discarded when splitting a document into terms.
    Their only contribution is to separate adjacent terms.

  <li><p> All uppercase characters within the ASCII range (UTF codepoints less 
    than 128), are transformed to their lowercase equivalents as part of the
    tokenization process. Thus, full-text queries are case-insensitive when
    using the simple tokenizer.
</ul>








|
|
|
|







1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
  VIRTUAL TABLE statement used to create the FTS table, the default 
  tokenizer, "simple", is used. The simple tokenizer extracts tokens from
  a document or basic FTS full-text query according to the following 
  rules:

<ul>
  <li><p> A term is a contiguous sequence of eligible characters, where 
    eligible characters are all alphanumeric characters and all characters with
    UTF codepoints greater than or equal to 128. All other characters are
    discarded when splitting a document into terms. Their only contribution is
    to separate adjacent terms.

  <li><p> All uppercase characters within the ASCII range (UTF codepoints less 
    than 128), are transformed to their lowercase equivalents as part of the
    tokenization process. Thus, full-text queries are case-insensitive when
    using the simple tokenizer.
</ul>