Documentation Source Text

Check-in [4d11f30804]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Enhancements to the FTS3/4 documentation about the limitations of BOMs in string literals with the simple tokenizer in a UTF16 database.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: 4d11f30804f6b8ae8cd364732ce5a435eb3243f87003b8d52194ec99f3543951
User & Date: drh 2020-01-29 23:04:30.305
Context
2020-01-31
20:15
Attempt to clarify the CREATE TABLE documentation. In the generated column documentation, make it clear that SQLite deliberately ignores the collating sequence of the AS expression when determining the collating sequence of the generated column. (check-in: 91fa93da86 user: drh tags: trunk)
2020-01-29
23:04
Enhancements to the FTS3/4 documentation about the limitations of BOMs in string literals with the simple tokenizer in a UTF16 database. (check-in: 4d11f30804 user: drh tags: trunk)
21:29
Add the SQLITE_OMIT_AUTOINIT compile-time option to the set of recommended compile-time options. (check-in: f250d55692 user: drh tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to pages/fts3.in.
2796
2797
2798
2799
2800
2801
2802
2803

2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830


2831
2832
2833











2834
2835
2836
2837
2838
2839
2840
  term-offset list to be omitted in this case.

<h1 tags="bugs">Limitations</h1>

<h2> UTF-16 byte-order-mark problem </h2>

For UTF-16 databases, when using the "simple" tokenizer, it is possible to use
malformed unicode strings to cause the integrity-check to falsely report

corruption, or for auxiliary functions to return incorrect results. More
specifically, the bug can be triggered by any of the following:

<ul>
  <li><p>A UTF-16 byte-order-mark is embedded at the beginning of an SQL string
       literal value inserted into an FTS3 table. For example:

<codeblock>
    INSERT INTO fts_table(col) VALUES('<b>{BOM}</b>text...');
</codeblock>
      <p>where {BOM} is a UTF-16 byte-order-mark, a 16-bit integer value 0xFFFE
      in either big or little endian format.

  <li><p>Malformed UTF-8 that SQLite converts to a UTF-16 byte-order-mark is
       embedded at the beginning of an SQL string literal value inserted 
       into an FTS3 table.

  <li><p>A text value created by casting a blob that begins with the two
       bytes 0xFF and 0xFE, in either possible order, is inserted into an
       FTS3 table. For example:
       
<codeblock>
    INSERT INTO fts_table(col) VALUES(CAST(X'FEFF' AS TEXT));
</codeblock>
</ul>

No problems occur if all unicode strings used with FTS3/4 are well-formed.


UTF-16 byte-order-marks may be safely used at the start of strings passed
to [sqlite3_bind_text16()], [sqlite3_prepare16()] and other similar APIs.













<h1 id=appendix_a nonumber tags="search application tips">
  Appendix A: Search Application Tips
</h1>

<p>
  FTS is primarily designed to support Boolean full-text queries - queries







|
>
|
|


|



|

<
<














|
>
>
|
|
|
>
>
>
>
>
>
>
>
>
>
>







2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814


2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
  term-offset list to be omitted in this case.

<h1 tags="bugs">Limitations</h1>

<h2> UTF-16 byte-order-mark problem </h2>

For UTF-16 databases, when using the "simple" tokenizer, it is possible to use
malformed unicode strings to cause the 
<a href="#integcheck">integrity-check special command</a> to falsely report
corruption, or for <a href="#snippet">auxiliary functions</a> to return
incorrect results. More specifically, the bug can be triggered by any of the following:

<ul>
  <li><p>A UTF-16 byte-order-mark (BOM) is embedded at the beginning of an SQL string
       literal value inserted into an FTS3 table. For example:

<codeblock>
    INSERT INTO fts_table(col) VALUES(char(0xfeff)||'text...');
</codeblock>



  <li><p>Malformed UTF-8 that SQLite converts to a UTF-16 byte-order-mark is
       embedded at the beginning of an SQL string literal value inserted 
       into an FTS3 table.

  <li><p>A text value created by casting a blob that begins with the two
       bytes 0xFF and 0xFE, in either possible order, is inserted into an
       FTS3 table. For example:
       
<codeblock>
    INSERT INTO fts_table(col) VALUES(CAST(X'FEFF' AS TEXT));
</codeblock>
</ul>

Everything works correctly if any of the following are true:
<ul>
<li> The [encoding|database encoding] is UTF-8.
<li> All text strings are insert using one of the
     [sqlite3_bind_text()] family of functions.
<li> Literal strings contain no byte-order-marks.
<li> A tokenizer is used that recognizes byte-order-marks
     as whitespace.  (The default "simple" tokenizer for
     FTS3/4 does not think that BOMs are whitespace, but
     the unicode tokenizer does.)
</ul>
All of the above conditions must be false in order for problems
to occur.  And even if all of the conditiona above are false,
most things will still operator correctly.  Only the
<a href="#integcheck">integrity-check</a> command and the
<a href="#snippet">auxiliary functions</a> might given
unexpected results.

<h1 id=appendix_a nonumber tags="search application tips">
  Appendix A: Search Application Tips
</h1>

<p>
  FTS is primarily designed to support Boolean full-text queries - queries
3097
3098
3099
3100
3101
3102
3103
3104
3105
  return;

<i>  /* Jump here if the wrong number of arguments are passed to this function */</i>
wrong_number_args:
  sqlite3_result_error(pCtx, "wrong number of arguments to function rank()", -1);
}
</codeblock>









<
<
3109
3110
3111
3112
3113
3114
3115


  return;

<i>  /* Jump here if the wrong number of arguments are passed to this function */</i>
wrong_number_args:
  sqlite3_result_error(pCtx, "wrong number of arguments to function rank()", -1);
}
</codeblock>