|Title:||FTS4 integrity-check results in "database disk image is malformed" for UTF-16 encoding|
|Last Modified:||2020-01-29 23:19:04|
|Version Found In:||3.30.0|
mrigger added on 2019-12-27 11:08:03:
Consider the following test case:
PRAGMA encoding = 'UTF-16'; CREATE VIRTUAL TABLE vt0 USING fts4(c0); INSERT INTO vt0 VALUES (''); INSERT INTO vt0(vt0) VALUES('integrity-check'); -- database disk image is malformed
Unexpectedly, the integrity check indicates an error. It seems that FTS5 is not affected.
dan added on 2019-12-27 14:14:52:
The INSERT in the script above is equivalent to this:
INSERT INTO vt0 VALUES (CAST(X'FEFF' AS TEXT));
In utf-16, the sequence 0xFE 0xFF is a BOM - a byte-order-mark. SQLite leaves this byte sequence unchanged when it converts the text value to utf-16 to pass to FTS4. Then FTS4 tokenizes it. But, when FTS4 binds the value to a statement to insert into the %_content table, SQLite strips the BOM away before writing the value to disk. Then, when the integrity check is run, the contents of the FTS index are slightly different to the index that would be produced by rebuilding based on the contents of the %_content table. Hence the error.
dan added on 2019-12-27 14:16:15:
The unicode61 tokenizer handles this, as it ignores the BOM. But the default tokenizer ("simple") fails as above.
mrigger added on 2020-01-29 10:10:51:
I noticed that this bug report is still open. Will this issue still be addressed?
drh added on 2020-01-29 23:19:04:
This is an obscure corner case in an obsolete extension that is unlikely to ever be encountered in practice, and a fix is very difficult. So rather than fix it, we merely document the issue and ask users to avoid the unusual set of circumstances that can bring it to pass. And the circumstances that lead to this problem are quite ununsual and so we do not believe anybody has ever hit them before or is likely to hit them in the furture. (Unless they are running a fuzzer. ;-))