/ View Ticket
Ticket UUID: bb1977f6e9f23108b6b02ddc79817904b907f45b
Title: Concatenation can create very invalid strings
Status: Closed Type: Code_Defect
Severity: Severe Priority: Immediate
Subsystem: Unknown Resolution: Rejected
Last Modified: 2010-04-22 12:08:03
Version Found In:
With a UTF16 encoded database:
    sqlite> insert into foo values('CD' || x'00' || 'E');
    sqlite> select cast(x as blob) from foo;
    INSERT INTO "table" VALUES(X'43004400004500');
    sqlite> select typeof(x) from foo;
    sqlite> select length(x) from foo;

Note that the string is 7 bytes long which is very invalid for UTF16 encoding. It is unlikely all code paths have been tested to ensure they cope with this correctly, don't read the 8th byte, deal with potential issues such as introducing surrogate pairs this way that are not valid etc.

Suggestions include making || with blobs on either side result in a blob, more careful blob to string coercion that always results in a valid string, giving an error etc.

drh added on 2010-04-22 12:08:03:
After a due consideration, the developers are unanimous that the behavior of strings and blobs and the concatenate operator should remain unchanged. With over a billion deployments, a behavior change like the one recommended here is too likely to break something somewhere.

One might argue that the suggested behavior makes more sense. But we could argue back: Garbage in, garbage out.

This ticket does, however, point out the desirability of very carefully testing the behavior when UTF16 strings have an odd number of bytes - to verify that there are no buffer overrun bugs elsewhere in the code. We'll be undertaking those tests soon.