- File www/data_encoding.wiki — part of check-in [0275ee48db] at 2013-07-26 16:20:07 on branch trunk — Change the data encoding (again) to make content-in-key use fewer bytes, since one suspects that this will become a common encoding. FILE FORMAT CHANGE. (user: drh size: 4048)
The key for each row of a table is the defined PRIMARY KEY in the key encoding. Or, if the table does not have a defined PRIMARY KEY, then it has an implicit INTEGER PRIMARY KEY as the first column.
The content consists of all columns of the table, in order, in the data encoding defined here. The PRIMARY KEY column or columns are repeated in the data, due to the difficulty in decoding the key format.
The data consists of a header area followed by a content area. The data begins with a single varint which is the size of the header area. The initial varint itself is not considered part of the header. The header is composed of one or two varints for each column in the table. The varints determines the datatype and size of the value for that column:
Description NULL 0 0 NULL ZERO 1 0 Zero ONE 2 0 One INT 3..10 N-2 Signed integer NUM 11..21 N-9 Floating-point number STRING 22+4*K K String BLOB 23+4*K K Inline blob KEY 24+4*K 0 Content in key at offset K/2. Floating-point if LSB of K is 1. TYPED 25+4*K K Typed blob, followed by a single varint type code
Header codes NULL, ZERO, and ONE are self describing and have no content in the payload area.
Strings (STRING) can be either UTF8, UTF16le, or UTF16be. If the first byte of the payload is 0x00, 0x01, or 0x02 then that byte is ignored and the remaining bytes are UTF8, UTF16le, or UTF16be respectively. If the first byte is 0x03 or larger, then the entire string including the first byte is UTF8. An empty string consists of the header code 22 and no payload.
Blobs (BLOB) are stored as a sequence of bytes with no encoding. An empty blob is header code 23. A one-byte blob is header code 27. And so forth.
The KEY header code indicates that the actually content of the column is in the key portion of the key/value pair at an offset of K/2 bytes from the beginning of the key. Numeric values should be interpreted as floating point if K is odd and as integers if K is even.
A "typed blob" (TYPED) is a sequence of bytes in an application-defined type. The type is determined by a varint that immediately follows the initial varint. Hence, a typed blob uses two varints in the header whereas all other types use a single varint.
The content of INT is the specified number of bytes for the signed integer. The most significant bytes are first.
The content of a number (NUM) is two varints. The first varint has a value which is abs(e)*4 + (e<0)*2 + (m<0). The second varint is abs(m). The maximum e is 999, which gives a max varint value of 3999 or 0xf906af, for a maximum first varint size of 3. Values of e greater than 999 (used for Inf and NaN) are represented as a -0. The second varint can be a full 9 bytes. Example values:
0.123 → e=-3, m=123 → 0e,7b (2 bytes) 3.14159 → e=-5, m=314159 → 16,fa04cb2f (5 bytes) -1.2e+99 → e=98, m=-12 → f199,0c (3 bytes) +Inf → e=-0, m=1 → 02,01 (2 bytes) -Inf → e=-0, m=-1 → 03,01 (2 bytes) NaN → e=-0, m=0 → 02,00 (2 bytes)
Initially, the followed typed blobs are defined:
0 external blob 1 big int 2 date/time