Artifact Content
Not logged in

Artifact 39548040ed02b3094227fa3cd0717f02a4b3f755:

Data Encoding

The key for each row of a table is the defined PRIMARY KEY in the key encoding. Or, if the table does not have a defined PRIMARY KEY, then it has an implicit INTEGER PRIMARY KEY as the first column.

The content consists of all columns of the table, in order, in the data encoding defined here. The PRIMARY KEY column or columns are repeated in the data, due to the difficulty in decoding the key format.

The data consists of a header area followed by a content area. The data begins with a single varint which is the size of the header area. The initial varint itself is not considered part of the header. The header is composed of one or two varints for each column in the table. The varints determines the datatype and size of the value for that column:

Code (N)
Bytes of
ZERO 1 0 Zero
ONE 2 0 One
INT 3..10 N-2 Signed integer
NUM 11..21 N-9 Floating-point number
STRING 22+4*K K String
BLOB 23+4*K K Inline blob
KEY 24+4*K 0 Content in key at offset K/2. Floating-point if LSB of K is 1.
TYPED 25+4*K K Typed blob, followed by a single varint type code

Header codes NULL, ZERO, and ONE are self describing and have no content in the payload area.

Strings (STRING) can be either UTF8, UTF16le, or UTF16be. If the first byte of the payload is 0x00, 0x01, or 0x02 then that byte is ignored and the remaining bytes are UTF8, UTF16le, or UTF16be respectively. If the first byte is 0x03 or larger, then the entire string including the first byte is UTF8. An empty string consists of the header code 22 and no payload.

Blobs (BLOB) are stored as a sequence of bytes with no encoding. An empty blob is header code 23. A one-byte blob is header code 27. And so forth.

The KEY header code indicates that the actually content of the column is in the key portion of the key/value pair at an offset of K/2 bytes from the beginning of the key. Numeric values should be interpreted as floating point if K is odd and as integers if K is even.

A "typed blob" (TYPED) is a sequence of bytes in an application-defined type. The type is determined by a varint that immediately follows the initial varint. Hence, a typed blob uses two varints in the header whereas all other types use a single varint.

The content of INT is the specified number of bytes for the signed integer. The most significant bytes are first.

The content of a number (NUM) is two varints. The first varint has a value which is abs(e)*4 + (e<0)*2 + (m<0). The second varint is abs(m). The maximum e is 999, which gives a max varint value of 3999 or 0xf906af, for a maximum first varint size of 3. Values of e greater than 999 (used for Inf and NaN) are represented as a -0. The second varint can be a full 9 bytes. Example values:

0.123 e=-3, m=123 0e,7b (2 bytes)
3.14159 e=-5, m=314159 16,fa04cb2f (5 bytes)
-1.2e+99 e=98, m=-12 f199,0c (3 bytes)
+Inf e=-0, m=1 02,01 (2 bytes)
-Inf e=-0, m=-1 03,01 (2 bytes)
NaN e=-0, m=0 02,00 (2 bytes)

Initially, the followed typed blobs are defined:

0 external blob
1 big int
2 date/time