<title>SQLite FTS3 Extension</title>
<tcl>
hd_keywords {fts3}
source [file join $::DOC pages fancyformat.tcl]
fancyformat_document "SQLite FTS3 Extension" {} {
<h2 style="margin-left:1.0em"> Overview</h2>
[h1 "Introduction to FTS3"]
<p>
The FTS3 extension module allows users to create special tables with a
built-in full-text index (hereafter "FTS3 tables"). The full-text index
allows the user to efficiently query the database for all rows that contain
one or more instances specified word (hereafter a "token", even if the table
contains many large documents.
<p>
For example, if each of the 517430 documents in the
"<a href="http://www.cs.cmu.edu/~enron/">Enron E-Mail Dataset</a>"
is inserted into both the FTS3 table and the ordinary SQLite table
created using the following SQL script:
[Code {
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table */
CREATE TABLE enrondata2(content TEXT); /* Ordinary table */
}]
<p>
Then either of the two queries below may be executed to find the number of
documents in the database that contain the word "linux" (351). Using one
desktop PC hardware configuration, the query on the FTS3 table returns in
approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
[Code {
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */
SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
}]
<p>
Of course, the two queries above are not entirely equivalent. For example
the LIKE query matches rows that contain terms such as "linuxophobe"
or "EnterpriseLinux" (as it happens, the Enron E-Mail Dataset does not
actually contain any such terms), whereas the MATCH query on the FTS3 table
selects only those rows that contain "linux" as a discreet token. Both
searches are case-insensitive. The FTS3 table consumes around 2006 MB on
disk compared to just 1453 MB for the ordinary table. Using the same
hardware configuration used to perform the SELECT queries above, the FTS3
table took just over 39 minutes to populate, versus 25 for the ordinary
table.
[h1 "Compiling and Enabling FTS3"]
[h1 "Custom Tokenizers" customtokenizer {custom tokenizer}]
[h1 "MATCH Expression Syntax"]
[h2 "Advanced Query Syntax"]
[h2 "Legacy Query Syntax"]
[h1 "Data Structures"]
<p>
This section describes at a high-level the way the FTS3 module stores its
index and content in the database. It is <b>not necessary to read or
understand the material in this section in order to use FTS3</b> in an
application. However, it may be useful to application developers attempting
to analyze and understand FTS3 performance characteristics, or to developers
contemplating enhancements to the existing FTS3 feature set.
<p>
For each FTS3 virtual table in a database, three real (non-virtual) tables
are created to store the underlying data. The real tables are named "%_content",
"%_segdir" and "%_segments", where "%" is replaced by the name supplied by
the user for the FTS3 virtual table.
<p>
The leftmost column of the "%_content" table is an INTEGER PRIMARY KEY field
named "docid". Following this is one column for each column of the FTS3
virtual table as declared by the user, named by prepending the column name
supplied by the user with "c<i>N</i>", where <i>N</i> is the index of the
column within the table, numbered from left to right starting with 1. Data
types supplied as part of the virtual table declaration are not used as
part of the %_content table declaration. For example:
[Code {
/* Virtual table declaration */
CREATE VIRTUAL TABLE abc USING FTS3(a NUMBER, b TEXT, c);
/* Corresponding %_content table declaration */
CREATE TABLE abc_content(docid INTEGER PRIMARY KEY, c1a, c2b, c2c);
}]
<p>
The %_content table contains the unadulterated data inserted by the user
into the FTS3 virtual table by the user. If the user does not explicitly
supply a "docid" value when inserting records, one is selected automatically
by the system.
<p>
The two remaining tables, %_segments and %_segdir, are used to store the
full-text index. Conceptually, this index is a lookup table that maps each
term (word) to the set of docid values corresponding to records in the
%_content table that contain one or more occurrences of the term. To
retrieve all documents that contain a specified term, the FTS3 module
queries this index to determine the set of docid values for records that
contain the term, then retrieves the required documents from the %_content
table. Regardless of the schema of the FTS3 virtual table, the %_segments
and %_segdir tables are always created as follows:
[Code {
CREATE TABLE %_segments(
blockid INTEGER PRIMARY KEY, /* B-tree node id */
block blob /* B-tree node data */
);
CREATE TABLE %_segdir(
level INTEGER,
idx INTEGER,
start_block INTEGER, /* Blockid of first node in %_segments */
leaves_end_block INTEGER, /* Blockid of last leaf node in %_segments */
end_block INTEGER, /* Blockid of last node in %_segments */
root BLOB, /* B-tree root node */
PRIMARY KEY(level, idx)
);
}]
<p>
The schema depicted above is not designed to store the full-text index
directly. Instead, it is used to one or more b-tree structures. There
is one b-tree for each row in the %_segdir table. The %_segdir table
row contains the root node and various meta-data associated with the
b-tree structure, and the %_segments table contains all other (non-root)
b-tree nodes. Each b-tree is referred to as a "segment". Once it has
been created, a segment b-tree is never updated (although it may be
deleted altogether).
<p>
The keys used by each segment b-tree are terms (words). As well as the
key, each segment b-tree entry has an associated "doclist" (document list).
A doclist consists of zero or more entries, where each entry consists of:
<ul>
<li> A docid (document id), and
<li> A list of term offsets, one for each occurrence of the term within
the document. A term offset indicates the number of tokens (words)
that occur before the term in question, not the number of characters
or bytes. For example, the term offset of the term "war" in the
phrase "Ancestral voices prophesying war!" is 3.
</ul>
<p>
Entries within a doclist are sorted by docid. Positions within a doclist
entry are stored in ascending order.
<p>
The contents of the logical full-text index is found by merging the
contents of all segment b-trees. If a term is present in more than one
segment b-tree, then it maps to the union of each individual doclist. If,
for a single term, the same docid occurs in more than one doclist, then only
the doclist that is part of the most recently created segment b-tree is
considered valid.
<p>
Multiple b-tree structures are used instead of a single b-tree to reduce
the cost of inserting records into FTS3 tables. When a new record is
inserted into an FTS3 table that already contains a lot of data, it is
likely that many of the terms in the new record are already present in
a large number of existing records. If a single b-tree were used, then
large doclist structures would have to be loaded from the database,
amended to include the new docid and term-offset list, then written back
to the database. Using multiple b-tree tables allows this to be avoided
by creating a new b-tree which can be merged with the existing b-tree
(or b-trees) later on. Merging of b-tree structures can be performed as
a background task, or once a certain number of separate b-tree structures
have been accumulated. Of course, this scheme makes queries more expensive
(as the FTS3 code may have to look up individual terms in more than one
b-tree and merge the results), but it has been found that in practice this
overhead is often negligible.
[h2 "Variable Length Integer (varint) Format"]
<p>
Integer values stored as part of segment b-tree nodes are encoded using the
FTS3 varint format. This encoding is similar, but <b>not identical</b>, to the
the <a href="fileformat.html#varint_format">SQLite varint format</a>.
<p>
An encoded FTS3 varint consumes between one and ten bytes of space. The
number of bytes required is determined by the sign and magnitude of the
integer value encoded. More accurately, the number of bytes used to store
the encoded integer depends on the position of the most significant set bit
in the 64-bit twos-compliment representation of the integer value. Negative
values always have the most significant bit set (the sign bit), and so are
always stored using the full ten bytes. Positive integer values may be
stored using less space.
<p>
The final byte of an encoded FTS3 varint has its most significant bit
cleared. All preceding bytes have the most significant bit set. Data
is stored in the remaining seven least signficant bits of each byte.
The first byte of the encoded representation contains the least significant
seven bits of the encoded integer value. The second byte of the encoded
representation, if it is present, contains the seven next least significant
bits of the integer value, and so on. The following table contains examples
of encoded integer values:
[Table]
[Tr]<th>Decimal<th>Hexadecimal<th width=100%>Encoded Representation
[Tr]<td>43<td>0x000000000000002B<td>0x2B
[Tr]<td>200815<td>0x000000000003106F<td>0x9C 0xA0 0x0C
[Tr]<td>-1<td>0xFFFFFFFFFFFFFFFF<td>0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0x01
</table>
[h2 "Segment B-Tree Format"]
<p>
Segment b-trees are prefix-compressed b+-trees. There is one segment b-tree
for each row in the %_segdir table (see above). The root node of the segment
b-tree is stored as a blob in the "root" field of the corresponding row
of the %_segdir table. All other nodes (if any exist) are stored in the
"blob" column of the %_segments table. Nodes within the %_segments table are
identified by the integer value in the blockid field of the corresponding
row. The following table describes the fields of the %_segdir table:
[Table]
[Tr]<th>Column <th width=100%>Interpretion
[Tr]<td>level <td>
Between them, the contents of the "level" and "idx" fields define the
relative age of the segment b-tree. The smaller the value stored in the
"level" field, the more recently the segment b-tree was created. If two
segment b-trees are of the same "level", the segment with the larger
value stored in the "idx" column is more recent. The PRIMARY KEY constraint
on the %_segdir table prevents any two segments from having the same value
for both the "level" and "idx" fields.
[Tr]<td>idx <td> See above.
[Tr]<td>start_block <td>
The blockid that corresponds to the node with the smallest blockid that
belongs to this segment b-tree. Or zero if the entire segment b-tree
fits on the root node. If it exists, this node is always a leaf node.
[Tr]<td>leaves_end_block <td>
The blockid that corresponds to the leaf node with the largest blockid
that belongs to this segment b-tree. Or zero if the entire segment b-tree
fits on the root node.
[Tr]<td>end_block <td>
The blockid that corresponds to the interior node with the largest
blockid that belongs to this segment b-tree. Or zero if the entire segment
b-tree fits on the root node. If it exists, this node is always an
interior node.
[Tr]<td>root <td>
Blob containing the root node of the segment b-tree.
</table>
<p>
Apart from the root node, the nodes that make up a single segment b-tree are
always stored using a contiguous sequence of blockids. Furthermore, the
nodes that make up a single level of the b-tree are themselves stored as
a contiguous block, in b-tree order. The contiguous sequence of blockids
used to store the b-tree leaves are allocated starting with the blockid
value stored in the "start_block" column of the corresponding %_segdir row,
and finishing at the blockid value stored in the "leaves_end_block"
field of the same row. It is therefore possible to iterate through all the
leaves of a segment b-tree, in key order, by traversing the %_segments
table in blockid order from "start_block" to "leaves_end_block".
[h3 "Segment B-Tree Leaf Nodes"]
<p>
The following diagram depicts the format of a segment b-tree leaf node.
[Fig fts3_leaf_node.png "Segment B-Tree Leaf Node Format"]
<p>
The first term stored on each node ("Term 1" in the figure above) is
stored verbatim. Each subsequent term is prefix-compressed with respect
to its predecessor. Terms are stored within a page in sorted (memcmp)
order.
[h3 "Segment B-Tree Interior Nodes"]
<p>
The following diagram depicts the format of a segment b-tree interior
(non-leaf) node.
[Fig fts3_interior_node.png "Segment B-Tree Interior Node Format"]
[h2 "Doclist Format"]
<p>
A doclist consists of an array of 64-bit signed integers, serialized using
the FTS3 varint format. Each doclist entry is made up of a series of two
or more integers, as follows:
<ol>
<li> The docid value. The first entry in a doclist contains the literal docid
value. The first field of each subsequent doclist entry contains the
difference between the new docid and the previous one (always a positive
number).
<li> Zero or more term-offset lists. A term-offset list is present for each
column of the FTS3 virtual table that contains the term. A term-offset
list consists of the following:
<ol>
<li> Constant value 1. This field is omitted for any term-offset list
associated with column 0.
<li> The column number (1 for the second leftmost column, etc.). This
field is omitted for any term-offset list associated with column 0.
<li> A list of term-offsets, sorted from smallest to largest. Instead
of storing the term-offset value literally, each integer stored
is the difference between the current term-offset and the previous
one (or zero if the current term-offset is the first), plus 2.
</ol>
<li> Constant value 0.
</ol>
[Fig fts3_doclist2.png "FTS3 Doclist Format"]
[Fig fts3_doclist.png "FTS3 Doclist Entry Format"]
<p>
For doclists for which the term appears in more than one column of the FTS3
virtual table, term-offset lists within the doclist are stored in column
number order. This ensures that the term-offset list associated with
column 0 (if any) is always first, allowing the first two fields of the
term-offset list to be omitted in this case.
}