Documentation Source Text

Artifact Content
Login

Artifact 1b83041ce5f4538d64cb79a53a16318d840ae0e6:


<title>The SQLite R*Tree Module</title>
<tcl>hd_keywords *rtree *RTREE {R-Tree extension} {R-Trees}</tcl>

<h1>1.0 Overview</h1>

<p>
An [http://en.wikipedia.org/wiki/R-tree | R-Tree] is a special
index that is designed for doing range queries.  R-Trees are most commonly
used in geospatial systems where each entry is a rectangle with minimum and
maximum X and Y coordinates.  Given a query rectangle, an R-Tree is able
to quickly find all entries that are contained within the query rectangle
or which overlap the query rectangle.  This idea is easily extended to
three dimensions for use in CAD systems.  R-Trees also find use in time-domain
range look-ups.  For example, suppose a database records the starting and
ending times for a large number of events.  A R-Tree is able to quickly
find all events, for example, that were active at any time during a given
time interval, or all events that started during a particular time interval,
or all events that both started and ended within a given time interval.
And so forth.
</p>

<p>
The R-Tree concept originated with 
[http://www.baymoon.com/~tg2/ | Toni Guttman]: 
<em>R-Trees: A Dynamic Index Structure for Spatial Searching</em>,
Proc. 1984 ACM SIGMOD International Conference on Management of Data,
pp. 47-57.
The implementation found in SQLite is a refinement of Guttman's original
idea, commonly called "R*Trees", that was described by
Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger:
<em>The R*-Tree: An Efficient and Robust Access Method for Points
and Rectangles.</em> SIGMOD Conference 1990: 322-331.
</p>

<h1>2.0 Compiling The R*Tree Module</h1>

<p>
The source code to the SQLite R*Tree module is included as part
of the [amalgamation] but is disabled by default.  To enable the
R*Tree module, simply compile with the [SQLITE_ENABLE_RTREE] 
C-preprocessor macro defined.  With many compilers, this is accomplished
by adding the option "-DSQLITE_ENABLE_RTREE=1" to the compiler
command-line.
</p>

<h1>3.0 Using the R*Tree Module</h1>

<p>
The SQLite R*Tree module is implemented as a
[sqlite3_create_module | virtual table].  ^Each R*Tree index is a
virtual table with an odd number of columns between 3 and 11.
^The first column is always a 64-bit signed integer primary key.
^The other columns are minimum- and maximum-value pairs (stored as
32-bit floating point numbers) for each
dimension.  ^A 1-dimensional R*Tree thus has 3 columns.  
^A 2-dimensional R*Tree (the most common case) has 5 columns.
^A 5-dimensional R*Tree has 11 columns.  The SQLite R*Tree implementation
does not support R*Trees wider than 5 dimensions.
</p>

<p>
^The first column of an SQLite R*Tree must always be an integer
primary key.
^The min/max-value pair columns are always stored as
32-bit floating point values.  ^Unlike regular SQLite tables which
can store data in a variety of datatypes and formats, the R*Tree
indices rigidly enforce these two storage types.  ^Attempts to insert
something other than an integer into the first column, or something
other than a floating point value into the other columns, will result
in an error.
</p>

<h2>3.1 Creating An R*Tree Index</h2>

^(<p>
A new R*Tree index is created as follows:
</p>

<blockquote><pre>
CREATE VIRTUAL TABLE <em>&lt;name&gt;</em> USING rtree(<em>&lt;column-names&gt;</em>);
</pre></blockquote>)^

<p>
The <em>&lt;name&gt;</em> is the name your application chooses for the
R*Tree index and <em>&lt;column-names&gt;</em> is a comma separated list
of between 3 and 11 columns.
^(The virtual &lt;name&gt; table creates three "shadow" tables to actually
store its content.  The names of these shadow tables are:
</p>

<blockquote>
<em>&lt;name&gt;</em><strong>_node</strong><br>
<em>&lt;name&gt;</em><strong>_rowid</strong><br>
<em>&lt;name&gt;</em><strong>_parent</strong>
</blockquote>)^

<p>
^The shadow tables are ordinary SQLite data tables.  You can query them 
directly if you like, though this unlikely to reveal anything particularly
useful. 
^And you can [UPDATE], [DELETE], [INSERT] or even [DROP TABLE | DROP] 
the shadow tables, though doing so will corrupt your R*Tree index.
So it is best to simply ignore the shadow tables.  Recognize that they
are there to hold your R*Tree index information and let it go as that.
</p>

<p>
^(As an example, consider creating a two-dimensional R*Tree index for use in 
spatial queries:
</p>

<blockquote><pre>
CREATE VIRTUAL TABLE demo_index USING rtree(
   id,              -- Integer primary key
   minX, maxX,      -- Minimum and maximum X coordinate
   minY, maxY       -- Minimum and maximum Y coordinate
);
</pre></blockquote>)^

<h2>3.2 Populating An R*Tree Index</h2>

<p>
^The usual [INSERT], [UPDATE], and [DELETE] commands work on an R*Tree
index just like on regular tables.  ^(So to insert some data into our sample
R*Tree index, we can do something like this:
</p>

<blockquote><pre>
INSERT INTO demo_index VALUES(
    1,                   -- Primary key
    -80.7749, -80.7747,  -- Longitude range
    30.3776, 30.3778     -- Latitude range
);
INSERT INTO demo_index VALUES(
    2,
    -81.0, -79.6,
    35.0, 36.2
);
</pre></blockquote>)^

<p>
The entries above might represent (for example) a bounding box around
the offices for SQLite.org and bounding box around the
12th Congressional District of North Carolina in which SQLite.org is located.  
</p>

<h2>3.3 Querying An R*Tree Index</h2>

<p>
^Any valid query will work against an R*Tree index.  But the R*Tree
implementation is designed to make two kinds of queries especially
efficient.  ^(First, queries against the primary key are efficient:
</p>

<blockquote><pre>
SELECT * FROM demo_index WHERE id=1;
</pre></blockquote>)^

<p>
Of course, an ordinary SQLite table will do a query against its
integer primary key efficiently, so the previous is no big deal.
The real reason for using an R*Tree in the first place is so that
you can efficiently do inequality queries against the coordinate
ranges.  ^(To find all elements of the index that are contained within
the vicinity of Charlotte, North Carolina, one might do:
</p>

<blockquote><pre>
SELECT id FROM demo_index
 WHERE minX>=-81.08 AND maxX<=-80.58
   AND minY>=35.00  AND maxY<=35.44;
</pre></blockquote>)^

<p>
^The query above would very quickly locate the id of 1 even if the
R*Tree contained millions of entries.  The previous is an example
of a "contained-within" query.  The R*Tree also supports "overlapping"
queries.  ^(For example, to find all bounding boxes that overlap the
Charlotte area:
</p>

<blockquote><pre>
SELECT id FROM demo_index
 WHERE maxX>=-81.08 AND minX<=-80.58
   AND maxY>=35.00  AND minY<=35.44;
</pre></blockquote>)^

<p>
^(This second query would find both entry 1 (the SQLite.org office) which
is entirely contained within the query box and also
the 12th Congressional District which extends well outside the
query box but still overlaps the query box.)^
</p>

<p>
^Note that is not necessary for all coordinates in an R*Tree index
to be constrained in order for the index search to be efficient.
^(One might, for example, want to query all objects that overlap with
the 35th parallel:
</p>

<blockquote><pre>
SELECT id FROM demo_index
 WHERE maxY>=35.0  AND minY<=35.0;
</pre></blockquote>)^

<p>
But, generally speaking, the more constraints that the R*Tree module
has to work with, and the smaller the bounding box, the faster the
results will come back.
</p>

<h1>4.0 Using R*Trees Effectively</h1>

<p>
The only information that an R*Tree index stores about an object is
its integer ID and its bounding box.  Additional information needs to
be stored in separate tables and related to the R*Tree index using
the primary key.  ^(For the example above, one might create an auxiliary
table as follows:
</p>

<blockquote><pre>
CREATE TABLE demo_data(
  id INTEGER PRIMARY KEY,  -- primary key
  objname TEXT,            -- name of the object
  objtype TEXT,            -- object type
  boundary BLOB            -- detailed boundary of object
);
</pre></blockquote>)^

<p>
In this example, the demo_data.boundary field is intended to hold some
kind of binary representation of the precise boundaries of the object.
The R*Tree index only holds an axis-aligned rectangular boundary for the
object.  The R*Tree boundary is just an approximation of the true object
boundary.  So what typically happens is that the R*Tree index is used to
narrow a search down to a list of candidate objects and then more detailed
and expensive computations are done on each candidate to find if the
candidate truly meets the search criteria.
</p>

<blockquote><p>
<strong>Key Point:</strong>
An R*Tree index does not normally provide the exact answer but merely
reduces the set of potential answers from millions to dozens.
</p></blockquote>

<p>
Suppose the demo_data.boundary field holds some proprietary data description
of a complex two-dimensional boundary for an object and suppose that the
application has used the [sqlite3_create_function()] interface to
created application-defined functions "contained_in" and
"overlaps" accepting two demo_data.boundary objects and return true or false.
One may assume that "contained_in" and "overlaps" are relatively slow
functions that we do not want to invoke too frequently.
^(Then an efficient way to find the name of all objects located within
the North Carolina 12th District, one may be to run a query like this:
</p>

<blockquote><pre>
SELECT objname FROM demo_data, demo_index
 WHERE demo_data.id=demo_index.id
   AND contained_in(demo_data.boundary, :boundary)
   AND minX>=-81.0 AND maxX<=-79.6
   AND minY>=35.0 AND maxY<=36.2;
</pre></blockquote>)^

<p>In the query above, one would presumably bind the binary BLOB 
description of the precise boundary of the 12th district to the
":boundary" parameter.</p>

<p>Notice how the query above works:  The R*Tree index runs in the outer
loop to find entries that are contained within the bounding box
of longitude -81..-79.6 and latitude 35.0..36.2. 
For each object identifier found, SQLite looks up
the corresponding entry in the demo_data table.  It then uses the boundary
field from the demo_data table as a parameter to the contained_in()
function and if that function returns true, the objname field from
the demo_data table is returned as the next row of query result.</p>

<p>One would get the same answer without the use of the R*Tree index
using the following simpler query:</p>

<blockquote><pre>
SELECT objname FROM demo_data
 WHERE contained_in(demo_data.boundary, :boundary);
</pre></blockquote>

<p>The problem with this latter query is that it must apply the
contained_in() function to millions of entries in the demo_data table.
The use of the R*Tree in the penultimate query reduces the number of
calls to contained_in() function to a small subset of the entire table.
The R*Tree index did not find the exact answer itself, it merely
limited the search space.</p>

<tcl>hd_fragment {customquery} {custom r-tree queries}</tcl>
<h1>5.0 Custom R-Tree Queries</h1>

<p>By using standard SQL expressions in the WHERE clause of a SELECT query,
a user may query for all r-tree entries that intersect a specified
bounding-box, or for all entries that are completely encapsulated by a
specified bounding-box. Custom r-tree queries, which use the special MATCH
operator in the WHERE clause of a SELECT, allow the user to query for the set
of r-tree entries that intersect any arbitrarily defined region.

<p>Regions for custom r-tree queries are defined by r-tree geometry callbacks
implemented by the application and registered with SQLite via a call to the 
following API:

<blockquote><pre>
int sqlite3_rtree_geometry_callback(
  sqlite3 *db,
  const char *zGeom,
  int (*xGeom)(sqlite3_rtree_geometry *, int nCoord, double *aCoord, int *pRes),
  void *pContext
);
</pre></blockquote>

<p>A call to the above API registers an r-tree geometry callback named zGeom.
If an r-tree geometry callback or ordinary SQL user function named zGeom 
already exists when sqlite3_rtree_geometry_callback() is called, it is replaced
by the new r-tree geometry callback. If the xGeom parameter is passed a NULL
value, then any existing r-tree geometry callback or SQL user function is
removed from the system, but no new r-tree geometry callback is registered.

<p>When the r-tree geometry callback is used in a custom r-tree query, the
registered callback is invoked one or more times by SQLite to test whether or
not the user-defined region intersects with specific bounding boxes.
The bounding box being tested is defined by the contents of the aCoord[] 
array (size nCoord) passed to the callback. The aCoord[] array always contains
the same number of entries as there are coordinate columns in the r-tree table
(one less than the total number of columns, since the object id column does not
contain a coordinate). They define a bounding-box in the same way as each row
of the r-tree table itself does - the first scalar coordinate contains the
minimum value for the first dimension, followed by the maximum value for the
first dimension, followed by the minimum value for the second dimension, and so
on. If the specified bounding box intersects with the custom query region, then
the implementation of the callback must set the output parameter *pRes to
non-zero and return SQLITE_OK. If the specified bounding box does not intersect
the queried region, *pRes should be set to zero before returning SQLITE_OK. If
an error occurs, the callback may return an SQLite error code other than
SQLITE_OK, in which case the value of *pRes is ignored by SQLite and the query
abandoned.

<p>A registered r-tree geometry callback is used in an r-tree query by adding
a MATCH condition to the WHERE clause of a SELECT statement. For example,
assuming a custom geometry callback named "circle" has been registered, it may
be used in a query on the two-dimensional r-tree table "demo_index" defined in
earlier examples as follows: 

<blockquote><pre>
SELECT * FROM demo_index WHERE id MATCH circle(45.3, 22.9, 5.0)
</blockquote></pre>

<p>The left-hand side of the MATCH operator may be any column from the r-tree
table, including the object id column. It makes no difference which column is
used. The right-hand side of the MATCH operator is passed the results of an
SQL function with the same name as the r-tree geometry callback. Zero or more
function parameters may be specified by the user. Parameters are always
interpreted as 64-bit real values. If a text, integer or blob value is passed
as a parameter to an r-tree geometry callback function, it is converted to
a real value as if by a [CAST expression]. If an SQL NULL value is passed to
an r-tree geometry callback function, it is converted to the value 0.0.

<p>Parameters passed to r-tree geometry callback functions may be used by the
implementation of the r-tree geometry callback to define the specified
region in any way. For example, the three parameters passed to the "circle"
geometry callback above could identify the center point and radius of the
circular region of interest.

<p>The first argument to each invocation of an r-tree geometry callback is
a pointer to a structure of the following type. The contents of the structure
are not modified between multiple calls to the r-tree geometry callback
associated with a single query (unless the pUser or xDelUser member variables
are modified by the callback implementation - see below).

<blockquote><pre>
typedef struct sqlite3_rtree_geometry sqlite3_rtree_geometry;
struct sqlite3_rtree_geometry {
  void *pContext;                 /* Copy of pContext passed to s_r_g_c() */
  int nParam;                     /* Size of array aParam */
  double *aParam;                 /* Parameters passed to SQL geom function */
  void *pUser;                    /* Callback implementation user data */
  void (*xDelUser)(void *);       /* Called by SQLite to clean up pUser */
};
</pre></blockquote>

<p>The pContext member of the structure is always set to a copy of the pContext
argument passed to sqlite3_rtree_geometry_callback() when the r-tree geometry
callback is registered. The aParam[] array (size nParam) contains the parameter
values passed to the r-tree geometry callback as part of the SQL query. In
the example "circle" query above, nParam would be set to 3 and the aParam[]
array would contain the three values 45.3, 22.9 and 5.0.

<p>The pUser and xDelUser members of the sqlite3_rtree_geometry structure are
initially set to NULL. The pUser variable may be set by the callback
implementation to any arbitrary value that may be useful to subsequent
invocations of the callback within the same custom r-tree query (for example, a
pointer to a complicated data structure used to test for region intersection).
If the xDelUser variable is set to a non-NULL value, then after the custom
r-tree query has finished running SQLite automatically invokes it with the
value of the pUser variable as the only argument. In other words, xDelUser
may be set to a destructor function for the pUser value.

<p>Example code implementing the "circle" r-tree geometry callback may be
found in the file 
<a href=http://www.sqlite.org/src/doc/trunk/src/test_rtree.c>test_rtree.c</a>.