Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Updates to the queryplanner.html document. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
33d6d684cbaef7a12ab2278a3068c764 |
User & Date: | drh 2010-08-30 00:13:59.000 |
Context
2010-08-30
| ||
00:42 | Copy the spell-check exception list into the buid directory. Add "SQLite" to the spell-check exception list. (check-in: 99a59a73ff user: drh tags: trunk) | |
00:13 | Updates to the queryplanner.html document. (check-in: 33d6d684cb user: drh tags: trunk) | |
2010-08-28
| ||
05:48 | Fix documentation typos. (check-in: 78fe166463 user: shaneh tags: trunk) | |
Changes
Added images/qp/fruitobstate0.gif.
cannot compute difference between binary files
Changes to pages/changes.in.
︙ | ︙ | |||
37 38 39 40 41 42 43 44 45 46 47 48 49 50 | <a href="http://www.sqlite.org/src/timeline"> http://www.sqlite.org/src/timeline</a>.</p> } hd_close_aux hd_enable_main 1 } } chng {2010 August 24 (3.7.2)} { <li> Fix an <a href="http://www.sqlite.org/src/info/5e10420e8d"> old and very obscure bug</a> that can lead to corruption of the database [free-page list] when [incremental_vacuum] is used. } | > > > > > > > > > > | 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | <a href="http://www.sqlite.org/src/timeline"> http://www.sqlite.org/src/timeline</a>.</p> } hd_close_aux hd_enable_main 1 } } chng {2010 October 15 (3.7.3)} { <li> Added the [sqlite3_create_function_v2()] interface that includes a destructor callback. <li> The default page cache strives more diligently to avoid using memory beyond what is allocated to it by [SQLITE_CONFIG_PAGECACHE]. Or if using page cache is allocating from the heap, it strives to avoid going over the [sqlite3_soft_heap_limit()], even if [SQLITE_ENABLE_MEMORY_MANAGEMENT] is not set. } chng {2010 August 24 (3.7.2)} { <li> Fix an <a href="http://www.sqlite.org/src/info/5e10420e8d"> old and very obscure bug</a> that can lead to corruption of the database [free-page list] when [incremental_vacuum] is used. } |
︙ | ︙ |
Changes to pages/queryplanner.in.
︙ | ︙ | |||
19 20 21 22 23 24 25 | <p> The best feature of SQL (in <u>all</u> its implementations, not just SQLite) is that it is a <i>declarative</i> language, not a <i>procedural</i> language. When programming in SQL you tell the system <i>what</i> you want to compute, not <i>how</i> to compute it. The task of figuring out the <i>how</i> is delegated to the <i>query planner</i> subsystem within the SQL database engine. | | | | | 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | <p> The best feature of SQL (in <u>all</u> its implementations, not just SQLite) is that it is a <i>declarative</i> language, not a <i>procedural</i> language. When programming in SQL you tell the system <i>what</i> you want to compute, not <i>how</i> to compute it. The task of figuring out the <i>how</i> is delegated to the <i>query planner</i> subsystem within the SQL database engine. Relieving the programmer from the chore of designing specific query algorithms reduces the workload on the programmer and reduces the number of opportunities for the programmer to make mistakes. </p> <p> Most of the time, the query planner in SQLite does a good job on its own and without outside help. However, the query planner needs indices to work with and it usually falls to the programmer to add indices to the schema that are sufficient for the query planner to accomplish its task. </p> <p> This document is intended to provide programmers who are new to SQL with background information to help them understand what is going on behind the scenes with SQLite, which in turn should make it easier for programmers to create the indices that will help the SQLite query planner to pick the best plans. </p> <h2>1.0 Searching</h2> <h3>1.1 Tables Without Indices</h3> <p> Every table in SQLite consists of zero or more rows with a unique integer key (the [rowid] or [INTEGER PRIMARY KEY]) followed by content. The rows are logically stored in order of increasing rowid. As an example, this |
︙ | ︙ | |||
206 207 208 209 210 211 212 | </tcl> <p> In this case, SQLite still does a single binary search to find the first entry of the index where fruit='Orange'. Then it extracts the rowid from the index and uses that rowid to lookup the original table entry via binary search and output the price from the original table. But instead | | | 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 | </tcl> <p> In this case, SQLite still does a single binary search to find the first entry of the index where fruit='Orange'. Then it extracts the rowid from the index and uses that rowid to lookup the original table entry via binary search and output the price from the original table. But instead of quitting, the database engine then advances to the next row of index to repeat the process for next fruit='Orange' entry. Advancing to the next row of an index (or table) is much less costly than doing a binary search since the next row is often located on the same database page as the current row. In fact, the cost of advancing to the next row is so cheap in comparison to a binary search that we usually ignore it. So our estimate for the total cost of this query is 3 binary searches. If the number of rows of output is K and the number of rows in the table |
︙ | ︙ | |||
359 360 361 362 363 364 365 366 367 368 369 370 371 372 | <p> Hence, a good rule of thumb is that your database schema should never contain two indices where one index is a prefix of the other. Drop the index with fewer columns. SQLite will still be able to do efficient lookups with the longer index. </p> <h3>1.7 Covering Indices</h3> <p> The "price of California oranges" query was made more efficient through the use of a two-column index. But SQLite can do even better with a three-column index that also includes the "price" column: </p> | > | 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 | <p> Hence, a good rule of thumb is that your database schema should never contain two indices where one index is a prefix of the other. Drop the index with fewer columns. SQLite will still be able to do efficient lookups with the longer index. </p> <tcl>hd_fragment covidx {covering index} {covering indices}</tcl> <h3>1.7 Covering Indices</h3> <p> The "price of California oranges" query was made more efficient through the use of a two-column index. But SQLite can do even better with a three-column index that also includes the "price" column: </p> |
︙ | ︙ | |||
454 455 456 457 458 459 460 | by AND, by using an intersect operator in place of union. Many SQL database engines will do just that. But the performance gain over using just a single index is slight and so SQLite does not implement that technique at this time. However, a future version SQLite might be enhanced to support AND-by-INTERSECT. </p> | | | 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 | by AND, by using an intersect operator in place of union. Many SQL database engines will do just that. But the performance gain over using just a single index is slight and so SQLite does not implement that technique at this time. However, a future version SQLite might be enhanced to support AND-by-INTERSECT. </p> <tcl>hd_fragment sorting sorting</tcl> <h2>2.0 Sorting</h2> <p> SQLite (like all other SQL database engines) can also use indices to satisfy the ORDER BY clauses in a query, in addition to expediting lookup. In other words, indices can be used to speed up sorting as well as searching. |
︙ | ︙ | |||
579 580 581 582 583 584 585 586 587 588 | </tcl> <p> With a covering index, SQLite can simply walk the index from one end to the other and deliver the output in time proportional to N and without having allocate a large buffer to hold the result set. </p> <h2>To Be Continued...</h2> | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | < | | 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 | </tcl> <p> With a covering index, SQLite can simply walk the index from one end to the other and deliver the output in time proportional to N and without having allocate a large buffer to hold the result set. </p> <h2>3.0 Searching And Sorting At The Same Time</h2> <p> The previous discussion has treated searching and sorting as separate topics. But in practice, it is often the case that one wants to search and sort at the same time. Fortunately, it is possible to do this using a single index. </p> <h3>3.1 Searching And Sorting With A Multi-Column Index</h3> <p> Suppose we want to find the prices of all kinds of oranges sorted in order of the state where they are grown. The query is this: </p> <tcl> code {SELECT price FROM fruitforsale WHERE fruit='Orange' ORDER BY state} </tcl> <p> The query contains both a search restriction in the WHERE clause and a sort order in the ORDER BY clause. Both the search and the sort can be accomplished at the same time using the two-column index Idx3. </p> <tcl> figure 20 #fig20 fruitobstate0.gif {Search And Sort By Multi-Column Index} </tcl> <p> The query does a binary search on the index to find the subset of rows that have fruit='Orange'. (Because the fruit column is the left-most column of the index and the rows of the index are in sorted order, all such rows will be adjacent.) Then it scans the matching index rows from top to bottom to get the rowids for the original table, and for each rowid does a binary search on the original table to find the price. </p> <p> You will notice that there is no "sort" box anywhere in the above diagram. The ORDER BY clause of the query has become a no-op. No sorting has to be done here because the output order is by the state column and the state column also happens to be the first column after the fruit column in the index. So, if we scan entries of the index that have the same value for the fruit column from top to bottom, those index entries are guaranteed to be ordered by the state column. </p> <tcl>hd_fragment {srchsortcovidx}</tcl> <h3>3.2 Searching And Sorting With A Covering Index</h3> <p> A [covering index] can also be used to search and sort at the same time. Consider the following: </p> <tcl> code {SELECT * FROM fruitforsale WHERE fruit='Orange' ORDER BY state} figure 21 #fig21 fruitobstate.gif {Search And Sort By Covering Index} </tcl> <p> As before, SQLite does single binary search for the range of rows in the covering index that satisfy the WHERE clause, the scans that range from top to bottom to get the desired results. The rows that satisfy the WHERE clause are guaranteed to be adjacent since the WHERE clause is an equality constraint on the left-most column of the index. And by scanning the matching index rows from top to bottom, the output is guaranteed to be order by state since the state column is the very next column to the right of the fruit column. And so the resulting query is very efficient. </p> <p> SQLite can pull a similar trick for a descending ORDER BY: </p> <tcl> code {SELECT * FROM fruitforsale WHERE fruit='Orange' ORDER BY state DESC} </tcl> <p> The same basic algorithm is followed, except this time the matching rows of the index are scanned from bottom to top instead of from top to bottom, so that the states will appear in descending order. </p> <h2>To Be Continued...</h2> <p>Further topics:</p> <p><ul> <li>How to construct a good index <li>Joins and join-order <li>Automatic transient indices <li>How to determine what query plan SQLite is using <li>Automatic detection of inefficient query plans <li>Techniques for influencing what query plan SQLite chooses </ul> </p> |
Changes to pages/rtree.in.
︙ | ︙ | |||
43 44 45 46 47 48 49 | command-line. </p> <h1>3.0 Using the R*Tree Module</h1> <p> The SQLite R*Tree module is implemented as a | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | command-line. </p> <h1>3.0 Using the R*Tree Module</h1> <p> The SQLite R*Tree module is implemented as a [sqlite3_create_module | virtual table]. ^Each R*Tree index is a virtual table with an odd number of columns between 3 and 11. ^The first column is always a 64-bit signed integer primary key. ^The other columns are minimum- and maximum-value pairs (stored as 32-bit floating point numbers) for each dimension. ^A 1-dimensional R*Tree thus has 3 columns. ^A 2-dimensional R*Tree (the most common case) has 5 columns. ^A 5-dimensional R*Tree has 11 columns. The SQLite R*Tree implementation does not support R*Trees wider than 5 dimensions. </p> <p> ^The first column of an SQLite R*Tree must always be an integer primary key. ^The min/max-value pair columns are always stored as 32-bit floating point values. ^Unlike regular SQLite tables which can store data in a variety of datatypes and formats, the R*Tree indices rigidly enforce these two storage types. ^Attempts to insert something other than an integer into the first column, or something other than a floating point value into the other columns, will result in an error. </p> <h2>3.1 Creating An R*Tree Index</h2> ^(<p> A new R*Tree index is created as follows: </p> <blockquote><pre> CREATE VIRTUAL TABLE <em><name></em> USING rtree(<em><column-names></em>); </pre></blockquote>)^ <p> The <em><name></em> is the name your application chooses for the R*Tree index and <em><column-names></em> is a comma separated list of between 3 and 11 columns. ^(The virtual <name> table creates three "shadow" tables to actually store its content. The names of these shadow tables are: </p> <blockquote> <em><name></em><strong>_node</strong><br> <em><name></em><strong>_rowid</strong><br> <em><name></em><strong>_parent</strong> </blockquote>)^ <p> ^The shadow tables are ordinary SQLite data tables. You can query them directly if you like, though this unlikely to to reveal anything particularly useful. ^And you can [UPDATE], [DELETE], [INSERT] or even [DROP TABLE | DROP] the shadow tables, though doing so will corrupt your R*Tree index. So it is best to simply ignore the shadow tables. Recognize that they are there to hold your R*Tree index information and let it go as that. </p> <p> ^(As an example, consider creating a two-dimensional R*Tree index for use in spatial queries: </p> <blockquote><pre> CREATE VIRTUAL TABLE demo_index USING rtree( id, -- Integer primary key minX, maxX, -- Minimum and maximum X coordinate minY, maxY -- Minimum and maximum Y coordinate ); </pre></blockquote>)^ <h2>3.2 Populating An R*Tree Index</h2> <p> ^The usual [INSERT], [UPDATE], and [DELETE] commands work on an R*Tree index just like on regular tables. ^(So to insert some data into our sample R*Tree index, we can do something like this: </p> <blockquote><pre> INSERT INTO demo_index VALUES( 1, -- Primary key -80.7749, -80.7747, -- Longitude range 30.3776, 30.3778 -- Latitude range ); INSERT INTO demo_index VALUES( 2, -81.0, -79.6, 35.0, 36.2 ); </pre></blockquote>)^ <p> The entries above might represent (for example) a bounding box around the offices for SQLite.org and bounding box around the 12th Congressional District of North Carolina in which SQLite.org is located. </p> <h2>3.3 Querying An R*Tree Index</h2> <p> ^Any valid query will work against an R*Tree index. But the R*Tree implementation is designed to make two kinds of queries especially efficient. ^(First, queries against the primary key are efficient: </p> <blockquote><pre> SELECT * FROM demo_index WHERE id=1; </pre></blockquote>)^ <p> Of course, an ordinary SQLite table will do a query against its integer primary key efficiently, so the previous is no big deal. The real reason for using an R*Tree in the first place is so that you can efficiently do inequality queries against the coordinate ranges. ^(To find all elements of the index that are contained within the vicinity of Charlotte, North Carolina, one might do: </p> <blockquote><pre> SELECT id FROM demo_index WHERE minX>=-81.08 AND maxX<=-80.58 AND minY>=35.00 AND maxY<=35.44; </pre></blockquote>)^ <p> ^The query above would very quickly locate the id of 1 even if the R*Tree contained millions of entries. The previous is an example of a "contained-within" query. The R*Tree also supports "overlapping" queries. ^(For example, to find all bounding boxes that overlap the Charlotte area: </p> <blockquote><pre> SELECT id FROM demo_index WHERE maxX>=-81.08 AND minX<=-80.58 AND maxY>=35.00 AND minY<=35.44; </pre></blockquote>)^ <p> ^(This second query would find both entry 1 (the SQLite.org office) which is entirely contained within the query box and also the 12th Congressional District which extends well outside the query box but still overlaps the query box.)^ </p> <p> ^Note that is not necessary for all coordinates in an R*Tree index to be constrained in order for the index search to be efficient. ^(One might, for example, want to query all objects that overlap with the 35th parallel: </p> <blockquote><pre> SELECT id FROM demo_index WHERE maxY>=35.0 AND minY<=35.0; </pre></blockquote>)^ <p> But, generally speaking, the more constraints that the R*Tree module has to work with, and the smaller the bounding box, the faster the results will come back. </p> <h1>4.0 Using R*Trees Effectively</h1> <p> The only information that an R*Tree index stores about an object is its integer ID and its bounding box. Additional information needs to be stored in separate tables and related to the R*Tree index using the primary key. ^(For the example above, one might create an auxiliary table as follows: </p> <blockquote><pre> CREATE TABLE demo_data( id INTEGER PRIMARY KEY, -- primary key objname TEXT, -- name of the object objtype TEXT, -- object type boundary BLOB -- detailed boundary of object ); </pre></blockquote>)^ <p> In this example, the demo_data.boundary field is intended to hold some kind of binary representation of the precise boundaries of the object. The R*Tree index only holds an axis-aligned rectangular boundary for the object. The R*Tree boundary is just an approximation of the true object boundary. So what typically happens is that the R*Tree index is used to |
︙ | ︙ | |||
247 248 249 250 251 252 253 | </p></blockquote> <p> Suppose the demo_data.boundary field holds some proprietary data description of a complex two-dimensional boundary for an object and suppose that the application has used the [sqlite3_create_function()] interface to created application-defined functions "contained_in" and | | | | | | 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 | </p></blockquote> <p> Suppose the demo_data.boundary field holds some proprietary data description of a complex two-dimensional boundary for an object and suppose that the application has used the [sqlite3_create_function()] interface to created application-defined functions "contained_in" and "overlaps" accepting two demo_data.boundary objects and return true or false. One may assume that "contained_in" and "overlaps" are relatively slow functions that we do not want to invoke too frequently. ^(Then an efficient way to find the name of all objects located within the North Carolina 12th District, one may be to run a query like this: </p> <blockquote><pre> SELECT objname FROM demo_data, demo_index WHERE demo_data.id=demo_index.id AND contained_in(demo_data.boundary, :boundary) AND minX>=-81.0 AND max<=-79.6 AND minY>=35.0 AND maxY<=36.2; </pre></blockquote>)^ <p>In the query above, one would presumably bind the binary BLOB description of the precise boundary of the 12th district to the ":boundary" parameter.</p> <p>Notice how the query above works: The R*Tree index runs in the outer loop to find entries that are contained within the bounding box |
︙ | ︙ |