Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Clarifications and typo-fixes in the NGQP document. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
5cbddcc9456089e4edc3060acb789d1d |
User & Date: | drh 2013-08-07 00:21:03.497 |
Context
2013-08-07
| ||
15:39 | Merge the changes from the 3.7.17 branch into trunk. (check-in: 69c4153640 user: drh tags: trunk) | |
00:21 | Clarifications and typo-fixes in the NGQP document. (check-in: 5cbddcc945 user: drh tags: trunk) | |
2013-08-03
| ||
18:42 | Always publish the most recent changes in releaselog/current.html. (check-in: 27f4c1dbbc user: drh tags: trunk) | |
Changes
Changes to pages/queryplanner-ng.in.
︙ | ︙ | |||
48 49 50 51 52 53 54 | But for larger and more complex queries, such as multi-way joins with many indices and subqueries, there can be hundreds, thousands, or millions of reasonable algorithms for computing the result. The job of the query planner is to chose a single "best" query plan from this multitude of possibilities.</p> | < | | 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | But for larger and more complex queries, such as multi-way joins with many indices and subqueries, there can be hundreds, thousands, or millions of reasonable algorithms for computing the result. The job of the query planner is to chose a single "best" query plan from this multitude of possibilities.</p> <p>Query planner are what make SQL database engines so amazingly useful and powerful. (This is true of all SQL database engines, not just SQLite.) The query planner frees the programmer from the chore of selecting a particular query plan, and thereby allows the programmer to focus more mental energy on higher-level application issues and on providing more value to the end user. For simple queries where the choice of query plan is obvious, this is convenient but not hugely important. But as applications and schemas and queries grow more complex, a |
︙ | ︙ | |||
77 78 79 80 81 82 83 | find the correct solution often enough that application programmers only rarely need to get involved.</p> <h3>2.1 Query Planning In SQLite</h3> <p>SQLite computes joins using nested loops, one loop for each table | | | > | | | | 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | find the correct solution often enough that application programmers only rarely need to get involved.</p> <h3>2.1 Query Planning In SQLite</h3> <p>SQLite computes joins using nested loops, one loop for each table in the join. (Additional loops might be inserted for IN and OR operators in the WHERE clause. SQLite considers those too, but for simplicity we will ignore them in this essay.) One or more indices might be used on each loop to speed the search, or a loop might be a "full table scan" that reads every row in the table. Thus query planning decomposes into two subtasks:</p> <ol> <li> Picking the nested order of the various loops <li> Choosing good indices for each loop </ol> <p>Picking the nesting order is generally the more challenging problem. Once the nesting order of the join is established, the choice of indices for each loop is normally obvious.</p> <tcl>hd_fragment qpstab {query planner stability guarantee}</tcl> <h3>2.2 The SQLite Query Planner Stability Guarantee</h3> <p>SQLite will always pick the same query plan for any given SQL statement as long as: <ol type="a"> <li>the database schema does not change in significant ways such as adding or dropping indices,</li> <li>the ANALYZE command is not rerun, </li> <li>SQLite is not compiled with [SQLITE_ENABLE_STAT3], and</li> <li>the same version of SQLite is used.</li> </ol> The SQLite stability guarantee means that if all of your queries run efficiently during testing, and if your application does not change the schema, then SQLite will not suddenly decide to start using a different |
︙ | ︙ | |||
129 130 131 132 133 134 135 | in an embedded database like SQLite, and hence SQLite is careful to ensure that plans do not change unexpectedly after deployment.</p> <p>The SQLite stability guarantee applies equally to the legacy query planner and to the NGQP.</p> <p>It is important to note that changing versions of SQLite might cause | | | | 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | in an embedded database like SQLite, and hence SQLite is careful to ensure that plans do not change unexpectedly after deployment.</p> <p>The SQLite stability guarantee applies equally to the legacy query planner and to the NGQP.</p> <p>It is important to note that changing versions of SQLite might cause changes in query plans. The same version of SQLite it will always pick the same query plan, but if you relink your application to use a different version of SQLite, then query plans might change. In rare cases, an SQLite version change might lead to a performance regression. This is one reason you should consider statically linking your applications against SQLite rather than use a system-wide SQLite shared library which might change without your knowledge or control.</p> <h2>3.0 A Difficult Case</h2> <p> "TPC-H Q8" is a test query from the <a href="http://www.tpc.org/tpch/">Transaction Processing Performance |
︙ | ︙ | |||
191 192 193 194 195 196 197 | being outside of it.</p> <p>The arrows from the small circles labeled with "*" indicate the cost of running each loop with no dependencies. The outer loop must use this *-cost. Inner loops have the option of using the *-cost or a cost assuming one of the other terms is in an outer loop, whichever gives the best result. One can think of the *-costs as a short-hand notation indicating | | | 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 | being outside of it.</p> <p>The arrows from the small circles labeled with "*" indicate the cost of running each loop with no dependencies. The outer loop must use this *-cost. Inner loops have the option of using the *-cost or a cost assuming one of the other terms is in an outer loop, whichever gives the best result. One can think of the *-costs as a short-hand notation indicating multiple arcs, one from each of the other nodes in the graph. The graph is is therefore "complete", meaning that there are arcs (some explicit and some implied) in both directions between every pair of nodes in the graph.</p> <p>The problem of finding the best query plan is equivalent to finding a minimum-cost path through the graph that visits each node exactly once.</p> |
︙ | ︙ | |||
302 303 304 305 306 307 308 | N2 (cost: 5.52) <br> P (cost: 7.71) <br> </blockquote> <p>The second step finds the four shortest paths to visit two nodes beginning with one of the four paths from the previous step. In the case where two or more paths are equivalent (they have the same set of | | | 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 | N2 (cost: 5.52) <br> P (cost: 7.71) <br> </blockquote> <p>The second step finds the four shortest paths to visit two nodes beginning with one of the four paths from the previous step. In the case where two or more paths are equivalent (they have the same set of visited nodes, though possibly in a different order) only the first and lowest-cost path is retained. We have:</p> <blockquote> R-N1 (cost: 7.03) <br> R-N2 (cost: 9.08) <br> N2-N1 (cost: 11.04) <br> R-P {cost: 11.27} <br> |
︙ | ︙ | |||
331 332 333 334 335 336 337 | times. In the general case of a K-way join, the storage requirements is O(N) and the computation time is O(K*N), which is significantly faster than the O(2<small><sup>K</sup></small>) exact solution.</p> <p>But what value to choose for N? One might try N=K. This makes the algorithm O(K<small><sup>2</sup></small>) which is actually still quite efficient, since the | | | 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 | times. In the general case of a K-way join, the storage requirements is O(N) and the computation time is O(K*N), which is significantly faster than the O(2<small><sup>K</sup></small>) exact solution.</p> <p>But what value to choose for N? One might try N=K. This makes the algorithm O(K<small><sup>2</sup></small>) which is actually still quite efficient, since the maximum value of K is 64 and K rarely exceeds 10. But that is not enough for the TPC-H Q8 problem. With N=8 on TPC-H Q8 the N3 algorithm finds the solution R-N1-C-O-L-S-N2-P with a cost of 29.78. That is a big improvement over NN, but it is still not optimal. N3 finds the optimal solution for TPC-H Q8 when N is 10 or greater.</p> |
︙ | ︙ | |||
456 457 458 459 460 461 462 | <li> The check-in has a "trunk" tag. <li> The check-in has a child that has a "trunk" tag. <li> The check-in has a parent that has a "trunk" tag. </ol></p> <p> The first condition causes all of the trunk check-ins to be displayed and | | | 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 | <li> The check-in has a "trunk" tag. <li> The check-in has a child that has a "trunk" tag. <li> The check-in has a parent that has a "trunk" tag. </ol></p> <p> The first condition causes all of the trunk check-ins to be displayed and the second and third cause check-ins that merge into or fork from the trunk to also be included. The three conditions are implemented by the three OR-connected EXISTS statements in the WHERE clause of the query. The slowdown that occurred with the NGQP was caused by the second and third conditions. The problem is the same in each, so we will examine just the second one. The subquery of the second condition can be rewritten (with minor |
︙ | ︙ | |||
672 673 674 675 676 677 678 | and comment the use of CROSS JOIN carefully so that you can take it out later if possible. Avoid using CROSS JOIN early in the development cycle as doing so is a premature optimization, which is well known to be <a href="http://c2.com/cgi/wiki?PrematureOptimization">the root of all evil</a>.</p> <li><p><b>Use unary "+" operators to disqualify WHERE clause terms.</b> | | | 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 | and comment the use of CROSS JOIN carefully so that you can take it out later if possible. Avoid using CROSS JOIN early in the development cycle as doing so is a premature optimization, which is well known to be <a href="http://c2.com/cgi/wiki?PrematureOptimization">the root of all evil</a>.</p> <li><p><b>Use unary "+" operators to disqualify WHERE clause terms.</b> If the query planner insists on selecting a poor-quality index for a particular query when a much higher-quality index is available, then [upluscontrol | careful use of unary "+" operators] in the WHERE clause can force the query planner away from the poor-quality index. Avoid using this trick if at all possible, and especially avoid it early in the application development cycle. Beware that adding a unary "+" operator to an equality expression might change the result of that expression if |
︙ | ︙ |