Documentation Source Text

Check-in [e8152063fb]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Work on the queryplanner.html document.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: e8152063fbd71c6a778e6a636c47848025411391
User & Date: drh 2010-08-12 16:26:36
Context
2010-08-12
17:55
Initial identification of requirements in the fileformat2.html document. check-in: 8925c8c2e1 user: drh tags: trunk
16:26
Work on the queryplanner.html document. check-in: e8152063fb user: drh tags: trunk
14:38
Update the file format documentation for the new 64K page size. Add a caution to the WAL document. Omit annoying echos in the script that removes requirement marks from the documentation. check-in: 0ec9f2cb8f user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/queryplanner.in.

1
2
3
4
5
6
7
8
..
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

26

27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


42
43
44
45
46
47
48
49
50
51
52
53

54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73





74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
..
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
...
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
...
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
...
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
...
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
...
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
...
264
265
266
267
268
269
270
271
272
273
274
275















276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
...
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
...
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
...
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
...
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
...
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416



















417
418
419
420
421
422
423

424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445


446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
...
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
...
539
540
541
542
543
544
545















<title>How Indexing Works</title>
<tcl>
hd_keywords {indexing} {indexing tutorial}
proc figure {fignum tag img title} {
  hd_puts "<p><center>\n"
  hd_puts "<img src=\"images/qp/$img\" alt=\"figure $fignum\"><br>\n"
  hd_puts "Figure $fignum: $title\n"
  hd_puts "</center></p>\n"
................................................................................
proc code {txt} {
  hd_puts "<center><table><tr><td><pre>\n"
  regsub {^[ \n]*\n} $txt {} txt
  hd_puts [string trimright $txt]\n
  hd_puts "</pre></table></center>\n"
}
</tcl>
<h1 align="center">How Indexing Works</h1>

<p>
The best feature of SQL (in <u>all</u> its implementations, not just SQLite)
is that it is a <i>declarative</i> language, not a <i>procedural</i>
language.  In other words, with SQL you tell the system <i>what</i> you
want to compute, not <i>how</i> to compute it.  The task of figuring out
the <i>how</i> is delegated to the <i>query optimizer</i> or
<i>query planner</i> subsystem within the SQL database engine, significantly

reducing the workload on the programmer and the opportunity for the

programmer to make mistakes.
</p>

<p>
Most of the time, the query planner does a good job without any help
and the programmer does not
have to think about it.  However, the query planner needs indices to
work with and it usually falls to the programmer to add indices to the
schema that are sufficient for the query planner to accomplish its task.
</p>

<p>
This document is intended to provide programmers who are new to SQL
and database with the background information needed to help them determine
the best indices for use on their project.


</p>

<h2>1.0 Querying</h2>

<h3>1.1 Tables Without Indices</h3>

<p>
Every table in SQLite consists of zero or more rows with a unique integer
key (the [rowid] or [INTEGER PRIMARY KEY]) followed by content.  The rows
are logically stored in order of increasing rowid.  As an example, this
note uses a table named "tab" which relates various fruits to the state
where it is grown and its unit price.  The schema is this:

</p>

<tcl>code {
CREATE TABLE tab(
  Fruit TEXT,
  State TEXT,
  Price REAL
);
}</tcl>

<p>
With some (arbitrary) data, such a table might be logically stored on disk
as shown in figure 1:
</p>

<tcl>figure 1 #fig1 tab.gif {Logical Layout Of Table "Tab"}</tcl>

<p>
Key features to notice about this example are that the rows are sorted by
rowid and that the order of the content is arbitrary.





</p>

<p>
Now suppose you want to look up the price of peaches.  The query would
be as follows:
</p>

<tcl>code {
SELECT price FROM tab WHERE fruit='Peach';
}</tcl>

<p>
In order to satisfy this query, SQLite has to read every row out of the
table, check to see if the "fruit" column has the value of "Peach" and if
so, output the "price" column from that row.  The process is illustrated
by <a href="#fig2">figure 2</a> below.
................................................................................
<tcl>figure 2 #fig2 fullscan.gif {Full Table Scan}</tcl>

<h3>1.2 Lookup By Rowid</h3>

<p>
One technique for avoiding a full table scan is to do lookups by
rowid (or by the equivalent INTEGER PRIMARY KEY).   To lookup the
prices of peaches, one would query for the entry with a rowid of 4:
</p>

<tcl>code {
SELECT price FROM tab WHERE rowid=4;
}</tcl>

<p>
Since the information is stored in the table in rowid order, SQLite
can find the correct row using doing a binary search on the rowid.
If the table contains N element, the time required to look up the
desired row is proportional to logN rather than being proportional
................................................................................

<tcl>figure 3 #fig3 rowidlu.gif {Lookup By Rowid}</tcl>

<h3>1.3 Lookup By Index</h3>
<p>
The problem with looking up information by rowid is that you probably
do not care what the price of "item 4" is - you want to know the price
of peaches.  And so a rowid lookup is often not an option.
</p>

<p>
To make the original query more effcient, we can add an index on the
"fruit" column of the "tab" table like this:
</p>

<tcl>code {
CREATE INDEX idx1 ON tab(fruit);
}</tcl>

<p>
An index is another table similar to the original "tab" table
but with the content (the fruit column in this case) stored in front of the
rowid and with all rows in content order.
<a href="#fig4">Figure 4</a> gives a logical view of the Idx1 index.
The "fruit" column is the primary key used to order the elements of the
table and the "rowid" is the secondary key used to break the tie when
two or more rows have the same "fruit".  In the example, the rowid
has to be used as a tie-breaker for the "Orange" rows.
................................................................................

<p>
With the new index in place, we can devise an alternative plan for the
original "Price of Peaches" query.
</p>

<tcl>code {
SELECT price FROM tab WHERE fruit='Peach';
}</tcl>

<p>
The query starts by doing a binary search on the Idx1 index for entries
that have fruit='Peach'.  SQLite can do this binary search on the Idx1 index
but not on the original Tab table because the rows in Idx1 are sorted
by the "fruit" column.  Having found a row in the Idx1 index that has
fruit='Peach', the database engine can extract the rowid for that row,
then do a separate binary search on the original Tab table to find the
original row that contains fruit='Peach'.  From the row in the Tab table,
SQLite can extract the value of the price column.
This procedure is illustrated by <a href="#fig5">figure 5</a>.
</p>

<tcl>figure 5 #fig5 idx1lu1.gif {Indexed Lookup For The Price Of Peaches}</tcl>

<p>
................................................................................
In the previous query the fruit='Peach' constraint narrowed the result
down to a single row.  But the same technique works even if multiple
rows are obtained.  Suppose we looked up the price of Oranges instead 
Peaches:
</p>

<tcl>
code {SELECT price FROM tab WHERE fruit='Orange'}
figure 6 #fig6 idx1lu2.gif {Indexed Lookup For The Price Of Oranges}
</tcl>

<p>
In this case, SQLite still does a single binary search to find the first
entry of the index where fruit='Orange'.  Then it extracts the rowid from
the index and uses that rowid to lookup the original table entry via
................................................................................
<p>
Next, suppose that you want to look up the price of not just any orange,
but specifically California-grown oranges.  The appropriate query would
be as follows:
</p>

<tcl>
code {SELECT price FROM tab WHERE fruit='Orange' AND state='CA'}
figure 7 #fig7 idx1lu3.gif {Indexed Lookup Of California Oranges}
</tcl>

<p>
One approach to this query is to use the fruit='Orange' term of the WHERE
clause to find all rows dealing with oranges, then filter those rows
by rejecting any that are from states other than California.  This
................................................................................

<p>
Suppose that in addition to the index on "fruit" there was also
an index on "state".
</p>

<tcl>
code {CREATE INDEX Idx2 ON tab(state);}
figure 8 #fig8 idx2.gif {Index On The State Column}
</tcl>

<p>
The "state" index works just like the "fruit" index in that it is a
new table with an extra column in front of the rowid and sorted by
that extra column as the primary key.  The only difference is that
................................................................................
</p>

<tcl>figure 9 #fig9 idx2lu1.gif {Indexed Lookup Of California Oranges}</tcl>

<p>
Using Idx2 instead of Idx1 causes SQLite to examine a different set of
rows, but it gets the same answer in the end (which is very important -
remember that indices should never change the answer, only cause us to
get to the answer more quickly) and it does the same amount of work.
So the Idx2 index did not help performance in this case.
</p>
















<h3>1.6 Multi-Column Indices</h3>

<p>
To get the maximum performance out of a query with multiple AND-connected
terms in the WHERE clause, you really want a multi-column index with
columns for each of the AND terms.  In this case we create a new index
on the "fruit" and "state" columns of Tab1:
</p>

<tcl>
code {CREATE INDEX Idx3 ON Tab(fruit, state);}
figure 1 #fig10 idx3.gif {A Two-Column Index}
</tcl>

<p>
A multi-column index follows the same pattern as a single-column index;
the indexed columns are added in front of the rowid.  The only difference
is that now multiple columns are added.  The left-most column is the
................................................................................

<p>
Given the new multi-column Idx3 index, it is now possible for SQLite
to find the price of California oranges using only 2 binary searches:
</p>

<tcl>
code {SELECT price FROM tab WHERE fruit='Orange' AND state='CA'}
figure 11 #fig11 idx3lu1.gif {Lookup Using A Two-Column Index}
</tcl>

<p>
With the Idx3 index on both columns that are constrained by the WHERE clause,
SQLite can do a single binary search against Idx3 to find the one rowid
for California oranges, then do a single binary search to find the price
................................................................................
Note that Idx3 contains all the same information as the original 
<a href="#fig3">Idx1</a>.  And so if we have Idx3, we do not really need Idx1
any more.  The "price of peaches" query can be satisfied using Idx3
by simply ignoring the "state" column of Idx3:
</p>

<tcl>
code {SELECT price FROM tab WHERE fruit='Peach'}
figure 12 #fig12 idx3lu2.gif {Single-Column Lookup On A Multi-Column Index}
</tcl>

<p>
Hence, a good rule of thumb is that your database schema should never
contain two indices where one index is a prefix of the other.  Drop the
index with fewer columns.  SQLite will still be able to do efficient
................................................................................
<p>
The "price of California oranges" query was made more efficient through
the use of a two-column index.  But SQLite can do even better with a
three-column index that also includes the "price" column:
</p>

<tcl>
code {CREATE INDEX Idx4 ON Tab(fruit, state, price);}
figure 13 #fig13 idx4.gif {A Covering Index}
</tcl>

<p>
This new index contains all the columns of the original Tab table that
are used by the query - both the search terms and the output.  We call
this a "covering index".  Because all of the information needed is in
the covering index, SQLite never needs to consult the original table
in order to find the price.
</p>

<tcl>
code {SELECT price FROM tab WHERE fruit='Orange' AND state='CA';}
figure 14 #fig14 idx4lu1.gif {Query Using A Covering Index}
</tcl>

<p>
Hence, by adding extra "output" columns onto the end of an index, one
can avoid having to reference the original table and thereby
cut the number of binary search for a query in half.  This is a
constant-factor improvement in performance (roughly a doubling of
the speed).  But on the other hand, it is also just a refinement;
A two-fold performance increase is not nearly as dramatic as the
one-millon-fold increase seen when the table was first indexed.
And for most queries, the difference between 1 microsecond and
2 microseconds is unlikely to be noticed.
</p>
................................................................................
So Idx3 and Idx4 are helpful when the search is for items that
are both Oranges and grown in California, but neither index would
be that useful if we wanted all items that were either oranges
<i>or</i> are grown in California.
</p>

<tcl>code {
SELECT price FROM Tab1 WHERE fruit='Orange' OR state='CA';
}</tcl>

<p>
When confronted with OR-connected terms in a WHERE clause, SQLite 
examines each OR term separately and tries to use an index to
find the rowids associated with each term.
It then takes the union of the resulting rowid sets to find
................................................................................
<tcl>figure 15 #fig15 orquery.gif {Query With OR Constraints}</tcl>

<p>
The diagram above implies that SQLite computes all of the rowids first
and then combines them with a union operation before starting to do
rowid lookups on the original table.  In reality, the rowid lookups
are interspersed with rowid computations.  SQLite uses one index at
a time to find rowids but it remembers which rowids it has seen
before so as to avoid duplicates.  But that is just an implementation
detail.  The diagram, while not 100% accurate, provides a good
overview of what is happening.
</p>

<p><i>TBD:  (1) Complex subexpressions connected by OR.
(2) What happens if any OR term is not indexable</i></p>




















<h2>2.0 Sorting</h2>

<p>
SQLite (and all other SQL database engines) can use indices to
speed up sorting, to satisfy ORDER BY clauses in queries, in addition
using them for queries.

</p>

<p>
When no appropriate indices are available, a query with an ORDER BY
clause must be sorted as a separate step.  Consider this query:
</p>

<tcl>code {SELECT * FROM tab ORDER BY fruit;}</tcl>

<p>
SQLite processes this by gather the output of partial query that
omits the ORDER BY clause, then running the output through a sorter.
</p>

<tcl>figure 16 #fig16 obfruitnoidx.gif {Sorting Without An Index}</tcl>

<p>
If the number of output rows is K, then the time needed to sort is
proportional to KlogK.  If K is small, the sorting time is usually
not a factor, but in a query such as the above where K==N, the time
needed to sort can be much greater than the time needed to run the
original query.  Furthermore, the entire output set is accumulated 


in memory, which can mean that a lot of memory is required to complete
the query.
</p>

<h3>2.1 Sorting By Rowid</h3>

<p>
Because sorting can be expensive, SQLite works hard to convert ORDER BY
clauses into no-ops.  If SQLite is able to determine that output will
naturally appear in the order specified, then no sorting is done.
So, for example, if you request the output in rowid order, no sorting
will be done:
</p>

<tcl>
code {SELECT * FROM tab ORDER BY rowid;}
figure 17 #fig17 obrowid.gif {Sorting By Rowid}
</tcl>

<p>
You can also request a reverse-order sort like this:
</p>

<tcl>code {SELECT * FROM tab ORDER BY rowid DESC;}</tcl>

<p>
SQLite will still omit the sorting step.  But in order for output to
appear in the correct order, SQLite will do the table scan starting at
the end and working toward the beginning, rather than starting at the
beginning and working toward the end as shown in 
<a href="#fig17">figure 17</a>.
</p>

<h3>2.2 Sorting By Index</h3>

<p>
Of course, ordering the output of a query by rowid is seldom useful.
Usually one wants to order the output by some other column, and so
the natural output order will not be helpful.
</p>

<p>
If an index is available on the ORDER BY column, that index can be used
for sorting.  Consider the request for all items sorted by "fruit":
</p>

<tcl>code {SELECT * FROM tab ORDER BY fruit;}</tcl>

<p>
If the <a href="#fig4">Idx1</a> index is available, then it can be used
to cause the output to appear in sorted order naturally.  The algorithm
would look something like this:
</p>

<tcl>figure 18 #fig18 obfruitidx1.gif {Sorting With An Index}</tcl>

<p>
The Idx1 index is scanned from top to bottom (or from bottom to top if
"ORDER BY fruit DESC" is used) in order to find the rowids for each item
in order by fruit.  Then for each rowid, a binary search is done to lookup
................................................................................
<p>
SQLite uses a cost-based query planner.  When there are two or more ways
of solving the same query, SQLite tries to estimate the total amount of
time needed to run the query using each plan, and then uses the plan with
the lowest estimated cost.  A cost is computed mostly from the estimated
time, and so this case could go either way depending on the table size and
what WHERE clause constraints were available, and so forth.  But generally
speaking, the indexed sort would probably be select, if for no other
reason, because it does not need to accumulate the entire result set in
memory before sorting and thus uses a lot less memory.
</p>

<h3>2.3 Sorting By Covering Index</h3>

<p>
If a covering index can be used for a query, then the multiple rowid looks
can be avoided and the cost of the query drops dramatically.
................................................................................
</tcl>

<p>
With a covering index, SQLite can simply walk the index from one end to the
other and deliver the output in time proportional to N and without having
allocate a large buffer to hold the result set.
</p>















|







 







|




|

|
|
>
|
>




|
|
|






|
|
>
>










|
|
>



|











|


|
|
>
>
>
>
>








|







 







|



|







 







|




|



|



|







 







|





|


|
|







 







|







 







|







 







|







 







|




>
>
>
>
>
>
>
>
>
>
>
>
>
>
>






|



|







 







|







 







|







 







|




|







|






|







 







|







 







|
|
|



|
|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>




|
|
|
>







|


|
|








|
|
>
>
|







|






|







|













|
<







|
<
<
<
<
<
<







 







|

|







 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
..
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
...
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
...
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
...
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
...
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
...
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
...
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
...
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
...
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
...
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
...
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
...
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
...
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530

531
532
533
534
535
536
537
538






539
540
541
542
543
544
545
...
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
...
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
<title>Query Planning</title>
<tcl>
hd_keywords {indexing} {indexing tutorial}
proc figure {fignum tag img title} {
  hd_puts "<p><center>\n"
  hd_puts "<img src=\"images/qp/$img\" alt=\"figure $fignum\"><br>\n"
  hd_puts "Figure $fignum: $title\n"
  hd_puts "</center></p>\n"
................................................................................
proc code {txt} {
  hd_puts "<center><table><tr><td><pre>\n"
  regsub {^[ \n]*\n} $txt {} txt
  hd_puts [string trimright $txt]\n
  hd_puts "</pre></table></center>\n"
}
</tcl>
<h1 align="center">Query Planning</h1>

<p>
The best feature of SQL (in <u>all</u> its implementations, not just SQLite)
is that it is a <i>declarative</i> language, not a <i>procedural</i>
language.  When programming in SQL you tell the system <i>what</i> you
want to compute, not <i>how</i> to compute it.  The task of figuring out
the <i>how</i> is delegated to the <i>query planner</i> subsystem within 
the SQL database engine.
Releaving the programmer from the chore of designing specific
query algorithms reduces the workload on the programmer and 
reduces the number of opportunities for the
programmer to make mistakes.
</p>

<p>
Most of the time, the query planner in SQLite does a good job on its
own and without outside help.
However, the query planner needs indices to
work with and it usually falls to the programmer to add indices to the
schema that are sufficient for the query planner to accomplish its task.
</p>

<p>
This document is intended to provide programmers who are new to SQL
and databases with the background information to help them understand
what is going on behind the scenes with SQLite, which in turn should make
it easier for programmers to create the indices that will help the SQLite
query planner to pick the best plans.
</p>

<h2>1.0 Querying</h2>

<h3>1.1 Tables Without Indices</h3>

<p>
Every table in SQLite consists of zero or more rows with a unique integer
key (the [rowid] or [INTEGER PRIMARY KEY]) followed by content.  The rows
are logically stored in order of increasing rowid.  As an example, this
article uses a table named "FruitsForSale" which relates various fruits 
to the state
where they are grown and their unit price at market.  The schema is this:
</p>

<tcl>code {
CREATE TABLE FruitsForSale(
  Fruit TEXT,
  State TEXT,
  Price REAL
);
}</tcl>

<p>
With some (arbitrary) data, such a table might be logically stored on disk
as shown in figure 1:
</p>

<tcl>figure 1 #fig1 tab.gif {Logical Layout Of Table "FruitsForSale"}</tcl>

<p>
The key features to notice in this example is that the rowids are not
consecutive but they are ordered.  SQLite usually creates rowids beginning
with one and increasing by one with each added row.  But if rows are 
deleted, gaps can appear in the sequence.  And the application can control
the rowid assigned if desired, so that rows are not necessarily inserted 
at the bottom.  But regardless of what happens, the rowids are always 
unique and in strictly ascending order.
</p>

<p>
Now suppose you want to look up the price of peaches.  The query would
be as follows:
</p>

<tcl>code {
SELECT price FROM fruitsforsale WHERE fruit='Peach';
}</tcl>

<p>
In order to satisfy this query, SQLite has to read every row out of the
table, check to see if the "fruit" column has the value of "Peach" and if
so, output the "price" column from that row.  The process is illustrated
by <a href="#fig2">figure 2</a> below.
................................................................................
<tcl>figure 2 #fig2 fullscan.gif {Full Table Scan}</tcl>

<h3>1.2 Lookup By Rowid</h3>

<p>
One technique for avoiding a full table scan is to do lookups by
rowid (or by the equivalent INTEGER PRIMARY KEY).   To lookup the
price of peaches, one would query for the entry with a rowid of 4:
</p>

<tcl>code {
SELECT price FROM fruitsforsale WHERE rowid=4;
}</tcl>

<p>
Since the information is stored in the table in rowid order, SQLite
can find the correct row using doing a binary search on the rowid.
If the table contains N element, the time required to look up the
desired row is proportional to logN rather than being proportional
................................................................................

<tcl>figure 3 #fig3 rowidlu.gif {Lookup By Rowid}</tcl>

<h3>1.3 Lookup By Index</h3>
<p>
The problem with looking up information by rowid is that you probably
do not care what the price of "item 4" is - you want to know the price
of peaches.  And so a rowid lookup is is not helpful.
</p>

<p>
To make the original query more effcient, we can add an index on the
"fruit" column of the "fruitsforsale" table like this:
</p>

<tcl>code {
CREATE INDEX idx1 ON fruitsforsale(fruit);
}</tcl>

<p>
An index is another table similar to the original "fruitsforsale" table
but with the content (the fruit column in this case) stored in front of the
rowid and with all rows in content order.
<a href="#fig4">Figure 4</a> gives a logical view of the Idx1 index.
The "fruit" column is the primary key used to order the elements of the
table and the "rowid" is the secondary key used to break the tie when
two or more rows have the same "fruit".  In the example, the rowid
has to be used as a tie-breaker for the "Orange" rows.
................................................................................

<p>
With the new index in place, we can devise an alternative plan for the
original "Price of Peaches" query.
</p>

<tcl>code {
SELECT price FROM fruitsforsale WHERE fruit='Peach';
}</tcl>

<p>
The query starts by doing a binary search on the Idx1 index for entries
that have fruit='Peach'.  SQLite can do this binary search on the Idx1 index
but not on the original FruitsForSale table because the rows in Idx1 are sorted
by the "fruit" column.  Having found a row in the Idx1 index that has
fruit='Peach', the database engine can extract the rowid for that row,
then do a separate binary search on the original FruitsForSale table to find the
original row that contains fruit='Peach'.  From the row in the FruitsForSale table,
SQLite can extract the value of the price column.
This procedure is illustrated by <a href="#fig5">figure 5</a>.
</p>

<tcl>figure 5 #fig5 idx1lu1.gif {Indexed Lookup For The Price Of Peaches}</tcl>

<p>
................................................................................
In the previous query the fruit='Peach' constraint narrowed the result
down to a single row.  But the same technique works even if multiple
rows are obtained.  Suppose we looked up the price of Oranges instead 
Peaches:
</p>

<tcl>
code {SELECT price FROM fruitsforsale WHERE fruit='Orange'}
figure 6 #fig6 idx1lu2.gif {Indexed Lookup For The Price Of Oranges}
</tcl>

<p>
In this case, SQLite still does a single binary search to find the first
entry of the index where fruit='Orange'.  Then it extracts the rowid from
the index and uses that rowid to lookup the original table entry via
................................................................................
<p>
Next, suppose that you want to look up the price of not just any orange,
but specifically California-grown oranges.  The appropriate query would
be as follows:
</p>

<tcl>
code {SELECT price FROM fruitsforsale WHERE fruit='Orange' AND state='CA'}
figure 7 #fig7 idx1lu3.gif {Indexed Lookup Of California Oranges}
</tcl>

<p>
One approach to this query is to use the fruit='Orange' term of the WHERE
clause to find all rows dealing with oranges, then filter those rows
by rejecting any that are from states other than California.  This
................................................................................

<p>
Suppose that in addition to the index on "fruit" there was also
an index on "state".
</p>

<tcl>
code {CREATE INDEX Idx2 ON fruitsforsale(state);}
figure 8 #fig8 idx2.gif {Index On The State Column}
</tcl>

<p>
The "state" index works just like the "fruit" index in that it is a
new table with an extra column in front of the rowid and sorted by
that extra column as the primary key.  The only difference is that
................................................................................
</p>

<tcl>figure 9 #fig9 idx2lu1.gif {Indexed Lookup Of California Oranges}</tcl>

<p>
Using Idx2 instead of Idx1 causes SQLite to examine a different set of
rows, but it gets the same answer in the end (which is very important -
remember that indices should never change the answer, only help SQLite to
get to the answer more quickly) and it does the same amount of work.
So the Idx2 index did not help performance in this case.
</p>

<p>
The last two queries take the same amount of time, in our example.
So which index, Idx1 or Idx2, will SQLite choose?  If the
[ANALYZE] command has been run on the database, so that SQLite has
had an opportunity to gather statistics about the available indices,
then SQLite will know that the Idx1 table usually narrows the search
down to a single item (our example of fruit='Orange' is the exception
to this rule) where as the Idx table will normally only narrow the 
search down to two rows.  So, if all else is equal, SQLite will
choose Idx1 with the hope of narrowing the search to as small
a number of rows as possible.  This choice is only possible because
of the statistics provided by [ANALYZE].  If [ANALYZE] has not been
run then the choice of which index to use is arbitrary.
</p>

<h3>1.6 Multi-Column Indices</h3>

<p>
To get the maximum performance out of a query with multiple AND-connected
terms in the WHERE clause, you really want a multi-column index with
columns for each of the AND terms.  In this case we create a new index
on the "fruit" and "state" columns of FruitsForSale:
</p>

<tcl>
code {CREATE INDEX Idx3 ON FruitsForSale(fruit, state);}
figure 1 #fig10 idx3.gif {A Two-Column Index}
</tcl>

<p>
A multi-column index follows the same pattern as a single-column index;
the indexed columns are added in front of the rowid.  The only difference
is that now multiple columns are added.  The left-most column is the
................................................................................

<p>
Given the new multi-column Idx3 index, it is now possible for SQLite
to find the price of California oranges using only 2 binary searches:
</p>

<tcl>
code {SELECT price FROM fruitsforsale WHERE fruit='Orange' AND state='CA'}
figure 11 #fig11 idx3lu1.gif {Lookup Using A Two-Column Index}
</tcl>

<p>
With the Idx3 index on both columns that are constrained by the WHERE clause,
SQLite can do a single binary search against Idx3 to find the one rowid
for California oranges, then do a single binary search to find the price
................................................................................
Note that Idx3 contains all the same information as the original 
<a href="#fig3">Idx1</a>.  And so if we have Idx3, we do not really need Idx1
any more.  The "price of peaches" query can be satisfied using Idx3
by simply ignoring the "state" column of Idx3:
</p>

<tcl>
code {SELECT price FROM fruitsforsale WHERE fruit='Peach'}
figure 12 #fig12 idx3lu2.gif {Single-Column Lookup On A Multi-Column Index}
</tcl>

<p>
Hence, a good rule of thumb is that your database schema should never
contain two indices where one index is a prefix of the other.  Drop the
index with fewer columns.  SQLite will still be able to do efficient
................................................................................
<p>
The "price of California oranges" query was made more efficient through
the use of a two-column index.  But SQLite can do even better with a
three-column index that also includes the "price" column:
</p>

<tcl>
code {CREATE INDEX Idx4 ON FruitsForSale(fruit, state, price);}
figure 13 #fig13 idx4.gif {A Covering Index}
</tcl>

<p>
This new index contains all the columns of the original FruitsForSale table that
are used by the query - both the search terms and the output.  We call
this a "covering index".  Because all of the information needed is in
the covering index, SQLite never needs to consult the original table
in order to find the price.
</p>

<tcl>
code {SELECT price FROM fruitsforsale WHERE fruit='Orange' AND state='CA';}
figure 14 #fig14 idx4lu1.gif {Query Using A Covering Index}
</tcl>

<p>
Hence, by adding extra "output" columns onto the end of an index, one
can avoid having to reference the original table and thereby
cut the number of binary searchs for a query in half.  This is a
constant-factor improvement in performance (roughly a doubling of
the speed).  But on the other hand, it is also just a refinement;
A two-fold performance increase is not nearly as dramatic as the
one-millon-fold increase seen when the table was first indexed.
And for most queries, the difference between 1 microsecond and
2 microseconds is unlikely to be noticed.
</p>
................................................................................
So Idx3 and Idx4 are helpful when the search is for items that
are both Oranges and grown in California, but neither index would
be that useful if we wanted all items that were either oranges
<i>or</i> are grown in California.
</p>

<tcl>code {
SELECT price FROM FruitsForSale WHERE fruit='Orange' OR state='CA';
}</tcl>

<p>
When confronted with OR-connected terms in a WHERE clause, SQLite 
examines each OR term separately and tries to use an index to
find the rowids associated with each term.
It then takes the union of the resulting rowid sets to find
................................................................................
<tcl>figure 15 #fig15 orquery.gif {Query With OR Constraints}</tcl>

<p>
The diagram above implies that SQLite computes all of the rowids first
and then combines them with a union operation before starting to do
rowid lookups on the original table.  In reality, the rowid lookups
are interspersed with rowid computations.  SQLite uses one index at
a time to find rowids while remembering which rowids it has seen
before so as to avoid duplicates.  That is just an implementation
detail, though.  The diagram, while not 100% accurate, provides a good
overview of what is happening.
</p>

<p>
In order for the OR-by-UNION technique shown above to be useful, there
must be an index available that helps resolve every OR-connected term
in the WHERE clause.  If even a single OR-connected term is not indexed,
then a full table scan would have to be done in order to find the rowids
generated by the one term, and if SQLite has to do a full table scan, it
might as well do it on the original table and get all of the results in
a single pass without having to mess with union operations and follow-on
binary searches.
</p>

<p>
One can see how the OR-by-UNION technique could also be leveraged to
use multiple indices on query where the WHERE clause has terms connected
by AND, by using an intersect operator in place of union.  Many SQL
database engines will do just that.  But the performance gain over using
just a single index is slight and so SQLite does not implement that technique
at this time.  However, a future version SQLite might be enhanced to support
AND-by-INTERSECT.
</p>


<h2>2.0 Sorting</h2>

<p>
SQLite (like all other SQL database engines) can also use indices to
satisfy the ORDER BY clauses in a query, in addition to expediting
lookup.  In other words, indices can be used to speed up sorting as
well as searching.
</p>

<p>
When no appropriate indices are available, a query with an ORDER BY
clause must be sorted as a separate step.  Consider this query:
</p>

<tcl>code {SELECT * FROM fruitsforsale ORDER BY fruit;}</tcl>

<p>
SQLite processes this by gather all the output of query and then
running that output through a sorter.
</p>

<tcl>figure 16 #fig16 obfruitnoidx.gif {Sorting Without An Index}</tcl>

<p>
If the number of output rows is K, then the time needed to sort is
proportional to KlogK.  If K is small, the sorting time is usually
not a factor, but in a query such as the above where K==N, the time
needed to sort can be much greater than the time needed to do a
full table scan.  Furthermore, the entire output is accumulated in
temporary storage (which might be either in main memory or on disk,
depending on various compile-time and run-time settings)
which can mean that a lot of temporary storage is required to complete
the query.
</p>

<h3>2.1 Sorting By Rowid</h3>

<p>
Because sorting can be expensive, SQLite works hard to convert ORDER BY
clauses into no-ops.  If SQLite determines that output will
naturally appear in the order specified, then no sorting is done.
So, for example, if you request the output in rowid order, no sorting
will be done:
</p>

<tcl>
code {SELECT * FROM fruitsforsale ORDER BY rowid;}
figure 17 #fig17 obrowid.gif {Sorting By Rowid}
</tcl>

<p>
You can also request a reverse-order sort like this:
</p>

<tcl>code {SELECT * FROM fruitsforsale ORDER BY rowid DESC;}</tcl>

<p>
SQLite will still omit the sorting step.  But in order for output to
appear in the correct order, SQLite will do the table scan starting at
the end and working toward the beginning, rather than starting at the
beginning and working toward the end as shown in 
<a href="#fig17">figure 17</a>.
</p>

<h3>2.2 Sorting By Index</h3>

<p>
Of course, ordering the output of a query by rowid is seldom useful.
Usually one wants to order the output by some other column.

</p>

<p>
If an index is available on the ORDER BY column, that index can be used
for sorting.  Consider the request for all items sorted by "fruit":
</p>

<tcl>code {SELECT * FROM fruitsforsale ORDER BY fruit;}</tcl>







<tcl>figure 18 #fig18 obfruitidx1.gif {Sorting With An Index}</tcl>

<p>
The Idx1 index is scanned from top to bottom (or from bottom to top if
"ORDER BY fruit DESC" is used) in order to find the rowids for each item
in order by fruit.  Then for each rowid, a binary search is done to lookup
................................................................................
<p>
SQLite uses a cost-based query planner.  When there are two or more ways
of solving the same query, SQLite tries to estimate the total amount of
time needed to run the query using each plan, and then uses the plan with
the lowest estimated cost.  A cost is computed mostly from the estimated
time, and so this case could go either way depending on the table size and
what WHERE clause constraints were available, and so forth.  But generally
speaking, the indexed sort would probably be chosen, if for no other
reason, because it does not need to accumulate the entire result set in
temporary before sorting and thus uses much less temporary storage.
</p>

<h3>2.3 Sorting By Covering Index</h3>

<p>
If a covering index can be used for a query, then the multiple rowid looks
can be avoided and the cost of the query drops dramatically.
................................................................................
</tcl>

<p>
With a covering index, SQLite can simply walk the index from one end to the
other and deliver the output in time proportional to N and without having
allocate a large buffer to hold the result set.
</p>

<h2>To Be Continued...</h2>

<p>Additional topics to discuss include:</b></p>

<p><ul>
<li>Search and sorting at the same time
<li>How to construct a good index
<li>Joins and join-order
<li>Automatic transient indices
<li>How to determine what query plan SQLite is using
<li>Automatic detection of inefficient query plans
<li>Techniques for influcing what query plan SQLite chooses
</ul>
</p>