Query Optimizer Changes Meaning of Query

(1.1) By Qingyao Sun (nalzok) on 2020-08-06 09:06:13 edited from 1.0 [source]

Say I have a list of natural numbers from 0 to 99

CREATE TABLE IF NOT EXISTS natural_numbers AS
WITH RECURSIVE natural_numbers(x) AS (
    SELECT 0
    UNION
    SELECT x + 1
    FROM natural_numbers
    LIMIT 100
)
SELECT x
FROM natural_numbers;

I want to draw 50 random numbers to be "lucky numbers", and count how many lucky numbers there are in the table natural_numbers. Here is the query

WITH lucky_numbers(lucky_number) AS (
    SELECT x
    FROM natural_numbers
    ORDER BY RANDOM()
    LIMIT 50
)
SELECT SUM(EXISTS(
        SELECT 1
        FROM lucky_numbers
        WHERE x = +lucky_number  -- disable automatic index on lucky_numbers(lucky_number)
))
FROM natural_numbers;

The result is a random variable bouncing around 50 (you may need to re-run the query multiple times to see the effect). This can be surprising to some, but it's actually documented

An ordinary common table expression works as if it were a view that exists for the duration of a single statement.

Since lucky_numbers is merely a view, or a pre-packaged SELECT statement, it is naturally re-run for each row in natural_numbers. Consequently, each of the 100 natural numbers has an independent 50% chance of being lucky.

Now, here is the bug I'm reporting： if you replace the clause WHERE x = +lucky_number with WHERE x = lucky_number, the result will always be 50, which means the AUTOMATIC COVERING INDEX (lucky_number=?) added by the query optimizer changes the outcome of a query. I suspect this is because the common table expression lucky_numbers is no longer re-run for each row, but instead cached in a covering index.

(2) By Keith Medcalf (kmedcalf) on 2020-08-06 07:57:30 in reply to 1.0 [link] [source]

The current tip of trunk appears to already do this.

(3) By Gunter Hick (gunter_hick) on 2020-08-06 07:57:39 in reply to 1.0 [link] [source]

I find the opposite effect:

asql> explain query plan with lucky(lucky) as (select val from numbers order by random() limit 50) select sum(exists(select 1 from lucky where val = lucky)) from numbers;
id    parent         notu  deta
----  -------------  ----  ----
3     0              0     SCAN TABLE numbers
5     0              0     CORRELATED SCALAR SUBQUERY
8     5              0     CO-ROUTINE 0x19668760
12    8              0     SCAN TABLE numbers
24    8              0     USE TEMP B-TREE FOR ORDER BY
41    5              0     SEARCH SUBQUERY 0x19668760 USING AUTOMATIC COVERING INDEX (lucky=?)
asql> explain query plan with lucky(lucky) as (select val from numbers order by random() limit 50) select sum(exists(select 1 from lucky where val = +lucky)) from numbers;
id    parent         notu  deta
----  -------------  ----  ----
3     0              0     SCAN TABLE numbers
5     0              0     CORRELATED SCALAR SUBQUERY
8     5              0     CO-ROUTINE 0x19667CE0
12    8              0     SCAN TABLE numbers
24    8              0     USE TEMP B-TREE FOR ORDER BY
31    5              0     SCAN SUBQUERY 0x19667CE0

https://sqlite.org/optoverview.html states

"Terms of the WHERE clause can be manually disqualified for use with indices by prepending a unary *+* operator to the column name."

(4) By Qingyao Sun (nalzok) on 2020-08-06 09:08:29 in reply to 3 [link] [source]

Oops, my bad! The queries didn't match their respective results because I was clueless. The original post has been edited now.

(5) By Qingyao Sun (nalzok) on 2020-08-06 09:11:24 in reply to 2 [link] [source]

I am using

3.31.1 2020-01-27 19:55:54 3bfa9cc97da10598521b342961df8f5f68c7388fa117345eeb516eaa837bb4d6

So I'd say the current tip of trunk appears to still do this :)

(6) By anonymous on 2020-08-06 17:08:57 in reply to 1.1 [link] [source]

It should be possible to make the query optimizer to automatically figure out whether or not a view is deterministic. However, this doesn't work for virtual tables, unless it is enhanced by adding a SQLITE_INDEX_SCAN_DETERMINISTIC flag (or a SQLITE_VTAB_DETERMINISTIC option).

(I think there are other messages on this forum describing proposed improvement of virtual table.)

(7) By Richard Hipp (drh) on 2020-08-06 21:48:52 in reply to 1.1 [link] [source]

The query

   SELECT x FROM natural_numbers ORDER BY random() LIMIT 50;

Returns a table of 50 distinct random numbers. The question then is, if that query is really a subquery in a larger SQL statement, and the result set is used multiple times within the outer query, is the SQL engine compelled to rerun the subquery, or is it allowed to cache the results of the first run and reuse them.

I contend that the query planner is free to do it either way. It can either cache the results of the first run and reuse them. Or it can rerun the query. Or (if it wanted to) it could cache the results and use them two or three times and then rerun the query and use those results two or three times, then run the subquery again, and so forth.

In other words, the programmer cannot make any assumptions about whether or not the results of a (non-correlated) subquery are cached and reused or if the subquery is run multiple times. The SQL engine is free to do whatever it wants. Any query that depends on whether or not subquery results are cached returns unpredictable output. It is akin to running a query without an ORDER BY clause - the engine is free to return the results in any order it wants. Just because it returns the results in the order you want today does not mean it will continue to do so tomorrow.

Constructing a query like this that depends on whether or not subquery results are cached is like writing a multiple-threaded program that depends on the order in which the threads are executed. The output can vary from one run to the next. Don't do that.

I checked with the PostgreSQL developers and am told that PostgreSQL works the same way. The equivalent query in PostgreSQL might cache and reuse the subquery results, or it might rerun the subquery. You never know.

As currently implemented, SQLite only caches the subquery results if it wants to build an automatic index. But that decision might change tomorrow, so you cannot depend on it.

(8) By Qingyao Sun (nalzok) on 2020-08-07 02:56:52 in reply to 7 [link] [source]

In other words, the programmer cannot make any assumptions about whether or not the results of a (non-correlated) subquery are cached and reused or if the subquery is run multiple times

Thanks for the great explanation! I actually have a follow-up question about how SQLite decides if a subquery is correlated or not, but let's discuss that in a new thread.