/ Check-in [3e0590de]
Login
SQLite training in Houston TX on 2019-11-05 (details)
Part of the 2019 Tcl Conference

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:When estimating the number of rows visited by a range scan for which the keys consist of more than one field, consider prefixes of stat4 samples as well as the full samples. This generates more accurate estimates.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 3e0590dee0e68cc1599858757c650a7378026170
User & Date: dan 2015-03-16 16:28:43
References
2017-05-24
04:32
Fix a problem in STAT4 equality estimation for multi-column indexes introduced by check-in [3e0590dee0e68cc1599]. check-in: 19dad0a7 user: drh tags: branch-3.8.9
04:18
Fix a problem in STAT4 equality estimation for multi-column indexes introduced by check-in [3e0590dee0e68cc1599]. check-in: cfb0d9e0 user: drh tags: trunk
Context
2015-03-16
16:44
When deleting the master journal to commit a multi-database transaction, do not sync the directory if PRAGMA synchronous=OFF for all participating database files. check-in: 018d7671 user: drh tags: trunk
16:28
When estimating the number of rows visited by a range scan for which the keys consist of more than one field, consider prefixes of stat4 samples as well as the full samples. This generates more accurate estimates. check-in: 3e0590de user: dan tags: trunk
13:48
Use #ifdef to omit code that is only used for STAT3 and STAT4. check-in: f2c9c5b5 user: drh tags: trunk
09:21
Another test case for the planner change on this branch. Closed-Leaf check-in: f2207a06 user: dan tags: stat4-change
Changes
Hide Diffs Side-by-Side Diffs Ignore Whitespace Patch

Changes to src/where.c.

  1927   1927   #endif /* !defined(SQLITE_OMIT_VIRTUALTABLE) */
  1928   1928   
  1929   1929   #ifdef SQLITE_ENABLE_STAT3_OR_STAT4
  1930   1930   /*
  1931   1931   ** Estimate the location of a particular key among all keys in an
  1932   1932   ** index.  Store the results in aStat as follows:
  1933   1933   **
  1934         -**    aStat[0]      Est. number of rows less than pVal
  1935         -**    aStat[1]      Est. number of rows equal to pVal
         1934  +**    aStat[0]      Est. number of rows less than pRec
         1935  +**    aStat[1]      Est. number of rows equal to pRec
  1936   1936   **
  1937   1937   ** Return the index of the sample that is the smallest sample that
  1938         -** is greater than or equal to pRec.
         1938  +** is greater than or equal to pRec. Note that this index is not an index
         1939  +** into the aSample[] array - it is an index into a virtual set of samples
         1940  +** based on the contents of aSample[] and the number of fields in record 
         1941  +** pRec. 
  1939   1942   */
  1940   1943   static int whereKeyStats(
  1941   1944     Parse *pParse,              /* Database connection */
  1942   1945     Index *pIdx,                /* Index to consider domain of */
  1943   1946     UnpackedRecord *pRec,       /* Vector of values to consider */
  1944   1947     int roundUp,                /* Round up if true.  Round down if false */
  1945   1948     tRowcnt *aStat              /* OUT: stats written here */
  1946   1949   ){
  1947   1950     IndexSample *aSample = pIdx->aSample;
  1948   1951     int iCol;                   /* Index of required stats in anEq[] etc. */
         1952  +  int i;                      /* Index of first sample >= pRec */
         1953  +  int iSample;                /* Smallest sample larger than or equal to pRec */
  1949   1954     int iMin = 0;               /* Smallest sample not yet tested */
  1950         -  int i = pIdx->nSample;      /* Smallest sample larger than or equal to pRec */
  1951   1955     int iTest;                  /* Next sample to test */
  1952   1956     int res;                    /* Result of comparison operation */
         1957  +  int nField;                 /* Number of fields in pRec */
         1958  +  tRowcnt iLower = 0;         /* anLt[] + anEq[] of largest sample pRec is > */
  1953   1959   
  1954   1960   #ifndef SQLITE_DEBUG
  1955   1961     UNUSED_PARAMETER( pParse );
  1956   1962   #endif
  1957   1963     assert( pRec!=0 );
  1958         -  iCol = pRec->nField - 1;
  1959   1964     assert( pIdx->nSample>0 );
  1960         -  assert( pRec->nField>0 && iCol<pIdx->nSampleCol );
         1965  +  assert( pRec->nField>0 && pRec->nField<=pIdx->nSampleCol );
         1966  +
         1967  +  /* Do a binary search to find the first sample greater than or equal
         1968  +  ** to pRec. If pRec contains a single field, the set of samples to search
         1969  +  ** is simply the aSample[] array. If the samples in aSample[] contain more
         1970  +  ** than one fields, all fields following the first are ignored.
         1971  +  **
         1972  +  ** If pRec contains N fields, where N is more than one, then as well as the
         1973  +  ** samples in aSample[] (truncated to N fields), the search also has to
         1974  +  ** consider prefixes of those samples. For example, if the set of samples
         1975  +  ** in aSample is:
         1976  +  **
         1977  +  **     aSample[0] = (a, 5) 
         1978  +  **     aSample[1] = (a, 10) 
         1979  +  **     aSample[2] = (b, 5) 
         1980  +  **     aSample[3] = (c, 100) 
         1981  +  **     aSample[4] = (c, 105)
         1982  +  **
         1983  +  ** Then the search space should ideally be the samples above and the 
         1984  +  ** unique prefixes [a], [b] and [c]. But since that is hard to organize, 
         1985  +  ** the code actually searches this set:
         1986  +  **
         1987  +  **     0: (a) 
         1988  +  **     1: (a, 5) 
         1989  +  **     2: (a, 10) 
         1990  +  **     3: (a, 10) 
         1991  +  **     4: (b) 
         1992  +  **     5: (b, 5) 
         1993  +  **     6: (c) 
         1994  +  **     7: (c, 100) 
         1995  +  **     8: (c, 105)
         1996  +  **     9: (c, 105)
         1997  +  **
         1998  +  ** For each sample in the aSample[] array, N samples are present in the
         1999  +  ** effective sample array. In the above, samples 0 and 1 are based on 
         2000  +  ** sample aSample[0]. Samples 2 and 3 on aSample[1] etc.
         2001  +  **
         2002  +  ** Often, sample i of each block of N effective samples has (i+1) fields.
         2003  +  ** Except, each sample may be extended to ensure that it is greater than or
         2004  +  ** equal to the previous sample in the array. For example, in the above, 
         2005  +  ** sample 2 is the first sample of a block of N samples, so at first it 
         2006  +  ** appears that it should be 1 field in size. However, that would make it 
         2007  +  ** smaller than sample 1, so the binary search would not work. As a result, 
         2008  +  ** it is extended to two fields. The duplicates that this creates do not 
         2009  +  ** cause any problems.
         2010  +  */
         2011  +  nField = pRec->nField;
         2012  +  iCol = 0;
         2013  +  iSample = pIdx->nSample * nField;
  1961   2014     do{
  1962         -    iTest = (iMin+i)/2;
  1963         -    res = sqlite3VdbeRecordCompare(aSample[iTest].n, aSample[iTest].p, pRec);
         2015  +    int iSamp;                    /* Index in aSample[] of test sample */
         2016  +    int n;                        /* Number of fields in test sample */
         2017  +
         2018  +    iTest = (iMin+iSample)/2;
         2019  +    iSamp = iTest / nField;
         2020  +    if( iSamp>0 ){
         2021  +      /* The proposed effective sample is a prefix of sample aSample[iSamp].
         2022  +      ** Specifically, the shortest prefix of at least (1 + iTest%nField) 
         2023  +      ** fields that is greater than the previous effective sample.  */
         2024  +      for(n=(iTest % nField) + 1; n<nField; n++){
         2025  +        if( aSample[iSamp-1].anLt[n-1]!=aSample[iSamp].anLt[n-1] ) break;
         2026  +      }
         2027  +    }else{
         2028  +      n = iTest + 1;
         2029  +    }
         2030  +
         2031  +    pRec->nField = n;
         2032  +    res = sqlite3VdbeRecordCompare(aSample[iSamp].n, aSample[iSamp].p, pRec);
  1964   2033       if( res<0 ){
         2034  +      iLower = aSample[iSamp].anLt[n-1] + aSample[iSamp].anEq[n-1];
         2035  +      iMin = iTest+1;
         2036  +    }else if( res==0 && n<nField ){
         2037  +      iLower = aSample[iSamp].anLt[n-1];
  1965   2038         iMin = iTest+1;
         2039  +      res = -1;
  1966   2040       }else{
  1967         -      i = iTest;
         2041  +      iSample = iTest;
         2042  +      iCol = n-1;
  1968   2043       }
  1969         -  }while( res && iMin<i );
         2044  +  }while( res && iMin<iSample );
         2045  +  i = iSample / nField;
  1970   2046   
  1971   2047   #ifdef SQLITE_DEBUG
  1972   2048     /* The following assert statements check that the binary search code
  1973   2049     ** above found the right answer. This block serves no purpose other
  1974   2050     ** than to invoke the asserts.  */
  1975         -  if( res==0 ){
  1976         -    /* If (res==0) is true, then sample $i must be equal to pRec */
  1977         -    assert( i<pIdx->nSample );
  1978         -    assert( 0==sqlite3VdbeRecordCompare(aSample[i].n, aSample[i].p, pRec)
  1979         -         || pParse->db->mallocFailed );
  1980         -  }else{
  1981         -    /* Otherwise, pRec must be smaller than sample $i and larger than
  1982         -    ** sample ($i-1).  */
  1983         -    assert( i==pIdx->nSample 
  1984         -         || sqlite3VdbeRecordCompare(aSample[i].n, aSample[i].p, pRec)>0
  1985         -         || pParse->db->mallocFailed );
  1986         -    assert( i==0
  1987         -         || sqlite3VdbeRecordCompare(aSample[i-1].n, aSample[i-1].p, pRec)<0
  1988         -         || pParse->db->mallocFailed );
         2051  +  if( pParse->db->mallocFailed==0 ){
         2052  +    if( res==0 ){
         2053  +      /* If (res==0) is true, then pRec must be equal to sample i. */
         2054  +      assert( i<pIdx->nSample );
         2055  +      assert( iCol==nField-1 );
         2056  +      pRec->nField = nField;
         2057  +      assert( 0==sqlite3VdbeRecordCompare(aSample[i].n, aSample[i].p, pRec) 
         2058  +           || pParse->db->mallocFailed 
         2059  +      );
         2060  +    }else{
         2061  +      /* Unless i==pIdx->nSample, indicating that pRec is larger than
         2062  +      ** all samples in the aSample[] array, pRec must be smaller than the
         2063  +      ** (iCol+1) field prefix of sample i.  */
         2064  +      assert( i<=pIdx->nSample && i>=0 );
         2065  +      pRec->nField = iCol+1;
         2066  +      assert( i==pIdx->nSample 
         2067  +           || sqlite3VdbeRecordCompare(aSample[i].n, aSample[i].p, pRec)>0
         2068  +           || pParse->db->mallocFailed );
         2069  +
         2070  +      /* if i==0 and iCol==0, then record pRec is smaller than all samples
         2071  +      ** in the aSample[] array. Otherwise, if (iCol>0) then pRec must
         2072  +      ** be greater than or equal to the (iCol) field prefix of sample i.
         2073  +      ** If (i>0), then pRec must also be greater than sample (i-1).  */
         2074  +      if( iCol>0 ){
         2075  +        pRec->nField = iCol;
         2076  +        assert( sqlite3VdbeRecordCompare(aSample[i].n, aSample[i].p, pRec)<=0
         2077  +             || pParse->db->mallocFailed );
         2078  +      }
         2079  +      if( i>0 ){
         2080  +        pRec->nField = nField;
         2081  +        assert( sqlite3VdbeRecordCompare(aSample[i-1].n, aSample[i-1].p, pRec)<0
         2082  +             || pParse->db->mallocFailed );
         2083  +      }
         2084  +    }
  1989   2085     }
  1990   2086   #endif /* ifdef SQLITE_DEBUG */
  1991   2087   
  1992         -  /* At this point, aSample[i] is the first sample that is greater than
  1993         -  ** or equal to pVal.  Or if i==pIdx->nSample, then all samples are less
  1994         -  ** than pVal.  If aSample[i]==pVal, then res==0.
  1995         -  */
  1996   2088     if( res==0 ){
         2089  +    /* Record pRec is equal to sample i */
         2090  +    assert( iCol==nField-1 );
  1997   2091       aStat[0] = aSample[i].anLt[iCol];
  1998   2092       aStat[1] = aSample[i].anEq[iCol];
  1999   2093     }else{
  2000         -    tRowcnt iLower, iUpper, iGap;
  2001         -    if( i==0 ){
  2002         -      iLower = 0;
  2003         -      iUpper = aSample[0].anLt[iCol];
         2094  +    /* At this point, the (iCol+1) field prefix of aSample[i] is the first 
         2095  +    ** sample that is greater than pRec. Or, if i==pIdx->nSample then pRec
         2096  +    ** is larger than all samples in the array. */
         2097  +    tRowcnt iUpper, iGap;
         2098  +    if( i>=pIdx->nSample ){
         2099  +      iUpper = sqlite3LogEstToInt(pIdx->aiRowLogEst[0]);
  2004   2100       }else{
  2005         -      i64 nRow0 = sqlite3LogEstToInt(pIdx->aiRowLogEst[0]);
  2006         -      iUpper = i>=pIdx->nSample ? nRow0 : aSample[i].anLt[iCol];
  2007         -      iLower = aSample[i-1].anEq[iCol] + aSample[i-1].anLt[iCol];
         2101  +      iUpper = aSample[i].anLt[iCol];
  2008   2102       }
  2009         -    aStat[1] = pIdx->aAvgEq[iCol];
         2103  +
  2010   2104       if( iLower>=iUpper ){
  2011   2105         iGap = 0;
  2012   2106       }else{
  2013   2107         iGap = iUpper - iLower;
  2014   2108       }
  2015   2109       if( roundUp ){
  2016   2110         iGap = (iGap*2)/3;
  2017   2111       }else{
  2018   2112         iGap = iGap/3;
  2019   2113       }
  2020   2114       aStat[0] = iLower + iGap;
         2115  +    aStat[1] = pIdx->aAvgEq[iCol];
  2021   2116     }
         2117  +
         2118  +  /* Restore the pRec->nField value before returning.  */
         2119  +  pRec->nField = nField;
  2022   2120     return i;
  2023   2121   }
  2024   2122   #endif /* SQLITE_ENABLE_STAT3_OR_STAT4 */
  2025   2123   
  2026   2124   /*
  2027   2125   ** If it is not NULL, pTerm is a term that provides an upper or lower
  2028   2126   ** bound on a range scan. Without considering pTerm, it is estimated 

Changes to test/analyze9.test.

  1129   1129     }
  1130   1130     do_eqp_test 25.4.2 { 
  1131   1131       SELECT * FROM t6 WHERE a < 20 AND (b BETWEEN ? AND 60)
  1132   1132     } {
  1133   1133       0 0 0 {SEARCH TABLE t6 USING INDEX bb (b>? AND b<?)}
  1134   1134     }
  1135   1135   }
         1136  +
         1137  +#-------------------------------------------------------------------------
         1138  +# Check that a problem in they way stat4 data is used has been 
         1139  +# resolved (see below).
         1140  +#
         1141  +reset_db
         1142  +do_test 26.1.1 {
         1143  +  db transaction {
         1144  +    execsql { 
         1145  +      CREATE TABLE t1(x, y, z);
         1146  +      CREATE INDEX t1xy ON t1(x, y);
         1147  +      CREATE INDEX t1z ON t1(z);
         1148  +    }
         1149  +    for {set i 0} {$i < 10000} {incr i} {
         1150  +      execsql { INSERT INTO t1(x, y) VALUES($i, $i) }
         1151  +    }
         1152  +    for {set i 0} {$i < 10} {incr i} {
         1153  +      execsql {
         1154  +        WITH cnt(x) AS (SELECT 1 UNION ALL SELECT x+1 FROM cnt WHERE x<100)
         1155  +        INSERT INTO t1(x, y) SELECT 10000+$i, x FROM cnt;
         1156  +        INSERT INTO t1(x, y) SELECT 10000+$i, 100;
         1157  +      }
         1158  +    }
         1159  +    execsql {
         1160  +      UPDATE t1 SET z = rowid / 20;
         1161  +      ANALYZE;
         1162  +    }
         1163  +  }
         1164  +} {}
         1165  +
         1166  +do_execsql_test 26.1.2 {
         1167  +  SELECT count(*) FROM t1 WHERE x = 10000 AND y < 50;
         1168  +} {49}
         1169  +do_execsql_test 26.1.3 {
         1170  +  SELECT count(*) FROM t1 WHERE z = 444;
         1171  +} {20}
         1172  +
         1173  +# The analyzer knows that any (z=?) expression matches 20 rows. So it
         1174  +# will use index "t1z" if the estimate of hits for (x=10000 AND y<50)
         1175  +# is greater than 20 rows.
         1176  +#
         1177  +# And it should be. The analyzer has a stat4 sample as follows:
         1178  +#
         1179  +#   sample=(x=10000, y=100) nLt=(10000 10099)
         1180  +#
         1181  +# There should be no other samples that start with (x=10000). So it knows 
         1182  +# that (x=10000 AND y<50) must match somewhere between 0 and 99 rows, but
         1183  +# know more than that. Guessing less than 20 is therefore unreasonable.
         1184  +#
         1185  +# At one point though, due to a problem in whereKeyStats(), the planner was
         1186  +# estimating that (x=10000 AND y<50) would match only 2 rows.
         1187  +#
         1188  +do_eqp_test 26.1.4 {
         1189  +  SELECT * FROM t1 WHERE x = 10000 AND y < 50 AND z = 444;
         1190  +} {
         1191  +  0 0 0 {SEARCH TABLE t1 USING INDEX t1z (z=?)}
         1192  +}
         1193  +
         1194  +
         1195  +# This test - 26.2.* - tests that another manifestation of the same problem
         1196  +# is no longer present in the library. Assuming:
         1197  +# 
         1198  +#   CREATE INDEX t1xy ON t1(x, y)
         1199  +#
         1200  +# and that have samples for index t1xy as follows:
         1201  +#
         1202  +#
         1203  +#   sample=('A', 70)        nEq=(100, 2)        nLt=(900, 970)
         1204  +#   sample=('B', 70)        nEq=(100, 2)        nLt=(1000, 1070)    
         1205  +#
         1206  +# the planner should estimate that (x = 'B' AND y > 25) matches 76 rows
         1207  +# (70 * 2/3 + 30). Before, due to the problem, the planner was estimating 
         1208  +# that this matched 100 rows.
         1209  +# 
         1210  +reset_db
         1211  +do_execsql_test 26.2.1 {
         1212  +  BEGIN;
         1213  +    CREATE TABLE t1(x, y, z);
         1214  +    CREATE INDEX i1 ON t1(x, y);
         1215  +    CREATE INDEX i2 ON t1(z);
         1216  +  
         1217  +    WITH 
         1218  +    cnt(y) AS (SELECT 0 UNION ALL SELECT y+1 FROM cnt WHERE y<99),
         1219  +    letters(x) AS (
         1220  +      SELECT 'A' UNION SELECT 'B' UNION SELECT 'C' UNION SELECT 'D'
         1221  +    )
         1222  +    INSERT INTO t1(x, y) SELECT x, y FROM letters, cnt;
         1223  +  
         1224  +    WITH
         1225  +    letters(x) AS (
         1226  +      SELECT 'A' UNION SELECT 'B' UNION SELECT 'C' UNION SELECT 'D'
         1227  +    )
         1228  +    INSERT INTO t1(x, y) SELECT x, 70 FROM letters;
         1229  +  
         1230  +    WITH
         1231  +    cnt(i) AS (SELECT 0 UNION ALL SELECT i+1 FROM cnt WHERE i<9999)
         1232  +    INSERT INTO t1(x, y) SELECT i, i FROM cnt;
         1233  +  
         1234  +    UPDATE t1 SET z = (rowid / 95);
         1235  +    ANALYZE;
         1236  +  COMMIT;
         1237  +}
         1238  +
         1239  +do_eqp_test 26.2.2 {
         1240  +  SELECT * FROM t1 WHERE x='B' AND y>25 AND z=?;
         1241  +} {
         1242  +  0 0 0 {SEARCH TABLE t1 USING INDEX i1 (x=? AND y>?)}
         1243  +}
         1244  +
  1136   1245   
  1137   1246   finish_test
         1247  +
         1248  +
         1249  +