/ Check-in [eef60f1b]
Login
SQLite training in Houston TX on 2019-11-05 (details)
Part of the 2019 Tcl Conference

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add a big introductory comment to vdbesort.c explaining its operation at a high level. Also adjust some symbolic names and fix other comment issues in that file.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | threads
Files: files | file ages | folders
SHA1: eef60f1bf54fcdc7b32f96ebb87a9a0bf0776e8b
User & Date: drh 2014-04-02 18:58:49
Context
2014-04-03
02:54
Refactor local object and method names in vdbesort.c so that their names more closely reflect their actual use. check-in: d284e30e user: drh tags: threads
2014-04-02
18:58
Add a big introductory comment to vdbesort.c explaining its operation at a high level. Also adjust some symbolic names and fix other comment issues in that file. check-in: eef60f1b user: drh tags: threads
15:15
Fix some problems with OOM handling in vdbesort.c. check-in: 47e702bd user: dan tags: threads
Changes
Hide Diffs Side-by-Side Diffs Ignore Whitespace Patch

Changes to src/vdbesort.c.

     1      1   /*
     2         -** 2011 July 9
            2  +** 2011-07-09
     3      3   **
     4      4   ** The author disclaims copyright to this source code.  In place of
     5      5   ** a legal notice, here is a blessing:
     6      6   **
     7      7   **    May you do good and not evil.
     8      8   **    May you find forgiveness for yourself and forgive others.
     9      9   **    May you share freely, never taking more than you give.
    10     10   **
    11     11   *************************************************************************
    12     12   ** This file contains code for the VdbeSorter object, used in concert with
    13         -** a VdbeCursor to sort large numbers of keys (as may be required, for
    14         -** example, by CREATE INDEX statements on tables too large to fit in main
    15         -** memory).
           13  +** a VdbeCursor to sort large numbers of keys for CREATE TABLE statements
           14  +** or by SELECT statements with ORDER BY clauses that cannot be satisfied
           15  +** using indexes and without LIMIT clauses.
           16  +**
           17  +** The VdbeSorter object implements a multi-threaded external merge sort
           18  +** algorithm that is efficient even if the number of element being sorted
           19  +** exceeds the available memory.
           20  +**
           21  +** Here is the (internal, non-API) interface between this module and the
           22  +** rest of the SQLite system:
           23  +**
           24  +**    sqlite3VdbeSorterInit()       Create a new VdbeSorter object.
           25  +**
           26  +**    sqlite3VdbeSorterWrite()      Add a single new row to the VdbeSorter
           27  +**                                  object.  The row is a binary blob in the
           28  +**                                  OP_MakeRecord format that contains both
           29  +**                                  the ORDER BY key columns and result columns
           30  +**                                  in the case of a SELECT w/ ORDER BY, or
           31  +**                                  the complete record for an index entry
           32  +**                                  in the case of a CREATE INDEX.
           33  +**
           34  +**    sqlite3VdbeSorterRewind()     Sort all content previously added.
           35  +**                                  Position the read cursor on the
           36  +**                                  first sorted element.
           37  +**
           38  +**    sqlite3VdbeSorterNext()       Advance the read cursor to the next sorted
           39  +**                                  element.
           40  +**
           41  +**    sqlite3VdbeSorterRowkey()     Return the complete binary blob for the
           42  +**                                  row currently under the read cursor.
           43  +**
           44  +**    sqlite3VdbeSorterCompare()    Compare the binary blob for the row
           45  +**                                  currently under the read cursor against
           46  +**                                  another binary blob X and report if
           47  +**                                  X is strictly less than the read cursor.
           48  +**                                  Used to enforce uniqueness in a
           49  +**                                  CREATE UNIQUE INDEX statement.
           50  +**
           51  +**    sqlite3VdbeSorterClose()      Close the VdbeSorter object and reclaim
           52  +**                                  all resources.
           53  +**
           54  +**    sqlite3VdbeSorterReset()      Refurbish the VdbeSorter for reuse.  This
           55  +**                                  is like Close() followed by Init() only
           56  +**                                  much faster.
           57  +**
           58  +** The interfaces above must be called in a particular order.  Write() can 
           59  +** only occur in between Init()/Reset() and Rewind().  Next(), Rowkey(), and
           60  +** Compare() can only occur in between Rewind() and Close()/Reset().
           61  +**
           62  +** Algorithm:
           63  +**
           64  +** Records to be sorted are initially held in memory, in the order in
           65  +** which they arrive from Write().  When the amount of memory needed exceeds
           66  +** a threshold, all in-memory records are sorted and then appended to
           67  +** a temporary file as a "Packed-Memory-Array" or "PMA" and the memory is
           68  +** reset.  There is a single temporary file used for all PMAs.  The PMAs
           69  +** are packed one after another in the file.  The VdbeSorter object keeps
           70  +** track of the number of PMAs written.
           71  +**
           72  +** When the Rewind() is seen, any records still held in memory are sorted.
           73  +** If no PMAs have been written (if all records are still held in memory)
           74  +** then subsequent Rowkey(), Next(), and Compare() operations work directly
           75  +** from memory.  But if PMAs have been written things get a little more
           76  +** complicated.
           77  +**
           78  +** When Rewind() is seen after PMAs have been written, any records still
           79  +** in memory are sorted and written as a final PMA.  Then all the PMAs
           80  +** are merged together into a single massive PMA that Next(), Rowkey(),
           81  +** and Compare() walk to extract the records in sorted order.
           82  +**
           83  +** If SQLITE_MAX_WORKER_THREADS is non-zero, various steps of the above
           84  +** algorithm might be performed in parallel by separate threads.  Threads
           85  +** are only used when one or more PMA spill to disk.  If the sort is small
           86  +** enough to fit entirely in memory, everything happens on the main thread.
    16     87   */
    17         -
    18     88   #include "sqliteInt.h"
    19     89   #include "vdbeInt.h"
    20     90   
    21         -
           91  +/*
           92  +** Private objects used by the sorter
           93  +*/
    22     94   typedef struct VdbeSorterIter VdbeSorterIter;
    23     95   typedef struct SortSubtask SortSubtask;
    24     96   typedef struct SorterRecord SorterRecord;
    25     97   typedef struct SorterMerger SorterMerger;
    26     98   typedef struct FileWriter FileWriter;
    27     99   
    28    100   
    29    101   /*
    30    102   ** Candidate values for SortSubtask.eWork
    31    103   */
    32         -#define SORTER_THREAD_SORT   1  /* Sort records on pList */
    33         -#define SORTER_THREAD_TO_PMA 2  /* Xfer pList to Packed-Memory-Array pFile */
    34         -#define SORTER_THREAD_CONS   3  /* Consolidate multiple PMAs */
          104  +#define SORT_SUBTASK_SORT   1     /* Sort records on pList */
          105  +#define SORT_SUBTASK_TO_PMA 2     /* Xfer pList to Packed-Memory-Array pTemp1 */
          106  +#define SORT_SUBTASK_CONS   3     /* Consolidate multiple PMAs */
    35    107   
    36    108   /*
    37         -** Much of the work performed in this module to sort the list of records is 
    38         -** broken down into smaller units that may be peformed in parallel. In order
    39         -** to perform such a unit of work, an instance of the following structure
    40         -** is configured and passed to vdbeSortSubtaskMain() - either directly by 
    41         -** the main thread or via a background thread.
          109  +** Sorting is divided up into smaller subtasks.  Each subtask is controlled
          110  +** by an instance of this object.  Subtask might run in either the main thread
          111  +** or in a background thread.
    42    112   **
    43         -** Exactly SortSubtask.nThread instances of this structure are allocated
          113  +** Exactly VdbeSorter.nThread instances of this object are allocated
    44    114   ** as part of each VdbeSorter object. Instances are never allocated any other
    45         -** way. SortSubtask.nThread is set to the number of worker threads allowed
          115  +** way. VdbeSorter.nThread is set to the number of worker threads allowed
    46    116   ** (see SQLITE_CONFIG_WORKER_THREADS) plus one (the main thread).
    47    117   **
    48    118   ** When a background thread is launched to perform work, SortSubtask.bDone
    49    119   ** is set to 0 and the SortSubtask.pThread variable set to point to the
    50    120   ** thread handle. SortSubtask.bDone is set to 1 (to indicate to the main
    51    121   ** thread that joining SortSubtask.pThread will not block) before the thread
    52    122   ** exits. SortSubtask.pThread and bDone are always cleared after the 
    53    123   ** background thread has been joined.
    54    124   **
    55         -** One object (specifically, VdbeSorter.aThread[SortSubtask.nThread-1])
          125  +** One object (specifically, VdbeSorter.aThread[VdbeSorter.nThread-1])
    56    126   ** is reserved for the foreground thread.
    57    127   **
    58    128   ** The nature of the work performed is determined by SortSubtask.eWork,
    59    129   ** as follows:
    60    130   **
    61         -**   SORTER_THREAD_SORT:
          131  +**   SORT_SUBTASK_SORT:
    62    132   **     Sort the linked list of records at SortSubtask.pList.
    63    133   **
    64         -**   SORTER_THREAD_TO_PMA:
          134  +**   SORT_SUBTASK_TO_PMA:
    65    135   **     Sort the linked list of records at SortSubtask.pList, and write
    66    136   **     the results to a new PMA in temp file SortSubtask.pTemp1. Open
    67    137   **     the temp file if it is not already open.
    68    138   **
    69         -**   SORTER_THREAD_CONS:
          139  +**   SORT_SUBTASK_CONS:
    70    140   **     Merge existing PMAs until SortSubtask.nConsolidate or fewer
    71    141   **     remain in temp file SortSubtask.pTemp1.
    72    142   */
    73    143   struct SortSubtask {
    74    144     SQLiteThread *pThread;          /* Thread handle, or NULL */
    75    145     int bDone;                      /* Set to true by pThread when finished */
    76    146   
    77    147     sqlite3_vfs *pVfs;              /* VFS used to open temporary files */
    78    148     KeyInfo *pKeyInfo;              /* How to compare records */
    79    149     UnpackedRecord *pUnpacked;      /* Space to unpack a record */
    80    150     int pgsz;                       /* Main database page size */
    81    151   
    82         -  u8 eWork;                       /* One of the SORTER_THREAD_* constants */
    83         -  int nConsolidate;               /* For THREAD_CONS, max final PMAs */
          152  +  u8 eWork;                       /* One of the SORT_SUBTASK_* constants */
          153  +  int nConsolidate;               /* For SORT_SUBTASK_CONS, max final PMAs */
    84    154     SorterRecord *pList;            /* List of records for pThread to sort */
    85    155     int nInMemory;                  /* Expected size of PMA based on pList */
    86    156     u8 *aListMemory;                /* Records memory (or NULL) */
    87    157   
    88    158     int nPMA;                       /* Number of PMAs currently in pTemp1 */
    89    159     i64 iTemp1Off;                  /* Offset to write to in pTemp1 */
    90    160     sqlite3_file *pTemp1;           /* File to write PMAs to, or NULL */
................................................................................
  1091   1161   /*
  1092   1162   ** The main routine for sorter-thread operations.
  1093   1163   */
  1094   1164   static void *vdbeSortSubtaskMain(void *pCtx){
  1095   1165     int rc = SQLITE_OK;
  1096   1166     SortSubtask *pThread = (SortSubtask*)pCtx;
  1097   1167   
  1098         -  assert( pThread->eWork==SORTER_THREAD_SORT
  1099         -       || pThread->eWork==SORTER_THREAD_TO_PMA
  1100         -       || pThread->eWork==SORTER_THREAD_CONS
         1168  +  assert( pThread->eWork==SORT_SUBTASK_SORT
         1169  +       || pThread->eWork==SORT_SUBTASK_TO_PMA
         1170  +       || pThread->eWork==SORT_SUBTASK_CONS
  1101   1171     );
  1102   1172     assert( pThread->bDone==0 );
  1103   1173   
  1104   1174     if( pThread->pUnpacked==0 ){
  1105   1175       char *pFree;
  1106   1176       pThread->pUnpacked = sqlite3VdbeAllocUnpackedRecord(
  1107   1177           pThread->pKeyInfo, 0, 0, &pFree
................................................................................
  1111   1181         rc = SQLITE_NOMEM;
  1112   1182         goto thread_out;
  1113   1183       }
  1114   1184       pThread->pUnpacked->nField = pThread->pKeyInfo->nField;
  1115   1185       pThread->pUnpacked->errCode = 0;
  1116   1186     }
  1117   1187   
  1118         -  if( pThread->eWork==SORTER_THREAD_CONS ){
         1188  +  if( pThread->eWork==SORT_SUBTASK_CONS ){
  1119   1189       assert( pThread->pList==0 );
  1120   1190       while( pThread->nPMA>pThread->nConsolidate && rc==SQLITE_OK ){
  1121   1191         int nIter = MIN(pThread->nPMA, SORTER_MAX_MERGE_COUNT);
  1122   1192         sqlite3_file *pTemp2 = 0;     /* Second temp file to use */
  1123   1193         SorterMerger *pMerger;        /* Object for reading/merging PMA data */
  1124   1194         i64 iReadOff = 0;             /* Offset in pTemp1 to read from */
  1125   1195         i64 iWriteOff = 0;            /* Offset in pTemp2 to write to */
................................................................................
  1184   1254         pThread->iTemp1Off = iWriteOff;
  1185   1255       }
  1186   1256     }else{
  1187   1257       /* Sort the pThread->pList list */
  1188   1258       rc = vdbeSorterSort(pThread);
  1189   1259   
  1190   1260       /* If required, write the list out to a PMA. */
  1191         -    if( rc==SQLITE_OK && pThread->eWork==SORTER_THREAD_TO_PMA ){
         1261  +    if( rc==SQLITE_OK && pThread->eWork==SORT_SUBTASK_TO_PMA ){
  1192   1262   #ifdef SQLITE_DEBUG
  1193   1263         i64 nExpect = pThread->nInMemory
  1194   1264           + sqlite3VarintLen(pThread->nInMemory)
  1195   1265           + pThread->iTemp1Off;
  1196   1266   #endif
  1197   1267         rc = vdbeSorterListToPMA(pThread);
  1198   1268         assert( rc!=SQLITE_OK || (nExpect==pThread->iTemp1Off) );
................................................................................
  1254   1324     if( pThread==0 ){
  1255   1325       pThread = &pSorter->aThread[nWorker];
  1256   1326     }
  1257   1327     pSorter->iPrev = (pThread - pSorter->aThread);
  1258   1328   
  1259   1329     if( rc==SQLITE_OK ){
  1260   1330       assert( pThread->pThread==0 && pThread->bDone==0 );
  1261         -    pThread->eWork = SORTER_THREAD_TO_PMA;
         1331  +    pThread->eWork = SORT_SUBTASK_TO_PMA;
  1262   1332       pThread->pList = pSorter->pRecord;
  1263   1333       pThread->nInMemory = pSorter->nInMemory;
  1264   1334       pSorter->nInMemory = 0;
  1265   1335       pSorter->pRecord = 0;
  1266   1336   
  1267   1337       if( pSorter->aMemory ){
  1268   1338         u8 *aMem = pThread->aListMemory;
................................................................................
  1302   1372   }
  1303   1373   
  1304   1374   /*
  1305   1375   ** Add a record to the sorter.
  1306   1376   */
  1307   1377   int sqlite3VdbeSorterWrite(
  1308   1378     sqlite3 *db,                    /* Database handle */
  1309         -  const VdbeCursor *pCsr,               /* Sorter cursor */
         1379  +  const VdbeCursor *pCsr,         /* Sorter cursor */
  1310   1380     Mem *pVal                       /* Memory cell containing record */
  1311   1381   ){
  1312   1382     VdbeSorter *pSorter = pCsr->pSorter;
  1313   1383     int rc = SQLITE_OK;             /* Return Code */
  1314   1384     SorterRecord *pNew;             /* New list element */
  1315   1385   
  1316   1386     int bFlush;                     /* True to flush contents of memory to PMA */
................................................................................
  1419   1489     ** sort the VdbeSorter.pRecord list. The vdbe layer will read data directly
  1420   1490     ** from the in-memory list.  */
  1421   1491     if( pSorter->bUsePMA==0 ){
  1422   1492       if( pSorter->pRecord ){
  1423   1493         SortSubtask *pThread = &pSorter->aThread[0];
  1424   1494         *pbEof = 0;
  1425   1495         pThread->pList = pSorter->pRecord;
  1426         -      pThread->eWork = SORTER_THREAD_SORT;
         1496  +      pThread->eWork = SORT_SUBTASK_SORT;
  1427   1497         assert( pThread->aListMemory==0 );
  1428   1498         pThread->aListMemory = pSorter->aMemory;
  1429   1499         rc = vdbeSorterRunThread(pThread);
  1430   1500         pThread->aListMemory = 0;
  1431   1501         pSorter->pRecord = pThread->pList;
  1432   1502         pThread->pList = 0;
  1433   1503       }else{
................................................................................
  1448   1518     ** some of them together so that this is no longer the case. */
  1449   1519     if( vdbeSorterCountPMA(pSorter)>SORTER_MAX_MERGE_COUNT ){
  1450   1520       int i;
  1451   1521       for(i=0; rc==SQLITE_OK && i<pSorter->nThread; i++){
  1452   1522         SortSubtask *pThread = &pSorter->aThread[i];
  1453   1523         if( pThread->pTemp1 ){
  1454   1524           pThread->nConsolidate = SORTER_MAX_MERGE_COUNT / pSorter->nThread;
  1455         -        pThread->eWork = SORTER_THREAD_CONS;
         1525  +        pThread->eWork = SORT_SUBTASK_CONS;
  1456   1526   
  1457   1527   #if SQLITE_MAX_WORKER_THREADS>0
  1458   1528           if( i<(pSorter->nThread-1) ){
  1459   1529             void *pCtx = (void*)pThread;
  1460   1530             rc = sqlite3ThreadCreate(&pThread->pThread,vdbeSortSubtaskMain,pCtx);
  1461   1531           }else
  1462   1532   #endif
................................................................................
  1572   1642   
  1573   1643   /*
  1574   1644   ** Compare the key in memory cell pVal with the key that the sorter cursor
  1575   1645   ** passed as the first argument currently points to. For the purposes of
  1576   1646   ** the comparison, ignore the rowid field at the end of each record.
  1577   1647   **
  1578   1648   ** If the sorter cursor key contains any NULL values, consider it to be
  1579         -** less than pVal. Evn if pVal also contains NULL values.
         1649  +** less than pVal. Even if pVal also contains NULL values.
  1580   1650   **
  1581   1651   ** If an error occurs, return an SQLite error code (i.e. SQLITE_NOMEM).
  1582   1652   ** Otherwise, set *pRes to a negative, zero or positive value if the
  1583   1653   ** key in pVal is smaller than, equal to or larger than the current sorter
  1584   1654   ** key.
         1655  +**
         1656  +** This routine forms the core of the OP_SorterCompare opcode, which in
         1657  +** turn is used to verify uniqueness when constructing a UNIQUE INDEX.
  1585   1658   */
  1586   1659   int sqlite3VdbeSorterCompare(
  1587   1660     const VdbeCursor *pCsr,         /* Sorter cursor */
  1588   1661     Mem *pVal,                      /* Value to compare to current sorter key */
  1589   1662     int nIgnore,                    /* Ignore this many fields at the end */
  1590   1663     int *pRes                       /* OUT: Result of comparison */
  1591   1664   ){