Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Pager optimization: do not write or journal free pages. This results in a 2x performance gain for large INSERTs and a 5x performance gain for large DELETEs. (CVS 410) |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
cf1ebcfb741786f84a596c406f4c492f |
User & Date: | drh 2002-03-02 20:41:58.000 |
Context
2002-03-03
| ||
02:49 | Bug fixes and additional tests for the subquery flattener. (CVS 411) (check-in: 2c05389eda user: drh tags: trunk) | |
2002-03-02
| ||
20:41 | Pager optimization: do not write or journal free pages. This results in a 2x performance gain for large INSERTs and a 5x performance gain for large DELETEs. (CVS 410) (check-in: cf1ebcfb74 user: drh tags: trunk) | |
19:00 | Change the btree node balancers to sort nodes into accending order. This improves insert and delete speed by 25%. (CVS 409) (check-in: abbb999d4f user: drh tags: trunk) | |
Changes
Changes to src/btree.c.
1 2 3 4 5 6 7 8 9 10 11 | /* ** 2001 September 15 ** ** The author disclaims copyright to this source code. In place of ** a legal notice, here is a blessing: ** ** May you do good and not evil. ** May you find forgiveness for yourself and forgive others. ** May you share freely, never taking more than you give. ** ************************************************************************* | | | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | /* ** 2001 September 15 ** ** The author disclaims copyright to this source code. In place of ** a legal notice, here is a blessing: ** ** May you do good and not evil. ** May you find forgiveness for yourself and forgive others. ** May you share freely, never taking more than you give. ** ************************************************************************* ** $Id: btree.c,v 1.57 2002/03/02 20:41:58 drh Exp $ ** ** This file implements a external (disk-based) database using BTrees. ** For a detailed discussion of BTrees, refer to ** ** Donald E. Knuth, THE ART OF COMPUTER PROGRAMMING, Volume 3: ** "Sorting And Searching", pages 473-480. Addison-Wesley ** Publishing Company, Reading, Massachusetts. |
︙ | ︙ | |||
60 61 62 63 64 65 66 67 68 69 70 71 72 73 | typedef struct PageOne PageOne; typedef struct MemPage MemPage; typedef struct PageHdr PageHdr; typedef struct Cell Cell; typedef struct CellHdr CellHdr; typedef struct FreeBlk FreeBlk; typedef struct OverflowPage OverflowPage; /* ** All structures on a database page are aligned to 4-byte boundries. ** This routine rounds up a number of bytes to the next multiple of 4. ** ** This might need to change for computer architectures that require ** and 8-byte alignment boundry for structures. | > | 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | typedef struct PageOne PageOne; typedef struct MemPage MemPage; typedef struct PageHdr PageHdr; typedef struct Cell Cell; typedef struct CellHdr CellHdr; typedef struct FreeBlk FreeBlk; typedef struct OverflowPage OverflowPage; typedef struct FreelistInfo FreelistInfo; /* ** All structures on a database page are aligned to 4-byte boundries. ** This routine rounds up a number of bytes to the next multiple of 4. ** ** This might need to change for computer architectures that require ** and 8-byte alignment boundry for structures. |
︙ | ︙ | |||
243 244 245 246 247 248 249 250 251 252 253 254 255 256 | ** page number of the first page in a linked list of unused database ** pages. */ struct OverflowPage { Pgno iNext; char aPayload[OVERFLOW_SIZE]; }; /* ** For every page in the database file, an instance of the following structure ** is stored in memory. The u.aDisk[] array contains the raw bits read from ** the disk. The rest is auxiliary information held in memory only. The ** auxiliary info is only valid for regular database pages - it is not ** used for overflow pages and pages on the freelist. | > > > > > > > > > > > > | 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 | ** page number of the first page in a linked list of unused database ** pages. */ struct OverflowPage { Pgno iNext; char aPayload[OVERFLOW_SIZE]; }; /* ** The PageOne.freeList field points to a linked list of overflow pages ** hold information about free pages. The aPayload section of each ** overflow page contains an instance of the following structure. The ** aFree[] array holds the page number of nFree unused pages in the disk ** file. */ struct FreelistInfo { int nFree; Pgno aFree[(OVERFLOW_SIZE-sizeof(int))/sizeof(Pgno)]; }; /* ** For every page in the database file, an instance of the following structure ** is stored in memory. The u.aDisk[] array contains the raw bits read from ** the disk. The rest is auxiliary information held in memory only. The ** auxiliary info is only valid for regular database pages - it is not ** used for overflow pages and pages on the freelist. |
︙ | ︙ | |||
1473 1474 1475 1476 1477 1478 1479 1480 1481 | ** Do not invoke sqlitepager_unref() on *ppPage if an error is returned. */ static int allocatePage(Btree *pBt, MemPage **ppPage, Pgno *pPgno){ PageOne *pPage1 = pBt->page1; int rc; if( pPage1->freeList ){ OverflowPage *pOvfl; rc = sqlitepager_write(pPage1); if( rc ) return rc; | > > | > > > | < | > > > > > > > > > > | 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 | ** Do not invoke sqlitepager_unref() on *ppPage if an error is returned. */ static int allocatePage(Btree *pBt, MemPage **ppPage, Pgno *pPgno){ PageOne *pPage1 = pBt->page1; int rc; if( pPage1->freeList ){ OverflowPage *pOvfl; FreelistInfo *pInfo; rc = sqlitepager_write(pPage1); if( rc ) return rc; pPage1->nFree--; rc = sqlitepager_get(pBt->pPager, pPage1->freeList, (void**)&pOvfl); if( rc ) return rc; rc = sqlitepager_write(pOvfl); if( rc ){ sqlitepager_unref(pOvfl); return rc; } pInfo = (FreelistInfo*)pOvfl->aPayload; if( pInfo->nFree==0 ){ *pPgno = pPage1->freeList; pPage1->freeList = pOvfl->iNext; *ppPage = (MemPage*)pOvfl; }else{ pInfo->nFree--; *pPgno = pInfo->aFree[pInfo->nFree]; rc = sqlitepager_get(pBt->pPager, *pPgno, (void**)ppPage); sqlitepager_unref(pOvfl); if( rc==SQLITE_OK ){ sqlitepager_dont_rollback(*ppPage); rc = sqlitepager_write(*ppPage); } } }else{ *pPgno = sqlitepager_pagecount(pBt->pPager) + 1; rc = sqlitepager_get(pBt->pPager, *pPgno, (void**)ppPage); if( rc ) return rc; rc = sqlitepager_write(*ppPage); } return rc; |
︙ | ︙ | |||
1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 | assert( pOvfl!=0 ); pgno = sqlitepager_pagenumber(pOvfl); } assert( pgno>2 ); rc = sqlitepager_write(pPage1); if( rc ){ return rc; } if( pOvfl==0 ){ assert( pgno>0 ); rc = sqlitepager_get(pBt->pPager, pgno, (void**)&pOvfl); if( rc ) return rc; needUnref = 1; } rc = sqlitepager_write(pOvfl); if( rc ){ if( needUnref ) sqlitepager_unref(pOvfl); return rc; } pOvfl->iNext = pPage1->freeList; pPage1->freeList = pgno; | > > > > > > > > > > > > > > > > > > > < | 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 | assert( pOvfl!=0 ); pgno = sqlitepager_pagenumber(pOvfl); } assert( pgno>2 ); rc = sqlitepager_write(pPage1); if( rc ){ return rc; } pPage1->nFree++; if( pPage1->nFree>0 && pPage1->freeList ){ OverflowPage *pFreeIdx; rc = sqlitepager_get(pBt->pPager, pPage1->freeList, (void**)&pFreeIdx); if( rc==SQLITE_OK ){ FreelistInfo *pInfo = (FreelistInfo*)pFreeIdx->aPayload; if( pInfo->nFree<(sizeof(pInfo->aFree)/sizeof(pInfo->aFree[0])) ){ rc = sqlitepager_write(pFreeIdx); if( rc==SQLITE_OK ){ pInfo->aFree[pInfo->nFree] = pgno; pInfo->nFree++; sqlitepager_unref(pFreeIdx); sqlitepager_dont_write(pBt->pPager, pgno); return rc; } } sqlitepager_unref(pFreeIdx); } } if( pOvfl==0 ){ assert( pgno>0 ); rc = sqlitepager_get(pBt->pPager, pgno, (void**)&pOvfl); if( rc ) return rc; needUnref = 1; } rc = sqlitepager_write(pOvfl); if( rc ){ if( needUnref ) sqlitepager_unref(pOvfl); return rc; } pOvfl->iNext = pPage1->freeList; pPage1->freeList = pgno; memset(pOvfl->aPayload, 0, OVERFLOW_SIZE); pMemPage = (MemPage*)pPage; pMemPage->isInit = 0; if( pMemPage->pParent ){ sqlitepager_unref(pMemPage->pParent); pMemPage->pParent = 0; } |
︙ | ︙ | |||
2699 2700 2701 2702 2703 2704 2705 | return (pCheck->anRef[iPage]++)>1; } /* ** Check the integrity of the freelist or of an overflow page list. ** Verify that the number of pages on the list is N. */ | | > > > > > > > | > > > > > > > | 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 | return (pCheck->anRef[iPage]++)>1; } /* ** Check the integrity of the freelist or of an overflow page list. ** Verify that the number of pages on the list is N. */ static void checkList( IntegrityCk *pCheck, /* Integrity checking context */ int isFreeList, /* True for a freelist. False for overflow page list */ int iPage, /* Page number for first page in the list */ int N, /* Expected number of pages in the list */ char *zContext /* Context for error messages */ ){ int i; char zMsg[100]; while( N-- > 0 ){ OverflowPage *pOvfl; if( iPage<1 ){ sprintf(zMsg, "%d pages missing from overflow list", N+1); checkAppendMsg(pCheck, zContext, zMsg); break; } if( checkRef(pCheck, iPage, zContext) ) break; if( sqlitepager_get(pCheck->pPager, (Pgno)iPage, (void**)&pOvfl) ){ sprintf(zMsg, "failed to get page %d", iPage); checkAppendMsg(pCheck, zContext, zMsg); break; } if( isFreeList ){ FreelistInfo *pInfo = (FreelistInfo*)pOvfl->aPayload; for(i=0; i<pInfo->nFree; i++){ checkRef(pCheck, pInfo->aFree[i], zMsg); } N -= pInfo->nFree; } iPage = (int)pOvfl->iNext; sqlitepager_unref(pOvfl); } } /* ** Return negative if zKey1<zKey2. |
︙ | ︙ | |||
2814 2815 2816 2817 2818 2819 2820 | /* Check payload overflow pages */ nKey2 = NKEY(pCell->h); sz = nKey2 + NDATA(pCell->h); sprintf(zContext, "On page %d cell %d: ", iPage, i); if( sz>MX_LOCAL_PAYLOAD ){ int nPage = (sz - MX_LOCAL_PAYLOAD + OVERFLOW_SIZE - 1)/OVERFLOW_SIZE; | | | 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 | /* Check payload overflow pages */ nKey2 = NKEY(pCell->h); sz = nKey2 + NDATA(pCell->h); sprintf(zContext, "On page %d cell %d: ", iPage, i); if( sz>MX_LOCAL_PAYLOAD ){ int nPage = (sz - MX_LOCAL_PAYLOAD + OVERFLOW_SIZE - 1)/OVERFLOW_SIZE; checkList(pCheck, 0, pCell->ovfl, nPage, zContext); } /* Check that keys are in the right order */ cur.idx = i; zKey2 = sqliteMalloc( nKey2+1 ); getPayload(&cur, 0, nKey2, zKey2); |
︙ | ︙ | |||
2919 2920 2921 2922 2923 2924 2925 | sCheck.anRef = sqliteMalloc( (sCheck.nPage+1)*sizeof(sCheck.anRef[0]) ); sCheck.anRef[1] = 1; for(i=2; i<=sCheck.nPage; i++){ sCheck.anRef[i] = 0; } sCheck.zErrMsg = 0; /* Check the integrity of the freelist */ | | > | 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 | sCheck.anRef = sqliteMalloc( (sCheck.nPage+1)*sizeof(sCheck.anRef[0]) ); sCheck.anRef[1] = 1; for(i=2; i<=sCheck.nPage; i++){ sCheck.anRef[i] = 0; } sCheck.zErrMsg = 0; /* Check the integrity of the freelist */ checkList(&sCheck, 1, pBt->page1->freeList, pBt->page1->nFree, "Main freelist: "); /* Check all the tables. */ for(i=0; i<nRoot; i++){ checkTreePage(&sCheck, aRoot[i], 0, "List of tree roots: ", 0,0,0,0); } |
︙ | ︙ |
Changes to src/pager.c.
︙ | ︙ | |||
14 15 16 17 18 19 20 | ** The pager is used to access a database disk file. It implements ** atomic commit and rollback through the use of a journal file that ** is separate from the database file. The pager also implements file ** locking to prevent two processes from writing the same database ** file simultaneously, or one process from reading the database while ** another is writing. ** | | | 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | ** The pager is used to access a database disk file. It implements ** atomic commit and rollback through the use of a journal file that ** is separate from the database file. The pager also implements file ** locking to prevent two processes from writing the same database ** file simultaneously, or one process from reading the database while ** another is writing. ** ** @(#) $Id: pager.c,v 1.41 2002/03/02 20:41:59 drh Exp $ */ #include "sqliteInt.h" #include "pager.h" #include "os.h" #include <assert.h> #include <string.h> |
︙ | ︙ | |||
1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 | ** to sqlitepager_write(). In other words, return TRUE if it is ok ** to change the content of the page. */ int sqlitepager_iswriteable(void *pData){ PgHdr *pPg = DATA_TO_PGHDR(pData); return pPg->dirty; } /* ** Commit all changes to the database and release the write lock. ** ** If the commit fails for any reason, a rollback attempt is made ** and an error code is returned. If the commit worked, SQLITE_OK ** is returned. | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 | ** to sqlitepager_write(). In other words, return TRUE if it is ok ** to change the content of the page. */ int sqlitepager_iswriteable(void *pData){ PgHdr *pPg = DATA_TO_PGHDR(pData); return pPg->dirty; } /* ** A call to this routine tells the pager that it is not necessary to ** write the information on page "pgno" back to the disk, even though ** that page might be marked as dirty. ** ** The overlying software layer calls this routine when all of the data ** on the given page is unused. The pager marks the page as clean so ** that it does not get written to disk. ** ** Tests show that this optimization, together with the ** sqlitepager_dont_rollback() below, more than double the speed ** of large INSERT operations and quadruple the speed of large DELETEs. */ void sqlitepager_dont_write(Pager *pPager, Pgno pgno){ PgHdr *pPg; pPg = pager_lookup(pPager, pgno); if( pPg && pPg->dirty ){ pPg->dirty = 0; } } /* ** A call to this routine tells the pager that if a rollback occurs, ** it is not necessary to restore the data on the given page. This ** means that the pager does not have to record the given page in the ** rollback journal. */ void sqlitepager_dont_rollback(void *pData){ PgHdr *pPg = DATA_TO_PGHDR(pData); Pager *pPager = pPg->pPager; if( pPager->state!=SQLITE_WRITELOCK || pPager->journalOpen==0 ) return; if( !pPg->inJournal && (int)pPg->pgno <= pPager->origDbSize ){ assert( pPager->aInJournal!=0 ); pPager->aInJournal[pPg->pgno/8] |= 1<<(pPg->pgno&7); pPg->inJournal = 1; if( pPager->ckptOpen ){ pPager->aInCkpt[pPg->pgno/8] |= 1<<(pPg->pgno&7); pPg->inCkpt = 1; } } if( pPager->ckptOpen && !pPg->inCkpt && (int)pPg->pgno<=pPager->ckptSize ){ assert( pPg->inJournal || (int)pPg->pgno>pPager->origDbSize ); assert( pPager->aInCkpt!=0 ); pPager->aInCkpt[pPg->pgno/8] |= 1<<(pPg->pgno&7); pPg->inCkpt = 1; } } /* ** Commit all changes to the database and release the write lock. ** ** If the commit fails for any reason, a rollback attempt is made ** and an error code is returned. If the commit worked, SQLITE_OK ** is returned. |
︙ | ︙ |
Changes to src/pager.h.
︙ | ︙ | |||
9 10 11 12 13 14 15 | ** May you share freely, never taking more than you give. ** ************************************************************************* ** This header file defines the interface that the sqlite page cache ** subsystem. The page cache subsystem reads and writes a file a page ** at a time and provides a journal for rollback. ** | | | 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | ** May you share freely, never taking more than you give. ** ************************************************************************* ** This header file defines the interface that the sqlite page cache ** subsystem. The page cache subsystem reads and writes a file a page ** at a time and provides a journal for rollback. ** ** @(#) $Id: pager.h,v 1.15 2002/03/02 20:41:59 drh Exp $ */ /* ** The size of one page ** ** You can change this value to another (reasonable) power of two ** such as 512, 2048, 4096, or 8192 and things will still work. But |
︙ | ︙ | |||
61 62 63 64 65 66 67 68 69 70 71 72 73 | int sqlitepager_pagecount(Pager*); int sqlitepager_commit(Pager*); int sqlitepager_rollback(Pager*); int sqlitepager_isreadonly(Pager*); int sqlitepager_ckpt_begin(Pager*); int sqlitepager_ckpt_commit(Pager*); int sqlitepager_ckpt_rollback(Pager*); int *sqlitepager_stats(Pager*); #ifdef SQLITE_TEST void sqlitepager_refdump(Pager*); int pager_refinfo_enable; #endif | > > | 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | int sqlitepager_pagecount(Pager*); int sqlitepager_commit(Pager*); int sqlitepager_rollback(Pager*); int sqlitepager_isreadonly(Pager*); int sqlitepager_ckpt_begin(Pager*); int sqlitepager_ckpt_commit(Pager*); int sqlitepager_ckpt_rollback(Pager*); void sqlitepager_dont_rollback(void*); void sqlitepager_dont_write(Pager*, Pgno); int *sqlitepager_stats(Pager*); #ifdef SQLITE_TEST void sqlitepager_refdump(Pager*); int pager_refinfo_enable; #endif |