Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Add a large comment to wal.c describing the WAL and wal-index file formats. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
a71a22b52f4570e934063553a81b3926 |
User & Date: | drh 2010-05-19 01:53:54.000 |
Context
2010-05-19
| ||
17:49 | Refactoring some variable names in wal.c. (check-in: 1d201ff51f user: drh tags: trunk) | |
01:53 | Add a large comment to wal.c describing the WAL and wal-index file formats. (check-in: a71a22b52f user: drh tags: trunk) | |
2010-05-18
| ||
23:29 | Update the wal-index hash format so that hash-table space is reused following a rollback, thus preventing hash table overflows. Add assert()s to verify that hash tables do not overfill. Further refactoring of the wal-index code. (check-in: ada9a8c7b6 user: drh tags: trunk) | |
Changes
Changes to src/wal.c.
1 2 3 4 5 6 7 8 9 10 11 12 | /* ** 2010 February 1 ** ** The author disclaims copyright to this source code. In place of ** a legal notice, here is a blessing: ** ** May you do good and not evil. ** May you find forgiveness for yourself and forgive others. ** May you share freely, never taking more than you give. ** ************************************************************************* ** | | | | | | | | | | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | /* ** 2010 February 1 ** ** The author disclaims copyright to this source code. In place of ** a legal notice, here is a blessing: ** ** May you do good and not evil. ** May you find forgiveness for yourself and forgive others. ** May you share freely, never taking more than you give. ** ************************************************************************* ** ** This file contains the implementation of a write-ahead log (WAL) used in ** "journal_mode=WAL" mode. ** ** WRITE-AHEAD LOG (WAL) FILE FORMAT ** ** A wal file consists of a header followed by zero or more "frames". ** Each frame records the revised content of a single page from the ** database file. All changes to the database are recorded by writing ** frames into the WAL. Transactions commit when a frame is written that ** contains a commit marker. A single WAL can and usually does record ** multiple transactions. Periodically, the content of the WAL is ** transferred back into the database file in an operation called a ** "checkpoint". ** ** A single WAL file can be used multiple times. In other words, the ** WAL can fill up with frames and then be checkpointed and then new ** frames can overwrite the old ones. A WAL always grows from beginning ** toward the end. Checksums and counters attached to each frame are ** used to determine which frames within the WAL are valid and which ** are leftovers from prior checkpoints. ** ** The WAL header is 12 bytes in size and consists of the following three ** big-endian 32-bit unsigned integer values: ** ** 0: Database page size, ** 4: Randomly selected salt value 1, ** 8: Randomly selected salt value 2. ** ** Immediately following the header are zero or more frames. Each ** frame consists of a 16-byte header followed by a <page-size> bytes ** of page data. The header is broken into 4 big-endian 32-bit unsigned ** integer values, as follows: ** ** 0: Page number. ** 4: For commit records, the size of the database image in pages ** after the commit. For all other records, zero. ** 8: Checksum value 1. ** 12: Checksum value 2. ** ** READER ALGORITHM ** ** To read a page from the database (call it page number P), a reader ** first checks the WAL to see if it contains page P. If so, then the ** last valid instance of page P that is or is followed by a commit frame ** become the value read. If the WAL contains no copies of page P that ** are valid and which are or are followed by a commit frame, then page ** P is read from the database file. ** ** The reader algorithm in the previous paragraph works correctly, but ** because frames for page P can appear anywhere within the WAL, the ** reader has to scan the entire WAL looking for page P frames. If the ** WAL is large (multiple megabytes is typical) that scan can be slow, ** and read performance suffers. To overcome this problem, a separate ** data structure called the wal-index is maintained to expedite the ** search for frames of a particular page. ** ** WAL-INDEX FORMAT ** ** Conceptually, the wal-index is shared memory, though VFS implementations ** might choose to implement the wal-index using a mmapped file. Because ** the wal-index is shared memory, SQLite does not support journal_mode=WAL |
︙ | ︙ | |||
86 87 88 89 90 91 92 | ** The purpose of the wal-index is to answer this question quickly: Given ** a page number P, return the index of the last frame for page P in the WAL, ** or return NULL if there are no frames for page P in the WAL. ** ** The wal-index consists of a header region, followed by an one or ** more index blocks. ** | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | > > > > > > > > > > > > > > > > > > > > > > > > > > > | 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | ** The purpose of the wal-index is to answer this question quickly: Given ** a page number P, return the index of the last frame for page P in the WAL, ** or return NULL if there are no frames for page P in the WAL. ** ** The wal-index consists of a header region, followed by an one or ** more index blocks. ** ** The wal-index header contains the total number of frames within the WAL ** in the the mxFrame field. Each index block contains information on ** HASHTABLE_NPAGE frames. Each index block contains two sections, a ** mapping which is a database page number for each frame, and a hash ** table used to look up frames by page number. The mapping section is ** an array of HASHTABLE_NPAGE 32-bit page numbers. The first entry on the ** array is the page number for the first frame; the second entry is the ** page number for the second frame; and so forth. The last index block ** holds a total of (mxFrame%HASHTABLE_NPAGE) page numbers. All index ** blocks other than the last are completely full with HASHTABLE_NPAGE ** page numbers. All index blocks are the same size; the mapping section ** of the last index block merely contains unused entries if mxFrame is ** not an even multiple of HASHTABLE_NPAGE. ** ** Even without using the hash table, the last frame for page P ** can be found by scanning the mapping sections of each index block ** starting with the last index block and moving toward the first, and ** within each index block, starting at the end and moving toward the ** beginning. The first entry that equals P corresponds to the frame ** holding the content for that page. ** ** The hash table consists of HASHTABLE_NSLOT 16-bit unsigned integers. ** HASHTABLE_NSLOT = 2*HASHTABLE_NPAGE, and there is one entry in the ** hash table for each page number in the mapping section, so the hash ** table is never more than half full. The expected number of collisions ** prior to finding a match is 1. Each entry of the hash table is an ** 1-based index of an entry in the mapping section of the same ** index block. Let K be the 1-based index of the largest entry in ** the mapping section. (For index blocks other than the last, K will ** always be exactly HASHTABLE_NPAGE (4096) and for the last index block ** K will be (mxFrame%HASHTABLE_NPAGE).) Unused slots of the hash table ** contain a value greater than K. Note that no hash table slot ever ** contains a zero value. ** ** To look for page P in the hash table, first compute a hash iKey on ** P as follows: ** ** iKey = (P * 383) % HASHTABLE_NSLOT ** ** Then start scanning entries of the hash table, starting with iKey ** (wrapping around to the beginning when the end of the hash table is ** reached) until an unused hash slot is found. Let the first unused slot ** be at index iUnused. (iUnused might be less than iKey if there was ** wrap-around.) Because the hash table is never more than half full, ** the search is guaranteed to eventually hit an unused entry. Let ** iMax be the value between iKey and iUnused, closest to iUnused, ** where aHash[iMax]==P. If there is no iMax entry (if there exists ** no hash slot such that aHash[i]==p) then page P is not in the ** current index block. Otherwise the iMax-th mapping entry of the ** current index block corresponds to the last entry that references ** page P. ** ** A hash search begins with the last index block and moves toward the ** first index block, looking for entries corresponding to page P. On ** average, only two or three slots in each index block need to be ** examined in order to either find the last entry for page P, or to ** establish that no such entry exists in the block. Each index block ** holds over 4000 entries. So two or three index blocks are sufficient ** to cover a typical 10 megabyte WAL file, assuming 1K pages. 8 or 10 ** comparisons (on average) suffice to either locate a frame in the ** WAL or to establish that the frame does not exist in the WAL. This ** is much faster than scanning the entire 10MB WAL. ** ** Note that entries are added in order of increasing K. Hence, one ** reader might be using some value K0 and a second reader that started ** at a later time (after additional transactions were added to the WAL ** and to the wal-index) might be using a different value K1, where K1>K0. ** Both readers can use the same hash table and mapping section to get ** the correct result. There may be entries in the hash table with ** K>K0 but to the first reader, those entries will appear to be unused ** slots in the hash table and so the first reader will get an answer as ** if no values greater than K0 had ever been inserted into the hash table ** in the first place - which is what reader one wants. Meanwhile, the ** second reader using K1 will see additional values that were inserted ** later, which is exactly what reader two wants. ** ** When a rollback occurs, the value of K is decreased. This has the ** effect of automatically removing entries from the hash table. */ #ifndef SQLITE_OMIT_WAL #include "wal.h" /* Object declarations */ |
︙ | ︙ | |||
108 109 110 111 112 113 114 | ** Member variables iCheck1 and iCheck2 contain the checksum for the ** last frame written to the wal, or 2 and 3 respectively if the log ** is currently empty. */ struct WalIndexHdr { u32 iChange; /* Counter incremented each transaction */ u32 pgsz; /* Database page size in bytes */ | | | 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | ** Member variables iCheck1 and iCheck2 contain the checksum for the ** last frame written to the wal, or 2 and 3 respectively if the log ** is currently empty. */ struct WalIndexHdr { u32 iChange; /* Counter incremented each transaction */ u32 pgsz; /* Database page size in bytes */ u32 mxFrame; /* Index of last valid frame in the WAL */ u32 nPage; /* Size of database in pages */ u32 iCheck1; /* Checkpoint value 1 */ u32 iCheck2; /* Checkpoint value 2 */ }; /* Size of serialized WalIndexHdr object. */ #define WALINDEX_HDR_NFIELD (sizeof(WalIndexHdr) / sizeof(u32)) |
︙ | ︙ | |||
616 617 618 619 620 621 622 | rc = walIndexAppend(pWal, ++iFrame, pgno); if( rc!=SQLITE_OK ) break; /* If nTruncate is non-zero, this is a commit record. */ if( nTruncate ){ hdr.iCheck1 = aCksum[0]; hdr.iCheck2 = aCksum[1]; | | | | 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 | rc = walIndexAppend(pWal, ++iFrame, pgno); if( rc!=SQLITE_OK ) break; /* If nTruncate is non-zero, this is a commit record. */ if( nTruncate ){ hdr.iCheck1 = aCksum[0]; hdr.iCheck2 = aCksum[1]; hdr.mxFrame = iFrame; hdr.nPage = nTruncate; hdr.pgsz = nPgsz; } } sqlite3_free(aFrame); }else{ hdr.iCheck1 = 2; hdr.iCheck2 = 3; } finished: if( rc==SQLITE_OK && hdr.mxFrame==0 ){ rc = walIndexRemap(pWal, WALINDEX_MMAP_INCREMENT); } if( rc==SQLITE_OK ){ walIndexWriteHdr(pWal, &hdr); memcpy(&pWal->hdr, &hdr, sizeof(hdr)); } return rc; |
︙ | ︙ | |||
837 838 839 840 841 842 843 | int i; /* Iterator variable */ int nFinal; /* Number of unindexed entries */ u8 *aTmp; /* Temp space used by merge-sort */ int rc; /* Return code of walIndexMap() */ u8 *aSpace; /* Surplus space on the end of the allocation */ /* Make sure the wal-index is mapped into local memory */ | | | | 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 | int i; /* Iterator variable */ int nFinal; /* Number of unindexed entries */ u8 *aTmp; /* Temp space used by merge-sort */ int rc; /* Return code of walIndexMap() */ u8 *aSpace; /* Surplus space on the end of the allocation */ /* Make sure the wal-index is mapped into local memory */ rc = walIndexMap(pWal, walMappingSize(pWal->hdr.mxFrame)); if( rc!=SQLITE_OK ){ return rc; } /* This routine only runs while holding SQLITE_SHM_CHECKPOINT. No other ** thread is able to write to shared memory while this routine is ** running (or, indeed, while the WalIterator object exists). Hence, ** we can cast off the volatile qualifacation from shared memory */ assert( pWal->lockState==SQLITE_SHM_CHECKPOINT ); aData = (u32*)pWal->pWiData; /* Allocate space for the WalIterator object */ iLast = pWal->hdr.mxFrame; nSegment = (iLast >> 8) + 1; nFinal = (iLast & 0x000000FF); nByte = sizeof(WalIterator) + (nSegment+1)*(sizeof(struct WalSegment)+256); p = (WalIterator *)sqlite3_malloc(nByte); if( !p ){ return SQLITE_NOMEM; } |
︙ | ︙ | |||
911 912 913 914 915 916 917 | int pgsz = pWal->hdr.pgsz; /* Database page-size */ WalIterator *pIter = 0; /* Wal iterator context */ u32 iDbpage = 0; /* Next database page to write */ u32 iFrame = 0; /* Wal frame containing data for iDbpage */ /* Allocate the iterator */ rc = walIteratorInit(pWal, &pIter); | | | 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 | int pgsz = pWal->hdr.pgsz; /* Database page-size */ WalIterator *pIter = 0; /* Wal iterator context */ u32 iDbpage = 0; /* Next database page to write */ u32 iFrame = 0; /* Wal frame containing data for iDbpage */ /* Allocate the iterator */ rc = walIteratorInit(pWal, &pIter); if( rc!=SQLITE_OK || pWal->hdr.mxFrame==0 ){ goto out; } if( pWal->hdr.pgsz!=nBuf ){ rc = SQLITE_CORRUPT_BKPT; goto out; } |
︙ | ︙ | |||
945 946 947 948 949 950 951 | if( rc!=SQLITE_OK ) goto out; /* Sync the database file. If successful, update the wal-index. */ if( sync_flags ){ rc = sqlite3OsSync(pWal->pDbFd, sync_flags); if( rc!=SQLITE_OK ) goto out; } | | | 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 | if( rc!=SQLITE_OK ) goto out; /* Sync the database file. If successful, update the wal-index. */ if( sync_flags ){ rc = sqlite3OsSync(pWal->pDbFd, sync_flags); if( rc!=SQLITE_OK ) goto out; } pWal->hdr.mxFrame = 0; pWal->hdr.iCheck1 = 2; pWal->hdr.iCheck2 = 3; walIndexWriteHdr(pWal, &pWal->hdr); /* TODO: If a crash occurs and the current log is copied into the ** database there is no problem. However, if a crash occurs while ** writing the next transaction into the start of the log, such that: |
︙ | ︙ | |||
1202 1203 1204 1205 1206 1207 1208 | Pgno pgno, /* Database page number to read data for */ int *pInWal, /* OUT: True if data is read from WAL */ int nOut, /* Size of buffer pOut in bytes */ u8 *pOut /* Buffer to write page data to */ ){ int rc; /* Return code */ u32 iRead = 0; /* If !=0, WAL frame to return data from */ | | | 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 | Pgno pgno, /* Database page number to read data for */ int *pInWal, /* OUT: True if data is read from WAL */ int nOut, /* Size of buffer pOut in bytes */ u8 *pOut /* Buffer to write page data to */ ){ int rc; /* Return code */ u32 iRead = 0; /* If !=0, WAL frame to return data from */ u32 iLast = pWal->hdr.mxFrame; /* Last page in WAL for this reader */ int iHash; /* Used to loop through N hash tables */ /* If the "last page" field of the wal-index header snapshot is 0, then ** no data will be read from the wal under any circumstances. Return early ** in this case to avoid the walIndexMap/Unmap overhead. */ if( iLast==0 ){ |
︙ | ︙ | |||
1378 1379 1380 1381 1382 1383 1384 | ** Otherwise, if the callback function does not return an error, this ** function returns SQLITE_OK. */ int sqlite3WalUndo(Wal *pWal, int (*xUndo)(void *, Pgno), void *pUndoCtx){ int rc = SQLITE_OK; if( pWal->lockState==SQLITE_SHM_WRITE ){ int unused; | | | | | | 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 | ** Otherwise, if the callback function does not return an error, this ** function returns SQLITE_OK. */ int sqlite3WalUndo(Wal *pWal, int (*xUndo)(void *, Pgno), void *pUndoCtx){ int rc = SQLITE_OK; if( pWal->lockState==SQLITE_SHM_WRITE ){ int unused; Pgno iMax = pWal->hdr.mxFrame; Pgno iFrame; assert( pWal->pWiData==0 ); rc = walIndexReadHdr(pWal, &unused); for(iFrame=pWal->hdr.mxFrame+1; rc==SQLITE_OK && iFrame<=iMax; iFrame++){ assert( pWal->lockState==SQLITE_SHM_WRITE ); rc = xUndo(pUndoCtx, pWal->pWiData[walIndexEntry(iFrame)]); } walIndexUnmap(pWal); } return rc; } /* Return an integer that records the current (uncommitted) write ** position in the WAL */ u32 sqlite3WalSavepoint(Wal *pWal){ assert( pWal->lockState==SQLITE_SHM_WRITE ); return pWal->hdr.mxFrame; } /* Move the write position of the WAL back to iFrame. Called in ** response to a ROLLBACK TO command. */ int sqlite3WalSavepointUndo(Wal *pWal, u32 iFrame){ int rc = SQLITE_OK; u8 aCksum[8]; assert( pWal->lockState==SQLITE_SHM_WRITE ); pWal->hdr.mxFrame = iFrame; if( iFrame>0 ){ i64 iOffset = walFrameOffset(iFrame, pWal->hdr.pgsz) + sizeof(u32)*2; rc = sqlite3OsRead(pWal->pWalFd, aCksum, sizeof(aCksum), iOffset); pWal->hdr.iCheck1 = sqlite3Get4byte(&aCksum[0]); pWal->hdr.iCheck2 = sqlite3Get4byte(&aCksum[4]); } |
︙ | ︙ | |||
1449 1450 1451 1452 1453 1454 1455 | assert( pWal->pWiData==0 ); /* If this is the first frame written into the log, write the WAL ** header to the start of the WAL file. See comments at the top of ** this source file for a description of the WAL header format. */ assert( WAL_FRAME_HDRSIZE>=WAL_HDRSIZE ); | | | 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 | assert( pWal->pWiData==0 ); /* If this is the first frame written into the log, write the WAL ** header to the start of the WAL file. See comments at the top of ** this source file for a description of the WAL header format. */ assert( WAL_FRAME_HDRSIZE>=WAL_HDRSIZE ); iFrame = pWal->hdr.mxFrame; if( iFrame==0 ){ sqlite3Put4byte(aFrame, nPgsz); sqlite3_randomness(8, &aFrame[4]); pWal->hdr.iCheck1 = sqlite3Get4byte(&aFrame[4]); pWal->hdr.iCheck2 = sqlite3Get4byte(&aFrame[8]); rc = sqlite3OsWrite(pWal->pWalFd, aFrame, WAL_HDRSIZE, 0); if( rc!=SQLITE_OK ){ |
︙ | ︙ | |||
1521 1522 1523 1524 1525 1526 1527 | assert( pWal->pWiData==0 ); /* Append data to the wal-index. It is not necessary to lock the ** wal-index to do this as the SQLITE_SHM_WRITE lock held on the wal-index ** guarantees that there are no other writers, and no data that may ** be in use by existing readers is being overwritten. */ | | | | 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 | assert( pWal->pWiData==0 ); /* Append data to the wal-index. It is not necessary to lock the ** wal-index to do this as the SQLITE_SHM_WRITE lock held on the wal-index ** guarantees that there are no other writers, and no data that may ** be in use by existing readers is being overwritten. */ iFrame = pWal->hdr.mxFrame; for(p=pList; p && rc==SQLITE_OK; p=p->pDirty){ iFrame++; rc = walIndexAppend(pWal, iFrame, p->pgno); } while( nLast>0 && rc==SQLITE_OK ){ iFrame++; nLast--; rc = walIndexAppend(pWal, iFrame, pLast->pgno); } if( rc==SQLITE_OK ){ /* Update the private copy of the header. */ pWal->hdr.pgsz = nPgsz; pWal->hdr.mxFrame = iFrame; if( isCommit ){ pWal->hdr.iChange++; pWal->hdr.nPage = nTruncate; } pWal->hdr.iCheck1 = aCksum[0]; pWal->hdr.iCheck2 = aCksum[1]; |
︙ | ︙ |