Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Add some more to fileio.html. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
3a3fd027228fcf4538f8834be9cc3067 |
User & Date: | dan 2008-09-23 18:06:48.000 |
Context
2008-09-24
| ||
17:49 | Fix a bug in the CREATE TRIGGER documentation. (check-in: 2f8b2b9ead user: drh tags: trunk) | |
2008-09-23
| ||
18:06 | Add some more to fileio.html. (check-in: 3a3fd02722 user: dan tags: trunk) | |
2008-09-22
| ||
12:42 | Add the observation that the unary "+" operator destroys type affinity. (check-in: b46eae7d62 user: drh tags: trunk) | |
Changes
Changes to images/fileformat/rtdocs.css.
︙ | ︙ | |||
57 58 59 60 61 62 63 | .todo:before { content: 'TODO:' } p.todo { border: solid #AA3333 1px; padding: 1ex } cite { font-style: normal; font-weight: normal } cite a { color: inherit; text-decoration: none } :link:hover,:visited:hover { background: wheat } | > | > > | 57 58 59 60 61 62 63 64 65 66 67 | .todo:before { content: 'TODO:' } p.todo { border: solid #AA3333 1px; padding: 1ex } cite { font-style: normal; font-weight: normal } cite a { color: inherit; text-decoration: none } :link:hover,:visited:hover { background: wheat } img { display:block; page-break-inside: avoid; } |
Changes to pages/fileio.in.
︙ | ︙ | |||
84 85 86 87 88 89 90 | Although it would be easy to design a system that uses the <i>VFS</i> interface to read and update the content of a database file stored within a file-system, there are several complicated issues that need to be addressed by such a system: <ol> <li><p>SQLite is required to <b>implement atomic and durable | | | | | | | | | | | 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | Although it would be easy to design a system that uses the <i>VFS</i> interface to read and update the content of a database file stored within a file-system, there are several complicated issues that need to be addressed by such a system: <ol> <li><p>SQLite is required to <b>implement atomic and durable transactions</b> (the 'A' and 'D' from the ACID acronym), even if an application, operating system or power failure occurs midway through or shortly after updating a database file. <p>To implement atomic transactions in the face of potential application, operating system or power failures, database writers write a copy of those portions of the database file that they are going to modify into a second file, the <i>journal file</i>, before writing to the database file. If a failure does occur while modifying the database file, SQLite can reconstruct the original database (before the modifications were attempted) based on the contents of the <i>journal file</i>. <li><p>SQLite is required to <b>implement isolated transactions</b> (the 'I' from the ACID acronym). <p>This is done by using the file locking facililities provided by the VFS adaptor to serialize writers (write transactions) and preventing readers (read transactions) from accessing database files while writers are midway through updating them. <li><p>For performance reasons, it is advantageous to <b>minimize the quantity of data read and written</b> to and from the file-system. <p>As one might expect, the amount of data read from the database file is minimized by caching portions of the database file in main |
︙ | ︙ | |||
355 356 357 358 359 360 361 | of the following four types of operation: <ul> <li> <b>Create file</b> operations. SQLite may create new files within the file-system by invoking the xOpen() method of the sqlite3_io_methods object. <li> <b>Delete file</b> operations. SQLite may remove files from the | | | 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 | of the following four types of operation: <ul> <li> <b>Create file</b> operations. SQLite may create new files within the file-system by invoking the xOpen() method of the sqlite3_io_methods object. <li> <b>Delete file</b> operations. SQLite may remove files from the file system by calling the xDelete() method of the sqlite3_io_methods object. <li> <b>Truncate file</b> operations. SQLite may truncate existing files by invoking the xTruncate() method of the sqlite3_file object. <li> <b>Write file</b> operations. SQLite may modify the contents and increase the size of a file by files by invoking the xWrite() method of the sqlite3_file object. |
︙ | ︙ | |||
645 646 647 648 649 650 651 | <p> Opening a new database connection is a two-step process: <ol> <li> A file-handle is opened on the database file. <li> If step 1 was successful, an attempt is made to read the | | | 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 | <p> Opening a new database connection is a two-step process: <ol> <li> A file-handle is opened on the database file. <li> If step 1 was successful, an attempt is made to read the <i>database file header</i> from the database file using the new file-handle. </ol> <p> In step 2 of the procedure above, the database file is not locked before it is read from. This is the only exception to the locking rules described in section <cite>reading_data</cite>. |
︙ | ︙ | |||
726 727 728 729 730 731 732 | <i>read/write transaction</i>) on the database before data may be read from the database file. <p> The two exceptions are: <ul> <li> When an attempt is made to read the 100 byte <i>database file | | | | | | 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 | <i>read/write transaction</i>) on the database before data may be read from the database file. <p> The two exceptions are: <ul> <li> When an attempt is made to read the 100 byte <i>database file header</i> immediately after opening the <i>database connection</i> (see section <cite>open_new_connection</cite>). When this occurs no lock is held on the database file. <li> Data read while in the process of opening a read-only transaction (see section <cite>open_read_only_trans</cite>). These read operations occur after a <i>shared lock</i> is held on the database file. </ul> <p> Once a transaction has been opened, reading data from a database connection is a simple operation. Using the xRead() method of the file-handle open on the database file, the required database file pages are read one at a time. SQLite never reads partial pages and |
︙ | ︙ | |||
814 815 816 817 818 819 820 | <ol> <li>A <i>shared-lock</i> is obtained on the database file. <li>The connection checks if a <i>hot journal file</i> exists in the file-system. If one does, then it is rolled back before continuing. <li>The connection checks if the data in the <i>page cache</i> may still be trusted. If not, all page cache data is discarded. <li>If the file-size is not zero bytes and the page cache does not | | | 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 | <ol> <li>A <i>shared-lock</i> is obtained on the database file. <li>The connection checks if a <i>hot journal file</i> exists in the file-system. If one does, then it is rolled back before continuing. <li>The connection checks if the data in the <i>page cache</i> may still be trusted. If not, all page cache data is discarded. <li>If the file-size is not zero bytes and the page cache does not contain valid data for the first page of the database, then the data for the first page must be read from the database. </ol> <p> Of course, an error may occur while attempting any of the 4 steps enumerated above. If this happens, then the <i>shared-lock</i> is released (if it was obtained) and an error returned to the user. |
︙ | ︙ | |||
908 909 910 911 912 913 914 | complex. The following steps take place: <ol class=spacedlist> <li>Using the VFS xAccess() method, SQLite queries the file-system to see if the journal file associated with the database exists. If it does not, then there is no hot-journal file. | | | | | | | | | | | | | | | > > | 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 | complex. The following steps take place: <ol class=spacedlist> <li>Using the VFS xAccess() method, SQLite queries the file-system to see if the journal file associated with the database exists. If it does not, then there is no hot-journal file. <li>By invoking the xCheckReservedLock() method of the file-handle opened on the database file, SQLite checks if some other connection holds a <i>reserved lock</i> or greater. If some other connection does hold a <i>reserved lock</i>, this indicates that the other connection is midway through a <i>read/write transaction</i> (see section <cite>writing_data</cite>). In this case the <i>journal file</i> is not a <i>hot-journal</i> and must not be rolled back. <li>Using the xFileSize() method of the file-handle opened on the database file, SQLite checks if the database file is 0 bytes in size. If it is, the journal file is not considered to be a <i>hot journal</i> file. Instead of rolling back the journal file, in this case it is deleted from the file-system by calling the VFS xDelete() method. <span class=todo>Technically, there is a race condition here. This step should be moved to after the exclusive lock is held.</span> <li>An attempt is made to upgrade to an <i>exclusive lock</i> on the database file. If the attempt fails, then all locks, including the recently obtained <i>shared lock</i> are dropped. The attempt to open a <i>read-only transaction</i> has failed. This occurs when some other connection is also attempting to open a <i>read-only transaction</i> and the attempt to gain the <i>exclusive lock</i> fails because the other connection is also holding a <i>shared lock</i>. It is left to the other connection to roll back the <i>hot journal</i>. <div style="margin-top:0.5em"></div> It is important that the file-handle lock is upgraded directly from <i>shared</i> to <i>exclusive</i> in this case, instead of first upgrading to <i>reserved</i> or </i>pending</i> locks as is required when obtaining an <i>exclusive lock</i> to write to the database file (section <cite>writing_data</cite>). If SQLite were to first upgrade to a <i>reserved</i> or <i>pending</i> lock in this scenario, then a second process also trying to open a <i>read-transaction</i> on the database file might detect the <i>reserved</i> lock in step 2 of this process, conclude that there was no <i>hot journal</i>, and commence reading data from the <i>database file</i>. <li>The xAccess() method is invoked again to detect if the journal file is still in the file system. If it is, then it is a hot-journal file and SQLite tries to roll it back (see section <cite>rollback</cite>). </ol> <p class=todo> Master journal file pointers? <p> The following requirements describe step 1 of the above procedure in more detail. REQ H21014 When required to attempt to detect a <i>hot-journal file</i>, SQLite |
︙ | ︙ | |||
1117 1118 1119 1120 1121 1122 1123 | a <i>database connection</i> unlocks the database file, it stores the current value of the <i>file change counter</i>. Later, while opening a new <i>read-only transaction</i>, SQLite checks the value of the <i>file change counter</i> stored in the database file. If the value has not changed since the database file was unlocked, then the <i>page cache</i> entries can be trusted. If the value has changed, then the <i>page cache</i> entries cannot be trusted and all entries associated with | | | 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 | a <i>database connection</i> unlocks the database file, it stores the current value of the <i>file change counter</i>. Later, while opening a new <i>read-only transaction</i>, SQLite checks the value of the <i>file change counter</i> stored in the database file. If the value has not changed since the database file was unlocked, then the <i>page cache</i> entries can be trusted. If the value has changed, then the <i>page cache</i> entries cannot be trusted and all entries associated with the current <i>database connection</i> are discarded. REQ H21018 When a file-handle open on a database file is unlocked, if the <i>page cache</i> contains one or more entries belonging to the associated <i>database connection</i>, SQLite shall store the value of the <i>file change counter</i> internally. |
︙ | ︙ | |||
1215 1216 1217 1218 1219 1220 1221 | calling the xUnlock() method of the file-handle. <p> See also requirements H21018 and H21021 above. <h1 id=writing_data>Writing Data</h1> <p> | > | > | < > > | < > > > > > | > > | | < | | | | > > > | | | > > > | > > | | > | > | < < | | | < | < | | | | < > > | | < | > | > | < | | | | < | > > > > > > > > > | | > > | 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 | calling the xUnlock() method of the file-handle. <p> See also requirements H21018 and H21021 above. <h1 id=writing_data>Writing Data</h1> <p> Using DDL or DML SQL statements, SQLite users may modify the contents and size of a database file. From the point of view of the sub-systems described in this document, each DDL or DML statement executed results in the contents of zero or more of the pages that make up the database file (see <cite>ff_sqlitert_requirements</cite>) being overwritten with new data. A DDL or DML statement may also append or truncate one or more pages to or from the end of the database file. One or more DDL and/or DML statements are grouped together to make up a single <i>write transaction</i>. A <i>write transaction</i> is required to have the special properties described in section <cite>overview</cite>: a <i>write transaction</i> must be isolated, durable and atomic. <p> SQLite accomplishes these goals using the following techniques: <ul> <li><p> To ensure that <i>write transactions</i> are <b>isolated</b>, before beginning to modify the contents of the database file to reflect the results of a <i>write transaction</i>, SQLite obtains an <i>exclusive lock</i> on the <i>database file</i>. The lock is not relinquished until the <i>write transaction</i> is concluded. Because reading from the <i>database file</i> requires a <i>shared lock</i> (see section <cite>reading_data</cite>), and because holding an <i>exclusive lock</i> guarantees that no other <i>database connection</i> is holding or can obtain a <i>shared lock</i>, this ensures that no other connection may read data from the <i>database file</i> at a point when a <i>write transaction</i> has been partially applied. <li><p> So that <i>write transactions</i> are <b>durable</b> in the face of a system failure, SQLite executes a <i>sync file</i> operation on the database file before concluding the <i>write transaction</i> <li><p>Ensuring that <i>write transactions</i> are <b>atomic</b> is the most complex task required of the system. In this case, <i>atomic</i> means that even if a system failure occurs, an attempt to commit a <i>write transaction</i> to the database file either results in all changes that are a part of the transaction being successfully applied to the database file, or none of the changes are successfully applied. There is no chance that a subset of the changes only are applied. Hence from the point of view of an external observer, the <i>write transaction</i> appears to be an atomic event. <p> Of course, it is usually not possible to atomically apply all the changes required by a <i>write transaction</i> to a database file within the file-system. For example, if a <i>write transaction</i> requires ten pages of a database file to be modified, and a power outage causes a system failure after sqlite has modified only five pages, then the database file will almost certainly be in an inconsistent state following system recovery. <p> SQLite solves this problem by using a <i>journal file</i>. In almost all cases, before the <i>database file</i> is modified in any way, SQLite stores sufficient information in the <i>journal file</i> to allow the original the database file to be reconstructed if a system failure occurs while the database file is being updated to reflect the modifications made by the <i>write transaction</i>. Each time SQLite opens a database file, it checks if such a system failure has occured and, if so, reconstructs the database file based on the contents of the journal file. The procedure used to detect whether or not this process, coined <i>hot journal rollback</i>, is required is described in section <cite>hot_journal_detection</cite>. <i>Hot journal rollback </i> itself is described in section <cite>hot_journal_rollback</cite>. <p> The same technique ensures that an SQLite database file cannot be corrupted by a system failure that occurs at an inopportune moment. If a system failure does occur before SQLite has had a chance to execute sufficient <i>sync file</i> operations to ensure that the changes that make up a <i>write transaction</i> have made it safely to persistent storage, then the <i>journal file</i> will be used to restore the database to a known good state following system recovery. </ul> <p> The <i>page cache</i> is used to buffer modifications to the database file image before they are written to the <i>database file</i>. Often, all changes for an entire <i>write transaction</i> are accumulated within the <i>page cache</i>. In this case no write operations are performed on the database file until the user commits the transaction. <p class=todo> this section is suspect from this point on. <p> Even if an application or system failure does not occur while a <i>write transaction</i> is in progress, a rollback operation to restore the database file to the state that it was in before the transaction started may be required. This may occur if the user explicitly requests transaction rollback (i.e. by issuing a "ROLLBACK" command), or |
︙ | ︙ | |||
1343 1344 1345 1346 1347 1348 1349 1350 | entire <i>write transaction</i>. This is intended to be illustrative only, many operations are omitted. <center><img src="images/fileformat/write_transaction.gif"> <p><i>Figure <span class=fig id=figure_write_transaction></span> - Progression of a Write Transaction</i> </center> | > > > | > > > > > > > > > > > > | | > > > | | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 | entire <i>write transaction</i>. This is intended to be illustrative only, many operations are omitted. <center><img src="images/fileformat/write_transaction.gif"> <p><i>Figure <span class=fig id=figure_write_transaction></span> - Progression of a Write Transaction</i> </center> <h2>Journal File Format</h2> <p> This section describes the format used by an SQLite <i>journal file</i>. <p> A journal file consists of one or more <i>journal headers</i>, zero or more <i>journal records</i> and optionally a <i>master journal pointer</i>. Each journal file always begins with a <i>journal header</i>, followed by zero or more <i>journal records</i>. Following this may be a second <i>journal header</i> followed by a seconds set of zero or more <i>journal records</i> and so on. There is no limit to the number of <i>journal headers</i> a journal file may contain. Following the <i>journal headers</i> and their accompanying sets of <i>journal records</i> may be the optional <i>master journal pointer</i>. Or, the file may simply end. <h3 id=journal_header_format>Journal Header Format</h3> <p> A <i>journal header</i> is <i>sector-size</i> bytes in size, where <i> sector-size</i> is the value returned by the xSectorSize method of the file handle opened on the database file. Only the first 28 bytes of the <i>journal header</i> are used, the remainder may contain garbage data. The first 28 bytes of each <i>journal header</i> consists of an eight byte block set to a well-known value, followed by five big-endian 32-bit unsigned integer fields. <center><img src="images/fileformat/journal_header.gif"> <p><i>Figure <span class=fig id=figure_journal_header></span> - Journal Header Format</i> </center> <p> Figure <cite>figure_journal_header</cite> graphically depicts the layout of a <i>journal header</i>. The individual fields are described in the following table. The offsets in the 'byte offset' column of the table are relative to the start of the <i>journal header</i>. <table class=striped> <tr><th>Byte offset<th>Size in bytes<th width=100%>Description <tr><td>0<td>8<td>The <b>journal magic</b> field always contains a well-known 8-byte string value used to identify SQLite journal files. The well-known sequence of byte values is: <pre>0xd9 0xd5 0x05 0xf9 0x20 0xa1 0x63 0xd7</pre> <tr><td>8<td>4<td>This field, the <b>record count</b>, is set to the number of <i>journal records</i> that follow this <i>journal header</i> in the <i>journal file</i>. <tr><td>12<td>4<td>This field, the <b>database page count</b>, is set to the number of pages that the <i>database file</i> contained before any modifications associated with <i>write transaction</i> are applied. <tr><td>16<td>4<td>The <b>checksum initializer</b> field is set to a pseudo-random value. It is used as part of the algorithm to calculate the checksum for all <i>journal records</i> that follow this <i>journal header</i>. <tr><td>20<td>4<td>This field, the <b>sector size</b>, is set to the <i>sector size</i> of the device on which the <i>journal file</i> was created, in bytes. This value is required when reading the journal file to determine the size of each <i>journal header</i>. <tr><td>24<td>4<td>The <b>page size</b> field contains the database page size used by the corresponding <i>database file</i> when the <i>journal file</i> was created, in bytes. </table> <p> All <i>journal headers</i> are positioned in the file so that they start at a <i>sector size</i> aligned offset. To achieve this, unused space may be left between the start of the second and subsequent <i>journal headers</i> and the end of the <i>journal records</i> associated with the previous header. <h3 id=journal_record_format>Journal Record Format</h3> <p> Each <i>journal record</i> contains the original data for a database page modified by the <i>write transaction</i>. If a rollback is required, then this data may be used to restore the contents of the database page to the state it was in before the <i>write transaction</i> was started. <center><img src="images/fileformat/journal_record.gif"> <p><i>Figure <span class=fig id=figure_journal_record></span> - Journal Record Format</i> </center> <p> A <i>journal record</i>, depicted graphically by figure <cite>figure_journal_record</cite>, contains three fields, as described in the following table. Byte offsets are relative to the start of the <i>journal record</i>. <table class=striped> <tr><th>Byte offset<th>Size in bytes<th width=100%>Description <tr><td>0<td>4<td>The page number of the database page associated with this <i>journal record</i>, stored as a 4 byte big-endian unsigned integer. <tr><td>4<td><i>page-size<td> This field contains the original data for the page, exactly as it appeared in the database file before the <i>write transaction</i> began. <tr><td style="white-space: nowrap">4 + <i>page-size</i><td>4<td> </table> <h3>Master Journal Pointer</h3> <h2>Write Transactions</h2> <h3>Beginning a Write Transaction</h3> <p> Before any database pages may be modified within the <i>page cache</i>, the <i>database connection</i> must open a <i>write transaction</i>. |
︙ | ︙ |