SQLite

Check-in [e05a7a552f]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Update the documentation for the new journal format to be introduced in version 2.8.0. (CVS 865)
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: e05a7a552f694158ee449d8682f5c137f1c2f2ac
User & Date: drh 2003-02-13 02:54:03.000
Context
2003-02-13
02:54
Update the documentation for the new journal format to be introduced in version 2.8.0. (CVS 1729) (check-in: dc53d92141 user: drh tags: trunk)
02:54
Update the documentation for the new journal format to be introduced in version 2.8.0. (CVS 865) (check-in: e05a7a552f user: drh tags: trunk)
01:58
Fix a bug in the rollback logic for the new journal format. (CVS 864) (check-in: 7c22aa3f81 user: drh tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to www/changes.tcl.
20
21
22
23
24
25
26






27
28
29
30
31
32
33
}


proc chng {date desc} {
  puts "<DT><B>$date</B></DT>"
  puts "<DD><P><UL>$desc</UL></P></DD>"
}







chng {2003 Jan 25 (2.7.6)} {
<li>Performance improvements.  The library is now much faster.</li>
<li>Added the <b>sqlite_set_authorizer()</b> API.  Formal documentation has
    not been written - see the source code comments for instructions on
    how to use this function.</li>
<li>Fix a bug in the GLOB operator that was preventing it from working







>
>
>
>
>
>







20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
}


proc chng {date desc} {
  puts "<DT><B>$date</B></DT>"
  puts "<DD><P><UL>$desc</UL></P></DD>"
}

chng {2003 Feb 14 (2.8.0)} {
<li>Modified the journal file format to make it more resistant to corruption
    that can occur after an OS crash or power failure.</li>
<li>Added a new C/C++ API that does not use callback for returning data.</li>
}

chng {2003 Jan 25 (2.7.6)} {
<li>Performance improvements.  The library is now much faster.</li>
<li>Added the <b>sqlite_set_authorizer()</b> API.  Formal documentation has
    not been written - see the source code comments for instructions on
    how to use this function.</li>
<li>Fix a bug in the GLOB operator that was preventing it from working
Changes to www/fileformat.tcl.
1
2
3
4
5
6
7
8
9
10
11
#
# Run this script to generated a fileformat.html output file
#
set rcsid {$Id: fileformat.tcl,v 1.7 2003/02/12 14:09:45 drh Exp $}

puts {<html>
<head>
  <title>SQLite Database File Format</title>
</head>
<body bgcolor="white">
<h1 align="center">



|







1
2
3
4
5
6
7
8
9
10
11
#
# Run this script to generated a fileformat.html output file
#
set rcsid {$Id: fileformat.tcl,v 1.8 2003/02/13 02:54:04 drh Exp $}

puts {<html>
<head>
  <title>SQLite Database File Format</title>
</head>
<body bgcolor="white">
<h1 align="center">
68
69
70
71
72
73
74
75
76
77
78
79





80
81
82
83



84
85
86



87





88
89
90
91
92
93
94
95
96
97
98
99
100

101
102
103
104
105
106
107
108
109

















110
111
112
113
114
115


116
117
118
119
120

121
122
123
124
125
126
127
main database file.  As far as the pager is concerned, each page
contains 1024 bytes of arbitrary data.  But there is structure to
the journal file.
</p>

<p>
A journal file begins with 8 bytes as follows:
0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, and 0xd5.
Processes that are attempting to rollback a journal use these 8 bytes
as a sanity check to make sure the file they think is a journal really
is a valid journal.  There is no significance to the choice of
bytes here - the values were obtained from /dev/random. 





</p>

<p>
Following the 8 byte prefix is a single 4-byte integer that is the



original size of the main database file before the transaction was
started.  The main database file is truncated back to this size
as part of the rollback process.



The size is expressed in pages (1024 bytes per page) and is





a big-endian number.  That means that the most significant byte
occurs first.  All multi-byte integers in the journal file are
written as big-endian numbers.  That way, a journal file that is
originally created on one machine can be rolled back by another
machine that uses a different byte order.  So, for example, a
transaction that failed to complete on your big-endian SparcStation
can still be rolled back on your little-endian Linux box.
</p>

<p>
After the 8-byte prefix and the 4-byte initial database size, the
journal file consists of zero or more page records.  Each page
record is a 4-byte (big-endian) page number followed by 1024 bytes

of data.  The data is the original content of the database page
before the transaction was started.  So to roll back the transaction,
the data is simply written into the corresponding page of the
main database file.  Pages can appear in the journal in any order,
but they are guaranteed to appear only once. All page numbers will be
between 1 and the maximum specified by the page size integer that
appeared at the beginning of the journal.
</p>


















<p>
Here is a summary of the journal file format:
</p>

<ul>
<li>8 byte prefix: 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, x0d5</li>


<li>4 byte initial database page count, big-endian.</li>
<li>Zero or more instances of the following:
   <ul>
   <li>4 byte page number - big-endian</li>
   <li>1024 bytes of original data for the page</li>

   </ul>
</li>
</ul>

<h2>3.0 &nbsp; The B-Tree Layer</h2>

<p>







|


|
|
>
>
>
>
>



|
>
>
>

|
|
>
>
>
|
>
>
>
>
>
|
<
|







|


>
|








>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>





|
>
>
|


|

>







68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104

105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
main database file.  As far as the pager is concerned, each page
contains 1024 bytes of arbitrary data.  But there is structure to
the journal file.
</p>

<p>
A journal file begins with 8 bytes as follows:
0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, and 0xd6.
Processes that are attempting to rollback a journal use these 8 bytes
as a sanity check to make sure the file they think is a journal really
is a valid journal.  Prior version of SQLite used different journal
file formats.  The magic numbers for these prior formats is differ
so that if a new version of the library attempts to rollback a journal
created by an earlier version, it can detect that the journal uses
an obsolete format and make the necessary adjustments.  This article
describes only the newest journal format - supported as of version
2.8.0.
</p>

<p>
Following the 8 byte prefix is a three 4-byte integers that tell us
the number of pages that have been committed to the journal,
a magic number used for
sanity checking each page, and the
original size of the main database file before the transaction was
started.  The number of committed pages is used to limit how far
into the journal to read.  The use of the checksum magic number is
described below.
The original size of the database is used to restore the database
file back to its original size.
The size is expressed in pages (1024 bytes per page).
</p>

<p>
All three integers in the journal header and all other multi-byte
numbers used in the journal file are big-endian.
That means that the most significant byte

occurs first.  That way, a journal file that is
originally created on one machine can be rolled back by another
machine that uses a different byte order.  So, for example, a
transaction that failed to complete on your big-endian SparcStation
can still be rolled back on your little-endian Linux box.
</p>

<p>
After the 8-byte prefix and the three 4-byte integers, the
journal file consists of zero or more page records.  Each page
record is a 4-byte (big-endian) page number followed by 1024 bytes
of data and a 4-byte checksum.  
The data is the original content of the database page
before the transaction was started.  So to roll back the transaction,
the data is simply written into the corresponding page of the
main database file.  Pages can appear in the journal in any order,
but they are guaranteed to appear only once. All page numbers will be
between 1 and the maximum specified by the page size integer that
appeared at the beginning of the journal.
</p>

<p>
The so-called checksum at the end of each record is not really a
checksum - it is the sum of the page number and the magic number which
was the second integer in the journal header.  The purpose of this
value is to try to detect journal corruption that might have occurred
because of a power loss or OS crash that occurred which the journal
file was being written to disk.  It could have been the case that the
meta-data for the journal file, specifically the size of the file, had
been written to the disk so that when the machine reboots it appears that
file is large enough to hold the current record.  But even though the
file size has changed, the data for the file might not have made it to
the disk surface at the time of the OS crash or power loss.  This means
that after reboot, the end of the journal file will contain quasi-random
garbage data.  The checksum is an attempt to detect such corruption.  If
the checksum does not match, that page of the journal is not rolled back.
</p>

<p>
Here is a summary of the journal file format:
</p>

<ul>
<li>8 byte prefix: 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, 0xd6</li>
<li>4 byte number of records in journal</li>
<li>4 byte magic number used for page checksums</li>
<li>4 byte initial database page count</li>
<li>Zero or more instances of the following:
   <ul>
   <li>4 byte page number</li>
   <li>1024 bytes of original data for the page</li>
   <li>4 byte checksum</li>
   </ul>
</li>
</ul>

<h2>3.0 &nbsp; The B-Tree Layer</h2>

<p>
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
</p>

<ul>
<li>48 byte header string</li>
<li>4 byte integer used to determine the byte-order</li>
<li>4 byte integer which is the first page of the freelist</li>
<li>4 byte integer which is the number of pages on the freelist</li>
<li>16 bytes of meta-data arranged as four 4-byte integers</li>
<li>948 bytes of unused space</li>
</ul>

<h3>3.2 &nbsp; Structure Of A Single B-Tree Page</h3>

<p>
Conceptually, a b-tree page contains N database entries and N+1 pointers
to other b-tree pages.







|
|







267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
</p>

<ul>
<li>48 byte header string</li>
<li>4 byte integer used to determine the byte-order</li>
<li>4 byte integer which is the first page of the freelist</li>
<li>4 byte integer which is the number of pages on the freelist</li>
<li>36 bytes of meta-data arranged as nine 4-byte integers</li>
<li>928 bytes of unused space</li>
</ul>

<h3>3.2 &nbsp; Structure Of A Single B-Tree Page</h3>

<p>
Conceptually, a b-tree page contains N database entries and N+1 pointers
to other b-tree pages.
737
738
739
740
741
742
743
744






745
746
747
748
749
750
751
752
753
754
755
by the DEFAULT_CACHE_SIZE pragma.  If the value is positive it
means that synchronous behavior is enable (via the DEFAULT_SYNCHRONOUS
pragma) and if negative it means that synchronous behavior is
disabled.
</p>

<p>
The fourth meta-value is currently unused.






</p>

}

puts {
<p><hr /></p>
<p><a href="index.html"><img src="/goback.jpg" border=0 />
Back to the SQLite Home Page</a>
</p>

</body></html>}







|
>
>
>
>
>
>











773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
by the DEFAULT_CACHE_SIZE pragma.  If the value is positive it
means that synchronous behavior is enable (via the DEFAULT_SYNCHRONOUS
pragma) and if negative it means that synchronous behavior is
disabled.
</p>

<p>
The fourth meta-value is safety level added in version 2.8.0.
A value of 1 corresponds to a SYNCHRONOUS setting of OFF.  In other
words, SQLite does not pause to wait for journal data to reach the disk
surface before overwriting pages of the database.  A value of 2 corresponds
to a SYNCHRONOUS setting of NORMAL.  A value of 3 corresponds to a
SYNCHRONOUS setting of FULL. If the value is 0, that means it has not
been initialized so the default synchronous setting of NORMAL is used.
</p>

}

puts {
<p><hr /></p>
<p><a href="index.html"><img src="/goback.jpg" border=0 />
Back to the SQLite Home Page</a>
</p>

</body></html>}
Changes to www/formatchng.tcl.
1
2
3
4
5
6
7
8
9
10
11
#
# Run this Tcl script to generate the formatchng.html file.
#
set rcsid {$Id: formatchng.tcl,v 1.7 2002/08/13 23:02:59 drh Exp $ }

puts {<html>
<head>
  <title>File Format Changes in SQLite</title>
</head>
<body bgcolor=white>
<h1 align=center>



|







1
2
3
4
5
6
7
8
9
10
11
#
# Run this Tcl script to generate the formatchng.html file.
#
set rcsid {$Id: formatchng.tcl,v 1.8 2003/02/13 02:54:04 drh Exp $ }

puts {<html>
<head>
  <title>File Format Changes in SQLite</title>
</head>
<body bgcolor=white>
<h1 align=center>
153
154
155
156
157
158
159


















160
161
162
163
164
165
166
  <p>Because "text" columns have a different sort order from numeric,
  indices on "text" columns occur in a different order for version
  2.7.0 and later database.  Hence version 2.6.3 and earlier of SQLite 
  will be unable to read a 2.7.0 or later database.  But version 2.7.0
  and later of SQLite will read earlier database version.</p>
  </td>
</tr>


















</table>
</blockquote>

<p>
To perform a database reload, have ready versions of the
<b>sqlite</b> command-line utility for both the old and new
version of SQLite.  Call these two executables "<b>sqlite-old</b>"







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
  <p>Because "text" columns have a different sort order from numeric,
  indices on "text" columns occur in a different order for version
  2.7.0 and later database.  Hence version 2.6.3 and earlier of SQLite 
  will be unable to read a 2.7.0 or later database.  But version 2.7.0
  and later of SQLite will read earlier database version.</p>
  </td>
</tr>
<tr>
  <td valign="top">2.7.6 to 2.8.0</td>
  <td valign="top">2003-Feb-14</td>
  <td><p>Version 2.8.0 introduces a change to the format of the rollback
  journal file.  The main database file format is unchanged.  Versions
  2.7.6 and earlier can read and write 2.8.0 databases and vice versa.
  Version 2.8.0 can rollback a transation that was started by version
  2.7.6 and earlier.  But version 2.7.6 and earlier cannot rollback a
  transaction started by version 2.8.0 or later.</p>

  <p>The only time this would ever be an issue is when you have a program
  using version 2.8.0 or later that crashes with an incomplete
  transaction, then you try to examine the database using version 2.7.6 or
  earlier.  The 2.7.6 code will not be able to read the journal file
  and thus will not be able to rollback the incomplete transaction
  to restore the database.</p>
  </td>
</tr>
</table>
</blockquote>

<p>
To perform a database reload, have ready versions of the
<b>sqlite</b> command-line utility for both the old and new
version of SQLite.  Call these two executables "<b>sqlite-old</b>"
Changes to www/lang.tcl.
1
2
3
4
5
6
7
8
9
10
11
#
# Run this Tcl script to generate the sqlite.html file.
#
set rcsid {$Id: lang.tcl,v 1.49 2003/01/29 22:58:27 drh Exp $}

puts {<html>
<head>
  <title>Query Language Understood By SQLite</title>
</head>
<body bgcolor=white>
<h1 align=center>



|







1
2
3
4
5
6
7
8
9
10
11
#
# Run this Tcl script to generate the sqlite.html file.
#
set rcsid {$Id: lang.tcl,v 1.50 2003/02/13 02:54:04 drh Exp $}

puts {<html>
<head>
  <title>Query Language Understood By SQLite</title>
</head>
<body bgcolor=white>
<h1 align=center>
1076
1077
1078
1079
1080
1081
1082
1083

1084
1085
1086
1087
1088

1089

1090
1091
1092
1093





1094
1095
1096
1097

1098
1099
1100
1101
1102
1103
1104
1105
1106
    will hold in memory at once.  Each page uses about 1.5K of memory.
    This pragma works like the <b>cache_size</b> pragma with the addition
    feature that it changes the cache size persistently.  With this pragma,
    you can set the cache size once and that setting is retained and reused
    everytime you reopen the database.</p></li>

<li><p><b>PRAGMA default_synchronous;
       <br>PRAGMA default_synchronous = ON;

       <br>PRAGMA default_synchronous = OFF;</b></p>
    <p>Query or change the setting of the "synchronous" flag in
    the database.  When synchronous is on (the default), the SQLite database
    engine will pause at critical moments to make sure that data has actually
    be written to the disk surface.  (In other words, it invokes the

    equivalent of the <b>fsync()</b> system call.)  In synchronous mode,

    an SQLite database should be fully recoverable even if the operating
    system crashes or power is interrupted unexpectedly.  The penalty for
    this assurance is that some database operations take longer because the
    engine has to wait on the (relatively slow) disk drive.  The alternative





    is to turn synchronous off.  With synchronous off, SQLite continues
    processing as soon as it has handed data off to the operating system.
    If the application running SQLite crashes, the data will be safe, but
    the database could (in theory) become corrupted if the operating system

    crashes or the computer suddenly loses power.  On the other hand, some
    operations are as much as 50 or more times faster with synchronous off.
    </p>
    <p>This pragma changes the synchronous mode persistently.  Once changed,
    the mode stays as set even if the database is closed and reopened.  The
    <b>synchronous</b> pragma does the same thing but only applies the setting
    to the current session.</p>

<a name="pragma_empty_result_callbacks">







|
>


|
|
|
>
|
>
|
<
<
|
>
>
>
>
>
|
|

|
>
|
|







1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093


1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
    will hold in memory at once.  Each page uses about 1.5K of memory.
    This pragma works like the <b>cache_size</b> pragma with the addition
    feature that it changes the cache size persistently.  With this pragma,
    you can set the cache size once and that setting is retained and reused
    everytime you reopen the database.</p></li>

<li><p><b>PRAGMA default_synchronous;
       <br>PRAGMA default_synchronous = FULL;
       <br>PRAGMA default_synchronous = NORMAL;
       <br>PRAGMA default_synchronous = OFF;</b></p>
    <p>Query or change the setting of the "synchronous" flag in
    the database.  When synchronous is FULL, the SQLite database engine will
    pause at critical moments to make sure that data has actually been 
    written to the disk surface before continuing.  This ensures that if
    the operating system crashes or if there is a power failure, the database
    will be uncorrupted after rebooting.  FULL synchronous is very 
    safe, but it is also slow.  
    When synchronous is NORMAL (the default), the SQLite database


    engine will still pause at the most critical moments, but less often
    than in FULL mode.  There is a very small (though non-zero) chance that
    a power failure at just the wrong time could corrupt the database in
    NORMAL mode.  But in practice, you are more likely to suffer
    a catastrophic disk failure or some other unrecoverable hardware
    fault.  So NORMAL is the default mode.
    With synchronous OFF, SQLite continues without pausing
    as soon as it has handed data off to the operating system.
    If the application running SQLite crashes, the data will be safe, but
    the database might become corrupted if the operating system
    crashes or the computer loses power before that data has been written
    to the disk surface.  On the other hand, some
    operations are as much as 50 or more times faster with synchronous OFF.
    </p>
    <p>This pragma changes the synchronous mode persistently.  Once changed,
    the mode stays as set even if the database is closed and reopened.  The
    <b>synchronous</b> pragma does the same thing but only applies the setting
    to the current session.</p>

<a name="pragma_empty_result_callbacks">
1175
1176
1177
1178
1179
1180
1181
1182

1183
1184
1185
1186
1187
1188
1189
       azCol[3] = "TEXT";<br>
       azCol[4] = "NUMERIC";<br>
       azCol[5] = "TEXT";<br>
       azCol[6] = 0;
    </td></table></blockquote></li>

<li><p><b>PRAGMA synchronous;
       <br>PRAGMA synchronous = ON;

       <br>PRAGMA synchronous = OFF;</b></p>
    <p>Query or change the setting of the "synchronous" flag in
    the database for the duration of the current database connect.
    The synchronous flag reverts to its default value when the database
    is closed and reopened.  For additional information on the synchronous
    flag, see the description of the <b>default_synchronous</b> pragma.</p>
    </li>







|
>







1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
       azCol[3] = "TEXT";<br>
       azCol[4] = "NUMERIC";<br>
       azCol[5] = "TEXT";<br>
       azCol[6] = 0;
    </td></table></blockquote></li>

<li><p><b>PRAGMA synchronous;
       <br>PRAGMA synchronous = FULL;
       <br>PRAGMA synchronous = NORMAL;
       <br>PRAGMA synchronous = OFF;</b></p>
    <p>Query or change the setting of the "synchronous" flag in
    the database for the duration of the current database connect.
    The synchronous flag reverts to its default value when the database
    is closed and reopened.  For additional information on the synchronous
    flag, see the description of the <b>default_synchronous</b> pragma.</p>
    </li>