Documentation Source Text

Check-in [ac2c5476d0]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Start enumerating the assumptions sqlite makes related to the state of the file system following a power failure or OS crash.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: ac2c5476d0b4c598aafd06383050394f87e2fc65
User & Date: dan 2008-07-28 15:01:48
Context
2008-07-29
16:42
Add definitions for the "atomic-write" "safe-append" and "sequential-write" VFS properties to fileio.html. check-in: 26da8bcd0b user: dan tags: trunk
2008-07-28
15:01
Start enumerating the assumptions sqlite makes related to the state of the file system following a power failure or OS crash. check-in: ac2c5476d0 user: dan tags: trunk
2008-07-23
15:38
Updates to system requirements. check-in: a1807b9496 user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/fileio.in.

1
2




3
4
5


6
7
8
9
10
11
12






13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
..
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132






133
134
135
136
137
138
139
140











































































































































































































141
142

143
144
145
146
147
148
149
...
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
...
201
202
203
204
205
206
207

208
209
210
211
212
213
214

<tcl>




proc process {text} {
  set zOut ""
  set isReq 0


  foreach zLine [split $text "\n"] {

    switch -regexp $zLine {
      {^ *REQ *[^ ][^ ]* *$} {
        regexp { *REQ *([^ ]+) *} $zLine -> zRecid
        append zOut "<p class=req id=$zRecid>"
        set isReq 1






        set zRecText ""
      }
      {^ *$} {
        if {$isReq} {
          hd_requirement $zRecid $zRecText
          set isReq 0
          append zOut </p>
        }
      }
      default {
        if {$isReq} {
          if {[regexp {^ *\. *$} $zLine]} {set zLine ""}
          append zRecText "$zLine\n"
        }
        append zOut "$zLine\n"
      }
    }

................................................................................
      <li>xOpen
      <li>xFullPathname

      <li>xClose

    </ul>

<h1>VFS Adaptor Implementations</h1>

  <h2>SQLite File-System Usage </h2>
    <p>
      SQLite uses the file-system service made available via the 
      <i>VFS adaptor</i> to create, read and modify files for a variety
      of purposes, including:

    <ul>
      <li> database files,
      <li> journal files,
      <li> master journal files, and
      <li> statement journal files.
    </ul>

    <p>
      A database file is a file-system file used to store an SQLite database 
      image (see <cite>ff_sqlitert_requirements</cite>).

  <h2 id=fs_characteristics>File-System Characteristics</h2>
    <p>
      In the event of an operating system or power failure, the various 
      combinations of file-system and storage hardware available provide
      varying levels of guarantee as to the integrity of the data written
      to the file system just before or during the failure. The exact
      combination of IO operations that SQLite is required to perform
      in order to safely modify a database file depend on the exact 
      characteristics of the target platform.

    <p>






      SQLite queries an implementation for file-system characteristics
      using the xDeviceCharacteristics() and xSectorSize() methods of the
      database file file-handle. These two methods are only ever called
      on file-handles open on database files. They are not called for 
      <i>journal files</i>, <i>master-journal files</i> or 
      <i>temporary database files</i>.

    <p>











































































































































































































      The return value of the xSectorSize() method, the <i>sector-size</i>, is
      expected by SQLite to be a power of 2 value greater than or equal to 512.

    <p class=todo> 
      What does it do if this is not the case? If the sector size is less
      than 512 then 512 is used instead. How about a non power-of-two value?
      UPDATE: How this situation is handled should be described in the API
      requirements. Here we can just refer to the other document.

    <p>
................................................................................
          occured may be trusted.
    </ul>

    <p class=todo>
      What do we assume about the other three file-system write 
      operations - xTruncate(), xDelete() and "create file"?

    <p>
      For example, if the <i>sector-size</i> of a given file-system is
      2048 bytes, and SQLite opens a file and writes a 1024 byte block
      of data to offset 3072 of the file, then according to the model 
      the second sector of the file is in the transient state. If a 
      power failure or operating system crash occurs before or during
      the next call to xSync() on the file handle, then following system
      recovery SQLite assumes that all file data between byte offsets 2048 
      and 4095, inclusive, is invalid. It also assumes that since the first
      sector of the file, containing the data from byte offset 0 to 2047 
      inclusive, is valid, since it was not in a transient state when the 
      crash occured.



    <p>
      The xDeviceCharacteristics() method returns a set of flags, 
      indicating which of the following properties (if any) the 
      file-system provides:

................................................................................
      <li>The <b><i>safe-append</i></b> property.
      <li>The <b><i>atomic write</i></b> property.
    </ul>

    <p class=todo>
      Write an explanation as to how the file-system properties influence
      the model used to predict file damage after a catastrophy.

 

<h1>Database Connections</h1>

  <p>
    Within this document, the term <i>database connection</i> has a slightly
    different meaning from that which one might assume. The handles returned


>
>
>
>


<
>
>






|
>
>
>
>
>
>



|
|
|




|







 







|




|
|



|
|
<






|










>
>
>
>
>
>








>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


>







 







<
<
<
<
<
<
<
<
<
<
<
<
<







 







>







1
2
3
4
5
6
7
8

9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
...
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125

126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
...
390
391
392
393
394
395
396













397
398
399
400
401
402
403
...
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422

<tcl>
proc hd_assumption {id text} {
  hd_requirement $id $text
}

proc process {text} {
  set zOut ""

  set zSpecial ""

  foreach zLine [split $text "\n"] {

    switch -regexp $zLine {
      {^ *REQ *[^ ][^ ]* *$} {
        regexp { *REQ *([^ ]+) *} $zLine -> zRecid
        append zOut "<p class=req id=$zRecid>"
        set zSpecial hd_requirement
        set zRecText ""
      }
      {^ *ASSUMPTION *[^ ][^ ]* *$} {
        regexp { *ASSUMPTION *([^ ]+) *} $zLine -> zRecid
        append zOut "<p class=req id=$zRecid>"
        set zSpecial hd_assumption
        set zRecText ""
      }
      {^ *$} {
        if {$zSpecial ne ""} {
          $zSpecial $zRecid $zRecText
          set zSpecial ""
          append zOut </p>
        }
      }
      default {
        if {$zSpecial ne ""} {
          if {[regexp {^ *\. *$} $zLine]} {set zLine ""}
          append zRecText "$zLine\n"
        }
        append zOut "$zLine\n"
      }
    }

................................................................................
      <li>xOpen
      <li>xFullPathname

      <li>xClose

    </ul>

<h1>VFS Adaptor Related Assumptions</h1>

  <h2>SQLite File-System Usage </h2>
    <p>
      SQLite uses the file-system service made available via the 
      <i>VFS adaptor</i> to create, read and modify three different types
      of files, as follows:

    <ul>
      <li> database files,
      <li> journal files, and
      <li> master journal files.

    </ul>

    <p>
      A database file is a file-system file used to store an SQLite database 
      image (see <cite>ff_sqlitert_requirements</cite>).

  <h2 id=fs_characteristics>System Failure Related Assumptions</h2>
    <p>
      In the event of an operating system or power failure, the various 
      combinations of file-system and storage hardware available provide
      varying levels of guarantee as to the integrity of the data written
      to the file system just before or during the failure. The exact
      combination of IO operations that SQLite is required to perform
      in order to safely modify a database file depend on the exact 
      characteristics of the target platform.

    <p>
      This section describes the assumptions that SQLite makes about the
      the content of a file-system following a power or system failure. In
      other words, it describes the extent of file and file-system corruption
      that such an event may cause.

    <p>
      SQLite queries an implementation for file-system characteristics
      using the xDeviceCharacteristics() and xSectorSize() methods of the
      database file file-handle. These two methods are only ever called
      on file-handles open on database files. They are not called for 
      <i>journal files</i>, <i>master-journal files</i> or 
      <i>temporary database files</i>.

    <p>
      The file-system <i>sector size</i> value determined by calling the
      xSectorSize() method is a power of 2 value between 512 and 32768, 
      inclusive <span class=todo>reference to exactly how this is
      determined</span>. SQLite assumes that the underlying storage
      device stores data in blocks of <i>sector-size</i> bytes each, 
      sectors. It is also assumed that each aligned block of 
      <i>sector-size</i> bytes of each file is stored in a single device
      sector. If the file is not an exact multiple of <i>sector-size</i>
      bytes in size, then the final device sector is partially empty.

    <p>
      Normally, SQLite assumes that if a power failure occurs while 
      updating any portion of a sector then the contents of the entire 
      device sector is suspect following recovery. After writing to
      any part of a sector within a file, it is assumed that the modified
      sector contents are held in a volatile buffer somewhere within
      the system (main memory, disk cache etc.). SQLite does not assume
      that the updated data has reached the persistent storage media, until
      after it has successfully <i>synced</i> the corresponding file by
      invoking the VFS xSync() method. <i>Syncing</i> a file causes all
      modifications to the file up until that point to be committed to
      persistent storage.

    <p>
      Based on the above, SQLite usually uses an internal model of the
      file-system whereby any sector of a file written to is considered to be
      in a transient state until after the file has been successfully 
      <i>synced</i>. Should a power or system failure occur while a sector 
      is in a transient state, it is impossible to predict its contents
      following recovery. It may be written correctly, not written at all,
      overwritten with random data, or any combination thereof.

    <p>
      For example, if the <i>sector-size</i> of a given file-system is
      2048 bytes, and SQLite opens a file and writes a 1024 byte block
      of data to offset 3072 of the file, then according to the model 
      the second sector of the file is in the transient state. If a 
      power failure or operating system crash occurs before or during
      the next call to xSync() on the file handle, then following system
      recovery SQLite assumes that all file data between byte offsets 2048 
      and 4095, inclusive, is invalid. It also assumes that since the first
      sector of the file, containing the data from byte offset 0 to 2047 
      inclusive, is valid, since it was not in a transient state when the 
      crash occured.

    <p>
      Assuming that any and all sectors in the transient state may be 
      corrupted following a power or system failure is a very pessimistic
      approach. Some modern systems provide more sophisticated guarantees
      than this. SQLite allows the VFS implementation to specify at runtime
      that the current platform supports one or more of the following 
      properties:

    <ul>
      <li>The <b>safe-append</b> property. Details...
      <li>The <b>sequential-write</b> property. Details...
      <li>The <b>atomic-write</b> property. Details...
    </ul>

    <h3>Details</h3>

    <p>
      This section describes how the assumptions presented in the parent
      section apply to the individual API functions and operations provided 
      by the VFS to SQLite for the purposes of modifying the contents of the
      file-system.

    <p>
      SQLite manipulates the contents of the file-system using a combination
      of the following four types of operation:

    <ul>
      <li> <b>Create file</b> operations. SQLite may create new files
           within the file-system by invoking the xOpen() method of
           the sqlite3_io_methods object.
      <li> <b>Delete file</b> operations. SQLite may remove files from the
	   file system by calling the xDelete() method of the
           sqlite3_io_methods object.
      <li> <b>Truncate file</b> operations. SQLite may truncate existing 
           files by invoking the xTruncate() method of the sqlite3_file 
           object.
      <li> <b>Write file</b> operations. SQLite may modify the contents
           and increase the size of a file by files by invoking the xWrite() 
           method of the sqlite3_file object.
    </ul>

    <p>
      Additionally, all VFS implementations are required to provide the
      <i>sync file</i> operation, accessed via the xSync() method of the
      sqlite3_file object, used to flush create, write and truncate operations
      on a file to the persistent storage medium.

    <p>
      The formalized assumptions in this section refer to <i>system failure</i>
      events.  In this context, this should be interpreted as any failure that
      causes the system to stop operating. For example a power failure or
      operating system crash.

    <p>
      SQLite does not assume that a <b>create file</b> operation has actually
      modified the file-system records within perisistent storage until
      after the file has been successfully <i>synced</i>.

    ASSUMPTION A21001
      If a system failure occurs during or after a "create file"
      operation, but before the created file has been <i>synced</i>, then 
      SQLite assumes that it is possible that the created file may not
      exist following system recovery.

    <p>
      Of course, it is also possible that it does exist following system
      recovery.

    ASSUMPTION A21002
      If a "create file" operation is executed by SQLite, and then the
      created file <i>synced</i>, then SQLite assumes that the file-system
      modifications corresponding to the "create file" operation have been
      committed to persistent media. It is assumed that if a system
      failure occurs any time after the file has been successfully 
      <i>synced</i>, then the file is guaranteed to appear in the file-system
      following system recovery.

    <p>
      A <b>delete file</b> operation (invoked by a call to the VFS xDelete() 
      method) is assumed to be an atomic and durable operation. 
    </p>

    ASSUMPTION A21003
      If a system failure occurs at any time after a "delete file" 
      operation (call to the VFS xDelete() method) returns successfully, it is
      assumed that the file-system will not contain the deleted file following
      system recovery.

    ASSUMPTION A21004
      If a system failure occurs during a "delete file" operation,
      it is assumed that following system recovery the file-system will 
      either contain the file being deleted in the state it was in before
      the operation was attempted, or not contain the file at all. It is 
      assumed that it is not possible for the file to have become corrupted
      purely as a result of a failure occuring during a "delete file" 
      operation.

    <p>
      The effects of a <b>truncate file</b> operation are not assumed to
      be made persistent until after the corresponding file has been
      <i>synced</i>.

    ASSUMPTION A21005
      If a system failure occurs during or after a "truncate file"
      operation, but before the truncated file has been <i>synced</i>, then 
      SQLite assumes that the size of the truncated file is either as large
      or larger than the size that it was to be truncated to.

    ASSUMPTION A21006
      If a system failure occurs during or after a "truncate file"
      operation, but before the truncated file has been <i>synced</i>, then 
      it is assumed that the contents of the file up to the size that the
      file was to be truncated to are not corrupted.

    <p>
      The above two assumptions may be interpreted to mean that if a 
      system failure occurs after file truncation but before the truncated
      file is <i>synced</i>, the contents of the file following the point
      at which it was to be truncated may not be trusted. They may contain 
      the original file data, or may contain garbage.

    ASSUMPTION A21007
      If a "truncate file" operation is executed by SQLite, and then the
      truncated file <i>synced</i>, then SQLite assumes that the file-system
      modifications corresponding to the "truncate file" operation have been
      committed to persistent media. It is assumed that if a system
      failure occurs any time after the file has been successfully 
      <i>synced</i>, then the effects of the file truncation are guaranteed
      to appear in the file system following recovery.

    <p>
      A <b>write file</b> operation modifies the contents of an existing file
      within the file-system. It may also increase the size of the file.
      The effects of a <i>write file</i> operation are not assumed to
      be made persistent until after the corresponding file has been
      <i>synced</i>.

    ASSUMPTION A21008
      If a system failure occurs during or after a "write file"
      operation, but before the corresponding file has been <i>synced</i>, 
      then it is assumed that the content of all sectors spanned by the
      <i>write file</i> operation is assumed to be untrustworthy following
      system recovery. This includes regions of the sectors that were not
      actually modified by the write file operation. It is assumed that it
      is possible that the sector data was written correctly, partially,
      or not at all, or that the sector has been completely or partially
      filled with random data.

    ASSUMPTION A21009
      If a system failure occurs during or after a "write file"
      operation that causes the file to grow, but before the corresponding 
      file has been <i>synced</i>, then it is assumed 
      that the size of the file following recovery is as large or larger
      than it was before the "write file" operation that, if successful,
      would cause the file to grow.

<!--
    <p>
      The return value of the xSectorSize() method, the <i>sector-size</i>, is
      expected by SQLite to be a power of 2 value greater than or equal to 512.

    <p class=todo> 
      What does it do if this is not the case? If the sector size is less
      than 512 then 512 is used instead. How about a non power-of-two value?
      UPDATE: How this situation is handled should be described in the API
      requirements. Here we can just refer to the other document.

    <p>
................................................................................
          occured may be trusted.
    </ul>

    <p class=todo>
      What do we assume about the other three file-system write 
      operations - xTruncate(), xDelete() and "create file"?
















    <p>
      The xDeviceCharacteristics() method returns a set of flags, 
      indicating which of the following properties (if any) the 
      file-system provides:

................................................................................
      <li>The <b><i>safe-append</i></b> property.
      <li>The <b><i>atomic write</i></b> property.
    </ul>

    <p class=todo>
      Write an explanation as to how the file-system properties influence
      the model used to predict file damage after a catastrophy.
 -->
 

<h1>Database Connections</h1>

  <p>
    Within this document, the term <i>database connection</i> has a slightly
    different meaning from that which one might assume. The handles returned