SQLite

View Ticket
Login
Ticket Hash: ff5be73dee08608976bd32851fa07b60381609ca
Title: Database corruption following power-loss in WAL mode
Status: Fixed Type: Code_Defect
Severity: Severe Priority: Immediate
Subsystem: Unknown Resolution: Fixed
Last Modified: 2012-01-14 15:35:02
Version Found In: 3.7.9
Description:
A carefully timed power loss on a filesystem that does out-of-order writes can cause database corruption when operating in WAL mode. The setup for the corruption scenario is as follows:
  1. Write many small transactions
  2. Checkpoint
  3. Begin writing a large transaction
  4. Power-loss with out-of-order write prior to the transaction fsync

If transaction fsyncs are disabled because PRAGMA synchronous is set to NORMAL, then the power-loss can occur at any time after SQLite begins overwriting the WAL file again from the beginning.

If out-of-order writing causes disk sectors after the first transaction of the previous cycle of the WAL to be overwritten and earlier sectors to remain unchanged, then after reboot the first process to connect will find a valid WAL file containing one or more transactions from the previous cycle of the WAL and it will replay those transactions. But subsequent transactions in that same the WAL cycle will be lost due to the overwrite and will not be replayed. The later transactions will already exist in the database file, however, meaning that transactions will occur out-of-order. This can easily result in database corruption.

This problem has existed in all prior versions of SQLite that support WAL mode. This problem was detected during internal stress testing and has not been observed in the wild. We believe that the behavior of real-world filesystems is such that this fault, while theoretically possible, is unlikely to occur in practice.


Further notes:

In most filesystems, space to hold a file is allocated in sequential sectors on disk. Filesystems do this because it is usually faster to read sectors sequentially rather than randomly. There can be discontinuities in the sector sequence for larger files, but the beginning of the file in particular is usually arranged contiguously on disk.

Similarly, when writing to disk, filesystems tend to order sector writes sequentially. (See, for example, the elevator algorithm.)

Disk controllers might also reorder writes, but they too seem likely to perform writes in sector order when possible.

When you consider the three factors above, it seems that the common case behavior when overwriting a file sequentially will be that content will reach disk oxide sequentially as well. And when that happens, the failure scenario described by this ticket will not occur.

Combine this with the fact that database damage also requires a well-timed power loss or hard reset (which we also hope is a rare event) and one begins to understand that the failure scenario described by this ticket is a very unlikely thing to happen.

The problem is still worth fixing, to be sure. But in consideration of the above factors, we have downgraded the severity from Critical to Severe. We will push forward the next release of SQLite in order to get a fix into circulation sooner rather than later. But we do not feel compelled to rush out a patch release nor take heroic measures to make the 3.7.10 release during the Christmas holiday.