How To Compile And Use ZIPVFS

1.0 Introduction

This document describes an extended version of SQLite that is able to read and write compressed database files.

2.0 License

The core SQLite library is in the public domain. However, the ZIPVFS extension needed to read and write compressed database files is licensed software. You should only be able to see the ZIPVFS source code if you have a license. If you do not have a valid license you must delete the ZIPVFS source code.

Your license is perpetual. You have paid a one-time fee that allows you to use and modify the software forever. You can ship as many copied of the software to your customers as you want so long as you ensure that only compiled binaries are shipped (you cannot distribute source code) and that your customers cannot make additional copies of the software to use for other purposes.

You can create multiple products that use this software as long as all products are developed and maintained by the same team. For the purposes of this paragraph, a "team" is a work unit where everybody knows each other's names. If you are in a large company where this product is used by multiple teams, then each team should acquire their own separate license.

These restrictions applies to the ZIPVFS source code. Anyone is allowed to view this documentation. However, this documentation is protected by copyright and should not be reproduced without written permission of the copyright holder.

3.0 Overview

This extended version of SQLite is able to read and write ordinary SQLite database files just like the public domain version. But this extended version also supports the ability to read and write compressed databases using an application-supplied compression function. Features include:

The ability to both read and write to compressed database files.

Protection against database loss and/or corruption even if a power failure occurs in the middle of a transaction commit.

Updates and changes to the database require temporary disk space the is proportional to the size of the update, not the size of the database.

Applications can register their own compression functions.

The application-defined compression functions can also do encryption, if desired.

Multiple compression algorithms can be registered and SQLite will chose the correct one automatically when opening an existing database.

Works seamlessly with the Multiplexor and with the 8+3 filename option.

4.0 Converting An Application From Ordinary SQLite to ZIPVFS

If you already have an application that is built using the public-domain "sqlite3.c" source file, then the easiest way to get up and running with ZIPVFS it to substitute the "sqlite3-zipvfs.c" source file included with the ZIPVFS source tree in place of the public-domain "sqlite3.c" source file and recompile. The "sqlite3-zipvfs.c" file makes calls to the zLib compression library, so you might need to add an option like "-lz" to the final linkage step of your compilation, but otherwise there should not be any required changes to your build process.

4.1 About The sqlite3-zipvfs.c Source File

The "sqlite3-zipvfs.c" file is constructed from other files using a script. However, the latest "sqlite3-zipvfs.c" is also checked into the source, for the convenience of users who do not want to build it themselves. The version of "sqlite3-zipvfs.c" that is found in the source tree is probably sufficient for your needs. But if you want to rebuild the file (perhaps after making enchancements to "algorithms.c") then do so using this script:

tclsh mksqlite3zipvfsc.tcl

The script reads in the public-domain "sqlite3.c" source file, the core ZIPVFS library in "zipvfs.c", the virtual table interface from "zipvfs_vtab.c", and the "algorithms.c" file that implements a few sample compression and extension functions, and combines these three files into a single large source file with the same interface as public-domain SQLite.

4.2 Taking Advantage Of Applications Built Using sqlite3-zipvfs.c

After recompiling an application using sqlite3-zipvfs.c in place of sqlite3.c, the application should continue to operate the same as it did before. In other words, it should continue to read and write ordinary unencrypted and uncompressed SQLite database files. To take advantage of ZIPVFS, you will need to modify the application slightly to turn ZIPVFS usage on.

When creating a new database file, in order to make that file a ZIPVFS database instead of an ordinary uncompressed database, open the file using a URI Filename with an extra query parameter "zv=zlib". For example:

sqlite3_open_v2("file:demo1.db?zv=zlib", &db,
    SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_URI, 0);

The "zv=zlib" query parameter tells SQLite to use ZIPVFS with the ZLib compression mechanism for the new database. When opening an existing database, the "zv=zlib" query parameter is not required because SQLite will be able to figure out for itself that the database is ZIPVFS compressed using ZLib. The query parameter is only needed when creating a new database.

To enable encryption, add an additional query parameter with the encryption key. For AES128 encryption use "password=KEY" and for AES256 use "password256=KEY". For example:

sqlite3_open_v2("file:demo1.db?zv=zlib&password=abc123", &db,
    SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_URI, 0);

Obviously, you should use an encryption key that is more secure than "abc123". The example above is merely for illustration purposes.

Note that unlike the "zv=zlib" query parameter which is only required when the database is initially created, the encryption key must be supplied every time you open or reopen the database file.

4.3 Adding New Compression And Encryption Algorithms

The compression and encryption algorithms for sqlite3-zipvfs.c are found in the "algorithms.c" source file. The "algorithms.c" source file implements a general-purpose framework that make it easy to add new and customized compression and encryption functions. Extension comments in the "algorithms.c" file explain what to do.

To add custom compression and encryption algorithms, first modify the algorithm.c file to implement your new methods. Then run the "mksqlite3zipvfsc.tcl" script to rebuild sqlite3-zipvfs.c. Then recompile your application and test it.

5.0 Building ZIPVFS

The core ZIPVFS functionality is contained in a single source file named "zipvfs.c". This source file must be appended to a standard public-domain SQLite amalgamation file, "sqlite3.c". A typical command to accomplish this is:

cat sqlite3.c zipvfs.c >zsqlite3.c

Then use the combined file, "zsqlite3.c", in place of the amalgamation file "sqlite3.c" in the application. In order to enable ZIPVFS, the zsqlite3.c module must be compiled with the SQLITE_ENABLE_ZIPVFS pre-processor symbol defined.

Instructions for building ZIPVFS for use with System.Data.SQLite (the .NET wrapper for SQLite) are available separately.

6.0 The Command-line Shell

6.1 Building The Command-line Shell

To build a stand-alone command-line shell for ZIPVFS, use the "shell.c" file contained in the ZIPVFS source tree (not the "shell.c" that comes in the public-domain SQLite source tree) and compile it with "sqlite3-zipvfs.c". (NB: The "shell.c" source files does a #include of the "zipvfs.h" header file, so that file must be in the build directory when compiling shell.c.) For example:

gcc shell.c sqlite3-zipvfs.c

The examples above show the essential core of the compilation command, though you will probably also need to specify various additional compiler options and libraries. A complete command-line might look more like this:

gcc -I. shell.c sqlite3-zipvfs.c -ldl -lz -lpthread

The "-I." option is needed so that shell.c can find the "sqlite3.h" and "zipvfs.h" files that it tries to #include. The "-ldl" option causes the dynamic loader library to be added (which is needed by the sqlite3_load_extension() interface of SQLite and the ".load" command of the shell). The "-lpthread" option links against the Pthreads library, which SQLite requires unless it is compiled with -DSQLITE_THREADSAFE=0. The "-lz" cause the Zlib compression library to be linked.

In order to enable command-line editing on unix systems, you might also want to add options like:

... -DHAVE_READLINE ... -lreadline -lncurses

Other standard options can also be included. As an example, the following command is used (on Linux) to build a ZIPVFS command-line shell that includes support for FTS4, RTree, Multiplexor, VFS-Trace, command-line editing, and default 8+3 filenames:

gcc -o zsqlite3 -g -I. \
   -DSQLITE_THREADSAFE=0 \
   -DSQLITE_ENABLE_ZIPVFS \
   -DSQLITE_ENABLE_8_3_NAMES=2 \
   -DSQLITE_ENABLE_VFSTRACE \
   -DSQLITE_ENABLE_MULTIPLEX \
   -DSQLITE_MULTIPLEX_EXT_OVWR \
   -DSQLITE_ENABLE_STAT2 \
   -DSQLITE_ENABLE_FTS3 \
   -DSQLITE_ENABLE_RTREE \
   -DHAVE_READLINE \
   -DHAVE_USLEEP=1 \
   shell.c test_vfstrace.c test_multiplex.c \
   sqlite3-zipvfs.c -ldl -lreadline -lncurses -lz

6.2 Using The Command-line Shell

The ZIPVFS-enhanced command-line shell works like the standard public-domain command-line shell for SQLite, with a few additions to support ZIPVFS.

To enable compression on a new database file using the ZIPVFS-enhanced shell, specify the name of the database file using URI notation and include "zv" query parameter to specify the compression algorithm. For example:

./zsqlite3 'file:database.db?zv=zlib'

If the database file already exists, then no special options or switches are required to open it. SQLite will automatically detect whether or not the database file is compressed, and if it is compressed will automatically detect which compression algorithm to use. So once a database file exists, all one has to do is specify the name of that file on the command-line and SQLite will do the right thing. For example:

./zsqlite3 database.cb

6.3 Using The Command-line Shell As An Example Of How To Integrate ZIPVFS Into An Existing Program

The command-line shell program is implemented using a single C source file named "shell.c". The version of the command-line shell program used by public-domain SQLite is found in the src/baseline/shell.c source file and a modified version of this program appropriate for use with ZIPVFS if found in the src/shell.c source file. By looking at the differences between these two files, you can get a better idea of what changes are needed in a program in order to convert it from using public-domain SQLite over to using ZIPVFS.

7.0 Using ZIPVFS In Applications

The steps for using ZIPVFS in an application will make more sense if you first read and understand How ZIPVFS Works.

To use ZIPVFS in an application, the first step is to register the appropriate VFS using a call this this interface:

int zipvfs_create_vfs_v3(
  char const *zName,
  char const *zParent,
  void *pCtx,
  int (*xAutoDetect)(void*, const char *zFile, const char *zHdr, ZipvfsMethods*)
);

The function above creates a new VFS that implements the ZipVFS functionality. The name of the new VFS is given by second parameter (zName). The new VFS is not the default. To make it the default VFS, run:

sqlite3_vfs_register(sqlite3_vfs_find(zName), 1);

The fifth parameter (xAutoDetect) is a callback that is invoked as each database is opened. The xAutoDetect callback must determine which compression algorithm to use and fill in its fourth parameter with pointers to the appropriate compression and decompression routines. The zipvfsAutoDetect() routine in the shellv3.c source file shows an example of how to implement the xAutoDetect callback.

ZIPVFS can also be activated using routines zipvfs_create_vfs() and zipvfs_create_vfs_v2(). But these older interfaces are not recommended. They are retained only for backwards compatibility.

8.0 ZipVFS Dictionary Store

In some circumstances, compression algorithms can be more effective if they are supplied with a pre-computed static dictionary. This is especially common with smaller units of compression.

In order to facilitate storing such a dictionary or dictionaries inside a ZipVFS database file, ZipVFS provides a "dictionary store". The dictionary is a simple, non-compressed, key-value store that maps non-negative 32-bit signed integer keys with associated blob data. The dictionary store is optimized for a small (say, a few dozen or so) number of entries.

There are two interfaces to the dictionary store. A read-only interface used by compression/decompression function implementations, and a read-write interface accessed by the application code.

8.1 ZipVFS Dictionary Store Virtual Table Interface

Applications use a virtual table interface - module name "zipvfs_dictstore". Virtual tables may be created in the temp schema only, using the following syntax:

CREATE VIRTUAL TABLE temp.kv zipvfs_dictstore(<db-name>);

where <db-name> is the name of the database to be accessed (e.g. "main"). The schema of the virtual table is equivalent to:

CREATE TABLE temp.kv(id INTEGER, val BLOB);

It is an error to attempt to store an "id" value that is not an integer between 0 and 2147483647 or a "val" value that is not a BLOB. The values in the "id" column must be unique.

8.2 ZipVFS Dictionary Store C Interface

The implementations of xAutoDetect, xCompress and xUncompress callbacks may use the zipvfs_dictstore_get() function to read entries from a ZipVFS dictionary store.

8.3 Notes on Usage

When creating a database, unless the compression algorithm also supports a dictionaryless mode, a dictionary must be supplied before any data needs to be compressed. This is done as follows:

Database is opened using an SQLite URI, specifying the compression algorithm as a URI parameter.
Various database parameters (e.g. page-size) that must be set before the database is created are attended to.
The initial dictionary is added to the database, using commands like:

CREATE VIRTUAL TABLE temp.dictstore USING zipvfs_dictstore('main');
INSERT INTO dictstore VALUES(1, X'...dictionary-blob...');

The INSERT command creates the database file on disk and adds the initial dictionary to it. Page 1 of the new database is compressed using the supplied dictionary.

Usually the first (say) byte of each compressed page will be used to identify the dictionary store entry that it was compressed with. New pages should be compressed using the newest dictionary in the dictstore, which has the numerically largest id value.

The xUncompress() method therefore, when required to uncompress a page, determines the required dictionary by inspecting the first byte of the compressed image. If the required dictionary has not already been loaded into memory, the xUncompress() method loads it using the zipvfs_dictstore_get() method.

Similarly, if it is not already cached, the xCompress() implementation can retreive the dictionary associated with the largest numerical id in the dictstore by calling zipvfs_dictstore_get() with a negative value as the second parameter.

After a new dictionary is added to the dictstore, any existing connections will need to be closed and reopened to ensure that they begin using the new dictionary.

9.0 Limitations And Caveats

9.1 Using WAL mode

It is possible to use SQLite's WAL mode to read and write a ZIPVFS database. However, the following caveats apply:

To activate WAL mode, use "PRAGMA zipvfs_journal_mode = WAL" instead of the usual "PRAGMA journal_mode = WAL".
The sqlite3_wal_checkpoint(), sqlite3_wal_autocheckpoint() and sqlite3_wal_hook() API functions may not be used. Instead, the application must use the "PRAGMA wal_checkpoint" and "PRAGMA wal_autocheckpoint" statements to schedule or request checkpoint operations.

Both of these restrictions is related to the fact that a ZIPVFS configuration features two pager objects, an upper level one that manages uncompressed pages and a lower level one that deals with reading and writing the file on disk.

It is not possible (and nor would it be useful) to set the upper level pager to use WAL mode. Setting the lower level pager to use WAL mode may improve database performance and concurrency in the same way as it may for non-ZIPVFS databases.

The "PRAGMA journal_mode" command operates on the upper level pager, whereas "PRAGMA zipvfs_journal_mode" sets the journal mode for the lower level pager. Similarly, the sqlite3_wal_xxx() APIs operate on the upper level pager only. On the other hand, in a ZIPVFS configuration the "PRAGMA wal_checkpoint" and "PRAGMA wal_autocheckpoint" commands are routed through to the lower level pager.

9.2 Unsupported features

The following PRAGMAs are incompatible with ZIPVFS. Use of these PRAGMAs might lead to crashes and segfaults:

PRAGMA locking_mode=EXCLUSIVE;

To use ZIPVFS with exclusive locking mode, either set the locking mode using sqlite3_file_control() with the ZIPVFS_CTRL_LOCKING_MODE opcode or specify a query parameter "excl" as part of a URI filename for ZIPVFS.

9.3 Multi-file commits are not atomic

Normally, when multiple database files are attached to a single connection using the ATTACH command and a transaction makes changes to tables in multiple files, then the COMMIT is atomic across all files. However, when ZIPVFS is used this is no longer the case. Transactions are still atomic within each separate file. But if a crash or power failure occurs at just the wrong moment within the COMMIT, then it might be that changes are committed to one database file but not the other.

SQLite has previously always had this limitation when using the WAL and MEMORY journal modes. The same limitation is carried over into ZIPVFS.

9.4 Database fragmentation

SQLite database files are page-oriented. In other words, the database file is an array of uniformly sized pages. The ZIPVFS extension compresses each page separately, resulting in pages that are all of different sizes, depending on how well each page compresses. This creates several problems for storage:

Database pages are no longer aligned to sector boundaries of the underlying storage media. This means that if a write is occurring during a power outage or OS crash, the built-in journaling mechanism of SQLite might not be able to fully rollback the transaction after restart.

Pages can no longer be located simply by seeking to a spot in the file which is the page-size times the page-number.

Changes to database content might make a page grow or shrink, meaning that it no longer fits back into the slot from which it came.

In the ZIPVFS extension, the first problem is addressed by maintaining a secondary journaling mechanism that provides sufficient information to completely rollback a transaction even when page boundaries are not aligned on sectors. The second and third problems are addressed by maintaining a separate "page-map" that allows pages to placed arbitrarily in the file, wherever they will fit.

The page-map mechanism can lead to fragmentation and reduced compression performance over time. For example, consider if a page is original N bytes in size but shrinks to N-5 bytes following an update. If the page is written back into its old slot, 5 extra bytes of space at the end of that slot are left unused, and are wasted. Or, if the update causes the page to grow to N+5 byte, it will no longer fit in its original slot and has to be moved either to another larger vacant slot or else to the end of the file. The entire original slot of N bytes is then left unused and wasted.

Unused space will be tracked and reused for subsequent inserts, if possible. But it is usually the case that a newly compressed page will not exactly fit in an available slot, and hence there is some wasted space left in the file. The amount of unused space will tend to grow the more the database file is modified. The entire ZIPVFS database file can be "compacted" down to its minimal size (with no unused or wasted space) using the following API call:

sqlite3_file_control(db, 0, ZIPVFS_CTRL_COMPACT, 0);

The above assumes that the ZIPVFS database is the "main" database. If the database is ATTACHed, then replace the second argument with a string which is the UTF8 name of the attached database. Note that compacting the ZIPVFS database is relatively fast, but it does require a lot of temporary disk space.

Alternatively, a PRAGMA command may be used to compact the database in the same way.

PRAGMA zipvfs_compact;
PRAGMA zipvfs_compact = N;

The first form above attempts to compact the entire database. The second limits the amount of compaction done so that no more than N bytes of data are moved. In both cases, a single integer value is returned - the remaining number of bytes of data that must be moved to complete the compaction operation. If a database is not completely compacted by an evaluation of "PRAGMA zipvfs_compact", then any subsequent "PRAGMA zipvfs_compact" or ZIPVFS_CTRL_COMPACT call automatically resumes the operation from where it left off.

Compiling And Using