*** DRAFT ***

What If OpenDocument Used SQLite?

Introduction

Suppose the OpenDocument file format, and specifically the "ODP" OpenDocument Presentation format, where built around SQLite. Benefits would include:

Note that this is only a thought experiment. We are not suggesting that OpenDocument be changed. Nor is this article a criticism of the current OpenDocument design. The point of this essay is to suggest ways to improve future file format designs.

About OpenDocument And OpenDocument Presentation

The OpenDocument file format is used for office applications: word processors, spreadsheets, and presentations. It was originally designed for the OpenOffice suite but has since been incorporated into other desktop application suites. The OpenOffice application has been forked and renamed a few times. This author's primary use for OpenDocument is building slide presentations with either NeoOffice on Mac, or LibreOffice on Linux and Windows.

An OpenDocument Presentation or "ODP" file is a ZIP archive containing XML files describing presentation slides and separate image files for the various images that are included as part of the presentation. (OpenDocument word processor and spreedsheet files are similarly structured but are not considered by this article.) The reader can easily see the content of an ODP file by using the "zip -l" command. For example, the following is the "zip -l" output from a 49-slide presentation about SQLite from the 2014 SouthEast LinuxFest conference:

Archive:  self2014.odp
  Length      Date    Time    Name
---------  ---------- -----   ----
       47  2014-06-21 12:34   mimetype
        0  2014-06-21 12:34   Configurations2/statusbar/
        0  2014-06-21 12:34   Configurations2/accelerator/current.xml
        0  2014-06-21 12:34   Configurations2/floater/
        0  2014-06-21 12:34   Configurations2/popupmenu/
        0  2014-06-21 12:34   Configurations2/progressbar/
        0  2014-06-21 12:34   Configurations2/menubar/
        0  2014-06-21 12:34   Configurations2/toolbar/
        0  2014-06-21 12:34   Configurations2/images/Bitmaps/
    54702  2014-06-21 12:34   Pictures/10000000000001F40000018C595A5A3D.png
    46269  2014-06-21 12:34   Pictures/100000000000012C000000A8ED96BFD9.png
... 58 other pictures omitted...
    13013  2014-06-21 12:34   Pictures/10000000000000EE0000004765E03BA8.png
  1005059  2014-06-21 12:34   Pictures/10000000000004760000034223EACEFD.png
   211831  2014-06-21 12:34   content.xml
    46169  2014-06-21 12:34   styles.xml
     1001  2014-06-21 12:34   meta.xml
     9291  2014-06-21 12:34   Thumbnails/thumbnail.png
    38705  2014-06-21 12:34   Thumbnails/thumbnail.pdf
     9664  2014-06-21 12:34   settings.xml
     9704  2014-06-21 12:34   META-INF/manifest.xml
---------                     -------
 10961006                     78 files

The ODP ZIP archive contains four different XML files: content.xml, styles.xml, meta.xml, and settings.xml. Those four files define the slide layout, text content, and styling. This particular presentation contains 62 images, ranging from full-screen pictures to tiny icons, each stored as a separate file in the Pictures folder. The "mimetype" file contains a single line of text that says:

application/vnd.oasis.opendocument.presentation

The purpose of the other files and folders is presently unknown to the author but is probably not difficult to figure out.

Limitations Of The OpenDocument Presentation Format

The use of a ZIP archive to encapsulate XML files plus resources is an elegant approach to an application file format. It is clearly superior to a custom binary file format. But using an SQLite database as the container, instead of ZIP, would be more elegant still.

A ZIP archive is basically a key/value database, optimized for the case of write-once/read-many and for a relatively small number of distinct keys (a few hundred to a few thousand) each with a large BLOB as its value. A ZIP archive can be viewed as a "pile-of-files" database. This works, but it has some shortcomings relative to an SQLite database, as follows:

  1. Incremental update is hard.

    It is difficult to update individual entries in a ZIP archive. It is especially difficult to update individual entries in a ZIP archive in a way that does not destroy the entire document if the computer loses power and/or crashes in the middle of the update. It is not impossible to do this, but it is sufficiently difficult that nobody actually does it. Instead, whenever the user selects "File/Save", the entire ZIP archive is rewritten. Hence, "File/Save" takes longer than it ought, especially on older hardware. Newer machines are faster, but it is still bothersome that changing a single character in a 50 megabyte presentation causes one to burn through 50 megabytes of the finite write life on the SSD.

  2. Startup is slow.

    In keeping with the pile-of-files theme, OpenDocument stores all slide content in a single big XML file named "content.xml". LibreOffice reads and parses this entire file just to display the first slide. LibreOffice also seems to read all images into memory as well, which makes sense seeing as when the user does "File/Save" it is going to have to write them all back out again, even though none of them changed. The net effect is that start-up is slow. Double-clicking an OpenDocument file brings up a progress bar rather than the first slide. This results in a bad user experience. The situation grows ever more annoying as the document size increases.

  3. More memory is required.

    Because ZIP archives are optimized for storing big chunks of content, they encourage a style of programming where the entire document is read into memory at startup, all editing occurs in memory, then the entire document is written to disk during "File/Save". OpenOffice and its descendants embrace that pattern.

    One might argue that it is ok, in this era of multi-gigabyte desktops, to read the entire document into memory. But it is not ok. For one, the amount of memory used far exceeds the (compressed) file size on disk. So a 50MB presentation might take 200MB or more RAM. That still is not a problem if one only edits a single document at a time. But when working on a talk, this author will typically have 10 or 15 different presentations up all at the same time (to facilitate copy/paste of slides from past presentation) and so gigabytes of memory are required. Add in an open web browser or two and a few other desktop apps, and suddenly the disk is whirling and the machine is swapping. And even having just a single document is a problem when working on an inexpensive Chromebook retrofitted with Ubuntu. Using less memory is always better.

  4. Crash recovery is difficult.

    The descendants of OpenOffice tend to segfault more often than commercial competitors. Perhaps for this reason, the OpenOffice forks make periodic backups of their in-memory documents so that users do not lose all pending edits when the inevitable application crash does occur. This causes frustrating pauses in the application for the few seconds while each backup is being made. After restarting from a crash, the user is presented with a dialog box that walks them through the recovery process. Managing the crash recovery this way involves lots of extra application logic and is generally an annoyance to the user.

  5. Content is inaccessible.

    One cannot easily view, change, or extract the content of an OpenDocument presentation using generic tools. The only reasonable way to view or edit an OpenDocument document is to open it up using an application that is specifically designed to read or write OpenDocument (read: LibreOffice or one of its cousins). The situation could be worse. One can extract and view individual images (say) from a presentation using just the "zip" archiver tool. But it is not reasonable try to extract the text from a slide. Remember that all content is stored in a single "context.xml" file. That file is XML, so it is a text file. But it is not a text file that can be managed with an ordinary text editor. For the example presentation above, the content.xml file consist of exactly two lines. The first line of the file is just:

    <?xml version="1.0" encoding="UTF-8"?>
    

    The second line of the file contains 211792 characters of impenetrable XML. Yes, 211792 characters all on one line. This file is a good stress-test for a text editor. Thankfully, the file is not some obscure binary format, but in terms of accessibility, it might as well be written in Klingon.

First Improvement: Replace ZIP with SQLite

Let us suppose that instead of using a ZIP archive to store its files, OpenDocument used a very simple SQLite database with the following single-table schema:

CREATE TABLE OpenDocTree(
  filename TEXT PRIMARY KEY,  -- Name of file
  filesize BIGINT,            -- Size of file after decompression
  content BLOB                -- Compressed file content
);

For this first experiment, nothing else about the file format is changed. The OpenDocument is still a pile-of-files, only now each file is a row in an SQLite database rather than an entry in a ZIP archive. This simple change does not use the power of a relational database. Even so, this simple change shows some improvements.

Surprisingly, using SQLite in place of ZIP makes the presentation file smaller. Really. One would think that a relational database file would be larger than a ZIP archive, but at least in the case of NeoOffice that is not so. The following is an actual screen-scrape showing the sizes of the same NeoOffice presentation, both in its original ZIP archive format as generated by NeoOffice (self2014.odp), and as repacked as an SQLite database using the SQLAR utility:

-rw-r--r--  1 drh  staff  10514994 Jun  8 14:32 self2014.odp
-rw-r--r--  1 drh  staff  10464256 Jun  8 14:37 self2014.sqlar
-rw-r--r--  1 drh  staff  10416644 Jun  8 14:40 zip.odp

The SQLite database file ("self2014.sqlar") is about a half percent smaller than the equivalent ODP file! How can this be? Apparently the ZIP archive generator logic in NeoOffice is not as efficient as it could be, because when the same pile-of-files is recompressed using the command-line "zip" utility, one gets a file ("zip.odp") that is smaller still, by another half percent, as seen in the third line above. So, a well-written ZIP archive can be slightly smaller than the equivalent SQLite database, as one would expect. But the difference is slight. The key take-away is that an SQLite database is size-competitive with a ZIP archive.

The other advantage to using SQLite in place of ZIP is that the document can now be updated incrementally, without risk of corrupting the document if a power loss or other crash occurs in the middle of the update. (Remember that writes to SQLite databases are atomic.) True, all the content is still kept in a single big XML file ("content.xml") which must be completely rewritten if so much as a single character changes. But with SQLite, only that one file needs to change. The other 77 files in the repository can remain unaltered. They do not all have to be rewritten, which in turn makes "File/Save" run much faster and saves wear on SSDs.

Second Improvement: Split content into smaller pieces

A pile-of-files encourages content to be stored in a few large chunks. In the case of ODP, there are just four XML files that define the layout off all slides in a presentation. An SQLite database allows storing information in a few large chunks, but SQLite is also adept and efficient at storing information in numerous smaller pieces.

So then, instead of storing all content for all slides in a single oversized XML file ("content.xml"), suppose there was a separate table for storing the content of each slide separately. The table schema might look something like this:

CREATE TABLE slide(
  pageNumber INTEGER,   -- The slide page number
  slideContent TEXT     -- Slide content as XML or JSON
);
CREATE INDEX slide_pgnum ON slide(pageNumber); -- Optional

The content of each slide could still be stored as compressed XML. But now each page is stored separately. So when opening a new document, the application could simply run:

SELECT slideContent FROM slide WHERE pageNumber=1;

This query will quickly and efficiently return the content of the first slide, which could then be speedily parsed and displayed to the user. Only one page needs to be read and parsed in order render the first screen, which means that the first screen appears much faster and there is no longer a need for an annoying progress bar.

If the application wanted to keep all content in memory, it could continue reading and parsing the other pages using a background thread after drawing the first page. Or, since reading from SQLite is so efficient, the application might instead choose to reduce its memory footprint and only keep a single slide in memory at a time. Or maybe it keeps the current slide and the next slide in memory, to facility rapid transitions to the next slide.

Notice that dividing up the content into smaller pieces using an SQLite table gives flexibility to the implementation. The application can choose to read all content into memory at startup. Or it can read just a few pages into memory and keep the rest on disk. Or it can read just single page into memory at a time. And different versions of the application can make different choices without having to make any changes to the file format. Such options are not available when all content is in a single big XML file in a ZIP archive.

Splitting content into smaller pieces also helps File/Save operations to go faster. Instead of having to write back the content of all pages when doing a File/Save, the application only has to write back those pages that have actually changed.

One minor downside of splitting content into smaller pieces is that compression does not work as well on shorter texts and so the size of the document might increase. But as the bulk of the document space is used to store images, a small reduction in the compression efficiency of the text content will hardly be noticeable, and is a small price to pay for an improved user experience.

Third Improvement: Versioning

Once one is comfortable with the concept of storing each slide separately, it is a small step to support versioning of the presentation. Consider the following schema:

CREATE TABLE slide(
  slideId INTEGER PRIMARY KEY,
  derivedFrom INTEGER REFERENCES slide,
  content TEXT     -- XML or JSON or whatever
);
CREATE TABLE version(
  versionId INTEGER PRIMARY KEY,
  priorVersion INTEGER REFERENCES version,
  checkinTime DATETIME,   -- When this version was saved
  comment TEXT,           -- Description of this version
  manifest TEXT           -- List of integer slideIds
);

In this schema, instead of each slide having a page number that determines its order within the presentation, each slide has a unique integer identifier that is unrelated to where it occurs in sequence. The order of slides in the presentation is determined by a list of slideIds, stored as a text string in the MANIFEST column of the VERSION table. Since multiple entries are allowed in the VERSION table, that means that multiple presentations can be stored in the same document.

On startup, the application first decides which version it wants to display. Since the versionId will naturally increase in time and one would normally want to see the latest version, an appropriate query might be:

SELECT manifest, versionId FROM version ORDER BY versionId DESC LIMIT 1;

Or perhaps the application would rather use the most recent checkinTime:

SELECT manifest, versionId, max(checkinTime) FROM version;

Using a single query such as the above, the application obtains a list of the slideIds for all slides in the presentation. The application then queries for the content of the first slide, and parses and displays that content, as before.

(Aside: Yes, that second query above that uses "max(checkinTime)" really does work and really does return a well-defined answer in SQLite. Such a query either returns an undefined answer or generates an error in many other SQL database engines, but in SQLite it does what you would expect: it returns the manifest and versionId of the entry that has the maximum checkinTime.)

When the user does a "File/Save", instead of overwriting the modified slides, the application can now make new entries in the SLIDE table for just those slides that have been added or altered. Then it creates a new entry in the VERSION table containing the revised manifest.

The VERSION table shown above has columns to record a check-in comment (presumably supplied by the user) and the time and date at which the File/Save action occurred. It also records the parent version to record the history of changes. Perhaps the manifest could be stored as a delta from the parent version, though typically the manifest will be small enough that storing a delta might be more trouble than it is worth. The SLIDE table also contains a derivedFrom column which could be used for delta encoding if it is determined that saving the slide content as a delta from its previous version is a worthwhile optimization.

So with this simple change, the ODP file now stores not just the most recent edit to the presentation, but a history of all historic edits. The user would normally want to see just the most recent edition of the presentation, but if desired, the user can now go backwards in time to see historical versions of the same presentation.

Or, multiple presentations could be stored within the same document.

With such a schema, the application would no longer need to make periodic backups of the unsaved changes to a separate file to avoid lost work in the event of a crash. Instead, a special "pending" version could be allocated and unsaved changes could be written into the pending version. Because only changes would need to be written, not the entire document, saving the pending changes would only involve writing a few kilobytes of content, not multiple megabytes, and would take milliseconds instead of seconds, and so it could be done frequently and silently in the background. Then when a crash occurs and the user reboots, all (or almost all) of their work is retained. If the user decides to discard unsaved changes, they simply go back to the previous version.

There are details to fill in here. Perhaps a screen can be provided that displays a history changes (perhaps with a graph) allowing the user to select which version they want to view or edit. Perhaps some facility can be provided to merge forks that might occur in the version history. And perhaps the application should provide a means to purge old and unwanted versions. The key point is that using an SQLite database to store the content, rather than a ZIP archive, makes all of these features much, much easier to implement, which increases the possibility that they will eventually get implemented.

And So Forth...

In the previous sections, we have seen how moving from a key/value store implemented as a ZIP archive to a simple SQLite database with just three tables can add significant capabilities to an application file format. We could continue to enhance the schema with new tables, with indexes added for performance, with triggers and views for programming convenience, and constraints to enforce consistency of content even in the face of programming errors. Further enhancement ideas include:

An SQLite database has a lot of capability, which this essay has only begun to touch upon. But hopefully this quick glimpse has convinced some readers that using an SQL database as an application file format is worth a second look.

Some readers might resist using SQLite as an application file format due to prior exposure to enterprise SQL databases and the caveats and limitations of those other systems. For example, many enterprise database engines advise against storing large strings or BLOBs in the database and instead suggest that large strings and BLOBs be stored as separate files and the filename stored in the database. But SQLite is not like that. Any column of an SQLite database can hold a string or BLOB up to about a gigabyte in size. And for strings and BLOBs of 100 kilobytes or less, I/O performance is better than using separate files.

Some readers might be reluctant to consider SQLite as an application file format because they have been inculcated with the idea that all SQL database schemas must be factored into third normal form and store only small primitive data types such as strings and integers. Certainly relational theory is important and designers should strive to understand it. But, as demonstrated above, it is often quite acceptable to store complex information as XML or JSON in text fields of a database. Do what works, not what your database professor said you ought to do.

Review Of The Benefits Of Using SQLite

In summary, the claim of this essay is that using SQLite as a container for an application file format like OpenDocument and storing lots of smaller objects in that container works out much better than using a ZIP archive holding a few larger objects. To wit:

  1. An SQLite database file is approximately the same size, and in some cases smaller, than a ZIP archive holding the same information.

  2. The atomic update capabilities of SQLite allow small incremental changes to be safely written into the document. This reduces total disk I/O and improves File/Save performance, enhancing the user experience.

  3. Startup time is reduced by allowing the application to read in only the content shown for the initial screen. This largely eliminates the need to show a progress bar when opening a new document. The document just pops up immediately, further enhancing the user experience.

  4. The memory footprint of the application can be dramatically reduced by only loading content that is relevant to the current display and keeping the bulk of the content on disk. The fast query capability of SQLite make this a viable alternative to keeping all content in memory at all times. And when applications use less memory, it makes the entire computer more responsive, further enhancing the user experience.

  5. The schema of an SQL database is able to represent information more directly and succinctly than a key/value database such as a ZIP archive. This makes the document content more accessible to third-party applications and scripts and facilitates advanced features such as built-in document versioning, and incremental saving of work in progress for recovery after a crash.

These are just a few of the benefits of using SQLite as an application file format — the benefits that seem most likely to improve the user experience for applications like OpenOffice. Other applications might benefit from SQLite in different ways. See the Application File Format document for additional ideas.

Finally, let us reiterate that this essay is a thought experiment. The OpenDocument format is well-established and already well-designed. Nobody really believes that OpenDocument should be changed to use SQLite as its container instead of ZIP. Nor is this article a criticism of OpenDocument for not choosing SQLite as its container since OpenDocument predates SQLite. Rather, the point of this article is to use OpenDocument as a concrete example of how SQLite can be used to build better application file formats for future projects.

*** DRAFT ***