Documentation Source Text: Artifact [cbee5540a9]

Artifact cbee5540a91bc64873bd888cb3150402bbf78e586a9ac4c94c67d08424206424:

File pages/affcase1.in — part of check-in [297fc5035b] at 2018-11-30 21:04:18 on branch trunk — Fix documentation typos. (user: drh size: 25761) [more...]
<tcl>hd_keywords {What If OpenOffice Used SQLite} \
{OpenOffice case study}</tcl>
<title>What If OpenDocument Used SQLite?</title>

<h1 align="center">
What If OpenDocument Used SQLite?</h1>

<h2>Introduction</h2>

<p>Suppose the
[http://en.wikipedia.org/wiki/OpenDocument|OpenDocument] file format,
and specifically the "ODP" OpenDocument Presentation format, were
built around SQLite.  Benefits would include:
<ul>
<li>Smaller documents
<li>Faster File/Save times
<li>Faster startup times
<li>Less memory used
<li>Document versioning
<li>A better user experience
</ul>

<p>
Note that this is only a thought experiment.
We are not suggesting that OpenDocument be changed.
Nor is this article a criticism of the current OpenDocument
design.  The point of this essay is to suggest ways to improve
future file format designs.

<h2>About OpenDocument And OpenDocument Presentation</h2>

<p>
The OpenDocument file format is used for office applications:
word processors, spreadsheets, and presentations.  It was originally
designed for the OpenOffice suite but has since been incorporated into
other desktop application suites.  The OpenOffice application has been
forked and renamed a few times.  This author's primary use for OpenDocument is 
building slide presentations with either 
[https://www.neooffice.org/neojava/en/index.php|NeoOffice] on Mac, or
[http://www.libreoffice.org/|LibreOffice] on Linux and Windows.

<p>
An OpenDocument Presentation or "ODP" file is a
[http://en.wikipedia.org/wiki/Zip_%28file_format%29|ZIP archive] containing
XML files describing presentation slides and separate image files for the
various images that are included as part of the presentation.
(OpenDocument word processor and spreadsheet files are similarly
structured but are not considered by this article.) The reader can
easily see the content of an ODP file by using the "zip -l" command.
For example, the following is the "zip -l" output from a 49-slide presentation
about SQLite from the 2014
<a href="http://southeastlinuxfest.org/">SouthEast LinuxFest</a>
conference:

<blockquote><pre>
Archive:  self2014.odp
  Length      Date    Time    Name
---------  ---------- -----   ----
       47  2014-06-21 12:34   mimetype
        0  2014-06-21 12:34   Configurations2/statusbar/
        0  2014-06-21 12:34   Configurations2/accelerator/current.xml
        0  2014-06-21 12:34   Configurations2/floater/
        0  2014-06-21 12:34   Configurations2/popupmenu/
        0  2014-06-21 12:34   Configurations2/progressbar/
        0  2014-06-21 12:34   Configurations2/menubar/
        0  2014-06-21 12:34   Configurations2/toolbar/
        0  2014-06-21 12:34   Configurations2/images/Bitmaps/
    54702  2014-06-21 12:34   Pictures/10000000000001F40000018C595A5A3D.png
    46269  2014-06-21 12:34   Pictures/100000000000012C000000A8ED96BFD9.png
<i>... 58 other pictures omitted...</i>
    13013  2014-06-21 12:34   Pictures/10000000000000EE0000004765E03BA8.png
  1005059  2014-06-21 12:34   Pictures/10000000000004760000034223EACEFD.png
   211831  2014-06-21 12:34   content.xml
    46169  2014-06-21 12:34   styles.xml
     1001  2014-06-21 12:34   meta.xml
     9291  2014-06-21 12:34   Thumbnails/thumbnail.png
    38705  2014-06-21 12:34   Thumbnails/thumbnail.pdf
     9664  2014-06-21 12:34   settings.xml
     9704  2014-06-21 12:34   META-INF/manifest.xml
---------                     -------
 10961006                     78 files
</pre></blockquote>

<p>
The ODP ZIP archive contains four different XML files:
content.xml, styles.xml, meta.xml, and settings.xml.  Those four files
define the slide layout, text content, and styling.  This particular
presentation contains 62 images, ranging from full-screen pictures to
tiny icons, each stored as a separate file in the Pictures
folder.  The "mimetype" file contains a single line of text that says:

<blockquote><pre>
application/vnd.oasis.opendocument.presentation
</pre></blockquote>

<p>The purpose of the other files and folders is presently 
unknown to the author but is probably not difficult to figure out.

<h2>Limitations Of The OpenDocument Presentation Format</h2>

<p>
The use of a ZIP archive to encapsulate XML files plus resources is an
elegant approach to an application file format.
It is clearly superior to a custom binary file format.
But using an SQLite database as the
container, instead of ZIP, would be more elegant still.

<p>A ZIP archive is basically a key/value database, optimized for
the case of write-once/read-many and for a relatively small number
of distinct keys (a few hundred to a few thousand) each with a large BLOB
as its value.  A ZIP archive can be viewed as a "pile-of-files"
database.  This works, but it has some shortcomings relative to an
SQLite database, as follows:

<ol>
<li><p><b>Incremental update is hard.</b>
<p>
It is difficult to update individual entries in a ZIP archive.
It is especially difficult to update individual entries in a ZIP
archive in a way that does not destroy
the entire document if the computer loses power and/or crashes
in the middle of the update.  It is not impossible to do this, but
it is sufficiently difficult that nobody actually does it.  Instead, whenever
the user selects "File/Save", the entire ZIP archive is rewritten.  
Hence, "File/Save" takes longer than it ought, especially on
older hardware.  Newer machines are faster, but it is still bothersome
that changing a single character in a 50 megabyte presentation causes one
to burn through 50 megabytes of the finite write life on the SSD.

<li><p><b>Startup is slow.</b>
<p>
In keeping with the pile-of-files theme, OpenDocument stores all slide 
content in a single big XML file named "content.xml".  
LibreOffice reads and parses this entire file just to display
the first slide.
LibreOffice also seems to
read all images into memory as well, which makes sense seeing as when
the user does "File/Save" it is going to have to write them all back out
again, even though none of them changed.  The net effect is that
start-up is slow.  Double-clicking an OpenDocument file brings up a
progress bar rather than the first slide.
This results in a bad user experience.
The situation grows ever more annoying as
the document size increases.

<li><p><b>More memory is required.</b>
<p>
Because ZIP archives are optimized for storing big chunks of content, they
encourage a style of programming where the entire document is read into
memory at startup, all editing occurs in memory, then the entire document
is written to disk during "File/Save".  OpenOffice and its descendants
embrace that pattern.

<p>
One might argue that it is ok, in this era of multi-gigabyte desktops, to
read the entire document into memory.
But it is not ok.
For one, the amount of memory used far exceeds the (compressed) file size
on disk.  So a 50MB presentation might take 200MB or more RAM.  
That still is not a problem if one only edits a single document at a time.  
But when working on a talk, this author will typically have 10 or 15 different 
presentations up all at the same
time (to facilitate copy/paste of slides from past presentation) and so
gigabytes of memory are required.
Add in an open web browser or two and a few other 
desktop apps, and suddenly the disk is whirling and the machine is swapping.
And even having just a single document is a problem when working
on an inexpensive Chromebook retrofitted with Ubuntu.
Using less memory is always better.
</p>

<li><p><b>Crash recovery is difficult.</b>
<p>
The descendants of OpenOffice tend to segfault more often than commercial
competitors.  Perhaps for this reason, the OpenOffice forks make
periodic backups of their in-memory documents so that users do not lose
all pending edits when the inevitable application crash does occur.
This causes frustrating pauses in the application for the few seconds
while each backup is being made.
After restarting from a crash, the user is presented with a dialog box
that walks them through the recovery process.  Managing the crash
recovery this way involves lots of extra application logic and is
generally an annoyance to the user.

<li><p><b>Content is inaccessible.</b>
<p>
One cannot easily view, change, or extract the content of an 
OpenDocument presentation using generic tools.
The only reasonable way to view or edit an OpenDocument document is to open
it up using an application that is specifically designed to read or write
OpenDocument (read: LibreOffice or one of its cousins).  The situation
could be worse.  One can extract and view individual images (say) from
a presentation using just the "zip" archiver tool.  But it is not reasonable
try to extract the text from a slide.  Remember that all content is stored
in a single "context.xml" file.  That file is XML, so it is a text file.
But it is not a text file that can be managed with an ordinary text
editor.  For the example presentation above, the content.xml file
consist of exactly two lines. The first line of the file is just:

<blockquote><pre>
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
</pre></blockquote>

<p>The second line of the file contains 211792 characters of
impenetrable XML.  Yes, 211792 characters all on one line.
This file is a good stress-test for a text editor.
Thankfully, the file is not some obscure
binary format, but in terms of accessibility, it might as well be
written in Klingon.
</ol>

<h2>First Improvement:  Replace ZIP with SQLite</h2>

<p>
Let us suppose that instead of using a ZIP archive to store its files,
OpenDocument used a very simple SQLite database with the following
single-table schema:

<blockquote><pre>
CREATE TABLE OpenDocTree(
  filename TEXT PRIMARY KEY,  -- Name of file
  filesize BIGINT,            -- Size of file after decompression
  content BLOB                -- Compressed file content
);
</pre></blockquote>

<p>
For this first experiment, nothing else about the file format is changed.
The OpenDocument is still a pile-of-files, only now each file is a row
in an SQLite database rather than an entry in a ZIP archive.
This simple change does not use the power of a relational
database.  Even so, this simple change shows some improvements.

<tcl>hd_fragment smaller {SQLite Archive smaller than ZIP}</tcl>
<p>
Surprisingly, using SQLite in place of ZIP makes the presentation
file smaller.  Really.  One would think that a relational database file
would be larger than a ZIP archive, but at least in the case of NeoOffice
that is not so.  The following is an actual screen-scrape showing
the sizes of the same NeoOffice presentation, both in its original 
ZIP archive format as generated by NeoOffice (self2014.odp), and 
as repacked as an SQLite database using the 
[http://www.sqlite.org/sqlar/doc/trunk/README.md|SQLAR] utility:

<blockquote><pre>
-rw-r--r--  1 drh  staff  10514994 Jun  8 14:32 self2014.odp
-rw-r--r--  1 drh  staff  10464256 Jun  8 14:37 self2014.sqlar
-rw-r--r--  1 drh  staff  10416644 Jun  8 14:40 zip.odp
</pre></blockquote>

<p>
The SQLite database file ("self2014.sqlar") is about a
half percent smaller than the equivalent ODP file!  How can this be?
Apparently the ZIP archive generator logic in NeoOffice
is not as efficient as it could be, because when the same pile-of-files
is recompressed using the command-line "zip" utility, one gets a file
("zip.odp") that is smaller still, by another half percent, as seen
in the third line above.  So, a well-written ZIP archive
can be slightly smaller than the equivalent SQLite database, as one would
expect.  But the difference is slight.  The key take-away is that an
SQLite database is size-competitive with a ZIP archive.

<p>
The other advantage to using SQLite in place of
ZIP is that the document can now be updated incrementally, without risk
of corrupting the document if a power loss or other crash occurs in the
middle of the update.  (Remember that writes to 
[atomic commit|SQLite databases are atomic].)   True, all the
content is still kept in a single big XML file ("content.xml") which must
be completely rewritten if so much as a single character changes.  But
with SQLite, only that one file needs to change.  The other 77 files in the
repository can remain unaltered.  They do not all have to be rewritten,
which in turn makes "File/Save" run much faster and saves wear on SSDs.

<h2>Second Improvement:  Split content into smaller pieces</h2>

<p>
A pile-of-files encourages content to be stored in a few large chunks.
In the case of ODP, there are just four XML files that define the layout
off all slides in a presentation.  An SQLite database allows storing
information in a few large chunks, but SQLite is also adept and efficient
at storing information in numerous smaller pieces.

<p>
So then, instead of storing all content for all slides in a single
oversized XML file ("content.xml"), suppose there was a separate table
for storing the content of each slide separately.  The table schema
might look something like this:

<blockquote><pre>
CREATE TABLE slide(
  pageNumber INTEGER,   -- The slide page number
  slideContent TEXT     -- Slide content as XML or JSON
);
CREATE INDEX slide_pgnum ON slide(pageNumber); -- Optional
</pre></blockquote>

<p>The content of each slide could still be stored as compressed XML.
But now each page is stored separately.  So when opening a new document,
the application could simply run:

<blockquote><pre>
SELECT slideContent FROM slide WHERE pageNumber=1;
</pre></blockquote>

<p>This query will quickly and efficiently return the content of the first
slide, which could then be speedily parsed and displayed to the user.
Only one page needs to be read and parsed in order render the first screen,
which means that the first screen appears much faster and
there is no longer a need for an annoying progress bar.

<p>If the application wanted
to keep all content in memory, it could continue reading and parsing the
other pages using a background thread after drawing the first page.  Or,
since reading from SQLite is so efficient, the application might 
instead choose to reduce its memory footprint and only keep a single
slide in memory at a time.  Or maybe it keeps the current slide and the
next slide in memory, to facility rapid transitions to the next slide.

<p>
Notice that dividing up the content into smaller pieces using an SQLite
table gives flexibility to the implementation.  The application can choose
to read all content into memory at startup.  Or it can read just a
few pages into memory and keep the rest on disk.  Or it can read just
single page into memory at a time.  And different versions of the application
can make different choices without having to make any changes to the
file format.  Such options are not available when all content is in
a single big XML file in a ZIP archive.

<p>
Splitting content into smaller pieces also helps File/Save operations
to go faster.  Instead of having to write back the content of all pages
when doing a File/Save, the application only has to write back those
pages that have actually changed.

<p>
One minor downside of splitting content into smaller pieces is that
compression does not work as well on shorter texts and so the size of
the document might increase.  But as the bulk of the document space 
is used to store images, a small reduction in the compression efficiency 
of the text content will hardly be noticeable, and is a small price 
to pay for an improved user experience.

<h2>Third Improvement:  Versioning</h2>

<p>
Once one is comfortable with the concept of storing each slide separately,
it is a small step to support versioning of the presentation.  Consider
the following schema:

<blockquote><pre>
CREATE TABLE slide(
  slideId INTEGER PRIMARY KEY,
  derivedFrom INTEGER REFERENCES slide,
  content TEXT     -- XML or JSON or whatever
);
CREATE TABLE version(
  versionId INTEGER PRIMARY KEY,
  priorVersion INTEGER REFERENCES version,
  checkinTime DATETIME,   -- When this version was saved
  comment TEXT,           -- Description of this version
  manifest TEXT           -- List of integer slideIds
);
</pre></blockquote>

<p>
In this schema, instead of each slide having a page number that determines
its order within the presentation, each slide has a unique
integer identifier that is unrelated to where it occurs in sequence.
The order of slides in the presentation is determined by a list of
slideIds, stored as a text string in the MANIFEST column of the VERSION
table.
Since multiple entries are allowed in the VERSION table, that means that
multiple presentations can be stored in the same document.

<p>
On startup, the application first decides which version it
wants to display.  Since the versionId will naturally increase in time
and one would normally want to see the latest version, an appropriate
query might be:

<blockquote><pre>
SELECT manifest, versionId FROM version ORDER BY versionId DESC LIMIT 1;
</pre></blockquote>

<p>
Or perhaps the application would rather use the
most recent checkinTime:

<blockquote><pre>
SELECT manifest, versionId, max(checkinTime) FROM version;
</pre></blockquote>

<p>
Using a single query such as the above, the application obtains a list
of the slideIds for all slides in the presentation.  The application then
queries for the content of the first slide, and parses and displays that
content, as before.

<p>(Aside:  Yes, that second query above that uses "max(checkinTime)"
really does work and really does return a well-defined answer in SQLite.
Such a query either returns an undefined answer or generates an error
in many other SQL database engines, but in SQLite it does what you would 
expect: it returns the manifest and versionId of the entry that has the
maximum checkinTime.)

<p>When the user does a "File/Save", instead of overwriting the modified
slides, the application can now make new entries in the SLIDE table for
just those slides that have been added or altered.  Then it creates a
new entry in the VERSION table containing the revised manifest.

<p>The VERSION table shown above has columns to record a check-in comment
(presumably supplied by the user) and the time and date at which the File/Save
action occurred.  It also records the parent version to record the history
of changes.  Perhaps the manifest could be stored as a delta from the
parent version, though typically the manifest will be small enough that
storing a delta might be more trouble than it is worth.  The SLIDE table
also contains a derivedFrom column which could be used for delta encoding
if it is determined that saving the slide content as a delta from its
previous version is a worthwhile optimization.

<p>So with this simple change, the ODP file now stores not just the most
recent edit to the presentation, but a history of all historic edits.  The
user would normally want to see just the most recent edition of the
presentation, but if desired, the user can now go backwards in time to 
see historical versions of the same presentation.

<p>Or, multiple presentations could be stored within the same document.

<p>With such a schema, the application would no longer need to make
periodic backups of the unsaved changes to a separate file to avoid lost
work in the event of a crash.  Instead, a special "pending" version could
be allocated and unsaved changes could be written into the pending version.
Because only changes would need to be written, not the entire document,
saving the pending changes would only involve writing a few kilobytes of
content, not multiple megabytes, and would take milliseconds instead of
seconds, and so it could be done frequently and silently in the background.
Then when a crash occurs and the user reboots, all (or almost all)
of their work is retained.  If the user decides to discard unsaved changes, 
they simply go back to the previous version.

<p>
There are details to fill in here.
Perhaps a screen can be provided that displays a history changes
(perhaps with a graph) allowing the user to select which version they
want to view or edit.  Perhaps some facility can be provided to merge
forks that might occur in the version history.  And perhaps the
application should provide a means to purge old and unwanted versions.
The key point is that using an SQLite database to store the content,
rather than a ZIP archive, makes all of these features much, much easier
to implement, which increases the possibility that they will eventually
get implemented.

<h2>And So Forth...</h2>

<p>
In the previous sections, we have seen how moving from a key/value
store implemented as a ZIP archive to a simple SQLite database
with just three tables can add significant capabilities to an application
file format.
We could continue to enhance the schema with new tables, with indexes
added for performance, with triggers and views for programming convenience,
and constraints to enforce consistency of content even in the face of
programming errors.  Further enhancement ideas include:
<ul>
<li> Store an [automated undo/redo stack] in a database table so that
     Undo could go back into prior edit sessions.
<li> Add [FTS4|full text search] capabilities to the slide deck, or across
     multiple slide decks.
<li> Decompose the "settings.xml" file into an SQL table that
     is more easily viewed and edited by separate applications.
<li> Break out the "Presentor Notes" from each slide into a separate
     table, for easier access from third-party applications and/or scripts.
<li> Enhance the presentation concept beyond the simple linear sequence of
     slides to allow for side-tracks and excursions to be taken depending on
     how the audience is responding.
</ul>

<p>
An SQLite database has a lot of capability, which
this essay has only begun to touch upon.  But hopefully this quick glimpse
has convinced some readers that using an SQL database as an application
file format is worth a second look.

<p>
Some readers might resist using SQLite as an application
file format due to prior exposure to enterprise SQL databases and
the caveats and limitations of those other systems.  
For example, many enterprise database
engines advise against storing large strings or BLOBs in the database
and instead suggest that large strings and BLOBs be stored as separate
files and the filename stored in the database.  But SQLite 
is not like that.  Any column of an SQLite database can hold
a string or BLOB up to about a gigabyte in size.  And for strings and
BLOBs of 100 kilobytes or less, 
[BLOB I/O performance|I/O performance is better] than using separate
files.

<p>
Some readers might be reluctant to consider SQLite as an application
file format because they have been inculcated with the idea that all
SQL database schemas must be factored into third normal form and store
only small primitive data types such as strings and integers.  Certainly
relational theory is important and designers should strive to understand
it.  But, as demonstrated above, it is often quite acceptable to store
complex information as XML or JSON in text fields of a database.
Do what works, not what your database professor said you ought to do.

<h2>Review Of The Benefits Of Using SQLite</h2>

<p>
In summary,
the claim of this essay is that using SQLite as a container for an application
file format like OpenDocument
and storing lots of smaller objects in that container
works out much better than using a ZIP archive holding a few larger objects.
To wit:

<ol>
<li><p>
An SQLite database file is approximately the same size, and in some cases
smaller, than a ZIP archive holding the same information.

<li><p>
The [atomic commit|atomic update capabilities]
of SQLite allow small incremental changes
to be safely written into the document.  This reduces total disk I/O
and improves File/Save performance, enhancing the user experience.

<li><p>
Startup time is reduced by allowing the application to read in only the
content shown for the initial screen.  This largely eliminates the
need to show a progress bar when opening a new document.  The document
just pops up immediately, further enhancing the user experience.

<li><p>
The memory footprint of the application can be dramatically reduced by
only loading content that is relevant to the current display and keeping
the bulk of the content on disk.  The fast query capability of SQLite
make this a viable alternative to keeping all content in memory at all times.
And when applications use less memory, it makes the entire computer more
responsive, further enhancing the user experience.

<li><p>
The schema of an SQL database is able to represent information more directly
and succinctly than a key/value database such as a ZIP archive.  This makes
the document content more accessible to third-party applications and scripts
and facilitates advanced features such as built-in document versioning, and
incremental saving of work in progress for recovery after a crash.
</ol>

<p>
These are just a few of the benefits of using SQLite as an application file
format &mdash; the benefits that seem most likely to improve the user
experience for applications like OpenOffice.  Other applications might
benefit from SQLite in different ways. See the [Application File Format]
document for additional ideas.

<p>
Finally, let us reiterate that this essay is a thought experiment.
The OpenDocument format is well-established and already well-designed.
Nobody really believes that OpenDocument should be changed to use SQLite
as its container instead of ZIP.  Nor is this article a criticism of
OpenDocument for not choosing SQLite as its container since OpenDocument
predates SQLite.  Rather, the point of this article is to use OpenDocument
as a concrete example of how SQLite can be used to build better 
application file formats for future projects.