Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | First complete draft of the new application file format document. Integrate with the rest of the documentation via hyperlinks. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
6d257b8d928cfd003a0a10eb10c0e4d5 |
User & Date: | drh 2014-03-13 00:43:56.100 |
Context
2014-03-13
| ||
15:38 | Improvements to the application file format document. (check-in: 9e12f0cedd user: drh tags: trunk) | |
00:43 | First complete draft of the new application file format document. Integrate with the rest of the documentation via hyperlinks. (check-in: 6d257b8d92 user: drh tags: trunk) | |
2014-03-12
| ||
18:12 | First draft of a document that advocates for using SQLite as an application file format. (check-in: aa63d85794 user: drh tags: trunk) | |
Changes
Changes to pages/appfileformat.in.
1 2 3 4 5 | <title>SQLite As An Application File Format</title> <h2>What Is An Application File Format?</h2> <p> | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | | | > | | | | | | | | | | < | > | | | > | | > | | | > > > | > > | > > > > > | | > | | < < < < > > > < | | | > | | > | > > > > > | | | > > > > > > > | | | > > > > > > > | | > > > > > > | | > | | | > > | | | > | > > > > > > > > > > > | | | | | | > > > > > > > > > > | | | > > | > | | > > > > | | | > > > > | > > | | | > > > > > > > > > > > > | | | | | > > | | | > > > > > > > > | | | > > > > > > > > | > | > > > > > > > > > > > > > > > > > > > > | > > > > > > > | | > > > > > > > | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 | <tcl>hd_keywords *appformat {application file-format}</tcl> <title>SQLite As An Application File Format</title> <h1 align="center"> SQLite As An Application File Format </h1> <h2>Executive Summary</h2> <p>This essay advocates for the use of SQLite as the file format in new applications. SQLite is often a better choice for an application file format than other techniques in common use. These are some of the reasons to prefer SQLite: <ol> <li> Ease Of Development <li> Single-File Documents <li> High-Level Query Language <li> Accessible Content <li> Cross-Platform <li> Atomic Transactions <li> Incremental And Continuous Updates <li> Easily Extensible <li> Performance <li> Concurrent Use By Multiple Processes <li> Multiple Programming Languages <li> Better Applications </ol> <p>Each of these reasons will be described in more detail following a brief discussion of what exactly this article means by "application file format". <h2>What Is An Application File Format?</h2> <p> An "application file format" is the file format used to persist application state to disk or to exchange information between programs. There are thousands of application file formats in use today. Here are a few examples: <ul> <li>DOC - Word Perfect and Microsoft Office documents <li>DWG - AutoCAD drawings <li>PDF - Portable Document Format from Adobe <li>XLS - Microsoft Excel Spreadsheet <li>GIT - Git source code repository <li>EPUB - The Electronic Publication format used by non-Kindle eBooks <li>ODT - The Open Document format used by OpenOffice and others <li>PPT - Microsoft PowerPoint presentations <li>ODP - The Open Document presentation format used by OpenOffice and others </ul> <p>Many application file formats fit into one of these three categories: <ol> <li><p><b>Fully Custom Formats.</b> Custom formats are specifically designed for a single application. DOC, DWG, PDF, XLS, and PPT are examples of custom formats. Custom formats are usually contained within a single file, for ease of transport. They are also usually binary, though the DWG format is a notable exception. Custom file formats require specialized application code to read and write and are not normally accessible from commonly available tools such as unix command-line programs and text editors. In other words, custom formats are often "opaque blobs". Generally speaking, to access the content of a custom application file format, you have to have a tool that is specifically engineered to read and/or write that format. <li><p><b>Pile-of-Files Formats.</b> Sometimes the application state is stored as a hierarchy of files. Git is a prime example of this, though the phenomenon occurs frequently in one-off and bespoke applications. A pile-of-files format essentially uses the filesystem as a key/value database, storing small chunks of information into separate files. This gives the advantage of making the content more accessible to common utility programs such as text editors or "awk" or "grep". But even if many of the files in a pile-of-files format are easily readable, there are usually some files that have their own custom format (example: Git "Packfiles") and are hence "opaque blobs" that are not readable or writable without specialized tools. It is also much less convenient to move a pile-of-files from one place or machine to another, than it is to move a single file. And it is hard to make a pile-of-files document into an email attachment, for example. Finally, a pile-of-files format breaks the "document metaphor": there is no one file that a user can point to that is the "document". <li><p><b>ZIP-ed Pile-of-Files Formats.</b> Some applications use a Pile-of-Files that is then encapsulated into a ZIP archive. EPUB, ODT,and ODP are examples of this approach. An EPUB book is really just a ZIP archive that contains various XHTML files for the text of book chapters, GIF and JPEG images for the artwork, and a specialized catalog file that tells the eBook reader how all the XML and image files fit together. OpenOffice documents (ODT and ODP) are also ZIP archives containing XML and images that represent their content as well as "catalog" files that show the interrelationships between the component parts. <p>A ZIP-ed pile-of-files format is a compromise between a full custom file format and a pure pile-of-files format. A ZIP-ed pile-of-files format is not an opaque blob in the same sense as a custom file format, since the component parts can still be accessed using any common ZIP archiver, but the format is not quite as accessible as a pure pile-of-files format because one does still need the ZIP archiver, and one cannot normally use command-line tools like "find" on the file hierarchy without first un-zipping it. On the other hand, a ZIP-ed pile-of-files format does preserve the document metaphor by putting all content into a single disk file. And because it is compressed, the ZIP-ed pile-of-files format tends to be more compact. <p>As with custom file formats, and unlike pure pile-of-file formats, a ZIP-ed pile-of-files format is not as easy to edit, since one most normally rewrite the entire file to change any component part. </ol> <p>The purpose of this document is to argue in favor of a fourth new catagory of application file format: An SQLite database file. <h2>SQLite As The Application File Format</h2> <p> For many applications, the use of SQLite as the application file format has a dozen or more compelling advantages over custom file formats, pile-of-file formats, and ZIP-ed pile-of-file formats. To wit: </p> <ol> <li><p><b>Ease Of Development.</b> No code is needed for reading or writing the application file. One has merely to link against the SQLite library, or include the [amalgamation | single "sqlite3.c" source file] with the rest of the application C code, and SQLite will take care of all of the application file I/O. This can reduce application code size by many thousands of lines, with corresponding saving in development and maintenance costs. <p>SQLite is widely used and stable. There are literally billions of SQLite database files in use daily, on smartphones and in desktop applications. SQLite is [tested | carefully tested] and proven reliable. It is not a component that needs much tuning or debugging, allowing developers to stay focused on application logic. <li><p><b>Single-File Documents.</b> An SQLite database is contained in a single file, which is easily copied or moved or attached. The "document" metaphor is preserved. SQLite does not have any file naming requirements and so the application can use any custom file suffix that it wants to help identify the file as "belonging" to the application. <p>SQLite database files contain a 4-byte [Application ID] in their headers that can be set to an application-defined value and then used to identify the "type" of the document for utility programs such as [http://linux.die.net/man/1/file | file(1)]. <li><p><b>High-Level Query Language.</b> SQLite is a complete relational database engine, which means that the application can access content using high-level queries. Application developers need not spend time thinking about "how" to retrieve the information they need from a document. Developers write SQL that expresses "what" information they want and let the database engine to figure out how to best retrieve that content. This helps developers operate "heads up" and remain focused on solving the user's problem, and avoid time spent "heads down" fiddling with low-level file formatting details. <p>A pile-of-files format can viewed as a key/value database. A key/value database is better than no database at all. But without transactions or indices or a high-level query language or a proper schema, it much harder and more error prone to use a key/value database than a relational database. <li><p><b>Accessible Content.</b> Information held in an SQLite database file is accessible using commonly available open-source command-line tools - tools that are installed by default on Mac and Linux systems and that are freely available as a self-contained EXE file on Windows. Unlike custom file formats, application-specific programs are <u>not</u> required to read or write content in an SQLite database. An SQLite database file is <u>not</u> an opaque blob. It is true that command-line tools such as text editors or "grep" or "awk" are not useful on an SQLite database, but the SQL query language is a much more powerful and convenient way for examining the the content, so the inability to use "grep" and "awk" and the like is not seen as a loss. <p>An SQLite database is a [file format | well-defined and well-documented] file format that is in widespread use by literally hundreds of thousands of applications and is backwards compatible to its inception in 2004 and which promises to continue to be compatible in years to come. The longevity of SQLite database files is particularly important to bespoke applications, since it allows the document content to be accessed years or decades in the future, long after all traces of the original application have been lost. Data lives longer than code. <li><p><b>Cross-Platform.</b> SQLite database files are portable between 32-bit and 64-bit machines and between big-endian and little-endian architectures and between any of the various flavors of windows and unix-like operating systems. The application using an SQLite application file format can store binary numeric data without having to worry about the byte-order of integers or floating point numbers. Text content can be read or written as UTF-8, UTF-16LE, or UTF-16BE and SQLite will automatically perform any necessary translations on-the-fly. <li><p><b>Atomic Transactions.</b> Writes to an SQLite database are [atomic commit | atomic]. They either happen completely or not at all, even during system crashes or power failures. So there is no danger of corrupting a document just because the power happened to go out at the same instant that a change was being written to disk. <p>SQLite is transactional, meaning that multiple changes can be grouped together such that either all or none of them occur, and so that the changes can be rolled back if a problem is found prior to to commit. This allows an application to make a change incrementally, then run various sanity and consistency checks on the resulting data prior to committing the changes to disk. The [http://www.fossil-scm.org/ | Fossil] DVCS [http://www.fossil-scm.org/fossil/doc/tip/www/selfcheck.wiki|uses this technique] to verify that no repository history has been lost prior to each change to the repository. <li><p><b>Incremental And Continuous Updates.</b> When writing to an SQLite database file, only those parts of the file that actually change are written out to disk. This makes the writing happen faster and saves wear on SSDs. This is an enormous advantage over custom and ZIP-ed pile-of-files formats, both of which must completely rewrite the entire document in order to change a single byte. Pure pile-of-files formats can also do incremental updates to some extent, though the granularity of writes is usually larger with pile-of-file formats (a single file) than with SQLite (a single page). <p>A desktop application built on SQLite can also do continuous update. In other words, instead of collecting changes in memory and then writing them to disk only on a File/Save action, changes can be written back to the disk as they occur. The avoids loss of work on a system crash or power failure. The undo/redo stack can also be kept in the on-disk database, meaning that undo/redo can occur across session boundaries. And maintenance of the undo/redo stack can often be automated using triggers. <li><p><b>Easily Extensible.</b> As an application grows, new features can be added to an SQLite application file format simply by adding new tables to the schema or by adding new columns to existing tables. Adding columns or tables does not change the meaning of prior queries against the database, so with a modicum of care to ensuring that the meaning of the columns and tables are preserved, backwards compatibility is maintained. <p>It is possible to extend custom or pile-of-files formats too, of course, but doing is often much harder. If indices are added, then all application code that changes the corresponding tables must be located and modified to keep those indices up-to-date. If columns are added, then all application code that accesses the corresponding table must be located and modified to take into account the new columns. <li><p><b>Performance.</b> In many cases, an SQLite application file format will be faster than a custom or pile-of-files format. In the case of a custom format, SQLite often dramatically improves start-up times because instead of having to read and parse the entire document into memory, the application can do queries to extract only the information needed for the initial screen. As the application progresses, it only needs to load as much material as is needed to draw the next screen, and can discard information from prior screens that is no longer in use. This helps keep the memory footprint of the application under control. <p>A pile-of-files format can be read incrementally just like SQLite. But many developers are surprised to learn that SQLite can read and write smaller BLOBs (less than about 100KB in size) from its database faster than those same blobs can be read or written as separate files from the filesystem. (See [Internal Versus External BLOBs] for further information.) There is overhead associated with operating a relational database engine, however one should not assume that direct file I/O is faster than SQLite database I/O, as often it is not. <p>In either case, if performance problems do arise in an SQLite application those problems can often be resolved by adding one or two [CREATE INDEX] statements to the schema or perhaps running [ANALYZE] one time and without having to touch a single line of application code. But if a performance problem comes up in a custom or pile-of-files format, the the fix will often require extensive changes to application code to add and maintain new indices or to extract information using different algorithms. <li><p><b>Concurrent Use By Multiple Processes.</b> SQLite automatically coordinates concurrent access to the same document from multiple threads and/or processes. Two or more applications can connect and read from the same document at the same time. Writes are serialized, but as writes normally only take milliseconds, applications simply take turns writing. SQLite automatically ensures that the low-level format of the document is uncorrupted. Accomplishing the same with a custom or pile-of-files format, in contrast, requires extensive support in the application. And the application logic needed to support concurrency is a notorious bug-magnet. <li><p><b>Multiple Programming Languages.</b> Though SQLite is itself written in ANSI-C, interfaces exist for just about every other programming language you can think of: C++, C#, Objective-C, Java, Tcl, Perl, Python, Ruby, Erlang, JavaScript, and so forth. So programmers can develop in whatever language they are most comfortable with and which best matches the needs of the application. <p>An SQLite application file format is a great choice in cases where there is a collection or "federation" of separate programs, often written in different languages and by different development teams. This comes up commonly in research or laboratory environments where one team is responsible for data acquisition and one or more different teams are responsible for analysis of the data. Each team can use whatever hardware, operating system, programming language and development methodology that they are most comfortable with, and as long as all programs use an SQLite database with a common schema as their application file format, they can all interoperate. <li><p><b>Better Applications.</b> If the application file format is an SQLite database, the complete documentation for that file format consists of the database schema, with perhaps a few extra words about what each table and column represent. The description of a custom file format, on the other hand, typically runs on for hundreds of pages. A pile-of-files format, while much simpler and easier to describe than a fully custom format, still tends to be much larger and more complex than an SQL schema dump, since the names and format for the individual files must still be described. <p>This is not a trivial point. A clear, concise, and easy to understand file format is a crucial part of any application design. Fred Brooks, in his all-time best-selling computer science text, <i>The Mythical Man-Month</i> says: <blockquote><i>Representation is the essence of computer programming.<br />...<br /> Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they'll be obvious.</i></blockquote> <p>Rob Pike, in his <i>Rules of Programming</i> expresses the same idea this way: <blockquote> <i>Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.</i></blockquote> <p> Linus Torvalds used different words to say much the same thing on the Git mailing list on 2006-06-27: <blockquote> <i>Bad programmers worry about the code. Good programmers worry about data structures and their relationships.</i> </blockquote> <p>The point is this: an SQL database schema almost always does a far better job of defining and organizing the tables and data structures and their relationships. And having clear, concise, and well-defined representation almost always results in an application that performs better, has fewer problems, and is easier to develop and maintain. </ol> <h2>Conclusion</h2> <p> SQLite is not the perfect application file format for every situation. But in many cases, SQLite is a far better choice than either a custom file format, a pile-of-files, or a ZIP-ed pile-of-files. SQLite is a high-level, stable, reliable, cross-platform, widely-deployed, extensible, performant, accessible, concurrent file format. It deserves your consideration as the standard file format on your next application design. |
Changes to pages/whentouse.in.
︙ | ︙ | |||
52 53 54 55 56 57 58 | It is designed to replace [http://man.he.net/man3/fopen | fopen()]. </p> <h2>Situations Where SQLite Works Well</h2> <ul> | | > > | 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | It is designed to replace [http://man.he.net/man3/fopen | fopen()]. </p> <h2>Situations Where SQLite Works Well</h2> <ul> <tcl>hd_fragment appfileformat</tcl> <li><p><b>Application File Format</b></p> <p>(Further details [application file-format | here].)</p> <p> SQLite has been used with great success as the on-disk file format for desktop applications such as version control systems, financial analysis tools, media cataloging and editing suites, CAD packages, record keeping programs, and so forth. The traditional File/Open operation calls sqlite3_open() to attach to the database |
︙ | ︙ |