Documentation Source Text

Artifact Content
Login

Artifact afbdb32535d14665db065052cb0c9f37f76ef5235b23af7d3bb17645bc98721b:


<title>The Lemon LALR(1) Parser Generator</title>
<tcl>hd_keywords {Lemon parser generator} {Lemon} \
                 {Lemon LALR parser generator}</tcl>

<table_of_contents>

<h1>Overview</h1>

<p>The SQL language parser for SQLite is generated using a code-generator
program called "Lemon".  The Lemon program reads a grammar of the input
language and emits C-code to implement a parser for that language.


<h2>Lemon Source Files And Documentation</h2>

<p>Lemon does not have its own source repository.  Rather, Lemon consists
of a few files in the SQLite source tree:

<ul>
<li><p>
     [https://sqlite.org/src/doc/trunk/doc/lemon.html|lemon.html] &rarr;
     The original detailed usage documentation and programmers reference
     for Lemon.
<li><p>
     [https://sqlite.org/src/file/tool/lemon.c|lemon.c] &rarr; The source code
     for the utility program that reads a grammar file and generates 
     corresponding parser C-code.
<li><p>
     [https://sqlite.org/src/file/tool/lempar.c|lempar.c] &rarr; A template
     for the generated parser C-code.  The "lemon" utility program reads this
     template and inserts additional code in order to generate a parser.
</ul>

<h1>Advantages of Lemon</h1>

<p>Lemon generates an LALR(1) parser.  It's operation is similar to the
more familiar tools [https://en.wikipedia.org/wiki/Yacc|Yacc] and
[https://en.wikipedia.org/wiki/GNU_bison|Bison], but Lemon adds important
improvements, including:

<ul>
<li><p>
     The grammar syntax is less error prone - using symbol names for
     semantic values rather that the "$1"-style positional notation
     of Yacc.
<li><p>
     In Lemon, the tokenizer calls the parser.  Yacc operates the other
     way around, with the parser calling the tokenizer.  The Lemon
     approach is reentrant and threadsafe, whereas Yacc uses global 
     variables and is therefore neither.  Reentrancy is especially
     important for SQLite since some SQL statements make recursive calls
     to the parser.  For example, when parsing a CREATE TABLE statement,
     SQLite invokes the parser recursively to generate an INSERT statement
     to make a new entry in the [sqlite_master] table.
<li><p>
     Lemon has the concept of a non-terminal destructor that can be
     used to reclaim memory or other resources following a syntax error
     or other aborted parse.
</ul>

<h2>Use of Lemon Within SQLite</h2>

<p>Lemon is used in two places in SQLite.

<p>The primary use of Lemon is to create the SQL language parser.
A grammar file ([https://sqlite.org/src/file/src/parse.y|parse.y]) is
compiled by Lemon into parse.c and parse.h.  The parse.c file is
incorporated into the [amalgamation] without further modification.
The parse.h file is post-processed by the
[https://sqlite.org/src/file/tool/addopcodes.tcl|addopcodes.tcl] script
before being incorporated into the [amalgamation].

<p>Lemon is also used to generate parse for the query pattern
expressions in the [FTS5] extension.  In this case, the input grammar
file is [https://sqlite.org/src/file/ext/fts5/fts5parse.y|fts5parse.y].

<h2>Lemon Customizations Especially For SQLite</h2>

<p>One of the advantages of hosting code generator tool as part of
the project is that the tools can be optimized to serve specific needs of
the overall project.  Lemon has benefited from this effect. Over the years,
the Lemon parser generator has been extended and enhanced to provide
new capabilities and improved performance to SQLite.  A few of the
specific enhancements to Lemon that are specifically designed for use
by SQLite include:

<ul>
<li><p>
Lemon has the concept of a "fallback" tokens.
The SQL language contains a large number of keywords and these keywords
have the potential to collide with identifier names.
Lemon has the ability to designate some keywords has being able to
"fallback" to an identifier.  If the keyword appears in the input token
stream in a context that would otherwise be a syntax error, the token
is automatically transformed into its fallback before the syntax error
is raised.  This feature allows the parser to be very forgiving of
reserved words used as identifiers, which is a problem that comes up
frequently in the SQL language.

<li><p>
In support of the [MC/DC|100% MC/DC testing] goal for SQLite, 
the parser code generated by Lemon has no unreachable branches,
and contains extra (compile-time selected) instrumentation useful
for measuring test coverage.

<li><p>
Lemon supports conditional compilation of grammar file rules, so that
a different parser can be generated depending on compile-time options.

<li><p>
As a performance optimization, reduce actions in the Lemon input grammar
are allowed to contain comments of the form "/*A-overwrites-Z*/" to indicate
that the semantic value "A" on the right-hand side of the rule is allowed
to directly overwrite the semantic value "Z" on the left-hand side.
This simple optimization reduces the number of stack operations in the
push-down automaton used to parse the input grammar, and thus improve
performance of the parser.  It also makes the generated code a little smaller.
</ul>

<p>The parsing of SQL statements is a significant consumer of CPU cycles 
in any SQL database engine.  On-going efforts to optimize SQLite have caused
the developers to spend a lot of time tweaking Lemon to generate faster
parsers.  These efforts have benefited all users of the Lemon parser generator,
not just SQLite.  But if Lemon had been a separately maintained tool, it
would have been more difficulty to make coordinated changes to both SQLite
and Lemon, and as a result not as much optimization would have been
accomplished.  Hence, the fact that the parser generator tool is included
in the source tree for SQLite has turned out to be a net benefit for both
the tool itself and for SQLite.

<h1>History Of Lemon</h1>

<p>Lemon was original written by D. Richard Hipp (also the creator of SQLite)
while he was in graduate school at Duke University between 1987 and 1992.
The original creation date of Lemon has been lost, but was probably sometime
around 1990.  Lemon generates an LALR(1) parser.  There was companion 
LL(1) parser generator tool named "Lime", but the source code for Lime
has been lost.

<p>The Lemon source code was originally written as separate source files,
and only later merged into a single "lemon.c" source file.

<p>The author of Lemon and SQLite (Hipp) reports that his C programming
skills were greatly enhanced by studying John Ousterhout's original
source code to Tcl.  Hipp discovered and studied Tcl in 1993.  Lemon
was written before then, and SQLite afterwards.  There is a clear
difference in the coding styles of these two products, with SQLite seeming
to be cleaner, more readable, and easier to maintain.