Small. Fast. Reliable.
Choose any three.

*** DRAFT ***

Requirements For The SQLite Tokenizer

When processing SQL statements, SQLite (as does every other SQL database engine) breaks the SQL statement up into tokens which are then forwarded to the parser component. SQL statements are split into tokens by the "tokenizer" component of SQLite. This document specifies requirements that precisely define the operation of the SQLite tokenizer.

Character classes

SQL statements are composed of unicode characters. Specific individual characters many be described using a notation consisting of the character "u" followed by four hexadecimal digits. For example, the lower-case letter "a" can be expressed as "u0061" and the dollar sign can be expressed as "u0024". For notational convenience, the following character classes are defined:

WHITESPACE

One of these five characters: u0009, u000a, u000c, u000d, or u0020

ALPHABETIC

Any of the characters in the range u0041 through u005a (letters "A" through "Z") or in the range u0061 through u007a (letters "a" through "z") or the character u005f ("_") or any other character larger than u007f.

NUMERIC

Any of the characters in the range u0030 through u0039 (digits "0" through "9")

ALPHANUMERIC

Any character which is either ALPHABETIC or NUMERIC

HEXADECIMAL

Any NUMERIC character or a characters in the range u0041 through u0046 ("A" through "F") or in the range u0061 through u0066 ("a" through "f")

SPECIAL

Any character that is not WHITESPACE, ALPHABETIC, nor NUMERIC

Token requirements

Processing is left-to-right. This seems obvious, but it needs to be explicitly stated.

H41010: SQLite shall divide input SQL text into tokens working from left to right.

The standard practice in SQL, as with most context-free grammar based programming languages, is to resolve ambiguities in tokenizing by selecting the option that results in the longest tokens.

H41020: At each step in the SQL tokenization process, SQLite shall extract the longest possible token from the remaining input text.

The tokenizer recognizes tokens one by one and passes them on to the parser. Except whitespace is ignored. The only use for whitespace is as a separator between tokens.

H41030: The tokenizer shall pass each non-WHITESPACE token seen on to the parser in the order in which the tokens are seen.

The tokenizer appends a semicolon to the end of input if necessary. This ensures that every SQL statement is terminated by a semicolon.

H41040: When the tokenizer reaches the end of input where the last token sent to the parser was not a SEMI token, it shall send a SEMI token to the parser.

An unrecognized token generates an immediate error and aborts the parse.

H41050: When the tokenizer encounters text that is not a valid token, it shall cause an error to be returned to the application.

Whitespace tokens

Whitespace has the usual definition.

H41100: SQLite shall recognize a sequence of one or more WHITESPACE characters as a WHITESPACE token.

An SQL comment is "--" through the end of line and is understood as whitespace.

H41110: SQLite shall recognize as a WHITESPACE token the two-character sequence "--" (u002d, u002d) followed by any sequence of non-zero characters up through and including the first u000a character or until end of input.

A C-style comment "/*...*/" is also recognized as white-space.

H41120: SQLite shall recognize as a WHITESPACE token the two-character sequence "/*" (u002f, u002a) followed by any sequence of zero or more non-zero characters through with the first "*/" (u002a, u002f) sequence or until end of input.

Identifier tokens

Identifiers follow the usual rules with the exception that SQLite allows the dollar-sign symbol in the interior of an identifier. The dollar-sign is for compatibility with Microsoft SQL-Server and is not part of the SQL standard.

H41130: SQLite shall recognize as an ID token any sequence of characters that begins with an ALPHABETIC character and continue with zero or more ALPHANUMERIC characters and/or "$" (u0024) characters and which is not a keyword token.

Identifiers can be arbitrary character strings within square brackets. This feature is also for compatibility with Microsoft SQL-Server and not a part of the SQL standard.

H41140: SQLite shall recognize as an ID token any sequence of non-zero characters that begins with "[" (u005b) and continuing through the first "]" (u005d) character.

The standard way of quoting SQL identifiers is to use double-quotes.

H41150: SQLite shall recognize as an ID token any sequence of characters that begins with a double-quote (u0022), is followed by zero or more non-zero characters and/or pairs of double-quotes (u0022) and terminates with a double-quote (u0022) that is not part of a pair.

MySQL allows identifiers to be quoted using the grave accent character. SQLite supports this for interoperability.

H41160: SQLite shall recognize as an ID token any sequence of characters that begins with a grave accent (u0060), is followed by zero or more non-zero characters and/or pairs ofgrave accents (u0060) and terminates with a grave accent (u0022) that is not part of a pair.

Literals

This is the usual definition of string literals for SQL. SQL uses the classic Pascal string literal format.

H41200: SQLite shall recognize as a STRING token a sequence of characters that begins with a single-quote (u0027), is followed by zero or more non-zero characters and/or pairs of single-quotes (u0027) and terminates with a single-quote (u0027) that is not part of a pair.

Blob literals are similar to string literals except that they begin with a single "X" character and contain hexadecimal data.

H41210: SQLite shall recognize as a BLOB token an upper or lower-case "X" (u0058 or u0078) followed by a single-quote (u0027) followed by a number of HEXADECIMAL character that is a multiple of two and terminated by a single-quote (u0027).

Integer literals are a string of digits. The plus or minus sign that might optionally preceed an integer is not part of the integer token.

H41220: SQLite shall recognize as an INTEGER token any squence of one or more NUMERIC characters.

An "exponentiation suffix" is defined to be an upper or lower case "E" (u0045 or u0065) followed by one or more NUMERIC characters. The "E" and the NUMERIC characters may optionally be separated by a plus-sign (u002b) or a minus-sign (u002d). An exponentiation suffix is part of the definition of a FLOAT token:

H41230: SQLite shall recognize as a FLOAT token a sequence of one or more NUMERIC characters together with zero or one period (u002e) and followed by an exponentiation suffix.
H41240: SQLite shall recognize as a FLOAT token a sequence of one or more NUMERIC characters that includes exactly one period (u002e) character.

Variables

Variables are used as placeholders in SQL statements for constant values that are to be bound at start-time.

H40310: SQLite shall recognize as a VARIABLE token the a question-mark (u003f) followed by zero or more NUMERIC characters.

A "parameter name" is defined to be a sequence of one or more characters that consists of ALPHANUMERIC characters and/or dollar-signs (u0025) intermixed with pairs of colons (u003a) and optionally followed by any sequence of non-zero, non-WHITESPACE characters enclosed in parentheses (u0028 and u0029).

H40320: SQLite shall recognize as a VARIABLE token one of the characters at-sign (u0040), dollar-sign (u0024), or colon (u003a) followed by a parameter name.
H40330: SQLite shall recognize as a VARIABLE token the shape-sign (u0023) followed by a parameter name that does not begin with a NUMERIC character.

The REGISTER token is a special token used internally. It does not appear as part of the published user interface. Hence, the following is a low-level requirement:

L42040: SQLite shall recognize as a REGISTER token a sharp-sign (u0023) followed by one or more NUMERIC characters.

Operator tokens

The following sequences of special characters are recognized as tokens:

H41403: SQLite shall recognize the 1-character sequenence "-" (u002d) as token MINUS
H41406: SQLite shall recognize the 1-character sequenence "(" (u0028) as token LP
H41409: SQLite shall recognize the 1-character sequenence ")" (u0029) as token RP
H41412: SQLite shall recognize the 1-character sequenence ";" (u003b) as token SEMI
H41415: SQLite shall recognize the 1-character sequenence "+" (u002b) as token PLUS
H41418: SQLite shall recognize the 1-character sequenence "*" (u002a) as token STAR
H41421: SQLite shall recognize the 1-character sequenence "/" (u002f) as token SLASH
H41424: SQLite shall recognize the 1-character sequenence "%" (u0025) as token REM
H41427: SQLite shall recognize the 1-character sequenence "=" (u003d) as token EQ
H41430: SQLite shall recognize the 2-character sequenence "==" (u003d u003d) as token EQ
H41433: SQLite shall recognize the 2-character sequenence "<=" (u003c u003d) as token LE
H41436: SQLite shall recognize the 2-character sequenence "<>" (u003c u003e) as token NE
H41439: SQLite shall recognize the 2-character sequenence "<<" (u003c u003c) as token LSHIFT
H41442: SQLite shall recognize the 1-character sequenence "<" (u003c) as token LT
H41445: SQLite shall recognize the 2-character sequenence ">=" (u003e u003d) as token GE
H41448: SQLite shall recognize the 2-character sequenence ">>" (u003e u003e) as token RSHIFT
H41451: SQLite shall recognize the 1-character sequenence ">" (u003e) as token GT
H41454: SQLite shall recognize the 2-character sequenence "!=" (u0021 u003d) as token NE
H41457: SQLite shall recognize the 1-character sequenence "," (u002c) as token COMMA
H41460: SQLite shall recognize the 1-character sequenence "&" (u0026) as token BITAND
H41463: SQLite shall recognize the 1-character sequenence "~" (u007e) as token BITNOT
H41466: SQLite shall recognize the 1-character sequenence "|" (u007c) as token BITOR
H41469: SQLite shall recognize the 2-character sequenence "||" (u007c u007c) as token CONCAT
H41472: SQLite shall recognize the 1-character sequenence "." (u002e) as token DOT

Keyword tokens

The following keywords are recognized as distinct tokens:

H41503: SQLite shall recognize the 5-character sequenence "ABORT" in any combination of upper and lower case letters as the keyword token ABORT.
H41506: SQLite shall recognize the 3-character sequenence "ADD" in any combination of upper and lower case letters as the keyword token ADD.
H41509: SQLite shall recognize the 5-character sequenence "AFTER" in any combination of upper and lower case letters as the keyword token AFTER.
H41512: SQLite shall recognize the 3-character sequenence "ALL" in any combination of upper and lower case letters as the keyword token ALL.
H41515: SQLite shall recognize the 5-character sequenence "ALTER" in any combination of upper and lower case letters as the keyword token ALTER.
H41518: SQLite shall recognize the 7-character sequenence "ANALYZE" in any combination of upper and lower case letters as the keyword token ANALYZE.
H41521: SQLite shall recognize the 3-character sequenence "AND" in any combination of upper and lower case letters as the keyword token AND.
H41524: SQLite shall recognize the 2-character sequenence "AS" in any combination of upper and lower case letters as the keyword token AS.
H41527: SQLite shall recognize the 3-character sequenence "ASC" in any combination of upper and lower case letters as the keyword token ASC.
H41530: SQLite shall recognize the 6-character sequenence "ATTACH" in any combination of upper and lower case letters as the keyword token ATTACH.
H41533: SQLite shall recognize the 13-character sequenence "AUTOINCREMENT" in any combination of upper and lower case letters as the keyword token AUTOINCR.
H41536: SQLite shall recognize the 6-character sequenence "BEFORE" in any combination of upper and lower case letters as the keyword token BEFORE.
H41539: SQLite shall recognize the 5-character sequenence "BEGIN" in any combination of upper and lower case letters as the keyword token BEGIN.
H41542: SQLite shall recognize the 7-character sequenence "BETWEEN" in any combination of upper and lower case letters as the keyword token BETWEEN.
H41545: SQLite shall recognize the 2-character sequenence "BY" in any combination of upper and lower case letters as the keyword token BY.
H41548: SQLite shall recognize the 7-character sequenence "CASCADE" in any combination of upper and lower case letters as the keyword token CASCADE.
H41551: SQLite shall recognize the 4-character sequenence "CASE" in any combination of upper and lower case letters as the keyword token CASE.
H41554: SQLite shall recognize the 4-character sequenence "CAST" in any combination of upper and lower case letters as the keyword token CAST.
H41557: SQLite shall recognize the 5-character sequenence "CHECK" in any combination of upper and lower case letters as the keyword token CHECK.
H41560: SQLite shall recognize the 7-character sequenence "COLLATE" in any combination of upper and lower case letters as the keyword token COLLATE.
H41563: SQLite shall recognize the 6-character sequenence "COLUMN" in any combination of upper and lower case letters as the keyword token COLUMNKW.
H41566: SQLite shall recognize the 6-character sequenence "COMMIT" in any combination of upper and lower case letters as the keyword token COMMIT.
H41569: SQLite shall recognize the 8-character sequenence "CONFLICT" in any combination of upper and lower case letters as the keyword token CONFLICT.
H41572: SQLite shall recognize the 10-character sequenence "CONSTRAINT" in any combination of upper and lower case letters as the keyword token CONSTRAINT.
H41575: SQLite shall recognize the 6-character sequenence "CREATE" in any combination of upper and lower case letters as the keyword token CREATE.
H41578: SQLite shall recognize the 5-character sequenence "CROSS" in any combination of upper and lower case letters as the keyword token JOIN_KW.
H41581: SQLite shall recognize the 12-character sequenence "CURRENT_DATE" in any combination of upper and lower case letters as the keyword token CTIME_KW.
H41584: SQLite shall recognize the 12-character sequenence "CURRENT_TIME" in any combination of upper and lower case letters as the keyword token CTIME_KW.
H41587: SQLite shall recognize the 17-character sequenence "CURRENT_TIMESTAMP" in any combination of upper and lower case letters as the keyword token CTIME_KW.
H41590: SQLite shall recognize the 8-character sequenence "DATABASE" in any combination of upper and lower case letters as the keyword token DATABASE.
H41593: SQLite shall recognize the 7-character sequenence "DEFAULT" in any combination of upper and lower case letters as the keyword token DEFAULT.
H41596: SQLite shall recognize the 8-character sequenence "DEFERRED" in any combination of upper and lower case letters as the keyword token DEFERRED.
H41599: SQLite shall recognize the 10-character sequenence "DEFERRABLE" in any combination of upper and lower case letters as the keyword token DEFERRABLE.
H41602: SQLite shall recognize the 6-character sequenence "DELETE" in any combination of upper and lower case letters as the keyword token DELETE.
H41605: SQLite shall recognize the 4-character sequenence "DESC" in any combination of upper and lower case letters as the keyword token DESC.
H41608: SQLite shall recognize the 6-character sequenence "DETACH" in any combination of upper and lower case letters as the keyword token DETACH.
H41611: SQLite shall recognize the 8-character sequenence "DISTINCT" in any combination of upper and lower case letters as the keyword token DISTINCT.
H41614: SQLite shall recognize the 4-character sequenence "DROP" in any combination of upper and lower case letters as the keyword token DROP.
H41617: SQLite shall recognize the 3-character sequenence "END" in any combination of upper and lower case letters as the keyword token END.
H41620: SQLite shall recognize the 4-character sequenence "EACH" in any combination of upper and lower case letters as the keyword token EACH.
H41623: SQLite shall recognize the 4-character sequenence "ELSE" in any combination of upper and lower case letters as the keyword token ELSE.
H41626: SQLite shall recognize the 6-character sequenence "ESCAPE" in any combination of upper and lower case letters as the keyword token ESCAPE.
H41629: SQLite shall recognize the 6-character sequenence "EXCEPT" in any combination of upper and lower case letters as the keyword token EXCEPT.
H41632: SQLite shall recognize the 9-character sequenence "EXCLUSIVE" in any combination of upper and lower case letters as the keyword token EXCLUSIVE.
H41635: SQLite shall recognize the 6-character sequenence "EXISTS" in any combination of upper and lower case letters as the keyword token EXISTS.
H41638: SQLite shall recognize the 7-character sequenence "EXPLAIN" in any combination of upper and lower case letters as the keyword token EXPLAIN.
H41641: SQLite shall recognize the 4-character sequenence "FAIL" in any combination of upper and lower case letters as the keyword token FAIL.
H41644: SQLite shall recognize the 3-character sequenence "FOR" in any combination of upper and lower case letters as the keyword token FOR.
H41647: SQLite shall recognize the 7-character sequenence "FOREIGN" in any combination of upper and lower case letters as the keyword token FOREIGN.
H41650: SQLite shall recognize the 4-character sequenence "FROM" in any combination of upper and lower case letters as the keyword token FROM.
H41653: SQLite shall recognize the 4-character sequenence "FULL" in any combination of upper and lower case letters as the keyword token JOIN_KW.
H41656: SQLite shall recognize the 4-character sequenence "GLOB" in any combination of upper and lower case letters as the keyword token LIKE_KW.
H41659: SQLite shall recognize the 5-character sequenence "GROUP" in any combination of upper and lower case letters as the keyword token GROUP.
H41662: SQLite shall recognize the 6-character sequenence "HAVING" in any combination of upper and lower case letters as the keyword token HAVING.
H41665: SQLite shall recognize the 2-character sequenence "IF" in any combination of upper and lower case letters as the keyword token IF.
H41668: SQLite shall recognize the 6-character sequenence "IGNORE" in any combination of upper and lower case letters as the keyword token IGNORE.
H41671: SQLite shall recognize the 9-character sequenence "IMMEDIATE" in any combination of upper and lower case letters as the keyword token IMMEDIATE.
H41674: SQLite shall recognize the 2-character sequenence "IN" in any combination of upper and lower case letters as the keyword token IN.
H41677: SQLite shall recognize the 5-character sequenence "INDEX" in any combination of upper and lower case letters as the keyword token INDEX.
H41680: SQLite shall recognize the 9-character sequenence "INITIALLY" in any combination of upper and lower case letters as the keyword token INITIALLY.
H41683: SQLite shall recognize the 5-character sequenence "INNER" in any combination of upper and lower case letters as the keyword token JOIN_KW.
H41686: SQLite shall recognize the 6-character sequenence "INSERT" in any combination of upper and lower case letters as the keyword token INSERT.
H41689: SQLite shall recognize the 7-character sequenence "INSTEAD" in any combination of upper and lower case letters as the keyword token INSTEAD.
H41692: SQLite shall recognize the 9-character sequenence "INTERSECT" in any combination of upper and lower case letters as the keyword token INTERSECT.
H41695: SQLite shall recognize the 4-character sequenence "INTO" in any combination of upper and lower case letters as the keyword token INTO.
H41698: SQLite shall recognize the 2-character sequenence "IS" in any combination of upper and lower case letters as the keyword token IS.
H41701: SQLite shall recognize the 6-character sequenence "ISNULL" in any combination of upper and lower case letters as the keyword token ISNULL.
H41704: SQLite shall recognize the 4-character sequenence "JOIN" in any combination of upper and lower case letters as the keyword token JOIN.
H41707: SQLite shall recognize the 3-character sequenence "KEY" in any combination of upper and lower case letters as the keyword token KEY.
H41710: SQLite shall recognize the 4-character sequenence "LEFT" in any combination of upper and lower case letters as the keyword token JOIN_KW.
H41713: SQLite shall recognize the 4-character sequenence "LIKE" in any combination of upper and lower case letters as the keyword token LIKE_KW.
H41716: SQLite shall recognize the 5-character sequenence "LIMIT" in any combination of upper and lower case letters as the keyword token LIMIT.
H41719: SQLite shall recognize the 5-character sequenence "MATCH" in any combination of upper and lower case letters as the keyword token MATCH.
H41722: SQLite shall recognize the 7-character sequenence "NATURAL" in any combination of upper and lower case letters as the keyword token JOIN_KW.
H41725: SQLite shall recognize the 3-character sequenence "NOT" in any combination of upper and lower case letters as the keyword token NOT.
H41728: SQLite shall recognize the 7-character sequenence "NOTNULL" in any combination of upper and lower case letters as the keyword token NOTNULL.
H41731: SQLite shall recognize the 4-character sequenence "NULL" in any combination of upper and lower case letters as the keyword token NULL.
H41734: SQLite shall recognize the 2-character sequenence "OF" in any combination of upper and lower case letters as the keyword token OF.
H41737: SQLite shall recognize the 6-character sequenence "OFFSET" in any combination of upper and lower case letters as the keyword token OFFSET.
H41740: SQLite shall recognize the 2-character sequenence "ON" in any combination of upper and lower case letters as the keyword token ON.
H41743: SQLite shall recognize the 2-character sequenence "OR" in any combination of upper and lower case letters as the keyword token OR.
H41746: SQLite shall recognize the 5-character sequenence "ORDER" in any combination of upper and lower case letters as the keyword token ORDER.
H41749: SQLite shall recognize the 5-character sequenence "OUTER" in any combination of upper and lower case letters as the keyword token JOIN_KW.
H41752: SQLite shall recognize the 4-character sequenence "PLAN" in any combination of upper and lower case letters as the keyword token PLAN.
H41755: SQLite shall recognize the 6-character sequenence "PRAGMA" in any combination of upper and lower case letters as the keyword token PRAGMA.
H41758: SQLite shall recognize the 7-character sequenence "PRIMARY" in any combination of upper and lower case letters as the keyword token PRIMARY.
H41761: SQLite shall recognize the 5-character sequenence "QUERY" in any combination of upper and lower case letters as the keyword token QUERY.
H41764: SQLite shall recognize the 5-character sequenence "RAISE" in any combination of upper and lower case letters as the keyword token RAISE.
H41767: SQLite shall recognize the 10-character sequenence "REFERENCES" in any combination of upper and lower case letters as the keyword token REFERENCES.
H41770: SQLite shall recognize the 6-character sequenence "REGEXP" in any combination of upper and lower case letters as the keyword token LIKE_KW.
H41773: SQLite shall recognize the 7-character sequenence "REINDEX" in any combination of upper and lower case letters as the keyword token REINDEX.
H41776: SQLite shall recognize the 6-character sequenence "RENAME" in any combination of upper and lower case letters as the keyword token RENAME.
H41779: SQLite shall recognize the 7-character sequenence "REPLACE" in any combination of upper and lower case letters as the keyword token REPLACE.
H41782: SQLite shall recognize the 8-character sequenence "RESTRICT" in any combination of upper and lower case letters as the keyword token RESTRICT.
H41785: SQLite shall recognize the 5-character sequenence "RIGHT" in any combination of upper and lower case letters as the keyword token JOIN_KW.
H41788: SQLite shall recognize the 8-character sequenence "ROLLBACK" in any combination of upper and lower case letters as the keyword token ROLLBACK.
H41791: SQLite shall recognize the 3-character sequenence "ROW" in any combination of upper and lower case letters as the keyword token ROW.
H41794: SQLite shall recognize the 6-character sequenence "SELECT" in any combination of upper and lower case letters as the keyword token SELECT.
H41797: SQLite shall recognize the 3-character sequenence "SET" in any combination of upper and lower case letters as the keyword token SET.
H41800: SQLite shall recognize the 5-character sequenence "TABLE" in any combination of upper and lower case letters as the keyword token TABLE.
H41803: SQLite shall recognize the 4-character sequenence "TEMP" in any combination of upper and lower case letters as the keyword token TEMP.
H41806: SQLite shall recognize the 9-character sequenence "TEMPORARY" in any combination of upper and lower case letters as the keyword token TEMP.
H41809: SQLite shall recognize the 4-character sequenence "THEN" in any combination of upper and lower case letters as the keyword token THEN.
H41812: SQLite shall recognize the 2-character sequenence "TO" in any combination of upper and lower case letters as the keyword token TO.
H41815: SQLite shall recognize the 11-character sequenence "TRANSACTION" in any combination of upper and lower case letters as the keyword token TRANSACTION.
H41818: SQLite shall recognize the 7-character sequenence "TRIGGER" in any combination of upper and lower case letters as the keyword token TRIGGER.
H41821: SQLite shall recognize the 5-character sequenence "UNION" in any combination of upper and lower case letters as the keyword token UNION.
H41824: SQLite shall recognize the 6-character sequenence "UNIQUE" in any combination of upper and lower case letters as the keyword token UNIQUE.
H41827: SQLite shall recognize the 6-character sequenence "UPDATE" in any combination of upper and lower case letters as the keyword token UPDATE.
H41830: SQLite shall recognize the 5-character sequenence "USING" in any combination of upper and lower case letters as the keyword token USING.
H41833: SQLite shall recognize the 6-character sequenence "VACUUM" in any combination of upper and lower case letters as the keyword token VACUUM.
H41836: SQLite shall recognize the 6-character sequenence "VALUES" in any combination of upper and lower case letters as the keyword token VALUES.
H41839: SQLite shall recognize the 4-character sequenence "VIEW" in any combination of upper and lower case letters as the keyword token VIEW.
H41842: SQLite shall recognize the 7-character sequenence "VIRTUAL" in any combination of upper and lower case letters as the keyword token VIRTUAL.
H41845: SQLite shall recognize the 4-character sequenence "WHEN" in any combination of upper and lower case letters as the keyword token WHEN.
H41848: SQLite shall recognize the 5-character sequenence "WHERE" in any combination of upper and lower case letters as the keyword token WHERE.

*** DRAFT ***


This page last modified 2008/10/28 21:57:06 UTC