Documentation Source Text

Check-in [b8865a0011]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Updates to the FTS3 documentation to include additional discussion and links to FTS4 concepts.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: b8865a0011a76e624c57d2e61dcd9bbc9abcacbf
User & Date: drh 2010-11-26 21:06:49.000
Context
2010-11-26
22:46
Update the download page to use a new, consistent file naming scheme. Download files end with a version number of the form WXXYYZZ where W is 3, XX is the major version, YY is the minor version, and ZZ is the patch number. (check-in: 8b77a71a8c user: drh tags: trunk)
21:06
Updates to the FTS3 documentation to include additional discussion and links to FTS4 concepts. (check-in: b8865a0011 user: drh tags: trunk)
18:23
Update documentation for fts3 matchinfo function. (check-in: 4deb70dd76 user: dan tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to pages/changes.in.
50
51
52
53
54
55
56


57
58
59
60
61
62
63
<li> [sqlite3_vfs | VFSes] that do not support shared memory are allowed
     to access [WAL] databases if [PRAGMA locking_mode] is set to EXCLUSIVE.
<li> Enhancements to [EXPLAIN QUERY PLAN].
<li> Added the [sqlite3_stmt_readonly()] interface.
<li> Added [PRAGMA checkpoint_fullfsync].
<li> Added the [SQLITE_FCNTL_FILE_POINTER] option
     to [sqlite3_file_control()].


}

chng {2010 October 08 (3.7.3)} {
<li> Added the [sqlite3_create_function_v2()] interface that includes a
     destructor callback.
<li> Added support for [custom r-tree queries] using application-supplied
     callback routines to define the boundary of the query region.







>
>







50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
<li> [sqlite3_vfs | VFSes] that do not support shared memory are allowed
     to access [WAL] databases if [PRAGMA locking_mode] is set to EXCLUSIVE.
<li> Enhancements to [EXPLAIN QUERY PLAN].
<li> Added the [sqlite3_stmt_readonly()] interface.
<li> Added [PRAGMA checkpoint_fullfsync].
<li> Added the [SQLITE_FCNTL_FILE_POINTER] option
     to [sqlite3_file_control()].
<li> Added support for [FTS4] and enchancements 
     to the FTS [matchinfo()] function.
}

chng {2010 October 08 (3.7.3)} {
<li> Added the [sqlite3_create_function_v2()] interface that includes a
     destructor callback.
<li> Added support for [custom r-tree queries] using application-supplied
     callback routines to define the boundary of the query region.
Changes to pages/fts3.in.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19


20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

<tcl>hd_keywords *fts3 FTS3 {full-text search}</tcl>
<title>SQLite FTS3 Extension</title>

<table_of_contents>

<h2 style="margin-left:1.0em" notoc> Overview</h2>

<p>
  FTS3 is an SQLite virtual table module that allows users to perform 
  full-text searches on a set of documents. The most common (and effective) 
  way to describe full-text searches is "what Google, Yahoo and Altavista do
  with documents placed on the World Wide Web". Users input a term, or series 
  of terms, perhaps connected by a binary operator or grouped together into a 
  phrase, and the full-text query system finds the set of documents that best 
  matches those terms considering the operators and groupings the user has 
  specified. This document describes the deployment and usage of FTS3.

<p>


  Portions of the original FTS3 code were contributed to the SQLite project 
  by Scott Hess of <a href="http://www.google.com">Google</a>. It is now 
  developed and maintained as part of SQLite.

<h1>Introduction to FTS3</h1>

<p>
  The FTS3 extension module allows users to create special tables with a 
  built-in full-text index (hereafter "FTS3 tables"). The full-text index
  allows the user to efficiently query the database for all rows that contain
  one or more instances specified word (hereafter a "token", even if the table
  contains many large documents.

<p>
  For example, if each of the 517430 documents in the 
  "<a href="http://www.cs.cmu.edu/~enron/">Enron E-Mail Dataset</a>"
  is inserted into both the FTS3 table and the ordinary SQLite table
  created using the following SQL script:

<codeblock>
  CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
  CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */
</codeblock>



|






|






|


>
>




|


|
|

|





|







1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

<tcl>hd_keywords *fts3 FTS3 {full-text search}</tcl>
<title>SQLite FTS3 and FTS4 Extensions</title>

<table_of_contents>

<h2 style="margin-left:1.0em" notoc> Overview</h2>

<p>
  FTS3 and FTS4 are an SQLite virtual table modules that allows users to perform 
  full-text searches on a set of documents. The most common (and effective) 
  way to describe full-text searches is "what Google, Yahoo and Altavista do
  with documents placed on the World Wide Web". Users input a term, or series 
  of terms, perhaps connected by a binary operator or grouped together into a 
  phrase, and the full-text query system finds the set of documents that best 
  matches those terms considering the operators and groupings the user has 
  specified. This article describes the deployment and usage of FTS3 and FTS4.

<p>
  FTS1 and FTS2 are obsolete full-text search modules for SQLite.  There are known
  issues with this older modules and their use should be avoided.
  Portions of the original FTS3 code were contributed to the SQLite project 
  by Scott Hess of <a href="http://www.google.com">Google</a>. It is now 
  developed and maintained as part of SQLite.

<h1>Introduction to FTS3 and FTS4</h1>

<p>
  The FTS3 and FTS4 extension modules allows users to create special tables with a 
  built-in full-text index (hereafter "FTS tables"). The full-text index
  allows the user to efficiently query the database for all rows that contain
  one or more words (hereafter "tokens"), even if the table
  contains many large documents.

<p>
  For example, if each of the 517430 documents in the 
  "<a href="http://www.cs.cmu.edu/~enron/">Enron E-Mail Dataset</a>"
  is inserted into both the FTS table and the ordinary SQLite table
  created using the following SQL script:

<codeblock>
  CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
  CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */
</codeblock>

60
61
62
63
64
65
66



























67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
  selects only those rows that contain "linux" as a discrete token. Both 
  searches are case-insensitive. The FTS3 table consumes around 2006 MB on
  disk compared to just 1453 MB for the ordinary table. Using the same
  hardware configuration used to perform the SELECT queries above, the FTS3
  table took just under 31 minutes to populate, versus 25 for the ordinary
  table.




























<h2>Creating and Destroying FTS3 Tables</h2>

<p>
  Like other virtual table types, new FTS3 tables are created using a 
  [CREATE VIRTUAL TABLE] statement. The module name, which follows
  the USING keyword, is "fts3". The virtual table module arguments may
  be left empty, in which case an FTS3 table with a single user-defined 
  column named "content" is created. Alternatively, the module arguments
  may be passed a list of comma separated column names. 

<p>
  If column names are explicitly provided for the FTS3 table as part of
  the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally 
  specified for each column. However, this is pure syntactic sugar, the
  supplied typenames are not used by FTS3 or the SQLite core for any
  purpose. The same applies to any constraints specified along with an
  FTS3 column name - they are parsed but not used or recorded by the system
  in any way.

<codeblock>
  <i>-- Create an FTS3 table named "data" with one column - "content":</i>
  CREATE VIRTUAL TABLE data USING fts3();

  <i>-- Create an FTS3 table named "pages" with three columns:</i>
  CREATE VIRTUAL TABLE pages USING fts3(title, keywords, body);

  <i>-- Create an FTS3 table named "mail" with two columns. Datatypes
  -- and column constraints are specified along with each column. These
  -- are completely ignored by FTS3 and SQLite. </i>
  CREATE VIRTUAL TABLE mail USING fts3(
    subject VARCHAR(256) NOT NULL,
    body TEXT CHECK(length(body)&lt;10240)
  );
</codeblock>

<p>
  As well as a list of columns, the module arguments passed to a CREATE
  VIRTUAL TABLE statement used to create an FTS3 table may be used to specify
  a [tokenizer]. This is done by specifying a string of the form
  "tokenize=&lt;tokenizer name&gt; &lt;tokenizer args&gt;" in place of a column
  name, where &lt;tokenizer name&gt; is the name of the tokenizer to use and
  &lt;tokenizer args&gt; is an optional list of whitespace separated qualifiers
  to pass to the tokenizer implementation. A tokenizer specification may be
  placed anywhere in the column list, but at most one tokenizer declaration is
  allowed for each CREATE VIRTUAL TABLE statement.  The second and subsequent
  tokenizer declaration are interpreted as column names. 
  [tokenizer|See below] for a detailed description of using (and, if
  necessary, implementing) a tokenizer.

<codeblock>
  <i>-- Create an FTS3 table named "papers" with two columns that uses</i>
  <i>-- the tokenizer "porter".</i>
  CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter);

  <i>-- Create an FTS3 table with a single column - "content" - that uses</i>
  <i>-- the "simple" tokenizer.</i>
  CREATE VIRTUAL TABLE data USING fts3(tokenize=simple);

  <i>-- Create an FTS3 table with two columns that uses the "icu" tokenizer.</i>
  <i>-- The qualifier "en_AU" is passed to the tokenizer implementation</i>
  CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU);
</codeblock>

<p>
  FTS3 tables may be dropped from the database using an ordinary [DROP TABLE]
  statement. For example:

<codeblock>
  <i>-- Create, then immediately drop, an FTS3 table.</i>
  CREATE VIRTUAL TABLE data USING fts3();
  DROP TABLE data;
</codeblock>

<h2>Populating FTS3 Tables</h2>

  <p>
    FTS3 tables are populated using [INSERT], [UPDATE] and [DELETE]
    statements in the same way as ordinary SQLite tables are.

  <p>
    As well as the columns named by the user (or the "content" column if no
    module arguments where specified as part of the [CREATE VIRTUAL TABLE] 
    statement), each FTS3 table has a "rowid" column. The rowid of an FTS3
    table behaves in the same way as the rowid column of an ordinary SQLite 
    table, except that the values stored in the rowid column of an FTS3 table 
    remain unchanged if the database is rebuilt using the [VACUUM] command. 
    For FTS3 tables, "docid" is allowed as an alias along with the usual "rowid",
    "oid" and "_oid_" identifiers. Attempting to insert or update a row with a 
    docid value that already exists in the table is an error, just as it would 
    be with an ordinary SQLite table.

  <p>
    There is one other subtle difference between "docid" and the normal SQLite
    aliases for the rowid column. Normally, if an INSERT or UPDATE statement 
    assigns discrete values to two or more aliases of the rowid column, SQLite 
    writes the rightmost of such values specified in the INSERT or UPDATE
    statement to the database. However, assigning a non-NULL value to both
    the "docid" and one or more of the SQLite rowid aliases when inserting or
    updating an FTS3 table is considered an error. See below for an example.

<codeblock>
  <i>-- Create an FTS3 table</i>
  CREATE VIRTUAL TABLE pages USING fts3(title, body);

  <i>-- Insert a row with a specific docid value.</i>
  INSERT INTO pages(docid, title, body) VALUES(53, 'Home Page', 'SQLite is a software...');

  <i>-- Insert a row and allow FTS3 to assign a docid value using the same algorithm as</i>
  <i>-- SQLite uses for ordinary tables. In this case the new docid will be 54,</i>
  <i>-- one greater than the largest docid currently present in the table.</i>
  INSERT INTO pages(title, body) VALUES('Download', 'All SQLite source code...');

  <i>-- Change the title of the row just inserted.</i>
  UPDATE pages SET title = 'Download SQLite' WHERE rowid = 54;

  <i>-- Delete the entire table contents.</i>
  DELETE FROM pages;

  <i>-- The following is an error. It is not possible to assign non-NULL values to both</i>
  <i>-- the rowid and docid columns of an FTS3 table.</i>
  INSERT INTO pages(rowid, docid, title, body) VALUES(1, 2, 'A title', 'A document body');
</codeblock>

  <p>
    To support full-text queries, FTS3 maintains an inverted index that maps
    from each unique term or word that appears in the dataset to the locations
    in which it appears within the table contents. For the curious, a 
    complete description of the [segment btree|data structure] used to store
    this index within the database file is described below. A feature of
    this data structure is that at any time the database may contain not
    one index b-tree, but several different b-trees that are incrementally
    merged as rows are inserted, updated and deleted. This technique improves 
    performance when writing to an FTS3 table, but causes some overhead for
    full-text queries that use the index. Executing an SQL statement of the
    form "INSERT INTO &lt;fts3-table&gt;(&lt;fts3-table&gt;) VALUES('optimize')"
    causes FTS3 to merge all existing index b-trees into a single large
    b-tree containing the entire index. This can be an expensive operation,
    but may speed up future queries. 

  <p>
    For example, to optimize the full-text index for an FTS3 table named
    "docs":

<codeblock>
  <i>-- Optimize the internal structure of FTS3 table "docs".</i>
  INSERT INTO docs(docs) VALUES('optimize');
</codeblock>

  <p>
    The statement above may appear syntactically incorrect to some. Refer to
    the section describing the [simple fts3 queries] for an explanation.

  <p>
    There is another, deprecated, method for invoking the optimize 
    operation using a SELECT statement. New code should use statements
    similar to the INSERT above to optimize FTS3 structures.

<h2 tags="simple fts3 queries">Simple FTS3 Queries</h2>

<p>
  As for all other SQLite tables, virtual or otherwise, data is retrieved
  from FTS3 tables using a [SELECT] statement.

<p>
  FTS3 tables can be queried efficiently using SELECT statements of two
  different forms:

<ul>
  <li><p>
    <b>Query by rowid</b>. If the WHERE clause of the SELECT statement
    contains a sub-clause of the form "rowid = ?", where ? is an SQL expression,
    FTS3 is able to retrieve the requested row directly using the equivalent 
    of an SQLite [INTEGER PRIMARY KEY] index.

  <li><p>
    <b>Full-text query</b>. If the WHERE clause of the SELECT statement contains
    a sub-clause of the form "&lt;column&gt; MATCH ?", FTS3 is able to use 
    the built-in full-text index to restrict the search to those documents 
    that match the full-text query string specified as the right-hand operand
    of the MATCH clause.
</ul>

<p>
  If neither of the two query strategies enumerated above can be used, all
  queries on FTS3 tables are implemented using a linear scan of the entire
  table. If the table contains large amounts of data, this may be an 
  impractically approach (the first example on this page shows that a linear
  scan of 1.5 GB of data takes around 30 seconds using a modern PC).

<codeblock>
  <i>-- The examples in this block assume the following FTS3 table:</i>
  CREATE VIRTUAL TABLE mail USING fts3(subject, body);

  SELECT * FROM mail WHERE rowid = 15;                <i>-- Fast. Rowid lookup.</i>
  SELECT * FROM mail WHERE body MATCH 'sqlite';       <i>-- Fast. Full-text query.</i>
  SELECT * FROM mail WHERE mail MATCH 'search';       <i>-- Fast. Full-text query.</i>
  SELECT * FROM mail WHERE rowid BETWEEN 15 AND 20;   <i>-- Slow. Linear scan.</i>
  SELECT * FROM mail WHERE subject = 'database';      <i>-- Slow. Linear scan.</i>







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|


|

|
|




|


|

|



|


|
|

|

|








|












|



|

|

|





|




|



|


|





|

|

|











|


|
|




|











|




|







|

|
|




|



|





|




|

|



|


|






|




|







|





|







62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
  selects only those rows that contain "linux" as a discrete token. Both 
  searches are case-insensitive. The FTS3 table consumes around 2006 MB on
  disk compared to just 1453 MB for the ordinary table. Using the same
  hardware configuration used to perform the SELECT queries above, the FTS3
  table took just under 31 minutes to populate, versus 25 for the ordinary
  table.

<tcl>hd_fragment fts4 FTS4</tcl>
<h2>Differences between FTS3 and FTS4</h2>

<p>
  FTS3 and FTS4 are nearly identical.  They share most of their code in common, and their
  interfaces are the same.  The only difference is that FTS4 stores some additional information
  about the document collection in two of new [FTS shadow tables].  This additional information allows
  FTS4 to use certain query performance optimizations that FTS3 cannot use.  And the added information
  permits some additional useful output options in the [matchinfo()] function.

<p>
  FTS4 is an enhancement to FTS3.  FTS3 has been available since SQLite [version 3.5.0] in
  2007-09-04.  The enhancements for FTS4 were added with SQLite [version 3.7.4] on 2010-12-08.

<p>
  Which module, FTS3 or FTS4, should you use in your application?
  FTS4 can sometimes is significantly faster than FTS3, even orders
  of magnitude faster, depending on the query, though in the common case the performance of the
  two modules is similar.  FTS4 also offers the enhanced [matchinfo()] outputs which can be
  useful in ranking the results of a [FTS MATCH|MATCH] operation.  On the other hand, because FTS4
  stores additional information, FTS4 requires a little more disk space than FTS3, though only a 
  percent of two in most cases.

<p>
  For newer applications, FTS4 is recommended; though if minimal disk usage or compatibility
  with older versions of SQLite are important, then FTS3 will usually serve just as well.  

<h2>Creating and Destroying FTS Tables</h2>

<p>
  Like other virtual table types, new FTS tables are created using a 
  [CREATE VIRTUAL TABLE] statement. The module name, which follows
  the USING keyword, is either "fts3" or "fts4". The virtual table module arguments may
  be left empty, in which case an FTS table with a single user-defined 
  column named "content" is created. Alternatively, the module arguments
  may be passed a list of comma separated column names. 

<p>
  If column names are explicitly provided for the FTS table as part of
  the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally 
  specified for each column. However, this is pure syntactic sugar, the
  supplied typenames are not used by FTS or the SQLite core for any
  purpose. The same applies to any constraints specified along with an
  FTS column name - they are parsed but not used or recorded by the system
  in any way.

<codeblock>
  <i>-- Create an FTS table named "data" with one column - "content":</i>
  CREATE VIRTUAL TABLE data USING fts3();

  <i>-- Create an FTS table named "pages" with three columns:</i>
  CREATE VIRTUAL TABLE pages USING fts4(title, keywords, body);

  <i>-- Create an FTS table named "mail" with two columns. Datatypes
  -- and column constraints are specified along with each column. These
  -- are completely ignored by FTS and SQLite. </i>
  CREATE VIRTUAL TABLE mail USING fts3(
    subject VARCHAR(256) NOT NULL,
    body TEXT CHECK(length(body)&lt;10240)
  );
</codeblock>

<p>
  As well as a list of columns, the module arguments passed to a CREATE
  VIRTUAL TABLE statement used to create an FTS table may be used to specify
  a [tokenizer]. This is done by specifying a string of the form
  "tokenize=&lt;tokenizer name&gt; &lt;tokenizer args&gt;" in place of a column
  name, where &lt;tokenizer name&gt; is the name of the tokenizer to use and
  &lt;tokenizer args&gt; is an optional list of whitespace separated qualifiers
  to pass to the tokenizer implementation. A tokenizer specification may be
  placed anywhere in the column list, but at most one tokenizer declaration is
  allowed for each CREATE VIRTUAL TABLE statement.  The second and subsequent
  tokenizer declaration are interpreted as column names. 
  [tokenizer|See below] for a detailed description of using (and, if
  necessary, implementing) a tokenizer.

<codeblock>
  <i>-- Create an FTS table named "papers" with two columns that uses</i>
  <i>-- the tokenizer "porter".</i>
  CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter);

  <i>-- Create an FTS table with a single column - "content" - that uses</i>
  <i>-- the "simple" tokenizer.</i>
  CREATE VIRTUAL TABLE data USING fts4(tokenize=simple);

  <i>-- Create an FTS table with two columns that uses the "icu" tokenizer.</i>
  <i>-- The qualifier "en_AU" is passed to the tokenizer implementation</i>
  CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU);
</codeblock>

<p>
  FTS tables may be dropped from the database using an ordinary [DROP TABLE]
  statement. For example:

<codeblock>
  <i>-- Create, then immediately drop, an FTS3 table.</i>
  CREATE VIRTUAL TABLE data USING fts4();
  DROP TABLE data;
</codeblock>

<h2>Populating FTS Tables</h2>

  <p>
    FTS tables are populated using [INSERT], [UPDATE] and [DELETE]
    statements in the same way as ordinary SQLite tables are.

  <p>
    As well as the columns named by the user (or the "content" column if no
    module arguments where specified as part of the [CREATE VIRTUAL TABLE] 
    statement), each FTS table has a "rowid" column. The rowid of an FTS
    table behaves in the same way as the rowid column of an ordinary SQLite 
    table, except that the values stored in the rowid column of an FTS table 
    remain unchanged if the database is rebuilt using the [VACUUM] command. 
    For FTS tables, "docid" is allowed as an alias along with the usual "rowid",
    "oid" and "_oid_" identifiers. Attempting to insert or update a row with a 
    docid value that already exists in the table is an error, just as it would 
    be with an ordinary SQLite table.

  <p>
    There is one other subtle difference between "docid" and the normal SQLite
    aliases for the rowid column. Normally, if an INSERT or UPDATE statement 
    assigns discrete values to two or more aliases of the rowid column, SQLite 
    writes the rightmost of such values specified in the INSERT or UPDATE
    statement to the database. However, assigning a non-NULL value to both
    the "docid" and one or more of the SQLite rowid aliases when inserting or
    updating an FTS table is considered an error. See below for an example.

<codeblock>
  <i>-- Create an FTS table</i>
  CREATE VIRTUAL TABLE pages USING fts4(title, body);

  <i>-- Insert a row with a specific docid value.</i>
  INSERT INTO pages(docid, title, body) VALUES(53, 'Home Page', 'SQLite is a software...');

  <i>-- Insert a row and allow FTS to assign a docid value using the same algorithm as</i>
  <i>-- SQLite uses for ordinary tables. In this case the new docid will be 54,</i>
  <i>-- one greater than the largest docid currently present in the table.</i>
  INSERT INTO pages(title, body) VALUES('Download', 'All SQLite source code...');

  <i>-- Change the title of the row just inserted.</i>
  UPDATE pages SET title = 'Download SQLite' WHERE rowid = 54;

  <i>-- Delete the entire table contents.</i>
  DELETE FROM pages;

  <i>-- The following is an error. It is not possible to assign non-NULL values to both</i>
  <i>-- the rowid and docid columns of an FTS table.</i>
  INSERT INTO pages(rowid, docid, title, body) VALUES(1, 2, 'A title', 'A document body');
</codeblock>

  <p>
    To support full-text queries, FTS maintains an inverted index that maps
    from each unique term or word that appears in the dataset to the locations
    in which it appears within the table contents. For the curious, a 
    complete description of the [segment btree|data structure] used to store
    this index within the database file is described below. A feature of
    this data structure is that at any time the database may contain not
    one index b-tree, but several different b-trees that are incrementally
    merged as rows are inserted, updated and deleted. This technique improves 
    performance when writing to an FTS table, but causes some overhead for
    full-text queries that use the index. Executing an SQL statement of the
    form "INSERT INTO &lt;fts-table&gt;(&lt;fts-table&gt;) VALUES('optimize')"
    causes FTS to merge all existing index b-trees into a single large
    b-tree containing the entire index. This can be an expensive operation,
    but may speed up future queries. 

  <p>
    For example, to optimize the full-text index for an FTS table named
    "docs":

<codeblock>
  <i>-- Optimize the internal structure of FTS table "docs".</i>
  INSERT INTO docs(docs) VALUES('optimize');
</codeblock>

  <p>
    The statement above may appear syntactically incorrect to some. Refer to
    the section describing the [simple fts queries] for an explanation.

  <p>
    There is another, deprecated, method for invoking the optimize 
    operation using a SELECT statement. New code should use statements
    similar to the INSERT above to optimize FTS structures.

<h2 tags="simple fts queries">Simple FTS Queries</h2>

<p>
  As for all other SQLite tables, virtual or otherwise, data is retrieved
  from FTS tables using a [SELECT] statement.

<p>
  FTS tables can be queried efficiently using SELECT statements of two
  different forms:

<ul>
  <li><p>
    <b>Query by rowid</b>. If the WHERE clause of the SELECT statement
    contains a sub-clause of the form "rowid = ?", where ? is an SQL expression,
    FTS is able to retrieve the requested row directly using the equivalent 
    of an SQLite [INTEGER PRIMARY KEY] index.

  <li><p>
    <b>Full-text query</b>. If the WHERE clause of the SELECT statement contains
    a sub-clause of the form "&lt;column&gt; MATCH ?", FTS is able to use 
    the built-in full-text index to restrict the search to those documents 
    that match the full-text query string specified as the right-hand operand
    of the MATCH clause.
</ul>

<p>
  If neither of the two query strategies enumerated above can be used, all
  queries on FTS tables are implemented using a linear scan of the entire
  table. If the table contains large amounts of data, this may be an 
  impractically approach (the first example on this page shows that a linear
  scan of 1.5 GB of data takes around 30 seconds using a modern PC).

<codeblock>
  <i>-- The examples in this block assume the following FTS table:</i>
  CREATE VIRTUAL TABLE mail USING fts3(subject, body);

  SELECT * FROM mail WHERE rowid = 15;                <i>-- Fast. Rowid lookup.</i>
  SELECT * FROM mail WHERE body MATCH 'sqlite';       <i>-- Fast. Full-text query.</i>
  SELECT * FROM mail WHERE mail MATCH 'search';       <i>-- Fast. Full-text query.</i>
  SELECT * FROM mail WHERE rowid BETWEEN 15 AND 20;   <i>-- Slow. Linear scan.</i>
  SELECT * FROM mail WHERE subject = 'database';      <i>-- Slow. Linear scan.</i>
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
  instances of the specified word ("sqlite", "search" or "database", depending 
  on which example you look at). Specifying a single term as the right-hand
  operand of the MATCH operator results in the simplest (and most common) type 
  of full-text query possible. However more complicated queries are possible,
  including phrase searches, term-prefix searches and searches for documents 
  containing combinations of terms occuring within a defined proximity of each
  other. The various ways in which the full-text index may be queried are
  [FTS3 MATCH|described below].

<p>
  Normally, full-text queries are case-insensitive. However, this is
  is dependent on the specific [tokenizer] used by the FTS3 table
  being queried. Refer to the section on [tokenizer|tokenizers] for details.

<p>
  The paragraph above notes that a MATCH operator with a simple term as the
  right-hand operand evaluates to true for all documents that contain the
  specified term. In this context, the "document" may refer to either the 
  data stored in a single column of a row of an FTS3 table, or to the contents
  of all columns in a single row, depending on the identifier used as the
  left-hand operand to the MATCH operator. If the identifier specified as
  the left-hand operand of the MATCH operator is an FTS3 table column name,
  then the document that the search term must be contained in is the value
  stored in the specified column. However, if the identifier is the name
  of the FTS3 <i>table</i> itself, then the MATCH operator evaluates to true
  for each row of the FTS3 table for which any column contains the search 
  term. The following example demonstrates this:

<codeblock>
  <i>-- Example schema</i>
  CREATE VIRTUAL TABLE mail USING fts3(subject, body);

  <i>-- Example table population</i>







|



|






|


|


|
|







301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
  instances of the specified word ("sqlite", "search" or "database", depending 
  on which example you look at). Specifying a single term as the right-hand
  operand of the MATCH operator results in the simplest (and most common) type 
  of full-text query possible. However more complicated queries are possible,
  including phrase searches, term-prefix searches and searches for documents 
  containing combinations of terms occuring within a defined proximity of each
  other. The various ways in which the full-text index may be queried are
  [FTS MATCH|described below].

<p>
  Normally, full-text queries are case-insensitive. However, this is
  is dependent on the specific [tokenizer] used by the FTS table
  being queried. Refer to the section on [tokenizer|tokenizers] for details.

<p>
  The paragraph above notes that a MATCH operator with a simple term as the
  right-hand operand evaluates to true for all documents that contain the
  specified term. In this context, the "document" may refer to either the 
  data stored in a single column of a row of an FTS table, or to the contents
  of all columns in a single row, depending on the identifier used as the
  left-hand operand to the MATCH operator. If the identifier specified as
  the left-hand operand of the MATCH operator is an FTS table column name,
  then the document that the search term must be contained in is the value
  stored in the specified column. However, if the identifier is the name
  of the FTS <i>table</i> itself, then the MATCH operator evaluates to true
  for each row of the FTS table for which any column contains the search 
  term. The following example demonstrates this:

<codeblock>
  <i>-- Example schema</i>
  CREATE VIRTUAL TABLE mail USING fts3(subject, body);

  <i>-- Example table population</i>
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380

381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405





406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
  SELECT * FROM mail WHERE mail    MATCH 'software';    <i>-- Selects rows 1, 2 and 3</i>
  SELECT * FROM mail WHERE mail    MATCH 'slow';        <i>-- Selects rows 1 and 3</i>
</codeblock>
  
<p>
  At first glance, the final two full-text queries in the example above seem
  to be syntactically incorrect, as there is a table name ("mail") used as
  an SQL expression. The reason this is acceptable is that each FTS3 table
  actually has a [sqlite3_declare_vtab|HIDDEN] column with the same name
  as the table itself (in this case, "mail"). The value stored in this
  column is not meaningful to the application, but can be used as the 
  left-hand operand to a MATCH operator. This special column may also be
  passed as an argument to the [snippet()|FTS3 auxiliary functions].

<p>
  The following example illustrates the above. The expressions "docs", 
  "docs.docs" and "main.docs.docs" all refer to column "docs". However, the 
  expression "main.docs" does not refer to any column. It could be used to 
  refer to a table, but a table name is not allowed in the context in which
  it is used below.

<codeblock>
  <i>-- Example schema</i>
  CREATE VIRTUAL TABLE docs USING fts3(content);

  <i>-- Example queries</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite';              <i>-- OK.</i>
  SELECT * FROM docs WHERE docs.docs MATCH 'sqlite';         <i>-- OK.</i>
  SELECT * FROM docs WHERE main.docs.docs MATCH 'sqlite';    <i>-- OK.</i>
  SELECT * FROM docs WHERE main.docs MATCH 'sqlite';         <i>-- Error.</i>
</codeblock>
 
<h2>Summary</h2>

<p>
  From the users point of view, FTS3 tables are similar to ordinary SQLite
  tables in many ways. Data may be added to, modified within and removed 
  from FTS3 tables using the INSERT, UPDATE and DELETE commands just as 
  it may be with ordinary tables. Similarly, the SELECT command may be used 
  to query data. The following list summarizes the differences between FTS3
  and ordinary tables:

<ol>
  <li><p> 
    As with all virtual table types, it is not possible to create indices or
    triggers attached to FTS3 tables. Nor is it possible to use the ALTER TABLE
    command to add extra columns to FTS3 tables (although it is possible to use
    ALTER TABLE to rename an FTS3 table).

  <li><p> 
    Data-types specified as part of the "CREATE VIRTUAL TABLE" statement
    used to create an FTS3 table are ignored completely. Instead of the
    normal rules for applying type [affinity] to inserted values, all
    values inserted into FTS3 table columns (except the special rowid
    column) are converted to type TEXT before being stored.

  <li><p> 
    FTS3 tables permit the special alias "docid" to be used to refer to the
    rowid column supported by all [virtual tables].

  <li><p> 
    The [FTS3 MATCH] operator is supported for queries based on the built-in
    full-text index. 

  <li><p> 
    The FTS3 auxiliary functions, [snippet|snippet() and offsets()], are 
    available to support full-text queries.

  <li><p> 

    Each FTS3 table has a [sqlite3_declare_vtab()|HIDDEN column] with the 
    same name as the table itself. The value contained in each row for the
    special column is only useful when used on the left-hand side of a 
    [FTS3 MATCH|MATCH] operator, or when specified as an argument to one 
    of the [snippet|FTS3 auxiliary functions].
    

</ol>


<h1 tags="compile fts3">Compiling and Enabling FTS3</h1>

<p>
  Although FTS3 is distributed as part of the SQLite source code, it is not
  enabled by default. To build SQLite with FTS3 functionality enabled, define
  the preprocessor macro [SQLITE_ENABLE_FTS3] when compiling. New applications
  should also define the [SQLITE_ENABLE_FTS3_PARENTHESIS] macro to enable the
  [enhanced query syntax] (see below). Usually, this is done by adding the 
  following two switches to the compiler command line:

<codeblock>
  -DSQLITE_ENABLE_FTS3
  -DSQLITE_ENABLE_FTS3_PARENTHESIS
</codeblock>






<p>
  If using the amalgamation autoconf based build system, setting the CPPFLAGS
  environment variable while running the 'configure' script is an easy
  way to set these macros. For example, the following command:

<codeblock>
  CPPFLAGS="-DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS" ./configure &lt;configure options&gt;
</codeblock>

<p>
  where <i>&lt;configure options&gt;</i> are those options normally passed to
  the configure script, if any.

<p>
  Because FTS3 is a virtual table, it is incompatible with the
  [SQLITE_OMIT_VIRTUALTABLE] option.

<p>
  If an SQLite build does not include FTS3, then any attempt to prepare an
  SQL statement to create an FTS3 table or to drop or access an existing 
  FTS3 table in any way will fail. The error message returned will be similar 
  to "no such module: fts3".

<p>
  If the C version of the <a href=http://site.icu-project.org/>ICU library</a>
  is available, then FTS3 may also be compiled with the SQLITE_ENABLE_ICU
  pre-processor macro defined. Compiling with this macro enables an FTS3
  [tokenizer] that uses the ICU library to split a document into terms
  (words) using the conventions for a specified language and locale.

<codeblock>
  -DSQLITE_ENABLE_ICU
</codeblock>
  

<h1 tags="FTS3 MATCH">Full-text Index Queries</h1>

<p>
  The most useful thing about FTS3 tables is the queries that may be 
  performed using the built-in full-text index. Full-text queries are 
  performed by specifying a clause of the form 
  "&lt;column&gt; MATCH &lt;full-text query expression&gt;" to the WHERE 
  clause of a SELECT statement that reads data from an FTS3 table. 
  [simple fts3 queries|Simple FTS3 queries] that return all documents that 
  contain a given term are described above. In that discussion the right-hand
  operand of the MATCH operator was assumed to be a string consisting of a
  single term. This section describes the more complex query types supported 
  by FTS3 tables, and how they may be utilized by specifying a more
  complex query expression as the right-hand operand of a MATCH operator.

<p>
  FTS3 tables support three basic query types:

<ul>
  <li><p><b>Token or token prefix queries</b>. 
    An FTS3 table may be queried for all documents that contain a specified
    term (the [simple fts3 queries|simple case] described above), or for
    all documents that contain a term with a specified prefix. As we have
    seen, the query expression for a specific term is simply the term itself.
    The query expression used to search for a term prefix is the prefix
    itself with a '*' character appended to it. For example:
</ul>

<codeblock>







|




|










|











|

|

|





|
|
|



|

|



|



|



|


|
>
|

|
|
|





|


|
|










>
>
>
>
>














|
|


|
|
|
|



|
|








|


|



|
|



|



|



|
|







341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
  SELECT * FROM mail WHERE mail    MATCH 'software';    <i>-- Selects rows 1, 2 and 3</i>
  SELECT * FROM mail WHERE mail    MATCH 'slow';        <i>-- Selects rows 1 and 3</i>
</codeblock>
  
<p>
  At first glance, the final two full-text queries in the example above seem
  to be syntactically incorrect, as there is a table name ("mail") used as
  an SQL expression. The reason this is acceptable is that each FTS table
  actually has a [sqlite3_declare_vtab|HIDDEN] column with the same name
  as the table itself (in this case, "mail"). The value stored in this
  column is not meaningful to the application, but can be used as the 
  left-hand operand to a MATCH operator. This special column may also be
  passed as an argument to the [snippet()|FTS auxiliary functions].

<p>
  The following example illustrates the above. The expressions "docs", 
  "docs.docs" and "main.docs.docs" all refer to column "docs". However, the 
  expression "main.docs" does not refer to any column. It could be used to 
  refer to a table, but a table name is not allowed in the context in which
  it is used below.

<codeblock>
  <i>-- Example schema</i>
  CREATE VIRTUAL TABLE docs USING fts4(content);

  <i>-- Example queries</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite';              <i>-- OK.</i>
  SELECT * FROM docs WHERE docs.docs MATCH 'sqlite';         <i>-- OK.</i>
  SELECT * FROM docs WHERE main.docs.docs MATCH 'sqlite';    <i>-- OK.</i>
  SELECT * FROM docs WHERE main.docs MATCH 'sqlite';         <i>-- Error.</i>
</codeblock>
 
<h2>Summary</h2>

<p>
  From the users point of view, FTS tables are similar to ordinary SQLite
  tables in many ways. Data may be added to, modified within and removed 
  from FTS tables using the INSERT, UPDATE and DELETE commands just as 
  it may be with ordinary tables. Similarly, the SELECT command may be used 
  to query data. The following list summarizes the differences between FTS
  and ordinary tables:

<ol>
  <li><p> 
    As with all virtual table types, it is not possible to create indices or
    triggers attached to FTS tables. Nor is it possible to use the ALTER TABLE
    command to add extra columns to FTS tables (although it is possible to use
    ALTER TABLE to rename an FTS table).

  <li><p> 
    Data-types specified as part of the "CREATE VIRTUAL TABLE" statement
    used to create an FTS table are ignored completely. Instead of the
    normal rules for applying type [affinity] to inserted values, all
    values inserted into FTS table columns (except the special rowid
    column) are converted to type TEXT before being stored.

  <li><p> 
    FTS tables permit the special alias "docid" to be used to refer to the
    rowid column supported by all [virtual tables].

  <li><p> 
    The [FTS MATCH] operator is supported for queries based on the built-in
    full-text index. 

  <li><p> 
    The [FTS auxiliary functions], [snippet()], [offsets()], and [matchinfo()] are 
    available to support full-text queries.

  <li><p>
    <tcl>hd_fragment hiddencol {FTS hidden column}</tcl>
    Every FTS table has a [hidden column] with the 
    same name as the table itself. The value contained in each row for the
    hidden column is a blob that is only useful as the left operand of a
    [FTS MATCH|MATCH] operator, or as the left-most argument to one
    of the [FTS auxiliary functions].
    

</ol>


<h1 tags="compile fts">Compiling and Enabling FTS3 and FTS4</h1>

<p>
  Although the FTS3 and FTS4 is included with the SQLite core source code, it is not
  enabled by default. To build SQLite with FTS functionality enabled, define
  the preprocessor macro [SQLITE_ENABLE_FTS3] when compiling. New applications
  should also define the [SQLITE_ENABLE_FTS3_PARENTHESIS] macro to enable the
  [enhanced query syntax] (see below). Usually, this is done by adding the 
  following two switches to the compiler command line:

<codeblock>
  -DSQLITE_ENABLE_FTS3
  -DSQLITE_ENABLE_FTS3_PARENTHESIS
</codeblock>

<p>
  Note that enabling FTS3 also makes FTS4 available.  There is not a separate
  SQLITE_ENABLE_FTS4 compile-time option.  An build of SQLite either supports
  both FTS3 and FTS4 or it supports neither.

<p>
  If using the amalgamation autoconf based build system, setting the CPPFLAGS
  environment variable while running the 'configure' script is an easy
  way to set these macros. For example, the following command:

<codeblock>
  CPPFLAGS="-DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS" ./configure &lt;configure options&gt;
</codeblock>

<p>
  where <i>&lt;configure options&gt;</i> are those options normally passed to
  the configure script, if any.

<p>
  Because FTS3 and FTS4 are virtual tables, The [SQLITE_ENABLE_FTS3] compile-time option
  is incompatible with the [SQLITE_OMIT_VIRTUALTABLE] option.

<p>
  If a build of SQLite does not include FTS3, then any attempt to prepare an
  SQL statement to create an FTS3 or FTS4 table or to drop or access an existing 
  FTS table in any way will fail. The error message returned will be similar 
  to "no such module: ftsN" (where N is either 3 or 4).

<p>
  If the C version of the <a href=http://site.icu-project.org/>ICU library</a>
  is available, then FTS may also be compiled with the SQLITE_ENABLE_ICU
  pre-processor macro defined. Compiling with this macro enables an FTS
  [tokenizer] that uses the ICU library to split a document into terms
  (words) using the conventions for a specified language and locale.

<codeblock>
  -DSQLITE_ENABLE_ICU
</codeblock>
  

<h1 tags="FTS MATCH">Full-text Index Queries</h1>

<p>
  The most useful thing about FTS tables is the queries that may be 
  performed using the built-in full-text index. Full-text queries are 
  performed by specifying a clause of the form 
  "&lt;column&gt; MATCH &lt;full-text query expression&gt;" to the WHERE 
  clause of a SELECT statement that reads data from an FTS table. 
  [simple fts queries|Simple FTS queries] that return all documents that 
  contain a given term are described above. In that discussion the right-hand
  operand of the MATCH operator was assumed to be a string consisting of a
  single term. This section describes the more complex query types supported 
  by FTS tables, and how they may be utilized by specifying a more
  complex query expression as the right-hand operand of a MATCH operator.

<p>
  FTS tables support three basic query types:

<ul>
  <li><p><b>Token or token prefix queries</b>. 
    An FTS table may be queried for all documents that contain a specified
    term (the [simple fts queries|simple case] described above), or for
    all documents that contain a term with a specified prefix. As we have
    seen, the query expression for a specific term is simply the term itself.
    The query expression used to search for a term prefix is the prefix
    itself with a '*' character appended to it. For example:
</ul>

<codeblock>
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
  <i>-- all documents that contain "linux", but also those that contain terms "linear",</i>
  <i>--"linker", "linguistic" and so on.</i>
  SELECT * FROM docs WHERE docs MATCH 'lin*';
</codeblock>

<ul>
  <li style="list-style:none"><p>
    Normally, a token or token prefix query is matched against the FTS3 table 
    column specified as the right-hand side of the MATCH operator. Or, if the
    special column with the same name as the FTS3 table itself is specified,
    against all columns. This may be overridden by specifying a column-name
    followed by a ":" character before a basic term query. There may be space
    between the ":" and the term to query for, but not between the column-name
    and the ":" character. For example:
</ul>
   
<codeblock>







|

|







512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
  <i>-- all documents that contain "linux", but also those that contain terms "linear",</i>
  <i>--"linker", "linguistic" and so on.</i>
  SELECT * FROM docs WHERE docs MATCH 'lin*';
</codeblock>

<ul>
  <li style="list-style:none"><p>
    Normally, a token or token prefix query is matched against the FTS table 
    column specified as the right-hand side of the MATCH operator. Or, if the
    special column with the same name as the FTS table itself is specified,
    against all columns. This may be overridden by specifying a column-name
    followed by a ":" character before a basic term query. There may be space
    between the ":" and the term to query for, but not between the column-name
    and the ":" character. For example:
</ul>
   
<codeblock>
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
    an operator of the form "NEAR/<i>&lt;N&gt;</i>" may be used, where
    <i>&lt;N&gt;</i> is the maximum number of intervening terms allowed.
    For example:
</ul>

<codeblock>
  <i>-- Virtual table declaration.</i>
  CREATE VIRTUAL TABLE docs USING fts3();

  <i>-- Virtual table data.</i>
  INSERT INTO docs VALUES('SQLite is an ACID compliant embedded relational database management system');

  <i>-- Search for a document that contains the terms "sqlite" and "database" with</i>
  <i>-- not more than 10 intervening terms. This matches the only document in</i>
  <i>-- table docs (since there are only six terms between "SQLite" and "database"</i> 







|







567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
    an operator of the form "NEAR/<i>&lt;N&gt;</i>" may be used, where
    <i>&lt;N&gt;</i> is the maximum number of intervening terms allowed.
    For example:
</ul>

<codeblock>
  <i>-- Virtual table declaration.</i>
  CREATE VIRTUAL TABLE docs USING fts4();

  <i>-- Virtual table data.</i>
  INSERT INTO docs VALUES('SQLite is an ACID compliant embedded relational database management system');

  <i>-- Search for a document that contains the terms "sqlite" and "database" with</i>
  <i>-- not more than 10 intervening terms. This matches the only document in</i>
  <i>-- table docs (since there are only six terms between "SQLite" and "database"</i> 
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640

<p>
  Phrase and NEAR queries may not span multiple columns within a row.

<p>
  The three basic query types described above may be used to query the full-text
  index for the set of documents that match the specified criteria. Using the
  FTS3 query expression language it is possible to perform various set 
  operations on the results of basic queries. There are currently three 
  supported operations:

<ul>
  <li> The AND operator determines the <b>intersection</b> of two sets of documents.

  <li> The OR operator calculates the <b>union</b> of two sets of documents.

  <li> The NOT operator (or, if using the standard syntax, a unary "-" operator)
       may be used to compute the <b>relative complement</b> of one set of
       documents with respect to another.
</ul>

<p>
  The FTS3 module may be compiled to use one of two slightly different versions
  of the full-text query syntax, the "standard" query syntax and the "enhanced" 
  query syntax. The basic term, term-prefix, phrase and NEAR queries described 
  above are the same in both versions of the syntax. The way in which set 
  operations are specified is slightly different. The following two sub-sections 
  describe the part of the two query syntaxes that pertains to set operations. 
  Refer to the description of how to [compile fts3] for compilation notes.

<h2 tags="enhanced query syntax">
  Set Operations Using The Enhanced Query Syntax</h2>

<p>
  The enhanced query syntax supports the AND, OR and NOT binary set operators.
  Each of the two operands to an operator may be a basic FTS3 query, or the
  result of another AND, OR or NOT set operation. Operators must be entered
  using capital letters. Otherwise, they are interpreted as basic term queries
  instead of set operators.

<p>
  The AND operator may be implicitly specified. If two basic queries appear 
  with no operator separating them in an FTS3 query string, the results are
  the same as if the two basic queries were separated by an AND operator.
  For example, the query expression "implicit operator" is a more succinct
  version of "implicit AND operator".

<codeblock>
  <i>-- Virtual table declaration</i>
  CREATE VIRTUAL TABLE docs USING fts3();







|














|





|






|






|







626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675

<p>
  Phrase and NEAR queries may not span multiple columns within a row.

<p>
  The three basic query types described above may be used to query the full-text
  index for the set of documents that match the specified criteria. Using the
  FTS query expression language it is possible to perform various set 
  operations on the results of basic queries. There are currently three 
  supported operations:

<ul>
  <li> The AND operator determines the <b>intersection</b> of two sets of documents.

  <li> The OR operator calculates the <b>union</b> of two sets of documents.

  <li> The NOT operator (or, if using the standard syntax, a unary "-" operator)
       may be used to compute the <b>relative complement</b> of one set of
       documents with respect to another.
</ul>

<p>
  The FTS module may be compiled to use one of two slightly different versions
  of the full-text query syntax, the "standard" query syntax and the "enhanced" 
  query syntax. The basic term, term-prefix, phrase and NEAR queries described 
  above are the same in both versions of the syntax. The way in which set 
  operations are specified is slightly different. The following two sub-sections 
  describe the part of the two query syntaxes that pertains to set operations. 
  Refer to the description of how to [compile fts] for compilation notes.

<h2 tags="enhanced query syntax">
  Set Operations Using The Enhanced Query Syntax</h2>

<p>
  The enhanced query syntax supports the AND, OR and NOT binary set operators.
  Each of the two operands to an operator may be a basic FTS query, or the
  result of another AND, OR or NOT set operation. Operators must be entered
  using capital letters. Otherwise, they are interpreted as basic term queries
  instead of set operators.

<p>
  The AND operator may be implicitly specified. If two basic queries appear 
  with no operator separating them in an FTS query string, the results are
  the same as if the two basic queries were separated by an AND operator.
  For example, the query expression "implicit operator" is a more succinct
  version of "implicit AND operator".

<codeblock>
  <i>-- Virtual table declaration</i>
  CREATE VIRTUAL TABLE docs USING fts3();
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
  SELECT * FROM docs WHERE docs MATCH 'database and sqlite';
</codeblock>

<p>
  The examples above all use basic full-text term queries as both operands of 
  the set operations demonstrated. Phrase and NEAR queries may also be used,
  as may the results of other set operations. When more than one set operation
  is present in an FTS3 query, the precedence of operators is as follows:

<table striped=1>
  <tr><th>Operator<th>Enhanced Query Syntax Precedence
  <tr><td>NOT <td> Highest precedence (tightest grouping).
  <tr><td>AND <td>
  <tr><td>OR  <td> Lowest precedence (loosest grouping).
</table>







|







704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
  SELECT * FROM docs WHERE docs MATCH 'database and sqlite';
</codeblock>

<p>
  The examples above all use basic full-text term queries as both operands of 
  the set operations demonstrated. Phrase and NEAR queries may also be used,
  as may the results of other set operations. When more than one set operation
  is present in an FTS query, the precedence of operators is as follows:

<table striped=1>
  <tr><th>Operator<th>Enhanced Query Syntax Precedence
  <tr><td>NOT <td> Highest precedence (tightest grouping).
  <tr><td>AND <td>
  <tr><td>OR  <td> Lowest precedence (loosest grouping).
</table>
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
  );
</codeblock>


<h2>Set Operations Using The Standard Query Syntax</h2>

<p>
  FTS3 query set operations using the standard query syntax are similar, but
  not identical, to set operations with the enhanced query syntax. There
  are four differences, as follows:

<ol>
  <li value=1><p> Only the implicit version of the AND operator is supported.
    Specifying the string "AND" as part of an standard query syntax query is
    interpreted as a term query for the set of documents containing the term 
    "and".
</ol>

<ol>
  <li value=2><p> Parenthesis are not supported.
</ol>

<ol>
  <li value=3><p> The NOT operator is not supported. Instead of the NOT 
    operator, the standard query syntax supports a unary "-" operator that
    may be applied to basic term and term-prefix queries (but not to phrase
    or NEAR queries). A term or term-prefix that has a unary "-" operator
    attached to it may not appear as an operand to an OR operator. An FTS3
    query may not consist entirely of terms or term-prefix queries with unary
    "-" operators attached to them.
</ol>

<codeblock>
  <i>-- Search for the set of documents that contain the term "sqlite" but do</i>
  <i>-- not contain the term "database".</i>







|



















|







746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
  );
</codeblock>


<h2>Set Operations Using The Standard Query Syntax</h2>

<p>
  FTS query set operations using the standard query syntax are similar, but
  not identical, to set operations with the enhanced query syntax. There
  are four differences, as follows:

<ol>
  <li value=1><p> Only the implicit version of the AND operator is supported.
    Specifying the string "AND" as part of an standard query syntax query is
    interpreted as a term query for the set of documents containing the term 
    "and".
</ol>

<ol>
  <li value=2><p> Parenthesis are not supported.
</ol>

<ol>
  <li value=3><p> The NOT operator is not supported. Instead of the NOT 
    operator, the standard query syntax supports a unary "-" operator that
    may be applied to basic term and term-prefix queries (but not to phrase
    or NEAR queries). A term or term-prefix that has a unary "-" operator
    attached to it may not appear as an operand to an OR operator. An FTS
    query may not consist entirely of terms or term-prefix queries with unary
    "-" operators attached to them.
</ol>

<codeblock>
  <i>-- Search for the set of documents that contain the term "sqlite" but do</i>
  <i>-- not contain the term "database".</i>
769
770
771
772
773
774
775

776
777
778
779
780
781
782
783
784
785
786
787


788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
  <i>-- Search for documents that contain at least one of the terms "database"</i>
  <i>-- and "sqlite", and also contain the term "library". Because of the differences</i>
  <i>-- in operator precedences, this query would have a different interpretation using</i>
  <i>-- the enhanced query syntax.</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library';
</codeblock>


<h1 tags="snippet, offsets">Auxiliary Functions - Snippet, Offsets and Matchinfo</h1>

<p>
  The FTS3 module provides three special SQL scalar functions that may be useful
  to the developers of full-text query systems: "snippet", "offsets" and
  "matchinfo". The purpose of the "snippet" and "offsets" functions is to allow
  the user to identify the location of queried terms in the returned documents.
  The "matchinfo" function provides the user with metrics that may be useful
  for filtering or sorting query results according to relevance.

<p>
  The first argument to all three special SQL scalar functions


  must be the special hidden column of an FTS3 table that has the same
  name as the table (see above). For example, given an FTS3 table named 
  "mail":

<codeblock>
  SELECT offsets(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
  SELECT snippet(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
  SELECT matchinfo(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
</codeblock>

<p>
  The three auxiliary functions are only useful within a SELECT statement that
  uses the FTS3 table's full-text index. ^If used within a SELECT that uses
  the "query by rowid" or "linear scan" strategies, then the snippet and
  offsets both return an empty string, and the matchinfo function returns
  a blob value zero bytes in size.

<p id=matchable>
  All three auxiliary functions extract a set of "matchable phrases" from
  the FTS3 query expression to work with. The set of matchable phrases for
  a given query consists of all phrases (including unquoted tokens and
  token prefixes) in the expression except those that are prefixed with
  a unary "-" operator (standard syntax) or are part of a sub-expression 
  that is used as the right-hand operand of a NOT operator.

<p>
  With the following provisos, each series of tokens in the FTS3 table that
  matches one of the matchable phrases in the query expression is known as a
  "phrase match":

<ol>
  <li> If a matchable phrase is part of a series of phrases connected by
       NEAR operators in the FTS3 query expression, then each phrase match
       must be sufficiently close to other phrase matches of the relevant
       types to satisfy the NEAR condition.

  <li> If the matchable phrase in the FTS3 query is restricted to matching
       data in a specified FTS3 table column, then only phrase matches that 
       occur within that column are considered.
</ol>
 

<h2>The Offsets Function</h2>

<p>
  For a SELECT query that uses the full-text index, the offsets() function 
  returns a text value containing a series of space-separated integers. For
  each term in each <a href=#matchable>phrase match</a> of the current row, 
  there are four integers in the returned list. Each set of four integers is 
  interpreted as follows:

<table striped=1>
  <tr><th>Integer <th>Interpretation
  <tr><td>0 
      <td>The column number that the term instance occurs in (0 for the
          leftmost column of the FTS3 table, 1 for the next leftmost, etc.).
  <tr><td>1
      <td>The term number of the matching term within the full-text query
          expression. Terms within a query expression are numbered starting
          from 0 in the order that they occur.
  <tr><td>2
      <td>The byte offset of the matching term within the column.
  <tr><td>3







>
|


|








>
>
|
|
<









|






|






|





|



|
|



|













|







804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827

828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
  <i>-- Search for documents that contain at least one of the terms "database"</i>
  <i>-- and "sqlite", and also contain the term "library". Because of the differences</i>
  <i>-- in operator precedences, this query would have a different interpretation using</i>
  <i>-- the enhanced query syntax.</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library';
</codeblock>

<tcl>hd_fragment snippet {FTS auxiliary functions}</tcl>
<h1>Auxiliary Functions - Snippet, Offsets and Matchinfo</h1>

<p>
  The FTS3 and FTS4 modules provide three special SQL scalar functions that may be useful
  to the developers of full-text query systems: "snippet", "offsets" and
  "matchinfo". The purpose of the "snippet" and "offsets" functions is to allow
  the user to identify the location of queried terms in the returned documents.
  The "matchinfo" function provides the user with metrics that may be useful
  for filtering or sorting query results according to relevance.

<p>
  The first argument to all three special SQL scalar functions
  must be the [FTS hidden column] of the FTS table that the function is
  applied to.  The [FTS hidden column] is an automatically-generated column found on
  all FTS tables that has the same name as the FTS table itself.
  For example, given an FTS table named "mail":


<codeblock>
  SELECT offsets(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
  SELECT snippet(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
  SELECT matchinfo(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
</codeblock>

<p>
  The three auxiliary functions are only useful within a SELECT statement that
  uses the FTS table's full-text index. ^If used within a SELECT that uses
  the "query by rowid" or "linear scan" strategies, then the snippet and
  offsets both return an empty string, and the matchinfo function returns
  a blob value zero bytes in size.

<p id=matchable>
  All three auxiliary functions extract a set of "matchable phrases" from
  the FTS query expression to work with. The set of matchable phrases for
  a given query consists of all phrases (including unquoted tokens and
  token prefixes) in the expression except those that are prefixed with
  a unary "-" operator (standard syntax) or are part of a sub-expression 
  that is used as the right-hand operand of a NOT operator.

<p>
  With the following provisos, each series of tokens in the FTS table that
  matches one of the matchable phrases in the query expression is known as a
  "phrase match":

<ol>
  <li> If a matchable phrase is part of a series of phrases connected by
       NEAR operators in the FTS query expression, then each phrase match
       must be sufficiently close to other phrase matches of the relevant
       types to satisfy the NEAR condition.

  <li> If the matchable phrase in the FTS query is restricted to matching
       data in a specified FTS table column, then only phrase matches that 
       occur within that column are considered.
</ol>
 
<tcl>hd_fragment offsets offsets</tcl>
<h2>The Offsets Function</h2>

<p>
  For a SELECT query that uses the full-text index, the offsets() function 
  returns a text value containing a series of space-separated integers. For
  each term in each <a href=#matchable>phrase match</a> of the current row, 
  there are four integers in the returned list. Each set of four integers is 
  interpreted as follows:

<table striped=1>
  <tr><th>Integer <th>Interpretation
  <tr><td>0 
      <td>The column number that the term instance occurs in (0 for the
          leftmost column of the FTS table, 1 for the next leftmost, etc.).
  <tr><td>1
      <td>The term number of the matching term within the full-text query
          expression. Terms within a query expression are numbered starting
          from 0 in the order that they occur.
  <tr><td>2
      <td>The byte offset of the matching term within the column.
  <tr><td>3
877
878
879
880
881
882
883

884
885
886
887
888
889
890
891
892
893
894

895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
  <i>-- The following query matches the second row in table "mail". It returns the</i>
  <i>-- text "1 0 28 7 1 1 36 4". Only those occurrences of terms "serious" and "mail"</i>
  <i>-- that are part of an instance of the phrase "serious mail" are identified; the</i>
  <i>-- other occurrences of "serious" and "mail" are ignored.</i>
  SELECT offsets(mail) FROM mail WHERE mail MATCH '"serious mail"';
</codeblock>


<h2>The Snippet Function</h2>

<p>
  The snippet function is used to create formatted fragments of document text
  for display as part of a full-text query results report. The snippet function 
  may be passed between one and six arguments, as follows:

<table striped=1>
  <tr><th>Argument <th>Default Value <th>Description
  <tr><td>0 <td>N/A
      <td> The first argument to the snippet function must always be the special

           hidden column of the FTS3 table that takes the same name as the table
           itself.
  <tr><td>1 <td>"&lt;b&gt;"
      <td> The "start match" text.
  <tr><td>2 <td>"&lt;/b&gt;"
      <td> The "end match" text.
  <tr><td>3 <td>"&lt;b&gt;...&lt;/b&gt;"
      <td> The "ellipses" text.
  <tr><td>4 <td>-1
      <td> The FTS3 table column number to extract the returned fragments of
           text from. Columns are numbered from left to right starting with
           zero. A negative value indicates that the text may be extracted
           from any column.
  <tr><td>5 <td>-15
      <td> The absolute value of this integer argument is used as the 
           (approximate) number of tokens to include in the returned text 
           value. The maximum allowable absolute value is 64. The value of







>










|
>
|
|







|







914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
  <i>-- The following query matches the second row in table "mail". It returns the</i>
  <i>-- text "1 0 28 7 1 1 36 4". Only those occurrences of terms "serious" and "mail"</i>
  <i>-- that are part of an instance of the phrase "serious mail" are identified; the</i>
  <i>-- other occurrences of "serious" and "mail" are ignored.</i>
  SELECT offsets(mail) FROM mail WHERE mail MATCH '"serious mail"';
</codeblock>

<tcl>hd_fragment snippet snippet</tcl>
<h2>The Snippet Function</h2>

<p>
  The snippet function is used to create formatted fragments of document text
  for display as part of a full-text query results report. The snippet function 
  may be passed between one and six arguments, as follows:

<table striped=1>
  <tr><th>Argument <th>Default Value <th>Description
  <tr><td>0 <td>N/A
      <td> The first argument to the snippet function must always be the [FTS hidden column]
           of the FTS table being queried and from which the snippet is to be taken.  The
           [FTS hidden column] is an automatically generated column with the same name as the
           FTS table itself.
  <tr><td>1 <td>"&lt;b&gt;"
      <td> The "start match" text.
  <tr><td>2 <td>"&lt;/b&gt;"
      <td> The "end match" text.
  <tr><td>3 <td>"&lt;b&gt;...&lt;/b&gt;"
      <td> The "ellipses" text.
  <tr><td>4 <td>-1
      <td> The FTS table column number to extract the returned fragments of
           text from. Columns are numbered from left to right starting with
           zero. A negative value indicates that the text may be extracted
           from any column.
  <tr><td>5 <td>-15
      <td> The absolute value of this integer argument is used as the 
           (approximate) number of tokens to include in the returned text 
           value. The maximum allowable absolute value is 64. The value of
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
  two and four as described in the paragraphs above, they are joined together
  in sorted order with the "ellipses" text separating them. The three 
  modifications enumerated earlier are performed on the text before it is 
  returned.

<codeblock>
  <b>Note: In this block of examples, newlines and whitespace characters have
  been inserted into the document inserted into the FTS3 table, and the expected
  results described in SQL comments. This is done to enhance readability only,
  they would not be present in actual SQLite commands or output.</b>

  <i>-- Create and populate an FTS3 table.</i>
  CREATE VIRTUAL TABLE text USING fts3();
  INSERT INTO text VALUES('
    During 30 Nov-1 Dec, 2-3oC drops. Cool in the upper portion, minimum temperature 14-16oC 
    and cool elsewhere, minimum temperature 17-20oC. Cold to very cold on mountaintops, 
    minimum temperature 6-12oC. Northeasterly winds 15-30 km/hr. After that, temperature 
    increases. Northeasterly winds 15-30 km/hr.     
  ');








|



|
|







1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
  two and four as described in the paragraphs above, they are joined together
  in sorted order with the "ellipses" text separating them. The three 
  modifications enumerated earlier are performed on the text before it is 
  returned.

<codeblock>
  <b>Note: In this block of examples, newlines and whitespace characters have
  been inserted into the document inserted into the FTS table, and the expected
  results described in SQL comments. This is done to enhance readability only,
  they would not be present in actual SQLite commands or output.</b>

  <i>-- Create and populate an FTS table.</i>
  CREATE VIRTUAL TABLE text USING fts4();
  INSERT INTO text VALUES('
    During 30 Nov-1 Dec, 2-3oC drops. Cool in the upper portion, minimum temperature 14-16oC 
    and cool elsewhere, minimum temperature 17-20oC. Cold to very cold on mountaintops, 
    minimum temperature 6-12oC. Northeasterly winds 15-30 km/hr. After that, temperature 
    increases. Northeasterly winds 15-30 km/hr.     
  ');

1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
  or more 32-bit unsigned integers in machine byte-order. The exact number
  of integers in the returned array depends on both the query and the value
  of the second argument (if any) passed to the matchinfo function.

<p>
  The matchinfo function is called with either one or two arguments. As for
  all auxiliary functions, the first argument must be the special hidden
  column of an FTS3/4 table that has the same name as the table itself (see 
  above). The second argument, if it is specified, must be a text value
  comprised only of the characters 'p', 'c', 'n', 'a', 'l', 's' and 'x'.
  If no second argument is explicitly supplied, it defaults to "pcx". The
  second argument is refered to as the "format string" below.

<p>
  Characters in the matchinfo format string are processed from left to right. 
  Each character in the format string causes one or more 32-bit unsigned
  integer values to be added to the returned array. The "values" column in
  the following table contains the number of integer values appended to the
  output buffer for each supported format string character. In the formula
  given, <i>cols</i> is the number of columns in the FTS3/4 table, and 
  <i>phrases</i> is the number of <a href=#matchable>matchable phrases</a> in 
  the query. 

<table striped=1>
  <tr><th>Character<th>Values<th>Description
  <tr><td>p <td>1 <td>The number of matchable phrases in the query.
  <tr><td>c <td>1 <td>The number of user defined columns in the FTS3/4
    table (i.e. not including the docid or the hidden column with the same 
    name as the table).
  <tr><td>x <td style="white-space:nowrap">3 * <i>cols</i> * <i>phrases</i> 
    <td>
      For each distinct combination of a phrase and table column, the
      following three values:
      <ul>
        <li> In the current row, the number of times the phrase appears in 
             the column.
        <li> The total number of times the phrase appears in the column in
             all rows in the FTS3/4 table.
        <li> The total number of rows in the FTS3/4 table for which the 
             column contains at least one instance of the phrase.
      </ul>
      The first set of three values corresponds to the left-most column
      of the table (column 0) and the left-most matchable phrase in the
      query (phrase 0). If the table has more than one column, the second
      set of three values in the output array correspond to phrase 0 and
      column 1. Followed by phrase 0, column 2 and so on for all columns of







|











|






|
|
<








|
|







1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073

1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
  or more 32-bit unsigned integers in machine byte-order. The exact number
  of integers in the returned array depends on both the query and the value
  of the second argument (if any) passed to the matchinfo function.

<p>
  The matchinfo function is called with either one or two arguments. As for
  all auxiliary functions, the first argument must be the special hidden
  column of an FTS table that has the same name as the table itself (see 
  above). The second argument, if it is specified, must be a text value
  comprised only of the characters 'p', 'c', 'n', 'a', 'l', 's' and 'x'.
  If no second argument is explicitly supplied, it defaults to "pcx". The
  second argument is refered to as the "format string" below.

<p>
  Characters in the matchinfo format string are processed from left to right. 
  Each character in the format string causes one or more 32-bit unsigned
  integer values to be added to the returned array. The "values" column in
  the following table contains the number of integer values appended to the
  output buffer for each supported format string character. In the formula
  given, <i>cols</i> is the number of columns in the FTS table, and 
  <i>phrases</i> is the number of <a href=#matchable>matchable phrases</a> in 
  the query. 

<table striped=1>
  <tr><th>Character<th>Values<th>Description
  <tr><td>p <td>1 <td>The number of matchable phrases in the query.
  <tr><td>c <td>1 <td>The number of user defined columns in the FTS
    table (i.e. not including the docid or the [FTS hidden column]).

  <tr><td>x <td style="white-space:nowrap">3 * <i>cols</i> * <i>phrases</i> 
    <td>
      For each distinct combination of a phrase and table column, the
      following three values:
      <ul>
        <li> In the current row, the number of times the phrase appears in 
             the column.
        <li> The total number of times the phrase appears in the column in
             all rows in the FTS table.
        <li> The total number of rows in the FTS table for which the 
             column contains at least one instance of the phrase.
      </ul>
      The first set of three values corresponds to the left-most column
      of the table (column 0) and the left-most matchable phrase in the
      query (phrase 0). If the table has more than one column, the second
      set of three values in the output array correspond to phrase 0 and
      column 1. Followed by phrase 0, column 2 and so on for all columns of
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
  be used to order results in a full-text search application. Appendix A of this 
  document, "[search application tips]", contains an example of using the
  matchinfo() function efficiently.

<h1 id=tokenizer tags="tokenizer">Tokenizers</h1>

<p>
  An FTS3 tokenizer is a set of rules for extracting terms from a document 
  or basic FTS3 full-text query. 

<p>
  Unless a specific tokenizer is specified as part of the CREATE 
  VIRTUAL TABLE statement used to create the FTS3 table, the default 
  tokenizer, "simple", is used. The simple tokenizer extracts tokens from
  a document or basic FTS3 full-text query according to the following 
  rules:

<ul>
  <li><p> A term is a contiguous sequence of eligible characters, where 
    eligible characters are all alphanumeric characters, the "_" character,
    and all characters with UTF codepoints greater than or equal to 128.
    All other characters are discarded when splitting a document into terms.







|
|



|

|







1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
  be used to order results in a full-text search application. Appendix A of this 
  document, "[search application tips]", contains an example of using the
  matchinfo() function efficiently.

<h1 id=tokenizer tags="tokenizer">Tokenizers</h1>

<p>
  An FTS tokenizer is a set of rules for extracting terms from a document 
  or basic FTS full-text query. 

<p>
  Unless a specific tokenizer is specified as part of the CREATE 
  VIRTUAL TABLE statement used to create the FTS table, the default 
  tokenizer, "simple", is used. The simple tokenizer extracts tokens from
  a document or basic FTS full-text query according to the following 
  rules:

<ul>
  <li><p> A term is a contiguous sequence of eligible characters, where 
    eligible characters are all alphanumeric characters, the "_" character,
    and all characters with UTF codepoints greater than or equal to 128.
    All other characters are discarded when splitting a document into terms.
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
  frustrated.", the terms extracted from the document and added to the 
  full-text index are, in order, "right now they re very frustrated". Such
  a document would match a full-text query such as "MATCH 'Frustrated'", 
  as the simple tokenizer transforms the term in the query to lowercase
  before searching the full-text index.

<p>
  As well as the "simple" tokenizer, the FTS3 source code features a tokenizer 
  that uses the <a href="http://tartarus.org/~martin/PorterStemmer/">Porter 
  Stemming algorithm</a>. This tokenizer uses the same rules to separate
  the input document into terms, but as well as folding all terms to lower
  case it uses the Porter Stemming algorithm to reduce related English language
  words to a common root. For example, using the same input document as in the
  paragraph above, the porter tokenizer extracts the following tokens:
  "right now thei veri frustrat". Even though some of these terms are not even
  English words, in some cases using them to build the full-text index is more
  useful than the more intelligible output produced by the simple tokenizer.
  Using the porter tokenizer, the document not only matches full-text queries
  such as "MATCH 'Frustrated'", but also queries such as "MATCH 'Frustration'",
  as the term "Frustration" is reduced by the Porter stemmer algorithm to 
  "frustrat" - just as "Frustrated" is. So, when using the porter tokenizer,
  FTS3 is able to find not just exact matches for queried terms, but matches
  against similar English language terms. For more information on the 
  Porter Stemmer algorithm, please refer to the page linked above.

<p>
  Example illustrating the difference between the "simple" and "porter"
  tokenizers:








|













|







1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
  frustrated.", the terms extracted from the document and added to the 
  full-text index are, in order, "right now they re very frustrated". Such
  a document would match a full-text query such as "MATCH 'Frustrated'", 
  as the simple tokenizer transforms the term in the query to lowercase
  before searching the full-text index.

<p>
  As well as the "simple" tokenizer, the FTS source code features a tokenizer 
  that uses the <a href="http://tartarus.org/~martin/PorterStemmer/">Porter 
  Stemming algorithm</a>. This tokenizer uses the same rules to separate
  the input document into terms, but as well as folding all terms to lower
  case it uses the Porter Stemming algorithm to reduce related English language
  words to a common root. For example, using the same input document as in the
  paragraph above, the porter tokenizer extracts the following tokens:
  "right now thei veri frustrat". Even though some of these terms are not even
  English words, in some cases using them to build the full-text index is more
  useful than the more intelligible output produced by the simple tokenizer.
  Using the porter tokenizer, the document not only matches full-text queries
  such as "MATCH 'Frustrated'", but also queries such as "MATCH 'Frustration'",
  as the term "Frustration" is reduced by the Porter stemmer algorithm to 
  "frustrat" - just as "Frustrated" is. So, when using the porter tokenizer,
  FTS is able to find not just exact matches for queried terms, but matches
  against similar English language terms. For more information on the 
  Porter Stemmer algorithm, please refer to the page linked above.

<p>
  Example illustrating the difference between the "simple" and "porter"
  tokenizers:

1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
  discard punctuation, this can be done by creating a tokenizer
  implementation that uses the ICU tokenizer as part of its implementation.

<h2>Custom (User Implemented) Tokenizers</h2>

<p>
  As well as the built-in "simple", "porter" and (possibly) "icu" tokenizers,
  FTS3 exports an interface that allows users to implement custom tokenizers
  using C. The interface used to create a new tokenizer is defined and 
  described in the fts3_tokenizer.h source file.

<p>
  Registering a new FTS3 tokenizer is similar to registering a new
  virtual table module with SQLite. The user passes a pointer to a
  structure containing pointers to various callback functions that
  make up the implementation of the new tokenizer type. For tokenizers,
  the structure (defined in fts3_tokenizer.h) is called
  "sqlite3_tokenizer_module".

<p>
  FTS3 does not expose a C-function that users call to register new
  tokenizer types with a database handle. Instead, the pointer must
  be encoded as an SQL blob value and passed to FTS3 through the SQL
  engine by evaluating a special scalar function, "fts3_tokenizer()".
  The fts3_tokenizer() function may be called with one or two arguments,
  as follows:

<codeblock>
    SELECT fts3_tokenizer(&lt;tokenizer-name&gt;);
    SELECT fts3_tokenizer(&lt;tokenizer-name&gt;, &lt;sqlite3_tokenizer_module ptr&gt;);
</codeblock>

<p>
  Where <tokenizer-name> is a string identifying the tokenizer and
  <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module
  structure encoded as an SQL blob. If the second argument is present,
  it is registered as tokenizer <tokenizer-name> and a copy of it
  returned. If only one argument is passed, a pointer to the tokenizer
  implementation currently registered as <tokenizer-name> is returned,
  encoded as a blob. Or, if no such tokenizer exists, an SQL exception
  (error) is raised.

<p>
  <b>SECURITY WARNING</b>: If the fts3 extension is used in an environment
  where potentially malicious users may execute arbitrary SQL, they should 
  be prevented from invoking the fts3_tokenizer() function, possibly using 
  the [sqlite3_set_authorizer()|authorisation callback].

<p>
  The following block contains an example of calling the fts3_tokenizer()
  function from C code:

<codeblock>
  <i>/*
  ** Register a tokenizer implementation with FTS3.
  */</i>
  int registerTokenizer(
    sqlite3 *db,
    char *zName,
    const sqlite3_tokenizer_module *p
  ){
    int rc;







|




|







|

|




















|










|







1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
  discard punctuation, this can be done by creating a tokenizer
  implementation that uses the ICU tokenizer as part of its implementation.

<h2>Custom (User Implemented) Tokenizers</h2>

<p>
  As well as the built-in "simple", "porter" and (possibly) "icu" tokenizers,
  FTS exports an interface that allows users to implement custom tokenizers
  using C. The interface used to create a new tokenizer is defined and 
  described in the fts3_tokenizer.h source file.

<p>
  Registering a new FTS tokenizer is similar to registering a new
  virtual table module with SQLite. The user passes a pointer to a
  structure containing pointers to various callback functions that
  make up the implementation of the new tokenizer type. For tokenizers,
  the structure (defined in fts3_tokenizer.h) is called
  "sqlite3_tokenizer_module".

<p>
  FTS does not expose a C-function that users call to register new
  tokenizer types with a database handle. Instead, the pointer must
  be encoded as an SQL blob value and passed to FTS through the SQL
  engine by evaluating a special scalar function, "fts3_tokenizer()".
  The fts3_tokenizer() function may be called with one or two arguments,
  as follows:

<codeblock>
    SELECT fts3_tokenizer(&lt;tokenizer-name&gt;);
    SELECT fts3_tokenizer(&lt;tokenizer-name&gt;, &lt;sqlite3_tokenizer_module ptr&gt;);
</codeblock>

<p>
  Where <tokenizer-name> is a string identifying the tokenizer and
  <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module
  structure encoded as an SQL blob. If the second argument is present,
  it is registered as tokenizer <tokenizer-name> and a copy of it
  returned. If only one argument is passed, a pointer to the tokenizer
  implementation currently registered as <tokenizer-name> is returned,
  encoded as a blob. Or, if no such tokenizer exists, an SQL exception
  (error) is raised.

<p>
  <b>SECURITY WARNING</b>: If the fts3/4 extension is used in an environment
  where potentially malicious users may execute arbitrary SQL, they should 
  be prevented from invoking the fts3_tokenizer() function, possibly using 
  the [sqlite3_set_authorizer()|authorisation callback].

<p>
  The following block contains an example of calling the fts3_tokenizer()
  function from C code:

<codeblock>
  <i>/*
  ** Register a tokenizer implementation with FTS3 or FTS4.
  */</i>
  int registerTokenizer(
    sqlite3 *db,
    char *zName,
    const sqlite3_tokenizer_module *p
  ){
    int rc;
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
    sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC);
    sqlite3_step(pStmt);

    return sqlite3_finalize(pStmt);
  }

  <i>/*
  ** Query FTS3 for the tokenizer implementation named zName.
  */</i>
  int queryTokenizer(
    sqlite3 *db,
    char *zName,
    const sqlite3_tokenizer_module **pp
  ){
    int rc;







|







1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
    sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC);
    sqlite3_step(pStmt);

    return sqlite3_finalize(pStmt);
  }

  <i>/*
  ** Query FTS for the tokenizer implementation named zName.
  */</i>
  int queryTokenizer(
    sqlite3 *db,
    char *zName,
    const sqlite3_tokenizer_module **pp
  ){
    int rc;
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360

1361
1362
1363

1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
  }
</codeblock>

  
<h1 tags="segment btree">Data Structures</h1>

<p>
  This section describes at a high-level the way the FTS3 module stores its
  index and content in the database. It is <b>not necessary to read or 
  understand the material in this section in order to use FTS3</b> in an 
  application. However, it may be useful to application developers attempting 
  to analyze and understand FTS3 performance characteristics, or to developers 
  contemplating enhancements to the existing FTS3 feature set.


<p>
  For each FTS3 virtual table in a database, three real (non-virtual) tables 
  are created to store the underlying data. The real tables are named "%_content",

  "%_segdir" and "%_segments", where "%" is replaced by the name supplied by
  the user for the FTS3 virtual table.

<p>
  The leftmost column of the "%_content" table is an INTEGER PRIMARY KEY field
  named "docid". Following this is one column for each column of the FTS3
  virtual table as declared by the user, named by prepending the column name
  supplied by the user with "c<i>N</i>", where <i>N</i> is the index of the 
  column within the table, numbered from left to right starting with 0. Data
  types supplied as part of the virtual table declaration are not used as
  part of the %_content table declaration. For example:

<codeblock>
  <i>-- Virtual table declaration</i>
  CREATE VIRTUAL TABLE abc USING FTS3(a NUMBER, b TEXT, c);

  <i>-- Corresponding %_content table declaration</i>
  CREATE TABLE abc_content(docid INTEGER PRIMARY KEY, c0a, c1b, c2c);
</codeblock>

<p>
  The %_content table contains the unadulterated data inserted by the user 
  into the FTS3 virtual table by the user. If the user does not explicitly
  supply a "docid" value when inserting records, one is selected automatically
  by the system.

<p>
  The two remaining tables, %_segments and %_segdir, are used to store the 
  full-text index. Conceptually, this index is a lookup table that maps each 
  term (word) to the set of docid values corresponding to records in the 
  %_content table that contain one or more occurrences of the term. To
  retrieve all documents that contain a specified term, the FTS3 module
  queries this index to determine the set of docid values for records that
  contain the term, then retrieves the required documents from the %_content
  table. Regardless of the schema of the FTS3 virtual table, the %_segments
  and %_segdir tables are always created as follows:

<codeblock>
  CREATE TABLE %_segments(
    blockid INTEGER PRIMARY KEY,       <i>-- B-tree node id</i>
    block blob                         <i>-- B-tree node data</i>
  );







|

|

|
|

>

|
|
>
|
|



|








|







|




|



|


|







1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
  }
</codeblock>

  
<h1 tags="segment btree">Data Structures</h1>

<p>
  This section describes at a high-level the way the FTS module stores its
  index and content in the database. It is <b>not necessary to read or 
  understand the material in this section in order to use FTS</b> in an 
  application. However, it may be useful to application developers attempting 
  to analyze and understand FTS performance characteristics, or to developers 
  contemplating enhancements to the existing FTS feature set.

<tcl>hd_fragment *shadowtab {FTS shadow tables}</tcl>
<p>
  For each FTS virtual table in a database, three to five real (non-virtual) tables
  are created to store the underlying data.  These real tables are called "shadow tables".
  The real tables are named "%_content",
  "%_segdir", "%_segments", "%_stat", and "%_docsize", where "%" is replaced by the name
  of the FTS virtual table.

<p>
  The leftmost column of the "%_content" table is an INTEGER PRIMARY KEY field
  named "docid". Following this is one column for each column of the FTS
  virtual table as declared by the user, named by prepending the column name
  supplied by the user with "c<i>N</i>", where <i>N</i> is the index of the 
  column within the table, numbered from left to right starting with 0. Data
  types supplied as part of the virtual table declaration are not used as
  part of the %_content table declaration. For example:

<codeblock>
  <i>-- Virtual table declaration</i>
  CREATE VIRTUAL TABLE abc USING fts4(a NUMBER, b TEXT, c);

  <i>-- Corresponding %_content table declaration</i>
  CREATE TABLE abc_content(docid INTEGER PRIMARY KEY, c0a, c1b, c2c);
</codeblock>

<p>
  The %_content table contains the unadulterated data inserted by the user 
  into the FTS virtual table by the user. If the user does not explicitly
  supply a "docid" value when inserting records, one is selected automatically
  by the system.

<p>
  The two tables, %_segments and %_segdir, are used to store the 
  full-text index. Conceptually, this index is a lookup table that maps each 
  term (word) to the set of docid values corresponding to records in the 
  %_content table that contain one or more occurrences of the term. To
  retrieve all documents that contain a specified term, the FTS module
  queries this index to determine the set of docid values for records that
  contain the term, then retrieves the required documents from the %_content
  table. Regardless of the schema of the FTS virtual table, the %_segments
  and %_segdir tables are always created as follows:

<codeblock>
  CREATE TABLE %_segments(
    blockid INTEGER PRIMARY KEY,       <i>-- B-tree node id</i>
    block blob                         <i>-- B-tree node data</i>
  );
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
  segment b-tree, then it maps to the union of each individual doclist. If,
  for a single term, the same docid occurs in more than one doclist, then only
  the doclist that is part of the most recently created segment b-tree is 
  considered valid. 

<p>
  Multiple b-tree structures are used instead of a single b-tree to reduce
  the cost of inserting records into FTS3 tables. When a new record is 
  inserted into an FTS3 table that already contains a lot of data, it is
  likely that many of the terms in the new record are already present in
  a large number of existing records. If a single b-tree were used, then
  large doclist structures would have to be loaded from the database,
  amended to include the new docid and term-offset list, then written back
  to the database. Using multiple b-tree tables allows this to be avoided
  by creating a new b-tree which can be merged with the existing b-tree
  (or b-trees) later on. Merging of b-tree structures can be performed as
  a background task, or once a certain number of separate b-tree structures
  have been accumulated. Of course, this scheme makes queries more expensive
  (as the FTS3 code may have to look up individual terms in more than one
  b-tree and merge the results), but it has been found that in practice this
  overhead is often negligible.
  
<h2>Variable Length Integer (varint) Format</h2>

<p>
  Integer values stored as part of segment b-tree nodes are encoded using the
  FTS3 varint format. This encoding is similar, but <b>not identical</b>, to the
  the <a href="fileformat.html#varint_format">SQLite varint format</a>.

<p>
  An encoded FTS3 varint consumes between one and ten bytes of space. The
  number of bytes required is determined by the sign and magnitude of the
  integer value encoded. More accurately, the number of bytes used to store
  the encoded integer depends on the position of the most significant set bit
  in the 64-bit twos-complement representation of the integer value. Negative
  values always have the most significant bit set (the sign bit), and so are
  always stored using the full ten bytes. Positive integer values may be
  stored using less space.

<p>
  The final byte of an encoded FTS3 varint has its most significant bit 
  cleared. All preceding bytes have the most significant bit set. Data
  is stored in the remaining seven least significant bits of each byte.
  The first byte of the encoded representation contains the least significant
  seven bits of the encoded integer value. The second byte of the encoded
  representation, if it is present, contains the seven next least significant
  bits of the integer value, and so on. The following table contains examples
  of encoded integer values:







|
|









|







|



|









|







1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
  segment b-tree, then it maps to the union of each individual doclist. If,
  for a single term, the same docid occurs in more than one doclist, then only
  the doclist that is part of the most recently created segment b-tree is 
  considered valid. 

<p>
  Multiple b-tree structures are used instead of a single b-tree to reduce
  the cost of inserting records into FTS tables. When a new record is 
  inserted into an FTS table that already contains a lot of data, it is
  likely that many of the terms in the new record are already present in
  a large number of existing records. If a single b-tree were used, then
  large doclist structures would have to be loaded from the database,
  amended to include the new docid and term-offset list, then written back
  to the database. Using multiple b-tree tables allows this to be avoided
  by creating a new b-tree which can be merged with the existing b-tree
  (or b-trees) later on. Merging of b-tree structures can be performed as
  a background task, or once a certain number of separate b-tree structures
  have been accumulated. Of course, this scheme makes queries more expensive
  (as the FTS code may have to look up individual terms in more than one
  b-tree and merge the results), but it has been found that in practice this
  overhead is often negligible.
  
<h2>Variable Length Integer (varint) Format</h2>

<p>
  Integer values stored as part of segment b-tree nodes are encoded using the
  FTS varint format. This encoding is similar, but <b>not identical</b>, to the
  the <a href="fileformat.html#varint_format">SQLite varint format</a>.

<p>
  An encoded FTS varint consumes between one and ten bytes of space. The
  number of bytes required is determined by the sign and magnitude of the
  integer value encoded. More accurately, the number of bytes used to store
  the encoded integer depends on the position of the most significant set bit
  in the 64-bit twos-complement representation of the integer value. Negative
  values always have the most significant bit set (the sign bit), and so are
  always stored using the full ten bytes. Positive integer values may be
  stored using less space.

<p>
  The final byte of an encoded FTS varint has its most significant bit 
  cleared. All preceding bytes have the most significant bit set. Data
  is stored in the remaining seven least significant bits of each byte.
  The first byte of the encoded representation contains the least significant
  seven bits of the encoded integer value. The second byte of the encoded
  representation, if it is present, contains the seven next least significant
  bits of the integer value, and so on. The following table contains examples
  of encoded integer values:
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
</center>


<h2>Doclist Format</h2>

<p>
  A doclist consists of an array of 64-bit signed integers, serialized using
  the FTS3 varint format. Each doclist entry is made up of a series of two 
  or more integers, as follows:

<ol>
  <li> The docid value. The first entry in a doclist contains the literal docid
       value. The first field of each subsequent doclist entry contains the 
       difference between the new docid and the previous one (always a positive 
       number).
  <li> Zero or more term-offset lists. A term-offset list is present for each
       column of the FTS3 virtual table that contains the term. A term-offset
       list consists of the following:
     <ol>
       <li> Constant value 1. This field is omitted for any term-offset list
            associated with column 0.
       <li> The column number (1 for the second leftmost column, etc.). This
            field is omitted for any term-offset list associated with column 0.
       <li> A list of term-offsets, sorted from smallest to largest. Instead







|








|







1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
</center>


<h2>Doclist Format</h2>

<p>
  A doclist consists of an array of 64-bit signed integers, serialized using
  the FTS varint format. Each doclist entry is made up of a series of two 
  or more integers, as follows:

<ol>
  <li> The docid value. The first entry in a doclist contains the literal docid
       value. The first field of each subsequent doclist entry contains the 
       difference between the new docid and the previous one (always a positive 
       number).
  <li> Zero or more term-offset lists. A term-offset list is present for each
       column of the FTS virtual table that contains the term. A term-offset
       list consists of the following:
     <ol>
       <li> Constant value 1. This field is omitted for any term-offset list
            associated with column 0.
       <li> The column number (1 for the second leftmost column, etc.). This
            field is omitted for any term-offset list associated with column 0.
       <li> A list of term-offsets, sorted from smallest to largest. Instead
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
<center>
  <img src=images/fts3_doclist2.png>
  <p> FTS3 Doclist Format
</center>

<center>
  <img src=images/fts3_doclist.png>
  <p> FTS3 Doclist Entry Format
</center>

<p>
  For doclists for which the term appears in more than one column of the FTS3
  virtual table, term-offset lists within the doclist are stored in column 
  number order. This ensures that the term-offset list associated with 
  column 0 (if any) is always first, allowing the first two fields of the
  term-offset list to be omitted in this case.

<h1 id=appendix_a nonumber tags="search application tips">
  Appendix A: Search Application Tips
</h1>

<p>
  FTS3 is primarily designed to support Boolean full-text queries - queries
  to find the set of documents that match a specified criteria. However, many 
  (most?) search applications require that results are somehow ranked in order
  of "relevance", where "relevance" is defined as the likelihood that the user
  who performed the search is interested in a specific element of the returned
  set of documents. When using a search engine to find documents on the world
  wide web, the user expects that the most useful, or "relevant", documents 
  will be returned as the first page of results, and that each subsequent page 
  contains progressively less relevant results. Exactly how a machine can 
  determine document relevance based on a users query is a complicated problem
  and the subject of much ongoing research.

<p>
  One very simple scheme might be to count the number of instances of the 
  users search terms in each result document. Those documents that contain
  many instances of the terms are considered more relevant than those with
  a small number of instances of each term. In an FTS3 application, the 
  number of term instances in each result could be determined by counting
  the number of integers in the return value of the [offsets] function.
  The following example shows a query that could be used to obtain the
  ten most relevant results for a query entered by the user:

<codeblock>
  <i>-- This example (and all others in this section) assumes the following schema</i>







|



|










|















|







1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
<center>
  <img src=images/fts3_doclist2.png>
  <p> FTS3 Doclist Format
</center>

<center>
  <img src=images/fts3_doclist.png>
  <p> FTS Doclist Entry Format
</center>

<p>
  For doclists for which the term appears in more than one column of the FTS
  virtual table, term-offset lists within the doclist are stored in column 
  number order. This ensures that the term-offset list associated with 
  column 0 (if any) is always first, allowing the first two fields of the
  term-offset list to be omitted in this case.

<h1 id=appendix_a nonumber tags="search application tips">
  Appendix A: Search Application Tips
</h1>

<p>
  FTS is primarily designed to support Boolean full-text queries - queries
  to find the set of documents that match a specified criteria. However, many 
  (most?) search applications require that results are somehow ranked in order
  of "relevance", where "relevance" is defined as the likelihood that the user
  who performed the search is interested in a specific element of the returned
  set of documents. When using a search engine to find documents on the world
  wide web, the user expects that the most useful, or "relevant", documents 
  will be returned as the first page of results, and that each subsequent page 
  contains progressively less relevant results. Exactly how a machine can 
  determine document relevance based on a users query is a complicated problem
  and the subject of much ongoing research.

<p>
  One very simple scheme might be to count the number of instances of the 
  users search terms in each result document. Those documents that contain
  many instances of the terms are considered more relevant than those with
  a small number of instances of each term. In an FTS application, the 
  number of term instances in each result could be determined by counting
  the number of integers in the return value of the [offsets] function.
  The following example shows a query that could be used to obtain the
  ten most relevant results for a query entered by the user:

<codeblock>
  <i>-- This example (and all others in this section) assumes the following schema</i>
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
  SELECT title FROM documents 
    WHERE documents MATCH &lt;query&gt;
    ORDER BY countintegers(offsets(document)) DESC
    OFFSET 0 LIMIT 10
</codeblock>

<p>
  The query above could be made to run faster by using the FTS3 [matchinfo]
  function to determine the number of query term instances that appear in each
  result. The matchinfo function is much more efficient than the offsets 
  function. Furthermore, the matchinfo function provides extra information
  regarding the overall number of occurrences of each query term in the entire
  document set (not just the current row) and the number of documents in which 
  each query term appears. This may be used (for example) to attach a higher
  weight to less common terms which may increase the overall computed relevancy 







|







1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
  SELECT title FROM documents 
    WHERE documents MATCH &lt;query&gt;
    ORDER BY countintegers(offsets(document)) DESC
    OFFSET 0 LIMIT 10
</codeblock>

<p>
  The query above could be made to run faster by using the FTS [matchinfo]
  function to determine the number of query term instances that appear in each
  result. The matchinfo function is much more efficient than the offsets 
  function. Furthermore, the matchinfo function provides extra information
  regarding the overall number of occurrences of each query term in the entire
  document set (not just the current row) and the number of documents in which 
  each query term appears. This may be used (for example) to attach a higher
  weight to less common terms which may increase the overall computed relevancy 
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
    OFFSET 0 LIMIT 10
</codeblock>

<p>
  The SQL query in the example above uses less CPU than the first example
  in this section, but still has a non-obvious performance problem. SQLite
  satisfies this query by retrieving the value of the "title" column and
  matchinfo data from the FTS3 module for every row matched by the users
  query before it sorts and limits the results. Because of the way SQLite's
  virtual table interface works, retrieving the value of the "title" column
  requires loading the entire row from disk (including the "content" field,
  which may be quite large). This means that if the users query matches
  several thousand documents, many megabytes of "title" and "content" data
  may be loaded from disk into memory even though they will never be used
  for any purpose. 







|







1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
    OFFSET 0 LIMIT 10
</codeblock>

<p>
  The SQL query in the example above uses less CPU than the first example
  in this section, but still has a non-obvious performance problem. SQLite
  satisfies this query by retrieving the value of the "title" column and
  matchinfo data from the FTS module for every row matched by the users
  query before it sorts and limits the results. Because of the way SQLite's
  virtual table interface works, retrieving the value of the "title" column
  requires loading the entire row from disk (including the "content" field,
  which may be quite large). This means that if the users query matches
  several thousand documents, many megabytes of "title" and "content" data
  may be loaded from disk into memory even though they will never be used
  for any purpose. 
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
      OFFSET 0 LIMIT 10
  ) AS ranktable USING(docid)
  ORDER BY ranktable.rank DESC
</codeblock>

<p>
  The next block of SQL enhances the query with solutions to two other problems
  that may arise in developing search applications using FTS3:

<ol>
  <li> <p>
       The [snippet] function cannot be used with the above query. Because
       the outer query does not include a "WHERE ... MATCH" clause, the snippet 
       function may not be used with it. One solution is to duplicate the WHERE
       clause used by the sub-query in the outer query. The overhead associated







|







1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
      OFFSET 0 LIMIT 10
  ) AS ranktable USING(docid)
  ORDER BY ranktable.rank DESC
</codeblock>

<p>
  The next block of SQL enhances the query with solutions to two other problems
  that may arise in developing search applications using FTS:

<ol>
  <li> <p>
       The [snippet] function cannot be used with the above query. Because
       the outer query does not include a "WHERE ... MATCH" clause, the snippet 
       function may not be used with it. One solution is to duplicate the WHERE
       clause used by the sub-query in the outer query. The overhead associated
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764

<p>
  This version of the query is very similar to that used by the 
  <a href="http://www.sqlite.org/search?q=fts3">sqlite.org documentation search</a> 
  application.

<codeblock>
  <i>-- This table stores the static weight assigned to each document in FTS3 table</i>
  <i>-- "documents". For each row in the documents table there is a corresponding row</i>
  <i>-- with the same docid value in this table.</i>
  CREATE TABLE documents_data(docid INTEGER PRIMARY KEY, weight);

  <i>-- This query is similar to the one in the block above, except that:</i>
  <i>--</i>
  <i>--   1. It returns a "snippet" of text along with the document title for display. So</i>







|







1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804

<p>
  This version of the query is very similar to that used by the 
  <a href="http://www.sqlite.org/search?q=fts3">sqlite.org documentation search</a> 
  application.

<codeblock>
  <i>-- This table stores the static weight assigned to each document in FTS table</i>
  <i>-- "documents". For each row in the documents table there is a corresponding row</i>
  <i>-- with the same docid value in this table.</i>
  CREATE TABLE documents_data(docid INTEGER PRIMARY KEY, weight);

  <i>-- This query is similar to the one in the block above, except that:</i>
  <i>--</i>
  <i>--   1. It returns a "snippet" of text along with the document title for display. So</i>
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
  implemented in C. Instead of a single weight, it allows a weight to be 
  externally assigned to each column of each document. It may be registered
  with SQLite like any other user function using [sqlite3_create_function].

<codeblock>
<i>/*</i>
<i>** SQLite user defined function to use with matchinfo() to calculate the</i>
<i>** relevancy of an FTS3 match. The value returned is the relevancy score</i>
<i>** (a real value greater than or equal to zero). A larger value indicates </i>
<i>** a more relevant document.</i>
<i>**</i>
<i>** The overall relevancy returned is the sum of the relevancies of each </i>
<i>** column value in the FTS3 table. The relevancy of a column value is the</i>
<i>** sum of the following for each reportable phrase in the FTS3 query:</i>
<i>**</i>
<i>**   (&lt;hit count&gt; / &lt;global hit count&gt) * &lt;column weight&gt;</i>
<i>**</i>
<i>** where &lt;hit count&gt; is the number of instances of the phrase in the</i>
<i>** column value of the current row and &lt;global hit count&gt; is the number</i>
<i>** of instances of the phrase in the same column of all rows in the FTS3</i>
<i>** table. The &lt;column weight&gt; is a weighting factor assigned to each</i>
<i>** column by the caller (see below).</i>
<i>**</i>
<i>** The first argument to this function must be the return value of the FTS3 </i>
<i>** matchinfo() function. Following this must be one argument for each column </i>
<i>** of the FTS3 table containing a numeric weight factor for the corresponding </i>
<i>** column. Example:</i>
<i>**</i>
<i>**     CREATE VIRTUAL TABLE documents USING fts3(title, content)</i>
<i>**</i>
<i>** The following query returns the docids of documents that match the full-text</i>
<i>** query &lt;query&gt; sorted from most to least relevant. When calculating</i>
<i>** relevance, query term instances in the 'title' column are given twice the</i>







|




|
|





|



|

|







1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
  implemented in C. Instead of a single weight, it allows a weight to be 
  externally assigned to each column of each document. It may be registered
  with SQLite like any other user function using [sqlite3_create_function].

<codeblock>
<i>/*</i>
<i>** SQLite user defined function to use with matchinfo() to calculate the</i>
<i>** relevancy of an FTS match. The value returned is the relevancy score</i>
<i>** (a real value greater than or equal to zero). A larger value indicates </i>
<i>** a more relevant document.</i>
<i>**</i>
<i>** The overall relevancy returned is the sum of the relevancies of each </i>
<i>** column value in the FTS table. The relevancy of a column value is the</i>
<i>** sum of the following for each reportable phrase in the FTS query:</i>
<i>**</i>
<i>**   (&lt;hit count&gt; / &lt;global hit count&gt) * &lt;column weight&gt;</i>
<i>**</i>
<i>** where &lt;hit count&gt; is the number of instances of the phrase in the</i>
<i>** column value of the current row and &lt;global hit count&gt; is the number</i>
<i>** of instances of the phrase in the same column of all rows in the FTS</i>
<i>** table. The &lt;column weight&gt; is a weighting factor assigned to each</i>
<i>** column by the caller (see below).</i>
<i>**</i>
<i>** The first argument to this function must be the return value of the FTS </i>
<i>** matchinfo() function. Following this must be one argument for each column </i>
<i>** of the FTS table containing a numeric weight factor for the corresponding </i>
<i>** column. Example:</i>
<i>**</i>
<i>**     CREATE VIRTUAL TABLE documents USING fts3(title, content)</i>
<i>**</i>
<i>** The following query returns the docids of documents that match the full-text</i>
<i>** query &lt;query&gt; sorted from most to least relevant. When calculating</i>
<i>** relevance, query term instances in the 'title' column are given twice the</i>
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
  int iPhrase;                    <i>/* Current phrase */</i>
  double score = 0.0;             <i>/* Value to return */</i>

  assert( sizeof(int)==4 );

<i>  /* Check that the number of arguments passed to this function is correct.</i>
<i>  ** If not, jump to wrong_number_args. Set aMatchinfo to point to the array</i>
<i>  ** of unsigned integer values returned by FTS3 function matchinfo. Set</i>
<i>  ** nPhrase to contain the number of reportable phrases in the users full-text</i>
<i>  ** query, and nCol to the number of columns in the table.</i>
<i>  */</i>
  if( nVal&lt;1 ) goto wrong_number_args;
  aMatchinfo = (unsigned int *)sqlite3_value_blob(apVal&#x5B;0&#x5D;);
  nPhrase = aMatchinfo&#x5B;0&#x5D;;
  nCol = aMatchinfo&#x5B;1&#x5D;;







|







1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
  int iPhrase;                    <i>/* Current phrase */</i>
  double score = 0.0;             <i>/* Value to return */</i>

  assert( sizeof(int)==4 );

<i>  /* Check that the number of arguments passed to this function is correct.</i>
<i>  ** If not, jump to wrong_number_args. Set aMatchinfo to point to the array</i>
<i>  ** of unsigned integer values returned by FTS function matchinfo. Set</i>
<i>  ** nPhrase to contain the number of reportable phrases in the users full-text</i>
<i>  ** query, and nCol to the number of columns in the table.</i>
<i>  */</i>
  if( nVal&lt;1 ) goto wrong_number_args;
  aMatchinfo = (unsigned int *)sqlite3_value_blob(apVal&#x5B;0&#x5D;);
  nPhrase = aMatchinfo&#x5B;0&#x5D;;
  nCol = aMatchinfo&#x5B;1&#x5D;;
Changes to pages/pragma.in.
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
}

Pragma writable_schema {
    <p>^(<b>PRAGMA writable_schema  = </b><i>boolean</i><b>;</b></p>

    <p>When this pragma is on, the SQLITE_MASTER tables in which database
    can be changed using ordinary [UPDATE], [INSERT], and [DELETE]
    statements.  ^Warning:  misuse of this pragma can easily result in
    a corrupt database file.</p>
}

Section {List Of PRAGMAs} {toc} {{pragma list}}
</tcl>
<table border=0 width="100%" cellpadding=10>
<tr><td valign="top" align="left"><ul>







|







1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
}

Pragma writable_schema {
    <p>^(<b>PRAGMA writable_schema  = </b><i>boolean</i><b>;</b></p>

    <p>When this pragma is on, the SQLITE_MASTER tables in which database
    can be changed using ordinary [UPDATE], [INSERT], and [DELETE]
    statements.)^  ^Warning:  misuse of this pragma can easily result in
    a corrupt database file.</p>
}

Section {List Of PRAGMAs} {toc} {{pragma list}}
</tcl>
<table border=0 width="100%" cellpadding=10>
<tr><td valign="top" align="left"><ul>
Changes to pages/vtab.in.
317
318
319
320
321
322
323





















324
325
326
327
328
329
330
The name of the table in this CREATE TABLE statement is ignored, 
as are all constraints. Only the column names and datatypes matter.
The CREATE TABLE statement string need not to be 
held in persistent memory.  The string can be
deallocated and/or reused as soon as the [sqlite3_declare_vtab()]
routine returns.






















<p>If a column datatype contains the special keyword "HIDDEN"
(in any combination of upper and lower case letters) then that keyword
it is omitted from the column datatype name and the column is marked 
as a hidden column internally. 
A hidden column differs from a normal column in three respects:

<p>







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
The name of the table in this CREATE TABLE statement is ignored, 
as are all constraints. Only the column names and datatypes matter.
The CREATE TABLE statement string need not to be 
held in persistent memory.  The string can be
deallocated and/or reused as soon as the [sqlite3_declare_vtab()]
routine returns.

<p>The xCreate method need not initialize the pModule, nRef, and zErrMsg
fields of the [sqlite3_vtab] object.  The SQLite core will take care of 
that chore.

<p>The xCreate must should return [SQLITE_OK] if it is successful in 
creating the new virtual table, or [SQLITE_ERROR] if it is not successful.
If not successful, the [sqlite3_vtab] structure must not be allocated. 
An error message may optionally be returned in *pzErr if unsuccessful.
Space to hold the error message string must be allocated using
an SQLite memory allocation function like 
[sqlite3_malloc()] or [sqlite3_mprintf()] as the SQLite core will
attempt to free the space using [sqlite3_free()] after the error has
been reported up to the application.

<p>The xCreate method is required for every virtual table implementation, 
though the xCreate and [xConnect] pointers of the [sqlite3_module] object
may point to the same function the virtual table does not need to initialize
backing store.

<tcl>hd_fragment hiddencol {hidden column}</tcl>
<h4>2.1.1 Hidden columns in virtual tables</h4>
<p>If a column datatype contains the special keyword "HIDDEN"
(in any combination of upper and lower case letters) then that keyword
it is omitted from the column datatype name and the column is marked 
as a hidden column internally. 
A hidden column differs from a normal column in three respects:

<p>
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
<blockquote><pre>
   CREATE TABLE x(a HIDDEN VARCHAR(12), b INTEGER, c INTEGER Hidden);
</pre></blockquote>

<p>Then the virtual table would be created with two hidden columns,
and with datatypes of "VARCHAR(12)" and "INTEGER".

<p>The xCreate method need not initialize the pModule, nRef, and zErrMsg
fields of the [sqlite3_vtab] object.  The SQLite core will take care of 
that chore.

<p>The xCreate must should return [SQLITE_OK] if it is successful in 
creating the new virtual table, or [SQLITE_ERROR] if it is not successful.
If not successful, the [sqlite3_vtab] structure must not be allocated. 
An error message may optionally be returned in *pzErr if unsuccessful.
Space to hold the error message string must be allocated using
an SQLite memory allocation function like 
[sqlite3_malloc()] or [sqlite3_mprintf()] as the SQLite core will
attempt to free the space using [sqlite3_free()] after the error has
been reported up to the application.

<p>The xCreate method is required for every virtual table implementation, 
though the xCreate and [xConnect] pointers of the [sqlite3_module] object
may point to the same function the virtual table does not need to initialize
backing store.

<tcl>############################################################# xConnect
hd_fragment xconnect {sqlite3_module.xConnect} {xConnect}</tcl>
<h3>2.2 The xConnect Method</h3>

<blockquote><pre>
  int (*xConnect)(sqlite3*, void *pAux,







|
|
<
|
<
|
<
<
<
<
<
<
<

<
<
<
<







363
364
365
366
367
368
369
370
371

372

373







374




375
376
377
378
379
380
381
<blockquote><pre>
   CREATE TABLE x(a HIDDEN VARCHAR(12), b INTEGER, c INTEGER Hidden);
</pre></blockquote>

<p>Then the virtual table would be created with two hidden columns,
and with datatypes of "VARCHAR(12)" and "INTEGER".

<p>An example use of hidden columns can be seen in the [FTS3] virtual 
table implementation, where every FTS virtual table

contains an [FTS hidden column] that is used to pass information from the

virtual table into [FTS auxiliary functions] and to the [FTS MATCH] operator.













<tcl>############################################################# xConnect
hd_fragment xconnect {sqlite3_module.xConnect} {xConnect}</tcl>
<h3>2.2 The xConnect Method</h3>

<blockquote><pre>
  int (*xConnect)(sqlite3*, void *pAux,