Documentation Source Text

Check-in [79a7fb1e3f]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Fix typos.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 79a7fb1e3fff0f67b2bf765a264c454fd57ae421
User & Date: drh 2015-07-16 00:01:14
Context
2015-07-16
03:22
Add Sibsankar's second book to the books.html page. check-in: cf7fe0f52c user: drh tags: trunk
00:01
Fix typos. check-in: 79a7fb1e3f user: drh tags: trunk
2015-07-14
20:48
Clarify and correct details of the OTA documentation. Make it clear that the FTS5 extension is still experimental and subject to file format changes. check-in: 709d5967ce user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/fts5.in.

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
...
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
....
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
....
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
....
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
....
1515
1516
1517
1518
1519
1520
1521
1522
1523

<p>FTS5 is an SQLite [virtual table module] that provides 
<a href=http://en.wikipedia.org/wiki/Full_text_search>full-text search</a>
functionality to database applications. In their most elementary form, 
full-text search engines allow the user to efficiently search a large 
collection of documents for the subset that contain one or more instances of a
search term. The search functionality provided to world wide web users by
<a href=www.google.com>Google</a> is, amongst other things, a full-text search
engine, as it allows users to search for all documents on the web that contain,
for example, the term "fts5".

<p>To use FTS5, the user creates an FTS5 virtual table with one or more
columns. For example:

<codeblock>
................................................................................
  ... lots of output ...
  $ ls fts5.& 91;ch]
  fts5.c        fts5.h
</codeblock>

<p>
  The code in "fts5.c" may then be compiled into a loadable extension or
  staticly linked into an application as described in 
  [Compiling Loadable Extensions]. There are two entry points defined, both
  of which do the same thing:

<ul>
  <li> sqlite3_fts_init
  <li> sqlite3_fts5_init
</ul>
................................................................................
containing the pointer to the fts5_api structure for the connection. The
following example code demonstrates the technique:

<codeblock>
  <i>/*
  ** Return a pointer to the fts5_api pointer for database connection db.
  ** If an error occurs, return NULL and leave an error in the database 
  ** handle (accessible using sqlite3_errcode()/errmsg().
  */</i>
  fts5_api *fts5_api_from_db(sqlite3 *db){
    fts5_api *pRet = 0;
    sqlite3_stmt *pStmt = 0;

    if( SQLITE_OK==sqlite3_prepare(db, "SELECT fts5()", -1, &pStmt, 0)
     && SQLITE_ROW==sqlite3_step(pStmt) 
................................................................................
  set ::extract_api_docs_mode tokenizer_api
  catch { set res [source [file join $::SRC ext/fts5/extract_api_docs.tcl]] }
  set res
</tclscript>

<h2 tags="FTS5 custom auxiliary functions">Custom Auxiliary Functions</h2>

<p> Implementing a custom auxiliary function is similar to implementing an
[application-defined SQL function | scalar SQL function]. The implementation
should be a C function of type fts5_extension_function, defined as follows:

<codeblock>
<tclscript>
  set res ""
  set ::extract_api_docs_mode fts5_extension
................................................................................
index, FTS5 is able to provide timely answers to queries such as "the set
of all documents that contain the token 'A'", or "the set of all documents
that contain the sequence 'Y Z'". The list of instances associated with a
single token is called an "instance-list".

<p>The principle difference between FTS3/4 and FTS5 is that in FTS3/4,
each instance-list is stored as a single large database record, whereas
in FTS5 large instance-lists are divided betwen multiple database records.
This has the following implications for dealing with large databases that
contain large lists:

<ul>
  <li> <p>FTS5 is able to load instance-lists into memory incrementally in
       order to reduce memory usage and peak allocation size. FTS3/4 very
       often loads entire instance-lists into memory.
................................................................................
       DELETE statement executed by the user. This means that any operation
       on an FTS3/4 table may turn out to be surprisingly slow, as FTS3/4 
       may unpredictably choose to merge together two or more large b-trees
       within it. FTS5 uses incremental merging by default, which limits
       the amount of processing that may take place within any given 
       INSERT, UPDATE or DELETE operation.
</ul>









|







 







|







 







|







 







|







 







|







 







<
<
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
...
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
....
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
....
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
....
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
....
1515
1516
1517
1518
1519
1520
1521



<p>FTS5 is an SQLite [virtual table module] that provides 
<a href=http://en.wikipedia.org/wiki/Full_text_search>full-text search</a>
functionality to database applications. In their most elementary form, 
full-text search engines allow the user to efficiently search a large 
collection of documents for the subset that contain one or more instances of a
search term. The search functionality provided to world wide web users by
<a href=www.google.com>Google</a> is, among other things, a full-text search
engine, as it allows users to search for all documents on the web that contain,
for example, the term "fts5".

<p>To use FTS5, the user creates an FTS5 virtual table with one or more
columns. For example:

<codeblock>
................................................................................
  ... lots of output ...
  $ ls fts5.& 91;ch]
  fts5.c        fts5.h
</codeblock>

<p>
  The code in "fts5.c" may then be compiled into a loadable extension or
  statically linked into an application as described in 
  [Compiling Loadable Extensions]. There are two entry points defined, both
  of which do the same thing:

<ul>
  <li> sqlite3_fts_init
  <li> sqlite3_fts5_init
</ul>
................................................................................
containing the pointer to the fts5_api structure for the connection. The
following example code demonstrates the technique:

<codeblock>
  <i>/*
  ** Return a pointer to the fts5_api pointer for database connection db.
  ** If an error occurs, return NULL and leave an error in the database 
  ** handle (accessible using sqlite3_errcode()/errmsg()).
  */</i>
  fts5_api *fts5_api_from_db(sqlite3 *db){
    fts5_api *pRet = 0;
    sqlite3_stmt *pStmt = 0;

    if( SQLITE_OK==sqlite3_prepare(db, "SELECT fts5()", -1, &pStmt, 0)
     && SQLITE_ROW==sqlite3_step(pStmt) 
................................................................................
  set ::extract_api_docs_mode tokenizer_api
  catch { set res [source [file join $::SRC ext/fts5/extract_api_docs.tcl]] }
  set res
</tclscript>

<h2 tags="FTS5 custom auxiliary functions">Custom Auxiliary Functions</h2>

<p> Implementing a custom auxiliary function is similar to implementing a
[application-defined SQL function | scalar SQL function]. The implementation
should be a C function of type fts5_extension_function, defined as follows:

<codeblock>
<tclscript>
  set res ""
  set ::extract_api_docs_mode fts5_extension
................................................................................
index, FTS5 is able to provide timely answers to queries such as "the set
of all documents that contain the token 'A'", or "the set of all documents
that contain the sequence 'Y Z'". The list of instances associated with a
single token is called an "instance-list".

<p>The principle difference between FTS3/4 and FTS5 is that in FTS3/4,
each instance-list is stored as a single large database record, whereas
in FTS5 large instance-lists are divided between multiple database records.
This has the following implications for dealing with large databases that
contain large lists:

<ul>
  <li> <p>FTS5 is able to load instance-lists into memory incrementally in
       order to reduce memory usage and peak allocation size. FTS3/4 very
       often loads entire instance-lists into memory.
................................................................................
       DELETE statement executed by the user. This means that any operation
       on an FTS3/4 table may turn out to be surprisingly slow, as FTS3/4 
       may unpredictably choose to merge together two or more large b-trees
       within it. FTS5 uses incremental merging by default, which limits
       the amount of processing that may take place within any given 
       INSERT, UPDATE or DELETE operation.
</ul>


Changes to pages/ota.in.

76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
       database to prevent this.</p></li>
</ul>


<h2>Preparing An OTA Update File</h2>

<p>All changes to be applied by OTA are stored in a separate SQLite database
called the "OTA database".  The database that is to be modifed is called
the "target database".

<p>
For each table in the target database, the OTA database should contain a table
named "data_&lt;<i>target-table-name</i>&gt;" with 
the all the same columns as the
target table, plus one additional column named "ota_control".







|







76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
       database to prevent this.</p></li>
</ul>


<h2>Preparing An OTA Update File</h2>

<p>All changes to be applied by OTA are stored in a separate SQLite database
called the "OTA database".  The database that is to be modified is called
the "target database".

<p>
For each table in the target database, the OTA database should contain a table
named "data_&lt;<i>target-table-name</i>&gt;" with 
the all the same columns as the
target table, plus one additional column named "ota_control".

Changes to pages/spellfix1.in.

102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
...
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
...
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
...
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
...
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
adding a "top=N" term to the WHERE clause of your query, where N
is the new maximum.  For example, to see the 5 best matches:

<blockquote><pre>
SELECT word FROM demo WHERE word MATCH 'kennes*' AND top=5;
</pre></blockquote>

<p>Each entry in the spellfix1 virtual table is associated with a
a particular language, identified by the integer "langid" column.
The default langid is 0 and if no other actions are taken, the
entire vocabulary is a part of the 0 language.  But if your application
needs to operate in multiple languages, then you can specify different
vocabulary items for each language by specifying the langid field
when populating the table.  For example:

................................................................................
<dt><p><b>rowid</b><dd>
A unique integer number associated with each
vocabulary item in the table.  This can be used
as a foreign key on other tables in the database.

<dt><p><b>word</b><dd>
The text of the word that matches the pattern.
Both word and pattern can contains unicode characters
and can be mixed case.

<dt><p><b>rank</b><dd>
This is the rank of the word, as specified in the
original INSERT statement.


<dt><p><b>distance</b><dd>
This is an edit distance or Levensthein distance going
from the pattern to the word.

<dt><p><b>langid</b><dd>
This is the language-id of the word.  All queries are
against a single language-id, which defaults to 0.
For any given query this value is the same on all rows.

................................................................................
"scope=N" in the WHERE clause of the query.  Increasing
the scope will make the query run faster, but will reduce
the possible corrections.

<dt><p><b>srchcnt</b><dd>
(HIDDEN)  For any query, this value is the same on all
rows.  This value is an integer which is the number of
of words examined using the edit-distance algorithm to
find the top matches that are ultimately displayed.  This
value is for diagnostic use only.

<dt><p><b>soundslike</b><dd>
(HIDDEN)  When inserting vocabulary entries, this field
can be set to a spelling that matches what the word
sounds like.  See the DEALING WITH UNUSUAL AND DIFFICULT
................................................................................
that are further apart.  In this implementation, the maximum cost
of any single-character edit (delete, insert, or substitute) is 100,
with lower costs for some edits (such as transforming vowels).

<p>The "score" for a comparison is the edit distance between the pattern
and the word, adjusted down by the base-2 logarithm of the word rank.
For example, a match with distance 100 but rank 1000 would have a
score of 122 (= 100 - log2(1000) + 32) where as a match with distance
100 with a rank of 1 would have a score of 131 (100 - log2(1) + 32).
(NB:  The constant 32 is added to each score to keep it from going
negative in case the edit distance is zero.)  In this way, frequently
used words get a slightly lower cost which tends to move them toward
the top of the list of alternative spellings.

<p>A straightforward implementation of a spelling corrector would be
................................................................................
the edit distance coefficients.  It is the presence of the "edit_cost_table="
parameter to the spellfix1 module name that causes editdist3() to be used
in place of the built-in edit distance function.  If APPCOST is an empty
string, then the built-in Wagner edit-distance function is used.

<p>The edit distance coefficients are normally read from the APPCOST table
once and there after stored in memory.  Hence, run-time changes to the
APPCOST table will not normally effect the edit distance results.
However, inserting the special string 'reset' into the "command" column of the
virtual table causes the edit distance coefficients to be reread the
APPCOST table.  Hence, applications should run a SQL statement similar
to the following when changes to the APPCOST table occur:

<blockquote>
INSERT INTO demo2(command) VALUES("reset");







|







 







|








|







 







|







 







|







 







|







102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
...
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
...
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
...
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
...
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
adding a "top=N" term to the WHERE clause of your query, where N
is the new maximum.  For example, to see the 5 best matches:

<blockquote><pre>
SELECT word FROM demo WHERE word MATCH 'kennes*' AND top=5;
</pre></blockquote>

<p>Each entry in the spellfix1 virtual table is associated with
a particular language, identified by the integer "langid" column.
The default langid is 0 and if no other actions are taken, the
entire vocabulary is a part of the 0 language.  But if your application
needs to operate in multiple languages, then you can specify different
vocabulary items for each language by specifying the langid field
when populating the table.  For example:

................................................................................
<dt><p><b>rowid</b><dd>
A unique integer number associated with each
vocabulary item in the table.  This can be used
as a foreign key on other tables in the database.

<dt><p><b>word</b><dd>
The text of the word that matches the pattern.
Both word and pattern can contain unicode characters
and can be mixed case.

<dt><p><b>rank</b><dd>
This is the rank of the word, as specified in the
original INSERT statement.


<dt><p><b>distance</b><dd>
This is an edit distance or Levenshtein distance going
from the pattern to the word.

<dt><p><b>langid</b><dd>
This is the language-id of the word.  All queries are
against a single language-id, which defaults to 0.
For any given query this value is the same on all rows.

................................................................................
"scope=N" in the WHERE clause of the query.  Increasing
the scope will make the query run faster, but will reduce
the possible corrections.

<dt><p><b>srchcnt</b><dd>
(HIDDEN)  For any query, this value is the same on all
rows.  This value is an integer which is the number of
words examined using the edit-distance algorithm to
find the top matches that are ultimately displayed.  This
value is for diagnostic use only.

<dt><p><b>soundslike</b><dd>
(HIDDEN)  When inserting vocabulary entries, this field
can be set to a spelling that matches what the word
sounds like.  See the DEALING WITH UNUSUAL AND DIFFICULT
................................................................................
that are further apart.  In this implementation, the maximum cost
of any single-character edit (delete, insert, or substitute) is 100,
with lower costs for some edits (such as transforming vowels).

<p>The "score" for a comparison is the edit distance between the pattern
and the word, adjusted down by the base-2 logarithm of the word rank.
For example, a match with distance 100 but rank 1000 would have a
score of 122 (= 100 - log2(1000) + 32) whereas a match with distance
100 with a rank of 1 would have a score of 131 (100 - log2(1) + 32).
(NB:  The constant 32 is added to each score to keep it from going
negative in case the edit distance is zero.)  In this way, frequently
used words get a slightly lower cost which tends to move them toward
the top of the list of alternative spellings.

<p>A straightforward implementation of a spelling corrector would be
................................................................................
the edit distance coefficients.  It is the presence of the "edit_cost_table="
parameter to the spellfix1 module name that causes editdist3() to be used
in place of the built-in edit distance function.  If APPCOST is an empty
string, then the built-in Wagner edit-distance function is used.

<p>The edit distance coefficients are normally read from the APPCOST table
once and there after stored in memory.  Hence, run-time changes to the
APPCOST table will not normally affect the edit distance results.
However, inserting the special string 'reset' into the "command" column of the
virtual table causes the edit distance coefficients to be reread the
APPCOST table.  Hence, applications should run a SQL statement similar
to the following when changes to the APPCOST table occur:

<blockquote>
INSERT INTO demo2(command) VALUES("reset");