Documentation Source Text

Check-in [d91a4c7249]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add text to fts3.html describing the snippet and offsets functions.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: d91a4c72493186fafeee787ef7e14ca3fe4c03d4
User & Date: dan 2009-11-30 08:41:45
Context
2009-11-30
11:52
Fix some problems with fts3.html. check-in: 0b949f6763 user: dan tags: trunk
08:41
Add text to fts3.html describing the snippet and offsets functions. check-in: d91a4c7249 user: dan tags: trunk
2009-11-28
19:17
Add a description of tokenizers to fts3.in. check-in: f1940858bf user: dan tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pages/fts3.in.

106
107
108
109
110
111
112




113
114
115
116
117
118
119
...
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
...
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
...
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
...
400
401
402
403
404
405
406











407
408
409
410
411
412
413
...
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
...
685
686
687
688
689
690
691
692


693
694
695
696
697
698
699
700
...
709
710
711
712
713
714
715
716






















































































































































717
718
719
720
721
722
723
  <i>-- Create an FTS3 table named "papers" with two columns that uses</i>
  <i>-- the tokenizer "porter".</i>
  CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter);

  <i>-- Create an FTS3 table with a single column - "content" - that uses</i>
  <i>-- the "simple" tokenizer.</i>
  CREATE VIRTUAL TABLE data USING fts3(tokenize=simple);




}]

<p>
  FTS3 tables may be dropped from the database using an ordinary \[DROP TABLE\]
  statement. For example:

[Code {
................................................................................
[Code {
  <i>-- Optimize the internal structure of FTS3 table "docs".</i>
  SELECT optimize(docs) FROM docs LIMIT 1;
}]

  <p>
    The statement above may appear syntacticly incorrect to some. Refer to
    the section describing the \[snippets()|FTS3 auxillary functions\] for
    a description of why it is not.

  <p>
    The optimize() function returns a text value. If the index was already
    optimized when it was called the text is "Index already optimal". 
    Otherwise if the index was not already optimized, it is made so and
    the text "Index optimized" returned.

................................................................................
  At first glance, the final two full-text queries in the example above seem
  to be syntacticly incorrect, as there is a table name ("mail") used as
  an SQL expression. The reason this is acceptable is that each FTS3 table
  actually has a \[sqlite3_declare_vtab|HIDDEN\] column with the same name
  as the table itself (in this case, "mail"). The value stored in this
  column is not meaningful to the application, but can be used as the 
  left-hand operand to a MATCH operator. This special column may also be
  passed as an argument to the \[snippets()|FTS3 auxillary functions\].

<p>
  The following example illustrates the above. The expressions "docs", 
  "docs.docs" and "main.docs.docs" all refer to column "docs". However, the 
  expression "main.docs" does not refer to any column. It could be used to 
  refer to a table, but a table name is not allowed in the context in which
  it is used below.
................................................................................
    rowid column supported by all \[virtual tables\].

  <li><p> 
    The \[FTS3 MATCH\] operator is supported for queries based on the built-in
    full-text index. 

  <li><p> 
    The FTS3 auxillary functions, \[snippets|snippets() and offsets()\], are 
    available to support full-text queries.

  <li><p> 
    Each FTS3 table has a \[sqlite3_declare_vtab()|HIDDEN column\] with the 
    same name as the table itself. The value contained in each row for the
    special column is only useful when used on the left-hand side of a 
    \[FTS3 MATCH|MATCH\] operator, or when specified as an argument to one 
    of the \[snippets|FTS3 auxillary functions\].
    

</ol>


[h1 "Compiling and Enabling FTS3" {} {compile fts3}]

<p>
  Although FTS3 is distributed as part of the SQLite source code, it is not
  enabled by default. To build SQLite with FTS3 functionality enabled, define
  the preprocessor macro \[SQLITE_ENABLE_FTS3\] when compiling. New applications
  should also define the \[SQLITE_ENABLE_FTS3_PARENTHESIS\] macro to enable the
  advanced query syntax (see below). Usually, this is done by adding the 
  following two switches to the compiler command line:

<p class=todo>
  Mention the ICU Tokenizer.

[Code {
  -DSQLITE_ENABLE_FTS3
  -DSQLITE_ENABLE_FTS3_PARENTHESIS
}]

<p>
  If using the amalgamation autoconf based build system, setting the CPPFLAGS
................................................................................
  \[SQLITE_OMIT_VIRTUALTABLE\] option.

<p>
  If an SQLite build does not include FTS3, then any attempt to prepare an
  SQL statement to create an FTS3 table or to drop or access an existing 
  FTS3 table in any way will fail. The error message returned will be similar 
  to "no such module: fts3".











  

[h1 "Full-text Index Queries" {} {FTS3 MATCH}]

<p>
  The most useful thing about FTS3 tables is the queries that may be 
  performed using the built-in full-text index. Full-text queries are 
................................................................................
  of the full-text query syntax, the "standard" query syntax and the "enhanced" 
  query syntax. The basic term, term-prefix, phrase and NEAR queries described 
  above are the same in both versions of the syntax. The way in which set 
  operations are specified is slightly different. The following two sub-sections 
  describe the part of the two query syntaxes that pertains to set operations. 
  Refer to the description of how to \[compile fts3\] for compilation notes.

[h2 "Set Operations Using The Enhanced Query Syntax"]

<p>
  The enhanced query syntax supports the AND, OR and NOT binary set operators.
  Each of the two operands to an operator may be a basic FTS3 query, or the
  result of another AND, OR or NOT set operation. Operators must be entered
  using capital letters. Otherwise, they are interpreted as basic term queries
  instead of set operators.
................................................................................
[Code {
  <i>-- Search for the set of documents that contain the term "sqlite" but do</i>
  <i>-- not contain the term "database".</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite -database';
}]

<ol>
  <li value=4><p> The relative precedence of the set operations is different. The


   precedence of operators when using the standard query syntax is:
</ol>

[Table]
  [Tr]<th>Operator<th>Standard Query Syntax Precedence
  [Tr]<td>Unary "-" <td> Highest precedence (tightest grouping).
  [Tr]<td>OR  <td>
  [Tr]<td>AND <td> Lowest precedence (loosest grouping).
................................................................................
  <i>-- Search for documents that contains at least one of the terms "database"</i>
  <i>-- and "sqlite", and also contains the term "library". Because of the differences</i>
  <i>-- in operator precedences, this query would have a different interpretation using</i>
  <i>-- the enhanced query syntax.</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library';
}]

[h1 "Auxillary functions - Snippets and Offsets" {} snippets offsets]























































































































































[h1 "Tokenizers" tokenizer {tokenizer}]

<p>
  An FTS3 tokenizer is a set of rules for extracting terms from a document 
  or basic FTS3 full-text query. 








>
>
>
>







 







|
<







 







|







 







|







|












|


<
<
<







 







>
>
>
>
>
>
>
>
>
>
>







 







|







 







|
>
>
|







 







|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
...
189
190
191
192
193
194
195
196

197
198
199
200
201
202
203
...
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
...
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379



380
381
382
383
384
385
386
...
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
...
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
...
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
...
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
  <i>-- Create an FTS3 table named "papers" with two columns that uses</i>
  <i>-- the tokenizer "porter".</i>
  CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter);

  <i>-- Create an FTS3 table with a single column - "content" - that uses</i>
  <i>-- the "simple" tokenizer.</i>
  CREATE VIRTUAL TABLE data USING fts3(tokenize=simple);

  <i>-- Create an FTS3 table with two columns that uses the "icu" tokenizer.</i>
  <i>-- The qualifier "en_AU" is passed to the tokenizer implementation</i>
  CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU);
}]

<p>
  FTS3 tables may be dropped from the database using an ordinary \[DROP TABLE\]
  statement. For example:

[Code {
................................................................................
[Code {
  <i>-- Optimize the internal structure of FTS3 table "docs".</i>
  SELECT optimize(docs) FROM docs LIMIT 1;
}]

  <p>
    The statement above may appear syntacticly incorrect to some. Refer to
    the section describing the \[simple fts3 queries\] for an explanation.


  <p>
    The optimize() function returns a text value. If the index was already
    optimized when it was called the text is "Index already optimal". 
    Otherwise if the index was not already optimized, it is made so and
    the text "Index optimized" returned.

................................................................................
  At first glance, the final two full-text queries in the example above seem
  to be syntacticly incorrect, as there is a table name ("mail") used as
  an SQL expression. The reason this is acceptable is that each FTS3 table
  actually has a \[sqlite3_declare_vtab|HIDDEN\] column with the same name
  as the table itself (in this case, "mail"). The value stored in this
  column is not meaningful to the application, but can be used as the 
  left-hand operand to a MATCH operator. This special column may also be
  passed as an argument to the \[snippet()|FTS3 auxillary functions\].

<p>
  The following example illustrates the above. The expressions "docs", 
  "docs.docs" and "main.docs.docs" all refer to column "docs". However, the 
  expression "main.docs" does not refer to any column. It could be used to 
  refer to a table, but a table name is not allowed in the context in which
  it is used below.
................................................................................
    rowid column supported by all \[virtual tables\].

  <li><p> 
    The \[FTS3 MATCH\] operator is supported for queries based on the built-in
    full-text index. 

  <li><p> 
    The FTS3 auxillary functions, \[snippet|snippet() and offsets()\], are 
    available to support full-text queries.

  <li><p> 
    Each FTS3 table has a \[sqlite3_declare_vtab()|HIDDEN column\] with the 
    same name as the table itself. The value contained in each row for the
    special column is only useful when used on the left-hand side of a 
    \[FTS3 MATCH|MATCH\] operator, or when specified as an argument to one 
    of the \[snippet|FTS3 auxillary functions\].
    

</ol>


[h1 "Compiling and Enabling FTS3" {} {compile fts3}]

<p>
  Although FTS3 is distributed as part of the SQLite source code, it is not
  enabled by default. To build SQLite with FTS3 functionality enabled, define
  the preprocessor macro \[SQLITE_ENABLE_FTS3\] when compiling. New applications
  should also define the \[SQLITE_ENABLE_FTS3_PARENTHESIS\] macro to enable the
  \[enhanced query syntax\] (see below). Usually, this is done by adding the 
  following two switches to the compiler command line:




[Code {
  -DSQLITE_ENABLE_FTS3
  -DSQLITE_ENABLE_FTS3_PARENTHESIS
}]

<p>
  If using the amalgamation autoconf based build system, setting the CPPFLAGS
................................................................................
  \[SQLITE_OMIT_VIRTUALTABLE\] option.

<p>
  If an SQLite build does not include FTS3, then any attempt to prepare an
  SQL statement to create an FTS3 table or to drop or access an existing 
  FTS3 table in any way will fail. The error message returned will be similar 
  to "no such module: fts3".

<p>
  If the C version of the <a href=http://site.icu-project.org/>ICU library</a>
  is available, then FTS3 may also be compiled with the SQLITE_ENABLE_ICU
  pre-processor macro defined. Compiling with this macro enables an FTS3
  \[tokenizer\] that uses the ICU library to split a document into terms
  (words) using the conventions for a specified language and locale.

[Code {
  -DSQLITE_ENABLE_ICU
}]
  

[h1 "Full-text Index Queries" {} {FTS3 MATCH}]

<p>
  The most useful thing about FTS3 tables is the queries that may be 
  performed using the built-in full-text index. Full-text queries are 
................................................................................
  of the full-text query syntax, the "standard" query syntax and the "enhanced" 
  query syntax. The basic term, term-prefix, phrase and NEAR queries described 
  above are the same in both versions of the syntax. The way in which set 
  operations are specified is slightly different. The following two sub-sections 
  describe the part of the two query syntaxes that pertains to set operations. 
  Refer to the description of how to \[compile fts3\] for compilation notes.

[h2 "Set Operations Using The Enhanced Query Syntax" {} {enhanced query syntax}]

<p>
  The enhanced query syntax supports the AND, OR and NOT binary set operators.
  Each of the two operands to an operator may be a basic FTS3 query, or the
  result of another AND, OR or NOT set operation. Operators must be entered
  using capital letters. Otherwise, they are interpreted as basic term queries
  instead of set operators.
................................................................................
[Code {
  <i>-- Search for the set of documents that contain the term "sqlite" but do</i>
  <i>-- not contain the term "database".</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite -database';
}]

<ol>
  <li value=4><p> The relative precedence of the set operations is different. 
   In particular, using the standard query syntax the "OR" operator has a
   higher precedence than "AND". The precedence of operators when using the
   standard query syntax is: 
</ol>

[Table]
  [Tr]<th>Operator<th>Standard Query Syntax Precedence
  [Tr]<td>Unary "-" <td> Highest precedence (tightest grouping).
  [Tr]<td>OR  <td>
  [Tr]<td>AND <td> Lowest precedence (loosest grouping).
................................................................................
  <i>-- Search for documents that contains at least one of the terms "database"</i>
  <i>-- and "sqlite", and also contains the term "library". Because of the differences</i>
  <i>-- in operator precedences, this query would have a different interpretation using</i>
  <i>-- the enhanced query syntax.</i>
  SELECT * FROM docs WHERE docs MATCH 'sqlite OR database library';
}]

[h1 "Auxillary functions - Snippets and Offsets" {} snippet offsets]

<p>
  The FTS3 module provides two special SQL scalar functions that may be useful
  to the developers of full-text query systems, "snippet" and "offsets". The
  purpose of both functions is to allow the user to identify the location of
  queried terms in the returned documents.

<p>
  The first argument to both the snippet and offsets SQL scalar functions
  must be the the special hidden column of an FTS3 table that has the same
  name as the table (see above). For example, given an FTS3 table named 
  "mail":

[Code {
  SELECT offsets(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
  SELECT snippet(mail) FROM mail WHERE mail MATCH &lt;full-text query expression&gt;;
}]

<p>
  The two auxillary functions are only useful within a SELECT statement that
  uses the FTS3 table's full-text index. If used within a SELECT that uses
  the "query by rowid" or "linear scan" strategies, both functions return 
  an empty string.

<p>
  For a SELECT query that uses the full-text index, the offsets() function 
  returns a text value containing a series of space-separated integers. For
  each occurence of a queried term in the document, there are four integers
  in the returned list. Each set of four integers is interpreted as 
  follows:

[Table]
  [Tr]<th>Integer <th>Interpretation
  [Tr]<td>0 
      <td>The column number that the term instance occurs in (0 for the
          leftmost column of the FTS3 table, 1 for the next leftmost, etc.).
  [Tr]<td>1
      <td>The term number of the matching term within the full-text query
          expression. Terms within a query expression are numbered starting
          from 0 in the order that they occur.
  [Tr]<td>2
      <td>The byte offset of the matching term within the column.
  [Tr]<td>3
      <td>The size of the matching term in bytes.
</table>

<p>
  The following block contains examples that use the offsets function.

[Code {
  CREATE VIRTUAL TABLE mail USING fts3(subject, body);
  INSERT INTO mail VALUES('hello world', 'This message is a hello world message.');
  INSERT INTO mail VALUES('urgent: serious', 'This mail is seen as a more serious mail');

  <i>-- The following query returns a single row (as it matches only the first</i>
  <i>-- entry in table "mail". The text returned by the offsets function is</i>
  <i>-- "0 0 6 5 1 0 24 5".</i>
  <i>--</i>
  <i>-- The first set of four integers in the result indicate that column 0</i>
  <i>-- contains an instance of term 0 ("world") at byte offset 6. The term instance</i>
  <i>-- is 5 bytes in size. The second set of four integers shows that column 1</i>
  <i>-- of the matched row contains an instance of term 0 ("world") at byte offset</i>
  <i>-- 24. Again, the term instance is 5 bytes in size.</i>
  SELECT offsets(mail) FROM mail WHERE mail MATCH 'world';

  <i>-- The following query returns also matches only the first row in table "mail".</i>
  <i>-- In this case the returned text is "1 0 5 7 1 0 30 7".</i>
  SELECT offsets(mail) FROM mail WHERE mail MATCH 'message';

  <i>-- The following query matches the second row in table "mail". It returns the</i>
  <i>-- text "1 0 28 7 1 1 36 4". Only those occurences of terms "serious" and "mail"</i>
  <i>-- that are part of an instance of the phrase "serious mail" are identified; the</i>
  <i>-- other occurences of "serious" and "mail" are ignored.</i>
  SELECT offsets(mail) FROM mail WHERE mail MATCH '"serious mail"';
}]

<p>
  The snippet function is used to create formatted fragments of document text
  for display as part of a full-text query results report. The snippet function 
  may be passed between one and four arguments, as follows:

[Table]
  [Tr]<th>Argument <th>Default Value <th>Description
  [Tr]<td>0 <td>N/A
      <td> The first argument to the snippet function must always be the special
           hidden column of the FTS3 table that takes the same name as the table
           itself.
  [Tr]<td>1 <td>"&lt;b&gt;"
      <td> The "start match" text.
  [Tr]<td>2 <td>"&lt;b&gt;"
      <td> The "end match" text.
  [Tr]<td>3 <td>"&lt;b&gt;...&lt;/b&gt;"
      <td> The "ellipses" text.
</table>

<p>
  The snippet function returns a fragment of text from the original document 
  surrounding the term identified by the first four integers that would be
  returned by the offsets function if it were used in a similar context. In
  most cases, the selected fragment contains 40 or less bytes of text before
  the identified term, and 40 or more bytes of text following the identified
  term. Slightly less than 40 bytes of preceding or following text is provided
  so that the fragment does not contain any partial terms. If the first term 
  (that would be) identified by the offsets function is less than 40 bytes 
  from the beginning or end of the document, then extra text may appear before
  or after the identified term within the fragment to make up the difference.

<p>
  If the returned fragment of text does not start at the start of the entire
  document, then the "ellipses" text (see table above) is prepended to the
  fragment before it is returned. Similarly, if the end of the returned fragment
  is not also the end of the entire document, the "ellipses" text is appended
  to it before it is returned.

<p>
  Before it is returned, the "start match" text is inserted into the fragment
  immediately before any terms within the fragment that would have been
  identified by the offsets function (not just the first one) were it invoked
  in the same context. The "end match" is inserted immediately following all
  such terms.

[Code {
  <b>Note: In this block of examples, newlines and whitespace characters have
  been inserted into the document inserted into the FTS3 table, and the expected
  results described in SQL comments. This is done to enhance readability only,
  they would not be present in actual SQLite commands or output.</b>

  <i>-- Create and populate an FTS3 table.</i>
  CREATE VIRTUAL TABLE text USING fts3();
  INSERT INTO text VALUES('
    During 30 Nov-1 Dec, 2-3oC drops. Cool in the upper portion, minimum temperature 14-16oC 
    and cool elsewhere, minimum temperature 17-20oC. Cold to very cold on mountaintops, 
    minimum temperature 6-12oC. Northeasterly winds 15-30 km/hr. After that, temperature 
    increases. Northeasterly winds 15-30 km/hr.     
  ');

  <i>-- The following query returns the text value:</i>
  <i>--</i>
  <i>--   "&lt;b&gt;...&lt;/b&gt; elsewhere, minimum temperature 17-20oC. &lt;b&gt;Cold&lt;/b&gt; to very &lt;b&gt;cold&lt;/b&gt; on</i>
  <i>--    mountaintops, minimum &lt;b&gt;...&lt;/b&gt;".</i>
  <i>--</i>
  SELECT snippet(text) FROM text WHERE text MATCH 'cold';

  <i>-- The following query returns the text value:</i>
  <i>--</i>
  <i>--   "... 2-3oC drops. Cool in the upper portion, &#91;minimum&#93; &#91;temperature&#93; 14-16oC and cool</i>
  <i>--    elsewhere, &#91;minimum&#93; ..."</i>
  <i>--</i>
  SELECT snippet(text, '&#91; '&#93;', '...') FROM text WHERE text MATCH '"min* tem*"'
}]

[h1 "Tokenizers" tokenizer {tokenizer}]

<p>
  An FTS3 tokenizer is a set of rules for extracting terms from a document 
  or basic FTS3 full-text query.