/ Check-in [e424a030]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Update comments in fts3.c to more accurately describe the doclist format.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: e424a0307359fee6875424c10ecad1a10acfba0e
User & Date: drh 2010-01-08 23:01:33
Context
2010-01-09
07:33
Fix handling of an OOM error in the fts3 offsets() function. Fix a couple of snippet related test cases in e_fts3.test. check-in: 14dc46a7 user: dan tags: trunk
2010-01-08
23:01
Update comments in fts3.c to more accurately describe the doclist format. check-in: e424a030 user: drh tags: trunk
04:50
Added option to restore_jrnl.tcl utility to hex dump journal pages. check-in: 08c545f0 user: shaneh tags: trunk
Changes
Hide Diffs Side-by-Side Diffs Ignore Whitespace Patch

Changes to ext/fts3/fts3.c.

    19     19   **     * The FTS3 module is being built as an extension
    20     20   **       (in which case SQLITE_CORE is not defined), or
    21     21   **
    22     22   **     * The FTS3 module is being built into the core of
    23     23   **       SQLite (in which case SQLITE_ENABLE_FTS3 is defined).
    24     24   */
    25     25   
    26         -/* TODO(shess) Consider exporting this comment to an HTML file or the
    27         -** wiki.
    28         -*/
    29     26   /* The full-text index is stored in a series of b+tree (-like)
    30     27   ** structures called segments which map terms to doclists.  The
    31     28   ** structures are like b+trees in layout, but are constructed from the
    32     29   ** bottom up in optimal fashion and are not updatable.  Since trees
    33     30   ** are built from the bottom up, things will be described from the
    34     31   ** bottom up.
    35     32   **
................................................................................
    44     41   **         B = 1xxxxxxx    7 bits of data and one flag bit
    45     42   **
    46     43   **  7 bits - A
    47     44   ** 14 bits - BA
    48     45   ** 21 bits - BBA
    49     46   ** and so on.
    50     47   **
    51         -** This is identical to how sqlite encodes varints (see util.c).
           48  +** This is similar in concept to how sqlite encodes "varints" but
           49  +** the encoding is not the same.  SQLite varints are big-endian
           50  +** are are limited to 9 bytes in length whereas FTS3 varints are
           51  +** little-endian and can be upt to 10 bytes in length (in theory).
           52  +**
           53  +** Example encodings:
           54  +**
           55  +**     1:    0x01
           56  +**   127:    0x7f
           57  +**   128:    0x81 0x00
    52     58   **
    53     59   **
    54     60   **** Document lists ****
    55     61   ** A doclist (document list) holds a docid-sorted list of hits for a
    56     62   ** given term.  Doclists hold docids, and can optionally associate
    57         -** token positions and offsets with docids.
           63  +** token positions and offsets with docids.  A position is the index
           64  +** of a word within the document.  The first word of the document has
           65  +** a position of 0.
           66  +**
           67  +** FTS3 used to optionally store character offsets using a compile-time
           68  +** option.  But that functionality is no longer supported.
    58     69   **
    59     70   ** A DL_POSITIONS_OFFSETS doclist is stored like this:
    60     71   **
    61     72   ** array {
    62     73   **   varint docid;
    63     74   **   array {                (position list for column 0)
    64     75   **     varint position;     (delta from previous position plus POS_BASE)
    65         -**     varint startOffset;  (delta from previous startOffset)
    66         -**     varint endOffset;    (delta from startOffset)
    67     76   **   }
    68     77   **   array {
    69     78   **     varint POS_COLUMN;   (marks start of position list for new column)
    70     79   **     varint column;       (index of new column)
    71     80   **     array {
    72     81   **       varint position;   (delta from previous position plus POS_BASE)
    73         -**       varint startOffset;(delta from previous startOffset)
    74         -**       varint endOffset;  (delta from startOffset)
    75     82   **     }
    76     83   **   }
    77     84   **   varint POS_END;        (marks end of positions for this document.
    78     85   ** }
    79     86   **
    80     87   ** Here, array { X } means zero or more occurrences of X, adjacent in
    81     88   ** memory.  A "position" is an index of a token in the token stream
    82         -** generated by the tokenizer, while an "offset" is a byte offset,
    83         -** both based at 0.  Note that POS_END and POS_COLUMN occur in the
    84         -** same logical place as the position element, and act as sentinals
    85         -** ending a position list array.
           89  +** generated by the tokenizer. Note that POS_END and POS_COLUMN occur 
           90  +** in the same logical place as the position element, and act as sentinals
           91  +** ending a position list array.  POS_END is 0.  POS_COLUMN is 1.
           92  +** The positions numbers are not stored literally but rather as two more
           93  +** the difference from the prior position, or the just the position plus
           94  +** 2 for the first position.  Example:
           95  +**
           96  +**   label:       A B C D E  F  G H   I  J K
           97  +**   value:     123 5 9 1 1 14 35 0 234 72 0
           98  +**
           99  +** The 123 value is the first docid.  For column zero in this document
          100  +** there are two matches at positions 3 and 10 (5-2 and 9-2+3).  The 1
          101  +** at D signals the start of a new column; the 1 at E indicates that the
          102  +** new column is column number 1.  There are two positions at 12 and 45
          103  +** (14-2 and 35-2+12).  The 0 at H indicate the end-of-document.  The
          104  +** 234 at I is the next docid.  It has one position 72 (72-2) and then
          105  +** terminates with the 0 at K.
    86    106   **
    87    107   ** A DL_POSITIONS doclist omits the startOffset and endOffset
    88    108   ** information.  A DL_DOCIDS doclist omits both the position and
    89    109   ** offset information, becoming an array of varint-encoded docids.
    90    110   **
    91    111   ** On-disk data is stored as type DL_DEFAULT, so we don't serialize
    92    112   ** the type.  Due to how deletion is implemented in the segmentation
................................................................................
   384    404           z[iOut++] = z[iIn++];
   385    405         }
   386    406       }
   387    407       z[iOut] = '\0';
   388    408     }
   389    409   }
   390    410   
          411  +/*
          412  +** Read a single varint from the doclist at *pp and advance *pp to point
          413  +** to the next element of the varlist.  Add the value of the varint
          414  +** to *pVal.
          415  +*/
   391    416   static void fts3GetDeltaVarint(char **pp, sqlite3_int64 *pVal){
   392    417     sqlite3_int64 iVal;
   393    418     *pp += sqlite3Fts3GetVarint(*pp, &iVal);
   394    419     *pVal += iVal;
   395    420   }
   396    421   
          422  +/*
          423  +** As long as *pp has not reached its end (pEnd), then do the same
          424  +** as fts3GetDeltaVarint(): read a single varint and add it to *pVal.
          425  +** But if we have reached the end of the varint, just set *pp=0 and
          426  +** leave *pVal unchanged.
          427  +*/
   397    428   static void fts3GetDeltaVarint2(char **pp, char *pEnd, sqlite3_int64 *pVal){
   398    429     if( *pp>=pEnd ){
   399    430       *pp = 0;
   400    431     }else{
   401    432       fts3GetDeltaVarint(pp, pVal);
   402    433     }
   403    434   }
................................................................................
   777    808     *ppCsr = pCsr = (sqlite3_vtab_cursor *)sqlite3_malloc(sizeof(Fts3Cursor));
   778    809     if( !pCsr ){
   779    810       return SQLITE_NOMEM;
   780    811     }
   781    812     memset(pCsr, 0, sizeof(Fts3Cursor));
   782    813     return SQLITE_OK;
   783    814   }
   784         -
   785         -/****************************************************************/
   786         -/****************************************************************/
   787         -/****************************************************************/
   788         -/****************************************************************/
   789         -
   790    815   
   791    816   /*
   792    817   ** Close the cursor.  For additional information see the documentation
   793    818   ** on the xClose method of the virtual table interface.
   794    819   */
   795    820   static int fulltextClose(sqlite3_vtab_cursor *pCursor){
   796    821     Fts3Cursor *pCsr = (Fts3Cursor *)pCursor;