Index: ext/fts3/fts3.c ================================================================== --- ext/fts3/fts3.c +++ ext/fts3/fts3.c @@ -21,13 +21,10 @@ ** ** * The FTS3 module is being built into the core of ** SQLite (in which case SQLITE_ENABLE_FTS3 is defined). */ -/* TODO(shess) Consider exporting this comment to an HTML file or the -** wiki. -*/ /* The full-text index is stored in a series of b+tree (-like) ** structures called segments which map terms to doclists. The ** structures are like b+trees in layout, but are constructed from the ** bottom up in optimal fashion and are not updatable. Since trees ** are built from the bottom up, things will be described from the @@ -46,45 +43,68 @@ ** 7 bits - A ** 14 bits - BA ** 21 bits - BBA ** and so on. ** -** This is identical to how sqlite encodes varints (see util.c). +** This is similar in concept to how sqlite encodes "varints" but +** the encoding is not the same. SQLite varints are big-endian +** are are limited to 9 bytes in length whereas FTS3 varints are +** little-endian and can be upt to 10 bytes in length (in theory). +** +** Example encodings: +** +** 1: 0x01 +** 127: 0x7f +** 128: 0x81 0x00 ** ** **** Document lists **** ** A doclist (document list) holds a docid-sorted list of hits for a ** given term. Doclists hold docids, and can optionally associate -** token positions and offsets with docids. +** token positions and offsets with docids. A position is the index +** of a word within the document. The first word of the document has +** a position of 0. +** +** FTS3 used to optionally store character offsets using a compile-time +** option. But that functionality is no longer supported. ** ** A DL_POSITIONS_OFFSETS doclist is stored like this: ** ** array { ** varint docid; ** array { (position list for column 0) ** varint position; (delta from previous position plus POS_BASE) -** varint startOffset; (delta from previous startOffset) -** varint endOffset; (delta from startOffset) ** } ** array { ** varint POS_COLUMN; (marks start of position list for new column) ** varint column; (index of new column) ** array { ** varint position; (delta from previous position plus POS_BASE) -** varint startOffset;(delta from previous startOffset) -** varint endOffset; (delta from startOffset) ** } ** } ** varint POS_END; (marks end of positions for this document. ** } ** ** Here, array { X } means zero or more occurrences of X, adjacent in ** memory. A "position" is an index of a token in the token stream -** generated by the tokenizer, while an "offset" is a byte offset, -** both based at 0. Note that POS_END and POS_COLUMN occur in the -** same logical place as the position element, and act as sentinals -** ending a position list array. +** generated by the tokenizer. Note that POS_END and POS_COLUMN occur +** in the same logical place as the position element, and act as sentinals +** ending a position list array. POS_END is 0. POS_COLUMN is 1. +** The positions numbers are not stored literally but rather as two more +** the difference from the prior position, or the just the position plus +** 2 for the first position. Example: +** +** label: A B C D E F G H I J K +** value: 123 5 9 1 1 14 35 0 234 72 0 +** +** The 123 value is the first docid. For column zero in this document +** there are two matches at positions 3 and 10 (5-2 and 9-2+3). The 1 +** at D signals the start of a new column; the 1 at E indicates that the +** new column is column number 1. There are two positions at 12 and 45 +** (14-2 and 35-2+12). The 0 at H indicate the end-of-document. The +** 234 at I is the next docid. It has one position 72 (72-2) and then +** terminates with the 0 at K. ** ** A DL_POSITIONS doclist omits the startOffset and endOffset ** information. A DL_DOCIDS doclist omits both the position and ** offset information, becoming an array of varint-encoded docids. ** @@ -386,16 +406,27 @@ } z[iOut] = '\0'; } } +/* +** Read a single varint from the doclist at *pp and advance *pp to point +** to the next element of the varlist. Add the value of the varint +** to *pVal. +*/ static void fts3GetDeltaVarint(char **pp, sqlite3_int64 *pVal){ sqlite3_int64 iVal; *pp += sqlite3Fts3GetVarint(*pp, &iVal); *pVal += iVal; } +/* +** As long as *pp has not reached its end (pEnd), then do the same +** as fts3GetDeltaVarint(): read a single varint and add it to *pVal. +** But if we have reached the end of the varint, just set *pp=0 and +** leave *pVal unchanged. +*/ static void fts3GetDeltaVarint2(char **pp, char *pEnd, sqlite3_int64 *pVal){ if( *pp>=pEnd ){ *pp = 0; }else{ fts3GetDeltaVarint(pp, pVal); @@ -779,16 +810,10 @@ return SQLITE_NOMEM; } memset(pCsr, 0, sizeof(Fts3Cursor)); return SQLITE_OK; } - -/****************************************************************/ -/****************************************************************/ -/****************************************************************/ -/****************************************************************/ - /* ** Close the cursor. For additional information see the documentation ** on the xClose method of the virtual table interface. */