Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Change the name of the "unicode" tokenizer to "unicode61" to emphasize that the case folding and separator-character identification routines are based on unicode version 6.1. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | fts4-unicode |
Files: | files | file ages | folders |
SHA1: |
8f3e60aa2253f21bcee5d03982cfdd7f |
User & Date: | dan 2012-05-26 14:54:50.307 |
Context
2012-05-26
| ||
15:44 | Add fault-injection tests that use the unicode61 tokenizer. Fix a problem revealed by the same. (check-in: ed28c48a3d user: dan tags: fts4-unicode) | |
14:54 | Change the name of the "unicode" tokenizer to "unicode61" to emphasize that the case folding and separator-character identification routines are based on unicode version 6.1. (check-in: 8f3e60aa22 user: dan tags: fts4-unicode) | |
2012-05-25
| ||
19:50 | Add special fast paths to sqlite3FtsUnicodeTolower() and Isalnum() for codepoints in the ASCII range. (check-in: cf7b25d476 user: dan tags: fts4-unicode) | |
Changes
Changes to ext/fts3/README.tokenizers.
1 2 3 4 5 6 7 8 9 10 11 12 13 | 1. FTS3 Tokenizers When creating a new full-text table, FTS3 allows the user to select the text tokenizer implementation to be used when indexing text by specifying a "tokenize" clause as part of the CREATE VIRTUAL TABLE statement: CREATE VIRTUAL TABLE <table-name> USING fts3( <columns ...> [, tokenize <tokenizer-name> [<tokenizer-args>]] ); The built-in tokenizers (valid values to pass as <tokenizer name>) are | | | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | 1. FTS3 Tokenizers When creating a new full-text table, FTS3 allows the user to select the text tokenizer implementation to be used when indexing text by specifying a "tokenize" clause as part of the CREATE VIRTUAL TABLE statement: CREATE VIRTUAL TABLE <table-name> USING fts3( <columns ...> [, tokenize <tokenizer-name> [<tokenizer-args>]] ); The built-in tokenizers (valid values to pass as <tokenizer name>) are "simple", "porter" and "unicode". <tokenizer-args> should consist of zero or more white-space separated arguments to pass to the selected tokenizer implementation. The interpretation of the arguments, if any, depends on the individual tokenizer. 2. Custom Tokenizers |
︙ | ︙ |
Changes to ext/fts3/fts3.c.
︙ | ︙ | |||
3597 3598 3599 3600 3601 3602 3603 | sqlite3Fts3HashInit(pHash, FTS3_HASH_STRING, 1); } /* Load the built-in tokenizers into the hash table */ if( rc==SQLITE_OK ){ if( sqlite3Fts3HashInsert(pHash, "simple", 7, (void *)pSimple) || sqlite3Fts3HashInsert(pHash, "porter", 7, (void *)pPorter) | | | 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 | sqlite3Fts3HashInit(pHash, FTS3_HASH_STRING, 1); } /* Load the built-in tokenizers into the hash table */ if( rc==SQLITE_OK ){ if( sqlite3Fts3HashInsert(pHash, "simple", 7, (void *)pSimple) || sqlite3Fts3HashInsert(pHash, "porter", 7, (void *)pPorter) || sqlite3Fts3HashInsert(pHash, "unicode61", 10, (void *)pUnicode) #ifdef SQLITE_ENABLE_ICU || (pIcu && sqlite3Fts3HashInsert(pHash, "icu", 4, (void *)pIcu)) #endif ){ rc = SQLITE_NOMEM; } } |
︙ | ︙ |
Changes to test/fts4unicode.test.
︙ | ︙ | |||
16 17 18 19 20 21 22 | source $testdir/tester.tcl ifcapable !fts3 { finish_test ; return } set ::testprefix fts4unicode proc do_unicode_token_test {tn input res} { set input [string map {' ''} $input] uplevel [list do_execsql_test $tn " | | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | source $testdir/tester.tcl ifcapable !fts3 { finish_test ; return } set ::testprefix fts4unicode proc do_unicode_token_test {tn input res} { set input [string map {' ''} $input] uplevel [list do_execsql_test $tn " SELECT fts3_tokenizer_test('unicode61', '$input'); " [list [list {*}$res]]] } do_unicode_token_test 1.0 {a B c D} {0 a a 1 b B 2 c c 3 d D} do_unicode_token_test 1.1 {Ä Ö Ü} {0 ä Ä 1 ö Ö 2 ü Ü} do_unicode_token_test 1.2 {xÄx xÖx xÜx} {0 xäx xÄx 1 xöx xÖx 2 xüx xÜx} # 0x00DF is a small "sharp s". 0x1E9E is a capital sharp s. do_unicode_token_test 1.3 "\uDF" "0 \uDF \uDF" do_unicode_token_test 1.4 "\u1E9E" "0 ß \u1E9E" do_unicode_token_test 1.5 "\u1E9E" "0 \uDF \u1E9E" do_unicode_token_test 1.6 "The quick brown fox" { 0 the The 1 quick quick 2 brown brown 3 fox fox } do_unicode_token_test 1.7 "The\u00bfquick\u224ebrown\u2263fox" { 0 the The 1 quick quick 2 brown brown 3 fox fox } #------------------------------------------------------------------------- # set docs [list { Enhance the INSERT syntax to allow multiple rows to be inserted via the VALUES clause. } { Enhance the CREATE VIRTUAL TABLE command to support the IF NOT EXISTS clause. } { Added the sqlite3_stricmp() interface as a counterpart to sqlite3_strnicmp(). } { Added the sqlite3_db_readonly() interface. } { Added the SQLITE_FCNTL_PRAGMA file control, giving VFS implementations the ability to add new PRAGMA statements or to override built-in PRAGMAs. } { Queries of the form: "SELECT max(x), y FROM table" returns the value of y on the same row that contains the maximum x value. } { Added support for the FTS4 languageid option. } { Documented support for the FTS4 content option. This feature has actually been in the code since version 3.7.9 but is only now considered to be officially supported. } { Pending statements no longer block ROLLBACK. Instead, the pending statement will return SQLITE_ABORT upon next access after the ROLLBACK. } { Improvements to the handling of CSV inputs in the command-line shell } { Fix a bug introduced in version 3.7.10 that might cause a LEFT JOIN to be incorrectly converted into an INNER JOIN if the WHERE clause indexable terms connected by OR. }] set map(a) [list "\u00C4" "\u00E4"] ; # LATIN LETTER A WITH DIAERESIS set map(e) [list "\u00CB" "\u00EB"] ; # LATIN LETTER E WITH DIAERESIS set map(i) [list "\u00CF" "\u00EF"] ; # LATIN LETTER I WITH DIAERESIS set map(o) [list "\u00D6" "\u00F6"] ; # LATIN LETTER O WITH DIAERESIS set map(u) [list "\u00DC" "\u00FC"] ; # LATIN LETTER U WITH DIAERESIS set map(y) [list "\u0178" "\u00FF"] ; # LATIN LETTER Y WITH DIAERESIS set map(h) [list "\u1E26" "\u1E27"] ; # LATIN LETTER H WITH DIAERESIS set map(w) [list "\u1E84" "\u1E85"] ; # LATIN LETTER W WITH DIAERESIS set map(x) [list "\u1E8C" "\u1E8D"] ; # LATIN LETTER X WITH DIAERESIS foreach k [array names map] { lappend mappings [string toupper $k] [lindex $map($k) 0] lappend mappings $k [lindex $map($k) 1] } proc mapdoc {doc} { set doc [regsub -all {[[:space:]]+} $doc " "] string map $::mappings [string trim $doc] } do_test 2.0 { execsql { CREATE VIRTUAL TABLE t2 USING fts4(tokenize=unicode61, x); } foreach doc $docs { set d [mapdoc $doc] execsql { INSERT INTO t2 VALUES($d) } } } {} do_test 2.1 { set q [mapdoc "row"] execsql { SELECT * FROM t2 WHERE t2 MATCH $q } } [list [mapdoc { Queries of the form: "SELECT max(x), y FROM table" returns the value of y on the same row that contains the maximum x value. }]] foreach {tn query snippet} { 2 "row" { ...returns the value of y on the same [row] that contains the maximum x value. } 3 "ROW" { ...returns the value of y on the same [row] that contains the maximum x value. } 4 "rollback" { ...[ROLLBACK]. Instead, the pending statement will return SQLITE_ABORT upon next access after the [ROLLBACK]. } 5 "rOllback" { ...[ROLLBACK]. Instead, the pending statement will return SQLITE_ABORT upon next access after the [ROLLBACK]. } 6 "lang*" { Added support for the FTS4 [languageid] option. } } { do_test 2.$tn { set q [mapdoc $query] execsql { SELECT snippet(t2, '[', ']', '...') FROM t2 WHERE t2 MATCH $q } } [list [mapdoc $snippet]] } finish_test |