Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
| SHA1 Hash: | 3a79eaa3ec9f41c73806f73edb09c06661555f1f |
|---|---|
| Date: | 2012-11-26 15:25:02 |
| User: | drh |
| Comment: | First draft of documentation for the spellfix1 virtual table. |
Tags And Properties
- branch=trunk inherited from [b2e03e19ab]
- sym-trunk inherited from [b2e03e19ab]
Changes
Added pages/spellfix1.in
> 1 <title>The spellfix1 virtual table</title> > 2 <tcl> > 3 hd_keywords {spellfix1}</tcl> > 4 <h1 align='center'>The Spellfix1 Virtual Table</h1> > 5 > 6 <p>This spellfix1 [virtual table] can be used to search > 7 a large vocabulary for close matches. For example, spellfix1 > 8 can be used to suggest corrections to misspelled words. Or, > 9 it could be used with [FTS4] to do full-text search using potentially > 10 misspelled words. > 11 > 12 <p>The implementation for the spellfix1 virtual table is held in the > 13 canonical SQLite source tree in the file src/test_spellfix1.c. The > 14 spellfix1 virtual table is not included in the SQLite [amalgamation] > 15 and is not a part of any standard SQLite build. Applications that > 16 want to make use of spellfix1 should obtain a copy of the src/test_spellfix1.c > 17 source file and compile it as a shared library or DLL. Then use the > 18 [sqlite3_load_extension()] interface at run-time to load this extension > 19 into the application. > 20 > 21 <p>Once the extension is loaded, an instance of the spellfix1 virtual table > 22 is created like this: > 23 > 24 <blockquote><pre> > 25 CREATE VIRTUAL TABLE demo USING spellfix1; > 26 </pre></blockquote> > 27 > 28 <p>The "spellfix1" term is the name of this module and must be entered as > 29 shown. The "demo" term is the > 30 name of the virtual table you will be creating and can be altered > 31 to suit the needs of your application. The virtual table is initially > 32 empty. In order for the virtual table to be useful, you will need to > 33 populate it with your vocabulary. Suppose you > 34 have a list of words in a table named "big_vocabulary". Then do this: > 35 > 36 <blockquote><pre> > 37 INSERT INTO demo(word) SELECT word FROM big_vocabulary; > 38 </pre></blockquote> > 39 > 40 <p>If you intend to use this virtual table in cooperation with an FTS4 > 41 table (for spelling correctly of search terms) then you might extract > 42 the vocabulary using an fts3aux table: > 43 > 44 <blockquote><pre> > 45 INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*'; > 46 </pre></blockquote> > 47 > 48 <p>You can also provide the virtual table with a "rank" for each word. > 49 The "rank" is an estimate of how common the word is. Larger numbers > 50 mean the word is more common. If you omit the rank when populating > 51 the table, then a rank of 1 is assumed. But if you have rank > 52 information, you can supply it and the virtual table will show a > 53 slight preference for selecting more commonly used terms. To > 54 populate the rank from an fts4aux table "search_aux" do something > 55 like this: > 56 > 57 <blockquote><pre> > 58 INSERT INTO demo(word,rank) > 59 SELECT term, documents FROM search_aux WHERE col='*'; > 60 </pre></blockquote> > 61 > 62 <p>To query the virtual table, include a MATCH operator in the WHERE > 63 clause. For example: > 64 > 65 <blockquote><pre> > 66 SELECT word FROM demo WHERE word MATCH 'kennasaw'; > 67 </pre></blockquote> > 68 > 69 <p>Using a dataset of American place names (derived from > 70 [http://geonames.usgs.gov/domestic/download_data.htm]) the query above > 71 returns 20 results beginning with: > 72 > 73 <blockquote><pre> > 74 kennesaw > 75 kenosha > 76 kenesaw > 77 kenaga > 78 keanak > 79 </pre></blockquote> > 80 > 81 <p>If you append the character '*' to the end of the pattern, then > 82 a prefix search is performed. For example: > 83 > 84 <blockquote><pre> > 85 SELECT word FROM demo WHERE word MATCH 'kennes*'; > 86 </pre></blockquote> > 87 > 88 <p>Yields 20 results beginning with: > 89 > 90 <blockquote><pre> > 91 kennesaw > 92 kennestone > 93 kenneson > 94 kenneys > 95 keanes > 96 keenes > 97 </pre></blockquote> > 98 > 99 <h2>Search Refinements</h2> > 100 > 101 <p>By default, the spellfix1 table returns no more than 20 results. > 102 (It might return less than 20 if there were fewer good matches.) > 103 You can change the upper bound on the number of returned rows by > 104 adding a "top=N" term to the WHERE clause of your query, where N > 105 is the new maximum. For example, to see the 5 best matches: > 106 > 107 <blockquote><pre> > 108 SELECT word FROM demo WHERE word MATCH 'kennes*' AND top=5; > 109 </pre></blockquote> > 110 > 111 <p>Each entry in the spellfix1 virtual table is associated with a > 112 a particular language, identified by the integer "langid" column. > 113 The default langid is 0 and if no other actions are taken, the > 114 entire vocabulary is a part of the 0 language. But if your application > 115 needs to operate in multiple languages, then you can specify different > 116 vocabulary items for each language by specifying the langid field > 117 when populating the table. For example: > 118 > 119 <blockquote><pre> > 120 INSERT INTO demo(word,langid) SELECT word, 0 FROM en_vocabulary; > 121 INSERT INTO demo(word,langid) SELECT word, 1 FROM de_vocabulary; > 122 INSERT INTO demo(word,langid) SELECT word, 2 FROM fr_vocabulary; > 123 INSERT INTO demo(word,langid) SELECT word, 3 FROM ru_vocabulary; > 124 INSERT INTO demo(word,langid) SELECT word, 4 FROM cn_vocabulary; > 125 </pre></blockquote> > 126 > 127 <p>After the virtual table has been populated with items from multiple > 128 languages, specify the language of interest using a "langid=N" term > 129 in the WHERE clause of the query: > 130 > 131 <blockquote><pre> > 132 SELECT word FROM demo WHERE word MATCH 'hildes*' AND langid=1; > 133 </pre></blockquote> > 134 > 135 <p>Note that if you do not include the "langid=N" term in the WHERE clause, > 136 the search will be against language 0 (English in the example above.) > 137 All spellfix1 searches are against a single language id. There is no > 138 way to search all languages at once. > 139 > 140 > 141 <h2>Virtual Table Details</h2> > 142 > 143 <p>The virtual table actually has a unique rowid with seven columns plus five > 144 extra hidden columns. The columns are as follows: > 145 > 146 <blockquote><dl> > 147 <dt><p><b>rowid</b><dd> > 148 A unique integer number associated with each > 149 vocabulary item in the table. This can be used > 150 as a foreign key on other tables in the database. > 151 > 152 <dt><p><b>word</b><dd> > 153 The text of the word that matches the pattern. > 154 Both word and pattern can contains unicode characters > 155 and can be mixed case. > 156 > 157 <dt><p><b>rank</b><dd> > 158 This is the rank of the word, as specified in the > 159 original INSERT statement. > 160 > 161 > 162 <dt><p><b>distance</b><dd> > 163 This is an edit distance or Levensthein distance going > 164 from the pattern to the word. > 165 > 166 <dt><p><b>langid</b><dd> > 167 This is the language-id of the word. All queries are > 168 against a single language-id, which defaults to 0. > 169 For any given query this value is the same on all rows. > 170 > 171 <dt><p><b>score</b><dd> > 172 The score is a combination of rank and distance. The > 173 idea is that a lower score is better. The virtual table > 174 attempts to find words with the lowest score and > 175 by default (unless overridden by ORDER BY) returns > 176 results in order of increasing score. > 177 > 178 <dt><p><b>matchlen</b><dd> > 179 In a prefix search, the matchlen is the number of characters in > 180 the string that match against the prefix. For a non-prefix search, > 181 this is the same as length(word). > 182 > 183 <dt><p><b>phonehash</b><dd> > 184 This column shows the phonetic hash prefix that was used to restrict > 185 the search. For any given query, this column should be the same for > 186 every row. This information is available for diagnostic purposes and > 187 is not normally considered useful in real applications. > 188 > 189 <dt><p><b>top</b><dd> > 190 (HIDDEN) For any query, this value is the same on all > 191 rows. It is an integer which is the maximum number of > 192 rows that will be output. The actually number of rows > 193 output might be less than this number, but it will never > 194 be greater. The default value for top is 20, but that > 195 can be changed for each query by including a term of > 196 the form "top=N" in the WHERE clause of the query. > 197 > 198 <dt><p><b>scope</b><dd> > 199 (HIDDEN) For any query, this value is the same on all > 200 rows. The scope is a measure of how widely the virtual > 201 table looks for matching words. Smaller values of > 202 scope cause a broader search. The scope is normally > 203 choosen automatically and is capped at 4. Applications > 204 can change the scope by including a term of the form > 205 "scope=N" in the WHERE clause of the query. Increasing > 206 the scope will make the query run faster, but will reduce > 207 the possible corrections. > 208 > 209 <dt><p><b>srchcnt</b><dd> > 210 (HIDDEN) For any query, this value is the same on all > 211 rows. This value is an integer which is the number of > 212 of words examined using the edit-distance algorithm to > 213 find the top matches that are ultimately displayed. This > 214 value is for diagnostic use only. > 215 > 216 <dt><p><b>soundslike</b><dd> > 217 (HIDDEN) When inserting vocabulary entries, this field > 218 can be set to an spelling that matches what the word > 219 sounds like. See the DEALING WITH UNUSUAL AND DIFFICULT > 220 SPELLINGS section below for details. > 221 > 222 <dt><p><b>command</b><dd> > 223 (HIDDEN) The value of the "command" column is always NULL. However, > 224 applications can insert special strings into the "command" column in order > 225 to provoke certain behaviors in the spellfix1 virtual table. > 226 For example, inserting the string 'reset' into the "command" column > 227 will cause the virtual table will reread its edit distance weights > 228 (if there are any). > 229 </dl></blockquote> > 230 > 231 <h2>Algorithm</h2> > 232 > 233 <p>The spellfix1 virtual table creates a single > 234 shadow table named "%_vocab" (where the % is replaced by the name of > 235 the virtual table; Ex: "demo_vocab" for the "demo" virtual table). > 236 the shadow table contains the following columns: > 237 > 238 <blockquote><dl> > 239 <dt><p><b>id</b><dd> > 240 The unique id (INTEGER PRIMARY KEY) > 241 > 242 <dt><p><b>rank</b><dd> > 243 The rank of word. > 244 > 245 <dt><p><b>langid</b><dd> > 246 The language id for this entry. > 247 > 248 <dt><p><b>word</b><dd> > 249 The original UTF8 text of the vocabulary word > 250 > 251 <dt><p><b>k1</b><dd> > 252 The word transliterated into lower-case ASCII. > 253 There is a standard table of mappings from non-ASCII > 254 characters into ASCII. Examples: "æ" -> "ae", > 255 "þ" -> "th", "ß" -> "ss", "á" -> "a", ... The > 256 accessory function spellfix1_translit(X) will do > 257 the non-ASCII to ASCII mapping. The built-in lower(X) > 258 function will convert to lower-case. Thus: > 259 k1 = lower(spellfix1_translit(word)). > 260 > 261 <dt><p><b>k2</b><dd> > 262 This field holds a phonetic code derived from k1. Letters > 263 that have similar sounds are mapped into the same symbol. > 264 For example, all vowels and vowel clusters become the > 265 single symbol "A". And the letters "p", "b", "f", and > 266 "v" all become "B". All nasal sounds are represented > 267 as "N". And so forth. The mapping is base on > 268 ideas found in Soundex, Metaphone, and other > 269 long-standing phonetic matching systems. This key can > 270 be generated by the function spellfix1_phonehash(X). > 271 Hence: k2 = spellfix1_phonehash(k1) > 272 </dl></blockquote> > 273 > 274 <p>There is also a function for computing the Wagner edit distance or the > 275 Levenshtein distance between a pattern and a word. This function > 276 is exposed as spellfix1_editdist(X,Y). The edit distance function > 277 returns the "cost" of converting X into Y. Some transformations > 278 cost more than others. Changing one vowel into a different vowel, > 279 for example is relatively cheap, as is doubling a constant, or > 280 omitting the second character of a double-constant. Other transformations > 281 or more expensive. The idea is that the edit distance function returns > 282 a low cost of words that are similar and a higher cost for words > 283 that are futher apart. In this implementation, the maximum cost > 284 of any single-character edit (delete, insert, or substitute) is 100, > 285 with lower costs for some edits (such as transforming vowels). > 286 > 287 <p>The "score" for a comparison is the edit distance between the pattern > 288 and the word, adjusted down by the base-2 logorithm of the word rank. > 289 For example, a match with distance 100 but rank 1000 would have a > 290 score of 122 (= 100 - log2(1000) + 32) where as a match with distance > 291 100 with a rank of 1 would have a score of 131 (100 - log2(1) + 32). > 292 (NB: The constant 32 is added to each score to keep it from going > 293 negative in case the edit distance is zero.) In this way, frequently > 294 used words get a slightly lower cost which tends to move them toward > 295 the top of the list of alternative spellings. > 296 > 297 <p>A straightforward implementation of a spelling corrector would be > 298 to compare the search term against every word in the vocabulary > 299 and select the 20 with the lowest scores. However, there will > 300 typically be hundreds of thousands or millions of words in the > 301 vocabulary, and so this approach is not fast enough. > 302 > 303 <p>Suppose the term that is being spell-corrected is X. To limit > 304 the search space, X is converted to a k2-like key using the > 305 equivalent of: > 306 > 307 <blockquote><pre> > 308 key = spellfix1_phonehash(lower(spellfix1_translit(X))) > 309 </pre></blockquote> > 310 > 311 <p>This key is then limited to "scope" characters. The default scope > 312 value is 4, but an alternative scope can be specified using the > 313 "scope=N" term in the WHERE clause. After the key has been truncated, > 314 the edit distance is run against every term in the vocabulary that > 315 has a k2 value that begins with the abbreviated key. > 316 > 317 <p>For example, suppose the input word is "Paskagula". The phonetic > 318 key is "BACACALA" which is then truncated to 4 characters "BACA". > 319 The edit distance is then run on the 4980 entries (out of > 320 272,597 entries total) of the vocabulary whose k2 values begin with > 321 BACA, yielding "Pascagoula" as the best match. > 322 > 323 <p>Only terms of the vocabulary with a matching langid are searched. > 324 Hence, the same table can contain entries from multiple languages > 325 and only the requested language will be used. The default langid > 326 is 0. > 327 > 328 <h2>Configurable Edit Distance</h2> > 329 > 330 <p>The built-in Wagner edit-distance function with fixed weights can be > 331 replaced by the [editdist3()] edit-distance function > 332 with application-defined weights and support for unicode, by specifying > 333 the "edit_cost_table=<i>TABLENAME</i>" parameter to the spellfix1 module > 334 when the virtual table is created. > 335 For example: > 336 > 337 <blockquote><pre> > 338 CREATE VIRTUAL TABLE demo2 USING spellfix1(edit_cost_table=APPCOST); > 339 </pre></blockquote> > 340 > 341 <p>In the example above, the APPCOST table would be interrogated to find > 342 the edit distance coefficients. It is the presence of the "edit_cost_table=" > 343 parameter to the spellfix1 module name that causes editdist3() to be used > 344 in place of the built-in edit distance function. > 345 > 346 <p>The edit distance coefficients are normally read from the APPCOST table > 347 once and there after stored in memory. Hence, run-time changes to the > 348 APPCOST table will not normally effect the edit distance results. > 349 However, inserting the special string 'reset' into the "command" column of the > 350 virtual table causes the edit distance coefficients to be reread the > 351 APPCOST table. Hence, applications should run a SQL statement similar > 352 to the following when changes to the APPCOST table occur: > 353 > 354 <blockquote> > 355 INSERT INTO demo2(command) VALUES("reset"); > 356 </blockquote> > 357 > 358 <h2>Dealing With Unusual And Difficult Spellings</h2> > 359 > 360 <p>The algorithm above works quite well for most cases, but there are > 361 exceptions. These exceptions can be dealt with by making additional > 362 entries in the virtual table using the "soundslike" column. > 363 > 364 <p>For example, many words of Greek origin begin with letters "ps" where > 365 the "p" is silent. Ex: psalm, pseudonym, psoriasis, psyche. In > 366 another example, many Scottish surnames can be spelled with an > 367 initial "Mac" or "Mc". Thus, "MacKay" and "McKay" are both pronounced > 368 the same. > 369 > 370 <p>Accommodation can be made for words that are not spelled as they > 371 sound by making additional entries into the virtual table for the > 372 same word, but adding an alternative spelling in the "soundslike" > 373 column. For example, the canonical entry for "psalm" would be this: > 374 > 375 <blockquote><pre> > 376 INSERT INTO demo(word) VALUES('psalm'); > 377 </pre></blockquote> > 378 > 379 <p>To enhance the ability to correct the spelling of "salm" into > 380 "psalm", make an addition entry like this: > 381 > 382 <blockquote><pre> > 383 INSERT INTO demo(word,soundslike) VALUES('psalm','salm'); > 384 </pre></blockquote> > 385 > 386 <p>It is ok to make multiple entries for the same word as long as > 387 each entry has a different soundslike value. Note that if no > 388 soundslike value is specified, the soundslike defaults to the word > 389 itself. > 390 > 391 <p>Listed below are some cases where it might make sense to add additional > 392 soundslike entries. The specific entries will depend on the application > 393 and the target language. > 394 > 395 <ul> > 396 <li>Silent "p" in words beginning with "ps": psalm, psyche > 397 <li>Silent "p" in words beginning with "pn": pneumonia, pneumatic > 398 <li>Silent "p" in words beginning with "pt": pterodactyl, ptolemaic > 399 <li>Silent "d" in words beginning with "dj": djinn, Djikarta > 400 <li>Silent "k" in words beginning with "kn": knight, Knuthson > 401 <li>Silent "g" in words beginning with "gn": gnarly, gnome, gnat > 402 <li>"Mac" versus "Mc" beginning Scottish surnames > 403 <li>"Tch" sounds in Slavic words: Tchaikovsky vs. Chaykovsky > 404 <li>The letter "j" pronounced like "h" in Spanish: LaJolla > 405 <li>Words beginning with "wr" versus "r": write vs. rite > 406 <li>Miscellanous problem words such as "debt", "tsetse", > 407 "Nguyen", "Van Nuyes". > 408 </ul> > 409 > 410 <h2>Auxiliary Functions</h2> > 411 > 412 <p>The source code module that implements the spellfix1 virtual table also > 413 implements several SQL functions that might be useful to applications > 414 that employ spellfix1 or for testing or diagnostic work while developing > 415 applications that use spellfix1. The following auxiliary functions are > 416 available: > 417 > 418 <blockquote><dl> > 419 <dt><p><b>editdist3(P,W)<br>editdist2(P,W,L)<br>editdist3(T)</b><dd> > 420 These routines provide direct access to the version of the Wagner > 421 edit-distance function that allows for application-defined weights > 422 on edit operations. The first two forms of this function compare > 423 pattern P against word W and return the edit distance. In the first > 424 function, the langid is assumed to be 0 and in the second, the > 425 langid is given by the L parameter. The third form of this function > 426 reloads edit distance coefficience from the table named by T. > 427 > 428 <dt><p><b>spellfix1_editdist(P,W)</b><dd> > 429 This routine provides access to the built-in Wagner edit-distance > 430 function that uses default, fixed costs. The value returned is > 431 the edit distance needed to transform W into P. > 432 > 433 <dt><p><b>spellfix1_phonehash(X)</b><dd> > 434 This routine constructs a phonetic hash of the pure ascii input word X > 435 and returns that hash. This routine is used internally by spellfix1 in > 436 order to transform the K1 column of the shadow table into the K2 > 437 column. > 438 > 439 <dt><p><b>spellfix1_scriptcode(X)</b><dd> > 440 Given an input string X, this routine attempts to determin the dominant > 441 script of that input and returns the ISO-15924 numeric code for that > 442 script. The current implementation understands the following scripts: > 443 <ul> > 444 <li> 215 - Latin > 445 <li> 220 - Cyrillic > 446 <li> 200 - Greek > 447 </ul> > 448 Additional language codes might be added in future releases. > 449 > 450 <dt><p><b>spellfix1_translit(X)</b><dd> > 451 This routine transliterates unicode text into pure ascii, returning > 452 the pure ascii representation of the input text X. This is the function > 453 that is used internally to transform vocabulary words into the K1 > 454 column of the shadow table. > 455 > 456 </dl></blockquote> > 457 > 458 <tcl>hd_fragment editdist3 editdist3</tcl> > 459 <h2>The editdist3 function</h2> > 460 > 461 <p>The editdist3 algorithm is a function that computes the minimum edit > 462 distance (a.k.a. the Levenshtein distance) between two input strings. > 463 The editdist3 algorithm is a configurable alternative to the default > 464 edit distance function of spellfix1. > 465 Features of editdist3 include: > 466 > 467 <ul> > 468 <li><p>It works with unicode (UTF8) text. > 469 > 470 <li><p>A table of insertion, deletion, and substitution costs can be > 471 provided by the application. > 472 > 473 <li><p>Multi-character insertsions, deletions, and substitutions can be > 474 enumerated in the cost table. > 475 </ul> > 476 > 477 <h2>The editdist3 COST table</h2> > 478 > 479 <p>To program the costs of editdist3, create a table such as the following: > 480 > 481 <blockquote><pre> > 482 CREATE TABLE editcost( > 483 iLang INT, -- The language ID > 484 cFrom TEXT, -- Convert text from this > 485 cTo TEXT, -- Convert text into this > 486 iCost INT -- The cost of doing the conversionnn > 487 ); > 488 </pre></blockquote> > 489 > 490 <p>The cost table can be named anything you want - it does not have to be > 491 called "editcost". And the table can contain additional columns. > 492 The only requirement is that the > 493 table must contain the four columns show above, with exactly the names shown. > 494 > 495 <p>The iLang column is a non-negative integer that identifies a set of costs > 496 appropriate for a particular language. The editdist3 function will only use > 497 a single iLang value for any given edit-distance computation. The default > 498 value is 0. It is recommended that applications that only need to use a > 499 single langauge always use iLang==0 for all entries. > 500 > 501 <p>The iCost column is the numeric cost of transforming cFrom into cTo. This > 502 value should be a non-negative integer, and should probably be less than 100. > 503 The default single-character insertion and deletion costs are 100 and the > 504 default single-character to single-character substitution cost is 150. A > 505 cost of 10000 or more is considered "infinite" and causes the rule to be > 506 ignored. > 507 > 508 <p>The cFrom and cTo columns show edit transformation strings. Either or both > 509 columns may contain more than one character. Or either column (but not both) > 510 may hold an empty string. When cFrom is empty, that is the cost of inserting > 511 cTo. When cTo is empty, that is the cost of deleting cFrom. > 512 > 513 <p>In the spellfix1 algorithm, cFrom is the text as the user entered it and > 514 cTo is the correctly spelled text as it exists in the database. The goal > 515 of the editdist3 algorithm is to determine how close the user-entered text is > 516 to the dictionary text. > 517 > 518 <p>There are three special-case entries in the cost table: > 519 > 520 <table border=1> > 521 <tr><th>cFrom</th><th>cTo</th><th>Meaning</th></tr> > 522 <tr><td>''</td><td>'?'</td><td>The default insertion cost</td></tr> > 523 <tr><td>'?'</td><td>''</td><td>The default deletion cost</td></tr> > 524 <tr><td>'?'</td><td>'?'</td><td>The default substitution cost</td></tr> > 525 </table> > 526 > 527 <p>If any of the special-case entries shows above are omitted, then the > 528 value of 100 is used for insertion and deletion and 150 is used for > 529 substitution. To disable the default insertion, deletion, and/or substitution > 530 set their respective cost to 10000 or more. > 531 > 532 <p>Other entries in the cost table specific transforms for particular > 533 characters. > 534 The cost of specific transforms should be less than the default costs, or else > 535 the default costs will take precedence and the specific transforms will never > 536 be used. > 537 > 538 <p>Some example, cost table entries: > 539 > 540 <blockquote><pre> > 541 INSERT INTO editcost(iLang, cFrom, cTo, iCost) > 542 VALUES(0, 'a', 'ä', 5); > 543 </pre></blockquote> > 544 > 545 <p>The rule above says that the letter "a" in user input can be matched against > 546 the letter "ä" in the dictionary with a penalty of 5. > 547 > 548 <blockquote><pre> > 549 INSERT INTO editcost(iLang, cFrom, cTo, iCost) > 550 VALUES(0, 'ss', 'ß', 8); > 551 </pre></blockquote> > 552 > 553 <p>The number of characters in cFrom and cTo do not need to be the same. The > 554 rule above says that "ss" on user input will match "ß" with a penalty of 8. > 555 > 556 <h2>Experimenting with the editcost3() function</h2> > 557 > 558 <p>The spellfix1 virtual table > 559 uses editdist3 if the "edit_cost_table=TABLE" option > 560 is specified as an argument when the spellfix1 virtual table is created. > 561 But editdist3 can also be tested directly using the built-in "editdist3()" > 562 SQL function. The editdist3() SQL function has 3 forms: > 563 > 564 <ol> > 565 <li> editdist3('TABLENAME'); > 566 <li> editdist3('string1', 'string2'); > 567 <li> editdist3('string1', 'string2', langid); > 568 </ol> > 569 > 570 <p>The first form loads the edit distance coefficients from a table called > 571 'TABLENAME'. Any prior coefficients are discarded. So when experimenting > 572 with weights and the weight table changes, simply rerun the single-argument > 573 form of editdist3() to reload revised coefficients. Note that the > 574 edit distance > 575 weights used by the editdist3() SQL function are independent from the > 576 weights used by the spellfix1 virtual table. > 577 > 578 <p>The second and third forms return the computed edit distance between strings > 579 'string1' and "string2'. In the second form, an language id of 0 is used. > 580 The language id is specified in the third form.