/ Check-in [adcf7890]

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Comment:Remove spellfix virtual table documentation from the source tree. Reference the separate documentation on the website instead.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: adcf78909ff9064b6e3c4dd15ccd3245c8cf270b
User & Date: drh 2013-04-27 18:06:40
Update 'fuzzerfault' test for its new module loading command. Fix several test names in 'io.test' and make sure the database gets closed between tests. check-in: e81e9ca1 user: mistachkin tags: trunk
Initial work on reforming usage of OSTRACE by the Windows VFS. check-in: b347b5d1 user: mistachkin tags: winOsTrace
Remove spellfix virtual table documentation from the source tree. Reference the separate documentation on the website instead. check-in: adcf7890 user: drh tags: trunk
Untested fix for building on VxWorks. check-in: f14d55cf user: drh tags: trunk
Hide Diffs Side-by-Side Diffs Ignore Whitespace Patch

Deleted ext/misc/editdist3.wiki.

     1         -<title>The editdist3 algorithm</title>
     2         -
     3         -The editdist3 algorithm is a function that computes the minimum edit distance
     4         -(a.k.a. the Levenshtein distance) between two input strings.  Features of
     5         -editdist3 include:
     6         -
     7         -   *   It works with unicode (UTF8) text.
     8         -
     9         -   *   A table of insertion, deletion, and substitution costs can be 
    10         -       provided by the application.
    11         -
    12         -   *   Multi-character insertsions, deletions, and substitutions can be
    13         -       enumerated in the cost table.
    14         -
    15         -<h2>The COST table</h2>
    16         -
    17         -To program the costs of editdist3, create a table such as the following:
    18         -
    19         -<blockquote><pre>
    20         -CREATE TABLE editcost(
    21         -  iLang INT,   -- The language ID
    22         -  cFrom TEXT,  -- Convert text from this
    23         -  cTo   TEXT,  -- Convert text into this
    24         -  iCost INT    -- The cost of doing the conversionnn
    25         -);
    26         -</pre></blockquote>
    27         -
    28         -The cost table can be named anything you want - it does not have to be called
    29         -"editcost".  And the table can contain additional columns.  However, it the
    30         -table must contain the four columns show above, with exactly the names shown.
    31         -
    32         -The iLang column is a non-negative integer that identifies a set of costs
    33         -appropriate for a particular language.  The editdist3 function will only use
    34         -a single iLang value for any given edit-distance computation.  The default
    35         -value is 0.  It is recommended that applications that only need to use a
    36         -single langauge always use iLang==0 for all entries.
    37         -
    38         -The iCost column is the numeric cost of transforming cFrom into cTo.  This
    39         -value should be a non-negative integer, and should probably be less than 100.
    40         -The default single-character insertion and deletion costs are 100 and the
    41         -default single-character to single-character substitution cost is 150.  A
    42         -cost of 10000 or more is considered "infinite" and causes the rule to be
    43         -ignored.
    44         -
    45         -The cFrom and cTo columns show edit transformation strings.  Either or both
    46         -columns may contain more than one character.  Or either column (but not both)
    47         -may hold an empty string.  When cFrom is empty, that is the cost of inserting
    48         -cTo.  When cTo is empty, that is the cost of deleting cFrom.
    49         -
    50         -In the spellfix1 algorithm, cFrom is the text as the user entered it and
    51         -cTo is the correctly spelled text as it exists in the database.  The goal
    52         -of the editdist3 algorithm is to determine how close the user-entered text is
    53         -to the dictionary text.
    54         -
    55         -There are three special-case entries in the cost table:
    56         -
    57         -<table border=1>
    58         -<tr><th>cFrom</th><th>cTo</th><th>Meaning</th></tr>
    59         -<tr><td>''</td><td>'?'</td><td>The default insertion cost</td></tr>
    60         -<tr><td>'?'</td><td>''</td><td>The default deletion cost</td></tr>
    61         -<tr><td>'?'</td><td>'?'</td><td>The default substitution cost</td></tr>
    62         -</table>
    63         -
    64         -If any of the special-case entries shows above are omitted, then the
    65         -value of 100 is used for insertion and deletion and 150 is used for
    66         -substitution.  To disable the default insertion, deletion, and/or substitution
    67         -set their respective cost to 10000 or more.
    68         -
    69         -Other entries in the cost table specific transforms for particular characters.
    70         -The cost of specific transforms should be less than the default costs, or else
    71         -the default costs will take precedence and the specific transforms will never 
    72         -be used.
    73         -
    74         -Some example, cost table entries:
    75         -
    76         -<blockquote><pre>
    77         -INSERT INTO editcost(iLang, cFrom, cTo, iCost)
    78         -VALUES(0, 'a', 'ä', 5);
    79         -</pre></blockquote>
    80         -
    81         -The rule above says that the letter "a" in user input can be matched against
    82         -the letter "ä" in the dictionary with a penalty of 5.
    83         -
    84         -<blockquote><pre>
    85         -INSERT INTO editcost(iLang, cFrom, cTo, iCost)
    86         -VALUES(0, 'ss', 'ß', 8);
    87         -</pre></blockquote>
    88         -
    89         -The number of characters in cFrom and cTo do not need to be the same.  The
    90         -rule above says that "ss" on user input will match "ß" with a penalty of 8.
    91         -
    92         -<h2>Experimenting with the editcost3() function</h2>
    93         -
    94         -The [./spellfix1.wiki | spellfix1 virtual table]
    95         -uses editdist3 if the "edit_cost_table=TABLE" option
    96         -is specified as an argument when the spellfix1 virtual table is created.  
    97         -But editdist3 can also be tested directly using the built-in "editdist3()"
    98         -SQL function.  The editdist3() SQL function has 3 forms:
    99         -
   100         -  1.  editdist3('TABLENAME');
   101         -  2.  editdist3('string1', 'string2');
   102         -  3.  editdist3('string1', 'string2', langid);
   103         -
   104         -The first form loads the edit distance coefficients from a table called
   105         -'TABLENAME'.  Any prior coefficients are discarded.  So when experimenting
   106         -with weights and the weight table changes, simply rerun the single-argument
   107         -form of editdist3() to reload revised coefficients.  Note that the 
   108         -edit distance
   109         -weights used by the editdist3() SQL function are independent from the
   110         -weights used by the spellfix1 virtual table.
   111         -
   112         -The second and third forms return the computed edit distance between strings
   113         -'string1' and "string2'.  In the second form, an language id of 0 is used.
   114         -The language id is specified in the third form.

Changes to ext/misc/spellfix.c.

     8      8   **    May you find forgiveness for yourself and forgive others.
     9      9   **    May you share freely, never taking more than you give.
    10     10   **
    11     11   *************************************************************************
    12     12   **
    13     13   ** This module implements the spellfix1 VIRTUAL TABLE that can be used
    14     14   ** to search a large vocabulary for close matches.  See separate
    15         -** documentation files (spellfix1.wiki and editdist3.wiki) for details.
           15  +** documentation (http://www.sqlite.org/spellfix1.html) for details.
    16     16   */
    17     17   #include "sqlite3ext.h"
    18     18   SQLITE_EXTENSION_INIT1
    19     19   
    20     20   #ifndef SQLITE_AMALGAMATION
    21     21   # include <string.h>
    22     22   # include <stdio.h>

Deleted ext/misc/spellfix1.wiki.

     1         -<title>The Spellfix1 Virtual Table</title>
     2         -
     3         -This spellfix1 virtual table is used to search
     4         -a large vocabulary for close matches.  For example, spellfix1
     5         -can be used to suggest corrections to misspelled words.  Or,
     6         -it could be used with FTS4 to do full-text search using potentially
     7         -misspelled words.
     8         -
     9         -Create an instance of the spellfix1 virtual table like this:
    10         -
    11         -<blockquote><pre>
    12         -CREATE VIRTUAL TABLE demo USING spellfix1;
    13         -</pre></blockquote>
    14         -
    15         -The "spellfix1" term is the name of this module and must be entered as
    16         -shown.  The "demo" term is the
    17         -name of the virtual table you will be creating and can be altered
    18         -to suit the needs of your application.  The virtual table is initially
    19         -empty.  In order for the virtual table to be useful, you will need to
    20         -populate it with your vocabulary.  Suppose you
    21         -have a list of words in a table named "big_vocabulary".  Then do this:
    22         -
    23         -<blockquote><pre>
    24         -INSERT INTO demo(word) SELECT word FROM big_vocabulary;
    25         -</pre></blockquote>
    26         -
    27         -If you intend to use this virtual table in cooperation with an FTS4
    28         -table (for spelling correctly of search terms) then you might extract
    29         -the vocabulary using an fts3aux table:
    30         -
    31         -<blockquote><pre>
    32         -INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*';
    33         -</pre></blockquote>
    34         -
    35         -You can also provide the virtual table with a "rank" for each word.
    36         -The "rank" is an estimate of how common the word is.  Larger numbers
    37         -mean the word is more common.  If you omit the rank when populating
    38         -the table, then a rank of 1 is assumed.  But if you have rank 
    39         -information, you can supply it and the virtual table will show a
    40         -slight preference for selecting more commonly used terms.  To
    41         -populate the rank from an fts4aux table "search_aux" do something
    42         -like this:
    43         -
    44         -<blockquote><pre>
    45         -INSERT INTO demo(word,rank)
    46         -   SELECT term, documents FROM search_aux WHERE col='*';
    47         -</pre></blockquote>
    48         -
    49         -To query the virtual table, include a MATCH operator in the WHERE
    50         -clause.  For example:
    51         -
    52         -<blockquote><pre>
    53         -SELECT word FROM demo WHERE word MATCH 'kennasaw';
    54         -</pre></blockquote>
    55         -
    56         -Using a dataset of American place names (derived from
    57         -[http://geonames.usgs.gov/domestic/download_data.htm]) the query above
    58         -returns 20 results beginning with:
    59         -
    60         -<blockquote><pre>
    61         -kennesaw
    62         -kenosha
    63         -kenesaw
    64         -kenaga
    65         -keanak
    66         -</pre></blockquote>
    67         -
    68         -If you append the character '*' to the end of the pattern, then
    69         -a prefix search is performed.  For example:
    70         -
    71         -<blockquote><pre>
    72         -SELECT word FROM demo WHERE word MATCH 'kennes*';
    73         -</pre></blockquote>
    74         -
    75         -Yields 20 results beginning with:
    76         -
    77         -<blockquote><pre>
    78         -kennesaw
    79         -kennestone
    80         -kenneson
    81         -kenneys
    82         -keanes
    83         -keenes
    84         -</pre></blockquote>
    85         -
    86         -<h2>Search Refinements</h2>
    87         -
    88         -By default, the spellfix1 table returns no more than 20 results.
    89         -(It might return less than 20 if there were fewer good matches.)
    90         -You can change the upper bound on the number of returned rows by
    91         -adding a "top=N" term to the WHERE clause of your query, where N
    92         -is the new maximum.  For example, to see the 5 best matches:
    93         -
    94         -<blockquote><pre>
    95         -SELECT word FROM demo WHERE word MATCH 'kennes*' AND top=5;
    96         -</pre></blockquote>
    97         -
    98         -Each entry in the spellfix1 virtual table is associated with a
    99         -a particular language, identified by the integer "langid" column.
   100         -The default langid is 0 and if no other actions are taken, the
   101         -entire vocabulary is a part of the 0 language.  But if your application
   102         -needs to operate in multiple languages, then you can specify different
   103         -vocabulary items for each language by specifying the langid field
   104         -when populating the table.  For example:
   105         -
   106         -<blockquote><pre>
   107         -INSERT INTO demo(word,langid) SELECT word, 0 FROM en_vocabulary;
   108         -INSERT INTO demo(word,langid) SELECT word, 1 FROM de_vocabulary;
   109         -INSERT INTO demo(word,langid) SELECT word, 2 FROM fr_vocabulary;
   110         -INSERT INTO demo(word,langid) SELECT word, 3 FROM ru_vocabulary;
   111         -INSERT INTO demo(word,langid) SELECT word, 4 FROM cn_vocabulary;
   112         -</pre></blockquote>
   113         -
   114         -After the virtual table has been populated with items from multiple
   115         -languages, specify the language of interest using a "langid=N" term
   116         -in the WHERE clause of the query:
   117         -
   118         -<blockquote><pre>
   119         -SELECT word FROM demo WHERE word MATCH 'hildes*' AND langid=1;
   120         -</pre></blockquote>
   121         -
   122         -Note that if you do not include the "langid=N" term in the WHERE clause,
   123         -the search will be against language 0 (English in the example above.)
   124         -All spellfix1 searches are against a single language id.  There is no
   125         -way to search all languages at once.
   126         - 
   127         -
   128         -<h2>Virtual Table Details</h2>
   129         -
   130         -The virtual table actually has a unique rowid with seven columns plus five
   131         -extra hidden columns.  The columns are as follows:
   132         -
   133         -<blockquote><dl>
   134         -<dt><b>rowid</b><dd>
   135         -A unique integer number associated with each
   136         -vocabulary item in the table.  This can be used
   137         -as a foreign key on other tables in the database.
   138         -
   139         -<dt><b>word</b><dd>
   140         -The text of the word that matches the pattern.
   141         -Both word and pattern can contains unicode characters
   142         -and can be mixed case.
   143         -
   144         -<dt><b>rank</b><dd>
   145         -This is the rank of the word, as specified in the
   146         -original INSERT statement.
   147         -
   148         -
   149         -<dt><b>distance</b><dd>
   150         -This is an edit distance or Levensthein distance going
   151         -from the pattern to the word.
   152         -
   153         -<dt><b>langid</b><dd>
   154         -This is the language-id of the word.  All queries are
   155         -against a single language-id, which defaults to 0.
   156         -For any given query this value is the same on all rows.
   157         -
   158         -<dt><b>score</b><dd>
   159         -The score is a combination of rank and distance.  The
   160         -idea is that a lower score is better.  The virtual table
   161         -attempts to find words with the lowest score and 
   162         -by default (unless overridden by ORDER BY) returns
   163         -results in order of increasing score.
   164         -
   165         -<dt><b>matchlen</b><dd>
   166         -In a prefix search, the matchlen is the number of characters in
   167         -the string that match against the prefix.  For a non-prefix search,
   168         -this is the same as length(word).
   169         -
   170         -<dt><b>phonehash</b><dd>
   171         -This column shows the phonetic hash prefix that was used to restrict
   172         -the search.  For any given query, this column should be the same for
   173         -every row.  This information is available for diagnostic purposes and
   174         -is not normally considered useful in real applications.
   175         -
   176         -<dt><b>top</b><dd>
   177         -(HIDDEN)  For any query, this value is the same on all
   178         -rows.  It is an integer which is the maximum number of
   179         -rows that will be output.  The actually number of rows
   180         -output might be less than this number, but it will never
   181         -be greater.  The default value for top is 20, but that
   182         -can be changed for each query by including a term of
   183         -the form "top=N" in the WHERE clause of the query.
   184         -
   185         -<dt><b>scope</b><dd>
   186         -(HIDDEN)  For any query, this value is the same on all
   187         -rows.  The scope is a measure of how widely the virtual
   188         -table looks for matching words.  Smaller values of
   189         -scope cause a broader search.  The scope is normally
   190         -choosen automatically and is capped at 4.  Applications
   191         -can change the scope by including a term of the form
   192         -"scope=N" in the WHERE clause of the query.  Increasing
   193         -the scope will make the query run faster, but will reduce
   194         -the possible corrections.
   195         -
   196         -<dt><b>srchcnt</b><dd>
   197         -(HIDDEN)  For any query, this value is the same on all
   198         -rows.  This value is an integer which is the number of
   199         -of words examined using the edit-distance algorithm to
   200         -find the top matches that are ultimately displayed.  This
   201         -value is for diagnostic use only.
   202         -
   203         -<dt><b>soundslike</b><dd>
   204         -(HIDDEN)  When inserting vocabulary entries, this field
   205         -can be set to an spelling that matches what the word
   206         -sounds like.  See the DEALING WITH UNUSUAL AND DIFFICULT
   207         -SPELLINGS section below for details.
   208         -
   209         -<dt><b>command</b><dd>
   210         -(HIDDEN)  The value of the "command" column is always NULL.  However,
   211         -applications can insert special strings into the "command" column in order
   212         -to provoke certain behaviors in the spellfix1 virtual table.
   213         -For example, inserting the string 'reset' into the "command" column
   214         -will cause the virtual table will reread its edit distance weights
   215         -(if there are any).
   216         -</dl></blockquote>
   217         -
   218         -<h2>Algorithm</h2>
   219         -
   220         -The spellfix1 virtual table creates a single
   221         -shadow table named "%_vocab" (where the % is replaced by the name of
   222         -the virtual table; Ex: "demo_vocab" for the "demo" virtual table).  
   223         -the shadow table contains the following columns:
   224         -
   225         -<blockquote><dl>
   226         -<dt><b>id</b><dd>
   227         -The unique id (INTEGER PRIMARY KEY)
   228         -
   229         -<dt><b>rank</b><dd>
   230         -The rank of word.
   231         -
   232         -<dt><b>langid</b><dd>
   233         -The language id for this entry.
   234         -
   235         -<dt><b>word</b><dd>
   236         -The original UTF8 text of the vocabulary word
   237         -
   238         -<dt><b>k1</b><dd>
   239         -The word transliterated into lower-case ASCII.  
   240         -There is a standard table of mappings from non-ASCII
   241         -characters into ASCII.  Examples: "æ" -> "ae",
   242         -"þ" -> "th", "ß" -> "ss", "á" -> "a", ...  The
   243         -accessory function spellfix1_translit(X) will do
   244         -the non-ASCII to ASCII mapping.  The built-in lower(X)
   245         -function will convert to lower-case.  Thus:
   246         -k1 = lower(spellfix1_translit(word)).
   247         -
   248         -<dt><b>k2</b><dd>
   249         -This field holds a phonetic code derived from k1.  Letters
   250         -that have similar sounds are mapped into the same symbol.
   251         -For example, all vowels and vowel clusters become the
   252         -single symbol "A".  And the letters "p", "b", "f", and
   253         -"v" all become "B".  All nasal sounds are represented
   254         -as "N".  And so forth.  The mapping is base on
   255         -ideas found in Soundex, Metaphone, and other
   256         -long-standing phonetic matching systems.  This key can
   257         -be generated by the function spellfix1_phonehash(X).  
   258         -Hence: k2 = spellfix1_phonehash(k1)
   259         -</dl></blockquote>
   260         -
   261         -There is also a function for computing the Wagner edit distance or the
   262         -Levenshtein distance between a pattern and a word.  This function
   263         -is exposed as spellfix1_editdist(X,Y).  The edit distance function
   264         -returns the "cost" of converting X into Y.  Some transformations
   265         -cost more than others.  Changing one vowel into a different vowel,
   266         -for example is relatively cheap, as is doubling a constant, or
   267         -omitting the second character of a double-constant.  Other transformations
   268         -or more expensive.  The idea is that the edit distance function returns
   269         -a low cost of words that are similar and a higher cost for words
   270         -that are futher apart.  In this implementation, the maximum cost
   271         -of any single-character edit (delete, insert, or substitute) is 100,
   272         -with lower costs for some edits (such as transforming vowels).
   273         -
   274         -The "score" for a comparison is the edit distance between the pattern
   275         -and the word, adjusted down by the base-2 logorithm of the word rank.
   276         -For example, a match with distance 100 but rank 1000 would have a
   277         -score of 122 (= 100 - log2(1000) + 32) where as a match with distance
   278         -100 with a rank of 1 would have a score of 131 (100 - log2(1) + 32).
   279         -(NB:  The constant 32 is added to each score to keep it from going
   280         -negative in case the edit distance is zero.)  In this way, frequently
   281         -used words get a slightly lower cost which tends to move them toward
   282         -the top of the list of alternative spellings.
   283         -
   284         -A straightforward implementation of a spelling corrector would be
   285         -to compare the search term against every word in the vocabulary
   286         -and select the 20 with the lowest scores.  However, there will 
   287         -typically be hundreds of thousands or millions of words in the
   288         -vocabulary, and so this approach is not fast enough.
   289         -
   290         -Suppose the term that is being spell-corrected is X.  To limit
   291         -the search space, X is converted to a k2-like key using the
   292         -equivalent of:
   293         -
   294         -<blockquote><pre>
   295         -   key = spellfix1_phonehash(lower(spellfix1_translit(X)))
   296         -</pre></blockquote>
   297         -
   298         -This key is then limited to "scope" characters.  The default scope
   299         -value is 4, but an alternative scope can be specified using the
   300         -"scope=N" term in the WHERE clause.  After the key has been truncated,
   301         -the edit distance is run against every term in the vocabulary that
   302         -has a k2 value that begins with the abbreviated key.
   303         -
   304         -For example, suppose the input word is "Paskagula".  The phonetic 
   305         -key is "BACACALA" which is then truncated to 4 characters "BACA".
   306         -The edit distance is then run on the 4980 entries (out of
   307         -272,597 entries total) of the vocabulary whose k2 values begin with
   308         -BACA, yielding "Pascagoula" as the best match.
   309         -
   310         -Only terms of the vocabulary with a matching langid are searched.
   311         -Hence, the same table can contain entries from multiple languages
   312         -and only the requested language will be used.  The default langid
   313         -is 0.
   314         -
   315         -<h2>Configurable Edit Distance</h2>
   316         -
   317         -The built-in Wagner edit-distance function with fixed weights can be
   318         -replaced by the [./editdist3.wiki | editdist3()] edit-distance function
   319         -with application-defined weights and support for unicode, by specifying
   320         -the "edit_cost_table=<i>TABLENAME</i>" parameter to the spellfix1 module
   321         -when the virtual table is created.
   322         -For example:
   323         -
   324         -<blockquote><pre>
   325         -CREATE VIRTUAL TABLE demo2 USING spellfix1(edit_cost_table=APPCOST);
   326         -</pre></blockquote>
   327         -
   328         -In the example above, the APPCOST table would be interrogated to find
   329         -the edit distance coefficients.  It is the presence of the "edit_cost_table="
   330         -parameter to the spellfix1 module name that causes editdist3() to be used
   331         -in place of the built-in edit distance function.
   332         -
   333         -The edit distance coefficients are normally read from the APPCOST table
   334         -once and there after stored in memory.  Hence, run-time changes to the
   335         -APPCOST table will not normally effect the edit distance results.
   336         -However, inserting the special string 'reset' into the "command" column of the
   337         -virtual table causes the edit distance coefficients to be reread the
   338         -APPCOST table.  Hence, applications should run a SQL statement similar
   339         -to the following when changes to the APPCOST table occur:
   340         -
   341         -<blockquote>
   342         -INSERT INTO demo2(command) VALUES('reset');
   343         -</blockquote>
   344         -
   345         -The tables used for edit distance costs can be changed using a command
   346         -like the following:
   347         -
   348         -<blockquote>
   349         -INSERT INTO demo2(command) VALUES('edit_cost_table=APPCOST2');
   350         -</blockquote>
   351         -
   352         -In the example above, any prior edit distance costs would be discarded and
   353         -all future queries would use the costs found in the APPCOST2 table.  If the
   354         -name of the table specified by the "edit_cost_table" command is "NULL", then
   355         -theh built-in Wagner edit-distance function will be used instead of the
   356         -editdist3() function in all future queries.
   357         -
   358         -<h2>Dealing With Unusual And Difficult Spellings</h2>
   359         -
   360         -The algorithm above works quite well for most cases, but there are
   361         -exceptions.  These exceptions can be dealt with by making additional
   362         -entries in the virtual table using the "soundslike" column.
   363         -
   364         -For example, many words of Greek origin begin with letters "ps" where
   365         -the "p" is silent.  Ex:  psalm, pseudonym, psoriasis, psyche.  In
   366         -another example, many Scottish surnames can be spelled with an
   367         -initial "Mac" or "Mc".  Thus, "MacKay" and "McKay" are both pronounced
   368         -the same.
   369         -
   370         -Accommodation can be made for words that are not spelled as they
   371         -sound by making additional entries into the virtual table for the
   372         -same word, but adding an alternative spelling in the "soundslike"
   373         -column.  For example, the canonical entry for "psalm" would be this:
   374         -
   375         -<blockquote><pre>
   376         -  INSERT INTO demo(word) VALUES('psalm');
   377         -</pre></blockquote>
   378         -
   379         -To enhance the ability to correct the spelling of "salm" into
   380         -"psalm", make an addition entry like this:
   381         -
   382         -<blockquote><pre>
   383         -  INSERT INTO demo(word,soundslike) VALUES('psalm','salm');
   384         -</pre></blockquote>
   385         -
   386         -It is ok to make multiple entries for the same word as long as
   387         -each entry has a different soundslike value.  Note that if no
   388         -soundslike value is specified, the soundslike defaults to the word
   389         -itself.
   390         -
   391         -Listed below are some cases where it might make sense to add additional
   392         -soundslike entries.  The specific entries will depend on the application
   393         -and the target language.
   394         -
   395         -  *   Silent "p" in words beginning with "ps":  psalm, psyche
   396         -
   397         -  *   Silent "p" in words beginning with "pn":  pneumonia, pneumatic
   398         -
   399         -  *   Silent "p" in words beginning with "pt":  pterodactyl, ptolemaic
   400         -
   401         -  *   Silent "d" in words beginning with "dj":  djinn, Djikarta
   402         -
   403         -  *   Silent "k" in words beginning with "kn":  knight, Knuthson
   404         -
   405         -  *   Silent "g" in words beginning with "gn":  gnarly, gnome, gnat
   406         -
   407         -  *   "Mac" versus "Mc" beginning Scottish surnames
   408         -
   409         -  *   "Tch" sounds in Slavic words:  Tchaikovsky vs. Chaykovsky
   410         -
   411         -  *   The letter "j" pronounced like "h" in Spanish:  LaJolla
   412         -
   413         -  *   Words beginning with "wr" versus "r":  write vs. rite
   414         -
   415         -  *   Miscellanous problem words such as "debt", "tsetse",
   416         -      "Nguyen", "Van Nuyes".
   417         -
   418         -<h2>Auxiliary Functions</h2>
   419         -
   420         -The source code module that implements the spellfix1 virtual table also
   421         -implements several SQL functions that might be useful to applications
   422         -that employ spellfix1 or for testing or diagnostic work while developing
   423         -applications that use spellfix1.  The following auxiliary functions are
   424         -available:
   425         -
   426         -<blockquote><dl>
   427         -<dt><b>editdist3(P,W)<br>editdist2(P,W,L)<br>editdist3(T)</b><dd>
   428         -These routines provide direct access to the version of the Wagner
   429         -edit-distance function that allows for application-defined weights
   430         -on edit operations.  The first two forms of this function compare
   431         -pattern P against word W and return the edit distance.  In the first
   432         -function, the langid is assumed to be 0 and in the second, the
   433         -langid is given by the L parameter.  The third form of this function
   434         -reloads edit distance coefficience from the table named by T.
   435         -
   436         -<dt><b>spellfix1_editdist(P,W)</b><dd>
   437         -This routine provides access to the built-in Wagner edit-distance
   438         -function that uses default, fixed costs.  The value returned is
   439         -the edit distance needed to transform W into P.
   440         -
   441         -<dt><b>spellfix1_phonehash(X)</b><dd>
   442         -This routine constructs a phonetic hash of the pure ascii input word X
   443         -and returns that hash.  This routine is used internally by spellfix1 in
   444         -order to transform the K1 column of the shadow table into the K2
   445         -column.
   446         -
   447         -<dt><b>spellfix1_scriptcode(X)</b><dd>
   448         -Given an input string X, this routine attempts to determin the dominant
   449         -script of that input and returns the ISO-15924 numeric code for that
   450         -script.  The current implementation understands the following scripts:
   451         -<ul>
   452         -<li> 215 - Latin
   453         -<li> 220 - Cyrillic
   454         -<li> 200 - Greek
   455         -</ul>
   456         -Additional language codes might be added in future releases.
   457         -
   458         -<dt><b>spellfix1_translit(X)</b><dd>
   459         -This routine transliterates unicode text into pure ascii, returning
   460         -the pure ascii representation of the input text X.  This is the function
   461         -that is used internally to transform vocabulary words into the K1
   462         -column of the shadow table.
   463         -
   464         -</dl></blockquote>