/ Check-in [6c77c2d5]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Just don't run tolower() on hi-bit characters. This shouldn't cause us to break any UTF-8 code points, unless they were already broken in the input. (CVS 3376)
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1:6c77c2d5e15e9d3efed3e274bc93cd5a4868f574
User & Date: shess 2006-08-30 21:40:30
Context
2006-08-31
15:07
Refactor the FTS1 module so that its name is "fts1" instead of "fulltext", so that all symbols with external linkage begin with "sqlite3Fts1", and so that all filenames begin with "fts1". (CVS 3377) check-in: e1891f0d user: drh tags: trunk
2006-08-30
21:40
Just don't run tolower() on hi-bit characters. This shouldn't cause us to break any UTF-8 code points, unless they were already broken in the input. (CVS 3376) check-in: 6c77c2d5 user: shess tags: trunk
2006-08-29
18:46
Bug fix: Get INSERT INTO ... SELECT working when the target is a virtual table. (CVS 3375) check-in: 7cdc41e7 user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to ext/fts1/simple_tokenizer.c.

58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
...
130
131
132
133
134
135
136



137

138
139
140
141
142
143
144
  ** track such information in the database, then we'd only want this
  ** information on the initial create.
  */
  if( argc>1 ){
    t->zDelim = string_dup(argv[1]);
  } else {
    /* Build a string excluding alphanumeric ASCII characters */
    char zDelim[256];               /* nul-terminated, so nul not a member */
    int i, j;
    for(i=1, j=0; i<0x100; i++){
      if( i>=0x80 || !isalnum(i) ){
        zDelim[j++] = i;
      }
    }
    zDelim[j++] = '\0';
    assert( j<=sizeof(zDelim) );
    t->zDelim = string_dup(zDelim);
  }
................................................................................
  while( c->pCurrent-c->pInput<c->nBytes ){
    int n = (int) strcspn(c->pCurrent, t->zDelim);
    if( n>0 ){
      if( n+1>c->nTokenAllocated ){
        c->zToken = realloc(c->zToken, n+1);
      }
      for(ii=0; ii<n; ii++){



        c->zToken[ii] = tolower(c->pCurrent[ii]);

      }
      c->zToken[n] = '\0';
      *ppToken = c->zToken;
      *pnBytes = n;
      *piStartOffset = (int) (c->pCurrent-c->pInput);
      *piEndOffset = *piStartOffset+n;
      *piPosition = c->iToken++;







|

|
|







 







>
>
>
|
>







58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
...
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
  ** track such information in the database, then we'd only want this
  ** information on the initial create.
  */
  if( argc>1 ){
    t->zDelim = string_dup(argv[1]);
  } else {
    /* Build a string excluding alphanumeric ASCII characters */
    char zDelim[0x80];               /* nul-terminated, so nul not a member */
    int i, j;
    for(i=1, j=0; i<0x80; i++){
      if( !isalnum(i) ){
        zDelim[j++] = i;
      }
    }
    zDelim[j++] = '\0';
    assert( j<=sizeof(zDelim) );
    t->zDelim = string_dup(zDelim);
  }
................................................................................
  while( c->pCurrent-c->pInput<c->nBytes ){
    int n = (int) strcspn(c->pCurrent, t->zDelim);
    if( n>0 ){
      if( n+1>c->nTokenAllocated ){
        c->zToken = realloc(c->zToken, n+1);
      }
      for(ii=0; ii<n; ii++){
        /* TODO(shess) This needs expansion to handle UTF-8
        ** case-insensitivity.
        */
        char ch = c->pCurrent[ii];
        c->zToken[ii] = (unsigned char)ch<0x80 ? tolower(ch) : ch;
      }
      c->zToken[n] = '\0';
      *ppToken = c->zToken;
      *pnBytes = n;
      *piStartOffset = (int) (c->pCurrent-c->pInput);
      *piEndOffset = *piStartOffset+n;
      *piPosition = c->iToken++;