Fujii Masao
masao****@gmail*****
2013年 10月 3日 (木) 19:36:34 JST
On Thu, Oct 3, 2013 at 5:53 PM, Beena Emerson <memis****@gmail*****> wrote: > On Thu, Oct 3, 2013 at 2:11 PM, Fujii Masao <masao****@gmail*****> wrote: >> >> On Thu, Oct 3, 2013 at 3:48 PM, Beena Emerson <memis****@gmail*****> >> wrote: >> > On Thu, Oct 3, 2013 at 7:26 AM, Fujii Masao <masao****@gmail*****> >> > wrote: >> >> >> >> On Mon, Sep 30, 2013 at 12:34 PM, Amit Langote >> >> <amitl****@gmail*****> >> >> wrote: >> >> > On Mon, Sep 30, 2013 at 12:22 PM, Fujii Masao <masao****@gmail*****> >> >> > wrote: >> >> >> On Sat, Sep 28, 2013 at 6:16 PM, Beena Emerson >> >> >> <memis****@gmail*****> wrote: >> >> >>> Please find attached the updated patch to rebase against the new >> >> >>> HEAD >> >> >>> and >> >> >>> also implements the comments I had given before. >> >> >> >> I extracted the similarity function code from the patch as the separate >> >> one >> >> so that we can more easily review and commit it. I'd like to work on >> >> this >> >> first. >> >> After committing it, I'd like to work on the remaining similarity >> >> search >> >> code. >> >> >> >> Attached patch just implements pg_bigm version of similarity function. >> >> This is in WIP yet. The description of bigm_similarity function must be >> >> added into the document. The regression test must be updated. >> >> >> >> >> >> While reviewing the similartiy function, I found that there is one big >> >> problem >> >> in bigm_similarity(). That is, bigm_similarity() is case-sensitive, but >> >> pg_trgm >> >> version of similarity function is not. Please see the following >> >> example: >> >> >> >> =# select similarity('wow', 'WOW'); >> >> similarity >> >> ------------ >> >> 1 >> >> (1 row) >> >> >> >> =# select bigm_similarity('wow', 'WOW'); >> >> bigm_similarity >> >> ----------------- >> >> 0 >> >> (1 row) >> >> >> >> Should we implement the *case-insensitive* bigm_similarity()? >> >> >> > >> > The pg_bigm code is case sensitive, even the show_bigm and show_trgm >> > behave >> > differently. >> > >> > =# SELECT show_bigm ('ABC'); >> > show_bigm >> > ------------------- >> > {" A",AB,BC,"C "} >> > (1 row) >> > >> > =# SELECT show_trgm ('ABC'); >> > show_trgm >> > ------------------------- >> > {" a"," ab",abc,"bc "} >> > (1 row) >> > >> > This is because of the difference in code of generate_trgm and >> > generate_bigm. >> > >> > fn generate_trgm: trgm_op.c ln 213 - 228 >> > while ((bword = find_word(eword, slen - (eword - str), &eword, >> > &charlen)) >> > != NULL) >> > { >> > #ifdef IGNORECASE >> > bword = lowerstr_with_len(bword, eword - bword); >> > bytelen = strlen(bword); >> > #else >> > bytelen = eword - bword; >> > #endif >> > >> > memcpy(buf + LPADDING, bword, bytelen); >> > >> > #ifdef IGNORECASE >> > pfree(bword); >> > #endif >> > buf[LPADDING + bytelen] = ' '; >> > buf[LPADDING + bytelen + 1] = ' '; >> > .... >> > >> > fn generate_bigm: bigm_op.c ln 247 - 253 >> > while ((bword = find_word(eword, slen - (eword - str), &eword, >> > &charlen)) >> > != NULL) >> > { >> > bytelen = eword - bword; >> > memcpy(buf + LPADDING, bword, bytelen); >> > >> > buf[LPADDING + bytelen] = ' '; >> > buf[LPADDING + bytelen + 1] = ' '; >> > .... >> > >> > Since similarity function uses this generate_bigm to get the bigrams and >> > then compare it, there is a difference in behavior. >> > >> > So the way to make similarity function case-insensitive would be to >> > change >> > generate_bigm and not the similarity code itself. Also, the change will >> > make >> > the show_bigm function behave differently. >> >> Yes, generate_bigm would need to be updated to make bigm_similarity >> case-sensitive. >> >> *From a user point of view*, bigm_similarity() and upcoming similarity >> search should be case-sensitive? If yes, we should change generate_bigm, >> but its change must not affect the behavior of the full-text search at >> all. >> >> Or we should just implement both case-sensitive and -insensitive >> bigm_similarity() and similarity search? >> >> Thought? > > > I am not sure I understood what you are trying to say. > > Are you suggesting that bigm similarity functions should be case insensitive > and the current functions and search method should remain case sensitive? If most users think that the similarity search should be case-sensitive, we should just implement case-sensitive bigm_similarity() function and similarity search feature, i.e., we don't need to change generate_bigm() function. If most users think that the similarity search should be case-insensitive as well as pg_trgm's does, we should just implement case-insensitive bigm_similarity() function and similarity search feature. In this case, as you suggested, we probably need to change generate_bigm() function, but which might break the current behavior of normal full-text search. I think that we should avoid such a change the current users. So, we need to change generate_bigm() function in the way that it doesn't affect the current behavior of normal full-text search. Regards, -- Fujii Masao