Adding similarity() and similarity_op(), '%' to pg_bigm (Pgbigm-hackers) - pg_bigm(ピージーバイグラム)

On Thu, Oct 3, 2013 at 5:53 PM, Beena Emerson <memis****@gmail*****> wrote:
> On Thu, Oct 3, 2013 at 2:11 PM, Fujii Masao <masao****@gmail*****> wrote:
>>
>> On Thu, Oct 3, 2013 at 3:48 PM, Beena Emerson <memis****@gmail*****>
>> wrote:
>> > On Thu, Oct 3, 2013 at 7:26 AM, Fujii Masao <masao****@gmail*****>
>> > wrote:
>> >>
>> >> On Mon, Sep 30, 2013 at 12:34 PM, Amit Langote
>> >> <amitl****@gmail*****>
>> >> wrote:
>> >> > On Mon, Sep 30, 2013 at 12:22 PM, Fujii Masao <masao****@gmail*****>
>> >> > wrote:
>> >> >> On Sat, Sep 28, 2013 at 6:16 PM, Beena Emerson
>> >> >> <memis****@gmail*****> wrote:
>> >> >>> Please find attached the updated patch to rebase against the new
>> >> >>> HEAD
>> >> >>> and
>> >> >>> also implements the comments I had given before.
>> >>
>> >> I extracted the similarity function code from the patch as the separate
>> >> one
>> >> so that we can more easily review and commit it. I'd like to work on
>> >> this
>> >> first.
>> >> After committing it, I'd like to work on the remaining similarity
>> >> search
>> >> code.
>> >>
>> >> Attached patch just implements pg_bigm version of similarity function.
>> >> This is in WIP yet. The description of bigm_similarity function must be
>> >> added into the document. The regression test must be updated.
>> >>
>> >>
>> >> While reviewing the similartiy function, I found that there is one big
>> >> problem
>> >> in bigm_similarity(). That is, bigm_similarity() is case-sensitive, but
>> >> pg_trgm
>> >> version of similarity function is not. Please see the following
>> >> example:
>> >>
>> >> =# select similarity('wow', 'WOW');
>> >>  similarity
>> >> ------------
>> >>           1
>> >> (1 row)
>> >>
>> >> =# select bigm_similarity('wow', 'WOW');
>> >>  bigm_similarity
>> >> -----------------
>> >>                0
>> >> (1 row)
>> >>
>> >> Should we implement the *case-insensitive* bigm_similarity()?
>> >>
>> >
>> > The pg_bigm code is case sensitive, even the show_bigm and show_trgm
>> > behave
>> > differently.
>> >
>> > =# SELECT show_bigm ('ABC');
>> >      show_bigm
>> > -------------------
>> >  {" A",AB,BC,"C "}
>> > (1 row)
>> >
>> > =# SELECT show_trgm ('ABC');
>> >         show_trgm
>> > -------------------------
>> >  {"  a"," ab",abc,"bc "}
>> > (1 row)
>> >
>> > This is because of the difference in code of generate_trgm and
>> > generate_bigm.
>> >
>> > fn generate_trgm: trgm_op.c ln 213 - 228
>> >  while ((bword = find_word(eword, slen - (eword - str), &eword,
>> > &charlen))
>> > != NULL)
>> >         {
>> > #ifdef IGNORECASE
>> >                 bword = lowerstr_with_len(bword, eword - bword);
>> >                 bytelen = strlen(bword);
>> > #else
>> >                 bytelen = eword - bword;
>> > #endif
>> >
>> >                 memcpy(buf + LPADDING, bword, bytelen);
>> >
>> > #ifdef IGNORECASE
>> >                 pfree(bword);
>> > #endif
>> >                 buf[LPADDING + bytelen] = ' ';
>> >                 buf[LPADDING + bytelen + 1] = ' ';
>> > ....
>> >
>> > fn generate_bigm: bigm_op.c ln 247 - 253
>> >  while ((bword = find_word(eword, slen - (eword - str), &eword,
>> > &charlen))
>> > != NULL)
>> >         {
>> >                 bytelen = eword - bword;
>> >                 memcpy(buf + LPADDING, bword, bytelen);
>> >
>> >                 buf[LPADDING + bytelen] = ' ';
>> >                 buf[LPADDING + bytelen + 1] = ' ';
>> > ....
>> >
>> > Since similarity function uses this generate_bigm to get the bigrams and
>> > then compare it, there is a difference in behavior.
>> >
>> > So the way to make similarity function case-insensitive would be to
>> > change
>> > generate_bigm and not the similarity code itself. Also, the change will
>> > make
>> > the show_bigm function behave differently.
>>
>> Yes, generate_bigm would need to be updated to make bigm_similarity
>> case-sensitive.
>>
>> *From a user point of view*, bigm_similarity() and upcoming similarity
>> search should be case-sensitive? If yes, we should change generate_bigm,
>> but its change must not affect the behavior of the full-text search at
>> all.
>>
>> Or we should just implement both case-sensitive and -insensitive
>> bigm_similarity() and similarity search?
>>
>> Thought?
>
>
>  I am not sure I understood what you are trying to say.
>
> Are you suggesting that bigm similarity functions should be case insensitive
> and the current functions and search method should remain case sensitive?

If most users think that the similarity search should be case-sensitive,
we should just implement case-sensitive bigm_similarity() function and
similarity search feature, i.e., we don't need to change generate_bigm()
function.

If most users think that the similarity search should be case-insensitive
as well as pg_trgm's does, we should just implement case-insensitive
bigm_similarity() function and similarity search feature. In this case,
as you suggested, we probably need to change generate_bigm() function,
but which might break the current behavior of normal full-text search.
I think that we should avoid such a change the current users. So,
we need to change generate_bigm() function in the way that it doesn't
affect the current behavior of normal full-text search.

Regards,

-- 
Fujii Masao

pg_bigm(ピージーバイグラム)
Fork

[Pgbigm-hackers] Adding similarity() and similarity_op(), '%' to pg_bigm

pg_bigm(ピージーバイグラム) Fork

[Pgbigm-hackers] Adding similarity() and similarity_op(), '%' to pg_bigm

pg_bigm(ピージーバイグラム)
Fork