[Pgbigm-hackers] Increasing Indexed Column Size

アーカイブの一覧に戻る

Fujii Masao masao****@gmail*****
2015年 3月 20日 (金) 01:42:30 JST


On Wed, Mar 18, 2015 at 5:19 AM, Fujii Masao <masao****@gmail*****> wrote:
> On Tue, Mar 17, 2015 at 5:53 PM, Sawada Masahiko <sawad****@gmail*****> wrote:
>> On Mon, Mar 16, 2015 at 4:10 PM, Beena Emerson <memis****@gmail*****> wrote:
>>> Hello,
>>>
>>> In the functions generate_bigm and generate_wildcard_bigm, the current
>>> formula used to allocate space for the generated bigrams is picked up from
>>> pg_trgm:
>>> bgm = (BIGM *) palloc(VARHDRSZ + sizeof(bigm) * (Size) (slen / 2 + 1) * 3);
>>> This limits the current indexed column size to 43 MB.
>>>
>>> However, this formula allocates more memory than required.
>>> Consider a string with w words of varying length (w - 1 special characters
>>> between words).
>>> Slen = Sum(word_length) + w - 1 ....(1)
>>> Number of bigrams generated by 1 word = word_length + 1 bigrams.
>>>
>>> For the whole string the total number of bigrams generated
>>> = (word_length1 + 1)  + (word_length2 + 1) .. .for w words.
>>> = sum(word_length) + w
>>> = (slen - w + 1) + w   ...[from equation 1]
>>> = slen + 1
>>>
>>> Hence the maximum number of bigrams generated by  a string is slen + 1.
>
> This formula is true only when LPADDING = 1 and RPADDING = 1 basically.
> That is, if they are changed, the formula also needs to be updated. I think
> we should add the comment about this fact.
>
> Don't we need to check whether palloc() can allocate "slen + 1" memory or not
> before actually trying to call the palloc()? You can see that pg_trgm has this
> kind of safeguard, and an error "out of memory" is emit if the requested size
> is too large and palloc() cannnot allocate such memory.
>
>>>
>>> The attached patches update the code and documentation
>>> 01_update formula.patch - This updates the formula used in the functions
>>> generate_bigm and generate_wildcard_bigm. This increases the indexed column
>>> size to 64 MB.
>>>
>>> 02_update_datatype.patch - Updates the data type of bytelen in the struct
>>> bigm from int to int8. This further increases the indexed column size to
>>> 102MB.
>>>
>>> 03_update_documentation.patch - Updates the indexed column size values in
>>> html/pg_bigm_en-1-1.html file.
>>>
>>
>> Thank you for your patch!
>>
>> These are good patch for making relax the pg_bigm restriction.
>> Also, 02_update_datatype.patch reduces GIN index size.
>
> Oh, really? I was thinking that GIN key entry is stored in text (i.e., varlena)
> data type, so changing the data type of the field in bigm struct doesn't
> affect the GIN key entry directly at all.
>
>> For more increasing indexed column size, when allocation memory size
>> exceeds limit we can have following two ways,
>>
>> 1.  Separate the input string into several strings, and concatenate
>> bigms after generating each bigm.
>> 2.  Recalculate multi-byte word length, and re-try to allocate memory.
>>      (Currently pg_bigm always use byte length, but multi-byte word
>> could be more than 1 bytes.)
>>
>> #1 eliminates the restrictions, #2 would more increase indexed column size.
>> But pg_bigm allocates memory using by traditional way at first time
>> for performance reason.
>>
>> Thought?
>
> What about the third one? ;) The idea is to change the bigm struct as follows.
>
> typedef struct
> {
>                 bool    pmatch;
>                 int8    bytelen;
>                 char    *str;
> } bigm;
>
> Currently every two characters are stored in the bigm struct directly.
> But ISTM that we can change the struct so that it stores just the pointer to
> two characters instead because whole characters are already stored in
> palloc()'d memory area.
>
> Which versions should we apply these changes? I'm thinking to do that only on
> the master because they (i.e., alleviating the column size limitation) look new
> features. OTOH, Beena's second patch (i.e., change from int to int8) is really
> simple, small but very useful. So I'm tempted to apply that to even older
> versions. Thought?

On second thought, the first patch is also simple but quite useful. So I feel
inclined to apply it to all supported versions rather than only master. Thought?

Regards,

-- 
Fujii Masao




Pgbigm-hackers メーリングリストの案内
アーカイブの一覧に戻る