Fujii Masao
masao****@gmail*****
2015年 3月 20日 (金) 01:42:30 JST
On Wed, Mar 18, 2015 at 5:19 AM, Fujii Masao <masao****@gmail*****> wrote: > On Tue, Mar 17, 2015 at 5:53 PM, Sawada Masahiko <sawad****@gmail*****> wrote: >> On Mon, Mar 16, 2015 at 4:10 PM, Beena Emerson <memis****@gmail*****> wrote: >>> Hello, >>> >>> In the functions generate_bigm and generate_wildcard_bigm, the current >>> formula used to allocate space for the generated bigrams is picked up from >>> pg_trgm: >>> bgm = (BIGM *) palloc(VARHDRSZ + sizeof(bigm) * (Size) (slen / 2 + 1) * 3); >>> This limits the current indexed column size to 43 MB. >>> >>> However, this formula allocates more memory than required. >>> Consider a string with w words of varying length (w - 1 special characters >>> between words). >>> Slen = Sum(word_length) + w - 1 ....(1) >>> Number of bigrams generated by 1 word = word_length + 1 bigrams. >>> >>> For the whole string the total number of bigrams generated >>> = (word_length1 + 1) + (word_length2 + 1) .. .for w words. >>> = sum(word_length) + w >>> = (slen - w + 1) + w ...[from equation 1] >>> = slen + 1 >>> >>> Hence the maximum number of bigrams generated by a string is slen + 1. > > This formula is true only when LPADDING = 1 and RPADDING = 1 basically. > That is, if they are changed, the formula also needs to be updated. I think > we should add the comment about this fact. > > Don't we need to check whether palloc() can allocate "slen + 1" memory or not > before actually trying to call the palloc()? You can see that pg_trgm has this > kind of safeguard, and an error "out of memory" is emit if the requested size > is too large and palloc() cannnot allocate such memory. > >>> >>> The attached patches update the code and documentation >>> 01_update formula.patch - This updates the formula used in the functions >>> generate_bigm and generate_wildcard_bigm. This increases the indexed column >>> size to 64 MB. >>> >>> 02_update_datatype.patch - Updates the data type of bytelen in the struct >>> bigm from int to int8. This further increases the indexed column size to >>> 102MB. >>> >>> 03_update_documentation.patch - Updates the indexed column size values in >>> html/pg_bigm_en-1-1.html file. >>> >> >> Thank you for your patch! >> >> These are good patch for making relax the pg_bigm restriction. >> Also, 02_update_datatype.patch reduces GIN index size. > > Oh, really? I was thinking that GIN key entry is stored in text (i.e., varlena) > data type, so changing the data type of the field in bigm struct doesn't > affect the GIN key entry directly at all. > >> For more increasing indexed column size, when allocation memory size >> exceeds limit we can have following two ways, >> >> 1. Separate the input string into several strings, and concatenate >> bigms after generating each bigm. >> 2. Recalculate multi-byte word length, and re-try to allocate memory. >> (Currently pg_bigm always use byte length, but multi-byte word >> could be more than 1 bytes.) >> >> #1 eliminates the restrictions, #2 would more increase indexed column size. >> But pg_bigm allocates memory using by traditional way at first time >> for performance reason. >> >> Thought? > > What about the third one? ;) The idea is to change the bigm struct as follows. > > typedef struct > { > bool pmatch; > int8 bytelen; > char *str; > } bigm; > > Currently every two characters are stored in the bigm struct directly. > But ISTM that we can change the struct so that it stores just the pointer to > two characters instead because whole characters are already stored in > palloc()'d memory area. > > Which versions should we apply these changes? I'm thinking to do that only on > the master because they (i.e., alleviating the column size limitation) look new > features. OTOH, Beena's second patch (i.e., change from int to int8) is really > simple, small but very useful. So I'm tempted to apply that to even older > versions. Thought? On second thought, the first patch is also simple but quite useful. So I feel inclined to apply it to all supported versions rather than only master. Thought? Regards, -- Fujii Masao