Clone
Hans Zeller <hans.zeller@hp.com>
committed
on 14 Apr 15
Fix for bug 1442932 and bug 1442966, encoding for varchar
Submitting this before finishing regressions on workstation, in the
interest of ti… Show more
Fix for bug 1442932 and bug 1442966, encoding for varchar

Submitting this before finishing regressions on workstation, in the

interest of time.

Key encodings for VARCHAR values used to put a varchar length indicator

in front of the encoded value. The value was the max. length of the

varchar and the indicator was 2 or 4 bytes long, depending on the

length of the indicator in the source field. That length used to

depend only on the max number of bytes in the field, for >32767

bytes we would use a 4 byte VC length indicator.

Now, with the introduction of long rows, the varchar indicator length

for varchars in aligned rows is always 4 bytes, regardless of the

character length. This causes a problem for the key encoding.

We could have computed the encoded VC indicator length from the field

length. Anoop suggested a better solution, not to include the VC

indicator at all, since that is unnecessary. Note that for HBase row

keys stored on disk, we already remove the VC indicator by converting

such keys from varchar to fixed char. Therefore, the issue happens

only for encoding needed in a query, for example when sorting or in a

merge join or union.

Description of the fix:

1. Change CompEncode::synthType not to include the VC length

  indicator in the encoded buffer. This change also includes

  some minor code clean-up.

2. Change the assert in CompEncode::codeGen not to include the

  VC indicator length anymore.

3. Changes in ex_function_encode::encodeKeyValue():

  a) Read 2 and 4 byte VC length indicators for VARCHAR/NVARCHAR.

  b) Small code cleanup, don't copy buffer for case-insensitive

     encode, since that is not necessary.

  c) Don't write max length as VC length indicator into target

     and adjust target offsets accordingly (for VARCHAR/NVARCHAR).

4. Other changes in sql/exp/exp_function.cpp:

  d) Handle 2 and 4 byte VC len indicators in hash function

     and Hive hash function (problems unrelated to LP bugs fixed).

  e) Add some asserts for cases where we assume VC length indicator

     is a 2 byte integer.

CompDecode is not yet changed. Filed bug 1444134 to do that for

the next release, since that change is less urgent.

Patch set 2: Copyright notice changes only.

Patch set 3: Updated expected regression test file that

            prints out encoded key in hex.

Change-Id: Idab3ed488f8c1b9aabedba4689bfb8d7286b9538

Show less

default + 9 more