hans zeller <> in Trafodion

TRAFODION-13 Multiple MTD-related bug fixes, UDF and RI constraint fixes

Bug fixes for Multi-temperature Data aka MTD aka divisioning and a related bug for RI constraints. These will be converted to JIRAs soon.

Also added an Excel tool to draw query plans, has limited support only.

LaunchPad 1466209 Referential constraint column correspondence not recorded correctly in metadata

LaunchPad 1417739 Multi-temperature data: should not allow divisioning on float datatype.

LaunchPad 1417741 multi-temperature data: divisioning on (date_part('year', add_months(...)) fails with ERROR[3044], ERROR[1135]

LaunchPad 1417743 multi-temperature data: able to define a RI constraint on a system column.

LaunchPad 1427527 Multi-temperature data: create-table-like-with-constrains-without-division table should not be divisioned

LaunchPad 1442774 TMUDF: Compiling a TPCH query with TMUDF returns internal assertion in BaseTypes.cpp:118


LaunchPad 1466209 Referential constraint column correspondence not recorded correctly in metadata

For an RI constraint, we record a list of foreign key columns in the

KEYS metadata table, as well as the UID of the referenced uniqueness

constraint. The foreign key column list is matched with the unique key

columns. We needed to add logic to reorder the lists in case the

unique constraint was specified in a different order. For example, in


alter table test020t9 add constraint test020t9ri4

foreign key (r1,r2) references test020t8 (c3,c2);

This will trigger the reorder logic to convert it to

foreign key (r2,r1) references test020t8 (c2,c3)

to match the index on test020t8 (c2,c3).

LaunchPad 1417739 Multi-temperature data: should not allow divisioning on float datatype.

Added a check and a new error number 4257. Note that we do allow float

as key columns, and that could have similar issues that rounding

errors could cause us to find a row sometimes and not other times,

although those issues should be rare. Nevertheless, I prefer not to

take that chance and allow divisioning on floating point columns (it

does not make much sense anyway).

LaunchPad 1417741 multi-temperature data: divisioning on (date_part('year', add_months(...)) fails with ERROR[3044], ERROR[1135]

There are really two issues here. First, the CAST operator that is

internally used in ADD_MONTHS in the parser generates a nullable

result, since in the parser we don't know whether the result is

nullable or not. We need to fix that in the binder to avoid the error

1135. Second, there were some cases where the interval leading

precision of a data type was not initialized correctly (error 3044).

LaunchPad 1417743 multi-temperature data: able to define a RI constraint on a system column.

This uncovered a bigger issue in the integration between MTD and RI

constraints. We need to ignore salt and divisioning columns in the

keys we consider, since these columns are redundant and not

user-visible. Added code to do that and while working on it, also

found the first bug in this list.

LaunchPad 1427527 Multi-temperature data: create-table-like-with-constrains-without-division table should not be divisioned

A typo in the parser code led us to ignore the WITHOUT DIVISION clause.

LaunchPad 1442774 TMUDF: Compiling a TPCH query with TMUDF returns internal assertion in BaseTypes.cpp:118

Used incorrect set of values in preCodeGen.

    • -45
    • +128
    • -0
    • +2
    • -6
    • +72
    • -2
    • +2
  1. … 4 more files in changeset.
Costing and statistics compiler interfaces for UDFs

blueprint cmp-tmudf-compile-time-interface

bug 1433192

This change adds compiler interfaces for UDFs that give information

about statistics of the result table and also a cost estimate. It also

has more code for the upcoming Java UDF feature, retrieving updated

invocation infos and returning them back to the executor/compiler C++


Description of the changes in more detail:

- Addressed remaining review comments from my last checkin,

- Make sure that user-generated exceptions during deallocation of

a routine are reported. These happens in the destructor of the

object derived from tmudr::UDR. For Java, we may need a deallocate


- Java and JNI code to serialize the updated UDRInvocationInfo and

UDRPlanInfo object after calling the user code and return them back

through the JNI interface to the calling C++ code.

- The cost method source files had some inline methods defined in

the .cpp file and used an include file that included other .cpp

files. Make didn't pick up changes made in these files. Removed

this code and changed it to regular methods and inlines.

- Replaced some Context * parameters in costing with PlanWorkSpace *,

to be able to get to UDF-related info that's stored in a special


- Changed the behavior or isBigMemoryOperator() for TMUDFs. If the

UDF writer specifies the DoP for the UDF invocation, then consider

it a BMO.

- If possible, synthesize the HASH2 partitioning function of a TMUDF's

child as the partitioning function of the UDF. This can be done if

the partitioning key gets passed through the UDF.

- Statistics interface for TMUDFs:

- TMUDF now populates statistics field in the UDRInvocationInfo

object and calls the describeStatistics() method.

- Added an estimated # of partitions for partitioned input tables

of TMUDFs. Also changed row count methods to "estimated" row count.

- Added code to incorporate the information on row count and UEC

provided by the UDF writer into statistics of the TMUDF. This code

is not that suitable for coding it as the default implementation

of describeStatistics(). Therefore, the default implementation of

describeStatistics() does nothing, but the compiler applies some

heuristics in case the UDF writer provides no statistics.

- Changed cost method for TMUDFs to incorporate an estimated cost

per row from the UDF writer. There is no special compiler interface

call to ask for the cost, it can be set from the

describeDesiredDegreeOfParallelism() call and, once supported, from

the describePlanProperties() call. Note that we don't have immediate

plans to support describePlanProperties(), that might come after 2.0.

Patch Set 3: Addressed Dave's review comments.

Patch Set 4: Fixed misplaced copyright in expected file.

Change-Id: Ia9ae076b7ae1fc2968c3d253d6d2d0e1d9a2ea40

    • -22
    • +62
  1. … 31 more files in changeset.
Using the language manager for UDF compiler interface

blueprint cmp-tmudf-compile-time-interface

This change includes new CLI calls, to be used in the compiler to

invoke routines. Right now, only trusted routines are supported,

executed in the same process as the caller, but in the future we may

extend this to isolated routines. Using a CLI call allows us to share

the language manager between compiler and executor, since language

manager resources such as the JVM and loaded DLLs exist only once per

process. This change is in preparation for Java UDFs.

Changes in a bit more detail:

- Added 4 new CLI calls to allocate a routine, invoke it, retrieve

updated invocation and plan infos and deallocate (put) the routine.

The CLI globals now have a C/C++ and a Java language manager that

is allocated on demand.

- The compiler no longer loads a DLL for the UDF compiler interface,

it uses the new CLI calls instead.

- DDL syntax is changed to allow TMUDFs in Java (not officially

supported, so don't use it quite yet).

- TMUDFs in C are no longer supported, only C++ and Java are.

Converted remaining TMUDF tests to C++.

- C++ TMUDFs now do a basic verification at DDL time, so errors

like missing entry points are detected earlier. Validation for

Java TMUDFs is also done through the CLI.

- Make sure we have no memory or resource leaks:

- CmpContext keeps track of UDF-related objects allocated on

system heap and in the CLI, cleaned up at the end of a statement

- CLI keeps a list of allocated trusted routines, cleaned up

when a CLI context is deallocated

- Using ExeCliInterface class to make the new CLI calls (4 new calls


- Removed CmpCli class in the optimizer directory and converted

tracking compiler to use ExeCliInterface as well.

- Compile-time parameter values are no longer baked into the

UDRInvocationInfo. Instead, they are provided as an input row, the

same way as they are provided at runtime.

- Bug fixes in C++ UDR code, mostly related to serialization and

to multiple interactions with the UDF through serialized objects.

- Added more info to UDRInvocationInfo (SQL access type, etc.).

- Since there are multiple plans per invocation, each of which

can have multiple interactions with the UDF, plans need to be

numbered so the UDF side can tell them apart to attach the

right state (owned by the UDF) to it.

- The language manager needs some functions that are provided by

the process it's running in. Added those (empty, for now) functions

as cli/CliImplLmExtFunc.cpp.

- Added a new class for Java TMUDFs, LmRoutineJavaObj. Added methods

to allocate such routines and to load their class as well as to

create Java objects by invoking the default constructor through JNI.

- Java TMUDFs use the new UDR interface (to be provided by Suresh and

Pavani). In the language manager, the container is the class of

the UDF, the external path is the fully qualified jar name. The

Java method name is <init>, the default constructor, with signature

"()V". Some code changes were required to do this.

- Created a new directory trafodion/core/sql/src for Java sources in

the sql engine. Right now, only language manager java

sources are in this directory, but I am planning to move the other

java sources under sql in a future checkin. Suresh and Pavani

will add their UDF-related Java files there as well.

- Renamed the udr jar to trafodion-sql-<version>.jar, in anticipation

of combining all the sql Java sources into this jar.

- Created a maven project file trafodion/core/sql/pom.xml and

changed makefiles to invoke maven to build java sources.

- More work to separate new UDR interface from older SPInfo object,

so that we can get rid of SPInfo if/when we don't support the older

style anymore.

- Small fix to odb makefile, make clean failed when executed twice.

Patch set 2: Adding a custom filter for test regress/udr/TEST108.

Change-Id: Ic827a42ac25505fb1ee451b79636c0f9349d8841

    • -0
    • +68
  1. … 84 more files in changeset.
Fix for bug 1442932 and bug 1442966, encoding for varchar

Submitting this before finishing regressions on workstation, in the

interest of time.

Key encodings for VARCHAR values used to put a varchar length indicator

in front of the encoded value. The value was the max. length of the

varchar and the indicator was 2 or 4 bytes long, depending on the

length of the indicator in the source field. That length used to

depend only on the max number of bytes in the field, for >32767

bytes we would use a 4 byte VC length indicator.

Now, with the introduction of long rows, the varchar indicator length

for varchars in aligned rows is always 4 bytes, regardless of the

character length. This causes a problem for the key encoding.

We could have computed the encoded VC indicator length from the field

length. Anoop suggested a better solution, not to include the VC

indicator at all, since that is unnecessary. Note that for HBase row

keys stored on disk, we already remove the VC indicator by converting

such keys from varchar to fixed char. Therefore, the issue happens

only for encoding needed in a query, for example when sorting or in a

merge join or union.

Description of the fix:

1. Change CompEncode::synthType not to include the VC length

indicator in the encoded buffer. This change also includes

some minor code clean-up.

2. Change the assert in CompEncode::codeGen not to include the

VC indicator length anymore.

3. Changes in ex_function_encode::encodeKeyValue():

a) Read 2 and 4 byte VC length indicators for VARCHAR/NVARCHAR.

b) Small code cleanup, don't copy buffer for case-insensitive

encode, since that is not necessary.

c) Don't write max length as VC length indicator into target

and adjust target offsets accordingly (for VARCHAR/NVARCHAR).

4. Other changes in sql/exp/exp_function.cpp:

d) Handle 2 and 4 byte VC len indicators in hash function

and Hive hash function (problems unrelated to LP bugs fixed).

e) Add some asserts for cases where we assume VC length indicator

is a 2 byte integer.

CompDecode is not yet changed. Filed bug 1444134 to do that for

the next release, since that change is less urgent.

Patch set 2: Copyright notice changes only.

Patch set 3: Updated expected regression test file that

prints out encoded key in hex.

Change-Id: Idab3ed488f8c1b9aabedba4689bfb8d7286b9538

    • -18
    • +18
Fix for bug 1441932 TMUDF: setLong() trouble handling decimal

Fixed computation of fractional part and used a temp buffer to

generate the result string, since the NUL terminator added by snprintf

could overwrite part of the next column in the record.

Change-Id: I2e820d205d00e4f285a02c88aa0c28d74382bad5

Fix for bug 1425745: Error 8421 when using an OR predicate.

Patch set 2: Fix copyright and add comment for allocation on stmt heap.

Change-Id: Ia51c492a0507f8ff48b6fd2c547485aa6a1672ad

TIMESERIES UDF for repository queries and UDF bug fixes

Bug fixes:

bug 1436593 TMUDF: getScale() returns a wrong scale for the TIME column

bug 1400812 Name resolution for predefined table mapping functions may need to be improved

bug 1436963 TMUDF: Unsigned numeric is mapped to TypeInfo::NUMERIC

bug 1436450 TMUDF: copyPassThruData() fails to pad nchar data properly

Added a predefined UDF to do timeseries queries. "Predefined" means that

like a built-in function it is not registered in the metadata. It is still

a UDF, though, using the SDK for UDFs.

Here is how to invoke the UDF:

select ...

from udf(timeseries(table(select ...

from ...

[partition by ...]

order by <tscol>),

<name of time slice column>,

<time slice width>

[ { , <col name>, <instr> } ... ]



<tscol> is a date, time or timestamp column from the input table

that describes the time dimension of the data. The data

can optionally be partitioned into multiple time series

that are independent of each other, using a PARTITION BY.

<name of time slice column> is the name of the generated output

column that contains the starting time of each time slice.

<time slice width> is an interval literal that determines how wide

each time slice is.

An optional list of pairs of <col name> and <instr> indicates

column values to be interpolated, according to the instructions.


Instruction Value at Interpolation Ignore nulls

----------- --------- ------------- ------------

FC beginning constant no

LC end constant no

FCI beginning constant yes

LCI end constant yes

FL beginning linear no

LL end linear no

FLI beginning linear yes

LLI end linear yes


select *

from udf(timeseries(table(select cust_id,



from e_meters

partition by cust_id

order by tstamp),

'HOURLY_READING', -- name of time slice column

interval '1' hour, -- time slice width

'KWH', 'FL', -- value of KWH column at beginning

-- of time slice, use linear

-- interpolation

'KWH', 'LCI')); -- end value, constant interpolation,

-- ignore NULL values

This will chop the time range of each customer into time slices,

1 hour wide, and will use linear interpolation for the meter readings

(assume we have readings for cust1 at 8:00 for 1000 and 10:30 for 1002

and readings for cust2 at 8:00 for 400 and at 9:30 with a NULL value).


------- ------------------- -------- -------

cust1 2015-03-20 08:00:00 1000.00 1000

cust1 2015-03-20 09:00:00 1000.80 1000

cust1 2015-03-20 10:00:00 1001.60 1002

cust2 2015-03-20 08:00:00 400.00 400

cust2 2015-03-20 09:00:00 ? 400

Other changes:

- Added DCS gui support to install_local_hadoop. If installed with

non-standard ports, see file $MY_SQROOT/sql/scripts/swurls.html

for the port numbers to use. I would recommend that you bookmark

this file in the browser you are using locally on your workstation.

- Addressed comments made by Dave B. in earlier checkins:

- Make error message 3286 more easy to understand.

- Change name resolution rules for predefined UDRs such that

real (user-defined) UDRs take precedence.

- Add comments to Trafodion engine files where some logic

is duplicated in the UDR SDK (file sqludr/sqludr.cpp and

in the future the equivalent Java file).

- Fix for "orphan entries in up queue" assert when canceling

a TMUDF while it is still reading data from its table-valued


Patch set 2: The jenkins build flagged some warnings as errors

that were not flagged on my workstation.

Change-Id: I1b806e35e2b2e91a42318fbbfd788e92d8cba070

  1. … 8 more files in changeset.
Fix for updated tpcds_kit file.

Steve pointed out that when we download the latest TPCDS zip file with

the tools to build data files, the directory structure has

changed. Our script can now handle both versions of the file. Also

added a check for the existence of This could be missing

if we download the hadoop tar file from the Internet instead of using

one that was generated by another Trafodion developer, with a


Change-Id: Ib473ebc902063f05eee37284d208fe316424aea6

Fixes for TMUDF bugs

Bugs fixed in this change:

bug 1430034 TMUDF: processData() fails to handle several data type

bug 1430438 TMUDF: A TMUDF with nonexisting external name crashes

sqlci with a core

bug 1430453 TMUDF: A TMUDF missing 'language cpp' crashes sqlci with a core

bug 1430484 TMUDF: User defined error number for UDRException() not

returned at run time

bug 1404053 Core with SIGSEGV during CREATE PROCEDURE statement when

procedure validation fails

Summary of changes:

- Support for more data types, including char types with character set UCS2

and interval types. Unlike in other types of UDFs, intervals are

represented as integers, but that representation is not directly

exposed to the UDF writer. CLOBs and BLOBs are only supported when

they are mapped to VARCHARS.

- Support to get and set values by specifying a C++ data type of time_t

for datetime and certain interval values.

- The default language for creating UDFs is changing. It used to be

C, the new rule depends on the type of the library. If we can determine

the library to be a jar file, the default language is Java. For

other libraries, the default is C for scalar UDFs, C++ for table-valued

UDFs. Also the PARAMETER STYLE clause is no longer used for TMUDFs, the

parameter style is computed automatically for now. PARAMETER STYLE is

now optional for Java stored procedures.

- Error handling is somewhat improved, fewer internal errors occur when

a UDR raises an exception. Still needs work.

- Making interfaces of OrderInfo and PartitionInfo more similar

by giving them similar methods.

Change-Id: I0bd0cebcf32f2185907c7d3573d4a511b17ead3e

  1. … 22 more files in changeset.
Normalizer interface for TMUDFs, blueprint cmp-tmudf-compile-time-interface

Added new compiler interfaces:

describeParamsAndColumns has been extended to allow updating

PARTITION BY and ORDER BY clauses specified for input tables.

describeDataFlowAndPredicates() allows the TMUDF to eliminate

columns not needed by the query and to push predicates into

the TMUDF or to the children.

describeConstraints() allows the TMUDF to see cardinality and

uniqueness constraints of the table-valued inputs (children) and

to synthesize cardinality and uniqueness constraints on the

TMUDF result.

TMUDFs now have 3 function types:

GENERIC - makes most conservative assumptions in the compiler

MAPPER - assumes TMUDF carries no state between input rows

REDUCER - assumes TMUDF carries no state between input partitions

defined by PARTITION BY clause

Query id and user id are now available to the UDR.

Added doxygen documentation for the C++ UDR interface. The resulting

web page will be published on the wiki. To generate the documentation

yourself, do the following:

cd $MY_SQROOT/../sql/sqludr

doxygen doxygen_tmudr.1.6.config

# now open tmudr_1.0/html/index.html in a web browser

Patch set 2: Updated copyrights in 2 files.

Change-Id: I3735eb3dd7e5292ba308ac00332fdebdf66a7472

    • -16
    • +17
    • -108
    • +912
    • -15
    • +34
  1. … 11 more files in changeset.
C++ run-time interface for TMUDFs

blueprint cmp-tmudf-compile-time-interface

- Support for C++ run-time interface:

- A new language, C++ is added to langman, the existing

LanguageManagerC handles both C and C++

- Two new parameter styles got added, C++ and Java

object-oriented parameter styles. Routines written in C++

use the new object-oriented C++ parameter style. The compiler

interface is only supported for that style (and in the future

for the Java object-oriented style).

- Also added one more compile time interface, the "completeDescription()"

call in the generator. Added logic to extract the UDRPlanInfo of

the optimal plan.

- Changes to UDRInvocationInfo and UDRPlanInfo classes:

- UDRInvocationInfo and UDRPlanInfo objects can now be serialized

and they are added to generated plans, as part of the UDR TDB.

- Split TableInfo into TupleInfo and TableInfo classes. TupleInfo

is now the common base class for describing both parameters and

input/output tables.

- TypeInfo now has offsets for data, null indicator and varchar


- New get<type> and set<type> methods on class TupleInfo, to be

used at compile time for parameters and at runtime for parameters,

input and output tables.

- Added a "call phase" member, to be able to throw exceptions when

certain methods are called at the wrong time (e.g. trying to modify

compile time members at runtime).

- Routine class in langman now has a new subclass, LmRoutineCppObj

and a new method, invokeRoutineMethod, that is used to invoke

the object-oriented methods, requiring UDRInvocationInfo and

UDRPlanInfo as parameters.

- Fixed some executor issues with error handling for UDFs, this is

still not very well supported

- Emitting the EOD row in the UDF is no longer required, and no longer

supported or even possible.

- UDRPlanInfo is now part of the physical properties, so that we

can extract it from the optimal plan.

- Disabling TMUDF as the inner of a nested join - for now.

We might support this "routine join" at a later time.

- regress/udr/TEST001:

- SESSIONIZE_STATIC remains in C, but other TMUDFs are now

rewritten in C++ (the runtime part that was not yet in C++)

- SESSIONIZE_DYNAMIC is now the same as the example on the wiki

- regress/udr/TEST002: Added some tests for event log reader UDF,

but can't add the part that copies a sample log file, since

in Jenkins, we don't have $MY_SQROOT set. Tried the test on my

workstation, though. Steve tells me $MY_SQROOT should be available,

so in a future checkin I'll enable this code again.

- For patch set 2: Removed fix for LP bug 1420539 and addressed

other review comments.

Change-Id: I008ad68a8f25f1aaee94e1c45bbf097a267129bb

  1. … 59 more files in changeset.
Splitting install_local_hadoop into two scripts.

This makes it possible to call the part that sets up a Hive TPC-DS

database from the Jenkins regression test environment.

Change-Id: If776fb5fb79d62450b7377be1e8a3ee1f23becbd

    • -0
    • +415
    • -311
    • +10
Fix for bug 1412983 Create table like view cores

The problem is that a view doesn't have a clustering index, and the

HBase create options hang off the clustering index. Need to skip the

HBase table options case when we are describing a view (describe is

internally used by create table like).

Change-Id: I623e45f961207c7df960c4915bdb64b1dd44c930

Fix for bug 1412642 tm.log_0_xxxx.log is not processed by the udf

These file names contain the string ".log" twice and the check

in the UDF did not handle that correctly.

Change-Id: Id802bf76fe01207181339d4b5856581c33caf150

    • -11
    • +21
Fix for bug 1412630, not all logs on all nodes processed by UDF

Co-locating the tdm_udrserv process with its parent process.

Change-Id: Ia285e5b3a1ac1340c838c8f40d327d8a5203a93d

Fix for bug 1409939 create table like does not handle salt option

showddl does not display salt clause for table that was created via

'create using n partitions'.

There are really two issues here. First, the condition to salt the

target table in CREATE TABLE LIKE is too restrictive (requires with

PARTITIONS clause). Fixed that. Second, we accept additional table

attributes in CREATE TABLE LIKE but we don't actually process them -

yet. Added a new error to indicate that.

Change-Id: Ia82322b4f587674257e888a080420e3d41031e31

    • -0
    • +37
Fix build errors for SQL compiler GUI

Recently, the compiler GUI started giving error messages during the build.

Note that this does not cause the overall build to fail.

Thanks to Howard for suggesting the fix, making sure that we use

/usr/bin/g++ to compile the GUI as we do for the rest of the code.

I tried using $(CC) instead of /usr/bin/g++ but that did not work.

Here are some of the error messages we saw, complaining about C++

syntax that was either outdated or relatively new:

In file included from ../export/NAStringDef.h:64,

from ../common/NAString.h:39,

from ../common/ComObjectName.h:55,

from ../common/CmpCommon.h:48,

from ../optimizer/ObjectNames.h:41,

from ../optimizer/ValueDesc.h:38,

from ../optimizer/GroupAttr.h:41,

from CommonSqlCmpDbg.h:34,

from MainWindow.h:24,

from MainWindow.cpp:18:

../export/FBString.h: In function ‘void folly::fbstring_detail::pod_fill(Pod*, Pod*, T)’:

../export/FBString.h:133: error: ISO C++ forbids declaration of ‘ee’ with no type

auto const ee = b + ((e – b) & ~7u);

In file included from ../common/ComSecurityKey.h:24,

from ../optimizer/NATable.h:43,

from ../optimizer/ColStatDesc.h:42,

from ../optimizer/EstLogProp.h:49,

from ../optimizer/GroupAttr.h:42,

from CommonSqlCmpDbg.h:34,

from MainWindow.h:24,

from MainWindow.cpp:18:

../sqlcomp/PrivMgrDefs.h: At global scope:

../sqlcomp/PrivMgrDefs.h:61: error: expected identifier before ‘class’

enum class PrivAuthClass {

Change-Id: I5e220b031af988077271179a969bc0724e2ee584

    • -0
    • +1
Fix for bug 1408187 missing commas in SQL event log files

The QRLogger::log method did not provide values for SQLCode and Query ID

that the event log reader TMUDF expects. In some cases, the calls to

QRLogger::log provided these values in the text, or they provided

two commas to indicate NULL values, but there are hundreds of other

calls to QRLogger::log not doing either of these. Added a second

QRLogger::log method and handled SQLCode and QueryID logging inside

those methods, so only a few calls had to be changed.

Change-Id: I9d147398df3056ed5c8b04be20ceebd3f42fe006

    • -1
    • +1
    • -19
    • +10
Log reading TMUDF, phase 3

blueprint cmp-tmudf-compile-time-interface

- Addressed review comments from phase 2. See

- Added a "parse_status" column to the TMUDF, see

updated syntax below

- Added versioning info to new DLL

- EVENT_LOG_READER TMUDF now should choose the correct

degree of parallelism without the need for CQDs

- Brought back the REPLICATE PARTITION keyword, which

is used in the TMUDF syntax. This should fix the failure

in regression test udf/TEST108.

- Some remaining issues:

- Newlines in the error message are not handled well,

at best the additional lines are lost, at worst

they will cause parse errors

- log_file_node output column is always 0

- Code is not yet integrated with changes to event


- Not yet tested on clusters

Updated syntax for the log reader TMUDF:

SQL Syntax to invoke this function:

select * from udf(event_log_reader( [options] ));

The optional [options] argument is a character constant. The

following options are supported:

f: add file name output columns (see below)

t: turn on tracing

d: loop in the runtime code, to be able to attach a debugger

(debug build only)

p: force parallel execution on workstation environment with

virtual nodes (debug build only)

Returned columns:

log_ts timestamp(6),

severity char(10 bytes) character set utf8,

component char(24 bytes) character set utf8,

node_number integer,

cpu integer,

pin integer,

process_name char(12 bytes) character set utf8,

sql_code integer,

query_id varchar(200 bytes) character set utf8,

message varchar(4000 bytes) character set utf8

if option "f" was specified, we have four more columns:

log_file_node integer not null,

log_file_name varchar(200 bytes) character set utf8 not null,

log_file_line integer not null,

parse_status char(2 bytes) character set utf8 not null

(log_file_node, log_file_name, log_file_line) form a unique key

in the result table. parse_status indicates whether there were

any errors reading the information:

' ' (two blanks): no errors

'E' (as first or second character): parse error

'T' (as first or second character): truncation or over/underflow


'C' (as first or second character): character conversion error

Change-Id: Iee3fc8383d4125f0f9b6c6035aa90bb82ceee92e

    • -127
    • +153
    • -0
    • +25
Phase 2 for log reader TMUDF

blueprint cmp-tmudf-compile-time-interface

Log reader TMUDF is mostly working now.

Still need to set cqd NUM_PARALLEL_ESPS '<num of nodes>' on clusters.

Still needs more work and more testing.

Still seeing some issues with non-ASCII characters.

// SQL Syntax to invoke this function:


// select * from udf(event_log_reader( [options] ));


// The optional [options] argument is a character constant. The

// following options are supported:

// f: add file name output columns (see below)

// t: turn on tracing

// d: loop in the runtime code, to be able to attach a debugger

// (debug build only)

// p: force parallel execution on workstation environment with

// virtual nodes (debug build only)

More detailed explanation of changes:

- PredefUdrReadfile.cpp: Work on event log reader TMUDF

- sqludr.*: New method to add formal parameters, allows TMUDF to

accept optional parameters.

- OptPhysRelExpr.cpp:

Made some changes for TMUDFs with arity 0 to avoid asserts

and to be able to call okToAttemptESPParallelism in method

RelExpr::synthPhysicalProperty(). This is needed for leaf

operators (arity 0) that want to initiate parallel execution

and TMUDFs seem to be first in that situation.

Changed TableMappingUDF::synthPhysicalProperty to generate

a partitioning function with multiple partitions (and no

partitioning key, so far) if required.

- ExUdr.cpp,








Addressed review comments from last phase, got rid of ALLOW_UDF CQD

- Rel*.h


OptPhysRelExpr.cpp (has other changes as well)

Simple but messy change to add one more parameter to


Change-Id: I5549e47c0f019beefd4ec1695ae7abf8c3bd43e3

    • -87
    • +184
  1. … 25 more files in changeset.
TMUDF C++ compiler interface, part of log-reading TMUDF

This is the infrastructure for a new C++ interface for TMUDFs

(table-mapping UDFs). It is used by a new log-reading TMUDF that

is not yet complete, but should be finished in the next few days.

See blueprint cmp-tmudf-compile-time-interface for more info.

Change-Id: I5a74e461462313b6d9722ac0deb21cd16c4b02ce

    • -0
    • +29
    • -119
    • +47
  1. … 41 more files in changeset.
Computed column key predicates for MDAM

Moving the generation of computed column predicates out of the

SearchKey logic and making it available as a static method on

class ScanKey. This allows us to compute these predicates before

we create the Disjuncts data structure that is used in a file

scan, where it will go into a SearchKey or an MdamKey.

Also fixing a bug that stopped after the first predicate found

on a computed column, so it failed to produce both a begin and

and end key value when selecting a range of values

(removed a "break" in ScanKey::createComputedColumnPredicates)

Change set 2: Addressed reviewer comments. Moved computation of

computed preds to Scan::addIndexInfo and ValueIdSet

that stores these preds from FileScan to Scan.

Change-Id: I4297d789ded8522eb67d5441ac281657ff90e774

    • -3
    • +1
Support for divisioning (multi-temperature data)

This is the initial support for divisioning. See

blueprint cmp-divisioning for more information:

Also, this change fixes the following LaunchPad bugs:

Bug 1388458 insert using primary key default value into a salted

table asserts in generator

Bug 1385543 salt clause on a table with large number of primary

key columns returns error

Bug 1392450 Internal error 2005 when querying a Hive table with

an unsupported data type

In addition, it changes the following behavior:

- The _SALT_ column now gets added as the last column in the

CREATE TABLE statement, rather than the first column after

SYSKEY. The position of _SALT_ in the clustering key does

not change. This will cause some differences in INVOKE and

in the column number assigned to columns.

- For CREATE TABLE LIKE, the defaults of the WITH clauses

are changing. CREATE TABLE LIKE now copies constraints,

SALT and DIVISION clauses by default. The WITH CONSTRAINTS

clause is now the default and should no longer be used.


DIVISIONING clauses are supported.

- For CREATE INDEX ... SALT LIKE TABLE, we now give a

warning instead of an error if the table is not salted.

- Also added an optimization for BETWEEN predicates. If

part or all of them can be converted to an equals predicate,

we do this now. Example:

(a,b,c,d) between (1,2,3,4) and (1,2,5,6)

is transformed into

a=1 and b=2 and (c,d) between (3,4) and (5,6).

More detailed description of changes:

- arkcmp/CmoStoredProc.cpp


+ other files

Using the new FLAGS column in the COLUMNS metadata table to store

whether a column is a salt or divisioning column. Note that since

there may be existing salted tables without this flag set, the flag

is so far only reliable for divisioning columns.

- comexe/ComTdb.h




Changed the column class field in struct

ComTdbVirtTableColumnInfo from a string to the corresponding

enum. Sorry, this caused lots of small changes (deleting "_LIT"

from the initializers). Also added the column flags.

- executor/hiveHook.cpp: Added a check for partitioned tables

(having multiple SDs). This is part of the fix for

bug 1353632.

- GenRelUpdate.cpp: When generating the key encoding expression

for an insert inside a MERGE operation, we assumed the new

record expression was in the order of the key columns. Added

a step to sort by key column, so we can pass the expression

in any order.

- optimizer/ItemExpr.cpp


Added a named NATypeToItem item expression.

This is used to do a primitive "bind" operation of an item expression

when processing a DDL statement. Specifically, to bind the DIVISION BY

clause in a CREATE TABLE statement.

- optimizer/ItemFunc.h

optimizer/SynthType.cpp: The DDL time "binder" gets expressions as

they come out of the parser, e.g. a ZZZBinderFunction. Need to add

type synthesis for some cases of the ZZZBinderFunction.

- optimizer/NATable.cpp

Removing some dead code. Adding an error message when we encounter

a Hive column type we can't handle yet. Bug 1392450.

- optimizer/TableDesc.*

Method TableDesc::validateDivisionByClauseForDDL() got moved

to CmpSeabaseDDL::validateDivisionByExprForDDL().

- optimizer/NormItemExpr.cpp

BETWEEN transformation described above.

- optimizer/ValueDesc.cpp

Avoid hard-codeing the "_SALT_" name and adding a comment about

possibility to use the flag in the future.

- parser

Lots of small changes for salt and divisioning option changes.

Simplifying the syntax for salt options somewhat. I think the older

syntax was so complex because it needed to record the starting and

ending position of the divisioning clause, something we don't need


- regress: Adding new test

- sqlcomp/CmpDescribe.cpp: Support for describing DIVISION BY clause

and also supporting the new WITHOUT SALT | DIVISION options

for CREATE TABLE LIKE, which relies on the describe feature.

- sqlcomp/CmpSeabaseDDLcommon.cpp


+ Handling the new column flags and making sure they are not

confused with the HBase column flags (e.g. for serialization).

+ Setting the new COLUMNS.FLAGS when writing metadata.

+ Also, writing the computed column text to the TEXT table.

+ For DROP TABLE, unconditionally deleting TEXT rows, since the

table could contain computed columns.

+ When building ColInfoArray, check system column flags, since

system columns can now appear at any position.

+ Add method to "bind" an item expression during DDL processing

without going through the full binder. This replaces any column

reference with a named NATypeToItem node, since all we really

need is the type and the name for unparsing.

+ Method TableDesc::validateDivisionByClauseForDDL() got moved

to CmpSeabaseDDL::validateDivisionByExprForDDL() with some minor

adjustments, since it used to be called on a bound ItemExpr, now

it gets called on something that came out of the parser and went

through the DDL time "binder".

- sqlcomp/CmpSeabaseDDLindex.cpp:

Support for CREATE INDEX ... DIVISION LIKE TABLE. If this is

set, add the division columns in front of the index key, otherwise


- sqlcomp/CmpSeabaseDDLtable.cpp:

+ Code to make sure column flags and column class is set and propagated.

+ Fix for bug 1385543: Now that we use the TEXT table for computed

column text, we no longer have a length limit. This is true for both

divisioning and salt expressions.

+ When processing the column list in seabaseCreateTable() we have a

bit of a chicken and egg problem: We need the column list to validate

the DIVISION BY expressions, but the DIVISION BY columns need to be part

of the column list. So, we do this a first time without divisioning

columns, then we add those, and produce the final list in a second


+ getTextFromMD method now takes a sub-id as an input parameter. That's

the column number for computed column text.

+ read computed column text from the TEXT table. Note: This also needs

to handle older tables where the computed column text is stored in

the default value.

Change-Id: I7c3ebe39a950c1d01f31855bdc92cbb98e5eb275

    • -24
    • +0
  1. … 36 more files in changeset.
Bug 1383491, Some problems remaining with incorrect ESP boundaries

When making up missing columns from HBase split keys that are

shorter than the actual key, there were a couple of issues:

These keys were created as decoded key values, but the code

below decoded them a second time.

Also, the code did not handle nullable columns properly.

These two fixes also solved the remaining issue with interval columns,

mentioned in bug 1375902.

Change-Id: Ie311fcc33c1a6920b68227fc2fff43a386f3c2e8

Two minor bug fixes.

Sourcing in a second time takes a couple of seconds.

The fix makes this much faster.

The topMatch method of a rule needs to call Rule::topMatch first

before doing anything else, like casting the RelExpr passed in

to a specific node.

Change-Id: I62ba69c6f7434fa3184ea981e15f1ed5ae2b2e01

Bug 1376922 Union query on a view returns wrong results

Qifan investigated this bug and found the problem in replacePivs().

Rather than forcing the parent's partitioning function on the child,

the fix takes the child's function and only replaces the PIVs in it.

Additional changes:

- The replacePivs() method used a ValueIdSet for the PIVs. This should

be a list, since we use multiple PIVs often by position in the list.

- We don't need the code in replacePivs() that fixes predicates in scan

nodes, since we call replacePivs() before calling preCodeGen() on the

child and therefore the child node does not yet have predicates that

refer to PIVs.

- We don't need to replace the partitioning expression anymore, since

it does not refer to any PIVs and we leave the partitioning key

predicates almost unchanged.

- Fixing a small, unrelated, thing: When sourcing in twice,

it reported an error message, due to a shell variable that didn't

get initialized to an empty string (workstation environment only).

Change-Id: Id8a20c0d958d8ce13edd59849a1418d252b5691d

    • -167
    • +43
Bug 1375902 Incorrect range boundaries for queries using ESPs

Reading unsalted, multi-region tables in parallel was very imbalanced,

except for the very first time a table was accessed in a session.

The problem was that NATable caching involves unparsing the region

boundaries, and we created these region boundary values in a way

that could not be unparsed. So, when we used an NATable object from

the NATable cache, the region boundaries had dummy values instead

of the real ones.

The fix adds code to create valid SQL literals from binary values,

to be used in the unparse method. Moving to HBase 0.98 also required

an additional fix, to handle partial values in HBase region start keys.

A region start key in an HBase 0.98 table can be a prefix of an

actual row key.

There is still a problem with keys that use the interval type,

but it is probably a day 1 issue. Since it is late in the 0.9

release, I'm deferring that to the next release.

Change-Id: Id0788a87641e7201723b5f8014215a36c87b7e23

Bug 1335477: Group by query returns different results in 2 builds

Reset the noExePreds flag in the MDAM key when we add

partitioning key predicates during preCodeGen. Those added

predicate can contain executor predicates that must not be


Also fixed an issue found in bug 1306836, where the syntax for

selecting specific partitions (salt buckets) was ignored for

coprocessor aggregation (simple count(*)). This syntax now disables

the coprocessor aggregate.

Change-Id: Id8dc0b7769f31a700bb459c424caada2f447a6ae

Making Maven build output less verbose

Maven emits warnings about using an environment variable as the

version. We want to keep doing this, since we have multiple Maven

projects that should all have the same version, stored centrally

in the file.

Suppressing most Maven output, including the warning, on stdout,

but preserving the entire output in the log file

(except for the "clean" target).

Change-Id: Ifee00e65971b20b485fc2eb65fff9135250d6fe3

Fix for bug 1348211 Generator error 7000 "root can't produce these values"

For joins involving repartitioning of data to match the other table,

we got this error in some cases, usually seen with salted tables.

Fix is to avoid storing the PIVs (partition input values) in the logical

scan node. Instead they are picked up in the preCodeGen phase, and this

means that if the PIVs have to be rewritten, we pick up the new and

correct values.

Patch set 3: Fixed regression issue by moving PIV logic so

availableValues includes the PIVs.

Change-Id: Ia8f9a83894e504f8d65d37d1589760258cdb8976