Clone Tools
Constraints: committers
Constraints: files
Constraints: dates
[TRAFODION-2733] BMO quota changes

To ensure BMO quota assignment works even when the estimates are way off.


estimate exceeds BMO_MEMORY_LIMIT_PER_NODE_IN_MB by this factor,

the BMO memory estimate for the operator is reset to


Then BMO memory estimate and the quota is assigned based on the

revised estimate.

  1. … 27 more files in changeset.
Merge branch 'master' of git:// into bmo_memory_quota








  1. … 20 more files in changeset.
TRAFODION-2731 CodeCleanup: Remove obsolete, legacy and unused code

This phase handles the following:

-- removal of code that dealt with:

-- mpalias, NSK, MP, mploc, resource fork, rfork

-- ARLIB, DISK, VOLUME, PFS, compiler version info

-- interpretasrow/IAR, AuditImage, ExtractColumns functions

-- ARKCMP_SINGLE_PROCESS and oneProcess()

-- recompControl, remoteDefaults, rtdu, module

-- latebind thru nsk defines, guardian names, nametype nsk

-- SHADOW implementation


-- older sqlcat ReadTableDef


-- internal cli methods no longer used by any caller

Code within the following defines is removed if it is obsolete

or the define itself is removed if that feature is always on:

-- removed NA_EIDPROC

-- removed SQLEXP_LIB_FUNC

-- removed NA_CMPDLL


-- removed SQ_NEW_PHANDLE

-- removed __EID

-- removed ARKFS_OPEN

-- removed STAND_ALONE

-- removed __TANDEM

-- removed NA_C89

-- removed NA_NSK


-- removed SQLCLI_LIB_FUNC

-- removed CLI_PRIV_SRL

-- removed PRIV_SRL

-- removed NA_LINUX

-- removed NA_HSC_LINUX

-- removed NA_UNIX

-- removed NA_WINNT

-- removed HAVE_MMAP

-- removed NA_NO_C_RUNTIME

-- removed NA_DEBUG_C_RUNTIME(replaced with _DEBUG)

-- removed NA_64BIT usage except in sqlcli.h

-- removed dg64

-- removed SQLEXPORT_LIB

-- removed NA_ARKFS

-- removed NA_IEEE_FLOAT

-- removed NA_GUARDIAN_MSG

-- removed NA_HSC


-- removed ERROR

-- removed ERROR_STATE


Contents of these files have been removed.

Next checkin fill remove the files itself from git:


executor/ExMeas.h, ExMeas.cpp

executor/tempfile.h, .cpp


executor/stubs.cpp, stubs2.cpp


cli/rtdu.h, rtdu2.h, rtdu.cpp, rtdu.cpp












sqlcat/ReadTableDef.h, cpp

sqlcat/readRealArk.h, cpp

  1. … 460 more files in changeset.
Following changes are done in BMO memory quota

1) Enabled memory quota per node. The CQD BMO_MEMORY_LIMIT_PER_NODE

(renamed from EXE_MEMORY_LIMIT_PER_CPU) is set to 10240 MB by default.

Old attribute Old value Renamed Attribute New value






2) Changes in EXPLAIN

Estimated memory per node for all BMOs at ROOT operator

Estimated memory per instance for every BMO operator

Memory quota per instance for every BMO operator

3) BMO TDB contains the memory quota per esp instance now.

4) Root TDB now contains the limit per node and estimated memory per node.

This can be used by WMS to change the memory allocation during

runtime without compilation. - Not yet implemented.

4) Added a CQD BMO_MEMORY_LIMIT_UPPER_BOUND to gap the memory

consumed by BMO by the same queries with less number of


5) The unused memory quota is yielded to other fragments in the process


6) Removed the code to limit the ESPs from being assigned to a fragement

based on the BMO memory quota.

7) Added a new CQD BMO_MEMORY_ESTIMATE_RATIO_CAP to gap the memory

estimate skew by any one BMO operator to 0.7.

8) To disable the memory quota per node, set BMO_MEMORY_LIMIT_PER_NODE to 0.

9) This memory quota is distributed proportionally based on the estimated memory

taking into consideration the number of bmo instances per operator and

the number of nodes available in the cluster to host these instances.

Hence, this memory quota should be valid in multi-fragments independent of the

number of fragments in an ESP.


11) Fixed BMO stats WM to be at least the allocated memory.

12) Changed the sort operator to account the bmo memory correctly.

(cherry picked from commit ba19c04a58890fdd845b03f8d915abdd487b6407)















  1. … 44 more files in changeset.
[TRAFODION-2420] RMS enhancements

Improved BMO operation accounting. Two new counters are added

a) scratchIOSize - Size in KB of the scratch IO

b) scratchIOMaxTime - Time in microseconds taken by any long pole

ESP instance to do scratch IO operations

The ScrBufSize, ScrBufRead and ScrBufWritten are no longer displayed

in the formatted get statistics command outputs.

Reduced the default value of CQD HBASE_NUM_CACHE_ROWS_MAX to 1024 from

10000 to reduce stress on the client JVM memory.

Disabled the memory pressure triggering in the BMO operators by increasing

the threshold GEN_MEM_PRESSURE_THRESHOLD to 10000 from 100. The memory

pressure detection code was inadvertently enabled when the cap to limit

memory pressure constant to 100 was removed.

Memory pressure triggers in BMO operators will be enabled later when

we understand the memory pressure detection better.

  1. … 14 more files in changeset.
[TRAFODION-2317] Infrastructure for common subexpressions

This is a first set of changes to allow us to make use of CTEs

(Common Table Expressions) declared in WITH clauses and to create

a temp table for them that is then read multiple times in the query.

This also includes a fix for

[TRAFODION-2280] Need to remove salt columns from uniqueness constraints

Summary of changes:

- Adding a unique statement number in CmpContext

- Moving the execHiveSQL method from the OSIM code to CmpContext

- Adding a list of common subexpressions and their references

to CmpStatement

- Adding the ability to the Hive Truncate operator to drop the

table when the TCB gets deallocated

- Adding the ability to the HDFS scan to compute scan ranges at

runtime. Those are usually determined in the compiler. This is

only supported for simple, non-partitioned, delimited tables.

We need this because we populate the temp table and read from

in in the same statement, without the option of compiling

after we inserted into the temp table.

- Special handling in the MapValueIds node of common subexpressions.

See the comment in MapValueId::preCodeGen().

- Moved the binder code to create a FastExtract node into a new

method FastExtract::makeFastExtractTree(), to be able to call

it from another place.

- MapValueIds no longer looks at the "used by MVQR flag" to determine

the method for VEGRewrite. Instead it checks whether a list of

values has been provided to do this.

- Adding a new method, RelExpr::prepareMeForCSESharing, that is

kind of an "unnormalizer", undoing some of the normalizer


- Implementing the steps for common subexpression materializations

described below.

- Adding the ability to suppress the Hive timestamp modification

check when truncating a Hive table

- Adding an optimizer rule to eliminate CommonSubExprRef nodes.

These nodes should not normally survive past the SQO phase, but

if the SQO phase gets interrupted by an exception, that could

happen, since we then fall back to a copy of the tree before

SQO. In the future, we can consider cost-based decision on

what to do with common subexpressions.

- Adding CommonSubExprRef nodes in the parser whenever we expand

a CTE reference.

- Adding cleanup code to the "cleanup obsolete volatile tables"

command that removes obsolete Hive tables used for common


Other changes contained in this change set:

- Optimization for empty scans, like select * from t where 1=0

This now generates a cardinality constraint with 0 rows, which

can be used later to eliminate parts of a tree.

(file OptLogRelExpr.cpp)

- [TRAFODION-2280] Need to remove salt columns from uniqueness

constraints generated on salted tables.

(file OptLogRelExpr.cpp)

- Got rid of the now meaningless "seamonster" display in EXPLAIN.

(file GenExplain.cpp and misc. expected files)

- Suppress display of "zombies" in the cstat command. Otherwise,

these zombies (marked as <defunct>) prevent Trafodion from

starting, because they are incorrectly considered "orphan"

processes. This could require a reboot when no reboot is necessary.

(file core/sqf/sql/scripts/pstat)

Incomplete list of things left to be done:

- TRAFODION-2316: Hive temp tables are not secure. Use volatile

tables instead.

- TRAFODION-2315: Add heuristics to decide when to use the temp table


- TRAFODION-2320: Make subquery unnesting work with common subexpressions.

Generated Plans


The resulting query plan for a query Q with n common

subexpressions CSE1 ... CSEn looks like this:






/ \

Union Q

/ \

... CTn



/ \


Each of the CTi variables looks like the following, an



/ \

Truncate FastExtract TEMPi



The original query Q has the common subexpressions replaced

with the following:



scan TEMPi

Here is a simple query and its explain:

prepare s from

with cse1 as (select d_date_sk, d_date, d_year, d_dow, d_moy from date_dim)

select x.d_year, y.d_date

from cse1 x join cse1 y on x.d_date_sk = y.d_date_sk

where x.d_moy = 3;

>>explain options 'f' s;


---- ---- ---- -------------------- -------- -------------------- ---------

11 . 12 root 1.46E+005

5 10 11 blocked_union 1.46E+005

7 9 10 merge_join 7.30E+004

8 . 9 sort 1.00E+002

. . 8 hive_scan CSE_TEMP_CSE1_MXID11 1.00E+002

6 . 7 sort 5.00E+001

. . 6 hive_scan CSE_TEMP_CSE1_MXID11 5.00E+001

1 4 5 blocked_union 7.30E+004

2 . 4 hive_insert CSE_TEMP_CSE1_MXID11 7.30E+004

. . 2 hive_scan DATE_DIM 7.30E+004

. . 1 hive_truncate 1.00E+000

--- SQL operation complete.


CQDs to control common subexpressions


CSE_FOR_WITH is the master switch.

CQD Value Default Behavior

--------------------- --------- ------- ---------------------------------------


ON Insert a CommonSubExprRef node in the

tree whenever we reference a CTE

(table defined in a WITH clause)

CSE_USE_TEMP OFF Y Disable creation of temp tables

for common subexpressions

SYSTEM Same as OFF for now

ON Always create a temp table for

common subexpressions where possible


ON Emit diagnostic warnings that show why

we didn't create temp tables for

common subexpressions


ON Cleanup Hive tables used for CSEs with

the "cleanup obsolete volatile tables"


CommonSubExprRef relational operators


This is a new RelExpr class that is introduced. It marks the common

subexpressions in a RelExpr tree. This operator remembers the name of

a common subexpression (e.g. the name used in the WITH clause).

Multiple such operators can reference to the same name. Each of

these references has a copy of the tree.

Right now, these operators are created in the parser when we expand a

CTE (Common Table Expression), declared in a WITH clause. If the CTE

is referenced only once, then the CommonSubExprRef operator is removed

in the binder - it also doesn't live up to its name in this case.

The remaining CommonSubExprRef operators keep track of various changes

to their child trees, during the binder and normalizer phases. In

particular, it tracks which predicates are pushed down into the child

tree and which outputs are eliminated.

The CmpStatement object keeps a global list of all the

CommonSubExprRef operators in a statement, so the individual operators

have a way to communicate with their siblings:

- A statement can have zero or more named common subexpressions.

- Each reference to a common subexpression is marked in the RelExpr

tree with a CommonSubExprRef node.

- In the binder and normalizer, common subexpressions are expanded,

meaning that multiple copies of them exist, one copy per


- Common subexpressions can reference other common subexpressions,

so they, together with the main query, for a DAG (directed

acyclic graph) of dependencies.

- Note that CTEs declared in a WITH clause but not referenced are

ignored and are not part of the query tree.

In the semantic query optimization phase (SQO), the current code makes

a heuristic decision what to do with common subexpressions - to

evaluate them multiple times (expand) or to create a temporary table

once and read that table multiple times.

If we decide to expand, the action is simple: Remove the

CommonSubExprRef operator from the tree and replace it with its child.

If we decide to create a temp table, things become much more difficult.

We need to do several steps:

- Pick one of the child trees of the CommonSubExprRefs as the one to


- Undo any normalization steps that aren't compatible with the other

CommonSubExprRefs. That means pulling out predicates that are not

common among the references and adding back outputs that are

required by other references. If that process fails, we fall back

and expand the expressions.

- Create a temp table tmp.

- Prepare an INSERT OVERWRITE TABLE tmp SELECT * FROM cse tree

that materializes the common subexpression in a table.

- Replace the CommonSubExprRef nodes with scans of the temp table.

- Hook up the insert query tree with the main query, such that it

is executed before the main query starts.

Temporary tables


At this time, temporary tables are created as Hive tables, with a

fabricated, unique name, including the session id, a unique statement

number within the session, and a unique identifier of the common

subexpression within the statement. The temporary table is created at

compile time. The query plan contains an operator to truncate the

table before populating it. The "temporary" Hive table is dropped when

the executor TCB is deallocated.

Several issues are remaining with this approach:

- If the process exits before executing and deallocating the statement,

the Hive table is not cleaned up.

Solution (TBD): Clean up these tables like we clean up left-over

volatile tables. Both are identified by the session id.

- If the executor runs into memory pressure and deallocates the TCB,

then allocates it again at a later time, the temp table is no longer


Solution (TBD): Use AQR to recompile the query and create a new table.

- Query cache: If we cache a query, multiple queries may end up with

the same temporary table. This works ok as long as these queries are

executed serially, but it fails if both queries are executed at the

same time (e.g. open two cursors and fetch from both, alternating).

Solution (TBD): Add a CQD that disables caching queries with temp tables

for common subexpressions.

In the future we also want to support volatile tables. However, those also

have issues:

- Volatile tables aren't cleaned up until the session ends. If we run

many statements with common subexpressions, that is undesirable. So,

we have a similar cleanup problem as with Hive tables.

- Volatile tables take a relatively long time to create.

- Insert and scan rates on volatile Trafodion tables are slower than

those on Hive tables.

To-do items are marked with "Todo: CSE: " in the code.

  1. … 78 more files in changeset.
jira TRAFODION-2234 turn aligned format on, phase 1

This is phase 1 of aligned format change.

Dev regressions now run with aligned format tables.

To test both hbase and aligned format, some tests have explicit

specification to create hbase format tables.

Tests that test for features that are currently only available

with hbase format (like pushdown sel expr) create hbase format tables.

Many expected files have been updated to reflect aligned format,

mostly in showddl and explain output.

  1. … 35 more files in changeset.
ddl xns turned on by default (plus couple more fixes)

-- cqd ddl_transactions is ON by default

-- disabled init/drop trafodion and drop schema from

running in a single transaction.

This will be changed once couple of dtm issues are fixed

-- error 8616 is retuned in case of commit conflict

-- fixed begin/commit/set stmts to not retry in case of an error

  1. … 14 more files in changeset.
ddl xns default plus handling of errors with default constants placement

  1. … 15 more files in changeset.
JIRA TRAFODION-1798 (ddl xns) and few other fixes, details below.

-- support for sql part of ddl xns. Section 1 of JIRA TRAFODION-1798

-- cqd ddl_transactions to enable or disable ddl xns.

Default is currently off. Once it is tested, it will be turned on.

Dev regressions are run with cqd set to ON

-- get stmts run with read committed to get changes in current xns

-- support for where preds with get stmts

-- scan to pass in transid even if running with read uncommitted access.

This enables rows modified in current xn to be returned.

-- cleanup no longer return multiple duplicate error messages if

objects id is not found.

-- cleanup no longer includes internallay created schemas (_HV_ , _HB_)

during cleanup operations.

-- Correct error msg was not getting returned if an invalid index

existed in table and the same index was created again.

-- init traf, drop md views was giving an error if views didnt exist.

That has been fixed.

-- regressions with -diff option now show original file timestamps

instead of the timestamp when the diff command was run.

  1. … 72 more files in changeset.
Implement TRAFODION-1420 Use ClientSmallScanner for small scans to improve perdormance Hbase implements an optimization for small scan (defined as scanning less than a data block ie 64Kb) resulting in 3X performance improvement. The underlying trick is about cutting down RPC calls from 3 (OpenScan/Next/Close) to 1, and use pread stateless instead of seek/read state-full and locking method to read data. This JIRA is about improving the compiler who can be aware if a scan will be acting on single data block (small) or not, and pass this data to executor so that it can use the right parameter for scan. (scan.setSmall(boolean)). reference:

  1. … 27 more files in changeset.
First commit for advanced predicate pushdown feature (also known as pushdown V2) associated JIRA TRAFODION-1662 Predicate push down revisited (V2). The JIRA contains a blueprint document, useful to understand what the code is supposed to do. This code is enabled using CQD hbase_filter_preds '2', and bypassed otherwise. Except for the change implemented in ValueDesc.cpp that is a global bug fix whereValueIdSet are supposed to contain set of valueID ANDed together, and should not contain any ValueID with operator ITM_AND.

  1. … 16 more files in changeset.
ESP-colocation fixes (Ravisha and Qifan)

  1. … 6 more files in changeset.
Merge remote branch 'core/master'

  1. … 2817 more files in changeset.
Move core into subdir to combine repos

  1. … 10768 more files in changeset.
Move core into subdir to combine repos

  1. … 10622 more files in changeset.
Move core into subdir to combine repos

Use: git log --follow -- <file>

to view file history thru renames.

  1. … 10837 more files in changeset.