ComTdbHbaseAccess.h

Clone Tools
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
initial support for returning multiple versions and column timestamps

This feature is not yet externalized.

Support added to:

-- return multiple versions of rows

-- select * from table {versions N | MAX | ALL}

-- get hbase timestamp of a column

-- select hbase_timestamp(col) from t;

-- get version number of a column.

-- select hbase_version(col) from t

Change-Id: I37921681fc606a22c19d2c0cb87a35dee5491e1e

  1. … 48 more files in changeset.
Move core into subdir to combine repos

  1. … 10768 more files in changeset.
Move core into subdir to combine repos

  1. … 10622 more files in changeset.
Move core into subdir to combine repos

Use: git log --follow -- <file>

to view file history thru renames.

  1. … 10837 more files in changeset.
Changes in Patchset2

Fixed issues found during review.

Most of the changes are related to disbling this change for unique indexes.

When a unique index is found, they alone are disabled during the load.

Other indexes are online and are handled as described below. Once the base

table and regular indexes have been loaded, unique indexes are loaded from

scratch using a new command "populate all unique indexes on <tab-name>".

A simlilar command "alter table <tab-name> disable all unique indexes"

is used to disable all unique indexes on a table at the start of load.

Cqd change setting allow_incompatible_assignment is unrelated and fixes an

issue related to loading timestamp types from hive.

Odb change gets rid of minor warnings.

Thanks to all three reviewers for their helpful comments.

-----------------------------------

Adding support for incremental index maintenance during bulk load.

Previously when bulk loading into a table with indexes, the indexes are first

disabled, base table is loaded and then the indexes are populated from

scratch one by one. This could take a long time when the table has significant

data prior to the load.

Using a design by Hans this change allows indexes to be loaded in the same

query tree as the base table. The query tree looks like this

Root

|

NestedJoin

/ \

Sort Traf_load_prep (into index1)

|

Exchange

|

NestedJoin

/ \

Sort Traf_load_prep (i.e. bulk insert) (into base table)

|

Exchange

|

Hive scan

This design and change set allows multiple indexes to be on the same tree.

Only one index is shown here for simplicity. LOAD CLEANUP and LOAD COMPLETE

statements also now perform these tasks for the base table along with all

enabled indexes

This change is enabled by default. If a table has indexes it will be

incrementally maintained during bulk load.

The WITH NO POPULATE INDEX option has been removed

A new option WITH REBUILD INDEXES has been added. With this option we get

the old behaviour of disabling all indexes before load into table and

then populate all of them from scratch.

Change-Id: Ib5491649e753b81e573d96dfe438c2cf8481ceca

  1. … 35 more files in changeset.
Enabling Bulk load and Hive Scan error logging/skip feature

Also Fixed the hanging issue with Hive scan (ExHdfsScan operator) when there

is an error in data conversion.

ExHbaseAccessBulkLoadPrepSQTcb was not releasing all the resources when there

is an error or when the last buffer had some rows.

Error logging/skip feature can be enabled in

hive scan using CQDs and in bulk load using the command line options.

For Hive Scan

CQD TRAF_LOAD_CONTINUE_ON_ERROR ‘ON’ to skip errors

CQD TRAF_LOAD_LOG_ERROR_ROWS ‘ON’ to log the error rows in Hdfs files.

For Bulk load

LOAD WITH CONTINUE ON ERROR [TO <location>] – to skip error rows

LOAD WITH LOG ERROR ROWS – to log the error rows in hdfs files.

The default parent error logging directory in hdfs is /bulkload/logs. The error

rows are logged in subdirectory ERR_<date>_<time>. A separate hdfs file is

created for every process/operator involved in the bulk load in this directory.

Error rows in hive scan are logged in

<sourceHiveTableName>_hive_scan_err_<inst_id>

Error rows in bulk upsert are logged in

<destTrafTableName>_traf_upsert_err_<inst_id>

Bulk load can also aborted after a certain number of error rows are seen using

LOAD WITH LOG ERROR ROWS, STOP AFTER <n> ERROR ROWS option

Change-Id: Ief44ebb9ff74b0cef2587705158094165fca07d3

  1. … 33 more files in changeset.
Changes to enable Rowset select - Fix for bug 1423327

HBase always returns an empty result set when the row is not found. Trafodion

is changed to exploit this concept to project no data in a rowset select.

Now optimizer has been enabled to choose a plan involving Rowset Select

where ever possible. This can result in plan changes for the queries -

nested join plan instead of hash join,

vsbb delete instead of delete,

vsbb insert instead of regular insert.

A new CQD HBASE_ROWSET_VSBB_SIZE is now added to control the hbase rowset size.

The default values is 1000

Change-Id: Id76c2e6abe01f2d1a7b6387f917825cac2004081

  1. … 19 more files in changeset.
Eliminate manual steps in load/ustat integration

The fix achieves full integration of the bulk load utility with

Update Statistics. The Hive backing sample table is now creeated

automatically (formerly, we only wrote the HDFS files to be

used by the Hive external table), the correct sampling percentage

for the sample table is calculated, and the ustat command is

launched fro1m the executor as one of the steps in execution of

the bulk load utility.

Change-Id: I9d5600c65f0752cbc7386b1c78cd10a091903015

Closes-Bug: #1436939

  1. … 26 more files in changeset.
New ustat algorithm and bulk load integration

This is the initial phase of the Update Statistics change to use

counting Bloom filters and a Hive backing sample table created

during bulk load and amended whenever a new HFile for a table

is created. All changes are currently disabled pending some

needed fixes and testing.

blueprint ustat-bulk-load

Change-Id: I32af5ce110b0f6359daa5d49a3b787ab518295fa

  1. … 16 more files in changeset.
Merge "fix in the tdb to return the correct name for bulk load"

  1. … 1 more file in changeset.
fix in the tdb to return the correct name for bulk load

fix so that the correct name for bulk load preparation phase is

returned in the output

Change-Id: I8e93c9b5bb953c647f2b669ea966e5b9db5db434

  1. … 1 more file in changeset.
Snapshot Scan changes

The changes in this delivery include:

-decoupling the snapshot scan from the bulk unload feature. Setup of the

temporary space and folders before running the query and cleanup afterwards

used to be done by the bulk unload operator because snapshot scan was specific

to bulk unload. In order the make snapshot scan indepenednt from bulk unload

and use it in any query the setup and cleanup tasks are now done by the query

itself at run time (the scan and root operators).

-caching of the snapshot information in NATable to optimize compilation time

Rework for chaching: when the user sets TRAF_TABLE_SNAPSHOT_SCAN to LATEST

we flush the metadata and then we set the caching back to on so that metadata

get cached again. If newer snapshots are created after setting the cqd they

won't be seen if they are already cached unless the user issue a command/cqd

to invalidate or flush the cache. One way for doing that can be to issue

"cqd TRAF_TABLE_SNAPSHOT_SCAN 'latest';" again

-code cleanup

below is a description of the CQds used with snapshot scan:

TRAF_TABLE_SNAPSHOT_SCAN

this CQD can be set to :

NONE--> (default)Snapshot scan is disabled and regular scan is used ,

SUFFIX --> Snapshot scan is enabled for the bulk unload (bulk unload

behavior is not changed)

LATEST --> Snapshot Scan is enabled independently from bulk unload and

the latest snapshot is used if it exists. If no snapshot exists

the regular scan is used. For this phase of the project the user

needs to create the snapshots using hbase shell or other tools.

And in the next phase of the project new comands to create,

delete and manage snapshots will be add.

TRAF_TABLE_SNAPSHOT_SCAN_SNAP_SUFFIX

This CQD is used with bulk unload and its value is used to build the

snapshot name as the table name followed by the suffix string

TRAF_TABLE_SNAPSHOT_SCAN_TABLE_SIZE_THRESHOLD

When the estimated table size is below the threshold (in MBs) defined by

this CQD the regular scan is used instead of snapshot scan. This CQD

does not apply to bulk unload which maintains the old behavior

TRAF_TABLE_SNAPSHOT_SCAN_TIMEOUT

The timeout beyond which we give up trying to create the snapshot scanner

TRAF_TABLE_SNAPSHOT_SCAN_TMP_LOCATION

Location for temporary links and files produced by snapshot scan

Change-Id: Ifede88bdf36049bac8452a7522b413fac2205251

  1. … 44 more files in changeset.
Remove code and cqds related to Thrift interface

ExpHbaseInterface_Thrift class was removed a few months ago. Completing

that cleanup work. exp/Hbase_types.{cpp,h} still remain. These are Thrift

generated files but we use the structs/classes generated for JNI access.

Change-Id: I7bc2ead6cc8d6025fb38f86fbdf7ed452807c445

  1. … 19 more files in changeset.
Bulk unload optimization using snapshot scan

resubmitting after facing git issues

The changes consist of:

*implementing the snapshot scan optimization in the Trafodion scan operator

*changes to the bulk unload changes to use the new snapshot scan.

*Changes to scripts and permissions (using ACLS)

*Rework based on review

Details:

*Snapshot Scan:

----------------------

**Added support for snapshot scan to Trafodion scan

**The scan expects the hbase snapshots themselves to be created before running

the query. When used with bulk unload the snapshots can created by bulk unload

**The snapshot scan implementation can be used without the bulk-unload. To use

the snapshot scan outside bulk-unload we need to use the below cqds

cqd TRAF_TABLE_SNAPSHOT_SCAN 'on'; --

-- the snapshot name will the table name concatenated with the suffix-string

cqd TRAF_TABLE_SNAPSHOT_SCAN_SNAP_SUFFIX 'suffix-string';

-- temp dir needed for the hbase snapshotsca

cqd TRAF_TABLE_SNAPSHOT_SCAN_TMP_LOCATION '/bulkload/temp_scan_dir/'; n

**snapshot scan can be used with table scan, index scans etc…

*Bulk unload utility :

-------------------------------

**The bulk unload optimization is due the newly added support for snapshot scan.

By default bulk unload uses the regular scan. But when snapshot scan is

specified it will use snapshot scan instead of regular scan

**To use snapshot scan with Bulk unload we need to specify the new options in

the bulk unload syntax : NEW|EXISTING SNAPHOT HAVING SUFFIX QUOTED_STRING

***using NEW in the above syntax means the bulk unload tool will create new

snapshots while using EXISTING means bulk unload expect the snapshot to

exist already.

***The snapshot names are based on the table names in the select statement. The

snapshot name needs to start with table name and have a suffix QUOTED-STRING

***For example for “unload with NEW SNAPSHOT HAVING SUFFIX ‘SNAP111’ into ‘tmp’

select from cat.sch.table1; “ the unload utiliy will create a snapshot

CAT.SCH.TABLE1_SNAP111; and for “unload with EXISTING SNAPSHOT HAVING SUFFIX

‘SNAP111’ into ‘tmp’ select from cat.sch.table1; “ the unload utility will

expect a snapshot CAT.SCH.TABLE1_SNAP111; to be existing already. Otherwise

an error is produced.

***If this newly added options is not used in the syntax bulk unload will use

the regular scan instead of snapshot scan

**The bulk unload queries the explain plan virtual table to get the list of

Trafodion tables that will be scanned and based on the case it either creates

the snapshots for those tables or verifies if they already exists or not

*Configuration changes

--------------------------------

**Enable ACLs in hdfs

**

*Testing

--------

**All developper regression tests were run and all passed

**bulk unload and snapshot scan were tested on the cluster

*Examples:

**Example of using snapshot scan without bulk unload:

(we need to create the snapshot first )

>>cqd TRAF_TABLE_SNAPSHOT_SCAN 'on';

--- SQL operation complete.

>>cqd TRAF_TABLE_SNAPSHOT_SCAN_SNAP_SUFFIX 'SNAP777';

--- SQL operation complete.

>>cqd TRAF_TABLE_SNAPSHOT_SCAN_TMP_LOCATION '/bulkload/temp_scan_dir/';

--- SQL operation complete.

>>select [first 5] c1,c2 from tt10;

C1 C2

--------------------- --------------------

.00 0

.01 1

.02 2

.03 3

.04 4

--- 5 row(s) selected.

**Example of using snapshot scan with unload:

UNLOAD

WITH PURGEDATA FROM TARGET

NEW SNAPSHOT HAVING SUFFIX 'SNAP778'

INTO '/bulkload/unload_TT14_3' select * from seabase.TT20 ;

Change-Id: Idb1d1807850787c6717ab0aa604dfc9a37f43dce

  1. … 35 more files in changeset.
Reworked fix for LP bug 1404951

The scan cache size for an mdam probe is now set to the hbase default of 100.

Setting it values like 1 or 2 resulted in intermittent failures. The cqd

COMP_BOOL_184 can be set ON to get a cache size of 1 for mdam probes.

Root cause for this intermittent failure will be investigated later.

Change-Id: Ic05a77ecb0deeb260784f156de251a0f0dbdf49c

  1. … 5 more files in changeset.
Support for divisioning (multi-temperature data)

This is the initial support for divisioning. See

blueprint cmp-divisioning for more information:

https://blueprints.launchpad.net/trafodion/+spec/cmp-divisioning

Also, this change fixes the following LaunchPad bugs:

Bug 1388458 insert using primary key default value into a salted

table asserts in generator

Bug 1385543 salt clause on a table with large number of primary

key columns returns error

Bug 1392450 Internal error 2005 when querying a Hive table with

an unsupported data type

In addition, it changes the following behavior:

- The _SALT_ column now gets added as the last column in the

CREATE TABLE statement, rather than the first column after

SYSKEY. The position of _SALT_ in the clustering key does

not change. This will cause some differences in INVOKE and

in the column number assigned to columns.

- For CREATE TABLE LIKE, the defaults of the WITH clauses

are changing. CREATE TABLE LIKE now copies constraints,

SALT and DIVISION clauses by default. The WITH CONSTRAINTS

clause is now the default and should no longer be used.

Instead, WITHOUT CONSTRAINTS, WITHOUT SALT and WITHOUT

DIVISIONING clauses are supported.

- For CREATE INDEX ... SALT LIKE TABLE, we now give a

warning instead of an error if the table is not salted.

- Also added an optimization for BETWEEN predicates. If

part or all of them can be converted to an equals predicate,

we do this now. Example:

(a,b,c,d) between (1,2,3,4) and (1,2,5,6)

is transformed into

a=1 and b=2 and (c,d) between (3,4) and (5,6).

More detailed description of changes:

- arkcmp/CmoStoredProc.cpp

sqlcat/desc.h

+ other files

Using the new FLAGS column in the COLUMNS metadata table to store

whether a column is a salt or divisioning column. Note that since

there may be existing salted tables without this flag set, the flag

is so far only reliable for divisioning columns.

- comexe/ComTdb.h

comexe/*.h

generator/Generator.cpp

sqlcomp/CmpSeabaseDDLmd.h:

Changed the column class field in struct

ComTdbVirtTableColumnInfo from a string to the corresponding

enum. Sorry, this caused lots of small changes (deleting "_LIT"

from the initializers). Also added the column flags.

- executor/hiveHook.cpp: Added a check for partitioned tables

(having multiple SDs). This is part of the fix for

bug 1353632.

- GenRelUpdate.cpp: When generating the key encoding expression

for an insert inside a MERGE operation, we assumed the new

record expression was in the order of the key columns. Added

a step to sort by key column, so we can pass the expression

in any order.

- optimizer/ItemExpr.cpp

optimizer/ItemNAType.h:

Added a named NATypeToItem item expression.

This is used to do a primitive "bind" operation of an item expression

when processing a DDL statement. Specifically, to bind the DIVISION BY

clause in a CREATE TABLE statement.

- optimizer/ItemFunc.h

optimizer/SynthType.cpp: The DDL time "binder" gets expressions as

they come out of the parser, e.g. a ZZZBinderFunction. Need to add

type synthesis for some cases of the ZZZBinderFunction.

- optimizer/NATable.cpp

Removing some dead code. Adding an error message when we encounter

a Hive column type we can't handle yet. Bug 1392450.

- optimizer/TableDesc.*

Method TableDesc::validateDivisionByClauseForDDL() got moved

to CmpSeabaseDDL::validateDivisionByExprForDDL().

- optimizer/NormItemExpr.cpp

BETWEEN transformation described above.

- optimizer/ValueDesc.cpp

Avoid hard-codeing the "_SALT_" name and adding a comment about

possibility to use the flag in the future.

- parser

Lots of small changes for salt and divisioning option changes.

Simplifying the syntax for salt options somewhat. I think the older

syntax was so complex because it needed to record the starting and

ending position of the divisioning clause, something we don't need

anymore.

- regress: Adding new test

- sqlcomp/CmpDescribe.cpp: Support for describing DIVISION BY clause

and also supporting the new WITHOUT SALT | DIVISION options

for CREATE TABLE LIKE, which relies on the describe feature.

- sqlcomp/CmpSeabaseDDLcommon.cpp

sqlcomp/CmpSeabaseDDL.h

+ Handling the new column flags and making sure they are not

confused with the HBase column flags (e.g. for serialization).

+ Setting the new COLUMNS.FLAGS when writing metadata.

+ Also, writing the computed column text to the TEXT table.

+ For DROP TABLE, unconditionally deleting TEXT rows, since the

table could contain computed columns.

+ When building ColInfoArray, check system column flags, since

system columns can now appear at any position.

+ Add method to "bind" an item expression during DDL processing

without going through the full binder. This replaces any column

reference with a named NATypeToItem node, since all we really

need is the type and the name for unparsing.

+ Method TableDesc::validateDivisionByClauseForDDL() got moved

to CmpSeabaseDDL::validateDivisionByExprForDDL() with some minor

adjustments, since it used to be called on a bound ItemExpr, now

it gets called on something that came out of the parser and went

through the DDL time "binder".

- sqlcomp/CmpSeabaseDDLindex.cpp:

Support for CREATE INDEX ... DIVISION LIKE TABLE. If this is

set, add the division columns in front of the index key, otherwise

don't.

- sqlcomp/CmpSeabaseDDLtable.cpp:

+ Code to make sure column flags and column class is set and propagated.

+ Fix for bug 1385543: Now that we use the TEXT table for computed

column text, we no longer have a length limit. This is true for both

divisioning and salt expressions.

+ When processing the column list in seabaseCreateTable() we have a

bit of a chicken and egg problem: We need the column list to validate

the DIVISION BY expressions, but the DIVISION BY columns need to be part

of the column list. So, we do this a first time without divisioning

columns, then we add those, and produce the final list in a second

iteration.

+ getTextFromMD method now takes a sub-id as an input parameter. That's

the column number for computed column text.

+ read computed column text from the TEXT table. Note: This also needs

to handle older tables where the computed column text is stored in

the default value.

Change-Id: I7c3ebe39a950c1d01f31855bdc92cbb98e5eb275

  1. … 50 more files in changeset.
Various LP fixes, bugs and code cleanup.

-- removed obsolete code (label create/alter/delete, get disk/label/buffer stats,

dp2 scan)

-- metadata structs are now created as classes and initialized during

creation. LP 1394649

-- warnings are now being returned from compiler to executor after DDL operations.

-- duplicate constraint names now return error.

-- handle NOT ENFORCED constraints: give warning during creation and not enforce

during use. LP 1361784

-- drop all indexes (enabled and disabled indexes) on a table during drop table

and schema now works. LP 1384380

-- drop constraint on disabled index succeeds. LP 1384479

-- string truncation error is now returned if default value doesn't fit in

column. LP 1394780

-- fixed issue where a failure during multiple constraints creation in a create

stmt was not cleaning up metadata. LP 1389871

-- update where current of is now supported. LP 1324679

Change-Id: Iec1b0b4fc6a8161a33b7f69228c0f1e3f441f330

  1. … 54 more files in changeset.
Enabling runtime stats for hbase tables and operators

This is the third set of changes to collect the runtime stats info. Part

is to address the comments and suggestions from last review.

1) Instead of passing the hbase access stats entry to every htable

calls, set the pointer in the EXP hbase interface layer with first init

call in the tcb work methods (not the task work methods), then

eventually to the htable JNI layer from getHTableClient()

(sql/exp/ExpHbaseInterface.cpp).

2) Rewrite the way to construct the hbase operator names from one

methord and use it for display both tdb contents and tcb stats.

3) Populate the hbase I/O bytes counter for both read and insert

(sql/executor/HBaseClient_JNI.cpp).

4) Fix the problem that parsing stats variable text string could go

beyond the end of the string (getSubstrInfo() in

sql/executor/ExExeUtilGetStats.cpp).

Change-Id: I62618b57894039bc1ca5bc0f3c9b89efec5cc42e

  1. … 15 more files in changeset.
Enabling runtime stats for hbase operators

This is the first set of changes to collect the runtime stats info for

hbase tables and operators. It contains:

1) Populate the estimated row count in hbase access TDB.

2) Collect the hbase access time and accessed row count at the JNI layer

(only for select operations now).

Partially reviewed by Mike H. and Selva G.

Removed the part that devides the estimated rows by number of ESPs based on the comments

Change-Id: I5a98a8ae9c4462aa53ad889edfe4cd8563502477

  1. … 11 more files in changeset.
Changes to support OSS poc.

This checkin contains multiple changes that were added to support OSS poc.

These changes are enabled through a special cqd mode_special_4 and not

yet externalized for general use.

A separate spec contains details of these changes.

These changes have been contributed and pre-reviewed by Suresh, Jim C,

Ravisha, Mike H, Selva and Khaled.

All dev regressions have been run and passed.

Change-Id: I2281c1b4ce7e7e6a251bbea3bf6dc391168f3ca3

  1. … 143 more files in changeset.
fix for LP 1359906

Change-Id: I3a8e40b25d3ce3a9261cfd999c116d1b6538d84d

  1. … 1 more file in changeset.
Trafodion bulk load changes

The changes include:

-A way to specify the maximum size of the Hfiles beyond which the file

will be split.

-Adding the “upsert using load …” statement to run under the load utility

so that if can take advantage of disabling and populating indexes and so on.

the syntax is: load with upsert using load into <trafodion table> select ...

from <table>. "Upsert using load" can still be used seperately from

load utility

-Checks in the compiler to make sure indexes and constraints are disabled

before running the "upsert using load" statement

-Moving seabase tests 015 and 017 to the hive suite as they use hive tables.

Change-Id: I80303e4471d2179718e050c98d954ef56cd4cc4f

  1. … 26 more files in changeset.
added support for externalized Sequence numbers.

-- sql statements: create/drop/alter/get/showddl sequence for sequence objects

-- function 'seqnum' to retrieve sequence numbers.

An external spec has been created.

Launchpad #1349985

Code reviewd by Joanie C, Suresh S, Selva and Sandhya.

Full dev regressions run and passed.

Change-Id: Ie11dbab4d24ff6a1106697f7e2253ea895e6c873

  1. … 71 more files in changeset.
Improve IUD performance by using direct buffers

All IUD operations except the Bulk load operations now use direct buffers.

There is a

- Direct buffer for multiple rowIDs

- Direct buffer for one or more rows containing the column names and

column valueis

Format for RowID direct buffer

numRowIds + rowId + rowId + …

rowId len is passed as a parameter to Java functions

Format for Row direct buffer

numCols

__

colNameLen |

colName | For each non-null column value

colValueLen |

colValue __ |

The colValue will have one byte null indicator 0 or -1 if column is nullable

The numRowIds, numCols, colNameLen and colValueLen are byte swapped because

Java always use BIG_ENDIAN format. But, the column value need not be byte

swapped because column value is always interpreted in the JNI side.

These direct buffers are wrapped as ByteBuffer and parsed in the java functions

to do the corresponding Hbase operations. Using direct buffers avoids the

following

- Creating temporary Mutation objects

- Creating RowToInsert java objects from these Mutation objects

- Transitioning from JNI to java call to add columns to RowToInsert object

- Iterating through the columns objects to do put on the java side.

Fix for seabase/TEST022 with the merged version. Rowwise format has been

changed recently and corresponding change is done in direct row buffer.

Multiple row delete was not part of transaction earlier because RMInterface

wasn't exposing a API to do delete using List. In patchset 1, multiple

deletes was converted into delete a single row iteratively using transaction.

In patchset 2, RMInterface is changed to support multiple row delete and hence

Trafodion HbaseClient uses this new interface to do multiple row delete with

transaction.

Change-Id: I1617b4236a823040ec3dfefbc66508f3f01868fc

  1. … 10 more files in changeset.
Bulk load changes

changes include:

A. Added new options to the load the syntax with options:

Load [with option[,option,….]] into <table> select .. from <table>

Option can be:

* Truncate Table: By default target table is not truncated before

loading data. If truncate table option is specified the target table

is truncated before loading.

* No recovery: by default load handles recovery using using snapshots.

If no recovery option is specified then snapshots are not used.

This option was called no rollback before.

* Log errors : (not implemented yet). Will be bsed to log error rows

into a file.

* No Populate indexes: by default indexes are handled by load.

in this case indexes are disable before starting the load and

populated afterwards

If no populate indexes option is specified, indexes are not handled

by load.

* Constraints: (not implemented yet) will be used to handle constraints.

* No duplicate check: by defaults an error is generated when duplicates

are detected.

if no Duplicate check option is specified then duplicates are ignored.

* Stop after N errors: (not implemented yet) will be used to fail the

load after N errors

* No output: by default load command print status messages listing the

steps that are being executed.

If no output is specified then no status messages are displayed.

B. Supoport for duplicate detection at runtime. By defaults an error is generated

when duplicates are detected.

If no Duplicate check is specified then duplicates are ignored.

C. Added index handling to load

By default, indexes are disabled before starting the load and populated

afterwards.

D. changes in the optimizer to make sure the data is always sorted.

when loading small data sets we noticed that optimizer does not always

add a sort node.

E. Added status messages that gives the different steps that are taking place .

to disable the status messages we need to specify “no output” option.

F. Added checks so that the user cannot specify an option more than once

Change-Id: I6c0596eb38763def2f23c7452ae400d9fccb008e

  1. … 25 more files in changeset.
Update Statistics performance improved by sampling in HBase

Update Statistics is much slower on HBase tables than it was for

Seaquest. A recent performance analysis revealed that much of the

deficit is due to the time spent retrieving the data from HBase that

is used to derive the histograms. Typically, Update Statistics uses a

1% random sample of a table’s rows for this purpose. All rows of the

table are retrieved from HBase, and the random selection of which rows

to use is done in Trafodion.

To reduce the number of rows flowing from Hbase to Trafodion for queries

using a SAMPLE clause that specifies random sampling, the sampling logic

was pushed into the HBase layer using a RandomRowFilter, one of the

built-in filtersrovided by HBase. In the typical case of a 1% sample,

this reduces the number of rows passed from HBase to Trafodion by 99%.

After the fix was implemented, Update Stats on various tables was 2 to 4

times faster than before, when using a 1% random sample.

Change-Id: Icd40e4db1dde444dec76165c215596755afae96c

  1. … 13 more files in changeset.
trafodion bulk load changes (disabled))

fixing a conflict in install_local_hadoop

these changes include:

*changes to the install scripts (install_local_hadoop and other files)

**change hbase run on top of hdfs instead of local file system. This change may require running install_local_hadoop again when you rebase and initiliaze trafodion again.

**You may lose youtr tables. If you have tables that you need to keep please use extract and then load to extract the data before you rebasing and then load them after you rebase

*adding a coprocessor to support secure way of doing load using hidden folders (works with non secure hbase). secure load is disabled by default

*recovery using Snapshots (diabled by default) ): when enabled a snapshot is taken before the load starts and restored if something goes wrong. Otherwise it is deleted after the data is loaded

*changes to Makefiles to build the coprocessors in java 7 and 6

Change-Id: If496afc874c842e3e02b2c3426b71c1090cbeca9

  1. … 28 more files in changeset.
Bulk Load code drop (feature disabled)

Main reason for this delivery is to test on the cluster

These changes are for the trafodion bulk load which is currently disabled. The changes include:

*using utilities infrastrucrure for the bulk load: Previously the bulk Load (prototype) used blocked union to sequence the 2 main steps: hfiles preparation and hfiles loading. But because of some limitations Bulk Load is now implemented with multiple steps using Utilities infrastructures: some of the steps are:

**Detect whether table is empty or not

**Truncate table if requested by user

**Generate the HFiles

**Load/merge the Hfiles

*new syntax

**Load [with option,option,….] into table select .. from table

**Options can be:

***Truncate:

***No rollback : no rollback is not supported yet for non empty tables

***Log errors : not implemented yet

*Changes to load query plan in the optimizer:

**Use of range repartitioning if the table is salted

**Use of Non random repartitioning of table is not salted . the primary key is used to as the partitioning key now

*“Quasi-secure” bulk load using coprocessors: Implementation of the hidden directory mechanism similar to what the secure version of hbase implements ( referred to as “quasi-secure” here ). This mechanism is used when the users under which hbase and trafodion run are different. It is implemented using a coprocessor which is not added to hbase settings yet (disable for now).

Change-Id: I6407c82249dc2e1e29267b92df0ab1d04ba3ab31

  1. … 36 more files in changeset.
Initial code drop of Trafodion

    • -0
    • +855
    ./ComTdbHbaseAccess.h
  1. … 4886 more files in changeset.