Clone Tools
  • last updated 24 mins ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
DRILL-6145: Implement Hive MapR-DB JSON handler

closes #1158

  1. … 3 more files in changeset.
DRILL-6204: Pass tables columns without partition columns to empty Hive reader

closes #1146

    • -5
    • +5
    ./store/hive/HiveDrillNativeScanBatchCreator.java
DRILL-6195: Quering Hive non-partitioned transactional tables via Drill

closes #1140

DRILL-6164: Heap memory leak during parquet scan and OOM

closes #1122

    • -1
    • +2
    ./store/hive/HiveDrillNativeScanBatchCreator.java
  1. … 14 more files in changeset.
DRILL-6436: Storage Plugin to have name and context moved to AbstractStoragePlugin

closes #1282

    • -14
    • +4
    ./store/hive/HiveStoragePlugin.java
  1. … 11 more files in changeset.
DRILL-6130: Fix NPE during physical plan submission for various storage plugins

1. Fixed ser / de issues for Hive, Kafka, Hbase plugins.

2. Added physical plan submission unit test for all storage plugins in contrib module.

3. Refactoring.

closes #1108

    • -7
    • +7
    ./planner/sql/HivePartitionDescriptor.java
    • -7
    • +7
    ./planner/sql/logical/ConvertHiveParquetScanToDrillParquetScan.java
    • -11
    • +11
    ./store/hive/HiveDrillNativeParquetScan.java
    • -2
    • +2
    ./store/hive/HiveDrillNativeParquetSubScan.java
  1. … 21 more files in changeset.
DRILL-5730: Mock testing improvements and interface improvements

closes #1045

    • -3
    • +4
    ./store/hive/HiveDrillNativeScanBatchCreator.java
    • -3
    • +3
    ./store/hive/HiveScanBatchCreator.java
  1. … 220 more files in changeset.
DRILL-3993: Changes to support Calcite 1.15.

Fix AssertionError: type mismatch for tests with aggregate functions.

Fix VARIANCE agg function

Remove using deprecated Subtype enum

Fix 'Failure while loading table a in database hbase' error

Fix 'Field ordinal 1 is invalid for type '(DrillRecordRow[*])'' unit test failures

  1. … 17 more files in changeset.
DRILL-5978: Updating of Apache and MapR Hive libraries to 2.3.2 and 2.1.2-mapr-1710 versions respectively

* Improvements to allow of reading Hive bucketed transactional ORC tables;

* Updating hive properties for tests and resolving dependencies and API conflicts:

- Fix for "hive.metastore.schema.verification", MetaException(message: Version information

not found in metastore) https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool

METASTORE_SCHEMA_VERIFICATION="false" property is added

- Added METASTORE_AUTO_CREATE_ALL="true", properties to tests, because some additional

tables are necessary in Hive metastore

- Disabling calcite CBO for (Hive's CalcitePlanner) for tests, because it is in conflict

with Drill's Calcite version for Drill unit tests. HIVE_CBO_ENABLED="false" property

- jackson and parquet libraries are relocated in hive-exec-shade module

- org.apache.parquet:parquet-column Drill version is added to "hive-exec" to

allow of using Parquet empty group on MessageType level (PARQUET-278)

- Removing of commons-codec exclusion from hive core. This dependency is

necessary for hive-exec and hive-metastore.

- Setting Hive internal properties for transactional scan:

HiveConf.HIVE_TRANSACTIONAL_TABLE_SCAN and for schema evolution: HiveConf.HIVE_SCHEMA_EVOLUTION,

IOConstants.SCHEMA_EVOLUTION_COLUMNS, IOConstants.SCHEMA_EVOLUTION_COLUMNS_TYPES

- "io.dropwizard.metrics:metrics-core" with last 4.0.2 version is added to dependencyManagement block in Drill root POM

- Exclusion of "hive-exec" in "hive-hbase-handler" is already in Drill root dependencyManagement POM

- Hive Calcite libraries are excluded (Calcite CBO was disabled)

- "jackson-core" dependency is added to DependencyManagement block in Drill root POM file

- For MapR Hive 2.1 client older "com.fasterxml.jackson.core:jackson-databind" is included

- "log4j:log4j" dependency is excluded from "hive-exec", "hive-metastore", "hive-hbase-handler".

close apache/drill#1111

    • -0
    • +1
    ./store/hive/HiveMetadataProvider.java
    • -21
    • +36
    ./store/hive/readers/HiveAbstractReader.java
  1. … 12 more files in changeset.
DRILL-5941: Skip header / footer improvements for Hive storage plugin

Overview:

1. When table has header / footer process input splits fo the same file in one reader (bug fix for DRILL-5941).

2. Apply skip header logic during reader initialization only once to avoid checks during reading the data (DRILL-5106).

3. Apply skip footer logic only when footer is more then 0, otherwise default processing will be done without buffering data in queue (DRILL-5106).

Code changes:

1. AbstractReadersInitializer was introduced to factor out common logic during readers intialization.

It will have two implementations:

a. Default (each input split group gets its own reader);

b. Empty (for empty tables);

2. AbstractRecordsInspector was introduced to improve performance when table has footer is less or equals to 0.

It will have two implementations:

a. Default (records will be processed one by one without buffering);

b. SkipFooter (queue will be used to buffer N records that should be skipped in the end of file processing).

3. When text table has header / footer each table file should be read as one unit. When file is being read as several input splits, they should be grouped.

For this purpose LogicalInputSplit class was introduced which replaced InputSplitWrapper class. New class stores list of grouped input splits and returns information about splits on group level.

Please note, during planning input splits are grouped only when data is being read from text table has header / footer each table, otherwise each input split is treated separately.

4. Allow HiveAbstractReader to have multiple input splits instead of one.

This closes #1030

    • -358
    • +0
    ./store/hive/HiveAbstractReader.java
    • -2
    • +2
    ./store/hive/HiveDrillNativeParquetSubScan.java
    • -48
    • +51
    ./store/hive/HiveDrillNativeScanBatchCreator.java
    • -34
    • +156
    ./store/hive/HiveMetadataProvider.java
    • -63
    • +6
    ./store/hive/HiveScanBatchCreator.java
    • -0
    • +417
    ./store/hive/readers/HiveAbstractReader.java
  1. … 7 more files in changeset.
DRILL-5857: Fix NumberFormatException in Hive unit tests

closes #980

    • -4
    • +5
    ./store/hive/HiveMetadataProvider.java
DRILL-4264: Allow field names to include dots

    • -2
    • +2
    ./planner/sql/logical/ConvertHiveParquetScanToDrillParquetScan.java
    • -3
    • +3
    ./store/hive/HiveDrillNativeScanBatchCreator.java
  1. … 97 more files in changeset.
DRILL-5546: Handle schema change exception failure caused by empty input or empty batch.

1. Modify ScanBatch's logic when it iterates list of RecordReader.

1) Skip RecordReader if it returns 0 row && present same schema. A new schema (by calling Mutator.isNewSchema() ) means either a new top level field is added, or a field in a nested field is added, or an existing field type is changed.

2) Implicit columns are presumed to have constant schema, and are added to outgoing container before any regular column is added in.

3) ScanBatch will return NONE directly (called as "fast NONE"), if all its RecordReaders haver empty input and thus are skipped, in stead of returing OK_NEW_SCHEMA first.

2. Modify IteratorValidatorBatchIterator to allow

1) fast NONE ( before seeing a OK_NEW_SCHEMA)

2) batch with empty list of columns.

2. Modify JsonRecordReader when it get 0 row. Do not insert a nullable-int column for 0 row input. Together with ScanBatch, Drill will skip empty json files.

3. Modify binary operators such as join, union to handle fast none for either one side or both sides. Abstract the logic in AbstractBinaryRecordBatch, except for MergeJoin as its implementation is quite different from others.

4. Fix and refactor union all operator.

1) Correct union operator hanndling 0 input rows. Previously, it will ignore inputs with 0 row and put nullable-int into output schema, which causes various of schema change issue in down-stream operator. The new behavior is to take schema with 0 into account

in determining the output schema, in the same way with > 0 input rows. By doing that, we ensure Union operator will not behave like a schema-lossy operator.

2) Add a UnionInputIterator to simplify the logic to iterate the left/right inputs, removing significant chunk of duplicate codes in previous implementation.

The new union all operator reduces the code size into half, comparing the old one.

5. Introduce UntypedNullVector to handle convertFromJson() function, when the input batch contains 0 row.

Problem: The function convertFromJSon() is different from other regular functions in that it only knows the output schema after evaluation is performed. When input has 0 row, Drill essentially does not have

a way to know the output type, and previously will assume Map type. That works under the assumption other operators like Union would ignore batch with 0 row, which is no longer

the case in the current implementation.

Solution: Use MinorType.NULL at the output type for convertFromJSON() when input contains 0 row. The new UntypedNullVector is used to represent a column with MinorType.NULL.

6. HBaseGroupScan convert star column into list of row_key and column family. HBaseRecordReader should reject column star since it expectes star has been converted somewhere else.

In HBase a column family always has map type, and a non-rowkey column always has nullable varbinary type, this ensures that HBaseRecordReader across different HBase regions will have the same top level schema, even if the region is

empty or prune all the rows due to filter pushdown optimization. In other words, we will not see different top level schema from different HBaseRecordReader for the same table.

However, such change will not be able to handle hard schema change : c1 exists in cf1 in one region, but not in another region. Further work is required to handle hard schema change.

7. Modify scan cost estimation when the query involves * column. This is to remove the planning randomness since previously two different operators could have same cost.

8. Add a new flag 'outputProj' to Project operator, to indicate if Project is for the query's final output. Such Project is added by TopProjectVisitor, to handle fast NONE when all the inputs to the query are empty

and are skipped.

1) column star is replaced with empty list

2) regular column reference is replaced with nullable-int column

3) An expression will go through ExpressionTreeMaterializer, and use the type of materialized expression as the output type

4) Return an OK_NEW_SCHEMA with the schema using the above logic, then return a NONE to down-stream operator.

9. Add unit test to test operators handling empty input.

10. Add unit test to test query when inputs are all empty.

DRILL-5546: Revise code based on review comments.

Handle implicit column in scan batch. Change interface in ScanBatch's constructor.

1) Ensure either the implicit column list is empty, or all the reader has the same set of implicit columns.

2) We could skip the implicit columns when check if there is a schema change coming from record reader.

3) ScanBatch accept a list in stead of iterator, since we may need go through the implicit column list multiple times, and verify the size of two lists are same.

ScanBatch code review comments. Add more unit tests.

Share code path in ProjectBatch to handle normal setupNewSchema() and handleNullInput().

- Move SimpleRecordBatch out of TopNBatch to make it sharable across different places.

- Add Unit test verify schema for star column query against multilevel tables.

Unit test framework change

- Fix memory leak in unit test framework.

- Allow SchemaTestBuilder to pass in BatchSchema.

close #906

    • -3
    • +3
    ./store/hive/HiveDrillNativeScanBatchCreator.java
    • -1
    • +1
    ./store/hive/HiveScanBatchCreator.java
  1. … 65 more files in changeset.
DRILL-3250: Drill fails to compare multi-byte characters from hive table - A small refactoring of original fix of this issue (DRILL-4039); - Added test for the fix.

    • -7
    • +6
    ./store/hive/schema/DrillHiveTable.java
  1. … 3 more files in changeset.
DRILL-5399: Fix race condition in DrillComplexWriterFuncHolder

  1. … 10 more files in changeset.
DRILL-5496: Fix for failed Hive connection

If the Hive server restarts, Drill either hangs or continually reports

errors when retrieving schemas. The problem is that the Hive plugin

tries to handle connection failures, but does not do so correctly for

the secure connection case. The problem is complex, see DRILL-5496 for

details.

This is a workaround: we discard the entire Hive schema cache when we

encounter an unhandled connection exception, then we rebuild a new one.

This is not a proper fix; for that we'd have to restructure the code.

This will, however, solve the immediate problem until we do the needed

restructuring.

    • -2
    • +14
    ./store/hive/DrillHiveMetaStoreClient.java
    • -3
    • +65
    ./store/hive/HiveStoragePlugin.java
    • -5
    • +18
    ./store/hive/schema/HiveSchemaFactory.java
  1. … 3 more files in changeset.
DRILL-4039: Query fails when non-ascii characters are used in string literals

closes #825

    • -2
    • +3
    ./store/hive/schema/DrillHiveTable.java
  1. … 1 more file in changeset.
DRILL-5419: Calculate return string length for literals & some string functions

1. Revisited calculation logic for string literals and some string functions

(cast, upper, lower, initcap, reverse, concat, concat operator, rpad, lpad, case statement,

coalesce, first_value, last_value, lag, lead).

Synchronized return type length calculation logic between limit 0 and regular queries.

2. Deprecated width and changed it to precision for string types in MajorType.

3. Revisited FunctionScope and splitted it into FunctionScope and ReturnType.

FunctionScope will indicate only function usage in term of number of in / out rows, (n -> 1, 1 -> 1, 1->n).

New annotation in UDFs ReturnType will indicate which return type strategy should be used.

4. Changed MAX_VARCHAR_LENGTH from 65536 to 65535.

5. Updated calculation of precision and display size for INTERVALYEAR & INTERVALDAY.

6. Refactored part of function code-gen logic (ValueReference, WorkspaceReference, FunctionAttributes, DrillFuncHolder).

This closes #819

  1. … 78 more files in changeset.
DRILL-5043: Function that returns a unique id per session/connection similar to MySQL's CONNECTION_ID() #685

    • -2
    • +2
    ./planner/sql/logical/ConvertHiveParquetScanToDrillParquetScan.java
  1. … 27 more files in changeset.
DRILL-5081: Lower logging level for corrupt dates message

* introduced in DRILL-4203

closes #691

    • -1
    • +3
    ./store/hive/HiveDrillNativeScanBatchCreator.java
  1. … 2 more files in changeset.
DRILL-5034: Select timestamp from hive generated parquet always return in UTC

- TIMESTAMP_IMPALA function is reverted to retaine local timezone

- TIMESTAMP_IMPALA_LOCALTIMEZONE is deleted

- Retain local timezone for the INT96 timestamp values in the parquet files while

PARQUET_READER_INT96_AS_TIMESTAMP option is on

Minor changes according to the review

Fix for the test, which relies on particular timezone

close #656

    • -1
    • +1
    ./planner/sql/logical/ConvertHiveParquetScanToDrillParquetScan.java
  1. … 6 more files in changeset.
Revert "DRILL-4373: Drill and Hive have incompatible timestamp representations in parquet - added sys/sess option "store.parquet.int96_as_timestamp"; - added int96 to timestamp converter for both readers; - added unit tests;"

This reverts commit 7e7214b40784668d1599f265067f789aedb6cf86.

    • -2
    • +1
    ./planner/sql/logical/ConvertHiveParquetScanToDrillParquetScan.java
  1. … 13 more files in changeset.
DRILL-5009: Skip reading of empty row groups while reading Parquet metadata

+ We will no longer attempt to scan such row groups.

closes #651

    • -0
    • +4
    ./store/hive/HiveDrillNativeScanBatchCreator.java
  1. … 1 more file in changeset.
DRILL-4982: Separate Hive reader classes for different data formats to improve performance.

1, Separating Hive reader classes allows optimization to apply on different classes in optimized ways. This separation effectively avoid the performance degradation of scan.

2, Do not apply Skip footer/header mechanism on most Hive formats. This skip mechanism introduces extra checks on each incoming records.

close apache/drill#638

    • -0
    • +361
    ./store/hive/HiveAbstractReader.java
    • -1
    • +1
    ./store/hive/HiveDrillNativeScanBatchCreator.java
    • -515
    • +0
    ./store/hive/HiveRecordReader.java
    • -19
    • +39
    ./store/hive/HiveScanBatchCreator.java
  1. … 3 more files in changeset.
DRILL-5032: Drill query on hive parquet table failed with OutOfMemoryError: Java heap space

close apache/drill#654

    • -6
    • +6
    ./planner/sql/HivePartitionDescriptor.java
    • -7
    • +7
    ./planner/sql/logical/ConvertHiveParquetScanToDrillParquetScan.java
    • -0
    • +95
    ./store/hive/ColumnListsCache.java
    • -12
    • +28
    ./store/hive/DrillHiveMetaStoreClient.java
    • -2
    • +2
    ./store/hive/HiveDrillNativeParquetScan.java
    • -4
    • +2
    ./store/hive/HiveDrillNativeScanBatchCreator.java
    • -6
    • +5
    ./store/hive/HiveMetadataProvider.java
    • -0
    • +61
    ./store/hive/HivePartition.java
    • -5
    • +3
    ./store/hive/HiveScanBatchCreator.java
    • -0
    • +76
    ./store/hive/HiveTableWithColumnCache.java
  1. … 8 more files in changeset.
DRILL-4964: Make Drill reconnect to hive metastore after hive metastore is restarted.

Drill fails to connect to hive metastore after hive metastore is restarted unless drillbits are restarted.

Changes: For methods DrillHiveMetaStoreClient.getAllDatabases() and DrillHiveMetaStoreClient.getAllTables(),

the HiveMetaStoreClient wraps MetaException and TException both into MetaException. In case of connection

failure which is thrown as TException it is difficult to categorize at DrillClient level. The fix is to

close older connection and reconnect in case of these 2 api's. In all other cases proper set of exceptions

are thrown where we can handle each one individually.

close apache/drill#628

    • -5
    • +13
    ./store/hive/DrillHiveMetaStoreClient.java
DRILL-4203: Parquet File. Date is stored wrongly - Added new extra field in the parquet meta info "is.date.correct = true"; - Removed unnecessary double conversion of value with Julian day; - Added ability to correct corrupted dates for parquet files with the second version old metadata cache file as well.

This closes #595

    • -0
    • +1
    ./store/hive/HiveDrillNativeScanBatchCreator.java
  1. … 21 more files in changeset.
DRILL-4373: Drill and Hive have incompatible timestamp representations in parquet - added sys/sess option "store.parquet.int96_as_timestamp"; - added int96 to timestamp converter for both readers; - added unit tests;

This closes #600

    • -1
    • +2
    ./planner/sql/logical/ConvertHiveParquetScanToDrillParquetScan.java
  1. … 14 more files in changeset.
DRILL-4826: Query against INFORMATION_SCHEMA.TABLES degrades as the number of views increases

This closes #592

    • -0
    • +26
    ./store/hive/DrillHiveMetaStoreClient.java
    • -17
    • +15
    ./store/hive/schema/HiveDatabaseSchema.java
  1. … 7 more files in changeset.
DRILL-4768: Fix leaking hive meta store connection in Drill's hive metastore client call.

- do not call reconnect if the connection is still alive and the error is caused by either UnknownTableException or access error.

- call close() explicitly before reconnect() and check if client.close() will hit exception.

- make DrillHiveMetaStoreClient closable.

close apache/drill#543

    • -11
    • +65
    ./store/hive/DrillHiveMetaStoreClient.java
    • -1
    • +1
    ./store/hive/schema/HiveSchemaFactory.java