Jinfeng Ni

DRILL-5546: Handle schema change exception failure caused by empty input or empty batch.

1. Modify ScanBatch's logic when it iterates list of RecordReader.

1) Skip RecordReader if it returns 0 row && present same schema. A new schema (by calling Mutator.isNewSchema() ) means either a new top level field is added, or a field in a nested field is added, or an existing field type is changed.

2) Implicit columns are presumed to have constant schema, and are added to outgoing container before any regular column is added in.

3) ScanBatch will return NONE directly (called as "fast NONE"), if all its RecordReaders haver empty input and thus are skipped, in stead of returing OK_NEW_SCHEMA first.

2. Modify IteratorValidatorBatchIterator to allow

1) fast NONE ( before seeing a OK_NEW_SCHEMA)

2) batch with empty list of columns.

2. Modify JsonRecordReader when it get 0 row. Do not insert a nullable-int column for 0 row input. Together with ScanBatch, Drill will skip empty json files.

3. Modify binary operators such as join, union to handle fast none for either one side or both sides. Abstract the logic in AbstractBinaryRecordBatch, except for MergeJoin as its implementation is quite different from others.

4. Fix and refactor union all operator.

1) Correct union operator hanndling 0 input rows. Previously, it will ignore inputs with 0 row and put nullable-int into output schema, which causes various of schema change issue in down-stream operator. The new behavior is to take schema with 0 into account

in determining the output schema, in the same way with > 0 input rows. By doing that, we ensure Union operator will not behave like a schema-lossy operator.

2) Add a UnionInputIterator to simplify the logic to iterate the left/right inputs, removing significant chunk of duplicate codes in previous implementation.

The new union all operator reduces the code size into half, comparing the old one.

5. Introduce UntypedNullVector to handle convertFromJson() function, when the input batch contains 0 row.

Problem: The function convertFromJSon() is different from other regular functions in that it only knows the output schema after evaluation is performed. When input has 0 row, Drill essentially does not have

a way to know the output type, and previously will assume Map type. That works under the assumption other operators like Union would ignore batch with 0 row, which is no longer

the case in the current implementation.

Solution: Use MinorType.NULL at the output type for convertFromJSON() when input contains 0 row. The new UntypedNullVector is used to represent a column with MinorType.NULL.

6. HBaseGroupScan convert star column into list of row_key and column family. HBaseRecordReader should reject column star since it expectes star has been converted somewhere else.

In HBase a column family always has map type, and a non-rowkey column always has nullable varbinary type, this ensures that HBaseRecordReader across different HBase regions will have the same top level schema, even if the region is

empty or prune all the rows due to filter pushdown optimization. In other words, we will not see different top level schema from different HBaseRecordReader for the same table.

However, such change will not be able to handle hard schema change : c1 exists in cf1 in one region, but not in another region. Further work is required to handle hard schema change.

7. Modify scan cost estimation when the query involves * column. This is to remove the planning randomness since previously two different operators could have same cost.

8. Add a new flag 'outputProj' to Project operator, to indicate if Project is for the query's final output. Such Project is added by TopProjectVisitor, to handle fast NONE when all the inputs to the query are empty

and are skipped.

1) column star is replaced with empty list

2) regular column reference is replaced with nullable-int column

3) An expression will go through ExpressionTreeMaterializer, and use the type of materialized expression as the output type

4) Return an OK_NEW_SCHEMA with the schema using the above logic, then return a NONE to down-stream operator.

9. Add unit test to test operators handling empty input.

10. Add unit test to test query when inputs are all empty.

DRILL-5546: Revise code based on review comments.

Handle implicit column in scan batch. Change interface in ScanBatch's constructor.

1) Ensure either the implicit column list is empty, or all the reader has the same set of implicit columns.

2) We could skip the implicit columns when check if there is a schema change coming from record reader.

3) ScanBatch accept a list in stead of iterator, since we may need go through the implicit column list multiple times, and verify the size of two lists are same.

ScanBatch code review comments. Add more unit tests.

Share code path in ProjectBatch to handle normal setupNewSchema() and handleNullInput().

- Move SimpleRecordBatch out of TopNBatch to make it sharable across different places.

- Add Unit test verify schema for star column query against multilevel tables.

Unit test framework change

- Fix memory leak in unit test framework.

- Allow SchemaTestBuilder to pass in BatchSchema.

close #906

  1. … 53 more files in changeset.
DRILL-5459: Extend physical operator test framework to test mini plans consisting of multiple operators.

This closes #823

DRILL-5378: Put more information for schema change exception in hash join, hash agg, streaming agg and sort operator.

close #801

DRILL-5359: Fix ClassCastException when Drill pushes down filter on the output of flatten operator.

- Move findItemOrFlatten as a static method in DrillRelOptUtil.

- Exclude filter conditions if they contain item/flatten operator.

close apache/drill#786

Update version to 1.11.0-SNAPSHOT

    • -1
    • +1
    /contrib/data/tpch-sample-data/pom.xml
  1. … 13 more files in changeset.
[maven-release-plugin] prepare release drill-1.10.0

    • -1
    • +1
    /contrib/data/tpch-sample-data/pom.xml
  1. … 13 more files in changeset.
Bump maxsize of jdbc-all jar to accommodate the increased size of jar file due to new code.

DRILL-5159: Drill's ProjectMergeRule should operate on RelNodes with same convention trait.

close apache/drill#705

DRILL-1950: Parquet rowgroup level filter pushdown in query planning time.

Implement Parquet rowgroup level filter pushdown. The filter pushdown is performed in

in Drill physical planning phase.

Only a local filter, which refers to columns in a single table, is qualified for filter pushdown.

A filter may be qualified if it is a simple comparison filter, or a compound "and/or" filter consists of

simple comparison filter. Data types allowed in comparison filter are int, bigint, float, double, date,

timestamp, time. Comparison operators are =, !=, <, <=, >, >=. Operands have to be a column of the above

data types, or an explicit cast or implicit cast function, or a constant expressions.

This closes #637

  1. … 29 more files in changeset.
DRILL-1950: Update parquet metadata cache format to include both min/max and additional column type information.

Parquet meta cache format change:

1. include both min/max in ColumnMetaData if column statistics is available,

2. include precision/scale/repetitionLevel/definitionLevel in ColumnTypeMetaData (precision/scale/definitionLevel is for future use).

DRILL-4911: Avoid plan serialization in SimpleParallelizer when debug logging is not enabled.

DRILL-4967: Adding template_name to source code generated using freemarker template.

close apache/drill#629

  1. … 53 more files in changeset.
Update version to 1.9.0-SNAPSHOT.

    • -1
    • +1
    /contrib/data/tpch-sample-data/pom.xml
  1. … 12 more files in changeset.
[maven-release-plugin] prepare release drill-1.8.0

    • -1
    • +1
    /contrib/data/tpch-sample-data/pom.xml
  1. … 12 more files in changeset.
DRILL-4825: Fix incorrect result issue caused by partition pruning when same table is queried multiple times with different filters in query.

1) Introduce DirPrunedEnumerableTableScan which will take file selection as part of digest.

2) When directory-based pruning happens, create instance of DirPrunedEnumerableTableScan.

Add Jinfeng's GPG key.

DRILL-4715: Fix IOBE for concurrent query planning by applying fix of CALCITE-1009.

Fix is done in CALCITE-1009, thanks to huntersjm(huntersjm@163.com)'s analysis.

DRILL-4768: Fix leaking hive meta store connection in Drill's hive metastore client call.

- do not call reconnect if the connection is still alive and the error is caused by either UnknownTableException or access error.

- call close() explicitly before reconnect() and check if client.close() will hit exception.

- make DrillHiveMetaStoreClient closable.

close apache/drill#543

DRILL-4715: Fix java compilation error in run-time generated code when query has large number of expressions.

Refactor unit test in drillbit context initialization and pass in option manager.

close apache/drill#521

  1. … 39 more files in changeset.
DRILL-4707: Fix memory leak or incorrect query result in case two column names are case-insensitive identical.

Fix is mainly in CALCITE-528

Close apache/drill#515

DRILL-4592: Explain plan statement should show plan in WebUI

DRILL-4531: Add a Drill customized rule for pushing filter past aggregate

DRILL-4474: Ensure that ConvertCountToDirectScan does not push through project when nullable input of count is not RexInputRef This closes #416

DRILL-4589: Reduce planning time for file system partition pruning by reducing filter evaluation overhead

DRILL-4392: Fix CTAS partition to remove one unnecessary internal field in generated parquet files.

DRILL-4387: GroupScan or ScanBatchCreator should not use star column in case of skipAll query.

The skipAll query should be handled in RecordReader.

DRILL-4363: Row count based pruning for parquet table used in Limit n query.

Modify two existint unit testcase:

1) TestPartitionFilter.testMainQueryFalseCondition(): rowCount pruning applied after false condition is transformed into LIMIT 0

2) TestLimitWithExchanges.testPushLimitPastUnionExchange(): modify the testcase to use Json source, so that it does not mix with PushLimitIntoScanRule.

DRILL-4339: Reverse the function signature change made to AbstractRecordReader.setColumns() in DRILL-4279.

DRILL-4279: Improve performance for skipAll query against Text/JSON/Parquet table.

DRILL-2517: Move directory-based partition pruning to Calcite logical planning phase.

1) Make directory-based pruning rule both work in calcite logical and drill logical planning phase.

2) Only apply directory-based pruning in logical phase when there is no metadata cache.

3) Make FileSelection constructor public, since FileSelection.create() would modify selectionRoot.