Clone Tools
  • last updated 16 mins ago
Constraints: committers
Constraints: files
Constraints: dates
DRILL-7502: Invalid codegen for typeof() with UNION

Also fixes DRILL-6362: typeof() reports NULL for primitive

columns with a NULL value.

typeof() is meant to return "NULL" if a UNION has a NULL

value, but the column type when known, such as for non-UNION


Also fixes DRILL-7499: sqltypeof() function with an array returns

"ARRAY", not type. This was due to treating REPEATED like LIST.

Handling of the Union vector in code gen is problematic

with about three special cases. Existing code handled two

of the cases. This change handles the third case.

Figuring out the change required poking around quite a bit

of unclear code. Added comments and restructuring to make

that code a bit more clear.

The fix modified code gen for the Union Holder. It can now

"go back in time" to add the union reader at the point we

need it.

closes #1945

  1. … 54 more files in changeset.
DRILL-7503: Refactor the project operator

Breaks the big "setup" function into its own class, and

separates out physical vector setup from logical projection

planning. No functional change; just rearranging existing


closes #1944

    • -0
    • +141
    • -284
    • +242
    • -0
    • +626
  1. … 4 more files in changeset.
DRILL-7487: Removes the unused OUT_OF_MEMORY iterator status

See JIRA ticket for full explanation.

closes #1930

  1. … 42 more files in changeset.
DRILL-7456: Batch count fixes for 12 operators

Enables batch validation for 12 additional operators:

* MergingRecordBatch

* OrderedPartitionRecordBatch

* RangePartitionRecordBatch

* TraceRecordBatch

* UnionAllRecordBatch

* UnorderedReceiverBatch

* UnpivotMapsRecordBatch

* WindowFrameRecordBatch

* TopNBatch

* HashJoinBatch

* ExternalSortBatch

* WriterRecordBatch

Fixes issues found with those checks so that this set of

operators passes all checks.

Includes code cleanup in many files touched during this


closes #1906

  1. … 45 more files in changeset.
DRILL-7450: Improve performance for ANALYZE command

- Implement two-phase aggregation for the lowest metadata aggregate to optimize performance

- Allow using complex functions with hash aggregate

- Use hash aggregation for PHASE_1of2 for ANALYZE to reduce memory usage and avoid sorting non-aggregated data

- Add sort above hash aggregation to fix correctness of merge exchange and stream aggregate

closes #1907

  1. … 59 more files in changeset.
DRILL-7436: Fix record count, vector structure issues in several operators

Adds additional vector checks to the BatchValidator.

Enables checking for the following operators:

* FilterRecordBatch

* PartitionLimitRecordBatch

* UnnestRecordBatch

* HashAggBatch

* RemovingRecordBatch

Fixes vector count issues for each of these.

Fixes empty-batch (record count = 0) handling in several of the

above operators. Added a method to VectorContainer to correctly

create an empty batch. (An empty batch, counter-intuitively,

needs vectors allocated to hold the 0 value in the first

position of each offset vector.)

Disables verbose logging for MongoDB tests. Details are written to

the log rather than the console.

Disables two invalid Mongo tests. See DRILL-7428.

Adjusts the expression tree materializer to not add the LATE type

to Union vectors. (See DRILL-7435.)

Ensures that Union vectors contain valid vectors for each subtype.

The present fix is a work-around, see DRILL-7434 for a better

long-term fix.

Cleans up code formatting and other minor issues in each file touched

during the fixes in this PR.

  1. … 35 more files in changeset.
DRILL-7424: Project operator fails to set the container row count

Enabled the "batch validator" for the Project operator. Ran tests.

Exceptions occurred because, in some paths, the Project operator

fails to set the container row count.

Fixes the project operator. Cleans up formatting issues in files

touched during the investigation. Cleaned up batch-related issues

in Project.

  1. … 8 more files in changeset.
DRILL-7155: Create a standard logging message for batch sizes generated by individual operators. This is needed for QA verification of the Batch Size feature DRILL-6238. closes #1716

  1. … 10 more files in changeset.
DRILL-6952: Host compliant text reader on the row set framework

The result set loader allows controlling batch sizes. The new scan framework

built on top of that framework handles projection, implicit columns, null

columns and more. This commit converts the "new" ("compliant") text reader

to use the new framework. Options select the use of the V2 ("new") or V3

(row-set based) versions. Unit tests demonstrate V3 functionality.

closes #1683

  1. … 58 more files in changeset.
DRILL-6773: The renamed schema with aliases is not shown for queries on empty directories

closes #1492

  1. … 17 more files in changeset.
DRILL-6768: Improve to_date, to_time and to_timestamp and corresponding cast functions to handle empty string when option is enabled closes #1494

  1. … 24 more files in changeset.
DRILL-6724: Dump operator context to logs when error occurs during query execution

closes #1455

  1. … 101 more files in changeset.
DRILL-6422: Replace guava imports with shaded ones

  1. … 979 more files in changeset.
DRILL-6688 Data batches for Project operator exceed the maximum specified (#1442)

This change separates the metadata-width and data-width of a variable-width column such that the data-width is used in all intermediate calculations and the meta-data width is added finally when the column's width is accumulated into the total width.

  1. … 1 more file in changeset.
DRILL-6709: Extended the batch stats utility to other operators

closes #1444

  1. … 14 more files in changeset.
DRILL-6594: Data batches for Project operator are not being split properly and exceed the maximum specified

This change fixes the incorrect accounting in the case where a columns is being projected more than once

closes #1375

DRILL-6461: Added basic data correctness tests for hash agg, and improved operator unit testing framework.

git closes #1344

  1. … 36 more files in changeset.
DRILL-6418: Handle Schema change in Unnest And Lateral for unnest field / non-unnest field

Note: Changed Lateral to handle non-empty right batch with OK_NEW_SCHEMA

closes #1271

  1. … 10 more files in changeset.
DRILL-6320: Fixed license headers.

closes #1207

  1. … 2064 more files in changeset.
DRILL-6327: Update unary operators to handle IterOutcome.EMIT Note: Handles for Non-Blocking Unary operators (like Filter/Project/etc) with EMIT Iter.Outcome

closes #1240

  1. … 16 more files in changeset.
DRILL-6340 Output Batch Control in Project using the RecordBatchSizer

Changes required to implement Output Batch Sizing in Project using the RecordBatchSizer.

closes #1302

    • -0
    • +46
    • -0
    • +147
    • -0
    • +278
    • -0
    • +37
    • -0
    • +310
  1. … 49 more files in changeset.
DRILL-5730: Mock testing improvements and interface improvements

closes #1045

  1. … 221 more files in changeset.
DRILL-6118: Handle item star columns during project / filter push down and directory pruning

1. Added DrillFilterItemStarReWriterRule to re-write item star fields to regular field references.

2. Refactored DrillPushProjectIntoScanRule to handle item star fields, factored out helper classes and methods from PreUitl.class.

3. Fixed issue with dynamic star usage (after Calcite upgrade old usage of star was still present, replaced WILDCARD -> DYNAMIC_STAR for clarity).

4. Added unit tests to check project / filter push down and directory pruning with item star.

  1. … 26 more files in changeset.
DRILL-6049: Misc. hygiene and code cleanup changes

close apache/drill#1085

  1. … 123 more files in changeset.
DRILL-5783, DRILL-5841, DRILL-5894: Rationalize test temp directories

This change includes:


- A unit test is created for the priority queue in the TopN operator.

- The code generation classes passed around a completely unused function registry reference in some places so it is removed.

- The priority queue had unused parameters for some of its methods so it is removed.


- Created standardized temp directory classes DirTestWatcher, SubDirTestWatcher, and BaseDirTestWatcher. And updated all unit tests to use them.


- Removed the dfs_test storage plugin for tests and replaced it with the already existing dfs storage plugin.


- General code cleanup.

- Removed unnecessary use of String.format in the tests.

This closes #984

  1. … 365 more files in changeset.
DRILL-4735: ConvertCountToDirectScan rule enhancements

1. ConvertCountToDirectScan rule will be applicable for 2 or more COUNT aggregates.

To achieve this DynamicPojoRecordReader was added which accepts any number of columns,

on the contrary with PojoRecordReader which depends on class fields.

AbstractPojoRecordReader class was added to factor out common logic for these two readers.

2. ConvertCountToDirectScan will distinguish between missing, directory and implicit columns.

For missing columns count will be set 0, for implicit to the total records count

since implicit columns are based on files and there is no data without a file.

If directory column will be encountered, rule won't be applied.

CountsCollector class was introduced to encapsulate counts collection logic.

3. MetadataDirectGroupScan class was introduced to indicate to the user when metadata was used

during calculation and for which files it was applied.

DRILL-4735: Changes after code review.

close #900

  1. … 22 more files in changeset.
DRILL-4264: Allow field names to include dots

  1. … 98 more files in changeset.
DRILL-5546: Handle schema change exception failure caused by empty input or empty batch.

1. Modify ScanBatch's logic when it iterates list of RecordReader.

1) Skip RecordReader if it returns 0 row && present same schema. A new schema (by calling Mutator.isNewSchema() ) means either a new top level field is added, or a field in a nested field is added, or an existing field type is changed.

2) Implicit columns are presumed to have constant schema, and are added to outgoing container before any regular column is added in.

3) ScanBatch will return NONE directly (called as "fast NONE"), if all its RecordReaders haver empty input and thus are skipped, in stead of returing OK_NEW_SCHEMA first.

2. Modify IteratorValidatorBatchIterator to allow

1) fast NONE ( before seeing a OK_NEW_SCHEMA)

2) batch with empty list of columns.

2. Modify JsonRecordReader when it get 0 row. Do not insert a nullable-int column for 0 row input. Together with ScanBatch, Drill will skip empty json files.

3. Modify binary operators such as join, union to handle fast none for either one side or both sides. Abstract the logic in AbstractBinaryRecordBatch, except for MergeJoin as its implementation is quite different from others.

4. Fix and refactor union all operator.

1) Correct union operator hanndling 0 input rows. Previously, it will ignore inputs with 0 row and put nullable-int into output schema, which causes various of schema change issue in down-stream operator. The new behavior is to take schema with 0 into account

in determining the output schema, in the same way with > 0 input rows. By doing that, we ensure Union operator will not behave like a schema-lossy operator.

2) Add a UnionInputIterator to simplify the logic to iterate the left/right inputs, removing significant chunk of duplicate codes in previous implementation.

The new union all operator reduces the code size into half, comparing the old one.

5. Introduce UntypedNullVector to handle convertFromJson() function, when the input batch contains 0 row.

Problem: The function convertFromJSon() is different from other regular functions in that it only knows the output schema after evaluation is performed. When input has 0 row, Drill essentially does not have

a way to know the output type, and previously will assume Map type. That works under the assumption other operators like Union would ignore batch with 0 row, which is no longer

the case in the current implementation.

Solution: Use MinorType.NULL at the output type for convertFromJSON() when input contains 0 row. The new UntypedNullVector is used to represent a column with MinorType.NULL.

6. HBaseGroupScan convert star column into list of row_key and column family. HBaseRecordReader should reject column star since it expectes star has been converted somewhere else.

In HBase a column family always has map type, and a non-rowkey column always has nullable varbinary type, this ensures that HBaseRecordReader across different HBase regions will have the same top level schema, even if the region is

empty or prune all the rows due to filter pushdown optimization. In other words, we will not see different top level schema from different HBaseRecordReader for the same table.

However, such change will not be able to handle hard schema change : c1 exists in cf1 in one region, but not in another region. Further work is required to handle hard schema change.

7. Modify scan cost estimation when the query involves * column. This is to remove the planning randomness since previously two different operators could have same cost.

8. Add a new flag 'outputProj' to Project operator, to indicate if Project is for the query's final output. Such Project is added by TopProjectVisitor, to handle fast NONE when all the inputs to the query are empty

and are skipped.

1) column star is replaced with empty list

2) regular column reference is replaced with nullable-int column

3) An expression will go through ExpressionTreeMaterializer, and use the type of materialized expression as the output type

4) Return an OK_NEW_SCHEMA with the schema using the above logic, then return a NONE to down-stream operator.

9. Add unit test to test operators handling empty input.

10. Add unit test to test query when inputs are all empty.

DRILL-5546: Revise code based on review comments.

Handle implicit column in scan batch. Change interface in ScanBatch's constructor.

1) Ensure either the implicit column list is empty, or all the reader has the same set of implicit columns.

2) We could skip the implicit columns when check if there is a schema change coming from record reader.

3) ScanBatch accept a list in stead of iterator, since we may need go through the implicit column list multiple times, and verify the size of two lists are same.

ScanBatch code review comments. Add more unit tests.

Share code path in ProjectBatch to handle normal setupNewSchema() and handleNullInput().

- Move SimpleRecordBatch out of TopNBatch to make it sharable across different places.

- Add Unit test verify schema for star column query against multilevel tables.

Unit test framework change

- Fix memory leak in unit test framework.

- Allow SchemaTestBuilder to pass in BatchSchema.

close #906

  1. … 67 more files in changeset.
DRILL-5399: Fix race condition in DrillComplexWriterFuncHolder

  1. … 10 more files in changeset.
DRILL-5419: Calculate return string length for literals & some string functions

1. Revisited calculation logic for string literals and some string functions

(cast, upper, lower, initcap, reverse, concat, concat operator, rpad, lpad, case statement,

coalesce, first_value, last_value, lag, lead).

Synchronized return type length calculation logic between limit 0 and regular queries.

2. Deprecated width and changed it to precision for string types in MajorType.

3. Revisited FunctionScope and splitted it into FunctionScope and ReturnType.

FunctionScope will indicate only function usage in term of number of in / out rows, (n -> 1, 1 -> 1, 1->n).

New annotation in UDFs ReturnType will indicate which return type strategy should be used.

4. Changed MAX_VARCHAR_LENGTH from 65536 to 65535.

5. Updated calculation of precision and display size for INTERVALYEAR & INTERVALDAY.

6. Refactored part of function code-gen logic (ValueReference, WorkspaceReference, FunctionAttributes, DrillFuncHolder).

This closes #819

  1. … 78 more files in changeset.