Clone Tools
  • last updated a few minutes ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
DRILL-6321: Lateral Join and Unnest - rules, options, logical plan supports

Included changes:

* Add planner.enable_unnest_lateral option. Default value set to false.

* Enable FilterCorrectRule

* Add support to logical plan

* Fix rebase errors for DRILL-6321 commits

  1. … 18 more files in changeset.
DRILL-6027: - Added fallback option for HashJoin. - No copy of incoming for single partition, and avoid HT resize. - Fix memory leak when cancelling while spill file is read - get correct schema when probe side is empty - Re-create the HashJoinProbe

  1. … 40 more files in changeset.
DRILL-6422: Update guava to 23.0 and shade it

- Fix compilation errors for new version of Guava.

- Remove usage of deprecated API

- Shade guava and add dependencies to the shaded version

- Ban unshaded package

- Introduce drill-shaded module and move guava-shaded under it

- Add methods to convert shaded guava lists to the unshaded ones

- Add instruction for publishing artifacts to the Apache repository

  1. … 81 more files in changeset.
DRILL-6320: Fixed license headers.

closes #1207

  1. … 2052 more files in changeset.
DRILL-6321: Lateral Join and Unnest - initial implementation for parser and planning

  1. … 25 more files in changeset.
DRILL-6284: Add operator metrics for batch sizing for flatten

  1. … 6 more files in changeset.
DRILL-6381: (Part 3) Planner and Execution implementation to support Secondary Indexes

  1. Index Planning Rules and Plan generators

    - DbScanToIndexScanRule: Top level physical planning rule that drives index planning for several relational algebra patterns.

- DbScanSortRemovalRule: Physical planning rule for index planning for Sort-based operations.

    - Plan Generators: Covering, Non-Covering and Intersect physical plan generators.

    - Support planning with functional indexes such as CAST functions.

    - Enhance PlannerSettings with several configuration options for indexes.

  2. Index Selection and Statistics

    - An IndexSelector that support cost-based index selection of covering and non-covering indexes using statistics and collation properties.

    - Costing of index intersection for comparison with single-index plans.

  3. Planning and execution operators

    - Support RangePartitioning physical operator during query planning and execution.

    - Support RowKeyJoin physical operator during query planning and execution.

    - HashTable and HashJoin changes to support RowKeyJoin and Index Intersection.

    - Enhance Materializer to keep track of subscan association with a particular rowkey join.

  4. Index Planning utilities

    - Utility classes to perform RexNode analysis, including conversion to and from SchemaPath.

    - Utility class to analyze filter condition and an input collation to determine output collation.

    - Helper classes to maintain index contexts for logical and physical planning phase.

    - IndexPlanUtils utility class for various helper methods.

  5. Miscellaneous

    - Separate physical rel for DirectScan.

    - Modify LimitExchangeTranspose rule to handle SingleMergeExchange.

- MD-3880: Return correct status from RangePartitionRecordBatch setupNewSchema

Co-authored-by: Aman Sinha <asinha@maprtech.com>

Co-authored-by: chunhui-shi <cshi@maprtech.com>

Co-authored-by: Gautam Parai <gparai@maprtech.com>

Co-authored-by: Padma Penumarthy <ppenumar97@yahoo.com>

Co-authored-by: Hanumath Rao Maduri <hmaduri@maprtech.com>

Conflicts:

exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashJoinPOP.java

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/fragment/Materializer.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillMergeProjectRule.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillOptiq.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjectIntoScanRule.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillScanRel.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/BroadcastExchangePrel.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/DrillDistributionTrait.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/HashJoinPrel.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/PrelUtil.java

exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetPushDownFilter.java

exec/java-exec/src/main/resources/drill-module.conf

logical/src/main/java/org/apache/drill/common/logical/StoragePluginConfig.java

Resolve merge comflicts and compilation issues.

    • -0
    • +60
    ./RangePartitionExchange.java
    • -0
    • +96
    ./RowKeyJoinPOP.java
  1. … 90 more files in changeset.
DRILL-6324: Unnest - Add tests with real Unnest and real Lateral.

Code cleanup, more comments, fix license headers, and more logging.

Refactor Unnest to allow setting in incoming batch after construction

fix compilation after rebase

This closes #1223

  1. … 13 more files in changeset.
DRILL-6323: Lateral Join - Refactor Join PopConfigs

  1. … 3 more files in changeset.
DRILL-6115: SingleMergeExchange is not scaling up when many minor fragments are allocated for a query.

DRILL-6115: Refactoring the existing code.

close apache/drill#1110

    • -0
    • +52
    ./OrderedMuxExchange.java
  1. … 15 more files in changeset.
DRILL-6027: - Added memory claculator - Added unit tests and docs. - Fixed IOB caused by output vector allocation. - Don't double count records that were spilled in HashJoin

  1. … 55 more files in changeset.
DRILL-6381: (Part 1) Secondary Index framework

  1. Secondary Index planning interfaces and abstract classes like DBGroupScan, DbSubScan, IndexDecriptor etc.

  2. Statistics and Cost model interfaces/classes: PluginCost, Statistics, StatisticsPayload, AbstractIndexStatistics

  3. ScanBatch and RecordReader to support repeatable scan

  4. Secondary Index execution related interfaces: RangePartitionSender, RowKeyJoin, PartitionFunction

5. MD-3979: Query using cast index plan fails with NPE

Co-authored-by: Aman Sinha <asinha@maprtech.com>

Co-authored-by: chunhui-shi <cshi@maprtech.com>

Co-authored-by: Gautam Parai <gparai@maprtech.com>

Co-authored-by: Padma Penumarthy <ppenumar97@yahoo.com>

Co-authored-by: Hanumath Rao Maduri <hmaduri@maprtech.com>

Conflicts:

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillTable.java

protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java

protocol/src/main/java/org/apache/drill/exec/proto/beans/CoreOperatorType.java

protocol/src/main/protobuf/UserBitShared.proto

    • -0
    • +73
    ./RangePartitionSender.java
  1. … 43 more files in changeset.
DRILL-6324: Unnest - Initial Implementation

- Based on Flatten

- Implement unnestRecords in UnnestTemplate

- Remove unnecessary code inherited from Flatten/Project. Add schema change handling.

- Fix build failure after rebase since RecordBatchSizer used by UNNEST was relocated to a different package

- Add unit tests

- Handling of input row splitting across multiple batches. Also do not kill incoming in killIncoming.

- Schema change generated by Unnest

  1. … 10 more files in changeset.
DRILL-6323: Lateral Join - Initial implementation

    • -0
    • +55
    ./LateralJoinPOP.java
  1. … 10 more files in changeset.
DRILL-6882: Handle the cases where RowKeyJoin's left pipeline being called multiple times.

close apache/drill#1562

  1. … 4 more files in changeset.
DRILL-5830: Resolve regressions to MapR DB from DRILL-5546

- Back out HBase changes

- Code cleanup

- Test utilities

- Fix for DRILL-5829

closes #968

  1. … 22 more files in changeset.
DRILL-5457: Spill implementation for Hash Aggregate

closes #822

  1. … 34 more files in changeset.
DRILL-5546: Handle schema change exception failure caused by empty input or empty batch.

1. Modify ScanBatch's logic when it iterates list of RecordReader.

1) Skip RecordReader if it returns 0 row && present same schema. A new schema (by calling Mutator.isNewSchema() ) means either a new top level field is added, or a field in a nested field is added, or an existing field type is changed.

2) Implicit columns are presumed to have constant schema, and are added to outgoing container before any regular column is added in.

3) ScanBatch will return NONE directly (called as "fast NONE"), if all its RecordReaders haver empty input and thus are skipped, in stead of returing OK_NEW_SCHEMA first.

2. Modify IteratorValidatorBatchIterator to allow

1) fast NONE ( before seeing a OK_NEW_SCHEMA)

2) batch with empty list of columns.

2. Modify JsonRecordReader when it get 0 row. Do not insert a nullable-int column for 0 row input. Together with ScanBatch, Drill will skip empty json files.

3. Modify binary operators such as join, union to handle fast none for either one side or both sides. Abstract the logic in AbstractBinaryRecordBatch, except for MergeJoin as its implementation is quite different from others.

4. Fix and refactor union all operator.

1) Correct union operator hanndling 0 input rows. Previously, it will ignore inputs with 0 row and put nullable-int into output schema, which causes various of schema change issue in down-stream operator. The new behavior is to take schema with 0 into account

in determining the output schema, in the same way with > 0 input rows. By doing that, we ensure Union operator will not behave like a schema-lossy operator.

2) Add a UnionInputIterator to simplify the logic to iterate the left/right inputs, removing significant chunk of duplicate codes in previous implementation.

The new union all operator reduces the code size into half, comparing the old one.

5. Introduce UntypedNullVector to handle convertFromJson() function, when the input batch contains 0 row.

Problem: The function convertFromJSon() is different from other regular functions in that it only knows the output schema after evaluation is performed. When input has 0 row, Drill essentially does not have

a way to know the output type, and previously will assume Map type. That works under the assumption other operators like Union would ignore batch with 0 row, which is no longer

the case in the current implementation.

Solution: Use MinorType.NULL at the output type for convertFromJSON() when input contains 0 row. The new UntypedNullVector is used to represent a column with MinorType.NULL.

6. HBaseGroupScan convert star column into list of row_key and column family. HBaseRecordReader should reject column star since it expectes star has been converted somewhere else.

In HBase a column family always has map type, and a non-rowkey column always has nullable varbinary type, this ensures that HBaseRecordReader across different HBase regions will have the same top level schema, even if the region is

empty or prune all the rows due to filter pushdown optimization. In other words, we will not see different top level schema from different HBaseRecordReader for the same table.

However, such change will not be able to handle hard schema change : c1 exists in cf1 in one region, but not in another region. Further work is required to handle hard schema change.

7. Modify scan cost estimation when the query involves * column. This is to remove the planning randomness since previously two different operators could have same cost.

8. Add a new flag 'outputProj' to Project operator, to indicate if Project is for the query's final output. Such Project is added by TopProjectVisitor, to handle fast NONE when all the inputs to the query are empty

and are skipped.

1) column star is replaced with empty list

2) regular column reference is replaced with nullable-int column

3) An expression will go through ExpressionTreeMaterializer, and use the type of materialized expression as the output type

4) Return an OK_NEW_SCHEMA with the schema using the above logic, then return a NONE to down-stream operator.

9. Add unit test to test operators handling empty input.

10. Add unit test to test query when inputs are all empty.

DRILL-5546: Revise code based on review comments.

Handle implicit column in scan batch. Change interface in ScanBatch's constructor.

1) Ensure either the implicit column list is empty, or all the reader has the same set of implicit columns.

2) We could skip the implicit columns when check if there is a schema change coming from record reader.

3) ScanBatch accept a list in stead of iterator, since we may need go through the implicit column list multiple times, and verify the size of two lists are same.

ScanBatch code review comments. Add more unit tests.

Share code path in ProjectBatch to handle normal setupNewSchema() and handleNullInput().

- Move SimpleRecordBatch out of TopNBatch to make it sharable across different places.

- Add Unit test verify schema for star column query against multilevel tables.

Unit test framework change

- Fix memory leak in unit test framework.

- Allow SchemaTestBuilder to pass in BatchSchema.

close #906

  1. … 67 more files in changeset.
DRILL-5325: Unit tests for the managed sort

Uses the sub-operator test framework (DRILL-5318), including the test

row set abstraction (DRILL-5323) to enable unit testing of the

“managed” external sort. This PR allows early review of the code, but

cannot be pulled until the dependencies (mentioned above) are pulled.

Refactors the external sort code into small chunks that can be unit

tested, then “wraps” that code in tests for all interesting data types,

record batch sizes, and so on.

Refactors some of the operator definitions to more easily allow

programmatic setup in the unit tests.

Fixes a number of bugs discovered by the unit tests. The biggest

changes were in the new code: the code that computes spilling and

merging based on memory levels.

Otherwise, although GitHub will show many files change, most of the

changes are simply moving blocks of code around to create smaller units

that can be tested independently.

Includes a refactoring of the code that does spilling, along with a

complete set of low-level unit tests.

Excludes long-running sort tests.

Defines a test category for long-running tests.

First attempt to provide a way to run such tests from Maven.

closes #808

  1. … 50 more files in changeset.
DRILL-5375: Nested loop join: return correct result for left join closes #794

  1. … 17 more files in changeset.
DRILL-5080: Memory-managed version of external sort

Please see JIRA entry for reasons for revision, design spec and list of

changes.

This PR covers the changes to the external sort itself. Tests for this

operator require the test framework in DRILL-5126 and the mock data

source in DRILL-5152. Tests for this operator will be issued as a

separate PR once those two dependencies are committed.

Until then, the new operator is disabled by default. It can be enabled

using drill.sort.external.disable_managed: false.

The operator now spills before receiving a new batch. Revised memory calcs and

merge calcs to make them a bit clearer and provide more margin of error

for the power-of-two allocations used when allocating vectors.

We have two external sort implementations, but only one operator code

for both. They can use only one Metrics enum between them. When adding

new metrics to the new version, didn’t add matching metrics to the old

one. This fixes that issue. (The issue will go away once the old one is

retired.)

Revised memory calculations to reflect limit of 16 MB per vector.

Current revision limits to 16 MB per output batch to be safe. Next

revision will enforce per-vector limits to allow the overall batch to

be larger when possible.

Also simplified the merge-time calculations.

Original code provided only crude methods to learn the size of a record

batch. Adds a "RecordBatchSizer" to provide detailed analysis so the

sort can know the amount of memory used to buffer a batch, the number

of rows, and the expected row width once the rows are copied to a

spill file or the output.

Moved generic spill classes to a separate package.

Created parameters for spill batch size and merge batch size. Separated

these values in code. Deprecated the min, max spill parameters as they

no longer add much value. Minor code rearranging.

Bug fix

Fixes a corner case of merging spilled files in a low-memory condition.

Fixes from code review

close apache/drill#717

  1. … 21 more files in changeset.
DRILL-5116: Enable generated code debugging in each Drill operator

DRILL-5052 added the ability to debug generated code. The reviewer suggested

permitting the technique to be used for all Drill operators. This PR provides

the required fixes. Most were small changes, others dealt with the rather

clever way that the existing byte-code merge converted static nested classes

to non-static inner classes, with the way that constructors were inserted

at the byte-code level and so on. See the JIRA for the details.

This code passed the unit tests twice: once with the traditional byte-code

manipulations, a second time using "plain-old Java" code compilation.

Plain-old Java is turned off by default, but can be turned on for all

operators with a single config change: see the JIRA for info. Consider

the plain-old Java option to be experimental: very handy for debugging,

perhaps not quite tested enough for production use.

close apache/drill#716

  1. … 62 more files in changeset.
DRILL-4446: Support mandatory work assignment to endpoint requirements of operators

  1. … 23 more files in changeset.
DRILL-4445: Standardize the Physical and Logical plan nodes to use Lists instead of arrays for their inputs

Remove some extra translation logic used to move between the

two representations.

TODO - look back the the Join logical node, has two JsonCreator annotations,

but only one will be used. Not sure if the behavior of which is chosen

is considered documented behavior, should just fix it on our end.

  1. … 26 more files in changeset.
DRILL-4260: Adding support for some custom window frames

this includes the following JIRAs:

DRILL-4261: Add support for RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

DRILL-4262: add support for ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

DRILL-4263: add support for RANGE BETWEEN CURRENT ROW AND CURRENT ROW

this closes #340

  1. … 22 more files in changeset.
DRILL-3012: Fix issue where remote values rel wasn't losing operatorId.

Also enhance Union rule to avoid more than 2 ways inputs

  1. … 3 more files in changeset.
DRILL-2936: Use SpoolingRawBatchBuffer for HashToMergeExchange In order to avoid deadlocks

Refactored common code in UnlimitedRawBatchBuffer and SpoolingRawBatchBuffer

into BaseRawBatchBuffer

Removed reflection-based construction of RawBatchBuffer. Now use choose implementation

based on plan

Updated SpoolingRawBatchBuffer to use a separate thread for spooling

  1. … 13 more files in changeset.
DRILL-2695: Add Support for large in conditions through the use of the Values operator. Update JSON reader to support reading Extended JSON. Update JSON writer to support writing extended JSON data. Update JSON reader to automatically unwrap a file that includes a single top-level array (used by values). Update Options manager to use getOption(<Type>Validator) to directly retrieve typed value. Remove JSON rewinding Add support for CONVERT_TO( [], 'SIMPLEJSON') to disable extended types as part of udf use.

  1. … 65 more files in changeset.
DRILL-2210 Introducing multithreading capability to PartitonerSender

  1. … 15 more files in changeset.
DRILL-133: LocalExchange planning and exec.

    • -0
    • +143
    ./AbstractDeMuxExchange.java
    • -0
    • +125
    ./AbstractMuxExchange.java
  1. … 55 more files in changeset.