Clone Tools
  • last updated a few minutes ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
DRILL-7326: Support repeated lists for CTAS parquet format

closes #1844

  1. … 4 more files in changeset.
DRILL-7350: Move RowSet related classes from test folder

    • -2
    • +2
    ./PartitionLimit/TestPartitionLimitBatch.java
    • -2
    • +2
    ./agg/TestStreamingAggEmitOutcome.java
    • -1
    • +1
    ./filter/TestFilterBatchEmitOutcome.java
    • -2
    • +2
    ./join/TestLateralJoinCorrectness.java
    • -2
    • +2
    ./join/TestLateralJoinCorrectnessBatchProcessing.java
    • -1
    • +1
    ./limit/TestLimitBatchEmitOutcome.java
  1. … 278 more files in changeset.
DRILL-7337: Add vararg UDFs support

  1. … 37 more files in changeset.
DRILL-7314: Use TupleMetadata instead of concrete implementation

1. Add ser / de implementation for TupleMetadata interface based on types.

2. Replace TupleSchema usage where possible.

3. Move patcher classes into commons.

4. Upgrade some dependencies and general refactoring.

  1. … 40 more files in changeset.
DRILL-7310: Move schema-related classes from exec module to be able to use them in metastore module

closes #1816

    • -2
    • +0
    ./agg/TestStreamingAggEmitOutcome.java
    • -13
    • +19
    ./protocol/TestOperatorRecordBatch.java
    • -11
    • +19
    ./scan/TestFileScanFramework.java
    • -11
    • +13
    ./scan/TestScanOrchestratorEarlySchema.java
    • -3
    • +2
    ./scan/TestScanOrchestratorLateSchema.java
    • -9
    • +8
    ./scan/TestScanOrchestratorMetadata.java
    • -23
    • +23
    ./scan/project/TestRowBatchMerger.java
    • -2
    • +5
    ./svremover/AbstractGenericCopierTest.java
    • -21
    • +21
    ./validate/TestBatchValidator.java
  1. … 88 more files in changeset.
DRILL-7306: Disable schema-only batch for new scan framework

The EVF framework is set up to return a "fast schema" empty batch

with only schema as its first batch because, when the code was

written, it seemed that's how we wanted operators to work. However,

DRILL-7305 notes that many operators cannot handle empty batches.

Since the empty-batch bugs show that Drill does not, in fact,

provide a "fast schema" batch, this ticket asks to disable the

feature in the new scan framework. The feature is disabled with

a config option; it can be re-enabled if ever it is needed.

SQL differentiates between two subtle cases, and both are

supported by this change.

1. Empty results: the query found a schema, but no rows

are returned. If no reader returns any rows, but at

least one reader provides a schema, then the scan

returns an empty batch with the schema.

2. Null results: the query found no schema or rows. No

schema is returned. If no reader returns rows or

schema, then the scan returns no batch: it instead

immediately returns a DONE status.

For CSV, an empty file with headers returns the null result set

(because we don't know the schema.) An empty CSV file without headers

returns an empty result set because we do know the schema: it will

always be the columns array.

Old tests validate the original schema-batch mode, new tests

added to validate the no-schema-batch mode.

    • -12
    • +12
    ./protocol/TestOperatorRecordBatch.java
    • -12
    • +18
    ./scan/BaseScanOperatorExecTest.java
    • -3
    • +49
    ./scan/TestScanOperExecEarlySchema.java
    • -0
    • +44
    ./scan/TestScanOperExecLateSchema.java
    • -0
    • +21
    ./scan/TestScanOperExecSmoothing.java
    • -13
    • +14
    ./scan/TestScanOrchestratorEarlySchema.java
    • -2
    • +3
    ./scan/TestScanOrchestratorLateSchema.java
    • -5
    • +6
    ./scan/TestScanOrchestratorMetadata.java
    • -1
    • +2
    ./scan/project/TestSchemaSmoothing.java
  1. … 31 more files in changeset.
DRILL-7302: Bump Apache Avro to 1.9.0

Apache Avro 1.9.0 brings a lot of new features:

Deprecate Joda-Time in favor of Java8 JSR310 and setting it as default

Remove support for Hadoop 1.x

Move from Jackson 1.x to 2.9

Add ZStandard Codec

Lots of updates on the dependencies to fix CVE's

Remove Jackson classes from public API

Apache Avro is built by default with Java 8

Apache Avro is compiled and tested with Java 11 to guarantee compatibility

Apache Avro MapReduce is compiled and tested with Hadoop 3

Apache Avro is now leaner, multiple dependencies were removed: guava, paranamer, commons-codec, and commons-logging

and many, many more!

close apache/drill#1812

    • -1
    • +1
    ./join/TestLateralJoinCorrectness.java
    • -1
    • +1
    ./join/TestLateralJoinCorrectnessBatchProcessing.java
  1. … 3 more files in changeset.
DRILL-6951: Merge row set based mock data source

The mock data source is used in several tests to generate a large volume

of sample data, such as when testing spilling. The mock data source also

lets us try new plugin featues in a very simple context. During the

development of the row set framework, the mock data source was converted

to use the new framework to verify functionality. This commit upgrades

the mock data source with that work.

The work changes non of the functionality. It does, however, improve

memory usage. Batchs are limited, by default, to 10 MB in size. The row

set framework minimizes internal fragmentation in the largest vector.

(Previously, internal fragmentation averaged 25% but could be as high as

50%.)

As it turns out, the hash aggregate tests depended on the internal

fragmentation: without it, the hash agg no longer spilled for the same

row count. Adjusted the generated row counts to recreate a data volume

that caused spilling.

One test in particular always failed due to assertions in the hash agg

code. These seem true bugs and are described in DRILL-7301. After

multiple failed attempts to get the test to work, it ws disabled until

DRILL-7301 is fixed.

Added a new unit test to sanity check the mock data source. (No test

already existed for this functionality except as verified via other unit

tests.)

    • -7
    • +16
    ./scan/BaseScanOperatorExecTest.java
  1. … 20 more files in changeset.
DRILL-7156: Support empty Parquet files creation

closes #1836

    • -22
    • +101
    ./writer/TestParquetWriterEmptyFiles.java
  1. … 1 more file in changeset.
DRILL-7293: Convert the regex ("log") plugin to use EVF

Converts the log format plugin (which uses a regex for parsing) to work

with the Extended Vector Format.

User-visible behavior changes added to the README file.

* Use the plugin config object to pass config to the Easy framework.

* Use the EVF scan mechanism in place of the legacy "ScanBatch"

mechanism.

* Minor code and README cleanup.

* Replace ad-hoc type conversion with builtin conversions

The provided schema support in the enhanced vector framework (EVF)

provides automatic conversions from VARCHAR to most types. The log

format plugin was created before EVF was available and provided its own

conversion mechanism. This commit removes the ad-hoc conversion code and

instead uses the log plugin config schema information to create an

"output schema" just as if it was provided by the provided schema

framework.

Because we need the schema in the plugin (rather than the reader), moved

the schema-parsing code out of the reader into the plugin. The plugin

creates two schemas: an "output schema" with the desired output types,

and a "reader schema" that uses only VARCHAR. This causes the EVF to

perform conversions.

* Enable provided schema support

Allows the user to specify types using either the format config (as

previously) or a provided schema. If a schema is provided, it will match

columns using names specified in the format config.

The provided schema can specify both types and modes (nullable or not

null.)

If a schema is provided, then the types specified in the plugin config

are ignored. No attempt is made to merge schemas.

If a schema is provided, but a column is omitted from the schema, the

type defaults to VARCHAR.

* Added ability to specify regex in table properties

Allows the user to specify the regex, and the column schema,

using a CREATE SCHEMA statement. The README file provides the details.

Unit tests demonstrate and verify the functionality.

* Used the custom error context provided by EVF to enhance the log format

reader error messages.

* Added user name to default EVF error context

* Added support for table functions

Can set the regex and maxErrors fields, but not the schema.

Schema will default to "field_0", "field_1", etc. of type

VARCHAR.

* Added unit tests to verify the functionality.

* Added a check, and a test, for a regex with no groups.

* Added columns array support

When the log regex plugin is given no schema, it previously

created a list of columns "field_0", "field_1", etc. After

this change, the plugin instead follows the pattern set by

the text plugin: it will place all fields into the columns

array. (The two special fields are still separate.)

A few adjustments were necessary to the columns array

framework to allow use of the special columns along with

the `columns` column.

Modified unit tests and the README to reflect this change.

The change should be backward compatible because few users

are likely relying on the dummy field names.

Added unit tests to verify that schema-based table

functions work. A test shows that, due to the unforunate

config property name "schema", users of this plugin cannot

combine a config table function with the schema attribute

in the way promised in DRILL-6965.

    • -12
    • +12
    ./scan/TestColumnsArrayFramework.java
    • -16
    • +16
    ./scan/TestColumnsArrayParser.java
  1. … 16 more files in changeset.
DRILL-7278: Refactor result set loader projection mechanism

Drill 1.16 added a enhanced scan framework based on the row set

mechanisms, and a "provisioned schema" feature build on top

of that framework. Conversion of the log reader plugin to use

the framework identified additional features we wish to add,

such as marking a column as "special" (not expanded in a wildcard

query.)

This work identified that the code added for provisioned schemas in

Drill 1.16 worked, but is a bit overly complex, making it hard to add

the desired new feature.

This patch refactors the "reader" projection code:

* Create a "projection set" mechanism that the reader can query to ask,

"the caller just added a column. Should it be projected or not?"

* Unifies the type conversion mechanism added as part of provisioned

schemas.

* Added the "special column" property for both "reader" and "provided"

schemas.

* Verified that provisioned schemas work with maps (at least on the scan

framework side.)

* Replaced the previous "schema transformer" mechanism with a new "type

conversion" mechanism that unifies type conversion, provided schemas

and an optional custom type conversion mechanism.

* Column writers can report if they are projected. Moved this query

from metadata to the column writer itself.

* Extended and clarified documentation of the feature.

* Revised and/or added unit tests.

closes #1797

    • -10
    • +65
    ./scan/TestScanOperExecOuputSchema.java
    • -4
    • +4
    ./scan/TestScanOrchestratorEarlySchema.java
    • -57
    • +151
    ./scan/project/TestScanLevelProjection.java
    • -0
    • +625
    ./scan/project/projSet/TestProjectionSet.java
  1. … 68 more files in changeset.
DRILL-7181: Improve V3 text reader (row set) error messages

Adds an error context to the User Error mechanism. The context allows

information to be passed through an intermediate layer and applied when

errors are raised in lower-level code; without the need for that

low-level code to know the details of the error context information.

Modifies the scan framework and V3 text plugin to use the framework to

improve error messages.

Refines how the `columns` column can be used with the text reader. If

headers are used, then `columns` is just another column. An error is

raised, however, if `columns[x]` is used when headers are enabled.

Added another builder abstraction where a constructor argument list

became too long.

Added the drill file system and split to the file schema negotiator

to simplify reader construction.

Added additional unit tests to fully define the `columns` column

behavior.

    • -20
    • +20
    ./scan/TestColumnsArrayParser.java
    • -17
    • +18
    ./scan/TestFileMetadataColumnParser.java
    • -6
    • +6
    ./scan/TestFileMetadataProjection.java
    • -14
    • +14
    ./scan/project/TestReaderLevelProjection.java
    • -15
    • +15
    ./scan/project/TestSchemaSmoothing.java
  1. … 26 more files in changeset.
DRILL-7183: TPCDS query 10, 35, 69 take longer with sf 1000 when Statistics are disabled. This commit reverts the changes done for DRILL-6997.

  1. … 5 more files in changeset.
DRILL-7143: Support default value for empty columns

Modifies the prior work to add default values for columns. The prior work added defaults

when the entire column is missing from a reader (the old Nullable Int column). The Row

Set mechanism now will also "fill empty" slots with the default value.

Added default support for the column writers. The writers automatically obtain the

default value from the column schema. The default can also be set explicitly on

the column writer.

Updated the null column mechanism to use this feature rather than the ad-hoc

implemention in the prior commit.

Semantics changed a bit. Only Required columns take a default. The default value

is ignored or nullable columns since nullable columns already have a file default: NULL.

Other changes:

* Updated the CSV-with-schema tests to illustrate the new behavior.

* Made multiple fixes for Boolean and Decimal columns and added unit tests.

* Upgraded Fremarker to version 2.3.28 to allow use of the continue statement.

* Reimplemented the Bit column reader and writer to use the BitVector directly since this vector is rather special.

* Added get/set Boolean methods for column accessors

* Moved the BooleanType class to the common package

* Added more CSV unit tests to explore decimal types, booleans, and defaults

* Add special handling for blank fields in from-string conversions

* Added options to the conversion factory to specify blank-handling behavior.

CSV uses a mapping of blanks to null (nullable) or default value (non-nullable)

closes #1726

    • -12
    • +17
    ./scan/project/TestNullColumnLoader.java
  1. … 72 more files in changeset.
DRILL-7096: Develop vector for canonical Map<K,V>

- Added new type DICT;

- Created value vectors for the type for single and repeated modes;

- Implemented corresponding FieldReaders and FieldWriters;

- Made changes in EvaluationVisitor to be able to read values from the map by key;

- Made changes to DrillParquetGroupConverter to be able to read Parquet's MAP type;

- Added an option `store.parquet.reader.enable_map_support` to disable reading MAP type as DICT from Parquet files;

- Updated AvroRecordReader to use new DICT type for Avro's MAP;

- Added support of the new type to ParquetRecordWriter.

  1. … 108 more files in changeset.
DRILL-7011: Support schema in scan framework

* Adds schema support to the row set-based scan framework and to the "V3" text reader based on that framework.

* Adding the schema made clear that passing options as a long list of constructor arguments was not sustainable. Refactored code to use a builder pattern instead.

* Added support for default values in the "null column loader", which required adding a "setValue" method to the column accessors.

* Added unit tests for all new or changed functionality. See TestCsvWithSchema for the overall test of the entire integrated mechanism.

* Added tests for explicit projection with schema

* Better handling of date/time in column accessors

* Converted recent column metadata work from Java 8 date/time to Joda.

* Added more CSV-with-schema unit tests

* Removed the ID fields from "resolved columns", used "instanceof" instead.

* Added wildcard projection with an output schema. Handles both "lenient" and "strict" schemas.

* Tagged projection columns with their output schema, when available.

* Scan projection added modes for wildcard with an output schema. The reader projection added support for merging reader and output schemas.

* Includes refactoring of scan operator tests (the test file grew too large.)

* Renamed some classes to avoid confusing reader schemas with output schemas.

* Added unit tests for the new functionality.

* Added "lenient" wildcard with schema test for CSV

* Added more type conversions: string-to-bit, many-to-string

* Fixed bug in column writer for VarDecimal

* Added missing unit tests, and fixed bugs, in Bit column reader/writer

* Cleaned up a number of unneded "SuppressWarnings"

closes #1711

    • -7
    • +0
    ./mergereceiver/TestMergingReceiver.java
    • -0
    • +180
    ./scan/BaseScanOperatorExecTest.java
    • -32
    • +64
    ./scan/TestColumnsArrayFramework.java
    • -14
    • +19
    ./scan/TestColumnsArrayParser.java
    • -106
    • +91
    ./scan/TestFileMetadataColumnParser.java
    • -24
    • +26
    ./scan/TestFileMetadataProjection.java
    • -76
    • +90
    ./scan/TestFileScanFramework.java
    • -0
    • +398
    ./scan/TestScanOperExecBasics.java
    • -0
    • +260
    ./scan/TestScanOperExecEarlySchema.java
    • -0
    • +402
    ./scan/TestScanOperExecLateSchema.java
    • -0
    • +253
    ./scan/TestScanOperExecOuputSchema.java
  1. … 210 more files in changeset.
DRILL-7106: Fix Intellij warning for FieldSchemaNegotiator

closes #1698

  1. … 4 more files in changeset.
DRILL-2326: Fix scalar replacement for the case when static method which does not return values is called

- Fix check for return function value to handle the case when created object is returned without assigning it to the local variable

closes #1687

  1. … 3 more files in changeset.
DRILL-7074: Scan framework fixes and enhancements

Roll-up of fixes an enhancements that emerged from the effort to host the CSV reader on the new framework.

closes #1676

    • -2
    • +17
    ./scan/TestColumnsArrayFramework.java
    • -153
    • +0
    ./scan/TestConstantColumnLoader.java
    • -8
    • +162
    ./scan/TestFileMetadataColumnParser.java
    • -1
    • +3
    ./scan/TestFileMetadataProjection.java
    • -281
    • +0
    ./scan/TestNullColumnLoader.java
    • -223
    • +0
    ./scan/TestScanLevelProjection.java
    • -170
    • +85
    ./scan/TestScanOrchestratorEarlySchema.java
    • -1
    • +5
    ./scan/TestScanOrchestratorLateSchema.java
    • -49
    • +28
    ./scan/TestScanOrchestratorMetadata.java
  1. … 27 more files in changeset.
DRILL-7063: Seperate metadata cache file into summary, file metadata

closes #1723

    • -2
    • +2
    ./writer/TestCorruptParquetDateCorrection.java
  1. … 18 more files in changeset.
DRILL-7068: Support memory adjustment framework for resource management with Queues. closes #1677

    • -6
    • +8
    ./partitionsender/TestPartitionSender.java
  1. … 36 more files in changeset.
DRILL-6952: Host compliant text reader on the row set framework

The result set loader allows controlling batch sizes. The new scan framework

built on top of that framework handles projection, implicit columns, null

columns and more. This commit converts the "new" ("compliant") text reader

to use the new framework. Options select the use of the V2 ("new") or V3

(row-set based) versions. Unit tests demonstrate V3 functionality.

closes #1683

    • -0
    • +3
    ./protocol/TestOperatorRecordBatch.java
    • -1
    • +28
    ./scan/TestFileMetadataColumnParser.java
    • -0
    • +9
    ./scan/TestFileMetadataProjection.java
    • -0
    • +15
    ./scan/TestScanOrchestratorMetadata.java
    • -0
    • +12
    ./scan/project/TestSchemaSmoothing.java
  1. … 50 more files in changeset.
DRILL-7200: Update Calcite to 1.19.0 / 1.20.0

    • -1
    • +1
    ./limit/TestLateLimit0Optimization.java
  1. … 46 more files in changeset.
DRILL-6855: Do not load schema if there is an IOException

closes #1626

  1. … 2 more files in changeset.
DRILL-5603: Replace String file paths to Hadoop Path - replaced all String path representation with org.apache.hadoop.fs.Path - added PathSerDe.Se JSON serializer - refactoring of DFSPartitionLocation code by leveraging existing listPartitionValues() functionality

closes #1657

  1. … 83 more files in changeset.
DRILL-7019: Add check for redundant imports

close apache/drill#1629

    • -3
    • +0
    ./scan/project/TestNullColumnLoader.java
    • -5
    • +0
    ./scan/project/TestRowBatchMerger.java
    • -7
    • +0
    ./scan/project/TestSchemaSmoothing.java
  1. … 19 more files in changeset.
DRILL-7016: Wrong query result with RuntimeFilter enabled when order of join and filter condition is swapped

close apache/drill#1628

    • -0
    • +178
    ./join/TestHashJoinJPPDCorrectness.java
  1. … 1 more file in changeset.
DRILL-7024: Refactor ColumnWriter to simplify type-conversion shim

DRILL-7006 added a type conversion "shim" within the row set framework. Basically, we insert a "shim" column writer that takes data in one form (String, say), and does reader-specific conversions to a target format (INT, say).

The code works fine, but the shim class ends up needing to override a bunch of methods which it then passes along to the base writer. This PR refactors the code so that the conversion shim is simpler.

closes #1633

    • -0
    • +3
    ./scan/project/TestNullColumnLoader.java
    • -0
    • +5
    ./scan/project/TestRowBatchMerger.java
    • -0
    • +3
    ./scan/project/TestSchemaSmoothing.java
  1. … 61 more files in changeset.
DRILL-7007: Use verify method in row set tests

Many of the early RowSet-based tests used the pattern:

new RowSetComparison(expected)

.verifyAndClearAll(result);

Revise this to use the simplified form:

RowSetUtilities.verify(expected, result);

The original form is retained when tests use additional functionality, such as the ability to perform multiple verifications on the same expected batch.

closes #1624

    • -4
    • +2
    ./xsort/managed/SortTestUtilities.java
  1. … 8 more files in changeset.
DRILL-6950: Row set-based scan framework

Adds the "plumbing" that connects the scan operator to the result set loader and the scan projection framework. See the various package-info.java files for the technical datails. Also adds a large number of tests.

This PR does not yet introduce an actual scan operator: that will follow in subsequent PRs.

closes #1618

    • -0
    • +190
    ./scan/TestColumnsArray.java
    • -0
    • +209
    ./scan/TestColumnsArrayFramework.java
    • -0
    • +259
    ./scan/TestColumnsArrayParser.java
    • -0
    • +153
    ./scan/TestConstantColumnLoader.java
    • -0
    • +262
    ./scan/TestFileMetadataColumnParser.java
    • -0
    • +328
    ./scan/TestFileMetadataProjection.java
    • -0
    • +546
    ./scan/TestFileScanFramework.java
    • -0
    • +281
    ./scan/TestNullColumnLoader.java
    • -0
    • +459
    ./scan/TestRowBatchMerger.java
    • -0
    • +122
    ./scan/TestScanBatchWriters.java
    • -0
    • +223
    ./scan/TestScanLevelProjection.java
    • -0
    • +1600
    ./scan/TestScanOperatorExec.java
    • -0
    • +1137
    ./scan/TestScanOrchestratorEarlySchema.java
    • -0
    • +155
    ./scan/TestScanOrchestratorLateSchema.java
  1. … 47 more files in changeset.