Clone Tools
  • last updated a few minutes ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
DRILL-7502: Invalid codegen for typeof() with UNION

Also fixes DRILL-6362: typeof() reports NULL for primitive

columns with a NULL value.

typeof() is meant to return "NULL" if a UNION has a NULL

value, but the column type when known, such as for non-UNION

columns.

Also fixes DRILL-7499: sqltypeof() function with an array returns

"ARRAY", not type. This was due to treating REPEATED like LIST.

Handling of the Union vector in code gen is problematic

with about three special cases. Existing code handled two

of the cases. This change handles the third case.

Figuring out the change required poking around quite a bit

of unclear code. Added comments and restructuring to make

that code a bit more clear.

The fix modified code gen for the Union Holder. It can now

"go back in time" to add the union reader at the point we

need it.

closes #1945

    • -8
    • +8
    ./vector/complex/impl/AbstractBaseReader.java
    • -0
    • +5
    ./vector/complex/impl/SingleDictReaderImpl.java
    • -11
    • +0
    ./vector/complex/reader/FieldReader.java
  1. … 52 more files in changeset.
DRILL-7456: Batch count fixes for 12 operators

Enables batch validation for 12 additional operators:

* MergingRecordBatch

* OrderedPartitionRecordBatch

* RangePartitionRecordBatch

* TraceRecordBatch

* UnionAllRecordBatch

* UnorderedReceiverBatch

* UnpivotMapsRecordBatch

* WindowFrameRecordBatch

* TopNBatch

* HashJoinBatch

* ExternalSortBatch

* WriterRecordBatch

Fixes issues found with those checks so that this set of

operators passes all checks.

Includes code cleanup in many files touched during this

work.

closes #1906

    • -1
    • +1
    ./vector/accessor/impl/VectorPrinter.java
  1. … 45 more files in changeset.
DRILL-7442: Create multi-batch row set reader

Adds a ResultSetReader that works across multiple batches

in a result set. Reuses the same row set and readers if

schema is unchanged, creates a new set if the schema changes.

Adds a unit test for the result set reader.

Adds a "rebind" capability to the row set readers to focus

on new buffers under an existing set of vectors. Used when

a new batch arrives, if the schema is unchanged.

Extends row set classses to be aware of the BatchAccessor class

which encapsulates a container and optional selection vector,

and tracks schema changes.

Moves row set tests into the same package as the row sets.

(Row set classes were moved a while back, but the tests were

not moved.)

Renames some BatchAccessor methods.

closes #1897

    • -0
    • +7
    ./vector/accessor/ColumnReaderIndex.java
    • -0
    • +7
    ./vector/accessor/reader/AbstractTupleReader.java
    • -2
    • +12
    ./vector/accessor/reader/ArrayReaderImpl.java
    • -1
    • +23
    ./vector/accessor/reader/BaseScalarReader.java
    • -0
    • +1
    ./vector/accessor/reader/ReaderEvents.java
    • -2
    • +11
    ./vector/accessor/reader/UnionReaderImpl.java
  1. … 56 more files in changeset.
DRILL-7445: Create batch copier based on result set framework

The result set framework now provides both a reader and writer.

This PR provides a copier that copies batches using this

framework. Such a copier can:

- Copy selected records

- Copy all records, such as for an SV2 or SV4

The copier uses the result set loader to create uniformly-sized

output batches from input batches of any size. It does this

by merging or splitting input batches as needed.

Since the result set reader handles both SV2 and SV4s, the

copier can filter or reorder rows based on the SV associated

with the input batch.

This version assumes single stream of input batches, and handles

any schema changes in that input by creating output batches

that track the input schema. This would be used in, say, the

selection vector remover. A different design is needed for merging

such as in the merging receiver.

Adds a "copy" method to the column writers. Copy is implemented

by doing a direct memory copy from source to destination vectors.

A unit test verifies functionality for various use cases

and data types.

closes #1899

    • -8
    • +8
    ./vector/accessor/reader/AbstractTupleReader.java
    • -0
    • +12
    ./vector/accessor/reader/BaseScalarReader.java
    • -0
    • +1
    ./vector/accessor/reader/OffsetVectorReader.java
    • -0
    • +14
    ./vector/accessor/writer/AbstractArrayWriter.java
    • -4
    • +18
    ./vector/accessor/writer/AbstractTupleWriter.java
    • -4
    • +4
    ./vector/accessor/writer/BaseVarWidthWriter.java
  1. … 25 more files in changeset.
DRILL-7441: Fix issues with fillEmpties, offset vectors

Fixes subtle issues with offset vectors and "fill empties"

logic.

Drill has an informal standard that if a batch has no rows, then

offset vectors within that batch should have zero size. Contrast

this with batches of size 1 that should have offset vectors of

size 2. Changed to enforce this rule throughout.

Nullable, repeated and variable-width vectors have "fill empties"

logic that is used in two places: when setting the value count and

when preparing to write a new value. The current logic is not

quite right for either case. Added tests and fixed the code to

properly handle each case.

Revised the batch validator to enforce the offset-vector length of 0 for

0-sized batches rule. The result was much simpler code.

Added tools to easily print a batch, restoring some code that

was recently lost when the RowSet classes were moved.

Code cleanup in all files touched.

Added logic to "dirty" allocated buffers when testing to ensure

logic is not sensitive to the "pristine" state of new buffers.

Added logic to the column writers to enforce the zero-size-batch rule

for offset vectors. Added unit tests for this case.

Fixed the column writers to set the "lastSet" mutator value for

nullable types since other code relies on this value.

Removed the "setCount" field in nullable vectors: turns out

it is not actually used.

closes #1896

    • -0
    • +9
    ./vector/accessor/impl/VectorPrinter.java
    • -15
    • +17
    ./vector/complex/AbstractRepeatedMapVector.java
    • -15
    • +27
    ./vector/complex/EmptyValuePopulator.java
  1. … 37 more files in changeset.
DRILL-7440: Failure during loading of RepeatedCount functions

closes #1894

    • -1
    • +2
    ./expr/holders/RepeatedDictHolder.java
  1. … 3 more files in changeset.
DRILL-7439: Batch count fixes for six additional operators

Enables vector checks, and fixes batch count and vector issues for:

* StreamingAggBatch

* RuntimeFilterRecordBatch

* FlattenRecordBatch

* MergeJoinBatch

* NestedLoopJoinBatch

* LimitRecordBatch

Also fixes a zero-size batch validity issue for the CSV reader when

all files contain no data.

Includes code cleanup for files touched in this PR.

closes #1893

  1. … 20 more files in changeset.
DRILL-7436: Fix record count, vector structure issues in several operators

Adds additional vector checks to the BatchValidator.

Enables checking for the following operators:

* FilterRecordBatch

* PartitionLimitRecordBatch

* UnnestRecordBatch

* HashAggBatch

* RemovingRecordBatch

Fixes vector count issues for each of these.

Fixes empty-batch (record count = 0) handling in several of the

above operators. Added a method to VectorContainer to correctly

create an empty batch. (An empty batch, counter-intuitively,

needs vectors allocated to hold the 0 value in the first

position of each offset vector.)

Disables verbose logging for MongoDB tests. Details are written to

the log rather than the console.

Disables two invalid Mongo tests. See DRILL-7428.

Adjusts the expression tree materializer to not add the LATE type

to Union vectors. (See DRILL-7435.)

Ensures that Union vectors contain valid vectors for each subtype.

The present fix is a work-around, see DRILL-7434 for a better

long-term fix.

Cleans up code formatting and other minor issues in each file touched

during the fixes in this PR.

  1. … 35 more files in changeset.
DRILL-7414: EVF incorrectly sets buffer writer index after rollover

Enabling the vector validator on the "new" scan operator, in cases

in which overflow occurs, identified that the DrillBuf writer index

was not properly set for repeated vectors.

Enables such checking, adds unit tests, and fixes the writer index

issue.

closes #1878

  1. … 5 more files in changeset.
DRILL-7412: Minor unit test improvements

Many tests intentionally trigger errors. A debug-only log setting

sent those errors to stdout. The resulting stack dumps simply cluttered

the test output, so disabled error output to the console.

Drill can apply bounds checks to vectors. Tests run via Maven

enable bounds checking. Now, bounds checking is also enabled in

"debug mode" (when assertions are enabled, as in an IDE.)

Drill contains two test frameworks. The older BaseTestQuery was

marked as deprecated, but many tests still use it and are unlikely

to be changed soon. So, removed the deprecated marker to reduce the

number of spurious warnings.

Also includes a number of minor clean-ups.

closes #1876

    • -16
    • +13
    ./vector/complex/RepeatedValueVector.java
  1. … 14 more files in changeset.
DRILL-7359: Add support for DICT type in RowSet Framework

closes #1870

    • -0
    • +261
    ./record/metadata/DictBuilder.java
    • -3
    • +8
    ./record/metadata/DictColumnMetadata.java
    • -4
    • +32
    ./record/metadata/MetadataUtils.java
    • -0
    • +11
    ./record/metadata/RepeatedListBuilder.java
    • -0
    • +16
    ./record/metadata/SchemaBuilder.java
    • -1
    • +11
    ./record/metadata/UnionBuilder.java
    • -0
    • +98
    ./vector/accessor/AbstractKeyAccessor.java
    • -0
    • +36
    ./vector/accessor/DictReader.java
  1. … 68 more files in changeset.
DRILL-7377: Nested schemas for dynamic EVF columns

The Result Set Loader (part of EVF) allows adding columns up-front

before reading rows (so-called "early schema.") Such schemas allow

nested columns (maps with members, repeated lists with a type, etc.)

The Result Set Loader also allows adding columns dynamically

while loading data (so-called "late schema".) Previously, the code

assumed that columns would be added top-down: first the map, then

the map's contents, etc.

Charles found a need to allow adding a nested column (a repeated

list with a declared list type.)

This patch revises the code to use the same mechanism in both the

early- and late-schema cases, allowing adding nested columns at

any time.

Testing: Added a new unit test case for the repeated list late

schema with content case.

    • -5
    • +21
    ./vector/accessor/writer/AbstractTupleWriter.java
  1. … 5 more files in changeset.
DRILL-7254: Read Hive union w/o nulls

    • -2
    • +4
    ./vector/complex/impl/PromotableWriter.java
  1. … 20 more files in changeset.
DRILL-7373: Fix problems involving reading from DICT type

- Fixed FieldIdUtil to resolve reading from DICT for some complex cases;

- optimized reading from DICT given a key by passing an appropriate Object type to DictReader#find(...) and DictReader#read(...) methods when schema is known (e.g. when reading from Hive tables) instead of generating it on fly based on int or String path and key type;

- fixed error when accessing value by not existing key value in Avro table.

    • -4
    • +21
    ./vector/complex/impl/SingleDictReaderImpl.java
  1. … 10 more files in changeset.
DRILL-7252: Read Hive map using Dict<K,V> vector

  1. … 16 more files in changeset.
DRILL-7168: Implement ALTER SCHEMA ADD / REMOVE commands

    • -0
    • +5
    ./record/metadata/AbstractPropertied.java
  1. … 13 more files in changeset.
DRILL-7350: Move RowSet related classes from test folder

    • -2
    • +2
    ./vector/accessor/ColumnReaderIndex.java
    • -1
    • +1
    ./vector/accessor/writer/AbstractTupleWriter.java
    • -1
    • +1
    ./vector/accessor/writer/WriterEvents.java
  1. … 289 more files in changeset.
DRILL-7341: Vector reAlloc may fail after exchange

closes #1838

  1. … 3 more files in changeset.
DRILL-7314: Use TupleMetadata instead of concrete implementation

1. Add ser / de implementation for TupleMetadata interface based on types.

2. Replace TupleSchema usage where possible.

3. Move patcher classes into commons.

4. Upgrade some dependencies and general refactoring.

    • -1
    • +50
    ./record/metadata/TupleMetadata.java
  1. … 39 more files in changeset.
DRILL-7315: Revise precision and scale order in the method arguments

    • -2
    • +2
    ./vector/complex/impl/MapOrListWriterImpl.java
  1. … 27 more files in changeset.
DRILL-7313: Use Hive schema for MaprDB native reader when field was empty

- Added all_text_mode option for hive maprDB Json

- Improved logic to convert Hive's schema into Drill's one

- Added unit tests for schema conversion

    • -0
    • +30
    ./record/metadata/RepeatedListBuilder.java
  1. … 27 more files in changeset.
DRILL-7310: Move schema-related classes from exec module to be able to use them in metastore module

closes #1816

    • -0
    • +329
    ./record/metadata/AbstractColumnMetadata.java
    • -0
    • +66
    ./record/metadata/ColumnBuilder.java
    • -0
    • +179
    ./record/metadata/MapBuilder.java
    • -0
    • +139
    ./record/metadata/MapColumnMetadata.java
    • -0
    • +198
    ./record/metadata/MetadataUtils.java
    • -0
    • +301
    ./record/metadata/PrimitiveColumnMetadata.java
    • -0
    • +105
    ./record/metadata/RepeatedListBuilder.java
    • -0
    • +108
    ./record/metadata/RepeatedListColumnMetadata.java
    • -0
    • +200
    ./record/metadata/SchemaBuilder.java
    • -0
    • +27
    ./record/metadata/SchemaContainer.java
    • -0
    • +150
    ./record/metadata/TupleBuilder.java
    • -0
    • +230
    ./record/metadata/TupleSchema.java
    • -0
    • +110
    ./record/metadata/UnionBuilder.java
    • -0
    • +136
    ./record/metadata/VariantColumnMetadata.java
    • -0
    • +210
    ./record/metadata/VariantSchema.java
  1. … 88 more files in changeset.
DRILL-7273: Introduce operators for handling metadata

closes #1886

    • -1
    • +7
    ./vector/complex/impl/SingleMapReaderImpl.java
  1. … 150 more files in changeset.
DRILL-7306: Disable schema-only batch for new scan framework

The EVF framework is set up to return a "fast schema" empty batch

with only schema as its first batch because, when the code was

written, it seemed that's how we wanted operators to work. However,

DRILL-7305 notes that many operators cannot handle empty batches.

Since the empty-batch bugs show that Drill does not, in fact,

provide a "fast schema" batch, this ticket asks to disable the

feature in the new scan framework. The feature is disabled with

a config option; it can be re-enabled if ever it is needed.

SQL differentiates between two subtle cases, and both are

supported by this change.

1. Empty results: the query found a schema, but no rows

are returned. If no reader returns any rows, but at

least one reader provides a schema, then the scan

returns an empty batch with the schema.

2. Null results: the query found no schema or rows. No

schema is returned. If no reader returns rows or

schema, then the scan returns no batch: it instead

immediately returns a DONE status.

For CSV, an empty file with headers returns the null result set

(because we don't know the schema.) An empty CSV file without headers

returns an empty result set because we do know the schema: it will

always be the columns array.

Old tests validate the original schema-batch mode, new tests

added to validate the no-schema-batch mode.

    • -3
    • +14
    ./record/metadata/ColumnBuilder.java
  1. … 42 more files in changeset.
DRILL-6951: Merge row set based mock data source

The mock data source is used in several tests to generate a large volume

of sample data, such as when testing spilling. The mock data source also

lets us try new plugin featues in a very simple context. During the

development of the row set framework, the mock data source was converted

to use the new framework to verify functionality. This commit upgrades

the mock data source with that work.

The work changes non of the functionality. It does, however, improve

memory usage. Batchs are limited, by default, to 10 MB in size. The row

set framework minimizes internal fragmentation in the largest vector.

(Previously, internal fragmentation averaged 25% but could be as high as

50%.)

As it turns out, the hash aggregate tests depended on the internal

fragmentation: without it, the hash agg no longer spilled for the same

row count. Adjusted the generated row counts to recreate a data volume

that caused spilling.

One test in particular always failed due to assertions in the hash agg

code. These seem true bugs and are described in DRILL-7301. After

multiple failed attempts to get the test to work, it ws disabled until

DRILL-7301 is fixed.

Added a new unit test to sanity check the mock data source. (No test

already existed for this functionality except as verified via other unit

tests.)

  1. … 21 more files in changeset.
DRILL-7293: Convert the regex ("log") plugin to use EVF

Converts the log format plugin (which uses a regex for parsing) to work

with the Extended Vector Format.

User-visible behavior changes added to the README file.

* Use the plugin config object to pass config to the Easy framework.

* Use the EVF scan mechanism in place of the legacy "ScanBatch"

mechanism.

* Minor code and README cleanup.

* Replace ad-hoc type conversion with builtin conversions

The provided schema support in the enhanced vector framework (EVF)

provides automatic conversions from VARCHAR to most types. The log

format plugin was created before EVF was available and provided its own

conversion mechanism. This commit removes the ad-hoc conversion code and

instead uses the log plugin config schema information to create an

"output schema" just as if it was provided by the provided schema

framework.

Because we need the schema in the plugin (rather than the reader), moved

the schema-parsing code out of the reader into the plugin. The plugin

creates two schemas: an "output schema" with the desired output types,

and a "reader schema" that uses only VARCHAR. This causes the EVF to

perform conversions.

* Enable provided schema support

Allows the user to specify types using either the format config (as

previously) or a provided schema. If a schema is provided, it will match

columns using names specified in the format config.

The provided schema can specify both types and modes (nullable or not

null.)

If a schema is provided, then the types specified in the plugin config

are ignored. No attempt is made to merge schemas.

If a schema is provided, but a column is omitted from the schema, the

type defaults to VARCHAR.

* Added ability to specify regex in table properties

Allows the user to specify the regex, and the column schema,

using a CREATE SCHEMA statement. The README file provides the details.

Unit tests demonstrate and verify the functionality.

* Used the custom error context provided by EVF to enhance the log format

reader error messages.

* Added user name to default EVF error context

* Added support for table functions

Can set the regex and maxErrors fields, but not the schema.

Schema will default to "field_0", "field_1", etc. of type

VARCHAR.

* Added unit tests to verify the functionality.

* Added a check, and a test, for a regex with no groups.

* Added columns array support

When the log regex plugin is given no schema, it previously

created a list of columns "field_0", "field_1", etc. After

this change, the plugin instead follows the pattern set by

the text plugin: it will place all fields into the columns

array. (The two special fields are still separate.)

A few adjustments were necessary to the columns array

framework to allow use of the special columns along with

the `columns` column.

Modified unit tests and the README to reflect this change.

The change should be backward compatible because few users

are likely relying on the dummy field names.

Added unit tests to verify that schema-based table

functions work. A test shows that, due to the unforunate

config property name "schema", users of this plugin cannot

combine a config table function with the schema attribute

in the way promised in DRILL-6965.

    • -0
    • +16
    ./record/metadata/AbstractPropertied.java
  1. … 17 more files in changeset.
DRILL-7258: Remove field width limit for text reader

The V2 text reader enforced a limit of 64K characters when using

column headers, but not when using the columns[] array. The V3 reader

enforced the 64K limit in both cases.

This patch removes the limit in both cases. The limit now is the

16MB vector size limit. With headers, no one column can exceed 16MB.

With the columns[] array, no one row can exceed 16MB. (The 16MB

limit is set by the Netty memory allocator.)

Added an "appendBytes()" method to the scalar column writer which adds

additional bytes to those already written for a specific column or

array element value. The method is implemented for VarChar, Var16Char

and VarBinary vectors. It throws an exception for all other types.

When used with a type conversion shim, the appendBytes() method throws

an exception. This should be OK because, the previous setBytes() should

have failed because a huge value is not acceptable for numeric or date

types conversions.

Added unit tests of the append feature, and for the append feature in

the batch overflow case (when appending bytes causes the vector or

batch to overflow.) Also added tests to verify the lack of column width

limit with the text reader, both with and without headers.

closes #1802

    • -2
    • +11
    ./vector/accessor/ColumnWriterIndex.java
    • -1
    • +6
    ./vector/accessor/writer/AbstractArrayWriter.java
    • -0
    • +7
    ./vector/accessor/writer/BaseVarWidthWriter.java
    • -0
    • +1
    ./vector/accessor/writer/MapWriter.java
    • -0
    • +3
    ./vector/accessor/writer/ScalarArrayWriter.java
  1. … 14 more files in changeset.
DRILL-7279: Enable provided schema for text files without headers

* Allows a provided schema for text files without headers. The

provided schema columns replace the `columns` column that is

normally used.

* Allows customizing text format properties using table properties.

The table properties "override" properties set in the plugin config.

* Added unit tests for the newly supported use cases.

* Fixed bug in quote escape handling.

closes #1798

    • -3
    • +10
    ./record/metadata/AbstractPropertied.java
  1. … 13 more files in changeset.
DRILL-7278: Refactor result set loader projection mechanism

Drill 1.16 added a enhanced scan framework based on the row set

mechanisms, and a "provisioned schema" feature build on top

of that framework. Conversion of the log reader plugin to use

the framework identified additional features we wish to add,

such as marking a column as "special" (not expanded in a wildcard

query.)

This work identified that the code added for provisioned schemas in

Drill 1.16 worked, but is a bit overly complex, making it hard to add

the desired new feature.

This patch refactors the "reader" projection code:

* Create a "projection set" mechanism that the reader can query to ask,

"the caller just added a column. Should it be projected or not?"

* Unifies the type conversion mechanism added as part of provisioned

schemas.

* Added the "special column" property for both "reader" and "provided"

schemas.

* Verified that provisioned schemas work with maps (at least on the scan

framework side.)

* Replaced the previous "schema transformer" mechanism with a new "type

conversion" mechanism that unifies type conversion, provided schemas

and an optional custom type conversion mechanism.

* Column writers can report if they are projected. Moved this query

from metadata to the column writer itself.

* Extended and clarified documentation of the feature.

* Revised and/or added unit tests.

closes #1797

    • -0
    • +9
    ./record/metadata/AbstractPropertied.java
    • -15
    • +15
    ./record/metadata/ColumnMetadata.java
    • -102
    • +0
    ./record/metadata/ProjectionType.java
    • -0
    • +10
    ./vector/accessor/ColumnWriter.java
    • -0
    • +3
    ./vector/accessor/writer/AbstractArrayWriter.java
    • -9
    • +3
    ./vector/accessor/writer/AbstractTupleWriter.java
    • -9
    • +11
    ./vector/accessor/writer/MapWriter.java
  1. … 58 more files in changeset.
DRILL-7257: Set nullable var-width vector lastSet value

Turns out this is due to a subtle issue with variable-width nullable

vectors. Such vectors have a lastSet attribute in the Mutator class.

When using "transfer pairs" to copy values, the code somehow decides

to zero-fill from the lastSet value to the record count. The row set

framework did not set this value, meaning that the RemovingRecordBatch

zero-filled the dir0 column when it chose to use transfer pairs rather

than copying values. The use of transfer pairs occurs when all rows in

a batch pass the filter prior to the removing record batch.

Modified the nullable vector writer to properly set the lastSet value at

the end of each batch. Added a unit test to verify the value is set

correctly.

Includes a bit of code clean-up.

  1. … 7 more files in changeset.