Clone Tools
  • last updated 21 mins ago
Constraints: committers
Constraints: files
Constraints: dates
DRILL-7442: Create multi-batch row set reader

Adds a ResultSetReader that works across multiple batches

in a result set. Reuses the same row set and readers if

schema is unchanged, creates a new set if the schema changes.

Adds a unit test for the result set reader.

Adds a "rebind" capability to the row set readers to focus

on new buffers under an existing set of vectors. Used when

a new batch arrives, if the schema is unchanged.

Extends row set classses to be aware of the BatchAccessor class

which encapsulates a container and optional selection vector,

and tracks schema changes.

Moves row set tests into the same package as the row sets.

(Row set classes were moved a while back, but the tests were

not moved.)

Renames some BatchAccessor methods.

closes #1897

    • -0
    • +188
    • -0
    • +431
  1. … 48 more files in changeset.
DRILL-7445: Create batch copier based on result set framework

The result set framework now provides both a reader and writer.

This PR provides a copier that copies batches using this

framework. Such a copier can:

- Copy selected records

- Copy all records, such as for an SV2 or SV4

The copier uses the result set loader to create uniformly-sized

output batches from input batches of any size. It does this

by merging or splitting input batches as needed.

Since the result set reader handles both SV2 and SV4s, the

copier can filter or reorder rows based on the SV associated

with the input batch.

This version assumes single stream of input batches, and handles

any schema changes in that input by creating output batches

that track the input schema. This would be used in, say, the

selection vector remover. A different design is needed for merging

such as in the merging receiver.

Adds a "copy" method to the column writers. Copy is implemented

by doing a direct memory copy from source to destination vectors.

A unit test verifies functionality for various use cases

and data types.

closes #1899

    • -26
    • +33
  1. … 35 more files in changeset.
DRILL-7441: Fix issues with fillEmpties, offset vectors

Fixes subtle issues with offset vectors and "fill empties"


Drill has an informal standard that if a batch has no rows, then

offset vectors within that batch should have zero size. Contrast

this with batches of size 1 that should have offset vectors of

size 2. Changed to enforce this rule throughout.

Nullable, repeated and variable-width vectors have "fill empties"

logic that is used in two places: when setting the value count and

when preparing to write a new value. The current logic is not

quite right for either case. Added tests and fixed the code to

properly handle each case.

Revised the batch validator to enforce the offset-vector length of 0 for

0-sized batches rule. The result was much simpler code.

Added tools to easily print a batch, restoring some code that

was recently lost when the RowSet classes were moved.

Code cleanup in all files touched.

Added logic to "dirty" allocated buffers when testing to ensure

logic is not sensitive to the "pristine" state of new buffers.

Added logic to the column writers to enforce the zero-size-batch rule

for offset vectors. Added unit tests for this case.

Fixed the column writers to set the "lastSet" mutator value for

nullable types since other code relies on this value.

Removed the "setCount" field in nullable vectors: turns out

it is not actually used.

closes #1896

    • -14
    • +17
    • -12
    • +8
    • -8
    • +8
  1. … 29 more files in changeset.
DRILL-7439: Batch count fixes for six additional operators

Enables vector checks, and fixes batch count and vector issues for:

* StreamingAggBatch

* RuntimeFilterRecordBatch

* FlattenRecordBatch

* MergeJoinBatch

* NestedLoopJoinBatch

* LimitRecordBatch

Also fixes a zero-size batch validity issue for the CSV reader when

all files contain no data.

Includes code cleanup for files touched in this PR.

closes #1893

  1. … 20 more files in changeset.
DRILL-7409: Moving test with huge test data to the drill-test-framework.

closes #1891

DRILL-7436: Fix record count, vector structure issues in several operators

Adds additional vector checks to the BatchValidator.

Enables checking for the following operators:

* FilterRecordBatch

* PartitionLimitRecordBatch

* UnnestRecordBatch

* HashAggBatch

* RemovingRecordBatch

Fixes vector count issues for each of these.

Fixes empty-batch (record count = 0) handling in several of the

above operators. Added a method to VectorContainer to correctly

create an empty batch. (An empty batch, counter-intuitively,

needs vectors allocated to hold the 0 value in the first

position of each offset vector.)

Disables verbose logging for MongoDB tests. Details are written to

the log rather than the console.

Disables two invalid Mongo tests. See DRILL-7428.

Adjusts the expression tree materializer to not add the LATE type

to Union vectors. (See DRILL-7435.)

Ensures that Union vectors contain valid vectors for each subtype.

The present fix is a work-around, see DRILL-7434 for a better

long-term fix.

Cleans up code formatting and other minor issues in each file touched

during the fixes in this PR.

  1. … 34 more files in changeset.
DRILL-7372: MethodAnalyzer consumes too much memory

closes #1887

  1. … 1 more file in changeset.
DRILL-7391: Wrong result when doing left outer join on CSV table

    • -0
    • +12
  1. … 1 more file in changeset.
DRILL-7177: Format Plugin for Excel Files

closes #1749

  1. … 17 more files in changeset.
DRILL-7424: Project operator fails to set the container row count

Enabled the "batch validator" for the Project operator. Ran tests.

Exceptions occurred because, in some paths, the Project operator

fails to set the container row count.

Fixes the project operator. Cleans up formatting issues in files

touched during the investigation. Cleaned up batch-related issues

in Project.

    • -3
    • +3
    • -1
    • +0
  1. … 5 more files in changeset.
DRILL-1709: Add desc alias for describe command

closes #1881

    • -0
    • +6
  1. … 1 more file in changeset.
DRILL-7418: MetadataDirectGroupScan improvements

1. Replaced files listing with selection root information to reduce query plan size in MetadataDirectGroupScan.

2. Fixed MetadataDirectGroupScan ser / de issues.

3. Added PlanMatcher to QueryBuilder for more convenient plan matching.

4. Re-written TestConvertCountToDirectScan to use ClusterTest.

5. Refactoring and code clean up.

    • -35
    • +41
    • -43
    • +109
  1. … 9 more files in changeset.
DRILL-7414: EVF incorrectly sets buffer writer index after rollover

Enabling the vector validator on the "new" scan operator, in cases

in which overflow occurs, identified that the DrillBuf writer index

was not properly set for repeated vectors.

Enables such checking, adds unit tests, and fixes the writer index


closes #1878

  1. … 4 more files in changeset.
DRILL-7413: Test and fix scan operator vectors

Enables vector validation tests for the ScanBatch and all

EasyFormat plugins. Fixes a bug in scan batch that failed to set

the record count in the output container.

Fixes a number of formatting and other issues found while adding

the tests.

    • -2
    • +2
  1. … 5 more files in changeset.
DRILL-7412: Minor unit test improvements

Many tests intentionally trigger errors. A debug-only log setting

sent those errors to stdout. The resulting stack dumps simply cluttered

the test output, so disabled error output to the console.

Drill can apply bounds checks to vectors. Tests run via Maven

enable bounds checking. Now, bounds checking is also enabled in

"debug mode" (when assertions are enabled, as in an IDE.)

Drill contains two test frameworks. The older BaseTestQuery was

marked as deprecated, but many tests still use it and are unlikely

to be changed soon. So, removed the deprecated marker to reduce the

number of spurious warnings.

Also includes a number of minor clean-ups.

closes #1876

    • -9
    • +10
    • -10
    • +10
  1. … 12 more files in changeset.
DRILL-5674: Support ZIP compression

1. Added ZipCodec implementation which can read / write single file.

2. Revisited Drill plugin formats to ensure 'openPossiblyCompressedStream' method is used in those which support compression.

3. Added unit tests.

4. General refactoring.

    • -0
    • +111
  1. … 17 more files in changeset.
DRILL-7402: Suppress batch dumps for expected failures in tests

Drill provides a way to dump the last few batches when an error

occurs. However, in tests, we often deliberately cause something

to fail. In this case, the batch dump is unnecessary.

This enhancement adds a config property, disabled in tests, that

controls the dump activity. The option is enabled in the one test

that needs it enabled.

closes #1872

    • -19
    • +8
  1. … 2 more files in changeset.
DRILL-7403: Validate batch checks, vector integretity in unit tests

Enhances the existing record batch checks to check all the various

batch record counts, and to more fully validate all vector types.

This code revealed that virtually all record batches have

problems: they omit setting some record count or other, they

introduce some form of vector corruption.

Since we want things to work as we make fixes, this change enables

the checks for only one record batch: the "new" scan. Others are

to come as they are fixed.

closes #1871

  1. … 3 more files in changeset.
DRILL-7385: Convert PCAP Format Plugin to EVF

    • -0
    • +103
  1. … 7 more files in changeset.
DRILL-6096: Provide mechanism to configure text writer configuration

1. Usage of format plugin configuration allows to specify line and field delimiters, quotes and escape characters.

2. Usage of system / session options allows to specify if writer should add headers, force quotes.

closes #1873

    • -27
    • +67
  1. … 17 more files in changeset.
DRILL-7377: Nested schemas for dynamic EVF columns

The Result Set Loader (part of EVF) allows adding columns up-front

before reading rows (so-called "early schema.") Such schemas allow

nested columns (maps with members, repeated lists with a type, etc.)

The Result Set Loader also allows adding columns dynamically

while loading data (so-called "late schema".) Previously, the code

assumed that columns would be added top-down: first the map, then

the map's contents, etc.

Charles found a need to allow adding a nested column (a repeated

list with a declared list type.)

This patch revises the code to use the same mechanism in both the

early- and late-schema cases, allowing adding nested columns at

any time.

Testing: Added a new unit test case for the repeated list late

schema with content case.

  1. … 5 more files in changeset.
DRILL-7358: Fix COUNT(*) for empty text files

Fixes a subtle error when a text file has a header (and so has a

schema), but is in a COUNT(*) query, so that no columns are

projected. Ensures that, in this case, an empty schema is

treated as a valid result set.

Tests: updated CSV tests to include this case.

closes #1867

    • -4
    • +6
  1. … 9 more files in changeset.
DRILL-7357: Expose Drill Metastore data through information_schema

1. Add additional columns to TABLES and COLUMNS tables.

2. Add PARTITIONS table.

3. General refactoring to adjust information_schema data retrieval from multiple sources.

closes #1860

    • -4
    • +0
    • -0
    • +401
  1. … 28 more files in changeset.
DRILL-7373: Fix problems involving reading from DICT type

- Fixed FieldIdUtil to resolve reading from DICT for some complex cases;

- optimized reading from DICT given a key by passing an appropriate Object type to DictReader#find(...) and DictReader#read(...) methods when schema is known (e.g. when reading from Hive tables) instead of generating it on fly based on int or String path and key type;

- fixed error when accessing value by not existing key value in Avro table.

  1. … 10 more files in changeset.
DRILL-7368: Fix Iceberg Metastore failure when filter column contains nulls

  1. … 9 more files in changeset.
DRILL-7168: Implement ALTER SCHEMA ADD / REMOVE commands

    • -1
    • +414
  1. … 14 more files in changeset.
DRILL-7362: COUNT(*) on JSON with outer list results in JsonParse error

closes #1849

  1. … 3 more files in changeset.
DRILL-7326: Support repeated lists for CTAS parquet format

closes #1844

    • -0
    • +2
  1. … 3 more files in changeset.
DRILL-7350: Move RowSet related classes from test folder

  1. … 278 more files in changeset.
DRILL-4517: Support reading empty Parquet files

1. Modified flat and complex parquet readers to output schema only when requested number of records to read is 0. In this case readers are not initialized to improve performance.

2. Allowed reading requested number of rows instead of all rows in the row group (DRILL-6528).

3. Fixed issue with nulls number determination in the row group (fixed IsPredicate#isAllNulls method).

4. Allowed reading empty parquet files via adding empty / fake row group.

5. General refactoring and unit tests.

6. Parquet tests categorization.

closes #1839

    • -0
    • +417
  1. … 34 more files in changeset.