Clone Tools
  • last updated 20 mins ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
DRILL-7292: Remove V1 and V2 text readers

Drill 1.16 introduced the "V2" text reader based on the row set

and provided schema mechanisms. V3 was available by system/session

option as the functionality was considered experimental.

The functionality has now undergone thorough testing. This commit makes

the V3 text reader available by default, and removes the code for the

original "V1" and the "new" (compliant, "V2") text reader.

The system/session options that controlled reader selection are retained

for backward compatibility, but they no longer do anything.

Specific changes:

* Removed the V2 "compliant" text reader.

* Moved the "V3" to replace the "compliant" version.

* Renamed the "complaint" package to "reader."

* Removed the V1 text reader.

* Moved the V1 text writer (still used with the V2 and V3 readers)

into a new "writer" package adjacent to the reader.

* Removed the CSV tests for the V2 reader, including those that

demonstrated bugs in V2.

* V2 did not properly handle the quote escape character. One or two unit

tests depended on the broken behavior. Fixed them for the correct

behavior.

* Behavior of "messy quotes" (those that appear in a non-quoted field)

was undefined for the text reader. Added a test to clearly demonstrate

the (somewhat odd) behavior. The behavior itself was not changed.

Reran all unit tests to ensure that they work with the now-default V3

text reader.

closes #1806

    • -256
    • +0
    ./CompliantTextRecordReader.java
    • -29
    • +0
    ./StreamFinishedPseudoException.java
    • -272
    • +0
    ./v3/CompliantTextBatchReader.java
  1. … 45 more files in changeset.
DRILL-7258: Remove field width limit for text reader

The V2 text reader enforced a limit of 64K characters when using

column headers, but not when using the columns[] array. The V3 reader

enforced the 64K limit in both cases.

This patch removes the limit in both cases. The limit now is the

16MB vector size limit. With headers, no one column can exceed 16MB.

With the columns[] array, no one row can exceed 16MB. (The 16MB

limit is set by the Netty memory allocator.)

Added an "appendBytes()" method to the scalar column writer which adds

additional bytes to those already written for a specific column or

array element value. The method is implemented for VarChar, Var16Char

and VarBinary vectors. It throws an exception for all other types.

When used with a type conversion shim, the appendBytes() method throws

an exception. This should be OK because, the previous setBytes() should

have failed because a huge value is not acceptable for numeric or date

types conversions.

Added unit tests of the append feature, and for the append feature in

the batch overflow case (when appending bytes causes the vector or

batch to overflow.) Also added tests to verify the lack of column width

limit with the text reader, both with and without headers.

closes #1802

  1. … 21 more files in changeset.
DRILL-7279: Enable provided schema for text files without headers

* Allows a provided schema for text files without headers. The

provided schema columns replace the `columns` column that is

normally used.

* Allows customizing text format properties using table properties.

The table properties "override" properties set in the plugin config.

* Added unit tests for the newly supported use cases.

* Fixed bug in quote escape handling.

closes #1798

    • -158
    • +124
    ./v3/TextParsingSettingsV3.java
  1. … 11 more files in changeset.
DRILL-7278: Refactor result set loader projection mechanism

Drill 1.16 added a enhanced scan framework based on the row set

mechanisms, and a "provisioned schema" feature build on top

of that framework. Conversion of the log reader plugin to use

the framework identified additional features we wish to add,

such as marking a column as "special" (not expanded in a wildcard

query.)

This work identified that the code added for provisioned schemas in

Drill 1.16 worked, but is a bit overly complex, making it hard to add

the desired new feature.

This patch refactors the "reader" projection code:

* Create a "projection set" mechanism that the reader can query to ask,

"the caller just added a column. Should it be projected or not?"

* Unifies the type conversion mechanism added as part of provisioned

schemas.

* Added the "special column" property for both "reader" and "provided"

schemas.

* Verified that provisioned schemas work with maps (at least on the scan

framework side.)

* Replaced the previous "schema transformer" mechanism with a new "type

conversion" mechanism that unifies type conversion, provided schemas

and an optional custom type conversion mechanism.

* Column writers can report if they are projected. Moved this query

from metadata to the column writer itself.

* Extended and clarified documentation of the feature.

* Revised and/or added unit tests.

closes #1797

  1. … 71 more files in changeset.
DRILL-7181: Improve V3 text reader (row set) error messages

Adds an error context to the User Error mechanism. The context allows

information to be passed through an intermediate layer and applied when

errors are raised in lower-level code; without the need for that

low-level code to know the details of the error context information.

Modifies the scan framework and V3 text plugin to use the framework to

improve error messages.

Refines how the `columns` column can be used with the text reader. If

headers are used, then `columns` is just another column. An error is

raised, however, if `columns[x]` is used when headers are enabled.

Added another builder abstraction where a constructor argument list

became too long.

Added the drill file system and split to the file schema negotiator

to simplify reader construction.

Added additional unit tests to fully define the `columns` column

behavior.

  1. … 34 more files in changeset.
DRILL-7011: Support schema in scan framework

* Adds schema support to the row set-based scan framework and to the "V3" text reader based on that framework.

* Adding the schema made clear that passing options as a long list of constructor arguments was not sustainable. Refactored code to use a builder pattern instead.

* Added support for default values in the "null column loader", which required adding a "setValue" method to the column accessors.

* Added unit tests for all new or changed functionality. See TestCsvWithSchema for the overall test of the entire integrated mechanism.

* Added tests for explicit projection with schema

* Better handling of date/time in column accessors

* Converted recent column metadata work from Java 8 date/time to Joda.

* Added more CSV-with-schema unit tests

* Removed the ID fields from "resolved columns", used "instanceof" instead.

* Added wildcard projection with an output schema. Handles both "lenient" and "strict" schemas.

* Tagged projection columns with their output schema, when available.

* Scan projection added modes for wildcard with an output schema. The reader projection added support for merging reader and output schemas.

* Includes refactoring of scan operator tests (the test file grew too large.)

* Renamed some classes to avoid confusing reader schemas with output schemas.

* Added unit tests for the new functionality.

* Added "lenient" wildcard with schema test for CSV

* Added more type conversions: string-to-bit, many-to-string

* Fixed bug in column writer for VarDecimal

* Added missing unit tests, and fixed bugs, in Bit column reader/writer

* Cleaned up a number of unneded "SuppressWarnings"

closes #1711

  1. … 224 more files in changeset.
DRILL-7106: Fix Intellij warning for FieldSchemaNegotiator

closes #1698

  1. … 4 more files in changeset.
DRILL-6952: Host compliant text reader on the row set framework

The result set loader allows controlling batch sizes. The new scan framework

built on top of that framework handles projection, implicit columns, null

columns and more. This commit converts the "new" ("compliant") text reader

to use the new framework. Options select the use of the V2 ("new") or V3

(row-set based) versions. Unit tests demonstrate V3 functionality.

closes #1683

    • -0
    • +165
    ./v3/BaseFieldOutput.java
    • -0
    • +269
    ./v3/CompliantTextBatchReader.java
    • -0
    • +62
    ./v3/FieldVarCharOutput.java
    • -0
    • +267
    ./v3/HeaderBuilder.java
    • -0
    • +137
    ./v3/RepeatedVarCharOutput.java
    • -0
    • +29
    ./v3/StreamFinishedPseudoException.java
    • -0
    • +368
    ./v3/TextInput.java
    • -0
    • +88
    ./v3/TextOutput.java
    • -0
    • +126
    ./v3/TextParsingContext.java
    • -0
    • +305
    ./v3/TextParsingSettingsV3.java
    • -0
    • +508
    ./v3/TextReader.java
    • -0
    • +22
    ./v3/package-info.java
  1. … 46 more files in changeset.
DRILL-6950: Row set-based scan framework

Adds the "plumbing" that connects the scan operator to the result set loader and the scan projection framework. See the various package-info.java files for the technical datails. Also adds a large number of tests.

This PR does not yet introduce an actual scan operator: that will follow in subsequent PRs.

closes #1618

  1. … 60 more files in changeset.
DRILL-6759: Make columns array name for csv data case insensitive

closes #1485

  1. … 1 more file in changeset.
DRILL-6724: Dump operator context to logs when error occurs during query execution

closes #1455

  1. … 101 more files in changeset.
DRILL-6422: Replace guava imports with shaded ones

  1. … 980 more files in changeset.
DRILL-6656: Disallow extra semicolons and multiple statements on the same line.

closes #1415

  1. … 143 more files in changeset.
DRILL-6386: Remove unused imports and star imports.

  1. … 228 more files in changeset.
DRILL-6389: Fixed building javadocs - Added documentation about how to build javadocs - Fixed some of the javadoc warnings

closes #1276

  1. … 65 more files in changeset.
DRILL-6320: Fixed license headers.

closes #1207

  1. … 2060 more files in changeset.
DRILL-5730: Mock testing improvements and interface improvements

closes #1045

  1. … 223 more files in changeset.
DRILL-6118: Handle item star columns during project / filter push down and directory pruning

1. Added DrillFilterItemStarReWriterRule to re-write item star fields to regular field references.

2. Refactored DrillPushProjectIntoScanRule to handle item star fields, factored out helper classes and methods from PreUitl.class.

3. Fixed issue with dynamic star usage (after Calcite upgrade old usage of star was still present, replaced WILDCARD -> DYNAMIC_STAR for clarity).

4. Added unit tests to check project / filter push down and directory pruning with item star.

  1. … 26 more files in changeset.
DRILL-6049: Misc. hygiene and code cleanup changes

close apache/drill#1085

  1. … 122 more files in changeset.
DRILL-6004: Direct buffer bounds checking should be disabled by default

This closes #1070

  1. … 7 more files in changeset.
DRILL-5590: Bugs in CSV field matching, null columns

Please see the problem and solution descriptions in DRILL-5590.

Also cleaned up some dead code left over from DRILL-5498.

close #855

  1. … 1 more file in changeset.
DRILL-5504: Add vector validator to diagnose offset vector issues

Validates offset vectors in VarChar and repeated vectors. Validates the

special case of repeated VarChar vectors (two layers of offsets.)

Provides two new session variables to turn on validation. One enables

the existing operator (iterator) validation, the other adds vector

validation. This allows validation to occur in a “production” Drill

(without restarting Drill with assertions, as previously required.)

Unit tests validate the validator. Another test validates the

integration, but requires manual steps, so is ignored by default.

This version is first-cut: all work is done within a single class.

Allows back-porting to an earlier version to solve a specific issues. A

revision should move some of the work into generated code (or refactor

vectors to allow outside access), since offset vectors appear for each

subclass; not on a base class that would allow generic operations.

* Added boot-time options to allow enabling vector validation in Maven

unit tests.

* Code cleanup per suggestions.

* Additional (manual) tests for boot-time options and default options.

closes #832

  1. … 9 more files in changeset.
DRILL-5498: Better handling of CSV column headers

See DRILL-5498 for details.

Replaced the repeated varchar reader for reading columns with a purpose

built column parser. Implemented rules to recover from invalid column

headers.

Added missing test method

Changes re code review comments

Back out testing-only change

close apache/drill#830

    • -0
    • +274
    ./HeaderBuilder.java
  1. … 3 more files in changeset.
DRILL-5355: Misc. code cleanup closes #784

  1. … 22 more files in changeset.
DRILL-5273: CompliantTextReader exhausts 4 GB memory when reading 5000 small files

Please see JIRA for details of problem and fix.

closes #750

DRILL-4919: Fix select count(1) / count(*) on csv with header

This closes #714

  1. … 1 more file in changeset.
DRILL-3178: csv reader should allow newlines inside quotes

This closes #593

  1. … 2 more files in changeset.
DRILL-4746: Verification Failures (Decimal values) in drill's regression tests.

This closes #545

DRILL-3149: TextReader should support multibyte line delimiters

  1. … 2 more files in changeset.
DRILL-4317: Exceptions on SELECT and CTAS with large CSV files

this closes #432