Clone Tools
  • last updated 11 mins ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
DRILL-7442: Create multi-batch row set reader

Adds a ResultSetReader that works across multiple batches

in a result set. Reuses the same row set and readers if

schema is unchanged, creates a new set if the schema changes.

Adds a unit test for the result set reader.

Adds a "rebind" capability to the row set readers to focus

on new buffers under an existing set of vectors. Used when

a new batch arrives, if the schema is unchanged.

Extends row set classses to be aware of the BatchAccessor class

which encapsulates a container and optional selection vector,

and tracks schema changes.

Moves row set tests into the same package as the row sets.

(Row set classes were moved a while back, but the tests were

not moved.)

Renames some BatchAccessor methods.

closes #1897

  1. … 57 more files in changeset.
DRILL-7350: Move RowSet related classes from test folder

    • -0
    • +74
    ./AbstractRowSet.java
    • -0
    • +69
    ./AbstractSingleRowSet.java
    • -0
    • +149
    ./DirectRowSet.java
    • -0
    • +180
    ./HyperRowSetImpl.java
    • -0
    • +145
    ./IndirectRowSet.java
    • -0
    • +165
    ./RowSetBuilder.java
    • -0
    • +123
    ./RowSetFormatter.java
    • -0
    • +84
    ./RowSetReaderImpl.java
  1. … 278 more files in changeset.
DRILL-7314: Use TupleMetadata instead of concrete implementation

1. Add ser / de implementation for TupleMetadata interface based on types.

2. Replace TupleSchema usage where possible.

3. Move patcher classes into commons.

4. Upgrade some dependencies and general refactoring.

    • -1
    • +1
    ./model/hyper/HyperSchemaInference.java
  1. … 38 more files in changeset.
DRILL-7310: Move schema-related classes from exec module to be able to use them in metastore module

closes #1816

  1. … 102 more files in changeset.
DRILL-7258: Remove field width limit for text reader

The V2 text reader enforced a limit of 64K characters when using

column headers, but not when using the columns[] array. The V3 reader

enforced the 64K limit in both cases.

This patch removes the limit in both cases. The limit now is the

16MB vector size limit. With headers, no one column can exceed 16MB.

With the columns[] array, no one row can exceed 16MB. (The 16MB

limit is set by the Netty memory allocator.)

Added an "appendBytes()" method to the scalar column writer which adds

additional bytes to those already written for a specific column or

array element value. The method is implemented for VarChar, Var16Char

and VarBinary vectors. It throws an exception for all other types.

When used with a type conversion shim, the appendBytes() method throws

an exception. This should be OK because, the previous setBytes() should

have failed because a huge value is not acceptable for numeric or date

types conversions.

Added unit tests of the append feature, and for the append feature in

the batch overflow case (when appending bytes causes the vector or

batch to overflow.) Also added tests to verify the lack of column width

limit with the text reader, both with and without headers.

closes #1802

  1. … 24 more files in changeset.
DRILL-7278: Refactor result set loader projection mechanism

Drill 1.16 added a enhanced scan framework based on the row set

mechanisms, and a "provisioned schema" feature build on top

of that framework. Conversion of the log reader plugin to use

the framework identified additional features we wish to add,

such as marking a column as "special" (not expanded in a wildcard

query.)

This work identified that the code added for provisioned schemas in

Drill 1.16 worked, but is a bit overly complex, making it hard to add

the desired new feature.

This patch refactors the "reader" projection code:

* Create a "projection set" mechanism that the reader can query to ask,

"the caller just added a column. Should it be projected or not?"

* Unifies the type conversion mechanism added as part of provisioned

schemas.

* Added the "special column" property for both "reader" and "provided"

schemas.

* Verified that provisioned schemas work with maps (at least on the scan

framework side.)

* Replaced the previous "schema transformer" mechanism with a new "type

conversion" mechanism that unifies type conversion, provided schemas

and an optional custom type conversion mechanism.

* Column writers can report if they are projected. Moved this query

from metadata to the column writer itself.

* Extended and clarified documentation of the feature.

* Revised and/or added unit tests.

closes #1797

    • -0
    • +105
    ./ProjectionSet.java
    • -77
    • +0
    ./impl/DefaultSchemaTransformer.java
    • -194
    • +0
    ./impl/SchemaTransformerImpl.java
  1. … 58 more files in changeset.
DRILL-7181: Improve V3 text reader (row set) error messages

Adds an error context to the User Error mechanism. The context allows

information to be passed through an intermediate layer and applied when

errors are raised in lower-level code; without the need for that

low-level code to know the details of the error context information.

Modifies the scan framework and V3 text plugin to use the framework to

improve error messages.

Refines how the `columns` column can be used with the text reader. If

headers are used, then `columns` is just another column. An error is

raised, however, if `columns[x]` is used when headers are enabled.

Added another builder abstraction where a constructor argument list

became too long.

Added the drill file system and split to the file schema negotiator

to simplify reader construction.

Added additional unit tests to fully define the `columns` column

behavior.

  1. … 30 more files in changeset.
DRILL-7143: Support default value for empty columns

Modifies the prior work to add default values for columns. The prior work added defaults

when the entire column is missing from a reader (the old Nullable Int column). The Row

Set mechanism now will also "fill empty" slots with the default value.

Added default support for the column writers. The writers automatically obtain the

default value from the column schema. The default can also be set explicitly on

the column writer.

Updated the null column mechanism to use this feature rather than the ad-hoc

implemention in the prior commit.

Semantics changed a bit. Only Required columns take a default. The default value

is ignored or nullable columns since nullable columns already have a file default: NULL.

Other changes:

* Updated the CSV-with-schema tests to illustrate the new behavior.

* Made multiple fixes for Boolean and Decimal columns and added unit tests.

* Upgraded Fremarker to version 2.3.28 to allow use of the continue statement.

* Reimplemented the Bit column reader and writer to use the BitVector directly since this vector is rather special.

* Added get/set Boolean methods for column accessors

* Moved the BooleanType class to the common package

* Added more CSV unit tests to explore decimal types, booleans, and defaults

* Add special handling for blank fields in from-string conversions

* Added options to the conversion factory to specify blank-handling behavior.

CSV uses a mapping of blanks to null (nullable) or default value (non-nullable)

closes #1726

    • -49
    • +41
    ./impl/SchemaTransformerImpl.java
  1. … 69 more files in changeset.
DRILL-7011: Support schema in scan framework

* Adds schema support to the row set-based scan framework and to the "V3" text reader based on that framework.

* Adding the schema made clear that passing options as a long list of constructor arguments was not sustainable. Refactored code to use a builder pattern instead.

* Added support for default values in the "null column loader", which required adding a "setValue" method to the column accessors.

* Added unit tests for all new or changed functionality. See TestCsvWithSchema for the overall test of the entire integrated mechanism.

* Added tests for explicit projection with schema

* Better handling of date/time in column accessors

* Converted recent column metadata work from Java 8 date/time to Joda.

* Added more CSV-with-schema unit tests

* Removed the ID fields from "resolved columns", used "instanceof" instead.

* Added wildcard projection with an output schema. Handles both "lenient" and "strict" schemas.

* Tagged projection columns with their output schema, when available.

* Scan projection added modes for wildcard with an output schema. The reader projection added support for merging reader and output schemas.

* Includes refactoring of scan operator tests (the test file grew too large.)

* Renamed some classes to avoid confusing reader schemas with output schemas.

* Added unit tests for the new functionality.

* Added "lenient" wildcard with schema test for CSV

* Added more type conversions: string-to-bit, many-to-string

* Fixed bug in column writer for VarDecimal

* Added missing unit tests, and fixed bugs, in Bit column reader/writer

* Cleaned up a number of unneded "SuppressWarnings"

closes #1711

  1. … 219 more files in changeset.
DRILL-7086: Output schema for row set mechanism

Enhances the row set mechanism to take an "output schema" that describes the vectors to

create. The "input schema" describes the type that the reader would like to write. A

conversion mechanism inserts a conversion shim to convert from the input to output type.

Provides a set of implicit type conversions, including string-to-date/time conversions

which use the new format property stored in column metadata. Includes unit tests for

the new functionality.

closes #1690

    • -0
    • +73
    ./impl/DefaultSchemaTransformer.java
    • -0
    • +44
    ./impl/SchemaTransformer.java
    • -0
    • +197
    ./impl/SchemaTransformerImpl.java
    • -2
    • +10
    ./model/single/BaseWriterBuilder.java
  1. … 55 more files in changeset.
DRILL-7074: Scan framework fixes and enhancements

Roll-up of fixes an enhancements that emerged from the effort to host the CSV reader on the new framework.

closes #1676

  1. … 40 more files in changeset.
DRILL-7024: Refactor ColumnWriter to simplify type-conversion shim

DRILL-7006 added a type conversion "shim" within the row set framework. Basically, we insert a "shim" column writer that takes data in one form (String, say), and does reader-specific conversions to a target format (INT, say).

The code works fine, but the shim class ends up needing to override a bunch of methods which it then passes along to the base writer. This PR refactors the code so that the conversion shim is simpler.

closes #1633

  1. … 59 more files in changeset.
DRILL-6950: Row set-based scan framework

Adds the "plumbing" that connects the scan operator to the result set loader and the scan projection framework. See the various package-info.java files for the technical datails. Also adds a large number of tests.

This PR does not yet introduce an actual scan operator: that will follow in subsequent PRs.

closes #1618

  1. … 60 more files in changeset.
DRILL-6903: SchemaBuilder code improvements

1. ColumnBuilder: setPrecisionAndScale method

2. SchemaContainer: addColumn method parameter AbstractColumnMetadata was changed to ColumnMetadata

3. MapBuilder / RepeatedListBuilder / UnionBuilder: added constructors without parent, made buildColumn method public

4. TupleMetadata: added toMetadataList method

5. Other refactoring

    • -3
    • +1
    ./model/single/SingleSchemaInference.java
  1. … 22 more files in changeset.
DRILL-6809: Handle repeated map in schema inference

It turns out that the RowSet utilities build a repeated map without including the hidden $offsets$ vector in the metadata for the map. But, other parts in Drill do include this vector.

The RowSet behavior might be a bug which can be addressed in another PR.

This PR:

* Adds unit tests for map accessors at the row set level. Looks like these were never added originally. They are a simplified form of the ResultSetLoader map tests.

* Verified that the schema inference can infer a schema from a repeated map (using the RowSet style.)

* Added a test to reproduce the case from the bug.

* Made a tweak to the RowSetBuilder to allow access to the RowSetWriter which is needed by the new tests.

* Could of minor clean-ups.

closes #1513

    • -1
    • +1
    ./model/single/SingleSchemaInference.java
  1. … 5 more files in changeset.
DRILL-6791: Scan projection framework

The "schema projection" mechanism:

* Handles none (SELECT COUNT\(*)), some (SELECT a, b, x) and all (SELECT *) projection.

* Handles null columns (for projection a column "x" that does not exist in the base table.)

* Handles constant columns as used for file metadata (AKA "implicit" columns).

* Handle schema persistence: the need to reuse the same vectors across different scanners

* Provides a framework for consuming externally-supplied metadata

* Since we don't yet have a way to provide "real" metadata, obtains metadata hints from

previous batches and from the projection list (a.b implies that "a" is a map, c[0]

implies that "c" is an array, etc.)

* Handles merging the set of data source columns and null columns to create the final output batch.

* Running tests found a failure due to an uninialized "bits" vector. Added code to explicitly fill

the bits vectors with zeros in the "result set loader."

  1. … 30 more files in changeset.
DRILL-6676: Add Union, List and Repeated List types to Result Set Loader

Adds required functionalty to the list and repeated list vectors.

Row set accessor changes

Adds a "variant" type to model both unions and (non-repeated) lists (which can act as a repeated union, among other things.)

Adds union, list and repeated list support to the result set loader and associated classes.

Copied much of the general documentation from my private Wiki into mark-down files.

closes #1429

    • -0
    • +415
    ./impl/ListState.java
    • -0
    • +239
    ./impl/RepeatedListState.java
    • -0
    • +202
    ./impl/UnionState.java
    • -0
    • +40
    ./model/hyper/BaseReaderBuilder.java
    • -10
    • +120
    ./model/single/BaseReaderBuilder.java
    • -11
    • +106
    ./model/single/BaseWriterBuilder.java
    • -4
    • +84
    ./model/single/BuildVectorsFromMetadata.java
    • -5
    • +55
    ./model/single/SingleSchemaInference.java
  1. … 53 more files in changeset.
DRILL-6656: Disallow extra semicolons and multiple statements on the same line.

closes #1415

  1. … 143 more files in changeset.
DRILL-6386: Remove unused imports and star imports.

  1. … 228 more files in changeset.
DRILL-6389: Fixed building javadocs - Added documentation about how to build javadocs - Fixed some of the javadoc warnings

closes #1276

  1. … 64 more files in changeset.
DRILL-6373:

- Adds code to return the proper vector type given the actual vector, adjusting metadata as needed.

- Refactor result set loader

- Revised projection & vector cache

closes #1244

    • -0
    • +99
    ./impl/BuildFromSchema.java
    • -0
    • +250
    ./impl/ColumnBuilder.java
    • -0
    • +175
    ./impl/ContainerState.java
    • -0
    • +111
    ./impl/LoaderInternals.java
    • -0
    • +16
    ./impl/NullResultVectorCacheImpl.java
    • -112
    • +0
    ./impl/PrimitiveColumnState.java
  1. … 25 more files in changeset.
DRILL-6335: Column accessor refactoring

closes #1218

  1. … 36 more files in changeset.
DRILL-6320: Fixed license headers.

closes #1207

  1. … 2064 more files in changeset.
DRILL-6230: Extend row set readers to handle hyper vectors

closes #1161

    • -0
    • +45
    ./model/AbstractReaderBuilder.java
    • -51
    • +91
    ./model/hyper/BaseReaderBuilder.java
    • -0
    • +72
    ./model/hyper/HyperSchemaInference.java
    • -34
    • +47
    ./model/single/BaseReaderBuilder.java
    • -0
    • +40
    ./model/single/DirectRowIndex.java
    • -0
    • +65
    ./model/single/SingleSchemaInference.java
  1. … 57 more files in changeset.
DRILL-6138: Move RecordBatchSizer to org.apache.drill.exec.record package

This closes #1115

  1. … 10 more files in changeset.
DRILL-6114: Metadata revisions

Support for union vectors, list vectors, repeated list vectors. Refactored metadata classes.

closes #1112

  1. … 59 more files in changeset.
DRILL-5657: Size-aware vector writer structure

- Vector and accessor layer

- Row Set layer

- Tuple and column models

- Revised write-time metadata

- "Result set loader" layer

this closes #914

    • -0
    • +204
    ./ResultSetLoader.java
    • -0
    • +33
    ./ResultVectorCache.java
    • -0
    • +153
    ./RowSetLoader.java
    • -0
    • +358
    ./impl/ColumnState.java
    • -0
    • +41
    ./impl/NullProjectionSet.java
    • -0
    • +41
    ./impl/NullResultVectorCacheImpl.java
    • -0
    • +52
    ./impl/NullVectorState.java
    • -0
    • +108
    ./impl/NullableVectorState.java
    • -0
    • +134
    ./impl/OptionBuilder.java
    • -0
    • +105
    ./impl/PrimitiveColumnState.java
    • -0
    • +48
    ./impl/ProjectionSet.java
    • -0
    • +136
    ./impl/ProjectionSetImpl.java
    • -0
    • +168
    ./impl/RepeatedVectorState.java
    • -0
    • +775
    ./impl/ResultSetLoaderImpl.java
    • -0
    • +186
    ./impl/ResultVectorCacheImpl.java
  1. … 173 more files in changeset.