Clone Tools
  • last updated 27 mins ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
DRILL-7380: Query of a field inside of an array of structs returns null

1. Fixed parquet reader projection for Logical lists (DrillParquetReader.java)

2. Fixed projection pushdown for RexFieldAccess (ProjectFieldsVisitor.java)

3. DrillParquetReader.getProjection(...) splitted into few methods

4. Added javadocs for PathSegment and SchemaPath

  1. … 6 more files in changeset.
DRILL-4517: Support reading empty Parquet files

1. Modified flat and complex parquet readers to output schema only when requested number of records to read is 0. In this case readers are not initialized to improve performance.

2. Allowed reading requested number of rows instead of all rows in the row group (DRILL-6528).

3. Fixed issue with nulls number determination in the row group (fixed IsPredicate#isAllNulls method).

4. Allowed reading empty parquet files via adding empty / fake row group.

5. General refactoring and unit tests.

6. Parquet tests categorization.

closes #1839

  1. … 48 more files in changeset.
DRILL-7315: Revise precision and scale order in the method arguments

  1. … 28 more files in changeset.
DRILL-7268: Read Hive array with parquet native reader

1. Fixed preserving of group originalType for projected schema

in DrillParquetReader

2. Added reading of LIST logical type to DrillParquetGroupConverter.

Intermediate noop converter used to skip writing for next nested

repeated field after recognition of parent field as LIST. For this

skipRepeated 'true' passed to child converter's constructor.

close apache/drill#1805

    • -117
    • +232
    ./DrillParquetGroupConverter.java
  1. … 4 more files in changeset.
DRILL-7062: Initial implementation of run-time rowgroup pruning closes #1738

  1. … 24 more files in changeset.
DRILL-7096: Develop vector for canonical Map<K,V>

- Added new type DICT;

- Created value vectors for the type for single and repeated modes;

- Implemented corresponding FieldReaders and FieldWriters;

- Made changes in EvaluationVisitor to be able to read values from the map by key;

- Made changes to DrillParquetGroupConverter to be able to read Parquet's MAP type;

- Added an option `store.parquet.reader.enable_map_support` to disable reading MAP type as DICT from Parquet files;

- Updated AvroRecordReader to use new DICT type for Avro's MAP;

- Added support of the new type to ParquetRecordWriter.

    • -52
    • +50
    ./DrillParquetGroupConverter.java
    • -0
    • +118
    ./DrillParquetMapGroupConverter.java
  1. … 106 more files in changeset.
DRILL-7011: Support schema in scan framework

* Adds schema support to the row set-based scan framework and to the "V3" text reader based on that framework.

* Adding the schema made clear that passing options as a long list of constructor arguments was not sustainable. Refactored code to use a builder pattern instead.

* Added support for default values in the "null column loader", which required adding a "setValue" method to the column accessors.

* Added unit tests for all new or changed functionality. See TestCsvWithSchema for the overall test of the entire integrated mechanism.

* Added tests for explicit projection with schema

* Better handling of date/time in column accessors

* Converted recent column metadata work from Java 8 date/time to Joda.

* Added more CSV-with-schema unit tests

* Removed the ID fields from "resolved columns", used "instanceof" instead.

* Added wildcard projection with an output schema. Handles both "lenient" and "strict" schemas.

* Tagged projection columns with their output schema, when available.

* Scan projection added modes for wildcard with an output schema. The reader projection added support for merging reader and output schemas.

* Includes refactoring of scan operator tests (the test file grew too large.)

* Renamed some classes to avoid confusing reader schemas with output schemas.

* Added unit tests for the new functionality.

* Added "lenient" wildcard with schema test for CSV

* Added more type conversions: string-to-bit, many-to-string

* Fixed bug in column writer for VarDecimal

* Added missing unit tests, and fixed bugs, in Bit column reader/writer

* Cleaned up a number of unneded "SuppressWarnings"

closes #1711

  1. … 224 more files in changeset.
DRILL-6852: Adapt current Parquet Metadata cache implementation to use Drill Metastore API

Co-authored-by: Volodymyr Vysotskyi <vvovyk@gmail.com>

Co-authored-by: Vitalii Diravka <vitalii@apache.org>

close apache/drill#1646

  1. … 66 more files in changeset.
DRILL-5603: Replace String file paths to Hadoop Path - replaced all String path representation with org.apache.hadoop.fs.Path - added PathSerDe.Se JSON serializer - refactoring of DFSPartitionLocation code by leveraging existing listPartitionValues() functionality

closes #1657

  1. … 83 more files in changeset.
DRILL-6853: Make the complex parquet reader batch max row size configurable

  1. … 3 more files in changeset.
DRILL-6724: Dump operator context to logs when error occurs during query execution

closes #1455

  1. … 102 more files in changeset.
DRILL-6422: Replace guava imports with shaded ones

  1. … 983 more files in changeset.
DRILL-6670: Align Parquet TIMESTAMP_MICROS logical type handling with earlier versions + minor fixes

closes #1428

  1. … 15 more files in changeset.
DRILL-6386: Remove unused imports and star imports.

  1. … 231 more files in changeset.
DRILL-6320: Fixed license headers.

closes #1207

  1. … 2066 more files in changeset.
DRILL-6094: Decimal data type enhancements

Add ExprVisitors for VARDECIMAL

Modify writers/readers to support VARDECIMAL

- Added usage of VarDecimal for parquet, hive, maprdb, jdbc;

- Added options to store decimals as int32 and int64 or fixed_len_byte_array or binary;

Add UDFs for VARDECIMAL data type

- modify type inference rules

- remove UDFs for obsolete DECIMAL types

Enable DECIMAL data type by default

Add unit tests for DECIMAL data type

Fix mapping for NLJ when literal with non-primitive type is used in join conditions

Refresh protobuf C++ source files

Changes in C++ files

Add support for decimal logical type in Avro.

Add support for date, time and timestamp logical types.

Update Avro version to 1.8.2.

    • -131
    • +55
    ./DrillParquetGroupConverter.java
  1. … 201 more files in changeset.
DRILL-6164: Heap memory leak during parquet scan and OOM

closes #1122

    • -6
    • +11
    ./DrillParquetRecordMaterializer.java
  1. … 14 more files in changeset.
DRILL-6118: Handle item star columns during project / filter push down and directory pruning

1. Added DrillFilterItemStarReWriterRule to re-write item star fields to regular field references.

2. Refactored DrillPushProjectIntoScanRule to handle item star fields, factored out helper classes and methods from PreUitl.class.

3. Fixed issue with dynamic star usage (after Calcite upgrade old usage of star was still present, replaced WILDCARD -> DYNAMIC_STAR for clarity).

4. Added unit tests to check project / filter push down and directory pruning with item star.

  1. … 26 more files in changeset.
DRILL-5832: Change OperatorFixture to use system option manager

- Rename FixtureBuilder to ClusterFixtureBuilder

- Provide alternative way to reset system/session options

- Fix for DRILL-5833: random failure in TestParquetWriter

- Provide strict, but clear, errors for missing options

closes #970

    • -11
    • +22
    ./DrillParquetGroupConverter.java
  1. … 49 more files in changeset.
DRILL-5971: Fix INT64, INT32 logical types in complex parquet reader

Added the following types : ENUM (Binary annotated as ENUM) INT96 (Dictionary encoded)

Fixed issue with reading Dictionary encoded fixed width reader

Added test file generator

This closes #1049

  1. … 12 more files in changeset.
DRILL-4264: Allow field names to include dots

  1. … 98 more files in changeset.
DRILL-5356: Refactor Parquet Record Reader

The Parquet reader is Drill's premier data source and has worked very well

for many years. As with any piece of code, it has grown in complexity over

that time and has become hard to understand and maintain.

In work in another project, we found that Parquet is accidentally creating

"low density" batches: record batches with little actual data compared to

the amount of memory allocated. We'd like to fix that.

However, the current complexity of the reader code creates a barrier to

making improvements: the code is so complex that it is often better to

leave bugs unfixed, or risk spending large amounts of time struggling to

make small changes.

This commit offers to help revitalize the Parquet reader. Functionality is

identical to the code in master; but code has been pulled apart into

various classes each of which focuses on one part of the task: building

up a schema, keeping track of read state, a strategy for reading various

combinations of records, etc. The idea is that it is easier to understand

several small, focused classes than one huge, complex class. Indeed, the

idea of small, focused classes is common in the industry; it is nothing new.

Unit tests pass with the change. Since no logic has chanaged, we only moved

lines of code, that is a good indication that everything still works.

Also includes fixes based on review comments.

closes #789

  1. … 16 more files in changeset.
DRILL-5034: Select timestamp from hive generated parquet always return in UTC

- TIMESTAMP_IMPALA function is reverted to retaine local timezone

- TIMESTAMP_IMPALA_LOCALTIMEZONE is deleted

- Retain local timezone for the INT96 timestamp values in the parquet files while

PARQUET_READER_INT96_AS_TIMESTAMP option is on

Minor changes according to the review

Fix for the test, which relies on particular timezone

close #656

  1. … 6 more files in changeset.
Revert "DRILL-4373: Drill and Hive have incompatible timestamp representations in parquet - added sys/sess option "store.parquet.int96_as_timestamp"; - added int96 to timestamp converter for both readers; - added unit tests;"

This reverts commit 7e7214b40784668d1599f265067f789aedb6cf86.

  1. … 14 more files in changeset.
DRILL-4203: Parquet File. Date is stored wrongly - Added new extra field in the parquet meta info "is.date.correct = true"; - Removed unnecessary double conversion of value with Julian day; - Added ability to correct corrupted dates for parquet files with the second version old metadata cache file as well.

This closes #595

  1. … 21 more files in changeset.
DRILL-4373: Drill and Hive have incompatible timestamp representations in parquet - added sys/sess option "store.parquet.int96_as_timestamp"; - added int96 to timestamp converter for both readers; - added unit tests;

This closes #600

  1. … 15 more files in changeset.
DRILL-4184: Support variable length decimal fields in parquet

  1. … 68 more files in changeset.
DRILL-4382: Remove dependency on drill-logical from vector package

  1. … 80 more files in changeset.
DRILL-4203: Fix date values written in parquet files created by Drill

Drill was writing non-standard dates into parquet files for all releases

before 1.9.0. The values have been read by Drill correctly by Drill, but

external tools like Spark reading the files will see corrupted values for

all dates that have been written by Drill.

This change corrects the behavior of the Drill parquet writer to correctly

store dates in the format given in the parquet specification.

To maintain compatibility with old files, the parquet reader code has

been updated to check for the old format and automatically shift the

corrupted values into corrected ones automatically.

The test cases included here should ensure that all files produced by

historical versions of Drill will continue to return the same values they

had in previous releases. For compatibility with external tools, any old

files with corrupted dates can be re-written using the CREATE TABLE AS

command (as the writer will now only produce the specification-compliant

values, even if after reading out of older corrupt files).

While the old behavior was a consistent shift into an unlikely range

to be used in a modern database (over 10,000 years in the future), these are still

valid date values. In the case where these may have been written into

files intentionally, and we cannot be certain from the metadata if Drill

produced the files, an option is included to turn off the auto-correction.

Use of this option is assumed to be extremely unlikely, but it is included

for completeness.

This patch was originally written against version 1.5.0, when rebasing

the corruption threshold was updated to 1.9.0.

Added regenerated binary files, updated metadata cache files accordingly.

One small fix in the ParquetGroupScan to accommodate changes in master that changed

when metadata is read.

Tests for bugs revealed by the regression suite.

Fix drill version number in metadata file generation

  1. … 84 more files in changeset.
DRILL-3987: (REFACTOR) Common and Vector modules building.

- Extract Accountor interface from Implementation

- Separate FMPP modules to separate out Vector Needs versus external needs

- Separate out Vector classes from those that are VectorAccessible.

- Cleanup Memory Exception hiearchy

  1. … 105 more files in changeset.