1_9_0_partitioned_no_corruption

Clone Tools
  • last updated a few minutes ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
DRILL-4203: Parquet File. Date is stored wrongly - Added new extra field in the parquet meta info "is.date.correct = true"; - Removed unnecessary double conversion of value with Julian day; - Added ability to correct corrupted dates for parquet files with the second version old metadata cache file as well.

This closes #595

  1. … 16 more files in changeset.
DRILL-4203: Fix date values written in parquet files created by Drill

Drill was writing non-standard dates into parquet files for all releases

before 1.9.0. The values have been read by Drill correctly by Drill, but

external tools like Spark reading the files will see corrupted values for

all dates that have been written by Drill.

This change corrects the behavior of the Drill parquet writer to correctly

store dates in the format given in the parquet specification.

To maintain compatibility with old files, the parquet reader code has

been updated to check for the old format and automatically shift the

corrupted values into corrected ones automatically.

The test cases included here should ensure that all files produced by

historical versions of Drill will continue to return the same values they

had in previous releases. For compatibility with external tools, any old

files with corrupted dates can be re-written using the CREATE TABLE AS

command (as the writer will now only produce the specification-compliant

values, even if after reading out of older corrupt files).

While the old behavior was a consistent shift into an unlikely range

to be used in a modern database (over 10,000 years in the future), these are still

valid date values. In the case where these may have been written into

files intentionally, and we cannot be certain from the metadata if Drill

produced the files, an option is included to turn off the auto-correction.

Use of this option is assumed to be extremely unlikely, but it is included

for completeness.

This patch was originally written against version 1.5.0, when rebasing

the corruption threshold was updated to 1.9.0.

Added regenerated binary files, updated metadata cache files accordingly.

One small fix in the ParquetGroupScan to accommodate changes in master that changed

when metadata is read.

Tests for bugs revealed by the regression suite.

Fix drill version number in metadata file generation

  1. … 81 more files in changeset.