Clone
 

paul rogers <progers@maprtech.com> in drill

DRILL-6080: Sort incorrectly limits batch size to 65535 records

closes #1090

* Sort incorrectly limits batch size to 65535 records rather than 65536.

* This PR also includes a few code cleanup items.

* Fix for overflow in offset vector in row set writer

* Performance tool update

* Replace "unsafe" methods with "set" methods

* Also fixes an indexing issue with nullable writers

* Removed debug & timing code

* Increase strictness for batch size

DRILL-6049: Misc. hygiene and code cleanup changes

close apache/drill#1085

  1. … 109 more files in changeset.
DRILL-5993: Adding generic copiers which do not require codegen

DRILL-1170: YARN integration for Drill

closes #1011

    • -0
    • +54
    /distribution/src/resources/drill-am-log.xml
    • -0
    • +137
    /distribution/src/resources/drill-am.sh
    • -0
    • +204
    /distribution/src/resources/drill-on-yarn-example.conf
    • -0
    • +74
    /distribution/src/resources/drill-on-yarn.sh
    • -0
    • +44
    /distribution/src/resources/yarn-client-log.xml
    • -0
    • +178
    /distribution/src/resources/yarn-drillbit.sh
    • -0
    • +190
    /drill-yarn/README.md
    • -0
    • +941
    /drill-yarn/USAGE.md
    • binary
    /drill-yarn/img/am-overview.png
    • binary
    /drill-yarn/img/client-classes.png
    • binary
    /drill-yarn/img/controller-classes.png
    • binary
    /drill-yarn/img/overview.png
    • -0
    • +144
    /drill-yarn/pom.xml
  1. … 111 more files in changeset.
DRILL-5842: Refactor fragment, operator contexts

This closes #978

  1. … 23 more files in changeset.
DRILL-5832: Change OperatorFixture to use system option manager

- Rename FixtureBuilder to ClusterFixtureBuilder

- Provide alternative way to reset system/session options

- Fix for DRILL-5833: random failure in TestParquetWriter

- Provide strict, but clear, errors for missing options

closes #970

  1. … 37 more files in changeset.
DRILL-5830: Resolve regressions to MapR DB from DRILL-5546

- Back out HBase changes

- Code cleanup

- Test utilities

- Fix for DRILL-5829

closes #968

  1. … 8 more files in changeset.
DRILL-5815: Option to set query memory as percent of total

closes #960

DRILL-5808: Reduce memory allocator strictness for "managed" operators

closes #958

DRILL-5443: Rollup of external sort fixes

- DRILL-5758: the “record batch sizer” did not handle repeated columns correctly.

- Enabled managed sort by default

- Fix check style warning

- Fix for DRILL-5670

Estimation for size of spill batch read from disk was off. For some

reason, Drill needs an amount of memory 2x the data size. The previous

estimate was 1.5x. That error, accumulated over 47 columns, was enough

to cause an OOM.

- Code cleanup discovered during the investigation.

- Exception if reAlloc tries to double a zero-size vector

- DRILL-5804: Fixes issues with zero-length vector allocations.

- Better estimates array cardinality when it is fractional.

- Uses fractional cardinality to allocate new arrays.

- Prevents an infinite loop on reAlloc if the array starts empty.

- Fixed unit test issue

- Change batch size variables from int to long

closes #932

  1. … 6 more files in changeset.
DRILL-5716: Queue-driven memory allocation

* Creates new core resource management and query queue abstractions.

* Adds queue information to the Protobuf layer.

* Foreman and Planner changes

- Abstracts memory management out to the new resource management layer.

This means deferring generating the physical plan JSON to later in the

process after memory planning.

* Web UI changes

* Adds queue information to the main page and the profile page to each

query.

* Also sorts the list of options displayed in the Web UI.

- Added memory reserve

A new config parameter, exec.queue.memory_reserve_ratio, sets aside a

slice of total memory for operators that do not participate in the

memory assignment process. The default is 20% testing will tell us if

that value should be larger or smaller.

* Additional minor fixes

- Code cleanup.

- Added mechanism to abandon lease release during shutdown.

- Log queue configuration only when the config changes, rather than on

every query.

- Apply Boaz’ option to enforce a minimum memory allocation per

operator.

- Additional logging to help testers see what is happening.

closes #928

  1. … 43 more files in changeset.
DRILL-5740: Ensure spill directories are unique

A recent change added the node name and port to make the spill path

unique. Turns out we need to add this information to the single spill

directory name. The previous change use the node ID as a parent

directory, which turns out not to work well in practice.

closes #924

DRILL-5657: Size-aware vector writer structure

- Vector and accessor layer

- Row Set layer

- Tuple and column models

- Revised write-time metadata

- "Result set loader" layer

this closes #914

  1. … 173 more files in changeset.
DRILL-5709: Provide a value vector method to convert a vector to nullable

Please see the DRILL-5709 for an explanation and example.

close apache/drill#901

DRILL-5590: Bugs in CSV field matching, null columns

Please see the problem and solution descriptions in DRILL-5590.

Also cleaned up some dead code left over from DRILL-5498.

close #855

DRILL-5518: Test framework enhancements

* Create a SubOperatorTest base class to do routine setup and shutdown.

* Additional methods to simplify creating complex schemas with field

widths.

* Define a test workspace with plugin-specific options (as for the CSV

storage plugin)

* When verifying row sets, add methods to verify and release just the

"actual" batch in addition to the existing method for verify and free

both the actual and expected batches.

* Allow reading of row set values as object for generic comparisons.

* "Column builder" within schema builder to simplify building a single

MatrializedField for tests.

* Misc. code cleanup.

closes #851

  1. … 3 more files in changeset.
DRILL-5517: Size-aware set methods in value vectors

Please see DRILL-5517 for an explanation.

Also includes a workaround for DRILL-5529.

Implements a setEmpties method for repeated and non-nullable

variable-width types in support of the revised column accessors.

Unit test included. Without the setEmpties call, the tests fail with

vector corruption. With the call, things work properly.

closes #840

  1. … 11 more files in changeset.
DRILL-5514: Enhance VectorContainer to merge two row sets

Adds ability to merge two schemas and to merge two vector containers,

in each case producing a new, merged result. See DRILL-5514 for details.

Also provides a handy constructor to create a vector container given a

pre-defined schema.

closes #837

DRILL-5512: Standardize error handling in ScanBatch

Standardizes error handling to throw a UserException. Prior code threw

various exceptions, called the fail() method, or returned a variety of

status codes.

closes #838

    • -5
    • +2
    /protocol/src/main/protobuf/Types.proto
DRILL-5496: Fix for failed Hive connection

If the Hive server restarts, Drill either hangs or continually reports

errors when retrieving schemas. The problem is that the Hive plugin

tries to handle connection failures, but does not do so correctly for

the secure connection case. The problem is complex, see DRILL-5496 for

details.

This is a workaround: we discard the entire Hive schema cache when we

encounter an unhandled connection exception, then we rebuild a new one.

This is not a proper fix; for that we'd have to restructure the code.

This will, however, solve the immediate problem until we do the needed

restructuring.

DRILL-5504: Add vector validator to diagnose offset vector issues

Validates offset vectors in VarChar and repeated vectors. Validates the

special case of repeated VarChar vectors (two layers of offsets.)

Provides two new session variables to turn on validation. One enables

the existing operator (iterator) validation, the other adds vector

validation. This allows validation to occur in a “production” Drill

(without restarting Drill with assertions, as previously required.)

Unit tests validate the validator. Another test validates the

integration, but requires manual steps, so is ignored by default.

This version is first-cut: all work is done within a single class.

Allows back-porting to an earlier version to solve a specific issues. A

revision should move some of the work into generated code (or refactor

vectors to allow outside access), since offset vectors appear for each

subclass; not on a base class that would allow generic operations.

* Added boot-time options to allow enabling vector validation in Maven

unit tests.

* Code cleanup per suggestions.

* Additional (manual) tests for boot-time options and default options.

closes #832

DRILL-5498: Better handling of CSV column headers

See DRILL-5498 for details.

Replaced the repeated varchar reader for reading columns with a purpose

built column parser. Implemented rules to recover from invalid column

headers.

Added missing test method

Changes re code review comments

Back out testing-only change

close apache/drill#830

DRILL-5428: submit_plan fails after Drill 1.8 script revisions

When the other scripts were updated, submit_plan was not corrected.

After Drill 1.8, drill-config.sh consumes all command line arguments,

finds the —config and —site options, removes them, and places the rest

in the new args array.

This PR updates submit_plan to use the new args array.

The fix was tested on a test cluster: we verified that a physical plan

was submitted and ran.

closes #816

    • -1
    • +1
    /distribution/src/resources/submit_plan
DRILL-5423: Refactor ScanBatch to allow unit testing record readers

Refactors ScanBatch to allow unit testing of record reader

implementations, especially the “writer” classes.

See JIRA for details.

closes #811

DRILL-5325: Unit tests for the managed sort

Uses the sub-operator test framework (DRILL-5318), including the test

row set abstraction (DRILL-5323) to enable unit testing of the

“managed” external sort. This PR allows early review of the code, but

cannot be pulled until the dependencies (mentioned above) are pulled.

Refactors the external sort code into small chunks that can be unit

tested, then “wraps” that code in tests for all interesting data types,

record batch sizes, and so on.

Refactors some of the operator definitions to more easily allow

programmatic setup in the unit tests.

Fixes a number of bugs discovered by the unit tests. The biggest

changes were in the new code: the code that computes spilling and

merging based on memory levels.

Otherwise, although GitHub will show many files change, most of the

changes are simply moving blocks of code around to create smaller units

that can be tested independently.

Includes a refactoring of the code that does spilling, along with a

complete set of low-level unit tests.

Excludes long-running sort tests.

Defines a test category for long-running tests.

First attempt to provide a way to run such tests from Maven.

closes #808

    • -0
    • +70
    /common/src/main/java/org/apache/drill/test/SecondaryTest.java
  1. … 36 more files in changeset.
DRILL-5601: Rollup of external sort fixes an improvements

- DRILL-5513: Managed External Sort : OOM error during the merge phase

- DRILL-5519: Sort fails to spill and results in an OOM

- DRILL-5522: OOM during the merge and spill process of the managed external sort

- DRILL-5594: Excessive buffer reallocations during merge phase of external sort

- DRILL-5597: Incorrect "bits" vector allocation in nullable vectors allocateNew()

- DRILL-5602: Repeated List Vector fails to initialize the offset vector

- DRILL-5617: Spill file name collisions when spill file is on a shared file system

0 DRILL-5445: bug in repeated map vector deserialization

- Workaround for DRILL-5656: Streaming Agg Batch forces sort to retain in-memory batches past NONE

- Fixes for the "record batch sizer" to handle for UNION, MAP, LIST types

Fixes a longstanding bug in the deserialization of a repeated map

vector read from a spill file. A few minor code cleanups also.

All of the bugs have to do with handling low-memory conditions, and with

correctly estimating the sizes of vectors, even when those vectors come

from the spill file or from an exchange. Hence, the changes for all of

the above issues are interrelated.

Also includes some fixes for tests:

* Certain tests require the ability to enforce the output size of the

memory merge/sort. Restored this option.

* Resolve issue with TestDrillbitResilience

* A particular test injects a fault during in-memory sort, but used a

single-batch input (which does not need the merge phase.)

Rather than introduce a new config property (the earlier solution),

altered the test to use input that returns more than one batch.

Two fixes forBasicPhysicalOpUnitTest — a test that uses JMockit to create a “fake”

fragment context.

1. No drillbit endpoint is available, so the SpillSet change that adds

the node to the spill path failed. Solution was to omit the node path

segment in such tests.

2. The Dynamic UDF registry is null causing a crash. This has nothing

to do with sort. Perhaps some pre-existing error? Anyway, added a check

for this condition.

close #860

  1. … 49 more files in changeset.
DRILL-5385: Vector serializer fails to read saved SV2

Unit testing revealed that the VectorAccessorSerializable class claims

to serialize SV2s, but, in fact, does not. Actually, it writes them,

but does not read them, resulting in corrupted data on read.

Fortunately, no code appears to serialize sv2s at present. Still, it is

a bug and needs to be fixed.

First task is to add serialization code for the sv2.

That revealed that the recently-added code to save DrillBufs using a

shared buffer had a bug: it relied on the writer index to know how much

data is in the buffer. Turns out sv2 buffers don’t set this index. So,

new versions of the write function takes a write length.

Then, closer inspection of the read code revealed duplicated code. So,

DrillBuf allocation moved into a version of the read function that now

does reading and DrillBuf allocation.

Turns out that value vectors, but not SV2s, can be built from a

Drillbuf. Added a matching constructor to the SV2 class.

Finally, cleaned up the code a bit to make it easier to follow. Also

allowed test code to access the handy timer already present in the code.

closes #800

DRILL-5234: External sort's spilling functionality does not work when the spilled columns contains a map type column closes #799

DRILL-5319: Refactor "contexts" for unit testing closes #787

This PR is purely a refactoring: no functionality is added or changed.

The refactoring splits various context and related classes into a set

of new interfaces with needed for operator-level unit tests. The other,

Drillbit-related methods are left in the original interfaces. Most code

need not change.

The changes here allow operator-level unit tests to mock up the

exec-time methods so we can use them without firing up a Drillbit (or

using mocking libraries).

A later PR will provide the sub-operator test framework that uses this

refactoring.

Changes include:

* The OptionManager is split, with read-only methods moving to a new

OptionSet interface.

* The FragmentContext is split, with an exec-only FragmentExecContext

proving low-level methods.

* OperatorStats is split, with a new OperatorStatReceiver class

providing write-only support to operators.

* Several places that accepted an OperatorContext or FragmentContext,

but needed only an allocator, are changed to accept the allocator

directly.

Includes fixes for code review comments

Adds more comments. Postpones the suggested rename until all affected

code is in master, else it will be difficult to synchronize the rename

across multiple branches.

  1. … 15 more files in changeset.
DRILL-5356: Refactor Parquet Record Reader

The Parquet reader is Drill's premier data source and has worked very well

for many years. As with any piece of code, it has grown in complexity over

that time and has become hard to understand and maintain.

In work in another project, we found that Parquet is accidentally creating

"low density" batches: record batches with little actual data compared to

the amount of memory allocated. We'd like to fix that.

However, the current complexity of the reader code creates a barrier to

making improvements: the code is so complex that it is often better to

leave bugs unfixed, or risk spending large amounts of time struggling to

make small changes.

This commit offers to help revitalize the Parquet reader. Functionality is

identical to the code in master; but code has been pulled apart into

various classes each of which focuses on one part of the task: building

up a schema, keeping track of read state, a strategy for reading various

combinations of records, etc. The idea is that it is easier to understand

several small, focused classes than one huge, complex class. Indeed, the

idea of small, focused classes is common in the industry; it is nothing new.

Unit tests pass with the change. Since no logic has chanaged, we only moved

lines of code, that is a good indication that everything still works.

Also includes fixes based on review comments.

closes #789

    • -0
    • +20
    /exec/java-exec/src/test/resources/parquet/expected/bogus.csv
    • -0
    • +20
    /exec/java-exec/src/test/resources/parquet/expected/star.csv