gautam parai <> in drill

DRILL-7245: Cap NDV at row count after applying filters

closes #1786

DRILL-7227: Fix predicate check in DrillRelOptUtil.analyzeSimpleEquiJoin

closes #1775

DRILL-7182: Incorrect access specifier for Join DrillRelMdDistinctRowCount

DRILL-7076: Fix NPE in StatsMaterializationVisitor

closes #1722

DRILL-7121: Use the NDV guess (same as before) when statistics is disabled

closes #1718

DRILL-7148: Use improved join cardinality and ndv estimation with statistics

closes #1744

DRILL-7109: Apply selectivity calculations to single column filter predicates

closes #1701

DRILL-7108: Improve selectivity estimates for (NOT)LIKE, NOT_EQUALS, IS NOT NULL predicates

closes #1699

DRILL-7085: Fix table-path check in AnalyzeTableHandler

closes #1685

DRILL-6878: Use DrillPushRowKeyJoinToScan rule on DrillJoin pattern to account for DrillSemiJoin

closes #1568

DRILL-6770: JsonTableGroupScan should use new MapRDB 6.1.0 APIs

closes #1489

DRILL-6824: Handle schema changes in MapRDBJsonRecordReader closes #1518

DRILL-6589: Push transitive closure predicate(s) past aggregates

closes #1372

DRILL-6487: Limit estimateRowCount should not return negative rowcount

closes #1322

DRILL-6463 : Fix integer overflow in MockGroupScanPOP

closes #1303

DRILL-3964 : Fix NPE in WriterRecordBatch when 0 rows

closes #1290

DRILL-6375 : Support for ANY_VALUE aggregate function

closes #1256

  1. … 22 more files in changeset.
DRILL-6099: Push limit past flatten(project) without pushdown into scan

closes #1096

DRILL-6093 : Account for simple columns in project cpu costing

close apache/drill#1093

DRILL-6833: Support for pushdown of rowkey based joins

closes #1532

DRILL-5853 : Update Calcite to get NULL direction for sort removal

closes #979

DRILL-4771: Drill should avoid doing the same join twice if count(distinct) exists

close apache/drill#588

DRILL-4795: Nested aggregate windowed query fails - IllegalStateException

close apache/drill#563

DRILL-3710: New option for the IN LIST size to convert into join

Cosmetic changes

close apache/drill#552

DRILL-4743: Allow new options to control filter selectivity (min/max bounds)

Addressed review comments 2

Addressed review comments 3

close apache/drill#534

DRILL-2330: Support for nested aggregates within window functions (merge CALCITE-750), add unit tests.

close apache/drill#529

DRILL-4665: Fix partition pruning for filters containing LIKE (or other similar) predicate on non-partition column.

close apache/drill#526

DRILL-1328: Support table statistics - Part 2

Add support for avg row-width and major type statistics.

Parallelize the ANALYZE implementation and stats UDF implementation to improve stats collection performance.

Update/fix rowcount, selectivity and ndv computations to improve plan costing.

Add options for configuring collection/usage of statistics.

Add new APIs and implementation for stats writer (as a precursor to Drill Metastore APIs).

Fix several stats/costing related issues identified while running TPC-H nad TPC-DS queries.

Add support for CPU sampling and nested scalar columns.

Add more testcases for collection and usage of statistics and fix remaining unit/functional test failures.

Thanks to Venki Korukanti (@vkorukanti) for the description below (modified to account for new changes). He graciously agreed to rebase the patch to latest master, fixed few issues and added few tests.

FUNCS: Statistics functions as UDFs:


Currently using FieldReader to ensure consistent output type so that Unpivot doesn't get confused. All stats columns should be Nullable, so that stats functions can return NULL when N/A.

* custom versions of "count" that always return BigInt

* HyperLogLog based NDV that returns BigInt that works only on VarChars

* HyperLogLog with binary output that only works on VarChars

OPS: Updated protobufs for new ops

OPS: Implemented StatisticsMerge

OPS: Implemented StatisticsUnpivot

ANALYZE: AnalyzeTable functionality

* JavaCC syntax more-or-less copied from LucidDB.

* (Basic) AnalyzePrule: DrillAnalyzeRel -> UnpivotPrel StatsMergePrel FilterPrel(for sampling) StatsAggPrel ScanPrel

ANALYZE: Add getMetadataTable() to AbstractSchema

USAGE: Change field access in QueryWrapper

USAGE: Add getDrillTable() to DrillScanRelBase and ScanPrel

* since ScanPrel does not inherit from DrillScanRelBase, this requires adding a DrillTable to the constructor

* This is done so that a custom ReflectiveRelMetadataProvider can access the DrillTable associated with Logical/Physical scans.

USAGE: Attach DrillStatsTable to DrillTable.

* DrillStatsTable represents the data scanned from a corresponding ".stats.drill" table

* In order to avoid doing query execution right after the ".stats.drill" table is found, metadata is not actually collected until the MaterializationVisitor is used.

** Currently, the metadata source must be a string (so that a SQL query can be created). Doing this with a table is probably more complicated.

** Query is set up to extract only the most recent statistics results for each column.

closes #729

  1. … 129 more files in changeset.