【HiveSQL】join关联on和where的区别及效率对比

测试环境:hive on spark

spark版本:3.3.1

一、执行时机

sql连接中,where属于过滤条件,用于对join的结果集进行过滤,所以理论上的执行时机在join之后。on属于关联条件,决定了满足什么样条件的数据才可以被关联到一起,因此理论上的执行时机在join时。

但是,大多数数据库系统为了提升效率都采用了一些优化技术,思想都是将where中的筛选条件或是on中的关联条件尽可能的提前到数据源侧进行筛选,目的是减少参与关联的数据量。因此它们实际的执行时机大多时候和理论上的不同。

二、对结果集的影响

内连接中,条件放在where或者on中对结果集无影响。

外连接中(以左外连接为例),因为左外连接是完全保留左表记录,on在join时生效,因此最终的结果集也会保留左表的全部记录。where是对join后的结果集进行操作,所以会过滤掉一些数据导致二者的结果集不相同。

三、效率对比

测试数据量如下:

poi_data.poi_res表:数据量8300W+

bi_report.mon_ronghe_pv表:分区表,总数据量120E+,本次采用分区20240522的数据关联,数据量5900W+,其中 bid like '1%' & pv>100 的数据量120W+

两表的关联字段均无重复值。

1.内连接

1)on

sql 复制代码
select
	t1.bid,
    t1.name,
    t1.point_x,
    t1.point_y,
	t2.pv
from poi_data.poi_res t1 
join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid 
and t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
   CollectLimit (17)
   +- * Project (16)
      +- * SortMergeJoin Inner (15)
         :- * Sort (6)
         :  +- AQEShuffleRead (5)
         :     +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7)
         :        +- Exchange (3)
         :           +- * Filter (2)
         :              +- Scan hive poi_data.poi_res (1)
         +- * Sort (14)
            +- AQEShuffleRead (13)
               +- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
                  +- Exchange (11)
                     +- * Project (10)
                        +- * Filter (9)
                           +- * ColumnarToRow (8)
                              +- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
   CollectLimit (27)
   +- Project (26)
      +- SortMergeJoin Inner (25)
         :- Sort (20)
         :  +- Exchange (19)
         :     +- Filter (18)
         :        +- Scan hive poi_data.poi_res (1)
         +- Sort (24)
            +- Exchange (23)
               +- Project (22)
                  +- Filter (21)
                     +- Scan parquet bi_report.mon_ronghe_pv (7)


(1) Scan hive poi_data.poi_res
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297, name#299, point_x#316, point_y#317], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#297, type#298, name#299, address#300, phone#301, alias#302, post_code#303, catalog_id#304, c..., Partition Cols: []]

(2) Filter [codegen id : 1]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))

(3) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id=774]

(4) ShuffleQueryStage
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: 0

(5) AQEShuffleRead
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: coalesced

(6) Sort [codegen id : 3]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0

(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#334, pv#335, event_day#338]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#338), (event_day#338 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>

(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338]

(9) Filter [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 > 100))

(10) Project [codegen id : 2]
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338]

(11) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id=799]

(12) ShuffleQueryStage
Output [2]: [bid#334, pv#335]
Arguments: 1

(13) AQEShuffleRead
Input [2]: [bid#334, pv#335]
Arguments: coalesced

(14) Sort [codegen id : 4]
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0

(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None

(16) Project [codegen id : 5]
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335]

(17) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000

(18) Filter
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))

(19) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id=759]

(20) Sort
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0

(21) Filter
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 > 100))

(22) Project
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338]

(23) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id=760]

(24) Sort
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0

(25) SortMergeJoin
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None

(26) Project
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335]

(27) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000

(28) AdaptiveSparkPlan
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: isFinalPlan=true

从物理执行计划可以看到第(2)步中的Filter使用条件Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))在t1表读取源数据时进行了过滤,在第(7)步中通过谓词下推在t2表scan源数据时使用条件PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]进行了过滤,两表都是在数据源侧进行的数据过滤,减少了shuffle和参与join的数据量。

2)where

sql 复制代码
select
	t1.bid,
    t1.name,
    t1.point_x,
    t1.point_y,
	t2.pv
from poi_data.poi_res t1 
join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid 
where t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
   CollectLimit (17)
   +- * Project (16)
      +- * SortMergeJoin Inner (15)
         :- * Sort (6)
         :  +- AQEShuffleRead (5)
         :     +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7)
         :        +- Exchange (3)
         :           +- * Filter (2)
         :              +- Scan hive poi_data.poi_res (1)
         +- * Sort (14)
            +- AQEShuffleRead (13)
               +- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
                  +- Exchange (11)
                     +- * Project (10)
                        +- * Filter (9)
                           +- * ColumnarToRow (8)
                              +- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
   CollectLimit (27)
   +- Project (26)
      +- SortMergeJoin Inner (25)
         :- Sort (20)
         :  +- Exchange (19)
         :     +- Filter (18)
         :        +- Scan hive poi_data.poi_res (1)
         +- Sort (24)
            +- Exchange (23)
               +- Project (22)
                  +- Filter (21)
                     +- Scan parquet bi_report.mon_ronghe_pv (7)


(1) Scan hive poi_data.poi_res
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350, name#352, point_x#369, point_y#370], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#350, type#351, name#352, address#353, phone#354, alias#355, post_code#356, catalog_id#357, c..., Partition Cols: []]

(2) Filter [codegen id : 1]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))

(3) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id=908]

(4) ShuffleQueryStage
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: 0

(5) AQEShuffleRead
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: coalesced

(6) Sort [codegen id : 3]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0

(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#387, pv#388, event_day#391]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#391), (event_day#391 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>

(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391]

(9) Filter [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 > 100))

(10) Project [codegen id : 2]
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391]

(11) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id=933]

(12) ShuffleQueryStage
Output [2]: [bid#387, pv#388]
Arguments: 1

(13) AQEShuffleRead
Input [2]: [bid#387, pv#388]
Arguments: coalesced

(14) Sort [codegen id : 4]
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0

(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None

(16) Project [codegen id : 5]
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388]

(17) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000

(18) Filter
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))

(19) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id=893]

(20) Sort
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0

(21) Filter
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 > 100))

(22) Project
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391]

(23) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id=894]

(24) Sort
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0

(25) SortMergeJoin
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None

(26) Project
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388]

(27) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000

(28) AdaptiveSparkPlan
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: isFinalPlan=true

物理执行计划没有变化,因此可以说,当数据库支持谓词下推时,筛选条件用where还是on没有区别,数据库都会在数据源侧进行数据过滤,减少参与关联的数据量。

2.外连接

1)on

sql 复制代码
select
	t1.bid,
    t1.name,
    t1.point_x,
    t1.point_y,
	t2.pv
from poi_data.poi_res t1 
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid 
and t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
   CollectLimit (17)
   +- * Project (16)
      +- * SortMergeJoin LeftOuter (15)
         :- * Sort (6)
         :  +- AQEShuffleRead (5)
         :     +- ShuffleQueryStage (4), Statistics(sizeInBytes=36.5 MiB, rowCount=3.07E+5)
         :        +- Exchange (3)
         :           +- * LocalLimit (2)
         :              +- Scan hive poi_data.poi_res (1)
         +- * Sort (14)
            +- AQEShuffleRead (13)
               +- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
                  +- Exchange (11)
                     +- * Project (10)
                        +- * Filter (9)
                           +- * ColumnarToRow (8)
                              +- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
   CollectLimit (27)
   +- Project (26)
      +- SortMergeJoin LeftOuter (25)
         :- Sort (20)
         :  +- Exchange (19)
         :     +- LocalLimit (18)
         :        +- Scan hive poi_data.poi_res (1)
         +- Sort (24)
            +- Exchange (23)
               +- Project (22)
                  +- Filter (21)
                     +- Scan parquet bi_report.mon_ronghe_pv (7)


(1) Scan hive poi_data.poi_res
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403, name#405, point_x#422, point_y#423], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#403, type#404, name#405, address#406, phone#407, alias#408, post_code#409, catalog_id#410, c..., Partition Cols: []]

(2) LocalLimit [codegen id : 1]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000

(3) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id=1043]

(4) ShuffleQueryStage
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 0

(5) AQEShuffleRead
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: coalesced

(6) Sort [codegen id : 3]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0

(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#440, pv#441, event_day#444]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#444), (event_day#444 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>

(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444]

(9) Filter [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 > 100))

(10) Project [codegen id : 2]
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444]

(11) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id=1067]

(12) ShuffleQueryStage
Output [2]: [bid#440, pv#441]
Arguments: 1

(13) AQEShuffleRead
Input [2]: [bid#440, pv#441]
Arguments: coalesced

(14) Sort [codegen id : 4]
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0

(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None

(16) Project [codegen id : 5]
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441]

(17) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000

(18) LocalLimit
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000

(19) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id=1029]

(20) Sort
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0

(21) Filter
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 > 100))

(22) Project
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444]

(23) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id=1030]

(24) Sort
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0

(25) SortMergeJoin
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None

(26) Project
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441]

(27) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000

(28) AdaptiveSparkPlan
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: isFinalPlan=true

因为左关联,on中的条件属于连接条件,结果需要保留左表全部记录,所以t1表全量读取,t2表使用了谓词下推过滤。

2)where

sql 复制代码
select
	t1.bid,
    t1.name,
    t1.point_x,
    t1.point_y,
	t2.pv
from poi_data.poi_res t1 
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid 
where t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
   CollectLimit (17)
   +- * Project (16)
      +- * SortMergeJoin Inner (15)
         :- * Sort (6)
         :  +- AQEShuffleRead (5)
         :     +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7)
         :        +- Exchange (3)
         :           +- * Filter (2)
         :              +- Scan hive poi_data.poi_res (1)
         +- * Sort (14)
            +- AQEShuffleRead (13)
               +- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
                  +- Exchange (11)
                     +- * Project (10)
                        +- * Filter (9)
                           +- * ColumnarToRow (8)
                              +- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
   CollectLimit (27)
   +- Project (26)
      +- SortMergeJoin Inner (25)
         :- Sort (20)
         :  +- Exchange (19)
         :     +- Filter (18)
         :        +- Scan hive poi_data.poi_res (1)
         +- Sort (24)
            +- Exchange (23)
               +- Project (22)
                  +- Filter (21)
                     +- Scan parquet bi_report.mon_ronghe_pv (7)


(1) Scan hive poi_data.poi_res
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456, name#458, point_x#475, point_y#476], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#456, type#457, name#458, address#459, phone#460, alias#461, post_code#462, catalog_id#463, c..., Partition Cols: []]

(2) Filter [codegen id : 1]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))

(3) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id=1176]

(4) ShuffleQueryStage
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: 0

(5) AQEShuffleRead
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: coalesced

(6) Sort [codegen id : 3]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0

(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#493, pv#494, event_day#497]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#497), (event_day#497 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>

(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497]

(9) Filter [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 > 100))

(10) Project [codegen id : 2]
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497]

(11) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id=1201]

(12) ShuffleQueryStage
Output [2]: [bid#493, pv#494]
Arguments: 1

(13) AQEShuffleRead
Input [2]: [bid#493, pv#494]
Arguments: coalesced

(14) Sort [codegen id : 4]
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0

(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None

(16) Project [codegen id : 5]
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494]

(17) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000

(18) Filter
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))

(19) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id=1161]

(20) Sort
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0

(21) Filter
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 > 100))

(22) Project
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497]

(23) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id=1162]

(24) Sort
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0

(25) SortMergeJoin
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None

(26) Project
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494]

(27) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000

(28) AdaptiveSparkPlan
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: isFinalPlan=true

where属于过滤条件,影响左关联的最终结果,所以执行计划第(2)步中将where提前到join关联之前按照bid对t1表进行过滤。

四、总结

假设数据库系统支持谓词下推的前提下,

  • 内连接:内连接的两个执行计划中,对t2表都使用了PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)],对t1表都使用了Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297)) ,因此可以说,内连接中where和on在执行效率上没区别。
  • 外连接:还是拿左外连接来说,右表相关的条件会使用谓词下推,而左表是否会提前过滤数据,取决于where还是on以及筛选条件是否与左表相关,1)当为on时,左表的数据必须全量读取,此时效率的差别主要取决于左表的数据量。2)当为where时,如果筛选条件涉及到左表,则会进行数据的提前过滤,否则左表仍然全量读取。

PS

在内连接的物理执行计划中,对poi_res表的过滤单独作为一个Filter步骤(2)Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297)),而对mon_ronghe_pv表的过滤在第(7)步scan中PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)] ,二者有什么区别?查了一些资料,说的是可以将PushedFilters理解为在读取数据时的过滤,不满足条件的数据直接不读取。Filter时将数据读取之后,再判断是否满足条件,决定是否参与后续计算。

既然都是在数据源侧进行数据过滤,为什么Filter不能像PushedFilters那样,直接在读取数据的时候判断,减少读入的数据量呢,这样也可以提升效率,这是一开始个人的疑问。查了一些资料,说的是是否支持在scan时filter数据,主要受数据源的影响。大数据中的存储方式主要分为行式存储和列式存储,列式存储的数据存储方式和丰富的元数据对谓词下推技术有更好的支持。当前测试中,mon_ronghe_pv表的存储格式为parquet,poi_res表存储格式text。

相关推荐
Yz98761 小时前
hive的存储格式
大数据·数据库·数据仓库·hive·hadoop·数据库开发
lzhlizihang1 小时前
python如何使用spark操作hive
hive·python·spark
武子康1 小时前
大数据-230 离线数仓 - ODS层的构建 Hive处理 UDF 与 SerDe 处理 与 当前总结
java·大数据·数据仓库·hive·hadoop·sql·hdfs
武子康1 小时前
大数据-231 离线数仓 - DWS 层、ADS 层的创建 Hive 执行脚本
java·大数据·数据仓库·hive·hadoop·mysql
武子康9 小时前
Java-06 深入浅出 MyBatis - 一对一模型 SqlMapConfig 与 Mapper 详细讲解测试
java·开发语言·数据仓库·sql·mybatis·springboot·springcloud
爱上口袋的天空10 小时前
09 - Clickhouse的SQL操作
数据库·sql·clickhouse
Yang.9912 小时前
基于Windows系统用C++做一个点名工具
c++·windows·sql·visual studio code·sqlite3
JessieZeng aaa13 小时前
CSV文件数据导入hive
数据仓库·hive·hadoop
王ASC16 小时前
ORA-01461: 仅能绑定要插入 LONG 列的 LONG 值。ojdbc8版本23.2.0.0驱动BUG【已解决】
数据库·sql·oracle
执键行天涯17 小时前
【日常经验】修改大数据量的表字段类型,怎么修改更快
sql