【HiveSQL】join关联on和where的区别及效率对比

测试环境:hive on spark

spark版本:3.3.1

一、执行时机

sql连接中,where属于过滤条件,用于对join的结果集进行过滤,所以理论上的执行时机在join之后。on属于关联条件,决定了满足什么样条件的数据才可以被关联到一起,因此理论上的执行时机在join时。

但是,大多数数据库系统为了提升效率都采用了一些优化技术,思想都是将where中的筛选条件或是on中的关联条件尽可能的提前到数据源侧进行筛选,目的是减少参与关联的数据量。因此它们实际的执行时机大多时候和理论上的不同。

二、对结果集的影响

内连接中,条件放在where或者on中对结果集无影响。

外连接中(以左外连接为例),因为左外连接是完全保留左表记录,on在join时生效,因此最终的结果集也会保留左表的全部记录。where是对join后的结果集进行操作,所以会过滤掉一些数据导致二者的结果集不相同。

三、效率对比

测试数据量如下:

poi_data.poi_res表:数据量8300W+

bi_report.mon_ronghe_pv表:分区表,总数据量120E+,本次采用分区20240522的数据关联,数据量5900W+,其中 bid like '1%' & pv>100 的数据量120W+

两表的关联字段均无重复值。

1.内连接

1)on

sql 复制代码
select
	t1.bid,
    t1.name,
    t1.point_x,
    t1.point_y,
	t2.pv
from poi_data.poi_res t1 
join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid 
and t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
   CollectLimit (17)
   +- * Project (16)
      +- * SortMergeJoin Inner (15)
         :- * Sort (6)
         :  +- AQEShuffleRead (5)
         :     +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7)
         :        +- Exchange (3)
         :           +- * Filter (2)
         :              +- Scan hive poi_data.poi_res (1)
         +- * Sort (14)
            +- AQEShuffleRead (13)
               +- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
                  +- Exchange (11)
                     +- * Project (10)
                        +- * Filter (9)
                           +- * ColumnarToRow (8)
                              +- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
   CollectLimit (27)
   +- Project (26)
      +- SortMergeJoin Inner (25)
         :- Sort (20)
         :  +- Exchange (19)
         :     +- Filter (18)
         :        +- Scan hive poi_data.poi_res (1)
         +- Sort (24)
            +- Exchange (23)
               +- Project (22)
                  +- Filter (21)
                     +- Scan parquet bi_report.mon_ronghe_pv (7)


(1) Scan hive poi_data.poi_res
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297, name#299, point_x#316, point_y#317], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#297, type#298, name#299, address#300, phone#301, alias#302, post_code#303, catalog_id#304, c..., Partition Cols: []]

(2) Filter [codegen id : 1]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))

(3) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id=774]

(4) ShuffleQueryStage
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: 0

(5) AQEShuffleRead
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: coalesced

(6) Sort [codegen id : 3]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0

(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#334, pv#335, event_day#338]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#338), (event_day#338 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>

(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338]

(9) Filter [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 > 100))

(10) Project [codegen id : 2]
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338]

(11) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id=799]

(12) ShuffleQueryStage
Output [2]: [bid#334, pv#335]
Arguments: 1

(13) AQEShuffleRead
Input [2]: [bid#334, pv#335]
Arguments: coalesced

(14) Sort [codegen id : 4]
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0

(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None

(16) Project [codegen id : 5]
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335]

(17) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000

(18) Filter
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))

(19) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id=759]

(20) Sort
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0

(21) Filter
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 > 100))

(22) Project
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338]

(23) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id=760]

(24) Sort
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0

(25) SortMergeJoin
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None

(26) Project
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335]

(27) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000

(28) AdaptiveSparkPlan
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: isFinalPlan=true

从物理执行计划可以看到第(2)步中的Filter使用条件Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))在t1表读取源数据时进行了过滤,在第(7)步中通过谓词下推在t2表scan源数据时使用条件PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]进行了过滤,两表都是在数据源侧进行的数据过滤,减少了shuffle和参与join的数据量。

2)where

sql 复制代码
select
	t1.bid,
    t1.name,
    t1.point_x,
    t1.point_y,
	t2.pv
from poi_data.poi_res t1 
join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid 
where t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
   CollectLimit (17)
   +- * Project (16)
      +- * SortMergeJoin Inner (15)
         :- * Sort (6)
         :  +- AQEShuffleRead (5)
         :     +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7)
         :        +- Exchange (3)
         :           +- * Filter (2)
         :              +- Scan hive poi_data.poi_res (1)
         +- * Sort (14)
            +- AQEShuffleRead (13)
               +- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
                  +- Exchange (11)
                     +- * Project (10)
                        +- * Filter (9)
                           +- * ColumnarToRow (8)
                              +- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
   CollectLimit (27)
   +- Project (26)
      +- SortMergeJoin Inner (25)
         :- Sort (20)
         :  +- Exchange (19)
         :     +- Filter (18)
         :        +- Scan hive poi_data.poi_res (1)
         +- Sort (24)
            +- Exchange (23)
               +- Project (22)
                  +- Filter (21)
                     +- Scan parquet bi_report.mon_ronghe_pv (7)


(1) Scan hive poi_data.poi_res
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350, name#352, point_x#369, point_y#370], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#350, type#351, name#352, address#353, phone#354, alias#355, post_code#356, catalog_id#357, c..., Partition Cols: []]

(2) Filter [codegen id : 1]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))

(3) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id=908]

(4) ShuffleQueryStage
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: 0

(5) AQEShuffleRead
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: coalesced

(6) Sort [codegen id : 3]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0

(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#387, pv#388, event_day#391]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#391), (event_day#391 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>

(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391]

(9) Filter [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 > 100))

(10) Project [codegen id : 2]
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391]

(11) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id=933]

(12) ShuffleQueryStage
Output [2]: [bid#387, pv#388]
Arguments: 1

(13) AQEShuffleRead
Input [2]: [bid#387, pv#388]
Arguments: coalesced

(14) Sort [codegen id : 4]
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0

(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None

(16) Project [codegen id : 5]
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388]

(17) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000

(18) Filter
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))

(19) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id=893]

(20) Sort
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0

(21) Filter
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 > 100))

(22) Project
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391]

(23) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id=894]

(24) Sort
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0

(25) SortMergeJoin
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None

(26) Project
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388]

(27) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000

(28) AdaptiveSparkPlan
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: isFinalPlan=true

物理执行计划没有变化,因此可以说,当数据库支持谓词下推时,筛选条件用where还是on没有区别,数据库都会在数据源侧进行数据过滤,减少参与关联的数据量。

2.外连接

1)on

sql 复制代码
select
	t1.bid,
    t1.name,
    t1.point_x,
    t1.point_y,
	t2.pv
from poi_data.poi_res t1 
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid 
and t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
   CollectLimit (17)
   +- * Project (16)
      +- * SortMergeJoin LeftOuter (15)
         :- * Sort (6)
         :  +- AQEShuffleRead (5)
         :     +- ShuffleQueryStage (4), Statistics(sizeInBytes=36.5 MiB, rowCount=3.07E+5)
         :        +- Exchange (3)
         :           +- * LocalLimit (2)
         :              +- Scan hive poi_data.poi_res (1)
         +- * Sort (14)
            +- AQEShuffleRead (13)
               +- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
                  +- Exchange (11)
                     +- * Project (10)
                        +- * Filter (9)
                           +- * ColumnarToRow (8)
                              +- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
   CollectLimit (27)
   +- Project (26)
      +- SortMergeJoin LeftOuter (25)
         :- Sort (20)
         :  +- Exchange (19)
         :     +- LocalLimit (18)
         :        +- Scan hive poi_data.poi_res (1)
         +- Sort (24)
            +- Exchange (23)
               +- Project (22)
                  +- Filter (21)
                     +- Scan parquet bi_report.mon_ronghe_pv (7)


(1) Scan hive poi_data.poi_res
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403, name#405, point_x#422, point_y#423], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#403, type#404, name#405, address#406, phone#407, alias#408, post_code#409, catalog_id#410, c..., Partition Cols: []]

(2) LocalLimit [codegen id : 1]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000

(3) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id=1043]

(4) ShuffleQueryStage
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 0

(5) AQEShuffleRead
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: coalesced

(6) Sort [codegen id : 3]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0

(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#440, pv#441, event_day#444]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#444), (event_day#444 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>

(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444]

(9) Filter [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 > 100))

(10) Project [codegen id : 2]
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444]

(11) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id=1067]

(12) ShuffleQueryStage
Output [2]: [bid#440, pv#441]
Arguments: 1

(13) AQEShuffleRead
Input [2]: [bid#440, pv#441]
Arguments: coalesced

(14) Sort [codegen id : 4]
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0

(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None

(16) Project [codegen id : 5]
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441]

(17) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000

(18) LocalLimit
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000

(19) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id=1029]

(20) Sort
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0

(21) Filter
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 > 100))

(22) Project
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444]

(23) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id=1030]

(24) Sort
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0

(25) SortMergeJoin
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None

(26) Project
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441]

(27) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000

(28) AdaptiveSparkPlan
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: isFinalPlan=true

因为左关联,on中的条件属于连接条件,结果需要保留左表全部记录,所以t1表全量读取,t2表使用了谓词下推过滤。

2)where

sql 复制代码
select
	t1.bid,
    t1.name,
    t1.point_x,
    t1.point_y,
	t2.pv
from poi_data.poi_res t1 
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid 
where t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
   CollectLimit (17)
   +- * Project (16)
      +- * SortMergeJoin Inner (15)
         :- * Sort (6)
         :  +- AQEShuffleRead (5)
         :     +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7)
         :        +- Exchange (3)
         :           +- * Filter (2)
         :              +- Scan hive poi_data.poi_res (1)
         +- * Sort (14)
            +- AQEShuffleRead (13)
               +- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
                  +- Exchange (11)
                     +- * Project (10)
                        +- * Filter (9)
                           +- * ColumnarToRow (8)
                              +- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
   CollectLimit (27)
   +- Project (26)
      +- SortMergeJoin Inner (25)
         :- Sort (20)
         :  +- Exchange (19)
         :     +- Filter (18)
         :        +- Scan hive poi_data.poi_res (1)
         +- Sort (24)
            +- Exchange (23)
               +- Project (22)
                  +- Filter (21)
                     +- Scan parquet bi_report.mon_ronghe_pv (7)


(1) Scan hive poi_data.poi_res
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456, name#458, point_x#475, point_y#476], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#456, type#457, name#458, address#459, phone#460, alias#461, post_code#462, catalog_id#463, c..., Partition Cols: []]

(2) Filter [codegen id : 1]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))

(3) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id=1176]

(4) ShuffleQueryStage
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: 0

(5) AQEShuffleRead
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: coalesced

(6) Sort [codegen id : 3]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0

(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#493, pv#494, event_day#497]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#497), (event_day#497 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>

(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497]

(9) Filter [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 > 100))

(10) Project [codegen id : 2]
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497]

(11) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id=1201]

(12) ShuffleQueryStage
Output [2]: [bid#493, pv#494]
Arguments: 1

(13) AQEShuffleRead
Input [2]: [bid#493, pv#494]
Arguments: coalesced

(14) Sort [codegen id : 4]
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0

(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None

(16) Project [codegen id : 5]
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494]

(17) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000

(18) Filter
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))

(19) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id=1161]

(20) Sort
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0

(21) Filter
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 > 100))

(22) Project
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497]

(23) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id=1162]

(24) Sort
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0

(25) SortMergeJoin
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None

(26) Project
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494]

(27) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000

(28) AdaptiveSparkPlan
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: isFinalPlan=true

where属于过滤条件,影响左关联的最终结果,所以执行计划第(2)步中将where提前到join关联之前按照bid对t1表进行过滤。

四、总结

假设数据库系统支持谓词下推的前提下,

  • 内连接:内连接的两个执行计划中,对t2表都使用了PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)],对t1表都使用了Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297)) ,因此可以说,内连接中where和on在执行效率上没区别。
  • 外连接:还是拿左外连接来说,右表相关的条件会使用谓词下推,而左表是否会提前过滤数据,取决于where还是on以及筛选条件是否与左表相关,1)当为on时,左表的数据必须全量读取,此时效率的差别主要取决于左表的数据量。2)当为where时,如果筛选条件涉及到左表,则会进行数据的提前过滤,否则左表仍然全量读取。

PS

在内连接的物理执行计划中,对poi_res表的过滤单独作为一个Filter步骤(2)Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297)),而对mon_ronghe_pv表的过滤在第(7)步scan中PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)] ,二者有什么区别?查了一些资料,说的是可以将PushedFilters理解为在读取数据时的过滤,不满足条件的数据直接不读取。Filter时将数据读取之后,再判断是否满足条件,决定是否参与后续计算。

既然都是在数据源侧进行数据过滤,为什么Filter不能像PushedFilters那样,直接在读取数据的时候判断,减少读入的数据量呢,这样也可以提升效率,这是一开始个人的疑问。查了一些资料,说的是是否支持在scan时filter数据,主要受数据源的影响。大数据中的存储方式主要分为行式存储和列式存储,列式存储的数据存储方式和丰富的元数据对谓词下推技术有更好的支持。当前测试中,mon_ronghe_pv表的存储格式为parquet,poi_res表存储格式text。

相关推荐
数新网络1 小时前
《深入浅出Apache Spark》系列②:Spark SQL原理精髓全解析
大数据·sql·spark
师太,答应老衲吧3 小时前
SQL实战训练之,力扣:2020. 无流量的帐户数(递归)
数据库·sql·leetcode
NiNg_1_2345 小时前
高级 SQL 技巧详解
sql
全栈开发圈13 小时前
新书速览|Spark SQL大数据分析快速上手
sql·数据分析·spark
WeeJot嵌入式13 小时前
数据库----SQL
数据库·sql
矛取矛求13 小时前
高级sql技巧
数据库·sql·mysql
Mephisto.java16 小时前
【大数据学习 | kafka高级部分】kafka的快速读写
大数据·redis·sql·kafka·flume
python15616 小时前
SQL中的IN语句和EXISTS语句
数据库·sql
超栈17 小时前
蓝桥杯-网络安全比赛题目-遗漏的压缩包
前端·网络·sql·安全·web安全·职场和发展·蓝桥杯
十二同学啊19 小时前
Mybatis拦截器中获取@RequestBody表单的值修改查询SQL
数据库·sql·mybatis