Hive中left join 中的where 和 on的区别

一、知识点

left join中关于where和on条件的知识点：

多表left join 是会生成一张临时表。
on后面：一般是对left join 的右表进行条件过滤，会返回左表中的所有行，而右表中没有匹配上的数据直接补为null。
on后面：如果存在对左表的筛选条件，无论条件真假，仍返回左表中的所有行，只是会影响右表的匹配值。即：on中针对左表的限制条件只会影响右表的匹配内容，并不影响返回的左表行数。
where后面：是对两表连接后的数据进行筛选，一般是针对左表的筛选条件。

例如下面的sql：关联表（右表）的筛选条件如果放置在on后面，则下面sql的执行顺序是：B先按照条件进行过滤，再与A表关联；

sql 复制代码

#主表：A； 关联表：B  关联条件：A.id = B.id  筛选条件：B.id >1
A left join B on A.id = B.id and B.id >1;

筛选条件如果放置在where后面，则下面sql的执行顺序是：A和B先关联，基于生成的临时表再进行where条件过滤。

sql 复制代码

#主表：A； 关联表：B  关联条件：A.id = B.id  过滤条件
A left join B on A.id = B.id  where  A.id <> 1

二、测试验证

创建t1 ,t2 两张表

sql 复制代码

create table if not exists test1( id int comment '用户id',name string comment '用户姓名');

insert overwrite table test1
values (1,'a1'),
       (2,'a2'),
       (3,'a3'),
       (4,'a4');

create table if not exists test2( id int comment '用户id',province string comment '用户所在省份');

insert overwrite table test1
values (1,'hainan'),
       (2,'beijing'),
       (5,'sichuan'),
       (6,'chongqing');

复制代码

测试1：返回左表所有行，右表符合on条件可匹配上，不满足条件的直接补null

sql 复制代码

select
    test1.id,
    test1.name,
    test2.id,
    test2.province
from test1
left join test2
 on test1.id = test2.id;

复制代码

测试2：on后面增加对右表的限制条件；结论：左表记录全部返回，右表筛选条件生效

sql 复制代码

select
    test1.id,
    test1.name,
    test2.id,
    test2.province
from test1
left join test2
  on test1.id = test2.id and test2.province = 'beijing';

复制代码

测试3：where后面增加对右表的限制条件：where test2.province = 'beijing';
   结论：where后面是对最后的临时表进行记录筛选，行数可能会减少

sql 复制代码

select
    test1.id,
    test1.name,
    test2.id,
    test2.province
from test1
left join test2
  on test1.id = test2.id  where test2.province = 'beijing';

复制代码

测试4：on中增加对左表的限制条件，不影响返回的行数，只影响右表的匹配内容。

sql 复制代码

select
    test1.id,
    test1.name,
    test2.id,
    test2.province
from test1
left join test2
  on test1.id = test2.id and test1.name = 'a1';

复制代码

测试5：where后面增加对左表的限制条件：where test1.name ='a4' 
  结论：where条件是在最后临时表的基础上进行筛选，返回满足条件的行

sql 复制代码

select
    test1.id,
    test1.name,
    test2.id,
    test2.province
from test1
left join test2
  on test1.id = test2.id
 where test1.name = 'a4';

三、引申

由上述的where和on之间的区别，引申出来的面试题： t1表和t2表的字段如图，计算以下两个sql的值。

sql 复制代码

--1.输出结果
select  t1.id,t2.id
from t1
left join t2 on t1.id = t2.id and t2.id <>2;

--2.输出结果
select  t1.id,t2.id
from t1
left join t2 on t1.id = t2.id 
where t2.id <>2;

解析：

（1） on后面：跟着的是对右表的限制条件 t2.id <>2;

结论：左表记录全部返回，右表筛选条件生效

sql 复制代码

  
with t1 as (
  select 1 as id union all
  select 2 as id union all
  select 3 as id
),
    t2 as (
        select 1 as id union all
        select 2 as id union all
        select 2 as id
    )

select  t1.id,t2.id
from t1
left join t2 on t1.id = t2.id and t2.id <>2;

输出结果为：

（2）where 后面：跟着的是对右表的限制条件 where t2.id <>2;

结论：最后临时表的基础上进行筛选，最终返回符合where过滤条件的行；

ps: sql中比较的结果一般有：true, false, null；而where条件只会过滤出true的结果。

sql 复制代码

with t1 as (
  select 1 as id union all
  select 2 as id union all
  select 3 as id
),
    t2 as (
        select 1 as id union all
        select 2 as id union all
        select 2 as id
    )

select  t1.id,t2.id
from t1
left join t2 on t1.id = t2.id
where t2.id <>2;

输出结果为：

补充：

sql 复制代码

select (1>2)  --> FALSE
select (1<2)  --> true
select (1<null) --> null

--所以，sql比较的结果会有三种：false, true, null