SparkSQL 子查询 IN/NOT IN 对 NULL 值的处理
官网:https://spark.apache.org/docs/4.0.0/sql-ref-functions.html
https://spark.apache.org/docs/4.0.0/sql-ref-null-semantics.html#innot-in-subquery

Unlike the
EXISTS
expression,IN
expression can return aTRUE
,FALSE
orUNKNOWN (NULL)
value.与
EXISTS
不同,IN
表达式可能返回三种布尔状态:
TRUE
:当前值在集合中;FALSE
:当前值不在集合中;UNKNOWN
(即NULL
):当表达式中涉及了NULL
值时,无法确定真假。
Conceptually a
IN
expression is semantically equivalent to a set of equality condition separated by a disjunctive operator (OR
).For example,c1 IN (1, 2, 3)
is semantically equivalent to(c1 = 1 OR c1 = 2 OR c1 = 3)
.
从语义上讲,IN
表达式等价于多个等于条件用 OR
连接起来。
c1 IN (1, 2, 3)
相当于:
c1 = 1 OR c1 = 2 OR c1 = 3
As far as handling
NULL
values are concerned, the semantics can be deduced from theNULL
value handling in comparison operators(=) and logical operators(OR
).
对于 NULL
值的处理方式,可以基于比较运算符(如 =
)和逻辑运算符(如 OR
)的行为来推导。 也就是说,IN
的行为是建立在底层 SQL 对 NULL 处理规则之上的。
To summarize, below are the rules for computing the result of an
IN
expression.
TRUE
is returned when the non-NULL value in question is found in the listFALSE
is returned when the non-NULL value is not found in the list and the list does not containNULL
valuesUNKNOWN
is returned when the value isNULL
, or the non-NULL value is not found in the list and the list contains at least oneNULL
value
IN
表达式的计算规则如下:
情况 | 结果 |
---|---|
当前值不为 NULL,并且存在于列表中 | TRUE |
当前值不为 NULL,但不在列表中,且列表中没有 NULL值 | FALSE |
当前值为 NULL,或者列表中有 NULL值但当前值不在其中 | UNKNOWN |
只要列表中包含 NULL
,即使当前值不在列表中,也不能简单地返回 FALSE
,而是返回 UNKNOWN
。
IN Demo:
1:子查询结果只有 NULL
%sql
WITH person AS (
SELECT * FROM VALUES
('a', 25),
('b', 30),
('c', 35),
('d', 40),
('e', 50),
('d', 50)
AS person(name, age)
)
SELECT * FROM person
WHERE age IN (SELECT null);
空表✅

IN (NULL)
返回的是UNKNOWN
,不会匹配任何行。
2:子查询包含 NULL
和有效值
%sql
WITH person AS (
SELECT * FROM VALUES
('a', 25),
('b', 30),
('c', 35),
('d', 40),
('e', 50),
('f', 50)
AS person(name, age)
)
SELECT * FROM person
WHERE age IN (
SELECT age FROM VALUES
(50), (NULL)
AS sub(age)
);
-- 虽然子查询里有 NULL,但只要匹配到具体值就会返回;
只有 age = 50
的记录被选中。

3:子查询包含 NULL
和多个值
%sql
WITH person AS (
SELECT * FROM VALUES
('a', 25),
('b', 30),
('c', 35),
('d', 40),
('e', 50),
('f', 50)
AS person(name, age)
)
SELECT * FROM person
WHERE age IN (
SELECT age FROM VALUES
(25), (30), (NULL)
AS sub(age)
);
-- 虽然子查询里有 NULL,但只要匹配到具体值就会返回;

4: 主表中存在 NULL
值,同时子查询结果也包含 NULL

条件 | 是否被选中 | 原因 |
---|---|---|
age = 25 | ✅ 是 | 匹配列表中的值 |
age = NULL | ❌ 否 | NULL IN (...) → UNKNOWN |
age = 35/40/50 | ❌ 否 | 不匹配列表中的非 NULL 值,且列表中有 NULL → UNKNOWN |
NOT IN Demo:
只要 NOT IN
后面的子查询包含 NULL
,整个条件就会变成 UNKNOWN
, 没有任何行被返回。
避免这个问题,需要在子查询中加上 WHERE age IS NOT NULL
。
1:子查询结果只有 NULL
-
NOT IN
(只要有null)不返回任何结果%sql
WITH person AS (
SELECT * FROM VALUES
('a', 25),
('b', 30),
('c', 35),
('d', 40),
('e', 50),
('d', 50)
AS person(name, age)
)
SELECT * FROM person
WHERE age NOT IN (SELECT null);

子查询中包含 NULL
,NOT IN
整体返回 UNKNOWN
,SQL 不会将其视为 TRUE
,所以没有行满足条件。
2: 子查询包含 NULL
和有效值
-
NOT IN
(只要有null)不返回任何结果%sql
WITH person AS (
SELECT * FROM VALUES
('a', 25),
('b', 30),
('c', 35),
('d', 40),
('e', 50),
('f', 50)
AS person(name, age)
)
SELECT * FROM person
WHERE age NOT IN (
SELECT age FROM VALUES
(50), (NULL)
AS sub(age)
);

3: 主表中存在 NULL
值,同时子查询结果也包含 NULL
-
NOT IN
(只要有null)不返回任何结果%sql
WITH person AS (
SELECT * FROM VALUES
('a', 25),
('b', 30),
('c', null),
('d', 40),
('e', 50),
('f', 50)
AS person(name, age)
)
SELECT * FROM person
WHERE age NOT IN (
SELECT age FROM VALUES
(50), (NULL)
AS sub(age)
);

Spark官方对于各种函数处理null值的说明:
https://spark.apache.org/docs/4.0.0/sql-ref-null-semantics.html
