PostgreSQL 中 DISTINCT 的多种面孔
发布日期 :2017年5月11日
原文链接:https://hakibenita.com/the-many-faces-of-distinct-in-postgre-sql
PostgreSQL 中 DISTINCT 的三种有趣用法
我的编程生涯始于一名 Oracle DBA。几年后,我最终厌倦了企业界,开始自己单干。
当我不再拥有 Oracle 企业版的舒适保障后,我发现了 PostgreSQL。在我克服了没有合适的分区功能和 MERGE 语句(即 UPSERT)的初期不适后,我找到了一些 PostgreSQL 独有的优秀特性。奇怪的是,其中很多都包含 DISTINCT 这个词。
DISTINCT
我使用这个网站上的模拟数据创建了一个简单的员工表,包含姓名、部门和薪水字段:
sql
haki=# \d employee
Column | Type | Modifiers
------------+-----------------------+-----------
id | integer | not null
name | character varying(30) |
department | character varying(30) |
salary | integer |
haki=# select * from employee limit 5;
id | name | department | salary
----+----------------+----------------------+--------
1 | Carl Frazier | Engineering | 3052
2 | Richard Fox | Product Management | 13449
3 | Carolyn Carter | Engineering | 8366
4 | Benjamin Brown | Business Development | 7386
5 | Diana Fisher | Services | 10419
什么是 DISTINCT?
SELECT DISTINCT 用于从结果中消除重复行。
最简单的用法是,例如,获取一个不重复的部门列表:
sql
haki=# SELECT DISTINCT department FROM employee;
department
--------------------------
Services
Support
Training
Accounting
Business Development
Marketing
Product Management
Human Resources
Engineering
Sales
Research and Development
Legal
(计算机科学的学生们,我知道这没有规范化......)
我们可以用 GROUP BY 做同样的事情:
sql
SELECT department FROM employee GROUP BY department;
但我们现在讨论的是 DISTINCT。
DISTINCT ON
一个经典的面试题是:找出每个部门中薪水最高的员工。
这是大学里教的方法:
sql
haki=# SELECT
*
FROM
employee
WHERE
(department, salary) IN (
SELECT
department,
MAX(salary)
FROM
employee
GROUP BY
department
)
ORDER BY
department;
id | name | department | salary
----+------------------+--------------------------+--------
30 | Sara Roberts | Accounting | 13845
4 | Benjamin Brown | Business Development | 7386
3 | Carolyn Carter | Engineering | 8366
20 | Janet Hall | Human Resources | 2826
14 | Chris Phillips | Legal | 3706
10 | James Cunningham | Legal | 3706
11 | Richard Bradley | Marketing | 11272
2 | Richard Fox | Product Management | 13449
25 | Evelyn Rodriguez | Research and Development | 10628
17 | Benjamin Carter | Sales | 6197
24 | Jessica Elliott | Services | 14542
7 | Bonnie Robertson | Support | 12674
8 | Jean Bailey | Training | 13230
法律部门有两个员工薪水相同。根据不同的使用场景,这个查询可能会变得相当棘手。
如果你是早几年毕业的,已经对数据库有所了解,并且听说过分析函数和窗口函数,你可能会这样做:
sql
WITH ranked_employees AS (
SELECT
ROW_NUMBER() OVER (
PARTITION BY department ORDER BY salary DESC
) AS rn,
*
FROM
employee
)
SELECT
*
FROM
ranked_employees
WHERE
rn = 1
ORDER BY
department;
结果一样,但没有重复项:
rn | id | name | department | salary
----+----+------------------+--------------------------+--------
1 | 30 | Sara Roberts | Accounting | 13845
1 | 4 | Benjamin Brown | Business Development | 7386
1 | 3 | Carolyn Carter | Engineering | 8366
1 | 20 | Janet Hall | Human Resources | 2826
1 | 14 | Chris Phillips | Legal | 3706
1 | 11 | Richard Bradley | Marketing | 11272
...
直到现在,这都是我会采用的方法。
接下来是真正的亮点:PostgreSQL 有一个特殊的非标准子句,用于查找组中的第一行:
sql
SELECT DISTINCT ON (department)
*
FROM
employee
ORDER BY
department,
salary DESC;
这太厉害了!
这太厉害了!
这太厉害了!为什么从来没人告诉我这也可以?
文档中解释了 DISTINCT ON:
SELECT DISTINCT ON ( 表达式 [, ...] )会保留每一组(这些表达式计算结果相等的行)中的第一行。
而我之前没听说过它的原因是:
非标准子句
DISTINCT ON ( ... )是 SQL 标准的扩展。
PostgreSQL 为我们完成了所有繁重的工作。唯一的要求是我们必须按分组的字段(这里是 department)进行 ORDER BY。它还支持按多个字段进行"分组",这使得这个子句更加强大。
IS DISTINCT FROM
在 SQL 中比较值可能产生三种结果:true、false 或 unknown:
sql
WITH t AS (
SELECT 1 AS a, 1 AS b UNION ALL
SELECT 1, 2 UNION ALL
SELECT NULL, 1 UNION ALL
SELECT NULL, NULL
)
SELECT
a,
b,
a = b as equal
FROM
t;
a | b | equal
------+------+-------
1 | 1 | t
1 | 2 | f
NULL | 1 | NULL
NULL | NULL | NULL
使用等号(=)比较 NULL 和 NULL 的结果是 UNKNOWN(在表中标记为 NULL)。
在 SQL 中,1 = 1 成立,NULL IS NULL 也成立,但 NULL != NULL 不成立。
意识到这个细微差别很重要,因为比较可能为空的字段可能会产生意想不到的结果。
比较可能为空的字段时,要得到 true 或 false 的完整条件是:
sql
(a is null and b is null)
or
(a is not null and b is not null and a = b)
结果是:
a | b | equal | full_condition
------+------+-------+----------------
1 | 1 | t | t
1 | 2 | f | f
NULL | 1 | NULL | f
NULL | NULL | NULL | t
这是我们想要的结果,但太冗长了。有更好的方法吗?
PostgreSQL 实现了 SQL 标准,用于安全地比较可为空的字段:
sql
haki=# SELECT
a,
b,
a = b as equal,
a IS DISTINCT FROM b AS is_distinct_from
FROM
t;
a | b | equal | is_distinct_from
------+------+-------+------------------
1 | 1 | t | f
1 | 2 | f | t
NULL | 1 | NULL | t
NULL | NULL | NULL | f
PostgreSQL 维基上解释了 IS DISTINCT FROM:
IS DISTINCT FROM和IS NOT DISTINCT FROM...... 将NULL视为一个已知的值,而不是未知的特殊情况。
好多了------既简短又清晰。
其他数据库如何处理这个?
- MySQL :有一个特殊的运算符
<=>,功能类似。 - Oracle :提供了一个名为
LNNVL的函数来比较可为空的字段(祝你好运......)。 - MSSQL:找不到类似的函数。
ARRAY_AGG (DISTINCT)
当我还从 Oracle 迁移时,ARRAY_AGG 是 PostgreSQL 的主要卖点之一。
ARRAY_AGG 将值聚合成一个数组:
sql
haki=# SELECT
department,
ARRAY_AGG(name) AS employees
FROM
employee
GROUP BY
department;
department | employees
----------------------+-------------------------------------
Services | {"Diana Fisher","Jessica Elliott"}
Support | {"Bonnie Robertson"}
Training | {"Jean Bailey"}
Accounting | {"Phillip Reynolds","Sean Franklin"}
Business Development | {"Benjamin Brown","Brian Hayes"}
Marketing | {"Richard Bradley","Arthur Moreno"}
Product Management | {"Richard Fox","Randy Wells"}
Human Resources | {"Janet Hall"}
Engineering | {"Carl Frazier","Carolyn Carter"}
Sales | {"Benjamin Carter"}
Research and Develo.. | {"Donna Reynolds","Ann Boyd"}
Legal | {"James Cunningham","George Hanson"}
我发现 ARRAY_AGG 主要在命令行界面(CLI)中用于快速查看数据,或者与 ORM 一起使用时很有用。
PostgreSQL 更进一步,也为这个聚合函数实现了 DISTINCT 选项。使用 DISTINCT,我们可以,例如,快速查看每个部门中不重复的薪水:
sql
haki=# SELECT
department,
ARRAY_AGG(DISTINCT salary) AS salaries
FROM
employee
GROUP BY
department;
department | salaries
--------------------------+---------------
Accounting | {11203}
Business Development | {2196,7386}
Engineering | {1542,3052}
Human Resources | {2826}
Legal | {1079,3706}
Marketing | {5740}
Product Management | {9101,13449}
Research and Development | {6451,10628}
Sales | {6197}
Services | {2119}
Support | {12674}
Training | {13230}
我们可以立即看到,支持部门的所有员工薪水相同。
其他数据库如何处理这个?
- MySQL :有一个类似的函数叫做
GROUP_CONCAT。(https://dev.mysql.com/doc/refman/5.6/en/group-by-functions.html#function_group-concat) - Oracle :有一个聚合函数叫做
ListAgg。它不支持DISTINCT。Oracle 在 11.2 版本中引入了这个函数,在此之前,互联网上充满了自定义实现。(https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030) - MsSQL :我发现最接近的是一个叫做
STUFF的函数,它接受一个表达式。(https://docs.microsoft.com/en-us/sql/t-sql/functions/stuff-transact-sql)
结语
本文的要点是:你应该经常回归基础!