Spark SQL----GROUP BY子句

Spark SQL----GROUP BY子句

一、描述

GROUP BY子句用于根据一组指定的分组表达式对行进行分组,并根据一个或多个指定的聚合函数计算行组上的聚合。Spark还支持高级聚合,通过GROUPING SETS、CUBE、ROLLUP子句对同一输入记录集进行多个聚合。分组表达式和高级聚合可以混合在GROUP BY子句中,也可以嵌套在GROUPING SETS子句中。请参阅Mixed/Nested Grouping Analytics部分中的更多详细信息。当FILTER子句附加到聚合函数时,只有匹配的行被传递给该函数。

二、语法

sql 复制代码
GROUP BY group_expression [ , group_expression [ , ... ] ] [ WITH { ROLLUP | CUBE } ]

GROUP BY { group_expression | { ROLLUP | CUBE | GROUPING SETS } (grouping_set [ , ...]) } [ , ... ]

而聚合函数定义为

sql 复制代码
aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_expression ) ]

三、参数

  • group_expression
    指定将行分组在一起所依据的条件。行的分组是基于分组表达式的结果值来执行的。分组表达式可以是类似GROUP BY A的列名、类似GROUP BY 0的列位置或类似GROUP BY a + b的表达式。
  • grouping_set
    grouping set由括号中的零个或多个逗号分隔的表达式指定。当分组集只有一个元素时,可以省略括号。例如,分组集((a), (b))与分组集(a, b)相同。
    语法:{ ( [ expression [ , ... ] ] ) | expression }
  • GROUPING SETS
    对GROUPING SETS之后指定的每个分组集的行进行分组。例如,GROUP BY GROUPING SETS ((warehouse), (product))在语义上等效于GROUP BY warehouse 和 GROUP BY product的结果的并集。此子句是UNION ALL的简写,其中UNION ALL运算符的每个分支执行GROUPING SETS子句中指定的每个分组集的聚合。类似地,GROUP BY GROUPING SETS ((warehouse, product), (product), ()) 在语义上等价于GROUP BY warehouse, product, GROUP BY product 和 global aggregate的结果的并集。
    注意:为了Hive兼容性,Spark允许GROUP BY ... GROUPING SETS (...)。GROUP BY表达式通常被忽略,但如果它包含比GROUPING SETS表达式更多的表达式,则这些额外的表达式将包含在分组表达式中,并且值始终为null。例如,SELECT a, b, c FROM ... GROUP BY a, b, c GROUPING SETS (a, b),列c的输出始终为null。
  • ROLLUP
    在单个语句中指定多个级别的聚合。此子句用于基于多个grouping sets计算聚合。ROLLUP是GROUPING SETS的简写。例如,GROUP BY warehouse, product WITH ROLLUP 或者GROUP BY ROLLUP(warehouse, product) 等效于GROUP BY GROUPING SETS((warehouse, product), (warehouse), ())。GROUP BY ROLLUP(warehouse, product, (warehouse, location))等效于GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse), ())。ROLLUP specification的N个元素产生N+1个GROUPING集合。
  • CUBE
    CUBE子句用于根据GROUP BY子句中指定的分组列的组合执行聚合。CUBE是GROUPING SETS的简写。例如,GROUP BY warehouse, product WITH CUBE 或 GROUP BY CUBE(warehouse, product)等效于GROUP BY GROUPING SETS((warehouse, product), (warehouse), (product), ())。GROUP BY CUBE(warehouse, product, (warehouse, location)) 等效于GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (product, warehouse, location), (warehouse), (product), (warehouse, product), ())。CUBE specification的N个元素产生2^N个分组集。
  • Mixed/Nested Grouping Analytics
    GROUP BY子句可以包括多个group_expressions和多个 CUBE|ROLLUP|GROUPING SETSs。GROUPING SETS也可以具有嵌套的CUBE|ROLLUP|GROUPING SETS子句,例如GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location)), GROUPING SETS(warehouse, GROUPING SETS(location, GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))))。CUBE|ROLLUP只是GROUPING SETS的语法糖,请参阅上面的部分,了解如何将CUBE|ROLLUP转换为GROUPING SETS。在此上下文中,group_expression可以被视为单个组GROUPING SETS。对于GROUP BY子句中的多个GROUPING SETS,我们通过对原始GROUP集进行cross-product来生成单个GROUPING SETS。对于GROUPING SETS子句中嵌套的GROUPING SETS,我们只需取其分组集并将其剥离即可。例如,GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ()) 和 GROUP BY warehouse, ROLLUP(product), CUBE(location, size) 等价于GROUP BY GROUPING SETS( (warehouse, product, location, size), (warehouse, product, location), (warehouse, product, size), (warehouse, product), (warehouse, location, size), (warehouse, location), (warehouse, size), (warehouse))。
    GROUP BY GROUPING SETS(GROUPING SETS(warehouse), GROUPING SETS((warehouse, product)))等价于GROUP BY GROUPING SETS((warehouse), (warehouse, product))。
  • aggregate_name
    指定聚合函数名称(MIN、MAX、COUNT、SUM、AVG等)。
  • DISTINCT
    在将输入行中的重复项传递给聚合函数之前,移除这些行。
  • FILTER
    过滤WHERE子句中boolean_expression计算结果为true并传递给聚合函数的输入行;其他行被丢弃。

四、例子

sql 复制代码
CREATE TABLE dealer (id INT, city STRING, car_model STRING, quantity INT);
INSERT INTO dealer VALUES
    (100, 'Fremont', 'Honda Civic', 10),
    (100, 'Fremont', 'Honda Accord', 15),
    (100, 'Fremont', 'Honda CRV', 7),
    (200, 'Dublin', 'Honda Civic', 20),
    (200, 'Dublin', 'Honda Accord', 10),
    (200, 'Dublin', 'Honda CRV', 3),
    (300, 'San Jose', 'Honda Civic', 5),
    (300, 'San Jose', 'Honda Accord', 8);

-- Sum of quantity per dealership. Group by `id`.
SELECT id, sum(quantity) FROM dealer GROUP BY id ORDER BY id;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100|           32|
|200|           33|
|300|           13|
+---+-------------+

-- Use column position in GROUP by clause.
SELECT id, sum(quantity) FROM dealer GROUP BY 1 ORDER BY 1;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100|           32|
|200|           33|
|300|           13|
+---+-------------+

-- Multiple aggregations.
-- 1. Sum of quantity per dealership.
-- 2. Max quantity per dealership.
SELECT id, sum(quantity) AS sum, max(quantity) AS max FROM dealer GROUP BY id ORDER BY id;
+---+---+---+
| id|sum|max|
+---+---+---+
|100| 32| 15|
|200| 33| 20|
|300| 13|  8|
+---+---+---+

-- Count the number of distinct dealer cities per car_model.
SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY car_model;
+------------+-----+
|   car_model|count|
+------------+-----+
| Honda Civic|    3|
|   Honda CRV|    2|
|Honda Accord|    3|
+------------+-----+

-- Sum of only 'Honda Civic' and 'Honda CRV' quantities per dealership.
SELECT id, sum(quantity) FILTER (
            WHERE car_model IN ('Honda Civic', 'Honda CRV')
        ) AS `sum(quantity)` FROM dealer
    GROUP BY id ORDER BY id;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100|           17|
|200|           23|
|300|            5|
+---+-------------+

-- Aggregations using multiple sets of grouping columns in a single statement.
-- Following performs aggregations based on four sets of grouping columns.
-- 1. city, car_model
-- 2. city
-- 3. car_model
-- 4. Empty grouping set. Returns quantities for all city and car models.
SELECT city, car_model, sum(quantity) AS sum FROM dealer
    GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
    ORDER BY city;
+---------+------------+---+
|     city|   car_model|sum|
+---------+------------+---+
|     null|        null| 78|
|     null| HondaAccord| 33|
|     null|    HondaCRV| 10|
|     null|  HondaCivic| 35|
|   Dublin|        null| 33|
|   Dublin| HondaAccord| 10|
|   Dublin|    HondaCRV|  3|
|   Dublin|  HondaCivic| 20|
|  Fremont|        null| 32|
|  Fremont| HondaAccord| 15|
|  Fremont|    HondaCRV|  7|
|  Fremont|  HondaCivic| 10|
| San Jose|        null| 13|
| San Jose| HondaAccord|  8|
| San Jose|  HondaCivic|  5|
+---------+------------+---+

-- Group by processing with `ROLLUP` clause.
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), ())
SELECT city, car_model, sum(quantity) AS sum FROM dealer
    GROUP BY city, car_model WITH ROLLUP
    ORDER BY city, car_model;
+---------+------------+---+
|     city|   car_model|sum|
+---------+------------+---+
|     null|        null| 78|
|   Dublin|        null| 33|
|   Dublin| HondaAccord| 10|
|   Dublin|    HondaCRV|  3|
|   Dublin|  HondaCivic| 20|
|  Fremont|        null| 32|
|  Fremont| HondaAccord| 15|
|  Fremont|    HondaCRV|  7|
|  Fremont|  HondaCivic| 10|
| San Jose|        null| 13|
| San Jose| HondaAccord|  8|
| San Jose|  HondaCivic|  5|
+---------+------------+---+

-- Group by processing with `CUBE` clause.
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
SELECT city, car_model, sum(quantity) AS sum FROM dealer
    GROUP BY city, car_model WITH CUBE
    ORDER BY city, car_model;
+---------+------------+---+
|     city|   car_model|sum|
+---------+------------+---+
|     null|        null| 78|
|     null| HondaAccord| 33|
|     null|    HondaCRV| 10|
|     null|  HondaCivic| 35|
|   Dublin|        null| 33|
|   Dublin| HondaAccord| 10|
|   Dublin|    HondaCRV|  3|
|   Dublin|  HondaCivic| 20|
|  Fremont|        null| 32|
|  Fremont| HondaAccord| 15|
|  Fremont|    HondaCRV|  7|
|  Fremont|  HondaCivic| 10|
| San Jose|        null| 13|
| San Jose| HondaAccord|  8|
| San Jose|  HondaCivic|  5|
+---------+------------+---+

--Prepare data for ignore nulls example
CREATE TABLE person (id INT, name STRING, age INT);
INSERT INTO person VALUES
    (100, 'Mary', NULL),
    (200, 'John', 30),
    (300, 'Mike', 80),
    (400, 'Dan', 50);

--Select the first row in column age
SELECT FIRST(age) FROM person;
+--------------------+
| first(age, false)  |
+--------------------+
| NULL               |
+--------------------+

--Get the first row in column `age` ignore nulls,last row in column `id` and sum of column `id`.
SELECT FIRST(age IGNORE NULLS), LAST(id), SUM(id) FROM person;
+-------------------+------------------+----------+
| first(age, true)  | last(id, false)  | sum(id)  |
+-------------------+------------------+----------+
| 30                | 400              | 1000     |
+-------------------+------------------+----------+
相关推荐
数智化精益手记局42 分钟前
拆解物料管理erp系统的核心功能,看物料管理erp系统如何解决库存积压与缺料难题
大数据·网络·人工智能·安全·信息可视化·精益工程
Elastic 中国社区官方博客2 小时前
使用 Observability Migration Platform 将 Datadog 和 Grafana 的仪表板与告警迁移到 Kibana
大数据·elasticsearch·搜索引擎·信息可视化·全文检索·grafana·datalog
jkyy20143 小时前
AI运动数字化:以技术重塑场景,健康有益赋能全域运动健康管理
大数据·人工智能·健康医疗
金融小师妹3 小时前
4月30日多因子共振节点:鲍威尔“收官效应”与权力结构重塑的预期重构
大数据·人工智能·重构·逻辑回归
2601_949925183 小时前
AI Agent如何重构跨境物流的决策?
大数据·人工智能·重构·ai agent·geo优化·物流科技
苍煜3 小时前
分布式事务生产实战选型对比
分布式
xiaoduo AI3 小时前
客服机器人问题解决率怎么统计?Agent系统自动判断是否解决,比人工回访准?
大数据·人工智能·机器人
猫的玖月4 小时前
(一)MY SQL概述
数据库·sql
小五兄弟4 小时前
YouTube 肖像检测扩展背后:短剧出海版权保护的技术实现与实战策略
大数据·人工智能
阿瑞说项目管理5 小时前
2026 实战入门指南:企业 Agent 到底能解决哪些工作问题?
大数据·人工智能·agent·智能体·企业级ai