ClickHouse 语法优化规则

ClickHouse 的 SQL 优化规则是基于RBO(Rule Based Optimization),下面是一些优化规则

1 准备测试用表

1 上传官方的数据集

将visits_v1.tar和hits_v1.tar上传到虚拟机,解压到clickhouse数据路径下

// 解压到clickhouse数据路径

复制代码
sudo tar -xvf hits_v1.tar -C /var/lib/clickhouse`
`sudo tar -xvf visits_v1.tar -C /var/lib/clickhouse`

`//修改所属用户`
`sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/data/datasets`
`sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/metadata/datasets`
`

2 重启 clickhouse-server

sudo clickhouse restart

3 执行查询

clickhouse-client --query "SELECT COUNT(*) FROM datasets.hits_v1"

clickhouse-client --query "SELECT COUNT(*) FROM datasets.visits_v1"

注意:官方的tar包,包含了建库、建表语句、数据内容,这种方式不需要手动建库、建表,最方便。

hits _v1 表有1 30 多个字段,8 80 多万条数据

visits_ v1 表有1 80 多个字段,1 60 多万条数据

2 COUNT 优化

在调用 count 函数时,如果使用的是 count() 或者 count(*),且没有 where 条件,则会直接使用 system.tables 的 total_rows,例如:

复制代码
EXPLAIN SELECT count()FROM datasets.hits_v1;`

`Union`
`  Expression (Projection)`
`    Expression (Before ORDER BY and SELECT)`
`      MergingAggregated`
`        ReadNothing (Optimized trivial count)`
`

注意 Optimized trivial count ,这是对 count 的优化。

如果 count 具体的列字段,则不会使用此项优化:

复制代码
EXPLAIN SELECT count(CounterID) FROM datasets.hits_v1;`

`Union`
`  Expression (Projection)`
`    Expression (Before ORDER BY and SELECT)`
`      Aggregating`
`        Expression (Before GROUP BY)`
`          ReadFromStorage (Read from MergeTree)`
`

3 消除子查询重复字段

下面语句子查询中有两个重复的 id 字段,会被去重:

复制代码
EXPLAIN SYNTAX SELECT `
`   a.UserID,`
`   b.VisitID,`
`   a.URL,`
`   b.UserID`
`   FROM`
`   hits_v1 AS a `
`   LEFT JOIN (` 
`    SELECT       `
`      UserID,`        
`      UserID as HaHa,`       
`      VisitID   `
`    FROM visits_v1) AS b `
`   USING (UserID)`
`   limit 3;`

`//返回优化语句:`
`SELECT `
`    UserID,`
`    VisitID,`
`    URL,`
`    b.UserID`
`FROM hits_v1 AS a`
`ALL LEFT JOIN `
`(`
`    SELECT `
`        UserID,`
`        VisitID`
`    FROM visits_v1`
`) AS b USING (UserID)`
`LIMIT 3`
`

4 谓词下推

当group by有having子句,但是没有with cube、with rollup 或者with totals修饰的时候,having过滤会下推到where提前过滤。例如下面的查询,HAVING name变成了WHERE name,在group by之前过滤:

复制代码
EXPLAIN SYNTAX SELECT UserID FROM hits_v1 GROUP BY UserID HAVING UserID =` `'8585742290196126178';`

`//返回优化语句`
`SELECT UserID`
`FROM hits_v1`
`WHERE UserID = \'8585742290196126178\'`
`GROUP BY UserID`
`

子查询也支持谓词下推:

复制代码
EXPLAIN SYNTAX`
`SELECT *`
`FROM `
`(`
`    SELECT UserID`
`    FROM visits_v1`
`)`
`WHERE UserID =` `'8585742290196126178'`

`//返回优化后的语句`
`SELECT UserID`
`FROM `
`(`
`    SELECT UserID`
`    FROM visits_v1`
`    WHERE UserID = \'8585742290196126178\'`
`)`
`WHERE UserID = \'8585742290196126178\'

再来一个复杂例子:

复制代码
//返回优化后的语句`
`SELECT UserID`
`FROM `
`(`
`    SELECT UserID`
`    FROM `
    `(`
`        SELECT UserID`
`        FROM visits_v1`
`        WHERE UserID = \'8585742290196126178\'`
    `)`
`    WHERE UserID = \'8585742290196126178\'`
`    UNION ALL`
`    SELECT UserID`
`    FROM `
    `(`
`        SELECT UserID`
`        FROM visits_v1`
`        WHERE UserID = \'8585742290196126178\'`
    `)`
`    WHERE UserID = \'8585742290196126178\'`
`)`
`WHERE UserID = \'8585742290196126178\'`
`

5 聚合计算外推

聚合函数内的计算,会外推,例如:

复制代码
EXPLAIN SYNTAX`
`SELECT sum(UserID *` `2)`
`FROM visits_v1`

`//返回优化后的语句`
`SELECT sum(UserID)` `*` `2`
`FROM visits_v1`
`

6 聚合函数消除

如果对聚合键,也就是 group by key 使用 min、max、any 聚合函数,则将函数消除,例如:

复制代码
EXPLAIN SYNTAX`
`SELECT`
    `sum(UserID *` `2),`
    `max(VisitID),`
    `max(UserID)`
`FROM visits_v1`
`GROUP BY UserID`

`//返回优化后的语句`
`SELECT `
    `sum(UserID)` `*` `2,`
    `max(VisitID),`
`    UserID`
`FROM visits_v1`
`GROUP BY UserID`
`

7 删除重复的 order by key

例如下面的语句,重复的聚合键 id 字段会被去重:

复制代码
EXPLAIN SYNTAX`
`SELECT *`
`FROM visits_v1`
`ORDER BY`
`    UserID ASC,`
`    UserID ASC,`
`    VisitID ASC,`
`VisitID ASC`


`//返回优化后的语句:`
`select`
`	......`
`FROM visits_v1`
`ORDER BY `
`    UserID ASC,`
`VisitID ASC`
`

8 删除重复的 limit by key

例如下面的语句,重复声明的 name 字段会被去重:

复制代码
EXPLAIN SYNTAX`
`SELECT *`
`FROM visits_v1`
`LIMIT 3 BY`
`    VisitID,`
`    VisitID`
`LIMIT 10`


`//返回优化后的语句:`
`select`
`	......`
`FROM visits_v1`
`LIMIT 3 BY VisitID`
`LIMIT 10`
`

9 删除重复的 USING Key

例如下面的语句,重复的关联键 id 字段会被去重:

复制代码
EXPLAIN SYNTAX`
`SELECT`
`    a.UserID,`
`    a.UserID,`
`    b.VisitID,`
`    a.URL,`
`    b.UserID`
`FROM hits_v1 AS a`
`LEFT JOIN visits_v1 AS b USING (UserID, UserID)`


`//返回优化后的语句:`
`SELECT `
`    UserID,`
`    UserID,`
`    VisitID,`
`    URL,`
`    b.UserID`
`FROM hits_v1 AS a`
`ALL LEFT JOIN visits_v1 AS b USING (UserID)`
`

10 标量替换

如果子查询只返回一行数据,在被引用的时候用标量替换,例如下面语句中的 total_disk_usage 字段:

复制代码
EXPLAIN SYNTAX`
`WITH `
    `(`
`        SELECT sum(bytes)`
`        FROM system.parts`
`        WHERE active`
    `) AS total_disk_usage`
`SELECT`
    `(sum(bytes)` `/ total_disk_usage)` `*` `100 AS table_disk_usage,`
`    table`
`FROM system.parts`
`GROUP BY table`
`ORDER BY table_disk_usage DESC`
`LIMIT 10;`


`//返回优化后的语句:`
`WITH CAST(0, \'UInt64\') AS total_disk_usage`
`SELECT `
    `(sum(bytes)` `/ total_disk_usage)` `*` `100 AS table_disk_usage,`
`    table`
`FROM system.parts`
`GROUP BY table`
`ORDER BY table_disk_usage DESC`
`LIMIT 10`
`

11 三元运算优化

如果开启了 optimize_if_chain_to_multiif 参数,三元运算符会被替换成 multiIf 函数,例如:

复制代码
EXPLAIN SYNTAX `
`SELECT number =` `1` `?` `'hello'` `:` `(number =` `2` `?` `'world'` `:` `'atguigu')` 
`FROM numbers(10)` 
`settings optimize_if_chain_to_multiif =` `1;`

`// 返回优化后的语句:`
`SELECT multiIf(number =` `1, \'hello\', number =` `2, \'world\', \'atguigu\')`
`FROM numbers(10)`
`SETTINGS optimize_if_chain_to_multiif =` `1`
`
相关推荐
Theodore_10222 小时前
大数据(1) 大数据概述
大数据·hadoop·数据分析·spark·hbase
Aurora_NeAr2 小时前
Apache Spark详解
大数据·后端·spark
IvanCodes4 小时前
六、Sqoop 导出
大数据·hadoop·sqoop
workflower4 小时前
以光量子为例,详解量子获取方式
数据仓库·人工智能·软件工程·需求分析·量子计算·软件需求
代码匠心4 小时前
从零开始学Flink:揭开实时计算的神秘面纱
java·大数据·后端·flink
海尔辛5 小时前
SQL 基础入门
数据库·sql
weixin_472339465 小时前
Doris查询Hive数据:实现高效跨数据源分析的实践指南
数据仓库·hive·hadoop
归去_来兮6 小时前
图神经网络(GNN)模型的基本原理
大数据·人工智能·深度学习·图神经网络·gnn
TDengine (老段)7 小时前
TDengine 支持的平台汇总
大数据·数据库·物联网·时序数据库·iot·tdengine·涛思数据