ClickHouse 的 SQL 优化规则是基于RBO(Rule Based Optimization),下面是一些优化规则
1 准备测试用表
1 ) 上传官方的数据集
将visits_v1.tar和hits_v1.tar上传到虚拟机,解压到clickhouse数据路径下
// 解压到clickhouse数据路径
sudo tar -xvf hits_v1.tar -C /var/lib/clickhouse`
`sudo tar -xvf visits_v1.tar -C /var/lib/clickhouse`
`//修改所属用户`
`sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/data/datasets`
`sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/metadata/datasets`
`
2 ) 重启 clickhouse-server
sudo clickhouse restart
3 ) 执行查询
clickhouse-client --query "SELECT COUNT(*) FROM datasets.hits_v1"
clickhouse-client --query "SELECT COUNT(*) FROM datasets.visits_v1"
注意:官方的tar包,包含了建库、建表语句、数据内容,这种方式不需要手动建库、建表,最方便。
hits _v1 表有1 30 多个字段,8 80 多万条数据
visits_ v1 表有1 80 多个字段,1 60 多万条数据
2 COUNT 优化
在调用 count 函数时,如果使用的是 count() 或者 count(*),且没有 where 条件,则会直接使用 system.tables 的 total_rows,例如:
EXPLAIN SELECT count()FROM datasets.hits_v1;`
`Union`
` Expression (Projection)`
` Expression (Before ORDER BY and SELECT)`
` MergingAggregated`
` ReadNothing (Optimized trivial count)`
`
注意 Optimized trivial count ,这是对 count 的优化。
如果 count 具体的列字段,则不会使用此项优化:
EXPLAIN SELECT count(CounterID) FROM datasets.hits_v1;`
`Union`
` Expression (Projection)`
` Expression (Before ORDER BY and SELECT)`
` Aggregating`
` Expression (Before GROUP BY)`
` ReadFromStorage (Read from MergeTree)`
`
3 消除子查询重复字段
下面语句子查询中有两个重复的 id 字段,会被去重:
EXPLAIN SYNTAX SELECT `
` a.UserID,`
` b.VisitID,`
` a.URL,`
` b.UserID`
` FROM`
` hits_v1 AS a `
` LEFT JOIN (`
` SELECT `
` UserID,`
` UserID as HaHa,`
` VisitID `
` FROM visits_v1) AS b `
` USING (UserID)`
` limit 3;`
`//返回优化语句:`
`SELECT `
` UserID,`
` VisitID,`
` URL,`
` b.UserID`
`FROM hits_v1 AS a`
`ALL LEFT JOIN `
`(`
` SELECT `
` UserID,`
` VisitID`
` FROM visits_v1`
`) AS b USING (UserID)`
`LIMIT 3`
`
4 谓词下推
当group by有having子句,但是没有with cube、with rollup 或者with totals修饰的时候,having过滤会下推到where提前过滤。例如下面的查询,HAVING name变成了WHERE name,在group by之前过滤:
EXPLAIN SYNTAX SELECT UserID FROM hits_v1 GROUP BY UserID HAVING UserID =` `'8585742290196126178';`
`//返回优化语句`
`SELECT UserID`
`FROM hits_v1`
`WHERE UserID = \'8585742290196126178\'`
`GROUP BY UserID`
`
子查询也支持谓词下推:
EXPLAIN SYNTAX`
`SELECT *`
`FROM `
`(`
` SELECT UserID`
` FROM visits_v1`
`)`
`WHERE UserID =` `'8585742290196126178'`
`//返回优化后的语句`
`SELECT UserID`
`FROM `
`(`
` SELECT UserID`
` FROM visits_v1`
` WHERE UserID = \'8585742290196126178\'`
`)`
`WHERE UserID = \'8585742290196126178\'
再来一个复杂例子:
//返回优化后的语句`
`SELECT UserID`
`FROM `
`(`
` SELECT UserID`
` FROM `
`(`
` SELECT UserID`
` FROM visits_v1`
` WHERE UserID = \'8585742290196126178\'`
`)`
` WHERE UserID = \'8585742290196126178\'`
` UNION ALL`
` SELECT UserID`
` FROM `
`(`
` SELECT UserID`
` FROM visits_v1`
` WHERE UserID = \'8585742290196126178\'`
`)`
` WHERE UserID = \'8585742290196126178\'`
`)`
`WHERE UserID = \'8585742290196126178\'`
`
5 聚合计算外推
聚合函数内的计算,会外推,例如:
EXPLAIN SYNTAX`
`SELECT sum(UserID *` `2)`
`FROM visits_v1`
`//返回优化后的语句`
`SELECT sum(UserID)` `*` `2`
`FROM visits_v1`
`
6 聚合函数消除
如果对聚合键,也就是 group by key 使用 min、max、any 聚合函数,则将函数消除,例如:
EXPLAIN SYNTAX`
`SELECT`
`sum(UserID *` `2),`
`max(VisitID),`
`max(UserID)`
`FROM visits_v1`
`GROUP BY UserID`
`//返回优化后的语句`
`SELECT `
`sum(UserID)` `*` `2,`
`max(VisitID),`
` UserID`
`FROM visits_v1`
`GROUP BY UserID`
`
7 删除重复的 order by key
例如下面的语句,重复的聚合键 id 字段会被去重:
EXPLAIN SYNTAX`
`SELECT *`
`FROM visits_v1`
`ORDER BY`
` UserID ASC,`
` UserID ASC,`
` VisitID ASC,`
`VisitID ASC`
`//返回优化后的语句:`
`select`
` ......`
`FROM visits_v1`
`ORDER BY `
` UserID ASC,`
`VisitID ASC`
`
8 删除重复的 limit by key
例如下面的语句,重复声明的 name 字段会被去重:
EXPLAIN SYNTAX`
`SELECT *`
`FROM visits_v1`
`LIMIT 3 BY`
` VisitID,`
` VisitID`
`LIMIT 10`
`//返回优化后的语句:`
`select`
` ......`
`FROM visits_v1`
`LIMIT 3 BY VisitID`
`LIMIT 10`
`
9 删除重复的 USING Key
例如下面的语句,重复的关联键 id 字段会被去重:
EXPLAIN SYNTAX`
`SELECT`
` a.UserID,`
` a.UserID,`
` b.VisitID,`
` a.URL,`
` b.UserID`
`FROM hits_v1 AS a`
`LEFT JOIN visits_v1 AS b USING (UserID, UserID)`
`//返回优化后的语句:`
`SELECT `
` UserID,`
` UserID,`
` VisitID,`
` URL,`
` b.UserID`
`FROM hits_v1 AS a`
`ALL LEFT JOIN visits_v1 AS b USING (UserID)`
`
10 标量替换
如果子查询只返回一行数据,在被引用的时候用标量替换,例如下面语句中的 total_disk_usage 字段:
EXPLAIN SYNTAX`
`WITH `
`(`
` SELECT sum(bytes)`
` FROM system.parts`
` WHERE active`
`) AS total_disk_usage`
`SELECT`
`(sum(bytes)` `/ total_disk_usage)` `*` `100 AS table_disk_usage,`
` table`
`FROM system.parts`
`GROUP BY table`
`ORDER BY table_disk_usage DESC`
`LIMIT 10;`
`//返回优化后的语句:`
`WITH CAST(0, \'UInt64\') AS total_disk_usage`
`SELECT `
`(sum(bytes)` `/ total_disk_usage)` `*` `100 AS table_disk_usage,`
` table`
`FROM system.parts`
`GROUP BY table`
`ORDER BY table_disk_usage DESC`
`LIMIT 10`
`
11 三元运算优化
如果开启了 optimize_if_chain_to_multiif 参数,三元运算符会被替换成 multiIf 函数,例如:
EXPLAIN SYNTAX `
`SELECT number =` `1` `?` `'hello'` `:` `(number =` `2` `?` `'world'` `:` `'atguigu')`
`FROM numbers(10)`
`settings optimize_if_chain_to_multiif =` `1;`
`// 返回优化后的语句:`
`SELECT multiIf(number =` `1, \'hello\', number =` `2, \'world\', \'atguigu\')`
`FROM numbers(10)`
`SETTINGS optimize_if_chain_to_multiif =` `1`
`