目录
-
- 摘要
- 一、SQL查询基础
-
- [1.1 查询语法概览](#1.1 查询语法概览)
- [1.2 准备测试数据](#1.2 准备测试数据)
- 二、基础查询
-
- [2.1 SELECT语句](#2.1 SELECT语句)
- [2.2 WHERE条件过滤](#2.2 WHERE条件过滤)
- [2.3 ORDER BY排序](#2.3 ORDER BY排序)
- [2.4 LIMIT限制](#2.4 LIMIT限制)
- 三、聚合查询
-
- [3.1 基本聚合函数](#3.1 基本聚合函数)
- [3.2 GROUP BY分组](#3.2 GROUP BY分组)
- [3.3 HAVING分组过滤](#3.3 HAVING分组过滤)
- [3.4 DISTINCT去重](#3.4 DISTINCT去重)
- 四、多表连接
-
- [4.1 连接类型](#4.1 连接类型)
- [4.2 连接查询示例](#4.2 连接查询示例)
- [4.3 连接性能优化](#4.3 连接性能优化)
- 五、子查询
-
- [5.1 标量子查询](#5.1 标量子查询)
- [5.2 列表子查询](#5.2 列表子查询)
- [5.3 表子查询](#5.3 表子查询)
- [5.4 EXISTS子查询](#5.4 EXISTS子查询)
- 六、窗口函数
-
- [6.1 排名函数](#6.1 排名函数)
- [6.2 聚合窗口函数](#6.2 聚合窗口函数)
- [6.3 分区窗口函数](#6.3 分区窗口函数)
- [6.4 窗口函数应用场景](#6.4 窗口函数应用场景)
- 七、CASE表达式
-
- [7.1 简单CASE](#7.1 简单CASE)
- [7.2 搜索CASE](#7.2 搜索CASE)
- [7.3 IIF函数](#7.3 IIF函数)
- 八、工业物联网实战案例
-
- [8.1 设备状态监控](#8.1 设备状态监控)
- [8.2 异常设备检测](#8.2 异常设备检测)
- [8.3 设备健康评分](#8.3 设备健康评分)
- [8.4 趋势分析](#8.4 趋势分析)
- 九、查询性能优化
-
- [9.1 执行计划分析](#9.1 执行计划分析)
- [9.2 分区裁剪](#9.2 分区裁剪)
- [9.3 优化建议](#9.3 优化建议)
- [9.4 查询缓存](#9.4 查询缓存)
- 十、总结
- 参考资料
摘要
本文深入讲解DolphinDB的SQL查询能力,从基础查询语法到高级特性应用。详细介绍条件过滤、分组聚合、多表连接、子查询、窗口函数等核心功能,并提供大量工业物联网场景的实战案例。同时涵盖查询性能优化技巧,帮助读者编写高效的SQL查询语句。本文适合需要在DolphinDB上进行数据分析和报表开发的工程师阅读。
一、SQL查询基础
1.1 查询语法概览
DolphinDB支持标准SQL语法,并扩展了时序数据处理能力:
SQL语法结构
SELECT
选择列
FROM
数据源
WHERE
过滤条件
GROUP BY
分组
HAVING
分组过滤
ORDER BY
排序
LIMIT
限制行数
1.2 准备测试数据
python
// 创建设备传感器数据库
db = database("dfs://sensor_db", COMPO,
[RANGE, 2024.01.01..2024.12.31, VALUE, 1..100])
// 定义表结构
schema = table(1:0,
`device_id`timestamp`temperature`humidity`pressure`vibration`power`status,
[INT, TIMESTAMP, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, SYMBOL])
// 创建分布式表
db.createPartitionedTable(schema, `sensor_data, `timestamp`device_id)
// 插入测试数据
n = 100000
data = table(
rand(1..100, n) as device_id,
2024.01.01 + rand(365, n) as timestamp,
rand(20.0..35.0, n) as temperature,
rand(40.0..70.0, n) as humidity,
rand(1000.0..1020.0, n) as pressure,
rand(0.0..5.0, n) as vibration,
rand(100.0..500.0, n) as power,
take(`normal`warning`error, n) as status
)
loadTable("dfs://sensor_db", "sensor_data").append!(data)
// 创建设备信息表
device_info = table(
1..100 as device_id,
`device` + string(1..100) as device_name,
take(`车间A`车间B`车间C, 100) as location,
take(`温度传感器`湿度传感器`压力传感器, 100) as device_type
)
db.createTable(device_info, `device_info)
二、基础查询
2.1 SELECT语句
python
// 查询所有列
select * from loadTable("dfs://sensor_db", "sensor_data") limit 10
// 查询指定列
select device_id, timestamp, temperature
from loadTable("dfs://sensor_db", "sensor_data")
limit 10
// 列别名
select device_id as 设备ID,
timestamp as 时间,
temperature as 温度
from loadTable("dfs://sensor_db", "sensor_data")
limit 10
// 计算列
select device_id,
temperature,
temperature * 9 / 5 + 32 as temperature_fahrenheit
from loadTable("dfs://sensor_db", "sensor_data")
limit 10
2.2 WHERE条件过滤
python
// 单条件过滤
select * from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
limit 10
// 多条件AND
select * from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1 and temperature > 25
limit 10
// 多条件OR
select * from loadTable("dfs://sensor_db", "sensor_data")
where temperature > 32 or humidity > 65
limit 10
// 范围查询
select * from loadTable("dfs://sensor_db", "sensor_data")
where timestamp between 2024.01.01 and 2024.01.31
limit 10
// IN条件
select * from loadTable("dfs://sensor_db", "sensor_data")
where device_id in [1, 2, 3, 4, 5]
limit 10
// LIKE模糊匹配
select * from loadTable("dfs://sensor_db", "device_info")
where device_name like "device1%"
// NULL判断
select * from loadTable("dfs://sensor_db", "sensor_data")
where temperature is not null
limit 10
2.3 ORDER BY排序
python
// 单列排序
select * from loadTable("dfs://sensor_db", "sensor_data")
order by timestamp desc
limit 10
// 多列排序
select * from loadTable("dfs://sensor_db", "sensor_data")
order by device_id asc, timestamp desc
limit 10
// 按计算列排序
select device_id, temperature, humidity, temperature + humidity as total
from loadTable("dfs://sensor_db", "sensor_data")
order by total desc
limit 10
2.4 LIMIT限制
python
// 限制行数
select * from loadTable("dfs://sensor_db", "sensor_data")
limit 100
// 分页查询(跳过前100条,取100条)
select * from loadTable("dfs://sensor_db", "sensor_data")
limit 100, 100
// 取前N条(top关键字)
select top 10 * from loadTable("dfs://sensor_db", "sensor_data")
三、聚合查询
3.1 基本聚合函数
python
// 常用聚合函数
select count(*) as total_records,
sum(temperature) as sum_temp,
avg(temperature) as avg_temp,
max(temperature) as max_temp,
min(temperature) as min_temp,
std(temperature) as std_temp,
var(temperature) as var_temp,
median(temperature) as median_temp
from loadTable("dfs://sensor_db", "sensor_data")
| 聚合函数 | 说明 | 示例 |
|---|---|---|
| count() | 计数 | count(*) |
| sum() | 求和 | sum(temperature) |
| avg() | 平均值 | avg(temperature) |
| max() | 最大值 | max(temperature) |
| min() | 最小值 | min(temperature) |
| std() | 标准差 | std(temperature) |
| var() | 方差 | var(temperature) |
| median() | 中位数 | median(temperature) |
| first() | 第一个值 | first(temperature) |
| last() | 最后一个值 | last(temperature) |
3.2 GROUP BY分组
python
// 单列分组
select device_id,
count(*) as record_count,
avg(temperature) as avg_temp,
max(temperature) as max_temp,
min(temperature) as min_temp
from loadTable("dfs://sensor_db", "sensor_data")
group by device_id
order by avg_temp desc
// 多列分组
select device_id,
date(timestamp) as date,
count(*) as record_count,
avg(temperature) as avg_temp
from loadTable("dfs://sensor_db", "sensor_data")
group by device_id, date(timestamp)
// 时间窗口分组
select device_id,
bar(timestamp, 1h) as hour,
avg(temperature) as avg_temp,
max(temperature) as max_temp,
min(temperature) as min_temp
from loadTable("dfs://sensor_db", "sensor_data")
where timestamp between 2024.01.01 and 2024.01.02
group by device_id, bar(timestamp, 1h)
3.3 HAVING分组过滤
python
// 过滤分组结果
select device_id,
count(*) as record_count,
avg(temperature) as avg_temp
from loadTable("dfs://sensor_db", "sensor_data")
group by device_id
having count(*) > 1000 and avg(temperature) > 27
order by avg_temp desc
// 复杂HAVING条件
select device_id,
avg(temperature) as avg_temp,
std(temperature) as std_temp
from loadTable("dfs://sensor_db", "sensor_data")
group by device_id
having avg(temperature) > 25 and std(temperature) < 5
3.4 DISTINCT去重
python
// 去重查询
select distinct device_id
from loadTable("dfs://sensor_db", "sensor_data")
// 多列去重
select distinct device_id, status
from loadTable("dfs://sensor_db", "sensor_data")
// 去重计数
select count(distinct device_id) as device_count
from loadTable("dfs://sensor_db", "sensor_data")
四、多表连接
4.1 连接类型
INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL JOIN
CROSS JOIN
表A
连接类型
内连接
左连接
右连接
全连接
交叉连接
结果
4.2 连接查询示例
python
// 加载表
t1 = loadTable("dfs://sensor_db", "sensor_data")
t2 = loadTable("dfs://sensor_db", "device_info")
// 左连接(lj函数)
select s.device_id, d.device_name, d.location, s.temperature, s.timestamp
from lj(t1, t2, `device_id)
limit 10
// 内连接
select s.device_id, d.device_name, s.temperature
from t1 s
inner join t2 d on s.device_id = d.device_id
limit 10
// 全连接(fj函数)
select s.device_id, d.device_name, s.temperature
from fj(t1, t2, `device_id)
limit 10
// 多表连接
t3 = table(1..100 as device_id,
rand(0..100, 100) as health_score)
select s.device_id, d.device_name, h.health_score, s.temperature
from lj(lj(t1, t2, `device_id), t3, `device_id)
limit 10
4.3 连接性能优化
| 优化建议 | 说明 |
|---|---|
| 小表驱动大表 | 将小表放在连接左侧 |
| 使用索引列连接 | 连接列应有索引 |
| 避免全连接 | 优先使用左连接 |
| 减少连接表数量 | 尽量不超过5张表 |
五、子查询
5.1 标量子查询
python
// 子查询返回单个值
select * from loadTable("dfs://sensor_db", "sensor_data")
where temperature > (
select avg(temperature) from loadTable("dfs://sensor_db", "sensor_data")
)
limit 10
5.2 列表子查询
python
// IN子查询
select * from loadTable("dfs://sensor_db", "sensor_data")
where device_id in (
select device_id from loadTable("dfs://sensor_db", "device_info")
where location = `车间A
)
limit 10
5.3 表子查询
python
// FROM子句中的子查询
select device_id, avg_temp
from (
select device_id, avg(temperature) as avg_temp
from loadTable("dfs://sensor_db", "sensor_data")
group by device_id
)
where avg_temp > 27
order by avg_temp desc
5.4 EXISTS子查询
python
// EXISTS判断存在性
select * from loadTable("dfs://sensor_db", "device_info") d
where exists (
select * from loadTable("dfs://sensor_db", "sensor_data") s
where s.device_id = d.device_id and s.temperature > 30
)
六、窗口函数
6.1 排名函数
python
// ROW_NUMBER:行号
select device_id, timestamp, temperature,
row_number() over (order by timestamp) as row_num
from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
limit 10
// RANK:排名(有间隙)
select device_id, temperature,
rank() over (order by temperature desc) as temp_rank
from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
limit 10
// DENSE_RANK:排名(无间隙)
select device_id, temperature,
dense_rank() over (order by temperature desc) as temp_rank
from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
limit 10
6.2 聚合窗口函数
python
// 累计聚合
select device_id, timestamp, temperature,
cumsum(temperature) over (order by timestamp) as cum_sum,
cumavg(temperature) over (order by timestamp) as cum_avg,
cummax(temperature) over (order by timestamp) as cum_max,
cummin(temperature) over (order by timestamp) as cum_min
from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
limit 20
// 移动窗口聚合
select device_id, timestamp, temperature,
mavg(temperature, 5) over (order by timestamp) as moving_avg_5,
msum(temperature, 10) over (order by timestamp) as moving_sum_10,
mmax(temperature, 5) over (order by timestamp) as moving_max_5
from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
limit 20
6.3 分区窗口函数
python
// 按设备分区计算
select device_id, timestamp, temperature,
rank() over (partition by device_id order by temperature desc) as device_rank,
mavg(temperature, 5) over (partition by device_id order by timestamp) as moving_avg
from loadTable("dfs://sensor_db", "sensor_data")
where timestamp between 2024.01.01 and 2024.01.02
limit 50
6.4 窗口函数应用场景
| 函数 | 应用场景 | 说明 |
|---|---|---|
| row_number() | 分页、去重 | 为每行分配唯一编号 |
| rank() | 排名榜单 | 相同值排名相同,有间隙 |
| dense_rank() | 连续排名 | 相同值排名相同,无间隙 |
| mavg() | 移动平均 | 平滑数据、趋势分析 |
| cumsum() | 累计求和 | 累计统计 |
| first()/last() | 首尾值 | 获取窗口首尾值 |
七、CASE表达式
7.1 简单CASE
python
// CASE WHEN表达式
select device_id, temperature,
case temperature
when temperature < 20 then "低温"
when temperature < 25 then "正常"
when temperature < 30 then "偏高"
else "高温"
end as temp_level
from loadTable("dfs://sensor_db", "sensor_data")
limit 10
7.2 搜索CASE
python
// 复杂条件CASE
select device_id, temperature, humidity, status,
case
when temperature > 32 and status = `error then "严重异常"
when temperature > 30 or status = `warning then "警告"
when temperature > 28 then "偏高"
else "正常"
end as alert_level
from loadTable("dfs://sensor_db", "sensor_data")
limit 20
7.3 IIF函数
python
// IIF简化条件表达式
select device_id, temperature,
iif(temperature > 28, "高温", "正常") as temp_status
from loadTable("dfs://sensor_db", "sensor_data")
limit 10
// 嵌套IIF
select device_id, temperature,
iif(temperature > 30, "高温",
iif(temperature > 25, "正常", "低温")) as temp_status
from loadTable("dfs://sensor_db", "sensor_data")
limit 10
八、工业物联网实战案例
8.1 设备状态监控
python
// 实时设备状态统计
select device_id,
last(temperature) as current_temp,
last(humidity) as current_humidity,
last(status) as current_status,
count(*) as today_records
from loadTable("dfs://sensor_db", "sensor_data")
where date(timestamp) = today()
group by device_id
order by current_status desc, current_temp desc
8.2 异常设备检测
python
// 检测异常设备(温度异常次数超过阈值)
select device_id,
count(*) as total_records,
sum(iif(temperature > 30, 1, 0)) as high_temp_count,
sum(iif(status = `error, 1, 0)) as error_count,
avg(temperature) as avg_temp
from loadTable("dfs://sensor_db", "sensor_data")
where timestamp > now() - 3600 * 1000 // 最近1小时
group by device_id
having sum(iif(temperature > 30, 1, 0)) > 10 or sum(iif(status = `error, 1, 0)) > 5
order by error_count desc, high_temp_count desc
8.3 设备健康评分
python
// 计算设备健康评分
select device_id,
avg(temperature) as avg_temp,
std(temperature) as temp_stability,
avg(humidity) as avg_humidity,
avg(power) as avg_power,
// 健康评分:温度稳定性(40%) + 湿度合理性(30%) + 功率稳定性(30%)
(100 - std(temperature) * 10) * 0.4 +
(100 - abs(avg(humidity) - 50) * 2) * 0.3 +
(100 - std(power) * 0.5) * 0.3 as health_score
from loadTable("dfs://sensor_db", "sensor_data")
where timestamp > today()
group by device_id
order by health_score desc
8.4 趋势分析
python
// 温度趋势分析(按小时)
select device_id,
bar(timestamp, 1h) as hour,
avg(temperature) as avg_temp,
max(temperature) as max_temp,
min(temperature) as min_temp,
max(temperature) - min(temperature) as temp_range,
count(*) as sample_count
from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
and timestamp between 2024.01.01 and 2024.01.02
group by device_id, bar(timestamp, 1h)
order by hour
九、查询性能优化
9.1 执行计划分析
python
// 查看执行计划
explain select * from loadTable("dfs://sensor_db", "sensor_data")
where timestamp between 2024.01.01 and 2024.01.31
and device_id = 1
// 查看详细执行计划
explain("select * from loadTable('dfs://sensor_db', 'sensor_data') where timestamp between 2024.01.01 and 2024.01.31", true)
9.2 分区裁剪
包含分区键
不包含分区键
查询请求
分区裁剪
只扫描相关分区
全表扫描
快速响应
性能较差
python
// 好的查询:利用分区裁剪
select * from loadTable("dfs://sensor_db", "sensor_data")
where timestamp between 2024.01.01 and 2024.01.31 // 分区键
and device_id = 1 // 分区键
// 差的查询:无法利用分区裁剪
select * from loadTable("dfs://sensor_db", "sensor_data")
where temperature > 30 // 非分区键
9.3 优化建议
| 优化项 | 建议 | 效果 |
|---|---|---|
| 分区裁剪 | 查询条件包含分区键 | 大幅减少扫描量 |
| 列裁剪 | 只查询需要的列 | 减少IO |
| 索引使用 | 为常用查询列建索引 | 加速查找 |
| 避免全表 | 加WHERE条件 | 减少处理量 |
| 批量查询 | 合并多次查询 | 减少网络开销 |
9.4 查询缓存
python
// 启用查询缓存
setCacheEngineCapacity(1024) // 1GB缓存
// 查询会自动缓存
result = select * from loadTable("dfs://sensor_db", "sensor_data")
where timestamp between 2024.01.01 and 2024.01.31
// 清除缓存
clearAllCache()
十、总结
本文详细介绍了DolphinDB的SQL查询能力。核心要点如下:
- 基础查询:SELECT、WHERE、ORDER BY、LIMIT
- 聚合查询:分组、聚合函数、HAVING过滤
- 多表连接:内连接、左连接、全连接
- 子查询:标量、列表、表、EXISTS子查询
- 窗口函数:排名、移动聚合、分区窗口
- 性能优化:分区裁剪、执行计划、查询缓存
思考题:
- 如何设计查询以充分利用分区裁剪?
- 窗口函数在时序数据分析中有哪些应用?
- 如何优化复杂的多表连接查询?