DolphinDB SQL查询:从基础到进阶

目录

    • 摘要
    • 一、SQL查询基础
      • [1.1 查询语法概览](#1.1 查询语法概览)
      • [1.2 准备测试数据](#1.2 准备测试数据)
    • 二、基础查询
      • [2.1 SELECT语句](#2.1 SELECT语句)
      • [2.2 WHERE条件过滤](#2.2 WHERE条件过滤)
      • [2.3 ORDER BY排序](#2.3 ORDER BY排序)
      • [2.4 LIMIT限制](#2.4 LIMIT限制)
    • 三、聚合查询
      • [3.1 基本聚合函数](#3.1 基本聚合函数)
      • [3.2 GROUP BY分组](#3.2 GROUP BY分组)
      • [3.3 HAVING分组过滤](#3.3 HAVING分组过滤)
      • [3.4 DISTINCT去重](#3.4 DISTINCT去重)
    • 四、多表连接
      • [4.1 连接类型](#4.1 连接类型)
      • [4.2 连接查询示例](#4.2 连接查询示例)
      • [4.3 连接性能优化](#4.3 连接性能优化)
    • 五、子查询
      • [5.1 标量子查询](#5.1 标量子查询)
      • [5.2 列表子查询](#5.2 列表子查询)
      • [5.3 表子查询](#5.3 表子查询)
      • [5.4 EXISTS子查询](#5.4 EXISTS子查询)
    • 六、窗口函数
      • [6.1 排名函数](#6.1 排名函数)
      • [6.2 聚合窗口函数](#6.2 聚合窗口函数)
      • [6.3 分区窗口函数](#6.3 分区窗口函数)
      • [6.4 窗口函数应用场景](#6.4 窗口函数应用场景)
    • 七、CASE表达式
      • [7.1 简单CASE](#7.1 简单CASE)
      • [7.2 搜索CASE](#7.2 搜索CASE)
      • [7.3 IIF函数](#7.3 IIF函数)
    • 八、工业物联网实战案例
      • [8.1 设备状态监控](#8.1 设备状态监控)
      • [8.2 异常设备检测](#8.2 异常设备检测)
      • [8.3 设备健康评分](#8.3 设备健康评分)
      • [8.4 趋势分析](#8.4 趋势分析)
    • 九、查询性能优化
      • [9.1 执行计划分析](#9.1 执行计划分析)
      • [9.2 分区裁剪](#9.2 分区裁剪)
      • [9.3 优化建议](#9.3 优化建议)
      • [9.4 查询缓存](#9.4 查询缓存)
    • 十、总结
    • 参考资料

摘要

本文深入讲解DolphinDB的SQL查询能力,从基础查询语法到高级特性应用。详细介绍条件过滤、分组聚合、多表连接、子查询、窗口函数等核心功能,并提供大量工业物联网场景的实战案例。同时涵盖查询性能优化技巧,帮助读者编写高效的SQL查询语句。本文适合需要在DolphinDB上进行数据分析和报表开发的工程师阅读。


一、SQL查询基础

1.1 查询语法概览

DolphinDB支持标准SQL语法,并扩展了时序数据处理能力:
SQL语法结构
SELECT
选择列
FROM
数据源
WHERE
过滤条件
GROUP BY
分组
HAVING
分组过滤
ORDER BY
排序
LIMIT
限制行数

1.2 准备测试数据

python 复制代码
// 创建设备传感器数据库
db = database("dfs://sensor_db", COMPO, 
    [RANGE, 2024.01.01..2024.12.31, VALUE, 1..100])

// 定义表结构
schema = table(1:0,
    `device_id`timestamp`temperature`humidity`pressure`vibration`power`status,
    [INT, TIMESTAMP, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, SYMBOL])

// 创建分布式表
db.createPartitionedTable(schema, `sensor_data, `timestamp`device_id)

// 插入测试数据
n = 100000
data = table(
    rand(1..100, n) as device_id,
    2024.01.01 + rand(365, n) as timestamp,
    rand(20.0..35.0, n) as temperature,
    rand(40.0..70.0, n) as humidity,
    rand(1000.0..1020.0, n) as pressure,
    rand(0.0..5.0, n) as vibration,
    rand(100.0..500.0, n) as power,
    take(`normal`warning`error, n) as status
)
loadTable("dfs://sensor_db", "sensor_data").append!(data)

// 创建设备信息表
device_info = table(
    1..100 as device_id,
    `device` + string(1..100) as device_name,
    take(`车间A`车间B`车间C, 100) as location,
    take(`温度传感器`湿度传感器`压力传感器, 100) as device_type
)
db.createTable(device_info, `device_info)

二、基础查询

2.1 SELECT语句

python 复制代码
// 查询所有列
select * from loadTable("dfs://sensor_db", "sensor_data") limit 10

// 查询指定列
select device_id, timestamp, temperature 
from loadTable("dfs://sensor_db", "sensor_data") 
limit 10

// 列别名
select device_id as 设备ID, 
       timestamp as 时间, 
       temperature as 温度
from loadTable("dfs://sensor_db", "sensor_data")
limit 10

// 计算列
select device_id, 
       temperature,
       temperature * 9 / 5 + 32 as temperature_fahrenheit
from loadTable("dfs://sensor_db", "sensor_data")
limit 10

2.2 WHERE条件过滤

python 复制代码
// 单条件过滤
select * from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
limit 10

// 多条件AND
select * from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1 and temperature > 25
limit 10

// 多条件OR
select * from loadTable("dfs://sensor_db", "sensor_data")
where temperature > 32 or humidity > 65
limit 10

// 范围查询
select * from loadTable("dfs://sensor_db", "sensor_data")
where timestamp between 2024.01.01 and 2024.01.31
limit 10

// IN条件
select * from loadTable("dfs://sensor_db", "sensor_data")
where device_id in [1, 2, 3, 4, 5]
limit 10

// LIKE模糊匹配
select * from loadTable("dfs://sensor_db", "device_info")
where device_name like "device1%"

// NULL判断
select * from loadTable("dfs://sensor_db", "sensor_data")
where temperature is not null
limit 10

2.3 ORDER BY排序

python 复制代码
// 单列排序
select * from loadTable("dfs://sensor_db", "sensor_data")
order by timestamp desc
limit 10

// 多列排序
select * from loadTable("dfs://sensor_db", "sensor_data")
order by device_id asc, timestamp desc
limit 10

// 按计算列排序
select device_id, temperature, humidity, temperature + humidity as total
from loadTable("dfs://sensor_db", "sensor_data")
order by total desc
limit 10

2.4 LIMIT限制

python 复制代码
// 限制行数
select * from loadTable("dfs://sensor_db", "sensor_data")
limit 100

// 分页查询(跳过前100条,取100条)
select * from loadTable("dfs://sensor_db", "sensor_data")
limit 100, 100

// 取前N条(top关键字)
select top 10 * from loadTable("dfs://sensor_db", "sensor_data")

三、聚合查询

3.1 基本聚合函数

python 复制代码
// 常用聚合函数
select count(*) as total_records,
       sum(temperature) as sum_temp,
       avg(temperature) as avg_temp,
       max(temperature) as max_temp,
       min(temperature) as min_temp,
       std(temperature) as std_temp,
       var(temperature) as var_temp,
       median(temperature) as median_temp
from loadTable("dfs://sensor_db", "sensor_data")
聚合函数 说明 示例
count() 计数 count(*)
sum() 求和 sum(temperature)
avg() 平均值 avg(temperature)
max() 最大值 max(temperature)
min() 最小值 min(temperature)
std() 标准差 std(temperature)
var() 方差 var(temperature)
median() 中位数 median(temperature)
first() 第一个值 first(temperature)
last() 最后一个值 last(temperature)

3.2 GROUP BY分组

python 复制代码
// 单列分组
select device_id,
       count(*) as record_count,
       avg(temperature) as avg_temp,
       max(temperature) as max_temp,
       min(temperature) as min_temp
from loadTable("dfs://sensor_db", "sensor_data")
group by device_id
order by avg_temp desc

// 多列分组
select device_id,
       date(timestamp) as date,
       count(*) as record_count,
       avg(temperature) as avg_temp
from loadTable("dfs://sensor_db", "sensor_data")
group by device_id, date(timestamp)

// 时间窗口分组
select device_id,
       bar(timestamp, 1h) as hour,
       avg(temperature) as avg_temp,
       max(temperature) as max_temp,
       min(temperature) as min_temp
from loadTable("dfs://sensor_db", "sensor_data")
where timestamp between 2024.01.01 and 2024.01.02
group by device_id, bar(timestamp, 1h)

3.3 HAVING分组过滤

python 复制代码
// 过滤分组结果
select device_id,
       count(*) as record_count,
       avg(temperature) as avg_temp
from loadTable("dfs://sensor_db", "sensor_data")
group by device_id
having count(*) > 1000 and avg(temperature) > 27
order by avg_temp desc

// 复杂HAVING条件
select device_id,
       avg(temperature) as avg_temp,
       std(temperature) as std_temp
from loadTable("dfs://sensor_db", "sensor_data")
group by device_id
having avg(temperature) > 25 and std(temperature) < 5

3.4 DISTINCT去重

python 复制代码
// 去重查询
select distinct device_id 
from loadTable("dfs://sensor_db", "sensor_data")

// 多列去重
select distinct device_id, status
from loadTable("dfs://sensor_db", "sensor_data")

// 去重计数
select count(distinct device_id) as device_count
from loadTable("dfs://sensor_db", "sensor_data")

四、多表连接

4.1 连接类型

INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL JOIN
CROSS JOIN
表A
连接类型
内连接
左连接
右连接
全连接
交叉连接
结果

4.2 连接查询示例

python 复制代码
// 加载表
t1 = loadTable("dfs://sensor_db", "sensor_data")
t2 = loadTable("dfs://sensor_db", "device_info")

// 左连接(lj函数)
select s.device_id, d.device_name, d.location, s.temperature, s.timestamp
from lj(t1, t2, `device_id)
limit 10

// 内连接
select s.device_id, d.device_name, s.temperature
from t1 s
inner join t2 d on s.device_id = d.device_id
limit 10

// 全连接(fj函数)
select s.device_id, d.device_name, s.temperature
from fj(t1, t2, `device_id)
limit 10

// 多表连接
t3 = table(1..100 as device_id, 
           rand(0..100, 100) as health_score)
select s.device_id, d.device_name, h.health_score, s.temperature
from lj(lj(t1, t2, `device_id), t3, `device_id)
limit 10

4.3 连接性能优化

优化建议 说明
小表驱动大表 将小表放在连接左侧
使用索引列连接 连接列应有索引
避免全连接 优先使用左连接
减少连接表数量 尽量不超过5张表

五、子查询

5.1 标量子查询

python 复制代码
// 子查询返回单个值
select * from loadTable("dfs://sensor_db", "sensor_data")
where temperature > (
    select avg(temperature) from loadTable("dfs://sensor_db", "sensor_data")
)
limit 10

5.2 列表子查询

python 复制代码
// IN子查询
select * from loadTable("dfs://sensor_db", "sensor_data")
where device_id in (
    select device_id from loadTable("dfs://sensor_db", "device_info")
    where location = `车间A
)
limit 10

5.3 表子查询

python 复制代码
// FROM子句中的子查询
select device_id, avg_temp
from (
    select device_id, avg(temperature) as avg_temp
    from loadTable("dfs://sensor_db", "sensor_data")
    group by device_id
)
where avg_temp > 27
order by avg_temp desc

5.4 EXISTS子查询

python 复制代码
// EXISTS判断存在性
select * from loadTable("dfs://sensor_db", "device_info") d
where exists (
    select * from loadTable("dfs://sensor_db", "sensor_data") s
    where s.device_id = d.device_id and s.temperature > 30
)

六、窗口函数

6.1 排名函数

python 复制代码
// ROW_NUMBER:行号
select device_id, timestamp, temperature,
       row_number() over (order by timestamp) as row_num
from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
limit 10

// RANK:排名(有间隙)
select device_id, temperature,
       rank() over (order by temperature desc) as temp_rank
from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
limit 10

// DENSE_RANK:排名(无间隙)
select device_id, temperature,
       dense_rank() over (order by temperature desc) as temp_rank
from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
limit 10

6.2 聚合窗口函数

python 复制代码
// 累计聚合
select device_id, timestamp, temperature,
       cumsum(temperature) over (order by timestamp) as cum_sum,
       cumavg(temperature) over (order by timestamp) as cum_avg,
       cummax(temperature) over (order by timestamp) as cum_max,
       cummin(temperature) over (order by timestamp) as cum_min
from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
limit 20

// 移动窗口聚合
select device_id, timestamp, temperature,
       mavg(temperature, 5) over (order by timestamp) as moving_avg_5,
       msum(temperature, 10) over (order by timestamp) as moving_sum_10,
       mmax(temperature, 5) over (order by timestamp) as moving_max_5
from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
limit 20

6.3 分区窗口函数

python 复制代码
// 按设备分区计算
select device_id, timestamp, temperature,
       rank() over (partition by device_id order by temperature desc) as device_rank,
       mavg(temperature, 5) over (partition by device_id order by timestamp) as moving_avg
from loadTable("dfs://sensor_db", "sensor_data")
where timestamp between 2024.01.01 and 2024.01.02
limit 50

6.4 窗口函数应用场景

函数 应用场景 说明
row_number() 分页、去重 为每行分配唯一编号
rank() 排名榜单 相同值排名相同,有间隙
dense_rank() 连续排名 相同值排名相同,无间隙
mavg() 移动平均 平滑数据、趋势分析
cumsum() 累计求和 累计统计
first()/last() 首尾值 获取窗口首尾值

七、CASE表达式

7.1 简单CASE

python 复制代码
// CASE WHEN表达式
select device_id, temperature,
       case temperature
           when temperature < 20 then "低温"
           when temperature < 25 then "正常"
           when temperature < 30 then "偏高"
           else "高温"
       end as temp_level
from loadTable("dfs://sensor_db", "sensor_data")
limit 10

7.2 搜索CASE

python 复制代码
// 复杂条件CASE
select device_id, temperature, humidity, status,
       case
           when temperature > 32 and status = `error then "严重异常"
           when temperature > 30 or status = `warning then "警告"
           when temperature > 28 then "偏高"
           else "正常"
       end as alert_level
from loadTable("dfs://sensor_db", "sensor_data")
limit 20

7.3 IIF函数

python 复制代码
// IIF简化条件表达式
select device_id, temperature,
       iif(temperature > 28, "高温", "正常") as temp_status
from loadTable("dfs://sensor_db", "sensor_data")
limit 10

// 嵌套IIF
select device_id, temperature,
       iif(temperature > 30, "高温",
           iif(temperature > 25, "正常", "低温")) as temp_status
from loadTable("dfs://sensor_db", "sensor_data")
limit 10

八、工业物联网实战案例

8.1 设备状态监控

python 复制代码
// 实时设备状态统计
select device_id,
       last(temperature) as current_temp,
       last(humidity) as current_humidity,
       last(status) as current_status,
       count(*) as today_records
from loadTable("dfs://sensor_db", "sensor_data")
where date(timestamp) = today()
group by device_id
order by current_status desc, current_temp desc

8.2 异常设备检测

python 复制代码
// 检测异常设备(温度异常次数超过阈值)
select device_id,
       count(*) as total_records,
       sum(iif(temperature > 30, 1, 0)) as high_temp_count,
       sum(iif(status = `error, 1, 0)) as error_count,
       avg(temperature) as avg_temp
from loadTable("dfs://sensor_db", "sensor_data")
where timestamp > now() - 3600 * 1000  // 最近1小时
group by device_id
having sum(iif(temperature > 30, 1, 0)) > 10 or sum(iif(status = `error, 1, 0)) > 5
order by error_count desc, high_temp_count desc

8.3 设备健康评分

python 复制代码
// 计算设备健康评分
select device_id,
       avg(temperature) as avg_temp,
       std(temperature) as temp_stability,
       avg(humidity) as avg_humidity,
       avg(power) as avg_power,
       // 健康评分:温度稳定性(40%) + 湿度合理性(30%) + 功率稳定性(30%)
       (100 - std(temperature) * 10) * 0.4 +
       (100 - abs(avg(humidity) - 50) * 2) * 0.3 +
       (100 - std(power) * 0.5) * 0.3 as health_score
from loadTable("dfs://sensor_db", "sensor_data")
where timestamp > today()
group by device_id
order by health_score desc

8.4 趋势分析

python 复制代码
// 温度趋势分析(按小时)
select device_id,
       bar(timestamp, 1h) as hour,
       avg(temperature) as avg_temp,
       max(temperature) as max_temp,
       min(temperature) as min_temp,
       max(temperature) - min(temperature) as temp_range,
       count(*) as sample_count
from loadTable("dfs://sensor_db", "sensor_data")
where device_id = 1
and timestamp between 2024.01.01 and 2024.01.02
group by device_id, bar(timestamp, 1h)
order by hour

九、查询性能优化

9.1 执行计划分析

python 复制代码
// 查看执行计划
explain select * from loadTable("dfs://sensor_db", "sensor_data")
where timestamp between 2024.01.01 and 2024.01.31
and device_id = 1

// 查看详细执行计划
explain("select * from loadTable('dfs://sensor_db', 'sensor_data') where timestamp between 2024.01.01 and 2024.01.31", true)

9.2 分区裁剪

包含分区键
不包含分区键
查询请求
分区裁剪
只扫描相关分区
全表扫描
快速响应
性能较差

python 复制代码
// 好的查询:利用分区裁剪
select * from loadTable("dfs://sensor_db", "sensor_data")
where timestamp between 2024.01.01 and 2024.01.31  // 分区键
and device_id = 1                                  // 分区键

// 差的查询:无法利用分区裁剪
select * from loadTable("dfs://sensor_db", "sensor_data")
where temperature > 30  // 非分区键

9.3 优化建议

优化项 建议 效果
分区裁剪 查询条件包含分区键 大幅减少扫描量
列裁剪 只查询需要的列 减少IO
索引使用 为常用查询列建索引 加速查找
避免全表 加WHERE条件 减少处理量
批量查询 合并多次查询 减少网络开销

9.4 查询缓存

python 复制代码
// 启用查询缓存
setCacheEngineCapacity(1024)  // 1GB缓存

// 查询会自动缓存
result = select * from loadTable("dfs://sensor_db", "sensor_data")
where timestamp between 2024.01.01 and 2024.01.31

// 清除缓存
clearAllCache()

十、总结

本文详细介绍了DolphinDB的SQL查询能力。核心要点如下:

  1. 基础查询:SELECT、WHERE、ORDER BY、LIMIT
  2. 聚合查询:分组、聚合函数、HAVING过滤
  3. 多表连接:内连接、左连接、全连接
  4. 子查询:标量、列表、表、EXISTS子查询
  5. 窗口函数:排名、移动聚合、分区窗口
  6. 性能优化:分区裁剪、执行计划、查询缓存

思考题

  1. 如何设计查询以充分利用分区裁剪?
  2. 窗口函数在时序数据分析中有哪些应用?
  3. 如何优化复杂的多表连接查询?

参考资料

相关推荐
有想法的py工程师2 小时前
PostgreSQL 深入heap_update() 与 HOT 机制(附源码级解析)
数据库·postgresql
qq_342295823 小时前
如何为容器内多个列表实现统一滚动条.txt
jvm·数据库·python
qq_206901393 小时前
CSS如何引入自适应图标_利用svg外链配合css控制颜色
jvm·数据库·python
weixin_408717773 小时前
Go语言怎么编译Linux程序_Go语言编译Linux可执行文件教程【避坑】
jvm·数据库·python
APIshop4 小时前
Python 爬虫获取京东商品详情 API 接口实战指南
java·服务器·数据库
XmasWu12254 小时前
【Hermes Agent进阶】开发自定义技能
网络·数据库
刘~浪地球4 小时前
数据库性能优化实战
数据库·性能优化
2401_865439634 小时前
CSS怎么在flex布局中实现项目均分间距_设置justify-content space-evenly
jvm·数据库·python