目录
-
- 摘要
- 一、JOIN概述
-
- [1.1 什么是JOIN](#1.1 什么是JOIN)
- [1.2 DolphinDB JOIN类型](#1.2 DolphinDB JOIN类型)
- [1.3 JOIN性能考虑](#1.3 JOIN性能考虑)
- 二、基本JOIN操作
-
- [2.1 INNER JOIN](#2.1 INNER JOIN)
- [2.2 LEFT JOIN](#2.2 LEFT JOIN)
- [2.3 RIGHT JOIN](#2.3 RIGHT JOIN)
- [2.4 FULL JOIN](#2.4 FULL JOIN)
- 三、DolphinDB特有JOIN函数
-
- [3.1 lj:左连接](#3.1 lj:左连接)
- [3.2 ej:等值连接](#3.2 ej:等值连接)
- [3.3 pj:分区连接](#3.3 pj:分区连接)
- [3.4 aj:As-Of连接](#3.4 aj:As-Of连接)
- [3.5 wj:窗口连接](#3.5 wj:窗口连接)
- 四、多表JOIN
-
- [4.1 三表关联](#4.1 三表关联)
- [4.2 自连接](#4.2 自连接)
- [4.3 多条件关联](#4.3 多条件关联)
- 五、分布式JOIN
-
- [5.1 分布式表JOIN](#5.1 分布式表JOIN)
- [5.2 分区对齐JOIN](#5.2 分区对齐JOIN)
- 六、JOIN优化
-
- [6.1 小表驱动大表](#6.1 小表驱动大表)
- [6.2 使用索引](#6.2 使用索引)
- [6.3 减少JOIN数据量](#6.3 减少JOIN数据量)
- [6.4 避免笛卡尔积](#6.4 避免笛卡尔积)
- 七、JOIN性能监控
-
- [7.1 查看执行计划](#7.1 查看执行计划)
- [7.2 性能对比](#7.2 性能对比)
- 八、实战案例
-
- [8.1 设备数据关联查询](#8.1 设备数据关联查询)
- 九、总结
- 参考资料
摘要
本文深入讲解DolphinDB多表关联查询技术。从JOIN类型到关联语法,从查询优化到分布式JOIN,全面介绍多表关联的核心方法。通过丰富的代码示例,帮助读者掌握JOIN查询和优化的核心技能。
一、JOIN概述
1.1 什么是JOIN
JOIN是将多个表按关联条件连接起来的操作:
JOIN原理
表A
关联条件
表B
结果集
关联类型
INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL JOIN
1.2 DolphinDB JOIN类型
| JOIN类型 | 说明 | 适用场景 |
|---|---|---|
| INNER JOIN | 内连接 | 两表都有匹配 |
| LEFT JOIN | 左连接 | 保留左表全部 |
| RIGHT JOIN | 右连接 | 保留右表全部 |
| FULL JOIN | 全连接 | 两表全部保留 |
| CROSS JOIN | 交叉连接 | 笛卡尔积 |
1.3 JOIN性能考虑
| 因素 | 影响 |
|---|---|
| 表大小 | 小表驱动大表 |
| 关联条件 | 索引加速 |
| 分区策略 | 分区对齐加速 |
| 数据分布 | 本地JOIN更快 |
二、基本JOIN操作
2.1 INNER JOIN
python
// 创建测试表
t1 = table(
1..5 as id,
`A`B`C`D`E as name,
10 20 30 40 50 as value
)
t2 = table(
1..3 as id,
100 200 300 as score
)
// 内连接:只返回两表都有匹配的行
select t1.id, t1.name, t1.value, t2.score
from t1
inner join t2 on t1.id = t2.id
/*
id name value score
1 A 10 100
2 B 20 200
3 C 30 300
*/
2.2 LEFT JOIN
python
// 左连接:保留左表全部行
select t1.id, t1.name, t1.value, t2.score
from t1
left join t2 on t1.id = t2.id
/*
id name value score
1 A 10 100
2 B 20 200
3 C 30 300
4 D 40 NULL
5 E 50 NULL
*/
// 使用lj函数
select * from lj(t1, t2, `id)
2.3 RIGHT JOIN
python
// 右连接:保留右表全部行
select t1.id, t1.name, t1.value, t2.score
from t1
right join t2 on t1.id = t2.id
// 使用rj函数
select * from rj(t1, t2, `id)
2.4 FULL JOIN
python
// 全连接:两表全部保留
select t1.id, t1.name, t1.value, t2.score
from t1
full join t2 on t1.id = t2.id
// 使用fj函数
select * from fj(t1, t2, `id)
三、DolphinDB特有JOIN函数
3.1 lj:左连接
python
// lj:DolphinDB优化的左连接
t1 = table(
1..5 as device_id,
`A`B`C`D`E as name
)
t2 = table(
1..3 as device_id,
100.0 200.0 300.0 as temperature
)
// lj函数
select * from lj(t1, t2, `device_id)
// 多列关联
t3 = table(
1..3 as device_id,
`A`B`C as type,
10 20 30 as value
)
select * from lj(t1, t3, `device_id`name, `device_id`type)
3.2 ej:等值连接
python
// ej:等值连接(INNER JOIN)
select * from ej(t1, t2, `device_id)
3.3 pj:分区连接
python
// pj:分区连接(用于分布式表)
// 自动按分区进行连接
3.4 aj:As-Of连接
python
// aj:As-Of连接(时间序列对齐)
t1 = table(
1 1 1 2 2 as device_id,
2024.01.01T00:00:00 2024.01.01T00:01:00 2024.01.01T00:02:00 2024.01.01T00:00:30 2024.01.01T00:01:30 as timestamp,
25.0 26.0 27.0 28.0 29.0 as temperature
)
t2 = table(
1 1 2 as device_id,
2024.01.01T00:00:00 2024.01.01T00:01:30 2024.01.01T00:01:00 as timestamp,
`A`B`C as status
)
// As-Of连接:找最近的时间匹配
select * from aj(t1, t2, `device_id`timestamp)
3.5 wj:窗口连接
python
// wj:窗口连接(时间窗口内关联)
t1 = table(
1 1 1 1 as device_id,
2024.01.01T00:00:00 2024.01.01T00:01:00 2024.01.01T00:02:00 2024.01.01T00:03:00 as timestamp,
25.0 26.0 27.0 28.0 as temperature
)
t2 = table(
1 1 1 as device_id,
2024.01.01T00:00:30 2024.01.01T00:01:30 2024.01.01T00:02:30 as timestamp,
`A`B`C as event
)
// 窗口连接:前后时间窗口内的事件
select * from wj(t1, t2, -60*1000:60*1000, `device_id`timestamp)
四、多表JOIN
4.1 三表关联
python
// 创建三个表
t1 = table(1..5 as id, `A`B`C`D`E as name)
t2 = table(1..5 as id, 100..104 as score)
t3 = table(1..5 as id, `X`Y`Z`X`Y as grade)
// 三表关联
select t1.id, t1.name, t2.score, t3.grade
from t1
inner join t2 on t1.id = t2.id
inner join t3 on t1.id = t3.id
// 使用lj链式关联
result = lj(t1, t2, `id)
result = lj(result, t3, `id)
select * from result
4.2 自连接
python
// 自连接:同一表关联
t = table(
1..5 as id,
0 1 2 3 4 as parent_id,
`A`B`C`D`E as name
)
// 自连接:查找父子关系
select t1.id as child_id, t1.name as child_name,
t2.id as parent_id, t2.name as parent_name
from t as t1
left join t as t2 on t1.parent_id = t2.id
4.3 多条件关联
python
// 多条件关联
t1 = table(
1..5 as device_id,
2024.01.01 + 0..4 as date,
10 20 30 40 50 as value
)
t2 = table(
1..3 as device_id,
2024.01.01 + 0..2 as date,
`A`B`C as status
)
// 多条件关联
select t1.device_id, t1.date, t1.value, t2.status
from t1
left join t2 on t1.device_id = t2.device_id and t1.date = t2.date
五、分布式JOIN
5.1 分布式表JOIN
python
// 创建分布式表
db = database("dfs://join_db", VALUE, 1..100)
// 设备数据表
schema1 = table(1:0, `device_id`timestamp`temperature,
[INT, TIMESTAMP, DOUBLE])
db.createPartitionedTable(schema1, `sensor_data, `device_id)
// 设备信息表(维度表)
schema2 = table(1:0, `device_id`device_name`location,
[INT, STRING, STRING])
db.createTable(schema2, `device_info)
// 插入数据
loadTable("dfs://join_db", "sensor_data").append!(
table(
take(1..100, 10000) as device_id,
take(now(), 10000) as timestamp,
rand(20.0..30.0, 10000) as temperature
)
)
loadTable("dfs://join_db", "device_info").append!(
table(
1..100 as device_id,
"device_" + string(1..100) as device_name,
take(`车间A`车间B`车间C, 100) as location
)
)
// 分布式JOIN
t1 = loadTable("dfs://join_db", "sensor_data")
t2 = loadTable("dfs://join_db", "device_info")
select t1.device_id, t1.timestamp, t1.temperature,
t2.device_name, t2.location
from t1
left join t2 on t1.device_id = t2.device_id
limit 10
5.2 分区对齐JOIN
python
// 分区对齐JOIN:分区相同的表JOIN更高效
db1 = database("dfs://aligned_db1", VALUE, 1..100)
db2 = database("dfs://aligned_db2", VALUE, 1..100)
// 两表使用相同分区策略
schema = table(1:0, `device_id`timestamp`value,
[INT, TIMESTAMP, DOUBLE])
db1.createPartitionedTable(schema, `table1, `device_id)
db2.createPartitionedTable(schema, `table2, `device_id)
// 分区对齐JOIN
t1 = loadTable("dfs://aligned_db1", "table1")
t2 = loadTable("dfs://aligned_db2", "table2")
select t1.device_id, t1.value as value1, t2.value as value2
from t1
inner join t2 on t1.device_id = t2.device_id and t1.timestamp = t2.timestamp
六、JOIN优化
6.1 小表驱动大表
python
// 小表驱动大表原则
// 小表在左,大表在右
// 不推荐:大表驱动小表
select * from large_table
left join small_table on large_table.id = small_table.id
// 推荐:小表驱动大表
select * from small_table
left join large_table on small_table.id = large_table.id
6.2 使用索引
python
// JOIN列有索引加速
// 分区列、排序列自动有索引
// 关联条件使用分区列
select * from t1
inner join t2 on t1.device_id = t2.device_id // device_id是分区列
6.3 减少JOIN数据量
python
// 先过滤再JOIN
select t1.device_id, t1.temperature, t2.device_name
from (
select * from t1 where date(timestamp) = 2024.01.15
) t1
left join t2 on t1.device_id = t2.device_id
// 或者
select t1.device_id, t1.temperature, t2.device_name
from t1
left join t2 on t1.device_id = t2.device_id
where date(t1.timestamp) = 2024.01.15
6.4 避免笛卡尔积
python
// 避免无条件的CROSS JOIN
// 不推荐
select * from t1, t2 // 笛卡尔积
// 推荐:有条件JOIN
select * from t1
inner join t2 on t1.id = t2.id
七、JOIN性能监控
7.1 查看执行计划
python
// 查看JOIN执行计划
explain select t1.device_id, t1.temperature, t2.device_name
from t1
left join t2 on t1.device_id = t2.device_id
7.2 性能对比
python
// JOIN性能对比
def compareJoinPerformance() {
t1 = loadTable("dfs://join_db", "sensor_data")
t2 = loadTable("dfs://join_db", "device_info")
// 测试1:lj函数
timer {
select * from lj(t1, t2, `device_id)
}
// 测试2:LEFT JOIN语法
timer {
select * from t1 left join t2 on t1.device_id = t2.device_id
}
}
compareJoinPerformance()
八、实战案例
8.1 设备数据关联查询
python
// 设备数据 + 设备信息 + 告警信息
db = database("dfs://iot_join_db", VALUE, 1..100)
// 设备数据表
schema1 = table(1:0, `device_id`timestamp`temperature`humidity,
[INT, TIMESTAMP, DOUBLE, DOUBLE])
db.createPartitionedTable(schema1, `sensor_data, `device_id)
// 设备信息表
schema2 = table(1:0, `device_id`device_name`location`install_date,
[INT, STRING, STRING, DATE])
db.createTable(schema2, `device_info)
// 告警表
schema3 = table(1:0, `device_id`alert_time`alert_type`alert_level,
[INT, TIMESTAMP, SYMBOL, INT])
db.createPartitionedTable(schema3, `alerts, `device_id)
// 关联查询
t1 = loadTable("dfs://iot_join_db", "sensor_data")
t2 = loadTable("dfs://iot_join_db", "device_info")
t3 = loadTable("dfs://iot_join_db", "alerts")
// 设备数据 + 设备信息
select t1.device_id, t1.timestamp, t1.temperature,
t2.device_name, t2.location
from t1
left join t2 on t1.device_id = t2.device_id
where date(t1.timestamp) = 2024.01.15
limit 100
// 设备数据 + 告警信息
select t1.device_id, t1.timestamp, t1.temperature,
t3.alert_type, t3.alert_level
from t1
left join t3 on t1.device_id = t3.device_id
and date(t1.timestamp) = date(t3.alert_time)
where date(t1.timestamp) = 2024.01.15
limit 100
九、总结
本文详细介绍了DolphinDB多表关联查询:
- JOIN类型:INNER、LEFT、RIGHT、FULL
- 特有函数:lj、ej、aj、wj
- 多表JOIN:三表关联、自连接、多条件
- 分布式JOIN:分布式表JOIN、分区对齐
- JOIN优化:小表驱动、索引使用、减少数据量
- 性能监控:执行计划、性能对比
思考题:
- 如何选择合适的JOIN类型?
- 如何优化分布式JOIN性能?
- As-Of JOIN适合什么场景?