
STDDEV_SAMP 函数
函数概述
sql
STDDEV_SAMP(expr)
功能说明:统计表/超级表中某列的样本标准差(Sample Standard Deviation)。
版本:v3.3.8.0
返回结果类型:DOUBLE。
适用数据类型:数值类型。
嵌套子查询支持:适用于内层查询和外层查询。
适用于:表和超级表。
使用说明
-
样本标准差与总体标准差的区别:
- 样本标准差(STDDEV_SAMP):使用 n-1 作为分母,适用于样本数据,提供无偏估计
- 总体标准差(STDDEV/STDDEV_POP):使用 n 作为分母,适用于完整总体数据
- 公式对比:
- 样本标准差:√[Σ(xi - x̄)² / (n-1)]
- 总体标准差:√[Σ(xi - x̄)² / n]
-
NULL 值处理:
- 若
expr列中所有值都为 NULL,返回 NULL - 计算时自动忽略 NULL 值
- 若
-
数据量要求:
- 至少需要 2 个非 NULL 数值才能计算样本标准差
- 如果只有 1 个非 NULL 值,返回 NULL
-
精度说明:
- 返回结果为 DOUBLE 类型,精度取决于系统的浮点运算精度
基础示例
示例 1:计算简单数据集的样本标准差
sql
-- 创建测试表
CREATE TABLE test_stddev (
ts TIMESTAMP,
id INT
);
-- 插入测试数据
INSERT INTO test_stddev VALUES
('2024-01-01 00:00:00', 1),
('2024-01-01 00:00:01', 2),
('2024-01-01 00:00:02', 3),
('2024-01-01 00:00:03', 4),
('2024-01-01 00:00:04', 5);
-- 查询数据
SELECT id FROM test_stddev;
输出:
id |
==============
1 |
2 |
3 |
4 |
5 |
sql
-- 计算样本标准差
SELECT STDDEV_SAMP(id) FROM test_stddev;
输出:
stddev_samp(id) |
============================
1.581138830084190 |
解释:
- 平均值:(1+2+3+4+5)/5 = 3
- 方差:[(1-3)²+(2-3)²+(3-3)²+(4-3)²+(5-3)²]/(5-1) = 10/4 = 2.5
- 标准差:√2.5 ≈ 1.5811
示例 2:对比样本标准差与总体标准差
sql
-- 同时计算样本标准差和总体标准差
SELECT
STDDEV_SAMP(id) AS sample_stddev,
STDDEV_POP(id) AS population_stddev,
STDDEV(id) AS stddev_default
FROM test_stddev;
输出:
sample_stddev | population_stddev | stddev_default |
==============================================================================
1.581138830084190| 1.414213562373095| 1.414213562373095|
说明:
- 样本标准差(1.5811)> 总体标准差(1.4142)
- STDDEV 默认等价于 STDDEV_POP
智能电表应用场景
场景准备:创建智能电表数据表
sql
-- 创建数据库
CREATE DATABASE IF NOT EXISTS power;
USE power;
-- 创建智能电表超级表
CREATE STABLE IF NOT EXISTS meters (
ts TIMESTAMP,
current FLOAT,
voltage INT,
phase FLOAT
) TAGS (
groupid INT,
location VARCHAR(64)
);
-- 创建子表
CREATE TABLE d1001 USING meters TAGS (1, 'Beijing.Chaoyang');
CREATE TABLE d1002 USING meters TAGS (1, 'Beijing.Haidian');
CREATE TABLE d1003 USING meters TAGS (2, 'Shanghai.Pudong');
-- 插入测试数据(模拟一天的电流数据)
INSERT INTO d1001 VALUES
('2024-01-15 00:00:00', 10.2, 220, 0.95),
('2024-01-15 06:00:00', 12.5, 221, 0.96),
('2024-01-15 12:00:00', 15.8, 222, 0.97),
('2024-01-15 18:00:00', 18.3, 223, 0.96),
('2024-01-15 23:00:00', 11.5, 220, 0.95);
INSERT INTO d1002 VALUES
('2024-01-15 00:00:00', 9.8, 219, 0.94),
('2024-01-15 06:00:00', 11.2, 220, 0.95),
('2024-01-15 12:00:00', 14.5, 221, 0.96),
('2024-01-15 18:00:00', 17.1, 222, 0.95),
('2024-01-15 23:00:00', 10.8, 219, 0.94);
INSERT INTO d1003 VALUES
('2024-01-15 00:00:00', 10.5, 220, 0.96),
('2024-01-15 06:00:00', 13.2, 221, 0.97),
('2024-01-15 12:00:00', 16.5, 222, 0.98),
('2024-01-15 18:00:00', 19.8, 223, 0.97),
('2024-01-15 23:00:00', 12.3, 220, 0.96);
场景 1:评估电表负荷波动性
业务需求:电力公司需要评估每个电表的负荷波动程度,标准差越大表示负荷波动越大,需要更密切的监控。
sql
-- 计算每个电表的电流样本标准差
SELECT
tbname AS meter_id,
location,
COUNT(*) AS sample_count,
ROUND(AVG(current), 2) AS avg_current,
ROUND(STDDEV_SAMP(current), 3) AS current_stddev_samp,
ROUND(STDDEV_POP(current), 3) AS current_stddev_pop
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
GROUP BY tbname, location
ORDER BY current_stddev_samp DESC;
预期输出:
meter_id | location | sample_count | avg_current | current_stddev_samp | current_stddev_pop |
=========================================================================================================
d1003 | Shanghai.Pudong | 5 | 14.46 | 3.584 | 3.206 |
d1001 | Beijing.Chaoyang | 5 | 13.66 | 3.275 | 2.930 |
d1002 | Beijing.Haidian | 5 | 12.68 | 2.911 | 2.604 |
业务洞察:
- d1003 电表的负荷波动最大(样本标准差 3.584A)
- 需要重点关注高波动电表,可能存在负荷不稳定的问题
- 样本标准差比总体标准差更保守,更适合用于风险评估
场景 2:按时间段分析负荷稳定性
业务需求:分析不同时间段(凌晨、上午、下午、晚上)的负荷稳定性。
sql
-- 按时间段统计电流波动
SELECT
CASE
WHEN HOUR(ts) BETWEEN 0 AND 5 THEN '凌晨(00-05)'
WHEN HOUR(ts) BETWEEN 6 AND 11 THEN '上午(06-11)'
WHEN HOUR(ts) BETWEEN 12 AND 17 THEN '下午(12-17)'
ELSE '晚上(18-23)'
END AS time_period,
COUNT(*) AS sample_count,
ROUND(AVG(current), 2) AS avg_current,
ROUND(MIN(current), 2) AS min_current,
ROUND(MAX(current), 2) AS max_current,
ROUND(STDDEV_SAMP(current), 3) AS stddev_samp
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
GROUP BY time_period
ORDER BY
CASE time_period
WHEN '凌晨(00-05)' THEN 1
WHEN '上午(06-11)' THEN 2
WHEN '下午(12-17)' THEN 3
WHEN '晚上(18-23)' THEN 4
END;
预期输出:
time_period | sample_count | avg_current | min_current | max_current | stddev_samp |
==========================================================================================
凌晨(00-05) | 3 | 10.17 | 9.80 | 10.50 | 0.361 |
上午(06-11) | 3 | 12.30 | 11.20 | 13.20 | 1.001 |
下午(12-17) | 3 | 15.60 | 14.50 | 16.50 | 1.001 |
晚上(18-23) | 6 | 15.13 | 10.80 | 19.80 | 3.859 |
业务洞察:
- 凌晨时段负荷最稳定(标准差 0.361A)
- 晚上时段负荷波动最大(标准差 3.859A),因为包含了高峰和低谷
- 可以据此优化峰谷电价策略
场景 3:异常电表识别(负荷波动异常检测)
业务需求:识别负荷波动异常的电表,标准差超过平均水平 1.5 倍的电表需要重点关注。
sql
-- 第一步:计算每个电表的标准差
WITH meter_stats AS (
SELECT
tbname AS meter_id,
location,
ROUND(AVG(current), 2) AS avg_current,
ROUND(STDDEV_SAMP(current), 3) AS stddev_current
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
GROUP BY tbname, location
),
-- 第二步:计算所有电表的平均标准差
avg_stddev AS (
SELECT AVG(stddev_current) AS avg_stddev_value
FROM meter_stats
)
-- 第三步:识别异常电表
SELECT
m.meter_id,
m.location,
m.avg_current,
m.stddev_current,
ROUND(a.avg_stddev_value, 3) AS overall_avg_stddev,
ROUND(m.stddev_current / a.avg_stddev_value, 2) AS deviation_ratio,
CASE
WHEN m.stddev_current > a.avg_stddev_value * 1.5 THEN '需要关注'
WHEN m.stddev_current > a.avg_stddev_value * 1.2 THEN '轻微异常'
ELSE '正常'
END AS status
FROM meter_stats m, avg_stddev a
ORDER BY deviation_ratio DESC;
预期输出:
meter_id | location | avg_current | stddev_current | overall_avg_stddev | deviation_ratio | status |
====================================================================================================================
d1003 | Shanghai.Pudong | 14.46 | 3.584 | 3.257 | 1.10 | 正常 |
d1001 | Beijing.Chaoyang | 13.66 | 3.275 | 3.257 | 1.01 | 正常 |
d1002 | Beijing.Haidian | 12.68 | 2.911 | 3.257 | 0.89 | 正常 |
场景 4:滚动窗口分析负荷稳定性趋势
业务需求:使用滑动窗口分析电表负荷的稳定性趋势,每 2 小时统计一次。
sql
-- 按 2 小时窗口统计电流标准差
SELECT
TIMETRUNCATE(ts, 2h) AS time_window,
tbname AS meter_id,
COUNT(*) AS sample_count,
ROUND(AVG(current), 2) AS avg_current,
ROUND(STDDEV_SAMP(current), 3) AS stddev_samp
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
AND tbname = 'd1001'
GROUP BY time_window, tbname
ORDER BY time_window;
预期输出:
time_window | meter_id | sample_count | avg_current | stddev_samp |
===================================================================================
2024-01-15 00:00:00.000| d1001 | 1 | 10.20 | NULL |
2024-01-15 06:00:00.000| d1001 | 1 | 12.50 | NULL |
2024-01-15 12:00:00.000| d1001 | 1 | 15.80 | NULL |
2024-01-15 18:00:00.000| d1001 | 1 | 18.30 | NULL |
2024-01-15 22:00:00.000| d1001 | 1 | 11.50 | NULL |
说明:
- 当窗口内只有 1 个数据点时,样本标准差为 NULL(需要至少 2 个点)
- 可以调整窗口大小或数据采样频率来获得有意义的标准差
场景 5:电压稳定性分析
业务需求:评估电网电压的稳定性,标准差过大可能表示电网质量问题。
sql
-- 分析电压波动情况
SELECT
location,
COUNT(DISTINCT tbname) AS meter_count,
COUNT(*) AS sample_count,
ROUND(AVG(voltage), 2) AS avg_voltage,
ROUND(MIN(voltage), 2) AS min_voltage,
ROUND(MAX(voltage), 2) AS max_voltage,
ROUND(STDDEV_SAMP(voltage), 3) AS voltage_stddev_samp,
ROUND(STDDEV_SAMP(voltage) / AVG(voltage) * 100, 3) AS cv_percentage
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
GROUP BY location
ORDER BY voltage_stddev_samp DESC;
预期输出:
location | meter_count | sample_count | avg_voltage | min_voltage | max_voltage | voltage_stddev_samp | cv_percentage |
====================================================================================================================================
Beijing.Chaoyang | 1 | 5 | 221.20 | 220.00 | 223.00 | 1.304 | 0.589 |
Shanghai.Pudong | 1 | 5 | 221.20 | 220.00 | 223.00 | 1.304 | 0.589 |
Beijing.Haidian | 1 | 5 | 220.20 | 219.00 | 222.00 | 1.304 | 0.592 |
业务洞察:
cv_percentage是变异系数(Coefficient of Variation),表示相对波动程度- 所有区域的电压变异系数都小于 1%,说明电网质量稳定
- 如果某区域的 CV > 2%,需要检查电网设备
场景 6:对比不同分组的负荷特征
业务需求:对比不同区域(北京 vs 上海)的负荷特征差异。
sql
-- 按区域分组统计
SELECT
CASE
WHEN location LIKE 'Beijing%' THEN '北京地区'
WHEN location LIKE 'Shanghai%' THEN '上海地区'
ELSE '其他地区'
END AS region,
COUNT(DISTINCT tbname) AS meter_count,
COUNT(*) AS total_samples,
ROUND(AVG(current), 2) AS avg_current,
ROUND(STDDEV_SAMP(current), 3) AS stddev_samp,
ROUND(MIN(current), 2) AS min_current,
ROUND(MAX(current), 2) AS max_current,
ROUND((MAX(current) - MIN(current)) / STDDEV_SAMP(current), 2) AS range_stddev_ratio
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
GROUP BY region
ORDER BY stddev_samp DESC;
预期输出:
region | meter_count | total_samples | avg_current | stddev_samp | min_current | max_current | range_stddev_ratio |
==========================================================================================================================
上海地区 | 1 | 5 | 14.46 | 3.584 | 10.50| 19.80 | 2.59 |
北京地区 | 2 | 10 | 13.17 | 3.076 | 9.80| 18.30 | 2.76 |
与其他统计函数对比
STDDEV_SAMP vs STDDEV_POP vs VAR_SAMP vs VAR_POP
sql
-- 全面对比各种统计函数
SELECT
tbname,
COUNT(*) AS n,
ROUND(AVG(current), 2) AS avg_val,
ROUND(STDDEV_SAMP(current), 4) AS stddev_samp,
ROUND(STDDEV_POP(current), 4) AS stddev_pop,
ROUND(VAR_SAMP(current), 4) AS var_samp,
ROUND(VAR_POP(current), 4) AS var_pop,
ROUND(SQRT(VAR_SAMP(current)), 4) AS sqrt_var_samp
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
GROUP BY tbname;
预期输出:
tbname | n | avg_val | stddev_samp | stddev_pop | var_samp | var_pop | sqrt_var_samp |
==========================================================================================
d1001 | 5 | 13.66 | 3.2746 | 2.9301 | 10.7231 | 8.5785 | 3.2746 |
d1002 | 5 | 12.68 | 2.9105 | 2.6037 | 8.4711 | 6.7769 | 2.9105 |
d1003 | 5 | 14.46 | 3.5844 | 3.2064 | 12.8481 | 10.2785 | 3.5844 |
关系验证:
STDDEV_SAMP= √VAR_SAMPSTDDEV_POP= √VAR_POPVAR_SAMP=VAR_POP× n/(n-1)STDDEV_SAMP>STDDEV_POP(n > 1 时)
使用最佳实践
1. 何时使用 STDDEV_SAMP
✅ 推荐使用场景:
- 分析样本数据,需要对总体进行推断
- 评估数据波动性和稳定性
- 进行风险评估和质量控制
- 统计学分析和假设检验
❌ 不推荐使用场景:
- 已知是完整总体数据(应使用 STDDEV_POP)
- 样本量 n < 2(返回 NULL,无意义)
- 需要精确描述当前数据集(应使用 STDDEV_POP)
2. 性能优化建议
sql
-- ✅ 推荐:使用时间范围过滤
SELECT STDDEV_SAMP(current)
FROM meters
WHERE ts >= '2024-01-15 00:00:00'
AND ts < '2024-01-16 00:00:00';
-- ✅ 推荐:结合窗口查询
SELECT
TIMETRUNCATE(ts, 1h) AS hour,
STDDEV_SAMP(current) AS hourly_stddev
FROM meters
WHERE ts >= '2024-01-15 00:00:00'
GROUP BY hour;
-- ✅ 推荐:与其他聚合函数一起使用
SELECT
AVG(current) AS avg_val,
STDDEV_SAMP(current) AS stddev_val,
COUNT(*) AS sample_size
FROM meters;
3. 数据质量检查
sql
-- 检查样本量是否足够
SELECT
tbname,
COUNT(*) AS sample_size,
CASE
WHEN COUNT(*) < 2 THEN '样本量不足'
WHEN COUNT(*) < 30 THEN '样本量较小'
ELSE '样本量充足'
END AS sample_status,
STDDEV_SAMP(current) AS stddev_val
FROM meters
GROUP BY tbname;
常见问题
Q1: STDDEV_SAMP 返回 NULL 是什么原因?
可能原因:
- 列中所有值都为 NULL
- 样本量 n < 2(至少需要 2 个非 NULL 值)
- 分组后某些组的样本量不足
解决方法:
sql
-- 检查样本量
SELECT
tbname,
COUNT(*) AS total_rows,
COUNT(current) AS non_null_count,
STDDEV_SAMP(current) AS stddev_val
FROM meters
GROUP BY tbname;
Q2: 如何选择 STDDEV_SAMP 还是 STDDEV_POP?
选择标准:
- 使用 STDDEV_SAMP:数据是从总体中抽取的样本,需要推断总体标准差
- 使用 STDDEV_POP:数据是完整的总体,只需描述当前数据集
电表场景示例:
- 如果分析的是所有电表的数据 → 使用 STDDEV_POP
- 如果分析的是抽样电表的数据 → 使用 STDDEV_SAMP
Q3: 标准差过大或过小意味着什么?
标准差的含义:
- 标准差大:数据离散程度高,波动大,不稳定
- 标准差小:数据集中,波动小,稳定
- 标准差为 0:所有值完全相同
判断标准:
- 使用变异系数(CV = 标准差/平均值)进行相对比较
- CV < 15%:低变异
- 15% ≤ CV < 30%:中等变异
- CV ≥ 30%:高变异
相关函数
- STDDEV / STDDEV_POP:总体标准差
- STD:总体标准差的别名
- VARIANCE / VAR_POP:总体方差
- VAR_SAMP:样本方差
- AVG:平均值
- SPREAD:最大值与最小值之差
总结
STDDEV_SAMP 函数是评估数据波动性和稳定性的重要工具,在智能电表场景中可用于:
- ✅ 负荷波动性评估
- ✅ 异常电表识别
- ✅ 电网质量监控
- ✅ 峰谷时段分析
- ✅ 区域负荷特征对比
合理使用样本标准差,可以帮助电力公司更好地进行风险管理和资源优化。
关于 TDengine
TDengine 专为物联网IoT平台、工业大数据平台设计。其中,TDengine TSDB 是一款高性能、分布式的时序数据库(Time Series Database),同时它还带有内建的缓存、流式计算、数据订阅等系统功能;TDengine IDMP 是一款AI原生工业数据管理平台,它通过树状层次结构建立数据目录,对数据进行标准化、情景化,并通过 AI 提供实时分析、可视化、事件管理与报警等功能。