
VARIANCE 函数用户手册
函数概述
sql
VARIANCE(expr)
功能说明:统计表/超级表中某列的总体方差(Population Variance)。
版本:v3.3.3.0
返回结果类型:DOUBLE。
适用数据类型:数值类型。
嵌套子查询支持:适用于内层查询和外层查询。
适用于:表和超级表。
别名:VAR_POP
使用说明
-
总体方差与样本方差的区别:
- 总体方差(VARIANCE/VAR_POP):使用 n 作为分母,适用于完整总体数据
- 样本方差(VAR_SAMP):使用 n-1 作为分母,适用于样本数据,提供无偏估计
- 公式对比:
- 总体方差:Σ(xi - x̄)² / n
- 样本方差:Σ(xi - x̄)² / (n-1)
-
NULL 值处理:
- 若
expr列中所有值都为 NULL,返回 NULL - 计算时自动忽略 NULL 值
- 若
-
数据量要求:
- 至少需要 1 个非 NULL 数值才能计算总体方差
- 如果没有非 NULL 值,返回 NULL
-
精度说明:
- 返回结果为 DOUBLE 类型,精度取决于系统的浮点运算精度
-
与标准差的关系:
- 方差 = 标准差²
- VARIANCE(x) = STDDEV_POP(x)²
基础示例
示例 1:计算简单数据集的总体方差
sql
-- 创建测试表
CREATE TABLE test_variance (
ts TIMESTAMP,
id INT
);
-- 插入测试数据
INSERT INTO test_variance VALUES
('2024-01-01 00:00:00', 1),
('2024-01-01 00:00:01', 2),
('2024-01-01 00:00:02', 3),
('2024-01-01 00:00:03', 4),
('2024-01-01 00:00:04', 5);
-- 查询数据
SELECT id FROM test_variance;
输出:
id |
==============
1 |
2 |
3 |
4 |
5 |
sql
-- 计算总体方差
SELECT VARIANCE(id) FROM test_variance;
输出:
variance(id) |
==========================
2.000000000000000|
解释:
- 平均值:(1+2+3+4+5)/5 = 3
- 方差:[(1-3)²+(2-3)²+(3-3)²+(4-3)²+(5-3)²]/5 = 10/5 = 2.0
示例 2:对比总体方差与样本方差
sql
-- 同时计算总体方差和样本方差
SELECT
VARIANCE(id) AS population_var,
VAR_POP(id) AS var_pop,
VAR_SAMP(id) AS sample_var,
STDDEV_POP(id) AS population_stddev
FROM test_variance;
输出:
population_var | var_pop | sample_var | population_stddev |
==========================================================================================
2.000000000000000| 2.000000000000000| 2.500000000000000| 1.414213562373095|
说明:
- 总体方差(2.0)< 样本方差(2.5)
- 总体标准差 = √总体方差 = √2.0 ≈ 1.4142
- VARIANCE 等价于 VAR_POP
智能电表应用场景
场景准备:创建智能电表数据表
sql
-- 创建数据库
CREATE DATABASE IF NOT EXISTS power;
USE power;
-- 创建智能电表超级表
CREATE STABLE IF NOT EXISTS meters (
ts TIMESTAMP,
current FLOAT,
voltage INT,
phase FLOAT
) TAGS (
groupid INT,
location VARCHAR(64)
);
-- 创建子表
CREATE TABLE d1001 USING meters TAGS (1, 'Beijing.Chaoyang');
CREATE TABLE d1002 USING meters TAGS (1, 'Beijing.Haidian');
CREATE TABLE d1003 USING meters TAGS (2, 'Shanghai.Pudong');
-- 插入测试数据(模拟一天的电流数据)
INSERT INTO d1001 VALUES
('2024-01-15 00:00:00', 10.2, 220, 0.95),
('2024-01-15 06:00:00', 12.5, 221, 0.96),
('2024-01-15 12:00:00', 15.8, 222, 0.97),
('2024-01-15 18:00:00', 18.3, 223, 0.96),
('2024-01-15 23:00:00', 11.5, 220, 0.95);
INSERT INTO d1002 VALUES
('2024-01-15 00:00:00', 9.8, 219, 0.94),
('2024-01-15 06:00:00', 11.2, 220, 0.95),
('2024-01-15 12:00:00', 14.5, 221, 0.96),
('2024-01-15 18:00:00', 17.1, 222, 0.95),
('2024-01-15 23:00:00', 10.8, 219, 0.94);
INSERT INTO d1003 VALUES
('2024-01-15 00:00:00', 10.5, 220, 0.96),
('2024-01-15 06:00:00', 13.2, 221, 0.97),
('2024-01-15 12:00:00', 16.5, 222, 0.98),
('2024-01-15 18:00:00', 19.8, 223, 0.97),
('2024-01-15 23:00:00', 12.3, 220, 0.96);
场景 1:评估电表负荷稳定性(总体指标)
业务需求:电力公司需要评估每个电表的负荷稳定性,方差越大表示负荷波动越大,电网运行风险越高。
sql
-- 计算每个电表的电流总体方差
SELECT
tbname AS meter_id,
location,
COUNT(*) AS sample_count,
ROUND(AVG(current), 2) AS avg_current,
ROUND(VARIANCE(current), 3) AS current_variance,
ROUND(VAR_SAMP(current), 3) AS current_var_samp,
ROUND(STDDEV_POP(current), 3) AS current_stddev
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
GROUP BY tbname, location
ORDER BY current_variance DESC;
预期输出:
meter_id | location | sample_count | avg_current | current_variance | current_var_samp | current_stddev |
======================================================================================================================
d1003 | Shanghai.Pudong | 5 | 14.46 | 10.278 | 12.848 | 3.206 |
d1001 | Beijing.Chaoyang | 5 | 13.66 | 8.586 | 10.733 | 2.930 |
d1002 | Beijing.Haidian | 5 | 12.68 | 6.778 | 8.473 | 2.604 |
业务洞察:
- d1003 电表的负荷方差最大(10.278),波动最大
- 总体方差比样本方差小(因为样本方差分母为 n-1)
- 方差 = 标准差² (如 d1003: 3.206² ≈ 10.278)
- 用于描述完整总体时,使用总体方差更准确
场景 2:按时间段分析负荷稳定性
业务需求:分析不同时间段的负荷稳定性,评估电网在不同时段的运行风险。
sql
-- 按时间段统计电流方差
SELECT
CASE
WHEN HOUR(ts) BETWEEN 0 AND 5 THEN '凌晨(00-05)'
WHEN HOUR(ts) BETWEEN 6 AND 11 THEN '上午(06-11)'
WHEN HOUR(ts) BETWEEN 12 AND 17 THEN '下午(12-17)'
ELSE '晚上(18-23)'
END AS time_period,
COUNT(*) AS sample_count,
ROUND(AVG(current), 2) AS avg_current,
ROUND(MIN(current), 2) AS min_current,
ROUND(MAX(current), 2) AS max_current,
ROUND(VARIANCE(current), 3) AS variance_pop,
ROUND(VAR_SAMP(current), 3) AS variance_samp,
ROUND(STDDEV_POP(current), 3) AS stddev_pop
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
GROUP BY time_period
ORDER BY
CASE time_period
WHEN '凌晨(00-05)' THEN 1
WHEN '上午(06-11)' THEN 2
WHEN '下午(12-17)' THEN 3
WHEN '晚上(18-23)' THEN 4
END;
预期输出:
time_period | sample_count | avg_current | min_current | max_current | variance_pop | variance_samp | stddev_pop |
======================================================================================================================
凌晨(00-05) | 3 | 10.17 | 9.80 | 10.50 | 0.087 | 0.130 | 0.294 |
上午(06-11) | 3 | 12.30 | 11.20 | 13.20 | 0.669 | 1.003 | 0.818 |
下午(12-17) | 3 | 15.60 | 14.50 | 16.50 | 0.669 | 1.003 | 0.818 |
晚上(18-23) | 6 | 15.13 | 10.80 | 19.80 | 9.942 | 11.930 | 3.153 |
业务洞察:
- 凌晨时段方差最小(0.087),负荷最稳定
- 晚上时段方差最大(9.942),因为包含高峰和低谷
- 可根据方差大小制定不同时段的电网调度策略
场景 3:电压稳定性评估
业务需求:评估电网电压的稳定性,方差过大可能表示电网质量问题。
sql
-- 分析电压波动情况
SELECT
location,
COUNT(DISTINCT tbname) AS meter_count,
COUNT(*) AS sample_count,
ROUND(AVG(voltage), 2) AS avg_voltage,
ROUND(MIN(voltage), 2) AS min_voltage,
ROUND(MAX(voltage), 2) AS max_voltage,
ROUND(VARIANCE(voltage), 3) AS voltage_variance,
ROUND(STDDEV_POP(voltage), 3) AS voltage_stddev,
ROUND(STDDEV_POP(voltage) / AVG(voltage) * 100, 3) AS cv_percentage
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
GROUP BY location
ORDER BY voltage_variance DESC;
预期输出:
location | meter_count | sample_count | avg_voltage | min_voltage | max_voltage | voltage_variance | voltage_stddev | cv_percentage |
===================================================================================================================================================
Beijing.Chaoyang | 1 | 5 | 221.20 | 220.00 | 223.00 | 1.200 | 1.095 | 0.495 |
Shanghai.Pudong | 1 | 5 | 221.20 | 220.00 | 223.00 | 1.200 | 1.095 | 0.495 |
Beijing.Haidian | 1 | 5 | 220.20 | 219.00 | 222.00 | 1.200 | 1.095 | 0.497 |
业务洞察:
cv_percentage是变异系数(Coefficient of Variation),表示相对波动程度- 所有区域的电压变异系数都小于 0.5%,说明电网质量优秀
- 如果某区域的变异系数 > 1%,需要检查电网设备
- 电压方差小,说明供电质量稳定
场景 4:按分组对比负荷特征
业务需求:对比不同区域(北京 vs 上海)的负荷特征差异。
sql
-- 按区域分组统计
SELECT
CASE
WHEN location LIKE 'Beijing%' THEN '北京地区'
WHEN location LIKE 'Shanghai%' THEN '上海地区'
ELSE '其他地区'
END AS region,
COUNT(DISTINCT tbname) AS meter_count,
COUNT(*) AS total_samples,
ROUND(AVG(current), 2) AS avg_current,
ROUND(VARIANCE(current), 3) AS variance_pop,
ROUND(VAR_SAMP(current), 3) AS variance_samp,
ROUND(MIN(current), 2) AS min_current,
ROUND(MAX(current), 2) AS max_current,
ROUND((MAX(current) - MIN(current)) / STDDEV_POP(current), 2) AS range_stddev_ratio
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
GROUP BY region
ORDER BY variance_pop DESC;
预期输出:
region | meter_count | total_samples | avg_current | variance_pop | variance_samp | min_current | max_current | range_stddev_ratio |
============================================================================================================================================
上海地区 | 1 | 5 | 14.46 | 10.278 | 12.848 | 10.50 | 19.80 | 2.90 |
北京地区 | 2 | 10 | 13.17 | 7.529 | 8.366 | 9.80 | 18.30 | 2.76 |
业务洞察:
- 上海地区方差较大,负荷波动较大
- 可根据方差制定差异化的电网管理策略
- 极差与标准差的比值可评估数据离散程度
场景 5:时间窗口内的方差趋势分析
业务需求:使用滑动窗口分析电表负荷的方差趋势,识别异常时段。
sql
-- 按 4 小时窗口统计电流方差
SELECT
TIMETRUNCATE(ts, 4h) AS time_window,
tbname AS meter_id,
COUNT(*) AS sample_count,
ROUND(AVG(current), 2) AS avg_current,
ROUND(VARIANCE(current), 3) AS variance_pop,
ROUND(STDDEV_POP(current), 3) AS stddev_pop
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
AND tbname = 'd1001'
GROUP BY time_window, tbname
ORDER BY time_window;
预期输出:
time_window | meter_id | sample_count | avg_current | variance_pop | stddev_pop |
===============================================================================================
2024-01-15 00:00:00.000| d1001 | 2 | 11.35 | 1.323 | 1.150 |
2024-01-15 12:00:00.000| d1001 | 2 | 17.05 | 1.560 | 1.249 |
说明:
- 不同时间窗口的方差反映了该时段内的负荷稳定性
- 可以识别出方差异常高的时间段,进行重点监控
场景 6:功率因数稳定性分析
业务需求:评估功率因数的稳定性,方差过大可能影响供电效率。
sql
-- 分析功率因数波动情况
SELECT
location,
COUNT(DISTINCT tbname) AS meter_count,
ROUND(AVG(phase), 4) AS avg_phase,
ROUND(VARIANCE(phase), 6) AS phase_variance,
ROUND(STDDEV_POP(phase), 6) AS phase_stddev,
ROUND(MIN(phase), 4) AS min_phase,
ROUND(MAX(phase), 4) AS max_phase,
CASE
WHEN VARIANCE(phase) < 0.0001 THEN '非常稳定'
WHEN VARIANCE(phase) < 0.0005 THEN '稳定'
WHEN VARIANCE(phase) < 0.001 THEN '一般'
ELSE '不稳定'
END AS stability_level
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
GROUP BY location
ORDER BY phase_variance DESC;
预期输出:
location | meter_count | avg_phase | phase_variance | phase_stddev | min_phase | max_phase | stability_level |
==============================================================================================================================
Shanghai.Pudong | 1 | 0.9680 | 0.000070 | 0.008367 | 0.9600 | 0.9800 | 稳定 |
Beijing.Chaoyang | 1 | 0.9580 | 0.000070 | 0.008367 | 0.9500 | 0.9700 | 稳定 |
Beijing.Haidian | 1 | 0.9480 | 0.000070 | 0.008367 | 0.9400 | 0.9600 | 稳定 |
业务洞察:
- 功率因数方差小,说明用电效率稳定
- 可根据方差水平评估电能质量
与其他统计函数对比
VARIANCE vs VAR_SAMP vs STDDEV vs STDDEV_SAMP
sql
-- 全面对比各种统计函数
SELECT
tbname,
COUNT(*) AS n,
ROUND(AVG(current), 2) AS avg_val,
ROUND(VARIANCE(current), 4) AS variance_pop,
ROUND(VAR_POP(current), 4) AS var_pop,
ROUND(VAR_SAMP(current), 4) AS var_samp,
ROUND(STDDEV_POP(current), 4) AS stddev_pop,
ROUND(STDDEV_SAMP(current), 4) AS stddev_samp,
ROUND(SQRT(VARIANCE(current)), 4) AS sqrt_variance
FROM meters
WHERE ts >= '2024-01-15 00:00:00' AND ts < '2024-01-16 00:00:00'
GROUP BY tbname;
预期输出:
tbname | n | avg_val | variance_pop | var_pop | var_samp | stddev_pop | stddev_samp | sqrt_variance |
==========================================================================================================
d1001 | 5 | 13.66 | 8.5860 | 8.5860 | 10.7325 | 2.9301 | 3.2761 | 2.9301 |
d1002 | 5 | 12.68 | 6.7780 | 6.7780 | 8.4725 | 2.6034 | 2.9107 | 2.6034 |
d1003 | 5 | 14.46 | 10.2780 | 10.2780 | 12.8475 | 3.2059 | 3.5843 | 3.2059 |
关系验证:
VARIANCE=VAR_POP(完全等价)STDDEV_POP= √VARIANCEVAR_SAMP=VAR_POP× n/(n-1)VARIANCE<VAR_SAMP(n > 1 时)
使用最佳实践
1. 何时使用 VARIANCE
✅ 推荐使用场景:
- 分析完整总体数据的波动性
- 评估设备运行的稳定性
- 质量控制和风险评估
- 电网负荷分析
- 设备性能监控
❌ 不推荐使用场景:
- 数据是从总体中抽取的样本(应使用 VAR_SAMP)
- 样本量 n < 1(返回 NULL,无意义)
- 需要无偏估计时(应使用 VAR_SAMP)
2. 性能优化建议
sql
-- ✅ 推荐:使用时间范围过滤
SELECT VARIANCE(current)
FROM meters
WHERE ts >= '2024-01-15 00:00:00'
AND ts < '2024-01-16 00:00:00';
-- ✅ 推荐:结合窗口查询
SELECT
TIMETRUNCATE(ts, 1h) AS hour,
VARIANCE(current) AS hourly_variance
FROM meters
WHERE ts >= '2024-01-15 00:00:00'
GROUP BY hour;
-- ✅ 推荐:与其他统计函数一起使用
SELECT
AVG(current) AS avg_val,
VARIANCE(current) AS var_val,
STDDEV_POP(current) AS stddev_val,
COUNT(*) AS sample_size
FROM meters;
3. 数据质量检查
sql
-- 检查样本量是否足够
SELECT
tbname,
COUNT(*) AS sample_size,
CASE
WHEN COUNT(*) < 1 THEN '无数据'
WHEN COUNT(*) < 30 THEN '样本量较小'
ELSE '样本量充足'
END AS sample_status,
VARIANCE(current) AS variance_val
FROM meters
GROUP BY tbname;
常见问题
Q1: VARIANCE 返回 NULL 是什么原因?
可能原因:
- 列中所有值都为 NULL
- 没有非 NULL 值
- 分组后某些组没有数据
解决方法:
sql
-- 检查数据情况
SELECT
tbname,
COUNT(*) AS total_rows,
COUNT(current) AS non_null_count,
VARIANCE(current) AS variance_val
FROM meters
GROUP BY tbname;
Q2: 如何选择 VARIANCE 还是 VAR_SAMP?
选择标准:
- 使用 VARIANCE(VAR_POP):数据是完整的总体,只需描述当前数据集
- 使用 VAR_SAMP:数据是从总体中抽取的样本,需要推断总体方差
电表场景示例:
- 如果分析的是所有电表的数据 → 使用 VARIANCE
- 如果分析的是抽样电表的数据 → 使用 VAR_SAMP
Q3: 方差与标准差如何选择?
使用建议:
-
方差:
- 优点:数学性质好,便于理论分析
- 缺点:单位是原始数据的平方,不直观
- 适用:理论研究、数学计算
-
标准差:
- 优点:单位与原始数据相同,更直观
- 缺点:平方根运算,数学性质略复杂
- 适用:实际应用、数据描述
实践建议:
sql
-- 报表展示:使用标准差(更直观)
SELECT
location,
ROUND(STDDEV_POP(current), 2) AS stddev
FROM meters
GROUP BY location;
-- 理论分析:使用方差(数学性质好)
SELECT
location,
ROUND(VARIANCE(current), 4) AS variance
FROM meters
GROUP BY location;
Q4: 方差过大或过小意味着什么?
方差的含义:
- 方差大:数据离散程度高,波动大,不稳定
- 方差小:数据集中,波动小,稳定
- 方差为 0:所有值完全相同
判断标准:
- 使用变异系数(CV = 标准差/平均值)进行相对比较
- CV < 15%:低变异
- 15% ≤ CV < 30%:中等变异
- CV ≥ 30%:高变异
sql
-- 计算变异系数评估稳定性
SELECT
tbname,
ROUND(AVG(current), 2) AS avg_current,
ROUND(STDDEV_POP(current), 2) AS stddev,
ROUND(STDDEV_POP(current) / AVG(current) * 100, 2) AS cv_percent,
CASE
WHEN STDDEV_POP(current) / AVG(current) < 0.15 THEN '低变异-稳定'
WHEN STDDEV_POP(current) / AVG(current) < 0.30 THEN '中等变异'
ELSE '高变异-不稳定'
END AS stability
FROM meters
GROUP BY tbname;
相关函数
- VAR_POP:总体方差的别名,与 VARIANCE 完全等价
- VAR_SAMP:样本方差
- STDDEV / STDDEV_POP:总体标准差(方差的平方根)
- STDDEV_SAMP:样本标准差
- AVG:平均值
- SPREAD:最大值与最小值之差
总结
VARIANCE 函数是评估数据波动性和稳定性的重要工具,在智能电表场景中可用于:
- ✅ 负荷稳定性评估(总体指标)
- ✅ 电网质量监控
- ✅ 峰谷时段分析
- ✅ 区域负荷特征对比
- ✅ 设备性能监控
- ✅ 异常检测和风险评估
与 VAR_SAMP 的选择:
- 完整总体数据 → 使用 VARIANCE(VAR_POP)
- 样本数据 → 使用 VAR_SAMP
关于 TDengine
TDengine 专为物联网IoT平台、工业大数据平台设计。其中,TDengine TSDB 是一款高性能、分布式的时序数据库(Time Series Database),同时它还带有内建的缓存、流式计算、数据订阅等系统功能;TDengine IDMP 是一款AI原生工业数据管理平台,它通过树状层次结构建立数据目录,对数据进行标准化、情景化,并通过 AI 提供实时分析、可视化、事件管理与报警等功能。