DolphinDB机器学习函数：内置ML能力

- 摘要
- 一、机器学习概述
- - [1.1 DolphinDB ML能力](#1.1 DolphinDB ML能力)
  - [1.2 内置ML函数](#1.2 内置ML函数)
  - [1.3 适用场景](#1.3 适用场景)
- 二、回归分析
- - [2.1 线性回归](#2.1 线性回归)
  - [2.2 回归预测](#2.2 回归预测)
  - [2.3 多项式回归](#2.3 多项式回归)
  - [2.4 回归评估](#2.4 回归评估)
- 三、分类模型
- - [3.1 逻辑回归](#3.1 逻辑回归)
  - [3.2 分类预测](#3.2 分类预测)
  - [3.3 分类评估](#3.3 分类评估)
- 四、聚类分析
- - [4.1 K-Means聚类](#4.1 K-Means聚类)
  - [4.2 聚类可视化](#4.2 聚类可视化)
  - [4.3 聚类评估](#4.3 聚类评估)
- 五、时间序列预测
- - [5.1 ARIMA模型](#5.1 ARIMA模型)
  - [5.2 时间序列预测](#5.2 时间序列预测)
  - [5.3 时间序列分解](#5.3 时间序列分解)
- 六、特征工程
- - [6.1 特征缩放](#6.1 特征缩放)
  - [6.2 特征编码](#6.2 特征编码)
  - [6.3 特征选择](#6.3 特征选择)
- 七、实战案例
- - [7.1 设备故障预测](#7.1 设备故障预测)
  - [7.2 能耗预测](#7.2 能耗预测)
- 八、总结
- 参考资料

摘要

本文深入讲解DolphinDB内置机器学习函数。从回归分析到分类模型，从聚类算法到时间序列预测，从特征工程到模型评估，全面介绍机器学习函数的核心功能。通过丰富的代码示例，帮助读者掌握内置ML能力的核心技能。

一、机器学习概述

1.1 DolphinDB ML能力

#mermaid-svg-88kpq2cgV1B5OS9L{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-88kpq2cgV1B5OS9L .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-88kpq2cgV1B5OS9L .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-88kpq2cgV1B5OS9L .error-icon{fill:#552222;}#mermaid-svg-88kpq2cgV1B5OS9L .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-88kpq2cgV1B5OS9L .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-88kpq2cgV1B5OS9L .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-88kpq2cgV1B5OS9L .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-88kpq2cgV1B5OS9L .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-88kpq2cgV1B5OS9L .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-88kpq2cgV1B5OS9L .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-88kpq2cgV1B5OS9L .marker{fill:#333333;stroke:#333333;}#mermaid-svg-88kpq2cgV1B5OS9L .marker.cross{stroke:#333333;}#mermaid-svg-88kpq2cgV1B5OS9L svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-88kpq2cgV1B5OS9L p{margin:0;}#mermaid-svg-88kpq2cgV1B5OS9L .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-88kpq2cgV1B5OS9L .cluster-label text{fill:#333;}#mermaid-svg-88kpq2cgV1B5OS9L .cluster-label span{color:#333;}#mermaid-svg-88kpq2cgV1B5OS9L .cluster-label span p{background-color:transparent;}#mermaid-svg-88kpq2cgV1B5OS9L .label text,#mermaid-svg-88kpq2cgV1B5OS9L span{fill:#333;color:#333;}#mermaid-svg-88kpq2cgV1B5OS9L .node rect,#mermaid-svg-88kpq2cgV1B5OS9L .node circle,#mermaid-svg-88kpq2cgV1B5OS9L .node ellipse,#mermaid-svg-88kpq2cgV1B5OS9L .node polygon,#mermaid-svg-88kpq2cgV1B5OS9L .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-88kpq2cgV1B5OS9L .rough-node .label text,#mermaid-svg-88kpq2cgV1B5OS9L .node .label text,#mermaid-svg-88kpq2cgV1B5OS9L .image-shape .label,#mermaid-svg-88kpq2cgV1B5OS9L .icon-shape .label{text-anchor:middle;}#mermaid-svg-88kpq2cgV1B5OS9L .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-88kpq2cgV1B5OS9L .rough-node .label,#mermaid-svg-88kpq2cgV1B5OS9L .node .label,#mermaid-svg-88kpq2cgV1B5OS9L .image-shape .label,#mermaid-svg-88kpq2cgV1B5OS9L .icon-shape .label{text-align:center;}#mermaid-svg-88kpq2cgV1B5OS9L .node.clickable{cursor:pointer;}#mermaid-svg-88kpq2cgV1B5OS9L .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-88kpq2cgV1B5OS9L .arrowheadPath{fill:#333333;}#mermaid-svg-88kpq2cgV1B5OS9L .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-88kpq2cgV1B5OS9L .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-88kpq2cgV1B5OS9L .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-88kpq2cgV1B5OS9L .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-88kpq2cgV1B5OS9L .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-88kpq2cgV1B5OS9L .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-88kpq2cgV1B5OS9L .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-88kpq2cgV1B5OS9L .cluster text{fill:#333;}#mermaid-svg-88kpq2cgV1B5OS9L .cluster span{color:#333;}#mermaid-svg-88kpq2cgV1B5OS9L div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-88kpq2cgV1B5OS9L .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-88kpq2cgV1B5OS9L rect.text{fill:none;stroke-width:0;}#mermaid-svg-88kpq2cgV1B5OS9L .icon-shape,#mermaid-svg-88kpq2cgV1B5OS9L .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-88kpq2cgV1B5OS9L .icon-shape p,#mermaid-svg-88kpq2cgV1B5OS9L .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-88kpq2cgV1B5OS9L .icon-shape .label rect,#mermaid-svg-88kpq2cgV1B5OS9L .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-88kpq2cgV1B5OS9L .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-88kpq2cgV1B5OS9L .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-88kpq2cgV1B5OS9L :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} DolphinDB ML
回归分析
线性回归
分类模型
逻辑回归
聚类算法
K-Means
时间序列
ARIMA
特点
内置函数
向量化加速
分布式计算

1.2 内置ML函数

类别	函数	说明
回归	ols	普通最小二乘回归
分类	logisticRegression	逻辑回归分类
聚类	kmeans	K-Means聚类
预测	arima	ARIMA时间序列预测

1.3 适用场景

场景	说明
预测性维护	设备故障预测
质量控制	质量预测分析
能耗预测	能耗趋势预测
异常检测	数据异常识别

二、回归分析

2.1 线性回归

python 复制代码

// 创建数据
n = 1000
x1 = rand(10.0, n)
x2 = rand(20.0, n)
y = 2 * x1 + 3 * x2 + rand(-1.0..1.0, n)

t = table(x1, x2, y)

// 线性回归
result = ols(y, [x1, x2])

// 查看结果
result

// 系数解释：
// Intercept: 截距
// x1: x1的系数（接近2）
// x2: x2的系数（接近3）

2.2 回归预测

python 复制代码

// 使用回归模型预测
// 创建新数据
newX1 = rand(10.0, 100)
newX2 = rand(20.0, 100)

// 预测
predictions = result.Intercept + 
              result.Coefficient[0] * newX1 + 
              result.Coefficient[1] * newX2

// 或者使用矩阵运算
newX = matrix([newX1, newX2])
predictions = newX ** result.Coefficient + result.Intercept

2.3 多项式回归

python 复制代码

// 多项式回归
x = rand(10.0, 1000)
y = 2 * x + 3 * x * x + rand(-1.0..1.0, 1000)

// 创建多项式特征
x2 = x * x

// 多项式回归
result = ols(y, [x, x2])

2.4 回归评估

python 复制代码

// 回归评估指标
def evaluateRegression(actual, predicted) {
    // R²
    ssRes = sum((actual - predicted) ^ 2)
    ssTot = sum((actual - avg(actual)) ^ 2)
    r2 = 1 - ssRes / ssTot
    
    // RMSE
    rmse = sqrt(avg((actual - predicted) ^ 2))
    
    // MAE
    mae = avg(abs(actual - predicted))
    
    return dict(STRING, DOUBLE, [
        ["R2", r2],
        ["RMSE", rmse],
        ["MAE", mae]
    ])
}

// 使用
predictions = result.Intercept + result.Coefficient[0] * x1 + result.Coefficient[1] * x2
evaluateRegression(y, predictions)

三、分类模型

3.1 逻辑回归

python 复制代码

// 创建分类数据
n = 1000
x1 = rand(10.0, n)
x2 = rand(10.0, n)
y = iif(x1 + x2 > 10, 1, 0)

t = table(x1, x2, y)

// 逻辑回归
result = logisticRegression(y, [x1, x2])

// 查看结果
result

3.2 分类预测

python 复制代码

// 预测概率
prob = 1 / (1 + exp(-(result.Intercept + 
                       result.Coefficient[0] * x1 + 
                       result.Coefficient[1] * x2)))

// 预测类别
predicted = iif(prob > 0.5, 1, 0)

3.3 分类评估

python 复制代码

// 分类评估指标
def evaluateClassification(actual, predicted) {
    // 混淆矩阵
    tp = sum(actual == 1 and predicted == 1)
    tn = sum(actual == 0 and predicted == 0)
    fp = sum(actual == 0 and predicted == 1)
    fn = sum(actual == 1 and predicted == 0)
    
    // 准确率
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    
    // 精确率
    precision = tp / (tp + fp)
    
    // 召回率
    recall = tp / (tp + fn)
    
    // F1分数
    f1 = 2 * precision * recall / (precision + recall)
    
    return dict(STRING, DOUBLE, [
        ["Accuracy", accuracy],
        ["Precision", precision],
        ["Recall", recall],
        ["F1", f1]
    ])
}

// 使用
evaluateClassification(y, predicted)

四、聚类分析

4.1 K-Means聚类

python 复制代码

// 创建聚类数据
n = 300
x1 = concat([rand(5.0, 100), rand(15.0, 100), rand(25.0, 100)])
x2 = concat([rand(5.0, 100), rand(15.0, 100), rand(25.0, 100)])

// K-Means聚类
result = kmeans(matrix([x1, x2]), 3)

// 查看结果
result

// 聚类中心
result.centers

// 聚类标签
result.cluster

4.2 聚类可视化

python 复制代码

// 聚类结果
t = table(x1, x2, result.cluster as cluster)

// 查看各簇统计
select cluster, count(*) as cnt,
       avg(x1) as avg_x1,
       avg(x2) as avg_x2
from t
group by cluster

4.3 聚类评估

python 复制代码

// 聚类评估：轮廓系数
def silhouetteScore(data, labels) {
    n = data.rows()
    scores = array(DOUBLE, n)
    
    for (i in 0..n) {
        // 计算簇内距离
        sameCluster = labels == labels[i]
        a = avg(abs(data[sameCluster] - data[i]))
        
        // 计算最近簇距离
        otherClusters = unique(labels[labels != labels[i]])
        b = min(each(def(c) { 
            avg(abs(data[labels == c] - data[i])) 
        }, otherClusters))
        
        scores[i] = (b - a) / max(a, b)
    }
    
    return avg(scores)
}

五、时间序列预测

5.1 ARIMA模型

python 复制代码

// 创建时间序列数据
n = 200
t = table(
    1..n as time,
    100 + 0.1 * (1..n) + 10 * sin(2 * pi * (1..n) / 12) + rand(-5.0..5.0, n) as value
)

// ARIMA预测
result = arima(t.value, 1, 1, 1)  // ARIMA(1,1,1)

// 查看结果
result

5.2 时间序列预测

python 复制代码

// 预测未来值
forecastSteps = 10
forecast = arimaForecast(result, forecastSteps)

// 预测结果
print("未来" + string(forecastSteps) + "期预测值:")
print(forecast)

5.3 时间序列分解

python 复制代码

// 时间序列分解
// 趋势：移动平均
trend = mavg(t.value, 12)

// 季节性：去趋势后的周期平均
detrended = t.value - trend
seasonal = avg(detrended)  // 简化处理

// 残差
residual = t.value - trend - seasonal

// 结果
select time, value, trend, seasonal, residual
from t

六、特征工程

6.1 特征缩放

python 复制代码

// 特征缩放
def normalize(data) {
    return (data - min(data)) / (max(data) - min(data))
}

def standardize(data) {
    return (data - avg(data)) / std(data)
}

// 使用
x = rand(100.0, 1000)
normalize(x)
standardize(x)

6.2 特征编码

python 复制代码

// 类别编码
def oneHotEncode(categories) {
    uniqueVals = distinct(categories)
    n = categories.size()
    m = uniqueVals.size()
    
    result = matrix(INT, n, m, 0)
    for (i in 0..n) {
        j = which(uniqueVals == categories[i])
        result[i, j] = 1
    }
    return result
}

// 使用
categories = take(`A`B`C, 100)
oneHotEncode(categories)

6.3 特征选择

python 复制代码

// 特征选择：相关性分析
def correlationFilter(features, target, threshold = 0.1) {
    correlations = each(def(f) { corr(f, target) }, features)
    return abs(correlations) > threshold
}

// 使用
x1 = rand(10.0, 1000)
x2 = rand(10.0, 1000)
x3 = rand(10.0, 1000)
y = 2 * x1 + rand(-1.0..1.0, 1000)  // x1与y相关

correlationFilter([x1, x2, x3], y)
// 结果：[true, false, false]

七、实战案例

7.1 设备故障预测

python 复制代码

// ========== 设备故障预测 ==========

// 创建设备数据
n = 10000
t = table(
    1..n as device_id,
    rand(1000.0, n) as vibration,      // 振动
    rand(100.0, n) as temperature,     // 温度
    rand(50.0, n) as pressure,         // 压力
    rand(1000.0, n) as runtime,        // 运行时间
    iif(rand(100.0, n) > 90, 1, 0) as failure  // 故障标签
)

// 特征
features = [t.vibration, t.temperature, t.pressure, t.runtime]

// 逻辑回归预测
model = logisticRegression(t.failure, features)

// 预测
prob = 1 / (1 + exp(-(model.Intercept + 
                       model.Coefficient[0] * t.vibration + 
                       model.Coefficient[1] * t.temperature + 
                       model.Coefficient[2] * t.pressure + 
                       model.Coefficient[3] * t.runtime)))

predicted = iif(prob > 0.5, 1, 0)

// 评估
evaluateClassification(t.failure, predicted)

7.2 能耗预测

python 复制代码

// ========== 能耗预测 ==========

// 创建能耗数据
n = 365
t = table(
    2024.01.01 + 0..(n-1) as date,
    rand(1000.0..2000.0, n) as energy,
    rand(10.0..35.0, n) as temperature,
    rand(0..1, n) as is_workday
)

// 特征：温度、是否工作日
features = [t.temperature, double(t.is_workday)]

// 线性回归
model = ols(t.energy, features)

// 预测
predictions = model.Intercept + 
              model.Coefficient[0] * t.temperature + 
              model.Coefficient[1] * double(t.is_workday)

// 评估
evaluateRegression(t.energy, predictions)

八、总结

本文详细介绍了DolphinDB机器学习函数：

回归分析：线性回归、多项式回归、评估指标
分类模型：逻辑回归、分类预测、评估指标
聚类分析：K-Means聚类、聚类评估
时间序列：ARIMA模型、时间序列预测
特征工程：特征缩放、特征编码、特征选择
实战应用：故障预测、能耗预测

思考题：

如何选择合适的机器学习模型？
如何评估模型性能？
特征工程有什么重要性？

DolphinDB机器学习函数：内置ML能力

目录

摘要

一、机器学习概述

1.1 DolphinDB ML能力

1.2 内置ML函数

1.3 适用场景

二、回归分析

2.1 线性回归

2.2 回归预测

2.3 多项式回归

2.4 回归评估

三、分类模型

3.1 逻辑回归

3.2 分类预测

3.3 分类评估

四、聚类分析

4.1 K-Means聚类

4.2 聚类可视化

4.3 聚类评估

五、时间序列预测

5.1 ARIMA模型

5.2 时间序列预测

5.3 时间序列分解

六、特征工程

6.1 特征缩放

6.2 特征编码

6.3 特征选择

七、实战案例

7.1 设备故障预测

7.2 能耗预测

八、总结

参考资料