DolphinDB文件数据接入:批量历史数据导入

目录

摘要

本文深入讲解DolphinDB文件数据接入技术。从CSV文件导入到Parquet处理,从Excel解析到批量导入,从数据清洗到性能优化,全面介绍文件数据接入的核心方法。通过丰富的代码示例,帮助读者掌握批量历史数据导入的核心技能。


一、文件数据接入概述

1.1 支持的文件格式

#mermaid-svg-1qyfEqLQxZtBc4M2{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-1qyfEqLQxZtBc4M2 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-1qyfEqLQxZtBc4M2 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-1qyfEqLQxZtBc4M2 .error-icon{fill:#552222;}#mermaid-svg-1qyfEqLQxZtBc4M2 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-1qyfEqLQxZtBc4M2 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-1qyfEqLQxZtBc4M2 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-1qyfEqLQxZtBc4M2 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-1qyfEqLQxZtBc4M2 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-1qyfEqLQxZtBc4M2 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-1qyfEqLQxZtBc4M2 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-1qyfEqLQxZtBc4M2 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-1qyfEqLQxZtBc4M2 .marker.cross{stroke:#333333;}#mermaid-svg-1qyfEqLQxZtBc4M2 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-1qyfEqLQxZtBc4M2 p{margin:0;}#mermaid-svg-1qyfEqLQxZtBc4M2 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-1qyfEqLQxZtBc4M2 .cluster-label text{fill:#333;}#mermaid-svg-1qyfEqLQxZtBc4M2 .cluster-label span{color:#333;}#mermaid-svg-1qyfEqLQxZtBc4M2 .cluster-label span p{background-color:transparent;}#mermaid-svg-1qyfEqLQxZtBc4M2 .label text,#mermaid-svg-1qyfEqLQxZtBc4M2 span{fill:#333;color:#333;}#mermaid-svg-1qyfEqLQxZtBc4M2 .node rect,#mermaid-svg-1qyfEqLQxZtBc4M2 .node circle,#mermaid-svg-1qyfEqLQxZtBc4M2 .node ellipse,#mermaid-svg-1qyfEqLQxZtBc4M2 .node polygon,#mermaid-svg-1qyfEqLQxZtBc4M2 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-1qyfEqLQxZtBc4M2 .rough-node .label text,#mermaid-svg-1qyfEqLQxZtBc4M2 .node .label text,#mermaid-svg-1qyfEqLQxZtBc4M2 .image-shape .label,#mermaid-svg-1qyfEqLQxZtBc4M2 .icon-shape .label{text-anchor:middle;}#mermaid-svg-1qyfEqLQxZtBc4M2 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-1qyfEqLQxZtBc4M2 .rough-node .label,#mermaid-svg-1qyfEqLQxZtBc4M2 .node .label,#mermaid-svg-1qyfEqLQxZtBc4M2 .image-shape .label,#mermaid-svg-1qyfEqLQxZtBc4M2 .icon-shape .label{text-align:center;}#mermaid-svg-1qyfEqLQxZtBc4M2 .node.clickable{cursor:pointer;}#mermaid-svg-1qyfEqLQxZtBc4M2 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-1qyfEqLQxZtBc4M2 .arrowheadPath{fill:#333333;}#mermaid-svg-1qyfEqLQxZtBc4M2 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-1qyfEqLQxZtBc4M2 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-1qyfEqLQxZtBc4M2 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-1qyfEqLQxZtBc4M2 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-1qyfEqLQxZtBc4M2 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-1qyfEqLQxZtBc4M2 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-1qyfEqLQxZtBc4M2 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-1qyfEqLQxZtBc4M2 .cluster text{fill:#333;}#mermaid-svg-1qyfEqLQxZtBc4M2 .cluster span{color:#333;}#mermaid-svg-1qyfEqLQxZtBc4M2 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-1qyfEqLQxZtBc4M2 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-1qyfEqLQxZtBc4M2 rect.text{fill:none;stroke-width:0;}#mermaid-svg-1qyfEqLQxZtBc4M2 .icon-shape,#mermaid-svg-1qyfEqLQxZtBc4M2 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-1qyfEqLQxZtBc4M2 .icon-shape p,#mermaid-svg-1qyfEqLQxZtBc4M2 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-1qyfEqLQxZtBc4M2 .icon-shape .label rect,#mermaid-svg-1qyfEqLQxZtBc4M2 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-1qyfEqLQxZtBc4M2 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-1qyfEqLQxZtBc4M2 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-1qyfEqLQxZtBc4M2 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 文件格式
CSV
DolphinDB
JSON
Parquet
Excel
特点
批量导入
数据清洗
性能优化

1.2 文件格式特点

格式 优点 缺点
CSV 通用、简单 无类型、大文件慢
JSON 结构化、灵活 解析慢、文件大
Parquet 列式、压缩 需要库支持
Excel 可读性好 仅适合小数据

二、CSV文件导入

2.1 基本导入

python 复制代码
// 导入CSV文件
data = loadText("/data/sensor_data.csv")

// 查看数据
select top 10 * from data

// 查看结构
schema(data)

2.2 指定格式导入

python 复制代码
// 指定列类型导入
schema = table(
    `device_id`timestamp`temperature`humidity as name,
    `SYMBOL`TIMESTAMP`DOUBLE`DOUBLE as type
)

data = loadText("/data/sensor_data.csv", schema=schema)

// 指定分隔符
data = loadText("/data/sensor_data.csv", delimiter=",")

2.3 大文件分块导入

python 复制代码
// 大文件分块导入
def importLargeCSV(filePath, chunkSize = 100000) {
    // 创建分布式表
    db = database("dfs://csv_import_db", VALUE, 1..100)
    schema = table(1:0, `device_id`timestamp`temperature`humidity,
                   [SYMBOL, TIMESTAMP, DOUBLE, DOUBLE])
    db.createPartitionedTable(schema, `sensor_data, `device_id)
    
    // 分块读取
    reader = textFileReader(filePath, chunkSize)
    
    while (reader.hasMore()) {
        chunk = reader.read()
        loadTable("dfs://csv_import_db", "sensor_data").append!(chunk)
        print("已导入: " + string(chunk.rows()) + " 条")
    }
    
    reader.close()
}

// 执行导入
importLargeCSV("/data/large_sensor_data.csv")

三、JSON文件导入

3.1 JSON文件导入

python 复制代码
// 导入JSON文件
jsonStr = readLines("/data/sensor_data.json")
data = parseJson(jsonStr)

// 转换为表
t = table(
    data.device_id as device_id,
    data.timestamp as timestamp,
    data.temperature as temperature,
    data.humidity as humidity
)

3.2 JSON数组导入

python 复制代码
// JSON数组格式
/*
[
    {"device_id": "D001", "temperature": 25.5},
    {"device_id": "D002", "temperature": 26.0}
]
*/

jsonArr = parseJson(readLines("/data/sensor_array.json"))
t = table(
    jsonArr.device_id as device_id,
    jsonArr.temperature as temperature
)

四、Parquet文件导入

4.1 Parquet导入

python 复制代码
// 导入Parquet文件
data = loadParquet("/data/sensor_data.parquet")

// 查看数据
select top 10 * from data

4.2 Parquet导出

python 复制代码
// 导出为Parquet
t = table(
    1..1000 as id,
    rand(20.0..30.0, 1000) as temperature
)

saveParquet(t, "/output/sensor_data.parquet")

五、Excel文件导入

5.1 Excel导入

python 复制代码
// 导入Excel文件
data = loadExcel("/data/sensor_data.xlsx")

// 指定Sheet
data = loadExcel("/data/sensor_data.xlsx", sheet="Sheet1")

// 指定范围
data = loadExcel("/data/sensor_data.xlsx", range="A1:D1000")

5.2 多Sheet导入

python 复制代码
// 导入所有Sheet
sheets = excelSheetNames("/data/multi_sheet.xlsx")

for (sheet in sheets) {
    data = loadExcel("/data/multi_sheet.xlsx", sheet=sheet)
    print("Sheet: " + sheet + ", 行数: " + string(data.rows()))
}

六、数据清洗

6.1 缺失值处理

python 复制代码
// 处理缺失值
def handleMissingValues(data) {
    // 删除缺失值
    cleaned = select * from data 
              where device_id is not null 
              and temperature is not null
    
    // 填充缺失值
    cleaned = select device_id,
                    timestamp,
                    iif(temperature is null, avg(temperature), temperature) as temperature,
                    iif(humidity is null, avg(humidity), humidity) as humidity
             from data
    
    return cleaned
}

6.2 异常值处理

python 复制代码
// 处理异常值
def handleOutliers(data) {
    // 过滤异常值
    cleaned = select * from data
              where temperature between -40 and 100
              and humidity between 0 and 100
    
    // 替换异常值
    cleaned = select device_id,
                    timestamp,
                    iif(temperature < -40 or temperature > 100, 
                        avg(temperature), temperature) as temperature
             from data
    
    return cleaned
}

6.3 数据类型转换

python 复制代码
// 数据类型转换
def convertTypes(data) {
    return select 
        device_id,
        timestamp(timestamp) as timestamp,
        double(temperature) as temperature,
        double(humidity) as humidity
    from data
}

七、批量导入优化

7.1 并行导入

python 复制代码
// 并行导入多个文件
def parallelImport(filePaths) {
    results = ploop(def(filePath) {
        data = loadText(filePath)
        return data.rows()
    }, filePaths)
    
    return sum(results)
}

// 执行
files = files("/data/*.csv")
totalRows = parallelImport(files)
print("总导入行数: " + string(totalRows))

7.2 内存优化

python 复制代码
// 内存优化:分批导入
def batchImport(filePath, batchSize = 100000) {
    reader = textFileReader(filePath, batchSize)
    
    totalRows = 0
    while (reader.hasMore()) {
        chunk = reader.read()
        // 处理并写入
        totalRows += chunk.rows()
    }
    
    reader.close()
    return totalRows
}

八、实战案例

8.1 历史数据批量导入

python 复制代码
// ========== 历史数据批量导入 ==========

// 1. 创建分布式表
db = database("dfs://history_db", VALUE, 1..1000)
schema = table(1:0, 
    `device_id`timestamp`temperature`humidity`pressure,
    [SYMBOL, TIMESTAMP, DOUBLE, DOUBLE, DOUBLE])
db.createPartitionedTable(schema, `sensor_history, `device_id)

// 2. 导入函数
def importHistoryData(filePath) {
    print("开始导入: " + filePath)
    
    // 读取CSV
    data = loadText(filePath)
    
    // 数据清洗
    cleaned = select device_id,
                    timestamp(timestamp) as timestamp,
                    double(temperature) as temperature,
                    double(humidity) as humidity,
                    double(pressure) as pressure
             from data
             where device_id is not null
    
    // 写入分布式表
    loadTable("dfs://history_db", "sensor_history").append!(cleaned)
    
    print("导入完成: " + string(cleaned.rows()) + " 条")
    return cleaned.rows()
}

// 3. 批量导入
files = files("/data/history/*.csv")
totalRows = 0

for (file in files) {
    totalRows += importHistoryData(file)
}

print("总导入行数: " + string(totalRows))

// 4. 验证
t = loadTable("dfs://history_db", "sensor_history")
select count(*) as total_rows from t

九、总结

本文详细介绍了DolphinDB文件数据接入:

  1. 文件格式:CSV、JSON、Parquet、Excel
  2. CSV导入:基本导入、格式指定、分块导入
  3. JSON导入:对象导入、数组导入
  4. Parquet导入:导入、导出
  5. Excel导入:单Sheet、多Sheet
  6. 数据清洗:缺失值、异常值、类型转换
  7. 性能优化:并行导入、内存优化

思考题

  1. 如何选择合适的文件格式?
  2. 如何处理大文件导入?
  3. 如何保证数据导入质量?

参考资料