数据格式
文本格式
CSV
JSON
TXT
二进制格式
Parquet
ORC
HDF5
数据库
MySQL
PostgreSQL
SQL Server
1.2 导入导出函数
函数
说明
loadText
导入文本文件
saveText
导出文本文件
loadTextEx
导入到分布式表
saveTextEx
从分布式表导出
loadParquet
导入Parquet
saveParquet
导出Parquet
loadJSON
导入JSON
saveJSON
导出JSON
二、CSV文件操作
2.1 导入CSV
python复制代码
// 基本导入
t = loadText("/data/sensor_data.csv")
// 指定分隔符
t = loadText("/data/sensor_data.csv", delimiter=',')
// 指定表结构
schema = table(
`device_id`timestamp`temperature`humidity as colNames,
[INT, DATETIME, DOUBLE, DOUBLE] as colTypes
)
t = loadText("/data/sensor_data.csv", schema=schema)
// 跳过标题行
t = loadText("/data/sensor_data.csv", skipRows=1)
// 指定日期格式
t = loadText("/data/sensor_data.csv", dateFormat="yyyy-MM-dd HH:mm:ss")
2.2 导出CSV
python复制代码
// 创建测试数据
t = table(
1..100 as device_id,
2024.01.01 + 0..99 as date,
rand(20.0..30.0, 100) as temperature,
rand(40.0..60.0, 100) as humidity
)
// 导出CSV
saveText(t, "/output/sensor_data.csv")
// 指定分隔符
saveText(t, "/output/sensor_data.csv", delimiter=',')
// 追加模式
saveText(t, "/output/sensor_data.csv", mode='a')
// 不包含列名
saveText(t, "/output/sensor_data.csv", header=false)
2.3 大文件导入
python复制代码
// 分批导入大文件
def importLargeCSV(filePath, batchSize=100000) {
result = table(1:0, `device_id`timestamp`temperature`humidity,
[INT, TIMESTAMP, DOUBLE, DOUBLE])
// 使用ploadText并行加载
data = ploadText(filePath)
result.append!(data)
return result
}
// 导入到分布式表
db = database("dfs://iot_data", VALUE, 2024.01.01..2024.12.31)
loadTextEx(db, "sensor_data", "/data/sensor_data.csv", `timestamp)
2.4 CSV导入实战
python复制代码
// 完整的CSV导入流程
def importSensorData(csvPath, dbPath) {
// 1. 定义表结构
schema = table(
`device_id`timestamp`temperature`humidity`pressure`status as colNames,
[INT, DATETIME, DOUBLE, DOUBLE, DOUBLE, SYMBOL] as colTypes
)
// 2. 导入数据
t = loadText(csvPath, schema=schema, dateFormat="yyyy-MM-dd HH:mm:ss")
// 3. 数据清洗
t = select * from t
where temperature between -40 and 100
and humidity between 0 and 100
// 4. 写入分布式表
db = database(dbPath)
loadTable(db, "sensor_data").append!(t)
return t.rows()
}
// 执行导入
importSensorData("/data/sensors.csv", "dfs://iot_data")
三、JSON文件操作
3.1 导入JSON
python复制代码
// 导入JSON文件
t = loadJSON("/data/sensor_data.json")
// 解析JSON字符串
jsonStr = '{"device_id":1,"temperature":25.5,"humidity":50.0}'
data = parseJson(jsonStr)
// 解析JSON数组
jsonArr = '[{"id":1,"value":10},{"id":2,"value":20}]'
t = parseJson(jsonArr)
3.2 导出JSON
python复制代码
// 创建测试数据
t = table(
1..5 as device_id,
`A`B`C`D`E as name,
10.0 20.0 30.0 40.0 50.0 as value
)
// 导出JSON
saveJSON(t, "/output/sensor_data.json")
// 格式化输出
saveJSON(t, "/output/sensor_data_pretty.json", pretty=true)
// 导入时清洗数据
def importAndClean(filePath) {
// 导入
t = loadText(filePath)
// 清洗
t = select * from t
where temperature is not null
and temperature between -40 and 100
and humidity between 0 and 100
// 转换
t[`timestamp] = temporalParse(t.timestamp, "yyyy-MM-dd HH:mm:ss")
return t
}