Kettle(Pentaho Data Integration,PDI)可以通过多种方式与应用程序集成,以下是7种主流方法及具体实现示例:
一、命令行调用(最基础)
bash
# 执行转换(Transformation)
./pan.sh -file=/etl/order_import.ktr -param:INPUT_FILE=/data/orders.csv
# 执行作业(Job)
./kitchen.sh -file=/etl/daily_sync.kjb -param:DATE=$(date +%Y-%m-%d)
适用场景:定时脚本调用、CI/CD流水线
二、Java API集成(推荐方式)
1. 添加Maven依赖
xml
<dependency>
<groupId>org.pentaho</groupId>
<artifactId>pentaho-kettle-core</artifactId>
<version>9.4.0.0-365</version>
</dependency>
2. 核心代码示例
java
public class KettleExecutor {
public void runTransform(String ktrPath) throws KettleException {
KettleEnvironment.init();
TransMeta transMeta = new TransMeta(ktrPath);
Trans trans = new Trans(transMeta);
// 设置变量
trans.setVariable("CUSTOMER_ID", "10086");
trans.execute(null);
trans.waitUntilFinished();
if (trans.getErrors() > 0) {
throw new RuntimeException("ETL执行失败");
}
}
}
优势:完全控制执行过程,可获取实时状态
三、REST API调用(通过Carte服务)
1. 启动Carte服务
bash
./carte.sh 0.0.0.0 8080
2. 调用示例(Python)
python
import requests
response = requests.post(
"http://etl-server:8080/kettle/executeTrans/",
params={"trans": "/jobs/data_clean.ktr", "name": "nightly_clean"},
auth=("admin", "password")
)
print(response.json())
适用场景:微服务架构、跨系统集成
四、数据库驱动集成
在MySQL中存储作业并触发
sql
-- 创建作业触发表
CREATE TABLE etl_triggers (
job_name VARCHAR(100),
params JSON,
status VARCHAR(20) DEFAULT 'PENDING'
);
-- 使用事件调度器触发
CREATE EVENT run_etl_job
ON SCHEDULE EVERY 1 DAY
DO
BEGIN
INSERT INTO etl_triggers VALUES ('daily_report', '{"date": CURDATE()}');
END;
Java监听代码
java
// 轮询数据库触发任务
@Scheduled(fixedRate = 60000)
public void checkEtlTriggers() {
List<TriggerRecord> triggers = jdbcTemplate.query(
"SELECT * FROM etl_triggers WHERE status = 'PENDING'",
(rs, rowNum) -> new TriggerRecord(rs.getString("job_name"), rs.getString("params"))
);
triggers.forEach(trigger -> {
new ProcessBuilder("kitchen.sh", "-file=/jobs/"+trigger.jobName+".kjb")
.inheritIO()
.start();
});
}
五、消息队列集成(Kafka示例)
java
@KafkaListener(topics = "etl-events")
public void handleEtlEvent(ConsumerRecord<String, String> record) {
JSONObject params = new JSONObject(record.value());
TransMeta transMeta = new TransMeta("/jobs/"+params.getString("job")+".ktr");
Trans trans = new Trans(transMeta);
params.keySet().forEach(key ->
trans.setVariable(key, params.getString(key))
);
trans.execute(null);
}
六、Spring Boot深度集成
1. 配置类
java
@Configuration
public class KettleConfig {
@Bean
public KettleEnvironmentBean kettleEnv() throws KettleException {
KettleEnvironment.init();
return new KettleEnvironmentBean();
}
}
2. 服务层封装
java
@Service
@Transactional
public class OrderEtlService {
@Async
public CompletableFuture<Void> syncOrders(LocalDate date) {
Trans trans = new Trans(new TransMeta("/jobs/order_sync.ktr"));
trans.setVariable("EXPORT_DATE", date.toString());
trans.execute(null);
return CompletableFuture.completedFuture(null);
}
}
七、云原生方案(Kubernetes)
yaml
# k8s-cronjob.yaml
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: daily-etl
spec:
schedule: "0 3 * * *"
jobTemplate:
spec:
containers:
- name: kettle-runner
image: pentaho/pdi-ce:9.4.0
command: ["/bin/sh", "-c"]
args:
- "kitchen.sh -file=/jobs/daily_sync.kjb -param:DATE=$(date +%Y-%m-%d)"
volumeMounts:
- name: etl-jobs
mountPath: /jobs
volumes:
- name: etl-jobs
configMap:
name: etl-config
最佳实践建议
-
参数管理:
- 使用
.properties
文件存储环境变量 - 敏感信息通过Vault/KeyVault注入
- 使用
-
性能优化:
bash# 调整JVM参数 export PENTAHO_DI_JAVA_OPTIONS="-Xms2g -Xmx4g -XX:MaxMetaspaceSize=512m"
-
错误处理:
- 实现
Abort
和Error Handling
步骤 - 记录错误数据到专用表
- 实现
-
监控方案:
bash# 日志收集 kitchen.sh -file=job.kjb -logfile=/logs/etl_${TIMESTAMP}.log
调试技巧
- 本地开发时使用
-level=Debug
参数 - 通过
Step Metrics
步骤监控性能瓶颈 - 使用
Mail
步骤发送异常通知
根据您的技术栈选择最适合的方案。如需具体场景的完整代码示例,请说明您的:
- 应用架构(单体/微服务)
- 调度需求(实时/定时)
- 数据规模级别