一、kettle概述
1、什么是kettle
Kettle是一款开源的ETL工具,纯java编写,可以在Window、Linux、Unix上运行,绿色无需安装,数据抽取高效稳定。
2、Kettle工程存储方式
(1)以XML形式存储
(2)以资源库方式存储(数据库资源库和文件资源库)
3、Kettle的两种设计
data:image/s3,"s3://crabby-images/569b4/569b4556724a4aacf703e73b837c9fb6aca77602" alt=""
4、Kettle的组成
data:image/s3,"s3://crabby-images/2cf53/2cf532355667c70a3dc5b13ce266fa91ee1b5579" alt=""
5、kettle特点
data:image/s3,"s3://crabby-images/a4f4b/a4f4b488173047dd24711752efe4a003799c04bc" alt=""
二、kettle安装部署和使用
1、 kettle安装地址( 官网地址**)**
++https://community.hitachivantara.com/docs/DOC-1009855++
下载地址:++https://sourceforge.net/projects/pentaho/files/Data%20Integration/++
资料下载:
链接:https://pan.baidu.com/s/149fBww3eiD7vLN2p2egCxg
提取码:gyhr
2、Windows下安装使用
(1)概述
在实际企业开发中,都是在本地环境下进行kettle的job和Transformation开发的,可以在本地运行,也可以连接远程机器运行
(2)安装步骤
安装jdk
下载kettle压缩包,因kettle为绿色软件,解压缩到任意本地路径即可
双击Spoon.bat,启动图形化界面工具,就可以直接使用了
3、案例1
案例一 把stu1的数据按id同步到stu2,stu2有相同id则更新数据
(1)在mysql中创建两张表
mysql> create database kettle;
Query OK, 1 row affected (0.00 sec)
mysql> use kettle;
Database changed
mysql> create table stu1(id int,name varchar(20),age int);
Query OK, 0 rows affected (0.01 sec)
mysql> create table stu2(id int,name varchar(20));
Query OK, 0 rows affected (0.00 sec)
(2)往两张表中插入一些数据
mysql> insert into stu1 values(1001,'zhangsan',20),(1002,'lisi',18), (1003,'wangwu',23);
Query OK, 3 rows affected (0.05 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> insert into stu2 values(1001,'wukong');
Query OK, 1 row affected (0.00 sec)
(3)把pdi-ce-8.2.0.0-342.zip文件拷贝到win环境中指定文件目录,解压后
data:image/s3,"s3://crabby-images/1b89f/1b89fe2f11eb9455b8d2b1ac25da20d7025f4380" alt=""
data:image/s3,"s3://crabby-images/2400d/2400dc283b97200f3fdefde70f174af17c95f41c" alt=""
在kettle中新建转换--->输入--->表输入-->表输入双击
data:image/s3,"s3://crabby-images/cda17/cda1780e1b4bc2177365fbf35de73216780f0020" alt=""
data:image/s3,"s3://crabby-images/730cc/730ccaa207da3564ecae63024205f37caff93334" alt=""
在数据库连接栏目点击新建
data:image/s3,"s3://crabby-images/d7b0b/d7b0b9ce2f42778dc0f7baddc38c95f352b95ae0" alt=""
以上错误说明,少了mysql-connector-java-5.1.27-bin.jar
解决方法:
在data-integration\lib文件下添加mysql-connector-java-5.1.27-bin.jar
再重启,再次操作
data:image/s3,"s3://crabby-images/88b1b/88b1b5403477c601157dac200e32c89496acdf70" alt=""
data:image/s3,"s3://crabby-images/ee74c/ee74c12ab190331895f6bb8e96f590d696641a9e" alt=""
data:image/s3,"s3://crabby-images/7f6aa/7f6aafa1d01413433f1784a5c9679911c97f2fec" alt=""
以上说明stu1的数据输入ok的,现在我们需要把输入stu1的数据同步到stu2输出的数据
data:image/s3,"s3://crabby-images/e197a/e197a0351a5e4976eedd5c8a0b06a6648aea96e2" alt=""
data:image/s3,"s3://crabby-images/565df/565df475e641851c3fa7506e195d02df7d99962a" alt=""
注意:拖出来的线条必须是深灰色才关联成功,若是浅灰色表示关联失败
data:image/s3,"s3://crabby-images/c0433/c0433948c8bd598cfd344ac31d40149c940de92d" alt=""
data:image/s3,"s3://crabby-images/32333/323330a4040eadad68ef66b13a7c4199e805f986" alt=""
data:image/s3,"s3://crabby-images/355b3/355b360f96a2b0e9b78590a1afa79492a4e80909" alt=""
data:image/s3,"s3://crabby-images/75df2/75df2402a245cbf6dfc4dd19ec2ad5cd8d29537b" alt=""
data:image/s3,"s3://crabby-images/73487/73487c54be51488f684503b992987a0dccc5fda3" alt=""
转换之前,需要做保存
data:image/s3,"s3://crabby-images/dfc0d/dfc0d3e9d3e2d34c64ed31d70028069138f1d465" alt=""
之后,在mysql查看,stu2的数据,注意(自己转换都是改成N)
mysql> select * from stu2;
+------+--------+
| id | name |
+------+--------+
| 1001 | wukong |
| 1002 | lisi |
| 1003 | wangwu |
+------+--------+
3 rows in set (0.00 sec)
若:改动
data:image/s3,"s3://crabby-images/8e381/8e3814aaf790ab14fc0c10ab0d92c18de0bfbdc2" alt=""
//查出来的数据有所变动
mysql> select * from stu2;
+------+----------+
| id | name |
+------+----------+
| 1001 | zhangsan |
| 1002 | lisi |
| 1003 | wangwu |
+------+----------+
3 rows in set (0.00 sec)
4、案例2:使用作业执行上述转换,并且额外在表student2中添加一条数据
(1)新建一个作业
data:image/s3,"s3://crabby-images/52d25/52d250650e9bcc6918a58bbb843dd16d2a39036d" alt=""
(2) 按图示拉取组件
data:image/s3,"s3://crabby-images/5caf2/5caf2c554fc80abcf3e176862917cfa6f2c11eb0" alt=""
(3)双击Start编辑Start
data:image/s3,"s3://crabby-images/db316/db31696c64219d5a32377fcee3092af978437682" alt=""
(4)双击转换,选择案例1保存的文件
data:image/s3,"s3://crabby-images/eb50c/eb50cc7ee7e174a65ff471471e6f9945daff81e6" alt=""
(5)双击SQL,编辑SQL语句,先在mysql的kettle数据库中插入一条数据
mysql> insert into stu1 values(1004,'stu1',22);
Query OK, 1 row affected (0.01 sec)
data:image/s3,"s3://crabby-images/00ba8/00ba83c00c719abb5e53d881d61890764206ed8c" alt=""
data:image/s3,"s3://crabby-images/e280f/e280fd5faa08eda10c7250a22e0d001155dd7020" alt=""
data:image/s3,"s3://crabby-images/76807/76807f0123d218002478a20b6406a41bac2dcb1c" alt=""
之后,加上Dummy,如图所示:
data:image/s3,"s3://crabby-images/eceb9/eceb9ccc709e6cc941e27effd85bc5356e37046b" alt=""
之后,必须保存,不然不会生效
接下来,我们就可以执行了
data:image/s3,"s3://crabby-images/fd865/fd8659b7a38e6fad70f34c7fb021e1ccdd9dd958" alt=""
再次,在mysql数据库查看,有数据了
mysql> select * from stu2;
+------+----------+
| id | name |
+------+----------+
| 1001 | zhangsan |
| 1002 | lisi |
| 1003 | wangwu |
| 1004 | stu1 |
| 1005 | kettle |
+------+----------+
5 rows in set (0.00 sec)
5、案例3:将hive表的数据输出到hdfs
(1)因为涉及到hive和hbase的读写,需要修改相关配置文件。
修改解压目录下的data-integration\plugins\pentaho-big-data-plugin下的plugin.properties,设置active.hadoop.configuration=hdp26,并将如下配置文件拷贝到data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp26下
data:image/s3,"s3://crabby-images/dcee3/dcee39e5a2838e5c9b4ce0664b37e1a8511eafd0" alt=""
(2)启动hdfs,yarn,hbase集群的所有进程,启动hiveserver2服务
[root@hadoop1 ~]# /opt/module/hadoop-2.7.2/sbin/start-all.sh
开启HBase前启动Zookeeper
[root@hadoop1 ~]# /opt/module/hbase-1.3.1/bin/start-hbase.sh
[root@hadoop1 ~]# /opt/module/hive/bin/hiveserver2
(3)进入beeline,查看10000端口开启情况
[root@hadoop1 ~]# /opt/module/hive/bin/beeline
Beeline version 1.2.1 by Apache Hive
beeline> !connect jdbc:hive2://hadoop1.x:10000(回车)
Connecting to jdbc:hive2://hadoop1.x:10000
Enter username for jdbc:hive2://hadoop1.x:10000: root(输入root)
Enter password for jdbc:hive2://hadoop1.x:10000:(直接回车)
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://hadoop1.x:10000>(到了这里说明成功开启10000端口)
(4)创建两张表dept和emp
CREATE TABLE dept(deptno int, dname string,loc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
CREATE TABLE emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm int,
deptno int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
(5)插入数据
insert into dept values(10,'accounting','NEW YORK'),(20,'RESEARCH','DALLAS'),(30,'SALES','CHICAGO'),(40,'OPERATIONS','BOSTON');
insert into emp values(7369,'SMITH','CLERK',7902,'1980-12-17',800,NULL,20),(7499,'ALLEN','SALESMAN',7698,'1980-12-17',1600,300,30),(7521,'WARD','SALESMAN',7698,'1980-12-17',1250,500,30),(7566,'JONES','MANAGER',7839,'1980-12-17',2975,NULL,20);
(6)按下图建立流程图
data:image/s3,"s3://crabby-images/e731f/e731fb4ca43adb72269c572b32616b4f3a8607fb" alt=""
(7)设置表输入,连接hive
data:image/s3,"s3://crabby-images/e89b3/e89b30ffe18f93ac9f796bcd7a11b073a09e90db" alt=""
data:image/s3,"s3://crabby-images/f394f/f394faec165c900bf6ff54a73b4d62eacb46d4a4" alt=""
(8)设置排序属性
data:image/s3,"s3://crabby-images/c7c3e/c7c3efbd6a2792b47f853e051a6e25c6e16d6e72" alt=""
(9)设置连接属性
data:image/s3,"s3://crabby-images/a35da/a35da0d172f1c383f32a514a5b659f91c5e9702d" alt=""
(10)设置字段选择
data:image/s3,"s3://crabby-images/56ad4/56ad480ff077e791dd7e372ecda6fa0f766c84c9" alt=""
(11)设置文件输出
data:image/s3,"s3://crabby-images/77c60/77c60668094dc9b48a14fa34ca9a137ee8265c10" alt=""
data:image/s3,"s3://crabby-images/04aaa/04aaab06cd220bb90bfe87f20d658dd31a1bf053" alt=""
data:image/s3,"s3://crabby-images/60622/60622facac8f54df3da773d4b6c9884f0a09075e" alt=""
data:image/s3,"s3://crabby-images/2f265/2f2659f517e90fb5003b822ccce23faae8696f32" alt=""
data:image/s3,"s3://crabby-images/0efd5/0efd5a211092c8808f2990fd4ef2f45288df93fb" alt=""
data:image/s3,"s3://crabby-images/072e7/072e77f003f50823a008c1b32801172fbc4ea5ef" alt=""
(12)保存并运行查看hdfs
data:image/s3,"s3://crabby-images/e348b/e348b7b0bfc23f090a0a3797f04d625e73411942" alt=""
6、案例4:读取hdfs文件并将sal大于1000的数据保存到hbase中
(1) 在HBase中创建一张表用于存放数据
[root@hadoop1 ~]$ /opt/module/hbase-1.3.1/bin/hbase shell
hbase(main):004:0> create 'people','info'
(2)按下图建立流程图
data:image/s3,"s3://crabby-images/405bd/405bd607aac1d83e713ec894264897be55e5e203" alt=""
(3)设置文件输入,连接hdfs
data:image/s3,"s3://crabby-images/444d1/444d1e36c9f8a27e1f432a01b30183f71fa7c007" alt=""
data:image/s3,"s3://crabby-images/e1d61/e1d61eee54f1d5ce5b0f5caf53c2a5a3faf77dbd" alt=""
(4)设置过滤记录
data:image/s3,"s3://crabby-images/2c925/2c925b5c6f580120d1bbead1c4b41803dbb31a1e" alt=""
(5)设置HBase output
data:image/s3,"s3://crabby-images/5c0cc/5c0cc812e9b1c5d7b4e0f0778ad071aca8f491de" alt=""
data:image/s3,"s3://crabby-images/0736e/0736e4606e51664cb2103dae6eb8041b7a049d3e" alt=""
注意:若报错没有权限往hdfs写文件,在Spoon.bat中第119行添加参数
"-DHADOOP_USER_NAME=atguigu" "-Dfile.encoding=UTF-8"
三、创建资源库
1、数据库资源库
数据库资源库是将作业和转换相关的信息存储在数据库中,执行的时候直接去数据库读取信息,很容易跨平台使用
1)点击右上角connect,选择Other Resporitory
data:image/s3,"s3://crabby-images/f15ef/f15ef7f74b780fedd3ba35912d0b1c736b6fc8e0" alt=""
- 选择Database Repository
data:image/s3,"s3://crabby-images/064c3/064c3443fac786dccaf6ae5a91ab12b05f41dee0" alt=""
- 建立新连接
data:image/s3,"s3://crabby-images/a9b81/a9b816f3b61ad23c9f9982c3a1d2249beaab9bfe" alt=""
data:image/s3,"s3://crabby-images/677cc/677cc2bf7d05831d04f8a44df3ed916d8d256328" alt=""
- 填好之后,点击finish,会在指定的库中创建很多表,至此数据库资源库创建完成
data:image/s3,"s3://crabby-images/7d8c0/7d8c082792be4e34ac8ffdc480cebbf3a10c3fed" alt=""
- 连接资源库
默认账号密码为admin
data:image/s3,"s3://crabby-images/718dc/718dca94f1c3305ab75c3b8c6a17fa52fcf68eb8" alt=""
- 将之前做过的转换导入资源库
(1)选择从xml文件导入
data:image/s3,"s3://crabby-images/66786/66786acc149fad99b610cc812d7e5489c2d2843d" alt=""
(2)随便选择一个转换
data:image/s3,"s3://crabby-images/bf61b/bf61bf163370d43bba93fd3a200145a8b8ee911f" alt=""
(3)点击保存,选择存储位置及文件名
data:image/s3,"s3://crabby-images/a7efb/a7efb8018ad3e058c3ec11107ad394ccb0545078" alt=""
(4)打开资源库查看保存结果
data:image/s3,"s3://crabby-images/55289/55289b62d6544703fed90c26f0914fb0a3f4f623" alt=""
2、文件资源库
将作业和转换相关的信息存储在指定的目录中,其实和XML的方式一样
创建方式跟创建数据库资源库步骤类似,只是不需要用户密码就可以访问,跨
平台使用比较麻烦
1)选择connect
data:image/s3,"s3://crabby-images/b2026/b20266ea8c7f272ebf3aad3a8b6b5d130ff9659c" alt=""
2)点击add后点击Other Repositories
data:image/s3,"s3://crabby-images/6f60f/6f60f90db59b1c56336eb5aae8b3f95f87410f24" alt=""
3)选择File Repository
data:image/s3,"s3://crabby-images/4839f/4839fdac11d17ee8242aac9f6157a34067e1197f" alt=""
4)填写信息
data:image/s3,"s3://crabby-images/d4031/d4031dd941cd98cdc353232a7d6bc46efb01ce9c" alt=""
四、 Linux下安装使用
**1、**单机
1)jdk安装
2)安装包上传到服务器,解压
注意:1. 把mysql驱动拷贝到lib目录下
2. 将本地用户家目录下的隐藏目录C:\Users\自己用户名\.kettle,整个上传到linux的家目录/home/MrZhou/下
3)运行数据库资源库中的转换:
[root@hadoop1 data-integration]$./pan.sh -rep=my_repo -user=admin -pass=admin -trans=stu1tostu2 -dir=/
参数说明:
-rep 资源库名称
-user 资源库用户名
-pass 资源库密码
-trans 要启动的转换名称
-dir 目录(不要忘了前缀 /)
data:image/s3,"s3://crabby-images/59ca9/59ca9c6e902b62ea1f1b3123853b6b8a4e2ae1b5" alt=""
4)运行资源库里的作业:
记得把作业里的转换变成资源库中的资源
[root@hadoop1 data-integration]$./kitchen.sh -rep=repo1 -user=admin -pass=admin -job=jobDemo1 -logfile=./logs/log.txt -dir=/
参数说明:
-rep - 资源库名
-user - 资源库用户名
-pass -- 资源库密码
-job -- job名
-dir -- job路径
-logfile -- 日志目录
2、 集群模式(了解)
-
准备三台服务器,hadoop1.x作为Kettle主服务器,服务器端口号为8080,hadoop2.x和hadoop3.x作为两个子服务器,端口号分别为8081和8082。
-
安装部署jdk
-
hadoop完全分布式环境搭建,并启动进程(因为要使用hdfs)
-
上传解压kettle的安装包
-
进到/opt/module/data-integration/pwd目录,修改配置文件
修改主服务器配置文件carte-config-master-8080.xml
<slaveserver>
<name>master</name>
<hostname>hadoop1.x</hostname>
<port>8080</port>
<master>Y</master>
<username>cluster</username>
<password>cluster</password>
</slaveserver>
修改从服务器配置文件carte-config-8081.xml
<masters>
<slaveserver>
<name>master</name>
<hostname>hadoop102</hostname>
<port>8080</port>
<username>cluster</username>
<password>cluster</password>
<master>Y</master>
</slaveserver>
</masters>
<report_to_masters>Y</report_to_masters>
<slaveserver>
<name>slave1</name>
<hostname>hadoop2.x</hostname>
<port>8081</port>
<username>cluster</username>
<password>cluster</password>
<master>N</master>
</slaveserver>
修改从配置文件carte-config-8082.xml
<masters>
<slaveserver>
<name>master</name>
<hostname>hadoop102</hostname>
<port>8080</port>
<username>cluster</username>
<password>cluster</password>
<master>Y</master>
</slaveserver>
</masters>
<report_to_masters>Y</report_to_masters>
<slaveserver>
<name>slave2</name>
<hostname>hadoop3.x</hostname>
<port>8082</port>
<username>cluster</username>
<password>cluster</password>
<master>N</master>
</slaveserver>
-
分发整个kettle的安装目录,xsync data-integration
-
启动相关进程,在hadoop1.x,hadoop2.x,hadoop3.x上执行
[root@hadoop1.x data-integration]# ./carte.sh hadoop1.x 8080
[root@hadoop2.x data-integration]#./carte.sh hadoop2.x 8081
[root@hadoop3.x data-integration]#./carte.sh hadoop3.x 8082 -
访问web页面
3、案例:读取hive中的emp表,根据id进行排序,并将结果输出到hdfs上
注意:因为涉及到hive和hbase的读写,需要修改相关配置文件。
修改解压目录下的data-integration\plugins\pentaho-big-data-plugin下的plugin.properties,设置active.hadoop.configuration=hdp26,并将如下配置文件拷贝到data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp26下
data:image/s3,"s3://crabby-images/dcee3/dcee39e5a2838e5c9b4ce0664b37e1a8511eafd0" alt=""
(1) 创建转换,编辑步骤,填好相关配置
data:image/s3,"s3://crabby-images/2e2c4/2e2c4ecefd2b19ccb8702bb3d90cbedc3a9711d5" alt=""
(2) 创建子服务器,填写相关配置,跟集群上的配置相同
data:image/s3,"s3://crabby-images/b7eb1/b7eb12438dc5e43a97e49c98a8fe4409de84f777" alt=""
data:image/s3,"s3://crabby-images/080bd/080bdf736ca59f30279e50887f6dba0283b0e6e4" alt=""
data:image/s3,"s3://crabby-images/87072/87072f946c8577ee4b2e505c15aa109968424197" alt=""
data:image/s3,"s3://crabby-images/33002/3300251cb57ccd47cac65a07db365f3b0d744d06" alt=""
(3) 创建集群schema,选中上一步的几个服务器
data:image/s3,"s3://crabby-images/d108d/d108d18c8e04a3da3dd660efc4b42d83fa6d78ba" alt=""
(4) 对于要在集群上执行的步骤,右键选择集群,选中上一步创建的集群schema
data:image/s3,"s3://crabby-images/95d80/95d8049b94ece63ad80be5f42fba53adf0df53d3" alt=""
(5) 创建Run Configuration,选择集群模式,直接运行
data:image/s3,"s3://crabby-images/bfd18/bfd188622bc7b21d0fa4fb17a4f7d3ee53ee374f" alt=""
data:image/s3,"s3://crabby-images/c97a4/c97a4439dff6241b449813e7b164c3cf87d1a3fd" alt=""
data:image/s3,"s3://crabby-images/9edd0/9edd02df71da62827cfc5283cb0148173f7ab4df" alt=""
五、调优
1、调整JVM大小进行性能优化,修改Kettle根目录下的Spoon脚本。
data:image/s3,"s3://crabby-images/f69b7/f69b7b0cb5520790c32eac56980efbce0695701c" alt=""
参数参考:
-Xmx2048m:设置JVM最大可用内存为2048M。
-Xms1024m:设置JVM促使内存为1024m。此值可以设置与-Xmx相同,以避免每次垃圾回收完成后JVM重新分配内存。
-Xmn2g:设置年轻代大小为2G。整个JVM内存大小=年轻代大小 + 年老代大小 + 持久代大小。持久代一般固定大小为64m,所以增大年轻代后,将会减小年老代大小。此值对系统性能影响较大,Sun官方推荐配置为整个堆的3/8。
-Xss128k:设置每个线程的堆栈大小。JDK5.0以后每个线程堆栈大小为1M,以前每个线程堆栈大小为256K。更具应用的线程所需内存大小进行调整。在相同物理内存下,减小这个值能生成更多的线程。但是操作系统对一个进程内的线程数还是有限制的,不能无限生成,经验值在3000~5000左右。
2、 调整提交(Commit)记录数大小进行优化,Kettle默认Commit数量为:1000,可以根据数据量大小来设置Commitsize:1000~50000
3、尽量使用数据库连接池;
4、尽量提高批处理的commit size;
5、尽量使用缓存,缓存尽量大一些(主要是文本文件和数据流);
6、Kettle是Java做的,尽量用大一点的内存参数启动Kettle;
7、可以使用sql来做的一些操作尽量用sql;
Group , merge , stream lookup,split field这些操作都是比较慢的,想办法避免他们.,能用sql就用sql;
8、插入大量数据的时候尽量把索引删掉;
9、尽量避免使用update , delete操作,尤其是update,如果可以把update变成先delete, 后insert;
10、能使用truncate table的时候,就不要使用deleteall row这种类似sql合理的分区,如果删除操作是基于某一个分区的,就不要使用delete row这种方式(不管是deletesql还是delete步骤),直接把分区drop掉,再重新创建;
11、尽量缩小输入的数据集的大小(增量更新也是为了这个目的);
12、尽量使用数据库原生的方式装载文本文件(Oracle的sqlloader, mysql的bulk loader步骤)。