使用DataX实现mysql与hive数据互相导入导出

一、概论

1.1 什么是DataX

DataX 是阿里巴巴开源 的一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle 等)、HDFS、Hive、ODPS、HBase、FTP 等各种异构数据源之间稳定高效的数据同步功能。

1.2 DataX 的设计

为了解决异构数据源同步问题,DataX 将复杂的网状 的同步链路变成了星型 数据链路,DataX 作为中间传输载体负责连接各种数据源。当需要接入一个新的数据源的时候,只需要将此数据源对接到 DataX,便能跟已有的数据源做到无缝数据同步

1.3 框架设计

  • Reader:数据采集模块,负责采集数据源的数据,将数据发给Framework。
  • Wiriter: 数据写入模块,负责不断向Framwork取数据,并将数据写入到目的端。
  • Framework :用于连接read和writer,作为两者的数据传输通道,并处理缓冲,流控,并发,数据转换等核心技术问题。
    运行原理
  • Job:单个作业的管理节点,负责数据清理、子任务划分、TaskGroup监控管理。
  • Task:由Job切分而来,是DataX作业的最小单元,每个Task负责一部分数据的同步工作。
  • Schedule:将Task组成TaskGroup,单个TaskGroup的并发数量为5。
  • TaskGroup:负责启动Task。

1.4 Datax所支持的渠道

类型 数据源 读者 作家(写) 文件
RDBMS关系型数据库 MySQL
甲骨文
SQL服务器
PostgreSQL的
DRDS
通用RDBMS(支持所有关系型数据库)
阿里云数仓数据存储 ODPS
美国存托凭证
开源软件
OCS
NoSQL数据存储 OTS
Hbase0.94
Hbase1.1
凤凰4.x
凤凰5.x
MongoDB
蜂巢
卡桑德拉
无结构化数据存储 文本文件
的FTP
HDFS
弹性搜索
时间序列数据库 OpenTSDB
技术开发局

二、快速入门

2.1 环境搭建

下载地址: http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
源码地址: https://github.com/alibaba/DataX

配置要求:

  • Linux
  • JDK(1.8以上 建议1.8) 下载
  • Python(推荐 Python2.6.X) 下载
    安装:

1) 将下载好的datax.tar.gz上传到服务器的任意节点,我这里上传到node01上的/exprot/soft
2)解压到/export/servers/

bash 复制代码
[root@node01 soft]# tar -zxvf datax.tar.gz  -C ../servers/

3)运行自检脚本

出现以下结果说明你得环境没有问题

/opt/module/datax/plugin/reader/._hbase094xreader/plugin.json\]不存在. 请检查您的配置文件. ![在这里插入图片描述](https://file.jishuzhan.net/article/1686903417424842754/ef9b9524323e4cde8a94fc4371f45a89.png) ## 2.2搭建环境注意事项 \[/opt/module/datax/plugin/reader/._hbase094xreader/plugin.json\]不存在. 请检查您的配置文件. 参考: ```bash find ./* -type f -name ".*er" | xargs rm -rf find: paths must precede expression: | Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec] [path...] [expression] find /datax/plugin/reader/ -type f -name "._*er" | xargs rm -rf find /datax/plugin/writer/ -type f -name "._*er" | xargs rm -rf 这里的/datax/plugin/writer/要改为你自己的目录 ``` 原文链接:https://blog.csdn.net/dz77dz/article/details/127055299 ## 2.3读取Mysql中的数据写入到HDFS **准备** 创建数据库和表并加载测试数据 ```prettyprint create database test; use test; create table c_s( id varchar(100) null, c_id int null, s_id varchar(20) null ); INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 1, '201967'); INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 2, '201967'); INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 3, '201967'); INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 5, '201967'); INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 6, '201967'); ``` 查看官方提供的模板 ```bash [root@node01 datax]# bin/datax.py -r mysqlreader -w hdfswriter DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. Please refer to the mysqlreader document: https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md Please refer to the hdfswriter document: https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md Please save the following configuration as a json file and use python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json to run the job. { "job": { "content": [ { "reader": { "name": "mysqlreader", "parameter": { "column": [], "connection": [ { "jdbcUrl": [], "table": [] } ], "password": "", "username": "", "where": "" } }, "writer": { "name": "hdfswriter", "parameter": { "column": [], "compress": "", "defaultFS": "", "fieldDelimiter": "", "fileName": "", "fileType": "", "path": "", "writeMode": "" } } } ], "setting": { "speed": { "channel": "" } } } } ``` 根据官网模板进行修改 ```bash [root@node01 datax]# vim job/mysqlToHDFS.json { "job": { "content": [ { "reader": { "name": "mysqlreader", "parameter": { "column": [ "id", "c_id", "s_id" ], "connection": [ { "jdbcUrl": [ "jdbc:mysql://node02:3306/test" ], "table": [ "c_s" ] } ], "password": "123456", "username": "root" } }, "writer": { "name": "hdfswriter", "parameter": { "column": [ { "name": "id", "type": "string" }, { "name": "c_id", "type": "int" }, { "name": "s_id", "type": "string" } ], "defaultFS": "hdfs://node01:8020", "fieldDelimiter": "\t", "fileName": "c_s.txt", "fileType": "text", "path": "/", "writeMode": "append" } } } ], "setting": { "speed": { "channel": "1" } } } } ``` HDFS的端口号注意版本,2.7.4 是9000;hdfs://node01:9000 MySQL的参数介绍 ![在这里插入图片描述](https://file.jishuzhan.net/article/1686903417424842754/164b586b94144f16b766a535437f5466.png) HDFS参数介绍 ![在这里插入图片描述](https://file.jishuzhan.net/article/1686903417424842754/e740938a421e4bdea0afb14b4f155e0a.png) 运行脚本 ```bash [root@node01 datax]# bin/datax.py job/mysqlToHDFS.json 2020-10-02 16:12:16.358 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /export/servers/datax/hook 2020-10-02 16:12:16.359 [job-0] INFO JobContainer - [total cpu info] => averageCpu | maxDeltaCpu | minDeltaCpu -1.00% | -1.00% | -1.00% [total gc info] => NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime PS MarkSweep | 1 | 1 | 1 | 0.245s | 0.245s | 0.245s PS Scavenge | 1 | 1 | 1 | 0.155s | 0.155s | 0.155s 2020-10-02 16:12:16.359 [job-0] INFO JobContainer - PerfTrace not enable! 2020-10-02 16:12:16.359 [job-0] INFO StandAloneJobContainerCommunicator - Total 5 records, 50 bytes | Speed 5B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00% 2020-10-02 16:12:16.360 [job-0] INFO JobContainer - 任务启动时刻 : 2020-10-02 16:12:04 任务结束时刻 : 2020-10-02 16:12:16 任务总计耗时 : 12s 任务平均流量 : 5B/s 记录写入速度 : 0rec/s 读出记录总数 : 5 读写失败总数 : 0 ``` ## 2.4 读取HDFS中的数据写入到Mysql **准备工作** ```prettyprint create database test; use test; create table c_s2( id varchar(100) null, c_id int null, s_id varchar(20) null ); ``` 查看官方提供的模板 ```bash [root@node01 datax]# bin/datax.py -r hdfsreader -w mysqlwriter DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. Please refer to the hdfsreader document: https://github.com/alibaba/DataX/blob/master/hdfsreader/doc/hdfsreader.md Please refer to the mysqlwriter document: https://github.com/alibaba/DataX/blob/master/mysqlwriter/doc/mysqlwriter.md Please save the following configuration as a json file and use python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json to run the job. { "job": { "content": [ { "reader": { "name": "hdfsreader", "parameter": { "column": [], "defaultFS": "", "encoding": "UTF-8", "fieldDelimiter": ",", "fileType": "orc", "path": "" } }, "writer": { "name": "mysqlwriter", "parameter": { "column": [], "connection": [ { "jdbcUrl": "", "table": [] } ], "password": "", "preSql": [], "session": [], "username": "", "writeMode": "" } } } ], "setting": { "speed": { "channel": "" } } } } ``` 根据官方提供模板进行修改 ```bash [root@node01 datax]# vim job/hdfsTomysql.json { "job": { "content": [ { "reader": { "name": "hdfsreader", "parameter": { "column": [ "*" ], "defaultFS": "hdfs://node01:8020", "encoding": "UTF-8", "fieldDelimiter": "\t", "fileType": "text", "path": "/c_s.txt" } }, "writer": { "name": "mysqlwriter", "parameter": { "column": [ "id", "c_id", "s_id" ], "connection": [ { "jdbcUrl": "jdbc:mysql://node02:3306/test", "table": [ "c_s2" ] } ], "password": "123456", "username": "root", "writeMode": "replace" } } } ], "setting": { "speed": { "channel": "1" } } } } ``` 脚本运行 ```bash [root@node01 datax]# bin/datax.py job/hdfsTomysql.json [total cpu info] => averageCpu | maxDeltaCpu | minDeltaCpu -1.00% | -1.00% | -1.00% [total gc info] => NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime PS MarkSweep | 1 | 1 | 1 | 0.026s | 0.026s | 0.026s PS Scavenge | 1 | 1 | 1 | 0.015s | 0.015s | 0.015s 2020-10-02 16:57:13.152 [job-0] INFO JobContainer - PerfTrace not enable! 2020-10-02 16:57:13.152 [job-0] INFO StandAloneJobContainerCommunicator - Total 5 records, 50 bytes | Speed 5B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.033s | Percentage 100.00% 2020-10-02 16:57:13.153 [job-0] INFO JobContainer - 任务启动时刻 : 2020-10-02 16:57:02 任务结束时刻 : 2020-10-02 16:57:13 任务总计耗时 : 11s 任务平均流量 : 5B/s 记录写入速度 : 0rec/s 读出记录总数 : 5 读写失败总数 : 0 ``` ## 2.5将Mysql表导入Hive 1.在hive中建表 ```sql -- hive建表 CREATE TABLE student2 ( classNo string, stuNo string, score int) row format delimited fields terminated by ','; -- 构造点mysql数据 create table if not exists student2( classNo varchar ( 50 ), stuNo varchar ( 50 ), score int ) insert into student2 values('1001','1012ww10087',63); insert into student2 values('1002','1012aa10087',63); insert into student2 values('1003','1012bb10087',63); insert into student2 values('1004','1012cc10087',63); insert into student2 values('1005','1012dd10087',63); insert into student2 values('1006','1012ee10087',63); ``` 2.编写mysql2hive.json配置文件 ```python { "job": { "setting": { "speed": { "channel": 1 } }, "content": [ { "reader": { "name": "mysqlreader", "parameter": { "username": "root", "password": "root", "connection": [ { "table": [ "student2" ], "jdbcUrl": [ "jdbc:mysql://192.168.43.10:3306/mytestmysql" ] } ], "column": [ "classNo", "stuNo", "score" ] } }, "writer": { "name": "hdfswriter", "parameter": { "defaultFS": "hdfs://192.168.43.10:9000", "path": "/hive/warehouse/home/myhive.db/student2", "fileName": "myhive", "writeMode": "append", "fieldDelimiter": ",", "fileType": "text", "column": [ { "name": "classNo", "type": "string" }, { "name": "stuNo", "type": "string" }, { "name": "score", "type": "int" } ] } } } ] } } ``` 3.运行脚本 ```bash bin/datax.py job/mysql2hive.json ``` 4.查看hive表是否有数据 ![](https://file.jishuzhan.net/article/1686903417424842754/395148667da7447cb68cb4f79b98af4c.png) ## 2.6将Hive表数据导入Mysql 1.要先在mysql建好表 ```sql create table if not exists student( classNo varchar ( 50 ), stuNo varchar ( 50 ), score int ) ``` 2.hive2mysql.json配置文件 ```python { "job": { "setting": { "speed": { "channel": 3 } }, "content": [ { "reader": { "name": "hdfsreader", "parameter": { "path": "/hive/warehouse/home/myhive.db/student/*", "defaultFS": "hdfs://192.168.43.10:9000", "column": [ { "index": 0, "type": "string" }, { "index": 1, "type": "string" }, { "index": 2, "type": "Long" } ], "fileType": "text", "encoding": "UTF-8", "fieldDelimiter": "," } }, "writer": { "name": "mysqlwriter", "parameter": { "writeMode": "insert", "username": "root", "password": "root", "column": [ "classNo", "stuNo", "score" ], "preSql": [ "delete from student" ], "connection": [ { "jdbcUrl": "jdbc:mysql://192.168.43.10:3306/mytestmysql?useUnicode=true&characterEncoding=utf8", "table": [ "student" ] } ] } } } ] } } ``` 注意事项: ```bash 在Hive的ODS层建表语句中,以","为分隔符; fields terminated by ',' 在DataX的json文件中,也以","为分隔符。 "fieldDelimiter": "," 与hive表里面的分隔符保持一致即可 ``` 由于DataX不能完全支持所有Hive表的数据类型,应将DataX启动文件中的hdfsreader中的column字段的类型改成DataX支持的类型 ![](https://file.jishuzhan.net/article/1686903417424842754/de0738c9ce9041f29bd11270e00a6418.png)

相关推荐
@insist1236 小时前
信息安全工程师考点精讲:身份认证核心原理与分类体系(上篇)
大数据·网络·分类·信息安全工程师·软件水平考试
天辛大师7 小时前
AI助力旅游扩大化,五一旅游公园通游年票普惠研究
大数据·启发式算法·旅游
WordPress学习笔记7 小时前
镌刻中式美学的高端WordPress主题
大数据·人工智能·wordpress
数智化精益手记局8 小时前
拆解物料管理erp系统的核心功能,看物料管理erp系统如何解决库存积压与缺料难题
大数据·网络·人工智能·安全·信息可视化·精益工程
Elastic 中国社区官方博客10 小时前
使用 Observability Migration Platform 将 Datadog 和 Grafana 的仪表板与告警迁移到 Kibana
大数据·elasticsearch·搜索引擎·信息可视化·全文检索·grafana·datalog
jkyy201410 小时前
AI运动数字化:以技术重塑场景,健康有益赋能全域运动健康管理
大数据·人工智能·健康医疗
金融小师妹10 小时前
4月30日多因子共振节点:鲍威尔“收官效应”与权力结构重塑的预期重构
大数据·人工智能·重构·逻辑回归
2601_9499251810 小时前
AI Agent如何重构跨境物流的决策?
大数据·人工智能·重构·ai agent·geo优化·物流科技
xiaoduo AI11 小时前
客服机器人问题解决率怎么统计?Agent系统自动判断是否解决,比人工回访准?
大数据·人工智能·机器人
小五兄弟12 小时前
YouTube 肖像检测扩展背后:短剧出海版权保护的技术实现与实战策略
大数据·人工智能