dolphin 配置data 从文件导入hive 实践(一)

datax 支持多种数据源的相互读写,作为开源软件,提供了离线采集功能,方便系统开发,过程中遇到诸多配置,需要开发者自己探索,免费同样有成本

配置模板

json 复制代码
{
    "setting": {},
    "job": {
        "setting": {
            "speed": {
                "channel": 2
            }
        },
        "content": [
            {
                "reader": {
                    "name": "txtfilereader",
                    "parameter": {
                        "path": ["/data/test/test.txt"],
                        "encoding": "UTF-8",
                        "column": [
                            {
                                "index": 0,
                                "type": "string"
                            },
                            {
                                "index": 1,
                                "type": "string"
                            }
                        ],
                        "fieldDelimiter": "\t"
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "defaultFS": "hdfs://****:9000",
                        "fileType": "TEXT",
                        "path": "/user/hive/warehouse/sz_center_devdb.db/cat",
                        "fileName": "catfile",
                        "column": [
                            {
                                "name": "cat_id",
                                "type": "STRING"
                            },
                            {
                                "name": "cat_name",
                                "type": "STRING"
                            }
                        ],
                        "writeMode": "append",
                        "fieldDelimiter": "\t",
                        "compress":"NONE"
                    }
                }
            }
        ]
    }
}

注意:文本文件需要上传到datax 所在服务器

执行报错一:

Hadoop 权限异常

bash 复制代码
      Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=default, access=WRITE, inode="/user/hive/warehouse/sz_center_devdb.db":anonymous:supergroup:drwxr-xr-x
                at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
                at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:336)
                at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:241)
                at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1909)
                at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1893)
                at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1852)
                at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:323)
                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2635)
                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2577)
                at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:807)
                at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:494)
                at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
                at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:532)
                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
                at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1020)
                at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:948)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:422)
                at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2952)

                at org.apache.hadoop.ipc.Client.call(Client.java:1476)
                at org.apache.hadoop.ipc.Client.call(Client.java:1407)
                at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
                at com.sun.proxy.$Proxy9.create(Unknown Source)
                at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:498)
                at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
                at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
                at com.sun.proxy.$Proxy10.create(Unknown Source)
                at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1623)
                ... 18 more

这里是因为Hadoop 目录没有权限。

这里执行的用户是default

dataX 模板中没有配置用户的地方,这里先去Hadoop 配置目录权限

Hadoop 目录权限配置

bash 复制代码
hdfs dfs -ls /
hdfs dfs -mkdir /user
hdfs dfs -mkdir /hbase
hdfs dfs -ls /
hadoop fs -chmod 777 /user
hadoop fs -chmod 777 /hbase
# 循环所有子目录配置权限
hadoop fs -chmod -R 777 /hbase

然后运行dataX 任务成功。

但从hive 链接中发现数据乱码,这里就是 hive的文件类型和分隔符不一致导致

这里回顾日志发现读取文本异常

bash 复制代码
[WI-0][TI-0] - [INFO] 2024-11-06 16:16:36.229 +0800 o.a.d.p.t.a.AbstractTask:[169] -  -> 
        2024-11-06 16:16:35.230 [0-0-0-reader] INFO  TxtFileReader$Task - reading file : [/data/test/test.txt]
        2024-11-06 16:16:35.231 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
        2024-11-06 16:16:35.268 [0-0-0-writer] INFO  HdfsWriter$Task - begin do write...
        2024-11-06 16:16:35.268 [0-0-0-writer] INFO  HdfsWriter$Task - write to file : [hdfs://10.80.18.165:9000/user/hive/warehouse/sz_center_devdb.db/cat__f395492b_e42a_47e5_a52b_214ab8bf833a/catfile__d369974c_fdeb_4601_b118_67ae6e97e197]
        2024-11-06 16:16:35.341 [0-0-0-reader] INFO  UnstructuredStorageReaderUtil - CsvReader使用默认值[{"captureRawRecord":true,"columnCount":0,"comment":"#","currentRecord":-1,"delimiter":"\t","escapeMode":1,"headerCount":0,"rawRecord":"","recordDelimiter":"\u0000","safetySwitch":false,"skipEmptyRecords":true,"textQualifier":"\"","trimWhitespace":true,"useComments":false,"useTextQualifier":true,"values":[]}],csvReaderConfig值为[null]
        2024-11-06 16:16:35.351 [0-0-0-reader] WARN  UnstructuredStorageReaderUtil - 您尝试读取的列越界,源文件该行有 [1] 列,您尝试读取第 [2] 列, 数据详情[1 hello]
        2024-11-06 16:16:35.356 [0-0-0-reader] ERROR StdoutPluginCollector - 脏数据: 
        {"message":"您尝试读取的列越界,源文件该行有 [1] 列,您尝试读取第 [2] 列, 数据详情[1 hello]","record":[{"byteSize":7,"index":0,"rawData":"1 hello","type":"STRING"}],"type":"reader"}
        2024-11-06 16:16:35.357 [0-0-0-reader] WARN  UnstructuredStorageReaderUtil - 您尝试读取的列越界,源文件该行有 [1] 列,您尝试读取第 [2] 列, 数据详情[2 cat]
        2024-11-06 16:16:35.357 [0-0-0-reader] ERROR StdoutPluginCollector - 脏数据: 
        {"message":"您尝试读取的列越界,源文件该行有 [1] 列,您尝试读取第 [2] 列, 数据详情[2 cat]","record":[{"byteSize":5,"index":0,"rawData":"2 cat","type":"STRING"}],"type":"reader"}
        2024-11-06 16:16:35.793 [0-0-0-writer] INFO  HdfsWriter$Task - end do write
        2024-11-06 16:16:35.841 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[623]ms
        2024-11-06 16:16:35.841 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
[WI-0][TI-0] - [INFO] 2024-11-06 16:16:45.231 +0800 o.a.d.p.t.a.AbstractTask:[169] -  -> 
        2024-11-06 16:16:45.222 [job-0] INFO  StandAloneJobContainerCommunicator - Total 2 records, 12 bytes | Speed 1B/s, 0 records/s | Error 2 records, 12 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
        2024-11-06 16:16:45.222 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
        2024-11-06 16:16:45.223 [job-0] INFO  JobContainer - DataX Writer.Job [hdfswriter] do post work.
        2024-11-06 16:16:45.224 [job-0] INFO  HdfsWriter$Job - start rename file [hdfs://10.80.18.165:9000/user/hive/warehouse/sz_center_devdb.db/cat__f395492b_e42a_47e5_a52b_214ab8bf833a/catfile__d369974c_fdeb_4601_b118_67ae6e97e197] to file [hdfs://10.80.18.165:9000/user/hive/warehouse/sz_center_devdb.db/cat/catfile__d369974c_fdeb_4601_b118_67ae6e97e197].
[WI-0][TI-0] - [INFO] 2024-11-06 16:16:46.231 +0800 o.a.d.p.t.a.AbstractTask:[169] -  -> 

暂时不确定是源文件格式问题 还是编码问题

或者是任务配置问题。

后续出结果后更新。

执行异常二

数据为空,或者数据列不对应。

这种情况执行日志没有任何异常,执行结果也是成功,但是目标的hive 表里没有数据。

这时候就看hive的分隔符配置了。

如何查看hive表的分隔符

执行命令

bash 复制代码
show create table hello    

查看

'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' # 默认分隔符,行分割符:"\n",列分割符:"^A"

这个在data JSON 中还不能直接配置,必须使用转义字符

默认存储格式textfile

JSON 里配置 TEXT

参考资料:https://blog.csdn.net/mn525520/article/details/106876384

https://blog.csdn.net/u010520724/article/details/121999575

https://blog.csdn.net/qq_36039236/article/details/108101345

生效hive 建表语句、dataX json 任务配置参见

配置示例:www.fancv.com

相关推荐
yumgpkpm9 小时前
CMP(类Cloudera CDP 7.3 404版华为Kunpeng)与其他大数据平台对比
大数据·hive·hadoop·elasticsearch·kafka·hbase·cloudera
陈辛chenxin14 小时前
【大数据技术06】大数据技术
大数据·hadoop·分布式·python·信息可视化
yumgpkpm14 小时前
Hadoop在AI时代如何实现生态协同? CMP 7.13(或类 Cloudera CDP7.3 的 CMP 7.13 平台,如华为鲲鹏 ARM 版)
大数据·hadoop·elasticsearch·zookeeper·kafka·hbase·cloudera
piepis21 小时前
Doris Docker 完整部署指南
数据仓库·docker·doris·容器部署
qqxhb1 天前
系统架构设计师备考第68天——大数据处理架构
大数据·hadoop·flink·spark·系统架构·lambda·kappa
yumgpkpm1 天前
Hadoop大数据平台在中国AI时代的后续发展趋势研究CMP(类Cloudera CDP 7.3 404版华为鲲鹏Kunpeng)
大数据·hive·hadoop·python·zookeeper·oracle·cloudera
凯子坚持 c2 天前
基于VMware与CentOS 7的Hadoop集群部署全景指南
linux·hadoop·centos
KANGBboy2 天前
ES 总结
hive·elasticsearch
FeelTouch Labs2 天前
数据仓库和数据集市之ODS、CDM、ADS、DWD、DWS
数据仓库
大数据CLUB2 天前
酒店预订数据分析及预测可视化
大数据·hadoop·分布式·数据挖掘·数据分析·spark·mapreduce