- 场景:
利用sqoop 导入千万级大概1K8W条Oracle数据到hive多分区表中
集群资源:132G,96cores
队列highway资源:
yarn.scheduler.capacity.root.highway.capacity=40
yarn.scheduler.capacity.root.highway.maximum-capacity=70
yarn.scheduler.capacity.root.highway.minimum-user-limit-percent=80
yarn.scheduler.capacity.root.highway.state=RUNNING
yarn.scheduler.capacity.root.highway.user-limit-factor=2
分区字段:
原本sqoop脚本:
sqoop import --D mapred.job.queue.name=highway \
--connect "jdbc:oracle:thin:@//localhost:61521/LZY2" \
--username LZSHARE \
--password '123456' \
--query "SELECT
TO_CHAR(GCRQ, 'YYYY') AS gcrq_year,
TO_CHAR(GCRQ, 'MM') AS gcrq_month,
TO_CHAR(GCRQ, 'DD') AS gcrq_day,
YEAR,
TO_CHAR(GCRQ, 'YYYY-MM-DD HH24:MI:SS') AS GCRQ,
GCZBS,
HOUR,
MINUTE,
......
DELETE_BY,
TO_CHAR(DELETE_TIME, 'YYYY-MM-DD HH24:MI:SS') AS DELETE_TIME,
CREATE_BY,
TO_CHAR(CREATE_TIME, 'YYYY-MM-DD HH24:MI:SS') AS CREATE_TIME,
UPDATE_BY,
TO_CHAR(UPDATE_TIME, 'YYYY-MM-DD HH24:MI:SS') AS UPDATE_TIME,
TO_CHAR(INSERT_TIME, 'YYYY-MM-DD HH24:MI:SS') AS INSERT_TIME
FROM LZJHGX.dat_dcsj_time
WHERE TO_CHAR(GCRQ , 'YYYY-MM-DD') < TO_CHAR(SYSDATE, 'YYYY-MM-DD') AND \$CONDITIONS" \
--split-by MINUTE \
--hcatalog-database dw \
--hcatalog-table ods_pre_dat_dcsj_time \
--hcatalog-storage-stanza 'stored as orc' \
--num-mappers 5
问题1:Error: Java heap space Out of Memory
解决思路:分析splitby字段,这是作为splitby字段MINUTE的情况:
如果按照上述划分,如果5个mapper,平均一个mapper处理4.5百万数据。明显不合理,另选一个splitby字段 (由于没有id和自增键),情况如下:
范围是1~288,每个分组6W多条数据。
第二,增加mapper个数,设定每个mapper所使用的个数
-D mapreduce.map.memory.mb=4096 \
-D mapreduce.map.java.opts=-Xmx3072m \
--num-mappers 20
其实还有一个,尽量避免复杂查询。
bash
sqoop import -D mapred.job.queue.name=highway \
-D mapreduce.map.memory.mb=4096 \
-D mapreduce.map.java.opts=-Xmx3072m \
--connect "jdbc:oracle:thin:@//localhost:61521/LZY2" \
--username LZSHARE \
--password '123456' \
--query "SELECT
TO_CHAR(GCRQ, 'YYYY') AS gcrq_year,
TO_CHAR(GCRQ, 'MM') AS gcrq_month,
TO_CHAR(GCRQ, 'DD') AS gcrq_day,
YEAR,
TO_CHAR(GCRQ, 'YYYY-MM-DD HH24:MI:SS') AS GCRQ,
GCZBS,
.......
ERR_CODE,
ERR_DESC,
DELETE_BY,
TO_CHAR(DELETE_TIME, 'YYYY-MM-DD HH24:MI:SS') AS DELETE_TIME,
CREATE_BY,
TO_CHAR(CREATE_TIME, 'YYYY-MM-DD HH24:MI:SS') AS CREATE_TIME,
UPDATE_BY,
TO_CHAR(UPDATE_TIME, 'YYYY-MM-DD HH24:MI:SS') AS UPDATE_TIME,
TO_CHAR(INSERT_TIME, 'YYYY-MM-DD HH24:MI:SS') AS INSERT_TIME
FROM LZJHGX.dat_dcsj_time
WHERE TO_CHAR(GCRQ , 'YYYY-MM-DD') < TO_CHAR(SYSDATE, 'YYYY-MM-DD') AND \$CONDITIONS" \
--split-by sjxh \
--hcatalog-database dw \
--hcatalog-table ods_pre_dat_dcsj_time \
--hcatalog-storage-stanza 'stored as orc' \
--num-mappers 20
最后再次运行:耗时4分钟左右
成功导入: