大数据组件(三)快速入门实时计算平台Dinky

大数据组件(三)快速入门实时计算平台Dinky

  • Dinky 是一个开箱即用的一站式实时计算平台(同样,还有StreamPark),以 Apache Flink 为基础,连接数据湖仓等众多框架,致力于流批一体和湖仓一体的建设与实践。

  • Dinky 是基于Spring Boot 轻应用,不需要在任何 Flink 集群修改源码或添加额外插件,无感知连接和监控Flink 集群。

  • 今天,我们以一个简单案例快速了解下实时计算平台Dinky:

1 Dinky的下载及安装

shell 复制代码
# 1、初始化dinky表
[root@centos01 apps]# tar -zxvf dlink-release-0.7.3.tar.gz
[root@centos01 apps]# mv dlink-release-0.7.3 dlink-0.7.3

# Dinky采用mysql作为后端的存储库,部署需要MySQL5.7以上版本
# 创建数据库
mysql>CREATE DATABASE dinky;

# 创建用户并允许远程登录
mysql>create user 'dinky'@'%' IDENTIFIED WITH mysql_native_password by 'dinky';

# 授权
mysql>grant ALL PRIVILEGES ON dinky.* to 'dinky'@'%';

mysql>flush privileges;

# 登录创建好的dinky用户,执行初始化sql文件
mysql -udinky -pdinky
mysql> use dinky;
mysql> source /opt/apps/dlink-0.7.3/sql/dinky.sql

# 2、修改Dinky连接mysql的配置文件
[root@centos01 dlink-0.7.3]# vim ./config/application.yml
spring:
  datasource:
    url: jdbc:mysql://${MYSQL_ADDR:centos01:3306}/${MYSQL_DATABASE:dinky}?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true&useSSL=false&zeroDateTimeBehavior=convertToNull&serverTimezone=Asia/Shanghai&allowPublicKeyRetrieval=true
    username: ${MYSQL_USERNAME:dinky}
    password: ${MYSQL_PASSWORD:dinky}
    driver-class-name: com.mysql.cj.jdbc.Driver
  application:
    name: dlink
  mvc:
    pathmatch:
      matching-strategy: ant_path_matcher
    format:
      date: yyyy-MM-dd HH:mm:ss
    #json格式化全局配置
  jackson:
    time-zone: GMT+8
    date-format: yyyy-MM-dd HH:mm:ss

  main:
    allow-circular-references: true

# 3、加载依赖
# 需要在Dinky根目录下plugins/flink${FLINK_VERSION}文件夹并上传相关的Flink依赖
# 我这里使用的是flink1.14版本
cp /opt/apps/flink-1.14.4/lib/* /opt/apps/dlink-0.7.3/plugins/flink1.14


# Dinky当前版本的yarn的perjob与application执行模式依赖flink-shade-hadoop,需要额外添加flink-shade-hadoop-uber-3包。
[root@centos01 plugins]# ll
total 58212
drwxr-xr-x. 3 root root     4096 Jan  6 17:27 flink1.14
-rw-r--r--. 1 root root 59604787 Dec 23 18:24 flink-shaded-hadoop-3-uber-3.1.1.7.2.9.0-173-9.0.jar


# 注意:这里我放置了geohash-1.4.0.jar和fastjson-1.2.75.jar
# 还放了doris的connector
# 这是为了后面使用自定义函数
[root@centos01 plugins]# ll flink1.14/
total 240560
-rw-r--r--. 1 root root     53820 Jan  3 23:34 commons-cli-1.4.jar
-rw-r--r--. 1 root root    655085 Jan  6 17:27 fastjson-1.2.75.jar
-rw-r--r--. 1 tom  1001     85584 Feb 25  2022 flink-csv-1.14.4.jar
-rw-r--r--. 1 tom  1001 136063964 Feb 25  2022 flink-dist_2.12-1.14.4.jar
-rw-r--r--. 1 root root   8077256 Dec 23 13:43 flink-doris-connector-1.14_2.12-1.1.1.jar
-rw-r--r--. 1 tom  1001    153145 Feb 25  2022 flink-json-1.14.4.jar
-rw-r--r--. 1 root root  59604787 Dec 23 18:25 flink-shaded-hadoop-3-uber-3.1.1.7.2.9.0-173-9.0.jar
-rw-r--r--. 1 tom  1001   7709731 Sep 10  2021 flink-shaded-zookeeper-3.4.14.jar
-rw-r--r--. 1 tom  1001  39635530 Feb 25  2022 flink-table_2.12-1.14.4.jar
-rw-r--r--. 1 root root     25422 Jan  6 15:48 geohash-1.4.0.jar
-rw-r--r--. 1 root root   1654887 Dec 24 21:16 hadoop-mapreduce-client-core-3.1.1.jar
-rw-r--r--. 1 tom  1001    208006 Jan 13  2022 log4j-1.2-api-2.17.1.jar
-rw-r--r--. 1 tom  1001    301872 Jan  7  2022 log4j-api-2.17.1.jar
-rw-r--r--. 1 tom  1001   1790452 Jan  7  2022 log4j-core-2.17.1.jar
-rw-r--r--. 1 tom  1001     24279 Jan  7  2022 log4j-slf4j-impl-2.17.1.jar



# 另外,需要注意的是:这里把除了flink1.14之外的目录全部删除了,否则在使用的过程中会报错如下:
Caused by: java.io.IOException: Cannot find any jar files for plugin in directory [plugins/flink1.11]. Please provide the jar files for the plugin or delete the directory.
	at org.apache.flink.core.plugin.DirectoryBasedPluginFinder.createJarURLsFromDirectory(DirectoryBasedPluginFinder.java:103)
	at org.apache.flink.core.plugin.DirectoryBasedPluginFinder.createPluginDescriptorForSubDirectory(DirectoryBasedPluginFinder.java:88)
	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)


# 4、上传jar包到hdfs
# 使用Application模式时,需要将flink和dinky相关的包上传到HDFS
hdfs dfs -mkdir -p /dinky/jar/
hdfs dfs -put /opt/apps/dlink-0.7.3/jar/dlink-app-1.14-0.7.3-jar-with-dependencies.jar /dinky/jar

hdfs dfs -mkdir /flink-dist
hdfs dfs -put /opt/apps/flink-1.14.4/lib /flink-dist
hdfs dfs -put /opt/apps/flink-1.14.4/plugins /flink-dist


# 5、启动、关闭
cd /opt/apps/dlink-0.7.3
[root@centos01 dlink-0.7.3]# ./auto.sh start 1.14
[root@centos01 dlink-0.7.3]# jps
1688 Dlink

# web相关
web页面:http://centps01:8888
默认用户名/密码: admin/admin

# 关闭命令
[root@centos01 dlink-0.7.3]# ./auto.sh stop

2 案例讲解_快速入门实时计算平台Dinky

  • 我们用下面的案例,快速了解下如何基于Dinky进行开发flink-sql程序
  • 如下图,我们读取kafka的日志事件,去look-up join在Hbase库中的维表,打宽后再sink到kafka中,并冗余一份数据到doris中
  • 平台更多的用法还是参考官方文档:https://www.dinky.org.cn/docs/next/get_started/overview

2.1 集群配置

  • 集群实例管理适用于 Standalone,Yarn Session 和 Kubernetes Session 这三种集群实例的注册
  • 集群配置管理适用于 Yarn Per-job、Yarn Application 和 Kubernetes Application 这三种类型配置。
shell 复制代码
# 这里为方便使用Standalone集群
[root@centos01 dlink-0.7.3]# /opt/apps/flink-1.14.4/bin/start-cluster.sh 
Starting cluster.
Starting standalonesession daemon on host centos01.
Starting taskexecutor daemon on host centos01.

# web页面
http://centos01:8081
  • 注册Standalone集群
  • 这里再展示下Flink on yarn的注册
  • 配置中心配置下提交 FlinkSQL 的 Jar 文件路径
  • 我们需要将mysql用户数据同步到hbase中,然后利用flink-sql关联hbase

  • 这里我们直接使用flink-cdc同步数据

  • 利用Dinky,我们可以自动生成flink-sql中建表语句

shell 复制代码
-- Flink 连接配置: (可以放入公共参数,及其敏感信息参数)
     'hostname' = '192.168.42.104'
    ,'port' = '3306'
    ,'username' = 'root'
    ,'password' = '123456'

-- Flink 连接模板: 注意引用变量的前后逗号, 使用此方式作业右侧必须开启全局变量
-- ${schemaName} 动态获取数据库,${tableName} 动态获取表名称
     'connector' = 'mysql-cdc'
    ,${Centos04-Mysql}
    ,'scan.incremental.snapshot.enabled' = 'true'
    ,'debezium.snapshot.mode'='latest-offset'
    ,'database-name' = '${schemaName}'
    ,'table-name' = '${tableName}'
  • 添加完数据源,在元数据中心可以访问
  • 下面在数据开发中可以创建flink-sql作业:

    • ADD JAR语句用于将用户jar添加到classpath。利用Mysql-cdc需要connector,我们flink/lib目录中并无此Jar包,因此这里我们添加此jar包;

    • 执行模式,我们选择Standalone,然后Flink集群就会出现我们之前配置过的集群信息;

    • 这里的DefaultCatalog是dinky自己实现了 mysql-catalog,而非flink原生基于内存的catalog。因此,程序结束后,在结构中(默认my_catalog.default_database),还能查看到表信息;

    • 右边需要启用全局变量,然后dinky会进行替换
    • 点击执行配置,任务会提交到flink-standalone集群上,启用打印流,就能通过"select * from"获取最新数据,这样方便我们进行调试
    • 在运维中心可以查看相关信息,也可以停止job
sql 复制代码
-- 添加mysql-cdc的jar包
add jar '/opt/apps/bak_jars/flink-sql-connector-mysql-cdc-2.3.0.jar';

-- 读取mysql的cdc数据
drop table if exists ums_member_cdc;
CREATE TABLE IF NOT EXISTS ums_member_cdc (
    `id` BIGINT NOT NULL
    ,`username` STRING
    ,`phone` STRING
    ,`user_status` INT
    ,`create_time` TIMESTAMP
    ,`gender` INT
    ,`birthday` DATE
    ,`province` STRING
    ,`city` STRING
    ,`job` STRING
    ,`source_type` INT
    ,PRIMARY KEY ( `id` ) NOT ENFORCED
) WITH (
   'connector' = 'mysql-cdc'
    ,${Centos04-Mysql}
    ,'scan.incremental.snapshot.enabled' = 'true'
    ,'debezium.snapshot.mode'='latest-offset'
    ,'database-name' = 'mall'
    ,'table-name' = 'ums_member'
);

select * from ums_member_cdc;
  • 我们现在需要将ums_member_cdc同步到hbase中,下面给出代码
sql 复制代码
-- 添加mysql-cdc的jar包
add jar '/opt/apps/bak_jars/flink-sql-connector-mysql-cdc-2.3.0.jar';

-- 读取mysql的cdc数据
drop table if exists ums_member_cdc;
CREATE TABLE IF NOT EXISTS ums_member_cdc (
    `id` BIGINT NOT NULL
    ,`username` STRING
    ,`phone` STRING
    ,`user_status` INT
    ,`create_time` TIMESTAMP
    ,`gender` INT
    ,`birthday` DATE
    ,`province` STRING
    ,`city` STRING
    ,`job` STRING
    ,`source_type` INT
    ,PRIMARY KEY ( `id` ) NOT ENFORCED
) WITH (
   'connector' = 'mysql-cdc'
    ,${Centos04-Mysql}
    ,'scan.incremental.snapshot.enabled' = 'true'
    ,'debezium.snapshot.mode'='latest-offset'
    ,'database-name' = 'mall'
    ,'table-name' = 'ums_member'
);


-- 添加hbase相关的jar包
add jar '/opt/apps/bak_jars/flink-connector-hbase-base_2.12-1.14.4.jar'; 
add jar '/opt/apps/bak_jars/flink-connector-hbase-2.2_2.12-1.14.4.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-common-2.2.5.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-client-2.2.5.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-protocol-2.2.5.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-protocol-shaded-2.2.5.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-shaded-miscellaneous-2.2.1.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-shaded-netty-2.2.1.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-shaded-protobuf-2.2.1.jar';


-- hbase(main):029:0> create 'dim_user_info', 'f1'
-- 创建hbase的映射表
create table if not exists dim_user_hbase(                   
     id STRING,                        
     f1 ROW<                                 
        id BIGINT,                           
  	    phone STRING,                        
  	    user_status INT,                          
  	    create_time TIMESTAMP(3),            
        gender INT,                          
  	    birthday DATE,                       
  	    province STRING,                     
  	    city STRING,                         
  	    job STRING,                          
  	    source_type INT>                     
) WITH(                                    
      'connector' = 'hbase-2.2',             
      'table-name' = 'dim_user_info',        
      'zookeeper.quorum' = 'centos01:2181'    
);



-- 将Mysql-cdc的信息写入到hbase中
insert into dim_user_hbase
select 
     cast(id as string) as id
   , row(id,phone,user_status,create_time,gender,birthday,province,city,job,source_type) as f1
from   
  ums_member_cdc;

注(相关表信息):

sql 复制代码
DROP TABLE IF EXISTS `ums_member`;
CREATE TABLE `ums_member` (
  `id` bigint(20) NOT NULL,
  `username` varchar(64) DEFAULT NULL,
  `phone` varchar(64) DEFAULT NULL,
  `user_status` int(11) DEFAULT NULL,
  `create_time` datetime DEFAULT NULL,
  `gender` int(11) DEFAULT NULL,
  `birthday` date DEFAULT NULL,
  `province` varchar(64) DEFAULT NULL,
  `city` varchar(64) DEFAULT NULL,
  `job` varchar(64) DEFAULT NULL,
  `source_type` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

-- ----------------------------
-- Records of ums_member
-- ----------------------------
INSERT INTO `ums_member` VALUES ('1', 'tom', '18616350000', '0', '2023-04-06 01:00:00', '0', '1995-02-18', '浙江省', '杭州市', '运营', '1');
INSERT INTO `ums_member` VALUES ('2', 'hank', '18616350001', '0', '2023-04-06 01:00:00', '1', '1995-03-18', '浙江省', '宁波市', '运营', '1');

2.3 mock维表数据

shell 复制代码
[root@centos01 dlink-0.7.3]# hbase shell

hbase(main):001:0> create 'dim_page_info', 'f'

put 'dim_page_info', '/mall/', 'f:pt', '商品详情页'
put 'dim_page_info', '/mall/', 'f:sv', '商城服务'
put 'dim_page_info', '/mall/search/', 'f:pt', '搜索结果页'
put 'dim_page_info', '/mall/search/', 'f:sv', '搜索服务'
put 'dim_page_info', '/mall/promotion/', 'f:pt', '活动页'
put 'dim_page_info', '/mall/promotion/', 'f:sv', '商城服务'

hbase(main):001:0> list
TABLE
dim_geo_area
dim_page_info
dim_user_info
3 row(s)
                                                                                                                                                            

# 这里dim_geo_area是将一些全国一些地方典型经纬度(百度坐标系)的转换为geoHash编码
# 后面,将用户经纬度同样转换为geoHash编码,关联用户所在的省、市、区\县
# 可以自己模拟几条数据
hbase(main):003:0> get 'dim_geo_area', 'w7w3j', {FORMATTER=>'toString'}
COLUMN                               CELL   
f:p                                     timestamp=1736131692022, value=海南省 
f:c                                     timestamp=1736131692022, value=海口市             
f:r                                     timestamp=1736131692022, value=秀英区                      
1 row(s)
Took 0.0084 seconds

2.4 维表join

  • 下面所示geohash所用的jar包,我们用来定义udf函数

    • 创建Java作业,udf模板选择java_udf

    • 然后,修改为下面的代码

    java 复制代码
    package com.yyds.udf;
    
    import org.apache.flink.table.functions.ScalarFunction;
    // 注意:这里使用到的jar包,已经手动放在了flink安装目录下的lib下
    // 同时,也同步到dinky的/opt/apps/dlink-0.7.3/plugins/flink1.14目录中
    // 参考:Dinky的下载及安装
    import ch.hsr.geohash.GeoHash;
    
    public class GetGeoHash extends ScalarFunction {
        // 接受一个gps座标,返回它的geohash码
        public String eval(Double lng, Double lat){
    
            String geohash = null;
            try{
                geohash = GeoHash.geoHashStringWithCharacterPrecision(lat,lng,5);
            } catch (Exception e){
    
            }
            return geohash;
        }
    }
    • geohash算法使用了下面的jar包
xml 复制代码
<dependency>
            <groupId>ch.hsr</groupId>
            <artifactId>geohash</artifactId>
            <version>1.4.0</version>
</dependency>
  • 维表join的代码如下:
sql 复制代码
-- 设置相关参数
set 'execution.checkpointing.interval'= '10000';
set 'state.checkpoints.dir' = 'hdfs://centos01:8020/flink_ck/dwd_sink2kafka';

-- 注册udf函数
create function if not exists GetGeoHash as 'com.yyds.udf.GetGeoHash';


-- 添加kafka相关jar包
add jar '/opt/apps/bak_jars/flink-connector-kafka_2.12-1.14.4.jar';
add jar '/opt/apps/bak_jars/kafka-clients-2.6.2.jar';
-- 添加hbase相关的jar包
add jar '/opt/apps/bak_jars/flink-connector-hbase-2.2_2.12-1.14.4.jar';
add jar '/opt/apps/bak_jars/flink-connector-hbase-base_2.12-1.14.4.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-common-2.2.5.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-client-2.2.5.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-protocol-2.2.5.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-protocol-shaded-2.2.5.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-shaded-miscellaneous-2.2.1.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-shaded-netty-2.2.1.jar';
add jar '/opt/apps/hbase-2.2.5/lib/hbase-shaded-protobuf-2.2.1.jar';



-- 1、获取kafka主表
-- 创建topic
-- /opt/apps/kafka_2.12-2.6.2/bin/kafka-topics.sh --create --replication-factor 1 --partitions 1 --bootstrap-server centos01:9092 --topic mall_logs
-- 启动生产者
-- /opt/apps/kafka_2.12-2.6.2/bin/kafka-console-producer.sh --bootstrap-server centos01:9092 --topic mall_logs

drop table if exists mall_logs_kafka_source;
CREATE TABLE if not exists mall_logs_kafka_source (
  `carrier`           STRING,
  `deviceId`          STRING,
  `deviceType`        STRING,
  `eventId`           STRING,
  `id`                BIGINT,
  `isNew`             INT,
  `latitude`          DOUBLE,
  `longitude`         DOUBLE,
  `netType`           STRING,
  `osName`            STRING,
  `osVersion`         STRING,
  `properties`        map<string,string> ,
  `releaseChannel`    STRING,
  `resolution`        STRING,
  `sessionId`         STRING,
  `timestamp`         bigint,
   proc_time   AS PROCTIME()
) WITH (
  'connector' = 'kafka',
  'topic' = 'mall_logs',
  'properties.bootstrap.servers' = '192.168.42.101:9092',
  'properties.group.id' = 'flink_group', 
  'format' = 'json',
  'json.fail-on-missing-field' = 'false', 
  'scan.startup.mode' = 'earliest-offset',
  'value.fields-include' = 'EXCEPT_KEY'
);


-- 2、从hbase中获取地域信息维表、用户信息维表
-- 地域信息维表
drop table if exists dim_geo_area_hbase;
create table if not exists dim_geo_area_hbase(                   
     geohash STRING,                        
     f ROW<   
           p STRING
         , c STRING
         , r STRING                              
        >                     
) WITH(                                    
      'connector' = 'hbase-2.2',             
      'table-name' = 'dim_geo_area',        
      'zookeeper.quorum' = 'centos01:2181'
);

-- 用户信息维表
drop table if exists dim_user_info_hbase;
create table if not exists dim_user_info_hbase(                   
     id STRING,                        
     f1 ROW<                                 
        id BIGINT,                           
  	    phone STRING,                        
  	    user_status INT,                          
  	    create_time TIMESTAMP(3),            
        gender INT,                          
  	    birthday STRING,                       
  	    province STRING,                     
  	    city STRING,                         
  	    job STRING,                          
  	    source_type INT>                     
) WITH(                                    
      'connector' = 'hbase-2.2',             
      'table-name' = 'dim_user_info',        
      'zookeeper.quorum' = 'centos01:2181'    
);

-- 页面信息维表
drop table if exists dim_page_info_hbase;
create table if not exists dim_page_info_hbase(  
    url_prefix STRING,                      
    f  ROW<                                 
       sv STRING,                           
 	    pt STRING>                          
 ) WITH(                                    
     'connector' = 'hbase-2.2',             
     'table-name' = 'dim_page_info',        
     'zookeeper.quorum' = 'centos01:2181'    
 );


-- 3、使用lookup join进行维表关联
CREATE TEMPORARY VIEW tmp_wide_view AS 
SELECT 
       e.id             as user_id,
       e.isNew          as is_new,
       e.sessionId      as session_id,
       e.eventId        as event_id,
       e.`timestamp`    as action_time,
       e.longitude      as longitude,
       e.latitude       as latitude,
       GetGeoHash(e.longitude, e.latitude) as geohash,
       e.releaseChannel as release_channel,
       e.deviceType     as device_type,
       e.properties,
	 e.netType        as net_type,
	 e.osName         as os_name,
	 -- 用户信息
       u.f1.phone       as register_phone,
       u.f1.user_status as user_status,
       u.f1.create_time as register_time,
       u.f1.gender      as register_gender,
       u.f1.birthday    as register_birthday,
       u.f1.province    as register_province,
       u.f1.city        as register_city,
       u.f1.job         as register_job,
       u.f1.source_type as register_source_type,
	 -- 地域信息
       g.f.p            as gps_province,
       g.f.c            as gps_city,
       g.f.r            as gps_region,
	 -- 页面信息
       p.f.pt           as page_type,
       p.f.sv           as page_service
FROM 
	mall_logs_kafka_source AS e
LEFT JOIN 
	dim_user_info_hbase     FOR SYSTEM_TIME AS OF e.proc_time AS u
ON cast(e.id as string) = cast(u.id as string)
LEFT JOIN 
	dim_geo_area_hbase      FOR SYSTEM_TIME AS OF e.proc_time AS g 
ON GetGeoHash(e.longitude, e.latitude) = g.geohash
LEFT JOIN 
	dim_page_info_hbase     FOR SYSTEM_TIME AS OF e.proc_time AS p 
ON regexp_extract(e.properties['url'], '^(.*/).*?') = p.url_prefix;




-- 4、将宽表sink到kafka
-- 创建topic: /opt/apps/kafka_2.12-2.6.2/bin/kafka-topics.sh --create --replication-factor 1 --partitions 1 --bootstrap-server centos01:9092 --topic dwd_user_details
drop table if exists dwd_user_details_kafka;
CREATE TABLE if not exists dwd_user_details_kafka (
     user_id           BIGINT,
     is_new            INT,
     session_id        STRING,
     event_id          STRING,
     action_time       BIGINT,
     longitude         DOUBLE,
     latitude          DOUBLE,
	 geohash           STRING,
     release_channel   STRING,
     device_type       STRING,
     properties        MAP<STRING, STRING>,
	 net_type          STRING,       
	 os_name           STRING,
     register_phone    STRING,
     user_status       INT,
     register_time     TIMESTAMP(3),
     register_gender   INT,
     register_birthday STRING,
     register_province STRING,
     register_city     STRING,
     register_job      STRING,
     register_source_type INT,
     gps_province      STRING,
     gps_city          STRING,
     gps_region        STRING,
     page_type         STRING,
     page_service      STRING
) WITH (
     'connector' = 'kafka',
     'topic' = 'dwd_user_details',
     'properties.bootstrap.servers' = 'centos01:9092',
     'properties.group.id' = 'testGroup2',
     'scan.startup.mode' = 'earliest-offset',
     'value.format' = 'json',
     'value.json.fail-on-missing-field' = 'false',   -- 正确命名空间下的配置项
     'value.json.ignore-parse-errors' = 'true'      -- 正确命名空间下的配置项
);


-- 插入到kafka
insert into dwd_user_details_kafka select * from tmp_wide_view;

注(几条kafka样例数据):

json 复制代码
{"carrier":"中国电信","deviceId":"fae6233e-f973-42bb-abc6-8d39cbd56074","deviceType":"MI-NOTE","eventId":"app_launch","id":1,"isNew":0,"lastUpdate":2023,"latitude":29.30644391183944,"longitude":120.06729564866686,"netType":"WIFI","osName":"android","osVersion":"8.5","releaseChannel":"小米游戏中心","resolution":"1024*768","sessionId":"HvBGPAxJ3Bmi","timestamp":1618020104340}                                    
{"carrier":"中国电信","deviceId":"fae6233e-f973-42bb-abc6-8d39cbd56074","deviceType":"MI-NOTE","eventId":"mall_click","id":1,"isNew":0,"lastUpdate":2023,"latitude":29.30644391183944,"longitude":120.06729564866686,"netType":"WIFI","osName":"android","osVersion":"8.5","properties":{"url":"/mall/item-2.html","item_id":2},"releaseChannel":"小米游戏中心","resolution":"1024*768","sessionId":"HvBGPAxJ3Bmi","timestamp":1618020106190}
{"carrier":"中国电信","deviceId":"fae6233e-f973-42bb-abc6-8d39cbd56074","deviceType":"MI-NOTE","eventId":"mall_click","id":1,"isNew":0,"lastUpdate":2023,"latitude":29.30644391183944,"longitude":120.06729564866686,"netType":"WIFI","osName":"android","osVersion":"8.5","properties":{"url":"/mall/item-666.html","item_id":666},"releaseChannel":"小米游戏中心","resolution":"1024*768","sessionId":"HvBGPAxJ3Bmi","timestamp":1618020106190}
{"carrier":"中国移动","deviceId":"863b13cc-23d5-4b22-8c41-8af6f2bc2436","deviceType":"LEPHONE-6","eventId":"app_launch","id":2,"isNew":0,"lastUpdate":2021,"latitude":19.88434436079741,"longitude":110.26320040618856,"netType":"5G","osName":"android","osVersion":"8.5","releaseChannel":"优亿市场","resolution":"2048*1024","sessionId":"VrRAxksJfobh","timestamp":1618020104396}

2.5 kafka2doris

  • 冗余一份数据到doris,注意:这里同样注册了Udf函数
  • 这里的数据源已经在mysql-catalog中,因此可以直接使用
sql 复制代码
-- 设置相关参数
set 'execution.checkpointing.interval'= '10000';
set 'state.checkpoints.dir' = 'hdfs://centos01:8020/flink_ck/dwd_kafka2doris';

-- 注册udf函数
create function if not exists Map2JsonStr as 'com.yyds.udf.Map2JsonStr';


-- 添加kafka相关jar包
add jar '/opt/apps/bak_jars/flink-connector-kafka_2.12-1.14.4.jar';
add jar '/opt/apps/bak_jars/kafka-clients-2.6.2.jar';
-- 添加doris相关的jar包
add jar '/opt/apps/flink-1.14.4/lib/flink-doris-connector-1.14_2.12-1.1.1.jar';


drop table if exists dwd_user_details_doris;
CREATE TABLE if not exists dwd_user_details_doris (
     gps_province         VARCHAR(16),
     gps_city             VARCHAR(16),
     gps_region           VARCHAR(16),
     dt                   DATE,
     user_id              BIGINT,
	 is_new               INT,
     session_id           VARCHAR(20),
     event_id             VARCHAR(10),
     action_time          BIGINT,
     longitude            DOUBLE,
     latitude             DOUBLE,
     release_channel      VARCHAR(20),
     device_type          VARCHAR(20),
     properties           VARCHAR(80),  -- 注意:Doris中不支持Map类型
	 net_type             VARCHAR(10),
	 os_name              VARCHAR(10),
     register_phone       VARCHAR(20),
     user_status          INT,
     register_time        TIMESTAMP(3),
     register_gender      INT,
     register_birthday    VARCHAR(20),
     register_province    VARCHAR(20),
     register_city        VARCHAR(20),
     register_job         VARCHAR(20),
     register_source_type INT,
     page_type            VARCHAR(20),
     page_service         VARCHAR(20)
) WITH (
     'connector' = 'doris',
     'fenodes' = 'centos01:8030',
     'table.identifier' = 'doris_database.dwd_user_details',
     'username' = 'root',
     'password' = '123456',
     'sink.label-prefix' = 'doris_sink_label-963'
);


-- 从kafka插入到doris中,即冗余一份到doris
INSERT INTO dwd_user_details_doris
SELECT 
    gps_province,
    gps_city,
    gps_region,
    TO_DATE(DATE_FORMAT(TO_TIMESTAMP_LTZ(action_time, 3), 'yyyy-MM-dd')) AS dt,
    user_id,
    is_new,
    session_id,
    event_id,
    action_time,
    longitude,
    latitude,
    release_channel,
    device_type,
    Map2JsonStr(properties) AS properties, -- 注意:确保Map2JsonStr是已注册的UDF函数
    net_type,
    os_name,
    register_phone,
    user_status,
    register_time,
    register_gender,
    cast(register_birthday as string) as register_birthday,
    register_province,
    register_city,
    register_job,
    register_source_type,
    page_type,
    page_service
FROM dwd_user_details_kafka;

注:

  • doris结果展示
  • udf函数的定义
java 复制代码
package com.yyds.udf;

import org.apache.flink.table.functions.ScalarFunction;
// 同样用到外部jar包
import com.alibaba.fastjson.JSON;

import java.util.Map;

public class Map2JsonStr extends ScalarFunction {
    public String eval(Map<String,String> properties) {
       return  JSON.toJSONString(properties);
    }
}
  • doris结果表的建表语句
sql 复制代码
CREATE TABLE IF NOT EXISTS dwd_user_details (
    gps_province         VARCHAR(16),
    gps_city             VARCHAR(16),
    gps_region           VARCHAR(16),
    dt                   DATE,
    user_id              BIGINT,
    is_new               INT,
    session_id           VARCHAR(20),
    event_id             VARCHAR(10),
    action_time          BIGINT,
    longitude            DOUBLE,
    latitude             DOUBLE,
    release_channel      VARCHAR(20),
    device_type          VARCHAR(20),
    properties           VARCHAR(80),  -- 注意:Doris中不支持Map类型
    net_type             VARCHAR(10),
    os_name              VARCHAR(10),
    register_phone       VARCHAR(20),
    user_status          INT,
    register_time        DATETIME,     -- Doris 不直接支持 TIMESTAMP(3),使用DATETIME代替
    register_gender      INT,
    register_birthday    VARCHAR(20),
    register_province    VARCHAR(20),
    register_city        VARCHAR(20),
    register_job         VARCHAR(20),
    register_source_type INT,
    page_type            VARCHAR(20),
    page_service         VARCHAR(20)
) ENGINE=OLAP
DUPLICATE KEY(gps_province, gps_city, gps_region, dt, user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10
PROPERTIES (
    "replication_num" = "1",
    "in_memory" = "false",
    "storage_format" = "DEFAULT"
);
  • 平台更多用法,如:报警组配置、CDCSOURCE的整库同步、flink jar包的运行、flink on yarn运行等,请参考官方文档,不再赘述。
相关推荐
sunxunyong2 小时前
spark on hive 参数
大数据·hive·spark
B站计算机毕业设计超人2 小时前
计算机毕业设计hadoop+spark+hive新能源汽车推荐系统 汽车数据分析可视化大屏 新能源汽车推荐系统 汽车爬虫 汽车大数据 机器学习
大数据·hive·hadoop·python·深度学习·spark·课程设计
泡芙萝莉酱4 小时前
中国税务年鉴PDF电子版Excel2022年-社科数据
大数据·人工智能·深度学习·数据挖掘·数据分析·毕业论文·统计年鉴
szxinmai主板定制专家4 小时前
基于RK3568/RK3588大车360度环视影像主动安全行车辅助系统解决方案,支持ADAS/DMS
大数据·人工智能·边缘计算
重生之绝世牛码6 小时前
Java设计模式 —— 【行为型模式】策略模式(Strategy Pattern) 详解
java·大数据·开发语言·设计模式·策略模式
熟透的蜗牛7 小时前
大数据技术(九)—— HBase优化
大数据·数据库·hbase
代码欢乐豆7 小时前
HBase Cassandra的部署和操作
大数据·数据库·hbase
司晓杰13 小时前
Flink 实时数据处理中的问题与解决方案
大数据·flink
lisacumt13 小时前
【Flink CDC】Flink CDC的Schema Evolution表结构演变的源码分析和流程图
大数据·flink·流程图
Elastic 中国社区官方博客16 小时前
在不到 5 分钟的时间内将威胁情报 PDF 添加为 AI 助手的自定义知识
大数据·人工智能·安全·elasticsearch·搜索引擎·pdf·全文检索