dolphinschedule+seatunnel+spark+hadoop

目的是调研用,没部署分布式。后续生产要部署分布式。

先安装JDK1.8

tar -zxvf jdk-8u451-linux-x64.tar.gz

配置/etc/profile

export JAVA_HOME=/data/jdk1.8.0_451

export CLASSPATH=.:{JAVA_HOME}/jre/lib/rt.jar:{JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar

export PATH=PATH:{JAVA_HOME}/bin

然后source /etc/profile

ds版本3.2.2(standalone)

bash 复制代码
解压并运行 Standalone Server
tar -xvzf apache-dolphinscheduler-3.2.2-bin.tar.gz
chmod -R 755 apache-dolphinscheduler-3.2.2-bin
cd apache-dolphinscheduler-3.2.2-bin
bash ./bin/dolphinscheduler-daemon.sh start standalone-server

浏览器访问地址 http://192.168.56.120:12345/dolphinscheduler/ui

即可登录系统 UI。

默认的用户名和密码是 admin/dolphinscheduler123

启停服务

复制代码
# 启动 Standalone Server 服务
bash ./bin/dolphinscheduler-daemon.sh start standalone-server
# 停止 Standalone Server 服务
bash ./bin/dolphinscheduler-daemon.sh stop standalone-server
# 查看 Standalone Server 状态
bash ./bin/dolphinscheduler-daemon.sh status standalone-server

Standalone 切换元数据库

我们这里以 MySQL 为例来说明如何配置外部数据库:

如果使用 MySQL 需要手动下载 mysql-connector-java 驱动 (8.0.16) 并移动到 DolphinScheduler 的每个模块的 libs 目录下,其中包括 api-server/libsalert-server/libsmaster-server/libsworker-server/libs。$DS_HOME/libs

  • 首先,参照 伪分布式/分布式安装初始化数据库 创建并初始化数据库(执行sql,在$DS_HOME/standalone/conf/sql/执行这个里面的mysql的脚本创建库)

  • 在你的命令行设定下列环境变量,将 {address}, {user}{password} 改为你数据库的地址, 用户名和密码。这个可以配置在/etc/profile中,也可以配置在$DS_HOME/bin/env/dolphinscheduler_env.sh 中

    export DATABASE=mysql
    export SPRING_PROFILES_ACTIVE=${DATABASE}
    export SPRING_DATASOURCE_URL="jdbc:mysql://192.168.43.210:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8&useSSL=false&serverTimezone=Asia/Shanghai"
    export SPRING_DATASOURCE_USERNAME=root
    export SPRING_DATASOURCE_PASSWORD=root

对于mysql 8:

bash 复制代码
mysql -uroot -p

mysql> CREATE DATABASE dolphinscheduler DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

# 修改 {user} 和 {password} 为你希望的用户名和密码
mysql> CREATE USER 'root'@'%' IDENTIFIED BY '{password}';
mysql> GRANT ALL PRIVILEGES ON dolphinscheduler.* TO 'root'@'%';
mysql> CREATE USER 'root'@'localhost' IDENTIFIED BY '{password}';
mysql> GRANT ALL PRIVILEGES ON dolphinscheduler.* TO 'root'@'localhost';
mysql> FLUSH PRIVILEGES;

我这里直接用的root

配置seatunnel到DS能识别的环境变量中

正常是把SEATUNNEL_HOME配置在$DS_HOME/bin/env/​​​dolphinscheduler_env.sh 中,但是我始终跑seatuunel任务的时候无法识别。

最后我是把SEATUNNEL_HOME配置在/etc/profile中,然后source /etc/profile才成功识别到的。

配置spark、hadoop到DS能识别的环境变量中

修改vi $DS_HOME/bin/env/dolphinscheduler_env.sh

复制代码
export HADOOP_HOME=${HADOOP_HOME:-/data/hadoop}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/data/hadoop/etc/hadoop}
export SPARK_HOME=${SPARK_HOME:-/data/spark}

export PATH=$HADOOP_HOME/bin:$SPARK_HOME/bin:$PATH

seatunnel版本2.3.13

下载包然后解压

tar -xzvf apache-seatunnel-2.3.13-bin.tar.gz

下载连接器插件

sh bin/install-plugin.sh 2.3.13

通常情况下,你不需要所有的连接器插件。你可以通过配置config/plugin_config来指定所需的插件。例如,如果你想让示例应用程序正常工作,你将需要connector-consoleconnector-fake插件。你可以修改plugin_config配置文件,如下所示:

复制代码
--seatunnel-connectors--
connector-fake
connector-console
--end--

您可以在${SEATUNNEL_HOME}/connectors/plugins-mapping.properties下找到所有支持的连接器和相应的plugin_config配置名称。

SeaTunnel 引擎快速开始

单机快速开始(Local 模式)

添加作业配置文件来定义作业

编辑config/v2.batch.config.template,它决定了当seatunnel启动后数据输入、处理和输出的方式及逻辑。 下面是配置文件的示例,它与上面提到的示例应用程序相同。

复制代码
env {
  parallelism = 1
  job.mode = "BATCH"
}

source {
  FakeSource {
    plugin_output = "fake"
    row.num = 16
    schema = {
      fields {
        name = "string"
        age = "int"
      }
    }
  }
}

transform {
  FieldMapper {
    plugin_input = "fake"
    plugin_output = "fake1"
    field_mapper = {
      age = age
      name = new_name
    }
  }
}

sink {
  Console {
    plugin_input = "fake1"
  }
}
运行SeaTunnel应用程序

cd apache-seatunnel-2.3.13

./bin/seatunnel.sh --config ./config/v2.batch.config.template -m local

SeaTunnel控制台将会打印一些如下日志信息:

2022-12-19 11:01:45,417 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - output rowType: name<STRING>, age<INT>

2022-12-19 11:01:46,489 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=1: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: CpiOd, 8520946

2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=2: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: eQqTs, 1256802974

2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=3: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: UsRgO, 2053193072

2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=4: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: jDQJj, 1993016602

2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=5: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: rqdKp, 1392682764

2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=6: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: wCoWN, 986999925

2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=7: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: qomTU, 72775247

2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=8: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: jcqXR, 1074529204

2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=9: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: AkWIO, 1961723427

2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=10: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: hBoib, 929089763

2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=11: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: GSvzm, 827085798

2022-12-19 11:01:46,491 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=12: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: NNAYI, 94307133

2022-12-19 11:01:46,491 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=13: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: EexFl, 1823689599

2022-12-19 11:01:46,491 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=14: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: CBXUb, 869582787

2022-12-19 11:01:46,491 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=15: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: Wbxtm, 1469371353

2022-12-19 11:01:46,491 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=16: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: mIJDt, 995616438

hadoop3.1.3

sudo useradd -m hadoop -s /bin/bash

sudo passwd hadoop

密码输入的hadoophadoop

sudo adduser hadoop sudo

sudo tar -zxf ~/下载/hadoop-3.1.3.tar.gz -C /data/hadoop # 解压到 /data/hadoop中

cd /data/hadoop

sudo mv ./hadoop-3.1.3/ ./hadoop # 将文件夹名改为hadoop

sudo chown -R hadoop ./hadoop # 修改文件权限

Hadoop 解压后即可使用。输入如下命令来检查 Hadoop 是否可用,成功则会显示 Hadoop 版本信息:

cd /usr/local/hadoop

./bin/hadoop version

Hadoop单机配置(非分布式)

Hadoop 默认模式为非分布式模式(本地模式),无需进行其他配置即可运行。非分布式即单 Java 进程,方便进行调试。

现在我们可以执行例子来感受下 Hadoop 的运行。Hadoop 附带了丰富的例子(运行 ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar 可以看到所有例子),包括 wordcount、terasort、join、grep 等。

在此我们选择运行 grep 例子,我们将 input 文件夹中的所有文件作为输入,筛选当中符合正则表达式 dfs[a-z.]+ 的单词并统计出现的次数,最后输出结果到 output 文件夹中

  1. cd /usr/local/hadoop
  2. mkdir ./input
  3. cp ./etc/hadoop/*.xml ./input # 将配置文件作为输入文件
  4. ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar grep ./input ./output 'dfs[a-z.]+'
  5. cat ./output/* # 查看运行结果

执行成功后,输出了作业的相关信息,输出的结果是符合正则的单词 dfsadmin 出现了1次

注意 ,Hadoop 默认不会覆盖结果文件,因此再次运行上面实例会提示出错,需要先将 ./output 删除。

  1. rm -r ./output

spark2.4.0

我要调研ds3.2.2里面的数据质量,所以要部署spark。

sudo tar -zxf ~/下载/spark-2.4.0-bin-without-hadoop.tgz -C /data

cd /data

sudo mv ./spark-2.4.0-bin-without-hadoop/ ./spark

sudo chown -R hadoop:hadoop ./spark # 此处的 hadoop 为你的用户名

cd spark

cp ./conf/spark-env.sh.template ./conf/spark-env.sh

复制代码
1.  export JAVA_HOME=/data/jdk1.8.0_451
2.  export HADOOP_CONF_DIR=/data/hadoop/etc/hadoop
3.  export SPARK_DIST_CLASSPATH=$(/data/hadoop/bin/hadoop classpath)

注意这个路径我也加到了/etc/profile里面了的

验证Spark是否安装成功

bin/run-example SparkPi

成功会在日志里面输出

Pi is roughly 3.140675703378517

如果spark要操作Mysql记得在$SPAKR_HOME/jars里面放入 mysql-connector-java 驱动 (8.0.16)

相关推荐
SeaTunnel1 天前
AI 让 SeaTunnel 读源码和调试过时了吗?
大数据·数据库·人工智能·apache·seatunnel·数据同步
ApacheSeaTunnel7 天前
AI 让 SeaTunnel 读源码和调试过时了吗?
大数据·ai·开源·数据集成·seatunnel·技术分享·数据同步
SeaTunnel10 天前
Apache SeaTunnel 4 月有何新动作?连接器增强与 Zeta 稳定性提升等亮点速览
大数据·数据仓库·spark·apache·seatunnel
ApacheSeaTunnel1 个月前
SeaTunnel + AI:一句“我要做什么”,能不能直接变成一份能跑的配置?
大数据·ai·开源·数据集成·seatunnel·数据同步
SeaTunnel1 个月前
深度解析 Apache SeaTunnel 核心引擎三大技术创新:高可靠异步持久化与 CDC 架构优化实战
大数据·数据库·架构·apache·seatunnel
ApacheSeaTunnel1 个月前
Apache SeaTunnel Zeta 为什么能做到“又快又稳”?
大数据·开源·数据集成·seatunnel·技术分享·数据同步
SeaTunnel2 个月前
关于 Apache SeaTunnel 类加载器治理的一些观察与思考(欢迎讨论)
大数据·开源·apache·seatunnel·数据同步
ApacheSeaTunnel2 个月前
祝贺 Apache SeaTunnel PMC 新成员张圣航!
大数据·开源·数据集成·seatunnel·技术分享
ApacheSeaTunnel2 个月前
从 Apache SeaTunnel 走向 ASF Member:一位开发者的长期主义样本
大数据·开源·数据集成·seatunnel·数据同步