目的是调研用,没部署分布式。后续生产要部署分布式。
先安装JDK1.8
tar -zxvf jdk-8u451-linux-x64.tar.gz
配置/etc/profile
export JAVA_HOME=/data/jdk1.8.0_451
export CLASSPATH=.:{JAVA_HOME}/jre/lib/rt.jar:{JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar
export PATH=PATH:{JAVA_HOME}/bin
然后source /etc/profile
ds版本3.2.2(standalone)
bash
解压并运行 Standalone Server
tar -xvzf apache-dolphinscheduler-3.2.2-bin.tar.gz
chmod -R 755 apache-dolphinscheduler-3.2.2-bin
cd apache-dolphinscheduler-3.2.2-bin
bash ./bin/dolphinscheduler-daemon.sh start standalone-server
浏览器访问地址 http://192.168.56.120:12345/dolphinscheduler/ui
即可登录系统 UI。
默认的用户名和密码是 admin/dolphinscheduler123
启停服务
# 启动 Standalone Server 服务
bash ./bin/dolphinscheduler-daemon.sh start standalone-server
# 停止 Standalone Server 服务
bash ./bin/dolphinscheduler-daemon.sh stop standalone-server
# 查看 Standalone Server 状态
bash ./bin/dolphinscheduler-daemon.sh status standalone-server
Standalone 切换元数据库
我们这里以 MySQL 为例来说明如何配置外部数据库:
如果使用 MySQL 需要手动下载 mysql-connector-java 驱动 (8.0.16) 并移动到 DolphinScheduler 的每个模块的 libs 目录下,其中包括
api-server/libs和alert-server/libs和master-server/libs和worker-server/libs。$DS_HOME/libs
-
首先,参照
伪分布式/分布式安装初始化数据库创建并初始化数据库(执行sql,在$DS_HOME/standalone/conf/sql/执行这个里面的mysql的脚本创建库) -
在你的命令行设定下列环境变量,将
{address},{user}和{password}改为你数据库的地址, 用户名和密码。这个可以配置在/etc/profile中,也可以配置在$DS_HOME/bin/env/dolphinscheduler_env.sh 中export DATABASE=mysql
export SPRING_PROFILES_ACTIVE=${DATABASE}
export SPRING_DATASOURCE_URL="jdbc:mysql://192.168.43.210:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8&useSSL=false&serverTimezone=Asia/Shanghai"
export SPRING_DATASOURCE_USERNAME=root
export SPRING_DATASOURCE_PASSWORD=root
对于mysql 8:
bash
mysql -uroot -p
mysql> CREATE DATABASE dolphinscheduler DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
# 修改 {user} 和 {password} 为你希望的用户名和密码
mysql> CREATE USER 'root'@'%' IDENTIFIED BY '{password}';
mysql> GRANT ALL PRIVILEGES ON dolphinscheduler.* TO 'root'@'%';
mysql> CREATE USER 'root'@'localhost' IDENTIFIED BY '{password}';
mysql> GRANT ALL PRIVILEGES ON dolphinscheduler.* TO 'root'@'localhost';
mysql> FLUSH PRIVILEGES;
我这里直接用的root
配置seatunnel到DS能识别的环境变量中
正常是把SEATUNNEL_HOME配置在$DS_HOME/bin/env/dolphinscheduler_env.sh 中,但是我始终跑seatuunel任务的时候无法识别。
最后我是把SEATUNNEL_HOME配置在/etc/profile中,然后source /etc/profile才成功识别到的。
配置spark、hadoop到DS能识别的环境变量中
修改vi $DS_HOME/bin/env/dolphinscheduler_env.sh
export HADOOP_HOME=${HADOOP_HOME:-/data/hadoop}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/data/hadoop/etc/hadoop}
export SPARK_HOME=${SPARK_HOME:-/data/spark}
export PATH=$HADOOP_HOME/bin:$SPARK_HOME/bin:$PATH
seatunnel版本2.3.13
下载包然后解压
tar -xzvf apache-seatunnel-2.3.13-bin.tar.gz
下载连接器插件
sh bin/install-plugin.sh 2.3.13
通常情况下,你不需要所有的连接器插件。你可以通过配置config/plugin_config来指定所需的插件。例如,如果你想让示例应用程序正常工作,你将需要connector-console和connector-fake插件。你可以修改plugin_config配置文件,如下所示:
--seatunnel-connectors--
connector-fake
connector-console
--end--
您可以在${SEATUNNEL_HOME}/connectors/plugins-mapping.properties下找到所有支持的连接器和相应的plugin_config配置名称。
SeaTunnel 引擎快速开始
单机快速开始(Local 模式)
添加作业配置文件来定义作业
编辑config/v2.batch.config.template,它决定了当seatunnel启动后数据输入、处理和输出的方式及逻辑。 下面是配置文件的示例,它与上面提到的示例应用程序相同。
env {
parallelism = 1
job.mode = "BATCH"
}
source {
FakeSource {
plugin_output = "fake"
row.num = 16
schema = {
fields {
name = "string"
age = "int"
}
}
}
}
transform {
FieldMapper {
plugin_input = "fake"
plugin_output = "fake1"
field_mapper = {
age = age
name = new_name
}
}
}
sink {
Console {
plugin_input = "fake1"
}
}
运行SeaTunnel应用程序
cd apache-seatunnel-2.3.13
./bin/seatunnel.sh --config ./config/v2.batch.config.template -m local
SeaTunnel控制台将会打印一些如下日志信息:
2022-12-19 11:01:45,417 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - output rowType: name<STRING>, age<INT>
2022-12-19 11:01:46,489 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=1: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: CpiOd, 8520946
2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=2: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: eQqTs, 1256802974
2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=3: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: UsRgO, 2053193072
2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=4: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: jDQJj, 1993016602
2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=5: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: rqdKp, 1392682764
2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=6: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: wCoWN, 986999925
2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=7: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: qomTU, 72775247
2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=8: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: jcqXR, 1074529204
2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=9: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: AkWIO, 1961723427
2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=10: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: hBoib, 929089763
2022-12-19 11:01:46,490 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=11: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: GSvzm, 827085798
2022-12-19 11:01:46,491 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=12: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: NNAYI, 94307133
2022-12-19 11:01:46,491 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=13: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: EexFl, 1823689599
2022-12-19 11:01:46,491 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=14: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: CBXUb, 869582787
2022-12-19 11:01:46,491 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=15: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: Wbxtm, 1469371353
2022-12-19 11:01:46,491 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=16: SeaTunnelRow#tableId=-1 SeaTunnelRow#kind=INSERT: mIJDt, 995616438
hadoop3.1.3
sudo useradd -m hadoop -s /bin/bash
sudo passwd hadoop
密码输入的hadoophadoop
sudo adduser hadoop sudo
sudo tar -zxf ~/下载/hadoop-3.1.3.tar.gz -C /data/hadoop # 解压到 /data/hadoop中
cd /data/hadoop
sudo mv ./hadoop-3.1.3/ ./hadoop # 将文件夹名改为hadoop
sudo chown -R hadoop ./hadoop # 修改文件权限
Hadoop 解压后即可使用。输入如下命令来检查 Hadoop 是否可用,成功则会显示 Hadoop 版本信息:
cd /usr/local/hadoop
./bin/hadoop version
Hadoop单机配置(非分布式)
Hadoop 默认模式为非分布式模式(本地模式),无需进行其他配置即可运行。非分布式即单 Java 进程,方便进行调试。
现在我们可以执行例子来感受下 Hadoop 的运行。Hadoop 附带了丰富的例子(运行 ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar 可以看到所有例子),包括 wordcount、terasort、join、grep 等。
在此我们选择运行 grep 例子,我们将 input 文件夹中的所有文件作为输入,筛选当中符合正则表达式 dfs[a-z.]+ 的单词并统计出现的次数,最后输出结果到 output 文件夹中
- cd /usr/local/hadoop
- mkdir ./input
- cp ./etc/hadoop/*.xml ./input # 将配置文件作为输入文件
- ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar grep ./input ./output 'dfs[a-z.]+'
- cat ./output/* # 查看运行结果
执行成功后,输出了作业的相关信息,输出的结果是符合正则的单词 dfsadmin 出现了1次
注意 ,Hadoop 默认不会覆盖结果文件,因此再次运行上面实例会提示出错,需要先将 ./output 删除。
- rm -r ./output
spark2.4.0
我要调研ds3.2.2里面的数据质量,所以要部署spark。
sudo tar -zxf ~/下载/spark-2.4.0-bin-without-hadoop.tgz -C /data
cd /data
sudo mv ./spark-2.4.0-bin-without-hadoop/ ./spark
sudo chown -R hadoop:hadoop ./spark # 此处的 hadoop 为你的用户名
cd spark
cp ./conf/spark-env.sh.template ./conf/spark-env.sh
1. export JAVA_HOME=/data/jdk1.8.0_451
2. export HADOOP_CONF_DIR=/data/hadoop/etc/hadoop
3. export SPARK_DIST_CLASSPATH=$(/data/hadoop/bin/hadoop classpath)
注意这个路径我也加到了/etc/profile里面了的
验证Spark是否安装成功
bin/run-example SparkPi
成功会在日志里面输出
Pi is roughly 3.140675703378517
如果spark要操作Mysql记得在$SPAKR_HOME/jars里面放入 mysql-connector-java 驱动 (8.0.16)