Spark在大数据集群下的部署

Spark部署文档

前提:需要保证配置好了三台装好hadoop的虚拟机hadoop102，hadoop103，hadoop104

下载地址

https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz

条件

PYTHON 推荐3.8
JDK 1.8

解压

解压下载的Spark安装包

tar -zxvf spark-3.2.0-bin-hadoop3.2.tgz -C /export/server/

环境变量

配置Spark由如下5个环境变量需要设置

SPARK_HOME: 表示Spark安装路径在哪里
PYSPARK_PYTHON: 表示Spark想运行Python程序, 那么去哪里找python执行器
JAVA_HOME: 告知Spark Java在哪里
HADOOP_CONF_DIR: 告知Spark Hadoop的配置文件在哪里
HADOOP_HOME: 告知Spark Hadoop安装在哪里

这5个环境变量都需要配置在: /etc/profile中

上传Spark安装包

将下载好的spark-3.2.0-bin-hadoop3.2.tgz上传这个文件到Linux服务器中

将其解压, 本文将其解压(安装)到: /export/server内.

tar -zxvf spark-3.2.0-bin-hadoop3.2.tgz -C /export/server/

由于spark目录名称很长, 给其一个软链接:

ln -s /export/server/spark-3.2.0-bin-hadoop3.2 /export/server/spark

测试

bin/pyspark

bin/pyspark 程序, 可以提供一个 交互式的 Python解释器环境, 在这里面可以写普通python代码, 以及spark代码

在这个环境内, 可以运行spark代码

py 复制代码

sc.parallelize([1,2,3,4,5]).map(lambda x: x + 1).collect()

注意：parallelize和map` 都是spark提供的API

WEB UI (4040)

Spark程序在运行的时候, 会绑定到机器的4040端口上。如果4040端口被占用, 会顺延到4041 ... 4042...

4040端口是一个WEBUI端口, 可以在浏览器内打开:输入:服务器ip:4040 即可打开。

打开监控页面后, 可以发现在程序内仅有一个Driver。因为我们是Local模式, Driver即管理又干活。同时, 输入jps，可以看到local模式下的唯一进程存在。这个进程即是master也是worker

bin/spark-shell - 了解

同样是一个解释器环境, 和bin/pyspark不同的是, 这个解释器环境运行的不是python代码, 而是scala程序代码

shell 复制代码

scala> sc.parallelize(Array(1,2,3,4,5)).map(x=> x + 1).collect()
res0: Array[Int] = Array(2, 3, 4, 5, 6)

这个仅作为了解即可, 因为这个是用于scala语言的解释器环境

bin/spark-submit (PI)

作用: 提交指定的Spark代码到Spark环境中运行

使用方法:

shell 复制代码

# 语法
bin/spark-submit [可选的一些选项] jar包或者python代码的路径 [代码的参数]

# 示例
bin/spark-submit /export/server/spark/examples/src/main/python/pi.py 10
# 此案例 运行Spark官方所提供的示例代码 来计算圆周率值.  后面的10 是主函数接受的参数, 数字越高, 计算圆周率越准确.

对比

功能	bin/spark-submit	bin/pyspark	bin/spark-shell
	功能	提交java\scala\python代码到spark中运行	提供一个`python`
解释器环境用来以python代码执行spark程序	提供一个`scala`
解释器环境用来以scala代码执行spark程序
特点	提交代码用	解释器环境写一行执行一行	解释器环境写一行执行一行
使用场景	正式场合, 正式提交spark程序运行	测试\学习\写一行执行一行\用来验证代码等	测试\学习\写一行执行一行\用来验证代码等

Anaconda On Linux 安装 (单台服务器)

安装

上传安装包:

下载地址：(https://www.anaconda.com/products/individual#Downloads)

上传: Anaconda3-2021.05-Linux-x86_64.sh`文件到Linux服务器上

安装:

sh ./Anaconda3-2021.05-Linux-x86_64.sh

输入yes后就安装完成了.

安装完成后, 退出SecureCRT 重新进来:

看到这个Base开头表明安装好了.

base是默认的虚拟环境.

国内源

如果你安装好后, 没有出现base, 可以打开:/root/.bashrc这个文件, 追加如下内容:

shell 复制代码

channels:
  - defaults
show_channel_urls: true
default_channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
  conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

附2 spark-submit和pyspark相关参数

客户端工具我们可以用的有:

bin/pyspark: pyspark解释器spark环境
bin/spark-shell: scala解释器spark环境
bin/spark-submit: 提交jar包或Python文件执行的工具
bin/spark-sql: sparksql客户端工具

这4个客户端工具的参数基本通用.

以spark-submit 为例:

bin/spark-submit --master spark://node1:7077 xxx.py

shell 复制代码

Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   部署模式 client 或者 cluster 默认是client
  --class CLASS_NAME          运行java或者scala class(for Java / Scala apps).
  --name NAME                 程序的名字
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         指定Python程序依赖的其它python文件
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).
  --archives ARCHIVES         Comma-separated list of archives to be extracted into the
                              working directory of each executor.

  --conf, -c PROP=VALUE       手动指定配置
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Driver的可用内存(Default: 1024M).
  --driver-java-options       Driver的一些Java选项
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Executor的内存 (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  显示帮助文件
  --verbose, -v               Print additional debug output.
  --version,                  打印版本

 Cluster deploy mode only(集群模式专属):
  --driver-cores NUM          Driver可用的的CPU核数(Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 如果给定, 可以尝试重启Driver

 Spark standalone, Mesos or K8s with cluster deploy mode only:
  --kill SUBMISSION_ID        指定程序ID kill
  --status SUBMISSION_ID      指定程序ID 查看运行状态

 Spark standalone, Mesos and Kubernetes only:
  --total-executor-cores NUM  整个任务可以给Executor多少个CPU核心用

 Spark standalone, YARN and Kubernetes only:
  --executor-cores NUM        单个Executor能使用多少CPU核心

 Spark on YARN and Kubernetes only(YARN模式下):
  --num-executors NUM         Executor应该开启几个
  --principal PRINCIPAL       Principal to be used to login to KDC.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above.

 Spark on YARN only:
  --queue QUEUE_NAME          指定运行的YARN队列(Default: "default").