使用dataGrip连接spark

zmd-zk2024-11-17 20:33

概述：

spark的配置共有5种

1、本地模式

2、集群模式：standalone， yarn，k8s，mesos四种集群模式

spark本身只是一个计算引擎，是没有数据库的，所以说数据需要在hdfs上存放，而数据库就是使用hive，都已经启动hdfs了，就使用yarn模式即可，而使用standalone模式就不合适了。

因此以下的配置是基于hdfs+yarn+spark进行配置的。

一、配置

要想spark知道hive中所有的数据库，那么就需要将spark和metastore服务联系起来

1、在hive下的hive-site.xml中添加

复制代码

<property>
		<name>hive.metastore.schema.verification</name>
		<value>false</value>
	</property>

2、将hive的该文件复制到spark的conf下

复制代码

cp /opt/installs/hive/conf/hive-site.xml /opt/installs/spark/conf

3、分发一下spark中的该文件

复制代码

xsync.sh opt/installs/spark/conf/hive-site.xml

二、启动

复制代码

1、启动hadoop
2、启动metastore
3、启动Spark Thrift Server

启动ThriftServer：若已经启动过hiveserver2 那么记得将端口号改为10001

复制代码

opt/installs/spark/sbin/start-thriftserver.sh \
--hiveconf hive.server2.thrift.port=10000 \
--hiveconf hive.server2.thrift.bind.host=bigdata01 \
--master yarn \
--conf spark.sql.shuffle.partitions=2

hiveServer2 和 thriftserver的区别：

只需访问 Hive 表，无需 Spark 的计算能力：

使用 HiveServer2，更简单，且原生支持 Hive 功能。

需要访问 Hive 表，并利用 Spark 的计算引擎：

使用 Spark Thrift Server，可以充分利用 Spark 的分布式计算能力。