一、环境准备
版本信息cdh 6.2.0
hadoop:3.0.0
spark:2.4.0
hive:2.1.1
python 3.7.0
shell
pip install pypandoc==1.5 --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple
pip install pandas==0.24.2 --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple
pip install pyspark==2.4.0 --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple
pip install hdfs
另外,certifi
、urllib3
、requests
可能太新,需要降低版本,pip uninstall 再 pip install
(venv) D:\xxx\>python --version
Python 3.7.0
(venv) D:\xxx\>pip list
Package Version
------------------ -----------
certifi 2018.8.13
chardet 3.0.4
charset-normalizer 3.4.0
decorator 5.1.1
docopt 0.6.2
gssapi 1.8.3
hdfs 2.7.3
idna 2.7
krbcontext 0.10
numpy 1.21.6
pandas 0.24.2
pip 24.0
py4j 0.10.7
pypandoc 1.5
pyspark 2.4.0
python-dateutil 2.9.0.post0
pytz 2024.2
requests 2.19.1
setuptools 68.0.0
six 1.16.0
urllib3 1.23
wheel 0.42.0
二、代码
python
# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
keytab_file = 'D:/user.keytab' # keytab位置
principal = 'user@XXXXX.COM'
if __name__ == '__main__':
# os.environ['HADOOP_HOME'] = 'D:\hadoop'
# os.environ['HADOOP_CONF_DIR'] = 'D:\hadoop\etc\hadoop-curr'
conf = SparkConf().setAppName("pyspark-sql") \
.setSparkHome("local[*]") \
.set("spark.sql.catalogImplementation", "hive") \ # 相当于 .enableHiveSupport()
.set("hive.metastore.uris",
"thrift://ts01.xxx.com:9083,thrift://ts02.xxx.com:9083") \ # 较新版本spark需要在此配置前添加spark前缀,如:spark.hive.metastore.uris,以下以hive开头的配置都需要添加
.set("hive.metastore.sasl.enabled", "true") \
.set("hive.metastore.kerberos.principal", "hive/_HOST@XXXXX.COM") \
.set("spark.driver.extraJavaOptions",
"-Djava.security.krb5.conf=D:\krb5.con -Djava.security.krb5.realm=XXXXX.COM -Djava.security.krb5.kdc=management.xxx.com") \
.set("spark.executor.extraJavaOptions",
"-Djava.security.krb5.conf=D:\krb5.con -Djava.security.krb5.realm=XXXXX.COM -Djava.security.krb5.kdc=management.xxx.com") \
.set("spark.yarn.keytab", keytab_file) \ # 较新版本spark使用 spark.kerberos.keytab
.set("spark.yarn.principal", principal) # 较新版本spark使用 spark.kerberos.principal
sc = SparkContext(conf=conf)
sc.setLogLevel("INFO")
spark=SparkSession(sc)
# spark = SparkSession.builder \
# .master("local[*]") \
# .config("hive.metastore.uris",
# "thrift://ts01.xxx.com:9083,thrift://ts02.xxx.com:9083") \ # 较新版本spark需要在此配置前添加spark前缀,如:spark.hive.metastore.uris,以下以hive开头的配置都需要添加
# .config("hive.metastore.sasl.enabled", "true") \
# .config("hive.metastore.kerberos.principal", "hive/_HOST@XXXXX.COM") \
# .config("spark.driver.extraJavaOptions",
# "-Djava.security.krb5.conf=D:\krb5.con -Djava.security.krb5.realm=XXXXX.COM -Djava.security.krb5.kdc=management.xxx.com") \
# .config("spark.executor.extraJavaOptions",
# "-Djava.security.krb5.conf=D:\krb5.conf -Djava.security.krb5.realm=XXXXX.COM -Djava.security.krb5.kdc=management.xxx.com") \
# .config("spark.yarn.keytab", keytab_file) \ # 较新版本spark使用 spark.kerberos.keytab
# .config("spark.yarn.principal", principal) \ # 较新版本spark使用 spark.kerberos.principal
# .enableHiveSupport() \
# .getOrCreate()
# spark.sparkContext.setLogLevel("INFO")
print("ready to run!!!")
# text = sc.textFile("hdfs://yourcluster/tmp/plain-text.txt") # 先测试spark读取hdfs文件,再测试读取hive
# print(text.collect())
lines = spark.sql("select * from test.tmp_tbl").collect()
print(lines)
spark.stop()
三、启动
pycharm中启动:
pycharm的Run/Debug Configurations
下的Environment variables
添加 HADOOP_HOME=D:\hadoop;HADOOP_CONF_DIR=D:\hadoop\etc\hadoop-curr
或者取消注释:
# os.environ['HADOOP_HOME'] = 'D:\hadoop'
# os.environ['HADOOP_CONF_DIR'] = 'D:\hadoop\etc\hadoop-curr'
环境变量说明
(1)hadoop
通过Run/Debug Configurations
下的Environment variables
或者代码里添加os.environ
都可以配置HADOOP_HOME和HADOOP_CONF_DIR
(2)hive
实测,通过各种方式配置HIVE_CONF_DIR
都是不起作用的。必须将hive的配置直接传递给spark。kerberos认证下hive.metastore.uris
,hive.metastore.sasl.enabled
,hive.metastore.kerberos.principal
都是必须得。
(3)keberos
实测,通过各种方式配置KRB5_CONFIG
都是没有用的。java的启动参数也必须通过spark.driver.extraJavaOptions
和spark.executor.extraJavaOptions
传入。尝试添加在如下位置C:\Windows/krb5.ini
,C:\winnt\krb5.ini
,JAVA_HOME/lib/security/krb5.ini
,krb5.ini或者krb5.conf都是不起作用的。只添加-Djava.security.krb5.conf
会报错:java.lang.IllegalArgumentException: Can't get Kerberos realm
,还必须添加-Djava.security.krb5.realm
和-Djava.security.krb5.kdc
。
四、日志文件内容说明
会有Warning内容,但是不能删除Hive相关的配置,否则会报错。
Warning: Ignoring non-spark config property: hive.metastore.sasl.enabled=true
Warning: Ignoring non-spark config property: hive.metastore.uris=thrift://ts01.xxx.com:9083,thrift://ts02.xxx.com:9083
Warning: Ignoring non-spark config property: hive.metastore.kerberos.principal=hive/_HOST@XXXXX.COM
2024-11-21 16:20:00 INFO metastore:376 - Trying to connect to metastore with URI thrift://ts01.xxx.com:9083
2024-11-21 16:20:01 INFO metastore:472 - Connected to metastore.