Python调用pyspark报错整理

前言

Pycharm配置了SSH服务器和Anaconda的python解释器,如果没有配置可参考 大数据单机学习环境搭建(8)Linux单节点Anaconda安装和Pycharm连接

Pycharm执行的脚本

执行如下 pyspark_model.py 的python脚本,构建SparkSession来执行sparksql

python 复制代码
"""
    脚本名称:Pycharm使用pyspark测试
    功能:Pycharm远程执行sparksql
"""
from pyspark.sql import SparkSession
import os

os.environ['SPARK_HOME'] = '/opt/spark'
os.environ['JAVA_HOME'] = '/opt/jdk1.8'

spark = SparkSession.builder \
    .appName('pyspark_conda') \
    .master("yarn") \
    .config("spark.sql.warehouse.dir", "hdfs://bigdata01:8020/user/hive/warehouse") \
    .config("hive.metastore.uris", "thrift://bigdata01:9083") \
    .enableHiveSupport() \
    .getOrCreate()

spark.sql('select * from hostnames limit 10;').show()

spark.stop()

报错一:pyspark版本不匹配

例如我当前集群环境Spark3.0.0,python的pyspark3.5.0,没有指定版本默认下载了最新的

报错信息 [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number., 具体如下:

复制代码
ssh://slash@bigdata01:22/opt/python3/bin/python3 -u /home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py
JAVA_HOME is not set
Traceback (most recent call last):
  File "/home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py", line 7, in <module>
    spark = SparkSession.builder \
  File "/opt/python3/lib/python3.8/site-packages/pyspark/sql/session.py", line 497, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 515, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 201, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 436, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/opt/python3/lib/python3.8/site-packages/pyspark/java_gateway.py", line 107, in launch_gateway
    raise PySparkRuntimeError(
pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

如果坚持不更换python的pyspark版本,即使像报错2已经指定了JAVA_HOME 依然会有其他报错。例如下方报错 Py4JError ,所以最彻底的方法是替换pyspark版本与spark版本一致

复制代码
Traceback (most recent call last):
  File "/home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py", line 7, in <module>
    spark = SparkSession.builder \
  File "/opt/python3/lib/python3.8/site-packages/pyspark/sql/session.py", line 497, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 515, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 203, in __init__
    self._do_init(
  File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 316, in _do_init
    self._jvm.PythonUtils.getPythonAuthSocketTimeout(self._jsc)
  File "/opt/python3/lib/python3.8/site-packages/py4j/java_gateway.py", line 1549, in __getattr__
    raise Py4JError(
py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getPythonAuthSocketTimeout does not exist in the JVM

报错二:JAVA_HOME指定不成功

python的pyspark已经重装3.0.0版本(下载时指定版本 pip install pyspark==3.0.0),报错信息 Java gateway process exited before sending its port number., JAVA_HOME is not set 具体如下:

复制代码
ssh://slash@bigdata01:22/opt/python3/bin/python3 -u /home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py
JAVA_HOME is not set
Traceback (most recent call last):
  File "/home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py", line 7, in <module>
    spark = SparkSession.builder \
  File "/opt/python3/lib/python3.8/site-packages/pyspark/sql/session.py", line 186, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 371, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 128, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/opt/python3/lib/python3.8/site-packages/pyspark/context.py", line 320, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/opt/python3/lib/python3.8/site-packages/pyspark/java_gateway.py", line 105, in launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number

指定内容如下:

python 复制代码
# pyspark3.5.0指定了 SPARK_HOME JAVA_HOME还是会报错
# pyspark3.0.0指定后成功运行
os.environ['SPARK_HOME'] = '/opt/spark'
os.environ['JAVA_HOME'] = '/opt/jdk1.8'

报错三:python版本问题

最开始安装的最新版的anaconda环境,其中python3.11,安装pyspark3.0.0也会报错 TypeError: code() argument 13 must be str, not int,具体内容如下:

复制代码
ssh://slash@bigdata01:22/opt/anaconda3/bin/python3.11 -u /home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py
Traceback (most recent call last):
  File "/home/slash/etl/dwtool/pyspark/pyspark_script/pyspark_model.py", line 1, in <module>
    from pyspark.sql import SparkSession
  File "/opt/anaconda3/lib/python3.11/site-packages/pyspark/__init__.py", line 51, in <module>
    from pyspark.context import SparkContext
  File "/opt/anaconda3/lib/python3.11/site-packages/pyspark/context.py", line 30, in <module>
    from pyspark import accumulators
  File "/opt/anaconda3/lib/python3.11/site-packages/pyspark/accumulators.py", line 97, in <module>
    from pyspark.serializers import read_int, PickleSerializer
  File "/opt/anaconda3/lib/python3.11/site-packages/pyspark/serializers.py", line 71, in <module>
    from pyspark import cloudpickle
  File "/opt/anaconda3/lib/python3.11/site-packages/pyspark/cloudpickle.py", line 209, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/pyspark/cloudpickle.py", line 172, in _make_cell_set_template_code
    return types.CodeType(
           ^^^^^^^^^^^^^^^
TypeError: code() argument 13 must be str, not int

删除 /opt/anaconda3的文件夹后,重新安装了 Anaconda3-2021.05-Linux-x86_64.sh 版本的anaconda,其中python3.8,利用pyspark3.0.0第三方库操作spark3.0.0的计算引擎构建SparkSession,执行sparksql成功。


声明:本文所载信息不保证准确性和完整性。文中所述内容和意见仅供参考,不构成实际商业建议,如有雷同纯属巧合。

相关推荐
坐吃山猪3 分钟前
Python27_协程游戏理解
开发语言·python·游戏
gCode Teacher 格码致知3 分钟前
Javascript提高:小数精度和随机数-由Deepseek产生
开发语言·javascript·ecmascript
Polar__Star4 分钟前
Redis如何利用位图快速判断数据存在性
jvm·数据库·python
2301_8176722618 分钟前
CSS如何实现优雅的间距_使用CSS Grid控制盒模型间隙
jvm·数据库·python
你说咋整就咋整18 分钟前
openGauss6.0.3 一主二从集群安装手册
数据库·python·gaussdb
Shorasul18 分钟前
JavaScript中显式创建包装对象的后果与性能损耗
jvm·数据库·python
椰猫子37 分钟前
Javaweb(Filter、Listener、AJAX、JSON)
java·开发语言
吕源林42 分钟前
C#怎么实现EF Core迁移 C#如何用Entity Framework Core进行数据库迁移和更新表结构【数据库】
jvm·数据库·python
qq_206901391 小时前
JavaScript中箭头函数在对象字面量方法中的潜在错误
jvm·数据库·python
盛世宏博北京1 小时前
以太网温湿度传感器运维技巧,提升设备稳定性与使用寿命
开发语言·php·以太网温湿度传感器