初探Spark SQL catalog缓存机制

先说结论:Spark SQL catalog中对表结构的缓存一般不会自动更新。

实验如下:

  1. 在pg中新建一张表t1,其中只有一列 c1 int
  2. 在Spark SQL中注册这张表,并从中查询数据
    1. ./bin/spark-sql --driver-class-path postgresql-42.7.1.jar --jars postgresql-42.7.1.jar
    2. spark-sql (default)> create table c1v using org.apache.spark.sql.jdbc options (url "jdbc:postgresql://localhost:5432/postgres", dbtable "public.t1", user 'postgres', password 'postgres');
    3. spark-sql (default)> select * from c1v;
    4. 结果:一行数据
  3. 在pg中为t1新增一列 c2,并插入一行数据2,2
  4. 在Spark SQL中继续查询数据
    1. spark-sql (default)> select * from c1v;
    2. 结果:两行数据,但是没有c2列
  5. 在pg中删掉c1列
  6. 在Spark SQL中继续查询数据
    1. spark-sql (default)> select * from c1v;
    2. 结果:报错 c1列不存在
  7. 从pg的query log中也可以看到,Spark一直发送的都是 SELECT "c1" FROM public.wkt,也即Spark对上述表结构的变化一无所知。

Spark报错如下

复制代码
spark-sql (default)> select * from c1v;
24/01/09 20:18:04 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
org.postgresql.util.PSQLException: ERROR: column "c1" does not exist
  Hint: Perhaps you meant to reference the column "wkt.c2".
  Position: 8
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2712)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2400)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:367)
        at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:498)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:415)
        at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190)
        at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:134)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:320)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
24/01/09 20:18:04 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5) (172.18.203.110 executor driver): org.postgresql.util.PSQLException: ERROR: column "c1" does not exist
  Hint: Perhaps you meant to reference the column "wkt.c2".
  Position: 8
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2712)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2400)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:367)
        at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:498)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:415)
        at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190)
        at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:134)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:320)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

24/01/09 20:18:04 ERROR TaskSetManager: Task 0 in stage 5.0 failed 1 times; aborting job
Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5) (172.18.203.110 executor driver): org.postgresql.util.PSQLException: ERROR: column "c1" does not exist
  Hint: Perhaps you meant to reference the column "wkt.c2".
  Position: 8
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2712)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2400)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:367)
        at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:498)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:415)
        at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190)
        at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:134)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:320)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5) (172.18.203.110 executor driver): org.postgresql.util.PSQLException: ERROR: column "c1" does not exist
  Hint: Perhaps you meant to reference the column "wkt.c2".
  Position: 8
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2712)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2400)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:367)
        at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:498)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:415)
        at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190)
        at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:134)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:320)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
        at scala.collection.immutable.List.foreach(List.scala:333)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1206)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1206)
        at scala.Option.foreach(Option.scala:437)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1206)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2984)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2284)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2328)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1019)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1018)
        at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:448)
        at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:475)
        at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$2(SparkSQLDriver.scala:69)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:69)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:415)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:533)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:527)
        at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
        at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:926)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:527)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:307)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1111)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.postgresql.util.PSQLException: ERROR: column "c1" does not exist
  Hint: Perhaps you meant to reference the column "wkt.c2".
  Position: 8
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2712)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2400)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:367)
        at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:498)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:415)
        at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190)
        at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:134)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:320)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

进一步对Spark SQL的报错进行分析,可见报错点是在org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:320)。从代码中可见,Spark直接使用catalog中缓存的表结构拼接SQL语句下发,直到SQL语句真正被pg执行时,才识别到c1这一列已经不存在的错误。

复制代码
    var builder = dialect
      .getJdbcSQLQueryBuilder(options)
      .withColumns(columns)
      .withPredicates(predicates, part)
      .withSortOrders(sortOrders)
      .withLimit(limit)
      .withOffset(offset)

    groupByColumns.foreach { groupByKeys =>
      builder = builder.withGroupByColumns(groupByKeys)
    }

    sample.foreach { tableSampleInfo =>
      builder = builder.withTableSample(tableSampleInfo)
    }

    val sqlText = builder.build()
    stmt = conn.prepareStatement(sqlText,
        ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)
    stmt.setFetchSize(options.fetchSize)
    stmt.setQueryTimeout(options.queryTimeout)
    rs = stmt.executeQuery()
相关推荐
num_killer40 分钟前
小白的Spark初识(RDD)
大数据·分布式·spark
红队it2 小时前
【Spark+Hadoop】基于spark+hadoop游戏评论数据分析可视化大屏(完整系统源码+数据库+开发笔记+详细部署教程+虚拟机分布式启动教程)✅
大数据·hadoop·分布式·算法·游戏·数据分析·spark
oMcLin2 小时前
如何在CentOS 8上配置并调优Apache Spark集群,确保大规模数据分析任务的高效运行与资源分配?
spark·centos·apache
俊哥大数据2 小时前
【项目9】 基于Spark网站流量日志大数据实时分析系统
大数据·分布式·spark
Light602 天前
从“报告”到“能力”——构建智能化、可审计的数据治理闭环——领码 SPARK 数据质量平台白皮书
大数据·分布式·spark
火龙谷2 天前
day2-采集数据
spark
大厂技术总监下海3 天前
从Hadoop MapReduce到Apache Spark:一场由“磁盘”到“内存”的速度与范式革命
大数据·hadoop·spark·开源
麦麦大数据3 天前
F052pro 基于spark推荐的中医古籍知识图谱可视化推荐系统|spark mlib|hadoop|docker集群
docker·spark-ml·spark·知识图谱·可是还·中医推荐·ehcarts
巧克力味的桃子4 天前
Spark 课程核心知识点复习汇总
大数据·分布式·spark
Light604 天前
智能重构人货场:领码SPARK破解快消行业增长困局的全景解决方案
spark·数字化转型·ai大模型·智能营销·快消行业·供应链优化