文章目录
-
- Summary
- 排查过程
-
- Driver端的直接异常
- [Task 1.0 in Stage 2.0失败: UnknownHostException触发了蝴蝶效应](#Task 1.0 in Stage 2.0失败: UnknownHostException触发了蝴蝶效应)
- [Stage 2.0有重试](#Stage 2.0有重试)
- [Task 3.0在Stage 2的两个Attempt中产生的"两方竞态"](#Task 3.0在Stage 2的两个Attempt中产生的"两方竞态")
- [Task 4.0在Stage 2的两个Attempt中产生的三方竞态](#Task 4.0在Stage 2的两个Attempt中产生的三方竞态)
- 解决方案:把竞态窗口"关上"
- 总结
Summary
这篇文章主要介绍了我们一次Spark Job失败的诊断、分析到最后解决问题的过程。
虽然出问题的是我们的Spark Job而不是一个通用的基础设施,但是其在分布式环境下收集纷繁复杂的日志、在互为因果的异常信息中梳理线性因果关系,查找日志、分析堆栈、破除矛盾点、总结原因、解决问题的过程是我们解决所有其他问题的基本方法论。
总是,我们在一个分布式系统中排查问题,需要这样:
- 分布式系统需要从多态机器上手机日志并总结成为具有因果关系的线性时间线
- 分布式系统的错误中有一些类似噪音类异常,即这些错误和异常与我们的最终失败无关,我们需要能够甄别这些错误。比如,在我们的case中,由于Task 3.0在Stage 2.0和Stage 2.1中刚好被分配到同一个机器,因此发生了本地写异常。这个异常具有很大的迷惑性,它跟我们的Application失败无关,也没有导致数据问题(Task 3.0 in Stage 2.0已经成功了)。
- 分布式系统的任务失败往往是由一个问题导致的,但是随后发生的比如重试等操作又带来了新的错误,所以,我们既需要找到初始错误并解决,也需要评估我们的代码在出现此类错误时候的处理能力,比如,在我们的case里面,初始错误是DNS问题,但是随后发生的Stage 重试暴露了我们Job的其他问题,这些问题也需要解决。
我们的Spark Job的任务很简单:
- 从 HDFS 读 parquet,
- 根据需要的并行度进行必要的
repartition()操作,让后续的Executor均衡的分配到Parquet数据 - 在 executor 端按 partition 执行
mapPartitions(),把数据写成一个本地 parquet,再用clickhouse-local生成 ClickHouse 的本地 part; - 把 part scp 到 ClickHouse 服务器的
detached/; - 等
collect()返回后,Driver 对每个 part 执行ALTER TABLE ... ATTACH PART完成落表
我们的Spark运行在Yarn上。
把这段过程画成流程图:
text
┌────────────────────────────┐
│ Spark Driver │
└──────────────┬─────────────┘
│
│ 读取 HDFS parquet
v
┌──────────────────┐
│ Dataset[Row] │
└─────────┬────────┘
│ repartition(...)
v
┌──────────────────┐
│ toJavaRDD() │
└─────────┬────────┘
│ mapPartitions(ClickhouseSink)
v
┌────────────────────────────────────────────┐
│ Spark Executors │
│ (每个 partition 一份) │
└──────────────┬─────────────────────────────┘
│
│ 1) 写 parquet-local
v
/corp/data/.../parquet-local/.../partition_N/data_partition_N.parquet
│
│ 2) clickhouse-local 读 parquet 生成本地 part
v
/corp/data/.../clickhouse-local/.../_local/{table}/{part}
│
│ 3) scp 到 ClickHouse server 的 detached
v
/corp/data/.../clickhouse/store/.../detached/{part}
│
│ 4) 返回 AttachClickhouseInfo 给 Driver(collect)
v
┌────────────────────────────┐
│ Spark Driver │
└──────────────┬─────────────┘
│
│ ALTER TABLE ... ATTACH PART
v
ClickHouse 表数据可见
下图显示了我们的Spark Application的各个Job/Stage的依赖关系。下文会看到,我们所有问题的根源发生在Stage 2.0中的Task中:
text
Driver: AppConverterLogic.execute()
┌───────────────────────────────────────────────────────────────────────┐
│ Job 0 │
│ ResultStage 0 (load at AppConverterLogic.java:196) │
│ 作用:spark.read.format(\"parquet\").load(hdfsPaths...) 的读取准备工作 │
│ (列文件/读 footer/合并 schema 等) │
└───────────────────────────────────────────────────────────────────────┘
|
| Dataset<Row> parquetData
v
┌───────────────────────────────────────────────────────────────────────┐
│ Job 1 │
│ Stage 1: ShuffleMapStage 1 │
│ 触发点:readSourceParquet() 里的 repartition(partitionNum) │
│ 作用:把输入数据按新的分区数洗牌,生成 shuffle blocks │
│ │
│ shuffle dependency │
│ (shuffleId = 0, mapId/reduceId in logs) │
│ │ │
│ v │
│ Stage 2: ResultStage 2 (collect at AppConverterLogic.java:106) │
│ 触发点:parquetData.toJavaRDD().mapPartitions(chSink).collect() │
│ 作用:每个 partition 执行 ClickhouseSink │
│ - 写本地 parquet │
│ - clickhouse-local 生成本地 part │
│ - scp 到 ClickHouse detached/ │
│ 输出:collect() 返回 List<AttachClickhouseInfo> 给 driver │
└───────────────────────────────────────────────────────────────────────┘
|
| driver 收到 attachClickhouseInfos
v
┌───────────────────────────────────────────────────────────────────────┐
│ Driver side (不属于 Spark stage) │
│ attachClickhouseInfos.forEach(attachClickhouse(...)) │
│ 作用:对每个 part 执行 ALTER TABLE ... ATTACH PART │
└───────────────────────────────────────────────────────────────────────┘
我们的Application的各个Job和Stage所完成的工作描述如下:
-
Job 0 / Stage 0(ResultStage 0):这一步本质上是 读 HDFS parquet 的元数据/文件列表/必要的 schema 推断/合并(所以只有 1 个 task)javaprivate Dataset<Row> readSourceParquet(int clickhouseShardNum) { .... Dataset<Row> parquetData = sparkSession.read() .format("parquet") .option("compression", "snappy") .load(paths.toArray(new String[0])); parquetData = parquetData.repartition(partitionNum); // 这里属于第二个Job了,开始进行shuffle .... } -
Job 1 / Stage 1(ShuffleMapStage 1):对应 readSourceParquet() 里的parquetData.repartition(partitionNum)引入的 shuffle 的 map 端。它会把读取到的 parquet 数据(含 filter)按新的分区数 重新洗牌并写出,真正引入 shuffle 的原因"在repartition(...)。javaprivate Dataset<Row> readSourceParquet(int clickhouseShardNum) { .... parquetData = parquetData.repartition(partitionNum); // 这里属于第二个Job了,开始进行shuffle .... } -
Job 1 / Stage 2(ResultStage 2):经过Job 1 Stage 1的repartition,数据被均衡分布给下一个Stage的所有executor,然后,下一个Stage(Job 1 Stage 2)会对Shuffle过来的数据进行如下操作:- 将数据写成本地parquet(代码中的Step 2)
- 对于每一个ClickHouse Server(通常是ClickHouse集群的某一个Shard的某一个Replica)
- 调用clickhouse-local,生成clickhouse的part数据(代码中的Step 3)
- 将数据scp到远程的ClickHouse Server中(代码中)中的detach目录中
- 封装好对应的
AttachClickhouseInfo对象,交付给Driver,Driver在Job 1结束以后,会根据收集到的所有的AttachClickhouseInfo进行Part的attach操作。
java/** 这里实现了FlatMapAction,重写了call()方法 */ public class ClickhouseSink implements FlatMapFunction<Iterator<Row>, AttachClickhouseInfo> { @Override public Iterator<AttachClickhouseInfo> call(Iterator<Row> rows) { try{ log.info("Step 2. Write local parquet"); long recordsWritten = writeLocalParquetByRdd(rows, partitionId, parquetLocalPath); List<String> hosts = ClickHouseUtil.selectHost(hostsList, partitionId); for (String host : hosts) { ScpUser scpUser = ConfigUtil.SCP_USER.copy(clickhouseServerPassword); try (IClickHouseOperation clickHouseOperation = bizType.getClickHouseOperation(entity, scpUser)) { ClickHouseLocalTable clickHouseLocalTable = clickHouseOperation.getLocalTableEntity(host); log.info("Step 3 Generate clickhouse data"); clickhouseLocalFolders = generateClickhouseLocalData(clickHouseOperation, partitionId, host, clickhouseLocalPath, parquetLocalPath, clickHouseLocalTable); log.info("Step 4 Sync to clickhouse server"); detachedPath = transferParquet(clickHouseOperation, partitionId, host, clickHouseLocalTable.getDataPaths(), clickhouseLocalFolders); } catch (Exception e) { throw new RuntimeException("error: generateLocalData failed:", e); } // 这里的SCP信息会被Driver收集,然后统一进行attach操作 AttachClickhouseInfo attachClickhouseInfo = AttachClickhouseInfo.builder() .clickhouseLocalFolders(clickhouseLocalFolders) .clickhouseLocalPath(clickhouseLocalPath) .detachedPath(detachedPath) .host(host) .partitionId(partitionId) .records(recordsWritten) .build(); return Collections.singletonList(attachClickhouseInfo).iterator(); } }
所以,总之,这条 Spark Job 的功能很直接:把 HDFS 上的一批 parquet 数据导入 ClickHouse,并让数据在目标表中可见。
从整体架构上讲,我们的Spark Job通过单独的job完成Parquet -> ClickHouse Data的转换并把转换成的ClickHouse Data远程传输到ClickHouse机器上,实现ClickHouse的离线写入,然后,当这个负责转换和写入的Job都以完成,Driver端会将这些detached parts一次性全部进行attach,完成数据写入。
我们之所以把生成detached part交给executor,而把attach操作一次性交给Driver, 就是为了让数据最后对用户可见的关键attach过程在一台机器上完成,这样,或者都做,或者都不做,而不是让executor各自去做,各个Executor会出现重试等等不稳定行为,带来不确定性结果。
当然,我们的Spark Job有多个Executor,而且我们的ClickHouse是一个集群,因此,需要建立Executor -> ClickHouse Server的对应关系,即这个Executor的输出应该输入到哪一个ClickHouse Server。我们通过固定的哈希取余算法,只需要知道整个ClickHouse集群的机器规模,就可以让每一个ClickHouse Instance被分配到数量均等的Executor的输入数据,从而让ClickHouse集群的每台机器上数据均衡。
在正常情况下,如果将数据转换并装载到detached目录的job成功完成,在Driver端进行attach的操作是不应该在所有的ClickHouse Server全部失败或者在一部分ClickHouse 失败、另一部分ClickHouse Server成功的,除非有网络通信问题或者ClickHouse的Health问题。
但是在我们的case下,整个数据的conversion和写入detach没问题,job成功,ClickHouse集群状态没问题,但是最后在Driver端进行attach的时候,在一个ClickHouse Server上出现了ClickHouse Part Not Found的问题。
本文会从"问题是怎么查出来的"开始:先从 driver 的异常入手,顺着线索追到 executor 与 ClickHouse detached,再回到源码解释与改造方案(DNS 兜底、detached 原子发布、重试隔离:本地路径 + 远端命名)。
所以,整个的数据流如下所示:
text
HDFS Parquet
-> Spark Dataset.repartition(...)
-> toJavaRDD().mapPartitions(ClickhouseSink)
(每个 partition 在 executor 上执行)
1) 写本地 parquet: /parquet-local/.../partition_N/data_partition_N.parquet
2) clickhouse-local 读本地 parquet 生成本地 part: /clickhouse-local/.../_local/{table}/{part}
3) scp 到 ClickHouse server detached: {store}/{table}/detached/{part}
4) 返回 AttachClickhouseInfo 给 Driver
-> Driver collect() 完成后,逐个对每一个detached part 执行 ALTER TABLE ATTACH PART
排查过程
Driver端的直接异常
我们的排查当然是从日志分析开始的。
首先,我们的Yarn的页面看到的是Attach 失败的异常:
shell
26/01/21 15:03:21 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.lang.RuntimeException: Failed to attach folder to clickhouse, host [rp506-2.iad7.prod.corp.com], clickhouseLocalFolders = [/corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_1_1_0_779054533,/corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_2_2_0_779054533]
at com.corp.storage.app.AppConverterLogic.attachClickhouse(AppConverterLogic.java:296)
at com.corp.storage.app.AppConverterLogic.lambda$execute$1(AppConverterLogic.java:138)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at com.corp.storage.app.AppConverterLogic.execute(AppConverterLogic.java:138)
at com.corp.storage.app.AppInsightsConverter.main(AppInsightsConverter.java:39)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)
Caused by: java.lang.RuntimeException: Retry failed.
at com.corp.storage.app.etl.core.utils.RetryUtil.retry(RetryUtil.java:37)
at com.corp.storage.app.etl.core.utils.RetryUtil.retry(RetryUtil.java:41)
at com.corp.storage.app.ch.ClickHouseOperation.attachToClickHouse(ClickHouseOperation.java:247)
at com.corp.storage.app.AppConverterLogic.attachClickhouse(AppConverterLogic.java:292)
... 19 more
Caused by: java.lang.RuntimeException: Clickhouse attach failed.
at com.corp.storage.app.ch.ClickHouseOperation.lambda$attachToClickHouse$1(ClickHouseOperation.java:266)
at com.corp.storage.app.etl.core.utils.RetryUtil.retry(RetryUtil.java:23)
... 22 more
从这个异常可以看到,attach一个part失败了。上文已经讲过我们的attach part发生在所有的part都已经被成功传输到远程ClickHouse的detached目录中。到了Driver的attach这个阶段,说明数据的转换和传输的Job完全成功。
我们首先看一下Driver的attach日志,可以看到,这个attach失败的前面已经有一部分成功,到了这个attach才失败:
shell
26/01/21 15:03:06 INFO ch.ClickHouseOperation: ClickHouse attach successfully.
26/01/21 15:03:06 INFO app.AppConverterLogic: AttachClickHouse, host: rp504-2.iad7.prod.corp.com, yarn local host: 10.72.1.146-oce1-spark-yarn-localssd-3.us-east4.prod.gcp.corp.com, clickhouseLocalFolders = /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_2/data/_local/oce_flow_pt1m_local/20260121_1_1_0_779054531,/corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_2/data/_local/oce_flow_pt1m_local/20260121_2_2_0_779054531, clickhouseLocalPath = /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_2, records = 440985
26/01/21 15:03:06 INFO ch.ClickHouseOperation: Attach command: [ALTER TABLE oce_flow_pt1m_local ATTACH PART '20260121_1_1_0_779054531'], host: rp504-2.iad7.prod.corp.com, detachedPath: /corp/data/nvme5n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/
26/01/21 15:03:06 INFO ch.ClickHouseOperation: ClickHouse attach successfully.
26/01/21 15:03:06 INFO ch.ClickHouseOperation: Attach command: [ALTER TABLE oce_flow_pt1m_local ATTACH PART '20260121_2_2_0_779054531'], host: rp504-2.iad7.prod.corp.com, detachedPath: /corp/data/nvme5n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/
26/01/21 15:03:06 INFO ch.ClickHouseOperation: ClickHouse attach successfully.
26/01/21 15:03:06 INFO app.AppConverterLogic: AttachClickHouse, host: rp505-1.iad7.prod.corp.com, yarn local host: 10.72.1.146-oce1-spark-yarn-localssd-3.us-east4.prod.gcp.corp.com, clickhouseLocalFolders = /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_3/data/_local/oce_flow_pt1m_local/20260121_1_1_0_779054532,/corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_3/data/_local/oce_flow_pt1m_local/20260121_2_2_0_779054532, clickhouseLocalPath = /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_3, records = 440986
26/01/21 15:03:06 INFO ch.ClickHouseOperation: Attach command: [ALTER TABLE oce_flow_pt1m_local ATTACH PART '20260121_1_1_0_779054532'], host: rp505-1.iad7.prod.corp.com, detachedPath: /corp/data/nvme6n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/
26/01/21 15:03:06 INFO ch.ClickHouseOperation: ClickHouse attach successfully.
26/01/21 15:03:06 INFO ch.ClickHouseOperation: Attach command: [ALTER TABLE oce_flow_pt1m_local ATTACH PART '20260121_2_2_0_779054532'], host: rp505-1.iad7.prod.corp.com, detachedPath: /corp/data/nvme6n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/
26/01/21 15:03:06 INFO ch.ClickHouseOperation: ClickHouse attach successfully.
26/01/21 15:03:06 INFO app.AppConverterLogic: AttachClickHouse, host: rp506-2.iad7.prod.corp.com, yarn local host: 10.72.1.146-oce1-spark-yarn-localssd-3.us-east4.prod.gcp.corp.com, clickhouseLocalFolders = /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_1_1_0_779054533,/corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_2_2_0_779054533, clickhouseLocalPath = /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4, records = 440988
26/01/21 15:03:06 INFO ch.ClickHouseOperation: Attach command: [ALTER TABLE oce_flow_pt1m_local ATTACH PART '20260121_1_1_0_779054533'], host: rp506-2.iad7.prod.corp.com, detachedPath: /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/
26/01/21 15:03:06 WARN utils.RetryUtil: returnTime = 1, retry exception: java.lang.RuntimeException: Clickhouse attach failed.
26/01/21 15:03:07 INFO scheduler.TaskSetManager: Ignoring task-finished event for 6.0 in stage 2.1 because task 6 has already completed successfully
26/01/21 15:03:08 INFO ch.ClickHouseOperation: Attach command: [ALTER TABLE oce_flow_pt1m_local ATTACH PART '20260121_1_1_0_779054533'], host: rp506-2.iad7.prod.corp.com, detachedPath: /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/
26/01/21 15:03:08 WARN utils.RetryUtil: returnTime = 2, retry exception: java.lang.RuntimeException: Clickhouse attach failed.
26/01/21 15:03:12 INFO ch.ClickHouseOperation: Attach command: [ALTER TABLE oce_flow_pt1m_local ATTACH PART '20260121_1_1_0_779054533'], host: rp506-2.iad7.prod.corp.com, detachedPath: /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/
26/01/21 15:03:12 WARN utils.RetryUtil: returnTime = 3, retry exception: java.lang.RuntimeException: Clickhouse attach failed.
注意,我们以其中的一条日志为例,来解释这个日志的含义:
shell
26/01/21 15:03:06 INFO app.AppConverterLogic: AttachClickHouse, host: rp506-2.iad7.prod.corp.com, yarn local host: 10.72.1.146-oce1-spark-yarn-localssd-3.us-east4.prod.gcp.corp.com, clickhouseLocalFolders = /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_1_1_0_779054533,/corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_2_2_0_779054533, clickhouseLocalPath = /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4, records = 440988
26/01/21 15:03:06 INFO ch.ClickHouseOperation: Attach command: [ALTER TABLE oce_flow_pt1m_local ATTACH PART '20260121_1_1_0_779054533'], host: rp506-2.iad7.prod.corp.com, detachedPath: /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/
26/01/21 15:03:06 WARN utils.RetryUtil: returnTime = 1, retry exception: java.lang.RuntimeException: Clickhouse attach failed.
这里的日志其实是将Executor上传上来的Detached part信息进行了打印。Driver从这些信息中知道了我需要在哪台ClickHouse机器上attach那个Part(Executor已经在对应的机器上准备好了对应的part,处于attached状态)。具体的,日志的含义是:
-
Driver正尝试在ClickHouse Host
rp506-2.iad7.prod.corp.com上执行attach操作。显然,这些detached part是由对应的executor计算并scp过来的,但是具体是哪个executor,这里没有标明; -
当前执行这个attach操作的是机器
oce1-spark-yarn-localssd-3.us-east4.prod.gcp.corp.com,即当前的driver机器; -
当前正在进行attach的part名字是
20260121_1_1_0_779054533 -
这个detached part位于ClickHouse的detached目录
/corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/。当然,这个detached 目录是Executor从ClickHouse Instance中查询得到的,是ClickHouse要求的。当Executor把对应的Part放进该目录以后,后续Driver执行Attach命令就不需要提供detached part的路径,只需要指明对应的表名和part名,ClickHouse会自动在这张表的detached目录搜索这个part:sqlALTER TABLE oce_flow_pt1m_local ATTACH PART '20260121_1_1_0_779054533'
并且,我们查询ClickHouse数据库,的确存在Partial Data(即数据只被ingest了一部分,由那些已经成功的attach操作ingest进来的)。
在找到事故原因以前,我们需要先删除ClickHouse数据然后重新运行这个任务(这个任务的Input Data 来自HDFS,HDFS 数据的Retention Policy是1d),因此可以在手动删掉数据以后重新运行Spark Job即可。
重新运行以后,任务成功,说明失败的Application的失败来自于一些trasient issue,并不是由于持续的机器宕机、服务不可用、磁盘损坏导致的。
然后,我们开始排查问题原因。
Task 1.0 in Stage 2.0失败: UnknownHostException触发了蝴蝶效应
首先,我们看一下Driver日志,日志里第一条明显的异常是 FetchFailed,并且堆栈最后落在 UnknownHostException:
shell
26/01/21 15:02:09 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 23, rp408-25a.iad6.prod.corp.com, executor 10): FetchFailed(BlockManagerId(4, oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com, 7337, None), shuffleId=0, mapId=6, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com:7337
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:523)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:454)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
.....
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
at com.corp.storage.app.transform.engine.sink.IRddSinkTemplate$1.hasNext(IRddSinkTemplate.java:40)
at com.corp.storage.app.etl.core.utils.ParquetUtil.writeRecords(ParquetUtil.java:47)
at com.corp.storage.app.etl.core.utils.ParquetUtil.write(ParquetUtil.java:29)
at com.corp.storage.app.transform.engine.sink.Rdd2ParquetSinkTemplate.sink(Rdd2ParquetSinkTemplate.java:26)
at com.corp.storage.app.excutors.ClickhouseSink.writeLocalParquetByRdd(ClickhouseSink.java:180)
at com.corp.storage.app.excutors.ClickhouseSink.call(ClickhouseSink.java:89)
at com.corp.storage.app.excutors.ClickhouseSink.call(ClickhouseSink.java:35)
....
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com:7337
...
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: java.net.UnknownHostException: oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
...
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
... 2 more
)
26/01/21 15:02:09 INFO scheduler.DAGScheduler: Marking ResultStage 2 (collect at AppConverterLogic.java:106) as failed due to a fetch failure from ShuffleMapStage 1 (toJavaRDD at AppConverterLogic.java:106)
这是Driver端的异常日志,但是不代表异常发生在Driver上,而是发生在Executor上然后Driver收集到的。下一件事通常就是"顺着它指向的 executor host 去翻现场日志"。因为这条日志已经把任务跑在哪台机器说清楚了:(TID 23, rp408-25a..., executor 10)。
这条日志里出现了两个 host,很容易让人卡住:到底是哪台机器在报错、又是在连哪台机器?实际上是这样的:
- 这个异常打印在Driver上,实际上是Executor发送过来的异常信息;
(TID 23, rp408-25a..., executor 10)表示这个失败发生在 executor 10 上,它所在的主机是rp408-25a...。- 从异常内容
BlockManagerId(4, oce1-spark-yarn-localssd-1..., 7337, ...)可以看到,reduce 侧要去 另一台主机oce1-spark-yarn-localssd-1...:7337上(通常是该节点的 BlockManager / external shuffle service 对外提供端口)拉取shuffle数据,但是拉取的过程中发生了java.net.UnknownHostException异常。 - 所以,总的说来,
UnknownHostException oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com是 executor 10 所在机器 在尝试解析/连接oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com时抛出的网络异常;driver 端的TaskSetManager只是把它"汇报"为Lost task 1.0 in stage 2.0的FetchFailed(...)并触发调度层面的处理。
我们去查看Executor rp408-25a.iad6.prod.corp.com(executor 10 所在机器)的聚合日志里,确实找到了同样的 UnknownHostException(原样摘录):
shell
Caused by: java.net.UnknownHostException: oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
....
所以,目前我们能确定两点:
-
shuffleId/mapId/reduceId说明失败发生在 reduce 拉取 shuffle block 的环节;- 我们的代码中的确存在Shuffle,因为我们为了实现负载均衡和可调节的并发度,使用了
repartition()操作:
scalaDataset<Row> parquetData = sparkSession.read() .format("parquet") .option("compression", "snappy") .load(paths.toArray(new String[0])); List<Integer> customerWhitelist = arguments.getCustomerWhitelist(); if (CollectionUtils.isNotEmpty(customerWhitelist)) { log.info("customerWhitelist is {}", StringUtils.join(",", customerWhitelist)); parquetData = parquetData.filter(col("customerId").isin(customerWhitelist.toArray())); } else { log.info("customerWhitelist is empty"); } parquetData = parquetData.repartition(partitionNum); - 我们的代码中的确存在Shuffle,因为我们为了实现负载均衡和可调节的并发度,使用了
-
UnknownHostException说明这是 hostname 解析层面的问题,而不是 ClickHouse 或业务逻辑本身。- 我们的Yarn集群是一个异构集群,一部分机器是在本地数据中心,一部分机器运行在远程GCP。从本地数据中心到远程GCP的网络的确有时候会不稳定,以你,对DNS本身发生的原因,我们不去深究,这种网络抖动经常发生。只是我们比较疑惑的是,DNS的网络抖动为什么会导致Attach的时候发生了ClickHouse Part Not Exists的问题?即:
- 既然ClickHouse Driver收到了需要进行Attach的Part的信息,那么这个Part就应该在Executor端已经完成了转换和SCP才对,这时候,不应该出现Driver端收到了Executor的Conversion + SCP 成功,但是Driver在做Attach的时候却找不到的问题。这是不合理的。
- 我们的Yarn集群是一个异构集群,一部分机器是在本地数据中心,一部分机器运行在远程GCP。从本地数据中心到远程GCP的网络的确有时候会不稳定,以你,对DNS本身发生的原因,我们不去深究,这种网络抖动经常发生。只是我们比较疑惑的是,DNS的网络抖动为什么会导致Attach的时候发生了ClickHouse Part Not Exists的问题?即:
Stage 2.0有重试
我们在Spark UI的页面上的确看到了失败的Task。我们从Driver的日志中,看到Stage 2的确有重试。在Spark中,Stage 的attemp通过比如[Stage Id].[Stage Attempt Id]来表示。我们在日志里看到了Stage 2.0和Stage 2.1,分别代表Stage 2的第一个attempt和第二个attempt:
这是Stage 2.0的各个Task的启动信息:
shell
26/01/21 15:01:38 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 22, oce1-spark-yarn-localssd-3.us-east4.prod.gcp.corp.com, executor 3, partition 0, PROCESS_LOCAL, 7745 bytes)
26/01/21 15:01:38 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 23, rp408-25a.iad6.prod.corp.com, executor 10, partition 1, PROCESS_LOCAL, 7745 bytes)
26/01/21 15:01:38 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.0 (TID 24, oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com, executor 4, partition 2, PROCESS_LOCAL, 7745 bytes)
26/01/21 15:01:38 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 2.0 (TID 25, oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com, executor 8, partition 3, PROCESS_LOCAL, 7745 bytes)
26/01/21 15:01:38 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 2.0 (TID 26, oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com, executor 6, partition 4, PROCESS_LOCAL, 7745 bytes)
26/01/21 15:01:38 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 2.0 (TID 27, oce1-spark-yarn-localssd-4.us-east4.prod.gcp.corp.com, executor 2, partition 5, PROCESS_LOCAL, 7745 bytes)
26/01/21 15:01:38 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 2.0 (TID 28, oce1-spark-yarn-localssd-2.us-east4.prod.gcp.corp.com, executor 9, partition 6, PROCESS_LOCAL, 7745 bytes)
这是Stage 2.1的各个Task的启动信息:
shell
26/01/21 15:02:22 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.1 (TID 31, 408-25a.iad6.prod.corp.com, executor 10, partition 0, PROCESS_LOCAL, 7745 bytes)
26/01/21 15:02:22 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.1 (TID 32, oce1-spark-yarn-lssd-4.us-east4.prod.gcp.corp.com, executor 2, partition 1, PROCESS_LOCAL, 7745 bytes)
26/01/21 15:02:22 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.1 (TID 33, 402-25a.iad6.prod.corp.com, executor 11, partition 2, PROCESS_LOCAL, 7745 bytes)
26/01/21 15:02:22 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 2.1 (TID 34, oce1-spark-yarn-lssd-1.us-east4.prod.gcp.corp.com, executor 6, partition 3, PROCESS_LOCAL, 7745 bytes)
26/01/21 15:02:22 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 2.1 (TID 35, oce1-spark-yarn-lssd-2.us-east4.prod.gcp.corp.com, executor 5, partition 4, PROCESS_LOCAL, 7745 bytes)
26/01/21 15:02:22 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 2.1 (TID 36, 406-25a.iad6.prod.corp.com, executor 7, partition 5, PROCESS_LOCAL, 7745 bytes)
26/01/21 15:02:22 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 2.1 (TID 37, oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com, executor 8, partition 6, PROCESS_LOCAL, 7745 bytes)
由于是Stage级别的重试,因此,尽管Stage 2.0中只有一个Task失败,但是所有的Task在Stage 2.1中都被重新运行。后面会讲到,Job 1成功的判断标准并不是Stage 2.1这个Stage Attempt中的所有Task都成功,而是Stage 2中的所有Task都成功,无论这个Task是在Stage 2.0还是Stage 2.1中成功。
因此,我们首先需要看的,是Stage 2.0为什么失败。
我们搜索日志,看到,task 1.0 in stage 2.0 失败了:
shell
26/01/21 15:02:09 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 23, rp408-25a.iad6.prod.corp.com, executor 10): FetchFailed(BlockManagerId(4, oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com, 7337, None), shuffleId=0, mapId=6, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com:7337
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:523)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:454)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
.....
所以,可以看到,DNS问题的确是始作俑者,导致了Stage 2.0的失败,然后, Spark开始启动另外一个Stage Attempt, Stage 2.1 。
必须注意到,在启动Stage 2.1的时候,Spark并没有kill掉Stage 2.0的任何Task,即,在启动Stage 2.1的所有Task以后,Stage 2.0的Task依然在照常运行,并且会向Spark Driver汇报结果。
DNS解析异常导致的Stage 2.0的重试,成了我们后续更多问题发生的根源。我们下文就以Stage 2.0作为First Stage Attempt, 而已Stage 2.1作为对应的Retry Stage Attempt。这时候,我们遇到的问题可能包括:
- 由于Spark在判定一个Job完成并开始执行下一步任务的标准,不是这个Job的所有Stage Attempt都结束,而是这个Job的所有Task(无论这个Task所属的Stage是Stage 2.0还是Stage 2.1)。这是两个完全不同的判断标准。
- 比如,
Stage 2.0Job 2.0 由于Task 1.0失败而重试,那么在Stage 2.1中只要Task 1.0重试成功,这个Job就可标记为完成,即使,此时,Stage 2.1的其他Task都还在执行。 - 这时候,Driver中的后续操作(比如,在我们的case中,Job结束以后开始进行Attach)就有可能受到正在运行的Task的干扰
- 比如,
- Stage 2.1的启动也不会完全等待Stage 2.0中所有Task都完成,因此,Stage 2.1中的Task和Stage 2.0中的task如果没有形成好的资源隔离,也有可能出问题,也会导致不可预测的问题,这种相互干扰可能是:
- Stage 2.0中的task和Stage 2.1中的相同的task或者不同的task都操作一些远程资源,导致竞态发生。在我们的case中,Stage 2.0中正在运行的Task删除了Stage 2.1中的相同的Task所传输到远程的detached part文件,导致Stage 2.1出现问题。
- Stage 2.0中的task和Stage 2.1中的相同的task都操作一些本地资源,导致竞态。这个case居然也出现在了我们的Job中。比如,我们从上文中Stage 2.0和Stage 2.1的Task启动信息可以看到,task 3.0在Stage 2.0和Stage 2.1中都恰好启动在了机器
oce1-spark-yarn-lssd-1.us-east4.prod.gcp.corp.com上,而这两个Job都会往相同的目录中写文件,导致冲突。
Task 3.0在Stage 2的两个Attempt中产生的"两方竞态"
我们继续往下看日志。
然后,我们发现,在Stage 2.1中,Task 3.0居然也失败了。注意,task 3.0在Stage 2.0中是成功的:
shell
26/01/21 15:02:56 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 2.1 (TID 34, oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com, executor 6): java.lang.RuntimeException: Job failed.
at com.corp.storage.app.excutors.ClickhouseSink.call(ClickhouseSink.java:141)
at com.corp.storage.app.excutors.ClickhouseSink.call(ClickhouseSink.java:35)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: error: generateLocalData failed:
at com.corp.storage.app.excutors.ClickhouseSink.call(ClickhouseSink.java:120)
... 14 more
Suppressed: java.lang.RuntimeException: error: generateLocalData failed:
... 15 more
Caused by: java.lang.RuntimeException: Error: execute shell command failed, host info: 10.72.0.218-oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com, errorMsg: Code: 107. DB::Exception: File /corp/data/ab-etl-workspace/parquet-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_3/data_partition_3.parquet doesn't exist. (FILE_DOESNT_EXIST)
at com.corp.storage.app.etl.core.utils.ShellCommandUtil.stderr(ShellCommandUtil.java:159)
at com.corp.storage.app.etl.core.utils.ShellCommandUtil.execute(ShellCommandUtil.java:118)
at com.corp.storage.app.ch.ClickHouseOperation.generateLocalData(ClickHouseOperation.java:311)
at com.corp.storage.app.excutors.ClickhouseSink.generateClickhouseLocalData(ClickhouseSink.java:200)
at com.corp.storage.app.excutors.ClickhouseSink.call(ClickhouseSink.java:114)
... 14 more
Caused by: java.lang.RuntimeException: Error: execute shell command failed, host info: 10.72.0.218-oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com, errorMsg: Code: 107. DB::Exception: File /corp/data/ab-etl-workspace/parquet-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_3/data_partition_3.parquet doesn't exist. (FILE_DOESNT_EXIST)
at com.corp.storage.app.etl.core.utils.ShellCommandUtil.stderr(ShellCommandUtil.java:159)
at com.corp.storage.app.etl.core.utils.ShellCommandUtil.execute(ShellCommandUtil.java:118)
at com.corp.storage.app.ch.ClickHouseOperation.generateLocalData(ClickHouseOperation.java:311)
at com.corp.storage.app.excutors.ClickhouseSink.generateClickhouseLocalData(ClickhouseSink.java:200)
at com.corp.storage.app.excutors.ClickhouseSink.call(ClickhouseSink.java:114)
... 14 more
同样,这里的日志含义是:
-
这个失败的任务是在Stage 2.1的task 3.0上, task 3.0运行在机器
oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com上 -
失败的位置是
generateLocalData()。我们从代码中看到,这里是在调用clickhouse-local命令,将本地的parquet文件(来资源远程HDFS上的Parquet Source经过repartition()以后生成)生成ClickHouse Part文件:scala@Override public void generateLocalData(ClickHouseLocalTable clickHouseLocalTable, String parquetPath, String clickHouseLocalPath) { ShellCommandUtil.rmDirectory(clickHouseLocalPath); log.info("Remove clickhouse previous old folder successfully, path is {}", clickHouseLocalPath); Map<String, String> clickHouseTableSchema = getClickHouseTableSchema(clickHouseLocalTable.getDatabase(), clickHouseLocalTable.getTableName(), true); log.info("clickHouseTableSchema = {}", JsonUtil.toJson(clickHouseTableSchema)); String parquetSchemaStr = parquetSchemaStr(clickHouseTableSchema); String command = ConfigUtil.getClickHouseLocalBin() + " local --file " + parquetPath; command += " -S " + "\"" + parquetSchemaStr + "\""; command += " -N \"" + tmpTable + "\""; String ddl = clickHouseLocalTable.getCreateTableDDL(); log.info("local table ddl: [{}]", ddl); String query = String.format( "\"%s; INSERT INTO TABLE %s SELECT %s FROM %s;\"", ddl, clickHouseLocalTable.getTableName(), String.join(",", selectStatement(clickHouseTableSchema)), tmpTable); log.info("Query is : [{}]", query); command += " -q " + query; command += " --path " + "\"" + clickHouseLocalPath + "\""; command += " --allow_experimental_object_type 1"; log.info("Final clickhouse local data command is: [{}]", command); long begin = System.currentTimeMillis(); ShellCommandUtil.execute(command); log.info("Generate clickhouse local folder successfully. Cost time is {}ms.", System.currentTimeMillis() - begin); }
所以,我们的问题是,为什么这个本地的文件会丢掉呢?
由于这个异常是task 3.0 on stage 2.1,因此,我们首先怀疑,是不是我们的stage 2中的task本来就不支持重试呢?比如,这个task的输入一定是已经在本地生成好的parquet,因此一旦task被重试,假如这个task重试在不同机器上,那么这个task的source文件一定找不到。
但是我们从上文中的Stage 切分可以看到,Stage 2的 Source是来自于Stage 1的repartition()的Shuffle Map,而不是依赖于本地的任何文件。所以,在generateLocalData()中根据本地的parquet 文件生成ClickHouse数据的时候,这个本地的parquet文件是这个Task独立生成的,即Task 3.0在Stage 2.0和Stage 2.1中都会重复生成对应的本地的Parquet文件(来自于Stage 1的reparitition()),然后调用clickhouse-local。
那么,这时候问题就来了,为什么Task 3.0在Stage 2.1中没有成功生成对应的本地Parquet文件(或者虽然生成了但是被删除了)呢?
然后,我们惊奇的发现,出于某种巧合, task 3.0在Stage 2.0和Stage 2.1中都被调度到了同一台Executor上: oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com
shell
26/01/21 15:01:38 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 2.0 (TID 25, oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com, executor 8, partition 3, PROCESS_LOCAL, 7745 bytes)
.....
26/01/21 15:02:22 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 2.1 (TID 34, oce1-spark-yarn-lssd-1.us-east4.prod.gcp.corp.com, executor 6, partition 3, PROCESS_LOCAL, 7745 bytes)
.....
既然是这样,我们就猜想,会不会是Task 3.0 in Stage 2.0的某些操作对Task 3.0 in Stage 2.1产生了影响呢?要证实这件事,需要分析两个Task的具体时间节点,看看是否存在交叉:
在下面的Driver端日志可以看到,的确,在Task 3.0 in Stage 2.0结束前,Task 3.0 in Stage 2.1已经开始了,的确存在交叉的可能性:
shell
26/01/21 15:01:38 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 2.0 (TID 25, oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com, executor 8, partition 3, PROCESS_LOCAL, 7745 bytes)
818:26/01/21 15:02:22 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 2.1 (TID 34, oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com, executor 6, partition 3, PROCESS_LOCAL, 7745 bytes)
839:26/01/21 15:02:24 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 2.0 (TID 25) in 45747 ms on oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com (executor 8) (3/7)
853:26/01/21 15:02:56 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 2.1 (TID 34, oce1-spark-yarn-localssd-1.us-east4.prod.gcp.corp.com, executor 6): java.lang.RuntimeException: Job failed.
然后,我们需要详细分析Task 3.0在两个Stage中的执行时间线,因为我们严重怀疑是Task 3.0 in Stage 2.0 的最后清理操作刚好把Task 3.0 in Stage 2.1的输入给删除了。
下图直观的反映出了两个Task 3.0 in Stage 2.0之间的竞态关系,Task 3.0 in Stage 2.0刚好在Task 3.0 in Stage 2.1生成了partition_3/data_partition_3.parquet以后、读取partition_3/data_partition_3.parquet以前,把了partition_3/data_partition_3.parquet删除了。
时间 ↓
Stage 2.0 / Task 3.0 (TID 25) Stage 2.1 / Task 3.0 (TID 34)
─────────────────────────────────────────────────────────── ───────────────────────────────────────────────────────────
15:01:39 rm -rf .../partition_3/data_partition_3.parquet
15:01:39 Start write local .../partition_3/data_partition_3.parquet
15:02:13 Write local parquet ... successfully (440986 records)
15:02:13 clickhouse local --file .../partition_3/data_partition_3.parquet
... 15:02:22 rm -rf .../partition_3/data_partition_3.parquet
15:02:22 Start write local .../partition_3/data_partition_3.parquet
15:02:24 cleanLocalTempDir ... parquetLocalDir = .../partition_3/
15:02:24 rm -rf .../partition_3/ ───────────────► (这一下把"路径名"删掉了;右边写入还在进行)
15:02:24 Finished task 3.0 in stage 2.0 (TID 25)
15:02:56 Write local parquet ... successfully (440986 records)
15:02:56 clickhouse local --file .../partition_3/data_partition_3.parquet
15:02:56 DB::Exception: File .../partition_3/data_partition_3.parquet doesn't exist. (FILE_DOESNT_EXIST)
15:02:56 generateLocalData failed
15:02:56 cleanLocalTempDir ... rm -rf .../partition_3/
所以,我们搞清楚了Task 3.0 in Stage 2.1中失败的原因,即Task 3.0 in Stage 2.0刚好删除了Task 3.0 in Stage 2.1中需要的temp parquet data。
为什么Task 1.0 in Stage 2.0失败了, Task 3.0 in Stage 2.1也失败了,但是Job还是成功了呢?因为Spark对Job成功的判断逻辑是: 对于一个Stage,只要所有的Task都成功了,那么这个Stage就成功了,无论这个Task是在这个Stage的哪一个attempt里面。
因此,Task 3.0 in Stage 2.1的失败并不影响application的成功或者失败,对数据也没有造成干扰,因为, Task 3.0 in Stage 2.0成功了,成功地将part 传输到了远程的ClickHouse Server上,而 Task 3.0 in Stage 2.1在 clickhouse-local执行阶段即失败,clickhouse server上的这个part就是Task 3.0 in Stage 2.0生成的,没有问题,Driver也随后会收到Task 3.0 in Stage 2.0的part 信息并且会进行attach操作。
Task 4.0在Stage 2的两个Attempt中产生的三方竞态
现在,我们终于要面对最终的问题了: 为什么Driver在最后attach part 20260121_1_1_0_779054533的时候,会报Detached part "20260121_1_1_0_779054533" not found异常,即,这个part居然在ClickHouse Server上不存在呢?
首先,我们通过搜索Executor日志,找到了负责这个Part的Task,即Task 4.0。
经过调查,我们发现,这其实是由于Task 4.0 in Stage 2.0, Task 4.0 in Stage 2.1 以及 Driver发生了"三方竞态"导致的。
竞态示意图如下所示:
text
时间 ↓
Stage 2.0 / Task 4.0 Stage 2.1 / Task 4.0 Driver (AM / AppConverterLogic)
oce1-spark-yarn-localssd-1 oce1-spark-yarn-localssd-2 oce1-spark-yarn-localssd-3
─────────────────────────────────────────────────── ──────────────────────────────────────────────────── ─────────────────────────────────────────────────────
15:02:18 renameFolders -> 20260121_1_1_0_779054533
15:02:18 cleanClickhouseRemoteData (rm remote part dir)
15:02:18 Transfer (scp) -> detached/
15:02:21 md5 check (stage2.0 side)
15:02:24 Finished task 4.0 in stage 2.0
15:03:02 renameFolders -> 20260121_1_1_0_779054533
15:03:02 cleanClickhouseRemoteData (rm remote part dir)
15:03:04 Remove remote .../detached/20260121_1_1_0_779054533
15:03:06 ATTACH PART '20260121_1_1_0_779054533' (attempt 1)
15:03:06 retry exception: Clickhouse attach failed
15:03:07 Transfer (scp) -> detached/
15:03:08 scp: .../detached//20260121_1_1_0_779054533/... No such file or directory
15:03:08 ATTACH PART '20260121_1_1_0_779054533' (attempt 2)
15:03:08 retry exception: Clickhouse attach failed
15:03:10 retry exception: MD5 does not match
15:03:12 cleanClickhouseRemoteData (rm remote part dir)
15:03:12 Remove remote .../detached/20260121_1_1_0_779054533 ◄─────┐
15:03:12 ATTACH PART '20260121_1_1_0_779054533' (attempt 3) │
15:03:12 retry exception: Clickhouse attach failed │
│
15:03:14 Transfer (scp) -> detached/ │
15:03:16 Finished task 4.0 in stage 2.1 │
│
15:03:21 ClickHouseException: Detached part \"20260121_1_1_0_779054533\" not found
从上面的竞态示意图我们可以看到:
- Task 4.0 in Stage 2.0中其实是完全执行成功的(part肯定已经scp到远程的ClickHouse机器上),但是Stage 2由于Task 1.0的重试而重试了整个Stage,所以Task 4.0也被迫重试,即使Task 4.0 in Stage 2.0中没有失败;
- Task 4.0 in Stage 2.0和Task 4.1 in Stage 2.1运行在不同的机器上,他们和Driver之间的竞态不是本地文件造成的,而是远程的ClickHouse Detached 目录造成的
- Stage 2中,在我们通过
clickhouse-local生成对应的ClickHouse数据以前,会先清空ClickHouse Server上的目标detached/part_name目录,避免冲突,但是,由于Stage 2存在重试,而Driver其实不用等到Stage 2.1的所有task完成就可以认为Job 1完成,因此,在Task 4.0 in Stage 2.1还在运行的时候,Spark Driver就已经标记Job 1完成然后开始进行attach操作,此时,Task 4.0 in Stage 2.1先清空ClickHouse Server的part目录然后再scp part过去的操作导致attach失败。
Task 4.0 在 Stage 2.0, Stage 2.1和对应的Driver上的关键日志如下所示:
-
Stage 2.0/ Task 4.0(在oce1-spark-yarn-localssd-1上):text26/01/21 15:02:18 INFO ch.ClickHouseOperation: renameFolders - originalPath = /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_1_1_0, newPath = /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_1_1_0_779054533 26/01/21 15:02:18 INFO ch.ClickHouseOperation: cleanClickhouseRemoteData - removePaths = /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/20260121_1_1_0_779054533, /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/20260121_2_2_0_779054533 26/01/21 15:02:18 INFO utils.ScpCommandUtil: Remove remote [/corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/20260121_1_1_0_779054533], host: rp506-2.iad7.prod.corp.com 26/01/21 15:02:18 INFO utils.ScpCommandUtil: Transfer, local path: /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_1_1_0_779054533, remote path: /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/ 26/01/21 15:02:24 INFO executor.Executor: Finished task 4.0 in stage 2.0 (TID 26). 7062 bytes result sent to driver -
Stage 2.1/ Task 4.0(在oce1-spark-yarn-localssd-2上)第一次 scp 失败,随后 retry 会再次rm remote:text26/01/21 15:03:02 INFO ch.ClickHouseOperation: renameFolders - originalPath = /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_1_1_0, newPath = /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_1_1_0_779054533 26/01/21 15:03:04 INFO utils.ScpCommandUtil: Remove remote [/corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/20260121_1_1_0_779054533], host: rp506-2.iad7.prod.corp.com 26/01/21 15:03:07 INFO utils.ScpCommandUtil: Transfer, local path: /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_1_1_0_779054533, remote path: /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/ 26/01/21 15:03:08 WARN common.ScpHelper: validateCommandStatusCode(ScpHelper[ClientSessionImpl[clickhouse@rp506-2.iad7.prod.corp.com/10.12.69.102:22]])[/corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_1_1_0_779054533/metricFloatGroup7.size0.cmrk2] advisory ACK=1: scp: /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached//20260121_1_1_0_779054533/metricFloatGroup7.size0.cmrk2: set times: No such file or directory for command=C0644 146 metricFloatGroup7.size0.cmrk2 26/01/21 15:03:10 WARN utils.RetryUtil: returnTime = 1, retry exception: java.lang.RuntimeException: Error: The MD5 of the local file and the remote file does not match. Local path: /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_1_1_0_779054533, remote path: /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/20260121_1_1_0_779054533 26/01/21 15:03:12 INFO utils.ScpCommandUtil: Remove remote [/corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/20260121_1_1_0_779054533], host: rp506-2.iad7.prod.corp.com 26/01/21 15:03:14 INFO utils.ScpCommandUtil: Transfer, local path: /corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_1_1_0_779054533, remote path: /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/ 26/01/21 15:03:16 INFO executor.Executor: Finished task 4.0 in stage 2.1 (TID 35). 7019 bytes result sent to driver -
Driver 端对同一个 part 的 ATTACH 重试,最终抛出 "Detached part not found":
text26/01/21 15:03:06 INFO ch.ClickHouseOperation: Attach command: [ALTER TABLE oce_flow_pt1m_local ATTACH PART '20260121_1_1_0_779054533'], host: rp506-2.iad7.prod.corp.com, detachedPath: /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/ 26/01/21 15:03:06 WARN utils.RetryUtil: returnTime = 1, retry exception: java.lang.RuntimeException: Clickhouse attach failed. 26/01/21 15:03:08 INFO ch.ClickHouseOperation: Attach command: [ALTER TABLE oce_flow_pt1m_local ATTACH PART '20260121_1_1_0_779054533'], host: rp506-2.iad7.prod.corp.com, detachedPath: /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/ 26/01/21 15:03:08 WARN utils.RetryUtil: returnTime = 2, retry exception: java.lang.RuntimeException: Clickhouse attach failed. 26/01/21 15:03:12 INFO ch.ClickHouseOperation: Attach command: [ALTER TABLE oce_flow_pt1m_local ATTACH PART '20260121_1_1_0_779054533'], host: rp506-2.iad7.prod.corp.com, detachedPath: /corp/data/nvme7n1/clickhouse/store/57a/57aa2c6f-915b-4a4f-a080-ef8975632763/detached/ 26/01/21 15:03:12 WARN utils.RetryUtil: returnTime = 3, retry exception: java.lang.RuntimeException: Clickhouse attach failed. 26/01/21 15:03:21 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: Failed to attach folder to clickhouse, host [rp506-2.iad7.prod.corp.com], clickhouseLocalFolders = [/corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_1_1_0_779054533,/corp/data/ab-etl-workspace/clickhouse-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_4/data/_local/oce_flow_pt1m_local/20260121_2_2_0_779054533] Caused by: com.clickhouse.client.ClickHouseException: Code: 233. DB::Exception: Detached part \"20260121_1_1_0_779054533\" not found. (BAD_DATA_PART_NAME) (version 24.8.4.13 (official build))
解决方案:把竞态窗口"关上"
前面把问题讲透以后,修复目标就很清晰了:要么减少进入重试的概率,要么让重试/并发 attempt 发生时依然安全。下面的三条改造分别对应这三类薄弱点。
-
/etc/hosts:用确定性解析干掉UnknownHostException**既然是DNS解析问题,我们就在
/etc/hosts中对域名和IP进行配置,减少域名解析异常。 -
detached 的"原子发布":先上传隐藏目录,成功后一次性
mv到 detached**其次,我们从竞态分析中可以看到,Task 4.0在Stage 2.0和Stage 2.1中轻易出现了竞态,是因为 detached 在 "删除远程detach目录 → scp part到远程ClickHouse Server → 校验失败 → 重试" 过程中存在可观的空窗期,如果Driver的Attach操作刚好发生在Stage 2.1的Task删除Remote Part和重新生成并SCP Remote Part的时间窗口内,那么Attach就会失败。
改造思路是把 detached 从"传输中间态"改成"提交区":对每个 part 先上传到隐藏目录,成功后再一次性发布。
- 远端创建隐藏目录:
{detached}/.tmp_{part}_{uuid}/ - scp 上传到隐藏目录
- 校验/修权限完成后,远端执行一次:
mv {detached}/.tmp_{part}_{uuid} {detached}/{part}
如果
mv是同一文件系统内目录 rename,它接近原子:detached 下"删除 → 生成"的窗口会被压缩到一次 rename 的瞬间,attach 的"踩空概率"会被大幅压缩。落点仍然在下述两个函数入口(示意):
java@Override public void transferClickHouseData(String host, List<String> clickHouseLocalFolders, String detachedPath) { RetryUtil.retry(() -> { // 1) upload to {detached}/.tmp_{part}_{uuid} // 2) verify/chown // 3) mv tmp -> {detached}/{part} return true; }); } - 远端创建隐藏目录:
-
attempt 隔离:本地路径 + 远端命名都要包含
stageAttempt+taskAttempt同时,我们也看到,由于Task 3.0 in Stage 2.0和Task 3.0 in
Stage 2.1刚好落到了一台机器上,导致他们在写本地目录的时候也发生了冲突。这是因为,我们在写本地目录的时候,没有考虑到重试,只是保证不同的Spark Partition假如运行在同一个Executor上的时候不会产生冲突,因此在路径上加入了Spark Partition信息,比如:
shell/corp/data/ab-etl-workspace/parquet-local/default/oce_flow_pt1m_local/2026/01/21/14/59/fcc00f88-564c-427a-917c-8499a52d6c99/partition_3/data_partition_3.parquet因此需要让不同 attempt 永远不共享同一个"可变对象",这样,即使运行在同一机器上,也不会发生冲突:
- 本地 parquet 路径(
PARQUET_LOCAL_PATH) - 本地 clickhouse-local 路径(
CLICKHOUSE_TEMP_PATH) - 远端发布目录/part 命名(避免不同 attempt 同名)
我们可以这样取一个
attemptKey(概念):textattemptKey = s<stageId>_sa<stageAttempt>_ta<taskAttemptId>然后把
attemptKey注入路径/命名(示例):text.../parquet-local/.../<batchId>/<attemptKey>/partition_4/data_partition_4.parquet .../clickhouse-local/.../<batchId>/<attemptKey>/partition_4/... {detached}/.tmp_{part}_{attemptKey}/ -> mv -> {detached}/{part}_{attemptKey}做到这一步,即使 stage 2.0 与 stage 2.1 的 task 同时跑,也只会写各自的目录,互不影响;Driver attach 也不会被另一个 attempt 的 rm 影响。
- 本地 parquet 路径(
总结
- Spark 的成功判定看的是"算出来能不能返回",不等于"外部落地是否一致"。当外部写入放在 task 里时,一旦触发重试,就要考虑"同一份分片被重复执行、外部动作会被重复触发"的情况。
- "先删再传"的远端写入方式在重试场景下是天然危险的:它会制造 detached 空窗期,让 attach 变成概率事件。
- 隔离 attempt 是最朴素也最有效的办法 :本地路径、远端目录、part 命名都应该包含
stageAttempt + taskAttempt(或等价的唯一 key)。 - 基础设施稳定性(DNS/解析)不是小问题:它会把系统推入重试路径,从而放大所有非幂等副作用。