【spark】dataframe慎用limit

Code_LT2023-09-02 23:02

官方：limit通常和order by一起使用，保证结果是确定的

limit 会有两个步骤：

LocalLimit ，发生在每个partition
GlobalLimit，发生shuffle，聚合到一个parttion

当提取的n大时，第二步是比较耗时的

复制代码

== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (5)
+- * GlobalLimit (4)
   +- Exchange (3)
      +- * LocalLimit (2)
         +- Scan csv  (1)

如果对取样顺序没有要求，可用tablesample替代，使用详解。

复制代码

== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (3)
+- * Sample (2)
   +- Scan csv  (1)

参考

官方
 Stop using the LIMIT clause wrong with Spark
DataFrame orderBy followed by limit in Spark

上一篇：20个不可错过的VScode神级插件

下一篇：Java获取文件内容IO流