Spark read load Parquet Files

Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

Loading Data Programmatically

Using the data from the above example:

  • Python

  • Scala

  • Java

  • R

  • SQL

    peopleDF = spark.read.json("examples/src/main/resources/people.json")

    DataFrames can be saved as Parquet files, maintaining the schema information.

    peopleDF.write.parquet("people.parquet")

    Read in the Parquet file created above.

    Parquet files are self-describing so the schema is preserved.

    The result of loading a parquet file is also a DataFrame.

    parquetFile = spark.read.parquet("people.parquet")

    Parquet files can also be used to create a temporary view and then used in SQL statements.

    parquetFile.createOrReplaceTempView("parquetFile")
    teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
    teenagers.show()

    +------+

    | name|

    +------+

    |Justin|

    +------+

Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.

Schema Merging

Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by

  1. setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
  2. setting the global SQL option spark.sql.parquet.mergeSchema to true
  • Python

  • Scala

  • Java

  • R

    from pyspark.sql import Row

    spark is from the previous example.

    Create a simple DataFrame, stored into a partition directory

    sc = spark.sparkContext

    squaresDF = spark.createDataFrame(sc.parallelize(range(1, 6))
    .map(lambda i: Row(single=i, double=i ** 2)))
    squaresDF.write.parquet("data/test_table/key=1")

    Create another DataFrame in a new partition directory,

    adding a new column and dropping an existing column

    cubesDF = spark.createDataFrame(sc.parallelize(range(6, 11))
    .map(lambda i: Row(single=i, triple=i ** 3)))
    cubesDF.write.parquet("data/test_table/key=2")

    Read the partitioned table

    mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
    mergedDF.printSchema()

    The final schema consists of all 3 columns in the Parquet files together

    with the partitioning column appeared in the partition directory paths.

    root

    |-- double: long (nullable = true)

    |-- single: long (nullable = true)

    |-- triple: long (nullable = true)

    |-- key: integer (nullable = true)

    // This is used to implicitly convert an RDD to a DataFrame.
    import spark.implicits._

    // Create a simple DataFrame, store into a partition directory
    val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
    squaresDF.write.parquet("data/test_table/key=1")

    // Create another DataFrame in a new partition directory,
    // adding a new column and dropping an existing column
    val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")
    cubesDF.write.parquet("data/test_table/key=2")

    // Read the partitioned table
    val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
    mergedDF.printSchema()

    // The final schema consists of all 3 columns in the Parquet files together
    // with the partitioning column appeared in the partition directory paths
    // root
    // |-- value: int (nullable = true)
    // |-- square: int (nullable = true)
    // |-- cube: int (nullable = true)
    // |-- key: int (nullable = true)

相关推荐
快乐非自愿8 分钟前
抛弃传统AI:OpenClaw与Skill重构AI生产力,技术范式不可逆
大数据·人工智能
网络研究员24 分钟前
Claude身份认证后还是被封?三条稳定防封策略
大数据·人工智能
TuCoder27 分钟前
2026年了,景区制作智慧导地图有哪些选择?
大数据
2601_9499251844 分钟前
基于 OpenClaw 打造货代行业 AI 智能体架构实战
大数据·人工智能·架构·ai智能体
zhengyquan1 小时前
7000mAh 电池 + 独立 AI 键,小米 18 Pro 是堆料还是突破?
大数据·人工智能
geneculture1 小时前
意识的多学科定义:从16个视域,到融智学统合——基于“意+识”框架且区分“意识≠心智”系统研究
大数据·人工智能·融智学的重要应用·哲学与科学统一性·融智时代(杂志)·意识=意+识·智=信息处理+选择用意
Ai173163915792 小时前
GB200 NVL72超节点深度解析:架构、生态与产业格局
大数据·服务器·人工智能·神经网络·机器学习·计算机视觉·架构
观远数据2 小时前
跨部门BI推广权限治理指南:如何避免数据泄露与权责混乱
大数据·人工智能·数据分析
xierui1231232 小时前
探索型 AI 与交付型 AI:两种截然不同的技术物种
大数据·人工智能·效率工具·ai工具·大模型应用·aiagent·agent架构
观远数据2 小时前
跨部门指标统一治理:如何消除数据口径歧义提升决策效率
大数据·人工智能·数据挖掘·数据分析