Spark read load Parquet Files

Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

Loading Data Programmatically

Using the data from the above example:

  • Python

  • Scala

  • Java

  • R

  • SQL

    peopleDF = spark.read.json("examples/src/main/resources/people.json")

    DataFrames can be saved as Parquet files, maintaining the schema information.

    peopleDF.write.parquet("people.parquet")

    Read in the Parquet file created above.

    Parquet files are self-describing so the schema is preserved.

    The result of loading a parquet file is also a DataFrame.

    parquetFile = spark.read.parquet("people.parquet")

    Parquet files can also be used to create a temporary view and then used in SQL statements.

    parquetFile.createOrReplaceTempView("parquetFile")
    teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
    teenagers.show()

    +------+

    | name|

    +------+

    |Justin|

    +------+

Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.

Schema Merging

Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by

  1. setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
  2. setting the global SQL option spark.sql.parquet.mergeSchema to true
  • Python

  • Scala

  • Java

  • R

    from pyspark.sql import Row

    spark is from the previous example.

    Create a simple DataFrame, stored into a partition directory

    sc = spark.sparkContext

    squaresDF = spark.createDataFrame(sc.parallelize(range(1, 6))
    .map(lambda i: Row(single=i, double=i ** 2)))
    squaresDF.write.parquet("data/test_table/key=1")

    Create another DataFrame in a new partition directory,

    adding a new column and dropping an existing column

    cubesDF = spark.createDataFrame(sc.parallelize(range(6, 11))
    .map(lambda i: Row(single=i, triple=i ** 3)))
    cubesDF.write.parquet("data/test_table/key=2")

    Read the partitioned table

    mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
    mergedDF.printSchema()

    The final schema consists of all 3 columns in the Parquet files together

    with the partitioning column appeared in the partition directory paths.

    root

    |-- double: long (nullable = true)

    |-- single: long (nullable = true)

    |-- triple: long (nullable = true)

    |-- key: integer (nullable = true)

    // This is used to implicitly convert an RDD to a DataFrame.
    import spark.implicits._

    // Create a simple DataFrame, store into a partition directory
    val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
    squaresDF.write.parquet("data/test_table/key=1")

    // Create another DataFrame in a new partition directory,
    // adding a new column and dropping an existing column
    val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")
    cubesDF.write.parquet("data/test_table/key=2")

    // Read the partitioned table
    val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
    mergedDF.printSchema()

    // The final schema consists of all 3 columns in the Parquet files together
    // with the partitioning column appeared in the partition directory paths
    // root
    // |-- value: int (nullable = true)
    // |-- square: int (nullable = true)
    // |-- cube: int (nullable = true)
    // |-- key: int (nullable = true)

相关推荐
真上帝的左手14 小时前
19. 大数据- BI - AI 应用2-AI模型部署与企业落地
大数据·人工智能·ai·bi
东方巴黎~Sunsiny14 小时前
实战:RocketMQ 幂等 + Redis 分布式锁 + 异常重试 保姆级教程
redis·分布式·rocketmq
Herlie14 小时前
2026新品上新季:3款AI电商套图生成工具实测
大数据·人工智能
珠海西格电力14 小时前
西格电力零碳园区管理系统的技术架构是怎样的?
大数据·运维·人工智能·物联网·架构·能源
数幄科技15 小时前
电力装备制造业智能化转型】【数据基础设施篇】【5】数据采集 ETL 的可靠性设计
大数据·人工智能·算法·数据治理·数幄科技
海伯森技术15 小时前
海伯森3D线光谱共焦精密测量技术及产业化应用
大数据·人工智能·3d
打码人的日常分享15 小时前
信息化数据安全管理制度办法(Word)
大数据·运维·网络·云计算·制造
大大大大晴天️15 小时前
Hudi文件布局:COW与MOR表案例解析
大数据·hudi
库拉大叔15 小时前
大模型AI横评实测:GPT-4与Claude 3.5三大维度对比,落地选型怎么选?
大数据·人工智能
ModelHub XC信创模盒15 小时前
压力之下,重构赛道:从中美AI博弈到信创生态的深层跃迁
大数据·人工智能·重构·开源·信创·范式