Spark read load Parquet Files

Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

Loading Data Programmatically

Using the data from the above example:

  • Python

  • Scala

  • Java

  • R

  • SQL

    peopleDF = spark.read.json("examples/src/main/resources/people.json")

    DataFrames can be saved as Parquet files, maintaining the schema information.

    peopleDF.write.parquet("people.parquet")

    Read in the Parquet file created above.

    Parquet files are self-describing so the schema is preserved.

    The result of loading a parquet file is also a DataFrame.

    parquetFile = spark.read.parquet("people.parquet")

    Parquet files can also be used to create a temporary view and then used in SQL statements.

    parquetFile.createOrReplaceTempView("parquetFile")
    teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
    teenagers.show()

    +------+

    | name|

    +------+

    |Justin|

    +------+

Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.

Schema Merging

Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by

  1. setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
  2. setting the global SQL option spark.sql.parquet.mergeSchema to true
  • Python

  • Scala

  • Java

  • R

    from pyspark.sql import Row

    spark is from the previous example.

    Create a simple DataFrame, stored into a partition directory

    sc = spark.sparkContext

    squaresDF = spark.createDataFrame(sc.parallelize(range(1, 6))
    .map(lambda i: Row(single=i, double=i ** 2)))
    squaresDF.write.parquet("data/test_table/key=1")

    Create another DataFrame in a new partition directory,

    adding a new column and dropping an existing column

    cubesDF = spark.createDataFrame(sc.parallelize(range(6, 11))
    .map(lambda i: Row(single=i, triple=i ** 3)))
    cubesDF.write.parquet("data/test_table/key=2")

    Read the partitioned table

    mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
    mergedDF.printSchema()

    The final schema consists of all 3 columns in the Parquet files together

    with the partitioning column appeared in the partition directory paths.

    root

    |-- double: long (nullable = true)

    |-- single: long (nullable = true)

    |-- triple: long (nullable = true)

    |-- key: integer (nullable = true)

    // This is used to implicitly convert an RDD to a DataFrame.
    import spark.implicits._

    // Create a simple DataFrame, store into a partition directory
    val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
    squaresDF.write.parquet("data/test_table/key=1")

    // Create another DataFrame in a new partition directory,
    // adding a new column and dropping an existing column
    val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")
    cubesDF.write.parquet("data/test_table/key=2")

    // Read the partitioned table
    val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
    mergedDF.printSchema()

    // The final schema consists of all 3 columns in the Parquet files together
    // with the partitioning column appeared in the partition directory paths
    // root
    // |-- value: int (nullable = true)
    // |-- square: int (nullable = true)
    // |-- cube: int (nullable = true)
    // |-- key: int (nullable = true)

相关推荐
yang_B6211 小时前
噪声处理方法
大数据·人工智能·算法
无忧智库1 小时前
算力、算法、数据三位一体:构建城市级AI大模型算力池的全景式解构与未来展望(WORD)
大数据·人工智能·算法
拾光向日葵1 小时前
洛阳科技职业学院2026年最新宿舍条件与周边环境全景测评
大数据·人工智能·物联网
黑棠会长2 小时前
ABP框架09.数据安全与合规:审计日志与实体变更追踪
分布式·安全·架构·c#·abp
格图素书2 小时前
大数据在电力行业的应用案例解析-【电力技术】(零)大数据在电力行业的典型落地案例(序)
大数据·单例模式
百胜软件@百胜软件2 小时前
对话文斌:E3+PRO的“AI大脑”——『胜券商品』如何让数据智能触手可及?
大数据·人工智能
码农小白AI3 小时前
AI报告文档审核助力排气烟度精准管控:IACheck守护绿色动力环境与合规发展新底线
大数据·人工智能
炼丹炉大数据3 小时前
炼丹炉:宠物电商数据工具首选
大数据·数据分析·宠物
ctrigger3 小时前
人力资源和社会保障部研究起草《人力资源社会保障部关于修改〈职称评审管理暂行规定〉的决定(征求意见稿)》
大数据
珠海西格4 小时前
四可装置如何监测组件衰减与逆变器效率?
大数据·运维·服务器·分布式·能源