Spark read load Parquet Files

Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

Loading Data Programmatically

Using the data from the above example:

  • Python

  • Scala

  • Java

  • R

  • SQL

    peopleDF = spark.read.json("examples/src/main/resources/people.json")

    DataFrames can be saved as Parquet files, maintaining the schema information.

    peopleDF.write.parquet("people.parquet")

    Read in the Parquet file created above.

    Parquet files are self-describing so the schema is preserved.

    The result of loading a parquet file is also a DataFrame.

    parquetFile = spark.read.parquet("people.parquet")

    Parquet files can also be used to create a temporary view and then used in SQL statements.

    parquetFile.createOrReplaceTempView("parquetFile")
    teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
    teenagers.show()

    +------+

    | name|

    +------+

    |Justin|

    +------+

Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.

Schema Merging

Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by

  1. setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
  2. setting the global SQL option spark.sql.parquet.mergeSchema to true
  • Python

  • Scala

  • Java

  • R

    from pyspark.sql import Row

    spark is from the previous example.

    Create a simple DataFrame, stored into a partition directory

    sc = spark.sparkContext

    squaresDF = spark.createDataFrame(sc.parallelize(range(1, 6))
    .map(lambda i: Row(single=i, double=i ** 2)))
    squaresDF.write.parquet("data/test_table/key=1")

    Create another DataFrame in a new partition directory,

    adding a new column and dropping an existing column

    cubesDF = spark.createDataFrame(sc.parallelize(range(6, 11))
    .map(lambda i: Row(single=i, triple=i ** 3)))
    cubesDF.write.parquet("data/test_table/key=2")

    Read the partitioned table

    mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
    mergedDF.printSchema()

    The final schema consists of all 3 columns in the Parquet files together

    with the partitioning column appeared in the partition directory paths.

    root

    |-- double: long (nullable = true)

    |-- single: long (nullable = true)

    |-- triple: long (nullable = true)

    |-- key: integer (nullable = true)

    // This is used to implicitly convert an RDD to a DataFrame.
    import spark.implicits._

    // Create a simple DataFrame, store into a partition directory
    val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
    squaresDF.write.parquet("data/test_table/key=1")

    // Create another DataFrame in a new partition directory,
    // adding a new column and dropping an existing column
    val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")
    cubesDF.write.parquet("data/test_table/key=2")

    // Read the partitioned table
    val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
    mergedDF.printSchema()

    // The final schema consists of all 3 columns in the Parquet files together
    // with the partitioning column appeared in the partition directory paths
    // root
    // |-- value: int (nullable = true)
    // |-- square: int (nullable = true)
    // |-- cube: int (nullable = true)
    // |-- key: int (nullable = true)

相关推荐
北i19 分钟前
ZooKeeper 一致性模型解析:线性一致性与顺序一致性的平衡
分布式·zookeeper·云原生
IT技术小密圈33 分钟前
图解分布式锁: 5分钟搞懂分布式锁
分布式·后端·面试
bing_1581 小时前
kafka 生产者是如何发送消息的?
分布式·kafka
IT毕设梦工厂2 小时前
大数据毕业设计选题推荐-基于大数据的1688商品类目关系分析与可视化系统-Hadoop-Spark-数据可视化-BigData
大数据·毕业设计·源码·数据可视化·bigdata·选题推荐
君不见,青丝成雪2 小时前
Hadoop技术栈(四)HIVE常用函数汇总
大数据·数据库·数据仓库·hive·sql
万邦科技Lafite2 小时前
利用淘宝开放API接口监控商品状态,掌握第一信息
大数据·python·电商开放平台·开放api接口·淘宝开放平台
更深兼春远8 小时前
flink+clinkhouse安装部署
大数据·clickhouse·flink
Monly218 小时前
RabbitMQ:数据隔离
分布式·rabbitmq
专注API从业者11 小时前
Python + 淘宝 API 开发:自动化采集商品数据的完整流程
大数据·运维·前端·数据挖掘·自动化
萧鼎12 小时前
Python pyzmq 库详解:从入门到高性能分布式通信
开发语言·分布式·python