idea开发delta.io数据湖

delta.io是三大数据湖之一,Iceberg 和hudi. 国内人用的比较多,delta国外的大厂用的比较多,主要来源与databrack . 像苹果,adobe,阿里等公司用的是delta.io,相对来说比较成熟一些。通过idea的spark 操作delta.

idea maven 的pom.xml

复制代码
 <dependency>
      <groupId>io.minio</groupId>
      <artifactId>minio</artifactId>
      <version>8.5.7</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.12</artifactId>
      <version>3.5.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.12</artifactId>
      <version>3.5.0</version> <!-- 根据实际情况选择版本号 -->
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-aws</artifactId>
      <version>3.3.4</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle -->

    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3 -->

    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-core -->

    <!-- https://mvnrepository.com/artifact/io.delta/delta-spark -->
    <dependency>
      <groupId>io.delta</groupId>
      <artifactId>delta-spark_2.12</artifactId>
      <version>3.0.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle -->

    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk -->
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>aws-java-sdk</artifactId>
      <version>1.11.375</version>
    </dependency> -->

实现代码。

delta 存储用的是minio,没有用hadoop

复制代码
package spark.delta

import org.apache.spark.sql.SparkSession

object delta {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().master("local").appName("test")
      .config("spark.hadoop.fs.s3a.access.key", "minioadmin")
      .config("spark.hadoop.fs.s3a.secret.key", "minioadmin")
      .config("spark.hadoop.spark.hadoop.fs.s3a.endpoint", "http://127.0.0.1:9000")
      .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
      .config("spark.hadoop.fs.s3a.path.style.access", "true")
      .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
      .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
      .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
      //指定hadoop catalog,catalog名称为hadoop_prod
      .config("spark.sql.catalog.hadoop_prod", "org.apache.iceberg.spark.SparkCatalog")
      .config("spark.sql.catalog.hadoop_prod.type", "hadoop")
      .config("spark.sql.catalog.hadoop_prod.hadoop.fs.s3a.access.key", "minioadmin")
      .config("spark.sql.catalog.hadoop_prod.hadoop.fs.s3a.secret.key", "minioadmin")
      .config("spark.sql.catalog.hadoop_prod.hadoop.fs.s3a.endpoint", "http://127.0.0.1:9000")
      .getOrCreate()
    //val tesflile="s3a://datalake/aa.txt"
   // val bucketName = "datalake"
    //val minioPath = "s3a://" + bucketName + "/common/outputData"
   // val df = spark.read.text(tesflile)




    //println("总共单词数量"+df.count())
    val data = spark.range(0, 5)
    data.write.format("delta").save("s3a://bat/zhangshan")
    //spark.range(500).write.format("delta").save("/tmp/delta-table")
    val df = spark.read.format("delta").load("/tmp/delta-table")
    df.show()
  }
}
相关推荐
m0_744724936 小时前
Tomcat相关
java·tomcat
AugustRed6 小时前
A2UI 完整学习指南(含 Java 后端 + 前端实战示例)
java·开发语言·前端
程序猿乐锅6 小时前
【MySQL | 第五篇】 MySQL 性能分析:如何查询慢 SQL
java·sql·mysql
lee_curry7 小时前
tomcat+springmvc+spring源码流通过程
java·spring·tomcat·springmvc
w1wi7 小时前
【兼职】边学边练的AI网站
java·人工智能·ai·ai编程·ai写作
basketball6167 小时前
C++进阶:1. 引用折叠规则
java·开发语言·c++
404号扳手7 小时前
Java 进阶知识(七)
java·后端
小马爱打代码7 小时前
Spring框架:介绍和快速入门
java·后端·spring
糖果店的幽灵7 小时前
LangChain 1.3 完全教程:从入门到精通-Part 7: Documents(文档处理)
java·python·langchain
Java_2017_csdn7 小时前
Java 策略模式(Strategy Pattern)-(三)
java·开发语言·servlet