idea开发delta.io数据湖

delta.io是三大数据湖之一,Iceberg 和hudi. 国内人用的比较多,delta国外的大厂用的比较多,主要来源与databrack . 像苹果,adobe,阿里等公司用的是delta.io,相对来说比较成熟一些。通过idea的spark 操作delta.

idea maven 的pom.xml

复制代码
 <dependency>
      <groupId>io.minio</groupId>
      <artifactId>minio</artifactId>
      <version>8.5.7</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.12</artifactId>
      <version>3.5.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.12</artifactId>
      <version>3.5.0</version> <!-- 根据实际情况选择版本号 -->
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-aws</artifactId>
      <version>3.3.4</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle -->

    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3 -->

    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-core -->

    <!-- https://mvnrepository.com/artifact/io.delta/delta-spark -->
    <dependency>
      <groupId>io.delta</groupId>
      <artifactId>delta-spark_2.12</artifactId>
      <version>3.0.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle -->

    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk -->
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>aws-java-sdk</artifactId>
      <version>1.11.375</version>
    </dependency> -->

实现代码。

delta 存储用的是minio,没有用hadoop

复制代码
package spark.delta

import org.apache.spark.sql.SparkSession

object delta {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().master("local").appName("test")
      .config("spark.hadoop.fs.s3a.access.key", "minioadmin")
      .config("spark.hadoop.fs.s3a.secret.key", "minioadmin")
      .config("spark.hadoop.spark.hadoop.fs.s3a.endpoint", "http://127.0.0.1:9000")
      .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
      .config("spark.hadoop.fs.s3a.path.style.access", "true")
      .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
      .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
      .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
      //指定hadoop catalog,catalog名称为hadoop_prod
      .config("spark.sql.catalog.hadoop_prod", "org.apache.iceberg.spark.SparkCatalog")
      .config("spark.sql.catalog.hadoop_prod.type", "hadoop")
      .config("spark.sql.catalog.hadoop_prod.hadoop.fs.s3a.access.key", "minioadmin")
      .config("spark.sql.catalog.hadoop_prod.hadoop.fs.s3a.secret.key", "minioadmin")
      .config("spark.sql.catalog.hadoop_prod.hadoop.fs.s3a.endpoint", "http://127.0.0.1:9000")
      .getOrCreate()
    //val tesflile="s3a://datalake/aa.txt"
   // val bucketName = "datalake"
    //val minioPath = "s3a://" + bucketName + "/common/outputData"
   // val df = spark.read.text(tesflile)




    //println("总共单词数量"+df.count())
    val data = spark.range(0, 5)
    data.write.format("delta").save("s3a://bat/zhangshan")
    //spark.range(500).write.format("delta").save("/tmp/delta-table")
    val df = spark.read.format("delta").load("/tmp/delta-table")
    df.show()
  }
}
相关推荐
怒放吧德德9 小时前
Netty 4.2 入门指南:从概念到第一个程序
java·后端·netty
雨中飘荡的记忆10 小时前
大流量下库存扣减的数据库瓶颈:Redis分片缓存解决方案
java·redis·后端
心之语歌13 小时前
基于注解+拦截器的API动态路由实现方案
java·后端
华仔啊14 小时前
Stream 代码越写越难看?JDFrame 让 Java 逻辑回归优雅
java·后端
ray_liang14 小时前
用六边形架构与整洁架构对比是伪命题?
java·架构
Ray Liang15 小时前
用六边形架构与整洁架构对比是伪命题?
java·python·c#·架构设计
Java水解15 小时前
Java 中间件:Dubbo 服务降级(Mock 机制)
java·后端
SimonKing20 小时前
OpenCode AI辅助编程,不一样的编程思路,不写一行代码
java·后端·程序员
FastBean20 小时前
Jackson View Extension Spring Boot Starter
java·后端
Seven9721 小时前
剑指offer-79、最⻓不含重复字符的⼦字符串
java