idea开发delta.io数据湖

delta.io是三大数据湖之一,Iceberg 和hudi. 国内人用的比较多,delta国外的大厂用的比较多,主要来源与databrack . 像苹果,adobe,阿里等公司用的是delta.io,相对来说比较成熟一些。通过idea的spark 操作delta.

idea maven 的pom.xml

复制代码
 <dependency>
      <groupId>io.minio</groupId>
      <artifactId>minio</artifactId>
      <version>8.5.7</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.12</artifactId>
      <version>3.5.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.12</artifactId>
      <version>3.5.0</version> <!-- 根据实际情况选择版本号 -->
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-aws</artifactId>
      <version>3.3.4</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle -->

    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3 -->

    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-core -->

    <!-- https://mvnrepository.com/artifact/io.delta/delta-spark -->
    <dependency>
      <groupId>io.delta</groupId>
      <artifactId>delta-spark_2.12</artifactId>
      <version>3.0.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle -->

    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk -->
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>aws-java-sdk</artifactId>
      <version>1.11.375</version>
    </dependency> -->

实现代码。

delta 存储用的是minio,没有用hadoop

复制代码
package spark.delta

import org.apache.spark.sql.SparkSession

object delta {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().master("local").appName("test")
      .config("spark.hadoop.fs.s3a.access.key", "minioadmin")
      .config("spark.hadoop.fs.s3a.secret.key", "minioadmin")
      .config("spark.hadoop.spark.hadoop.fs.s3a.endpoint", "http://127.0.0.1:9000")
      .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
      .config("spark.hadoop.fs.s3a.path.style.access", "true")
      .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
      .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
      .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
      //指定hadoop catalog,catalog名称为hadoop_prod
      .config("spark.sql.catalog.hadoop_prod", "org.apache.iceberg.spark.SparkCatalog")
      .config("spark.sql.catalog.hadoop_prod.type", "hadoop")
      .config("spark.sql.catalog.hadoop_prod.hadoop.fs.s3a.access.key", "minioadmin")
      .config("spark.sql.catalog.hadoop_prod.hadoop.fs.s3a.secret.key", "minioadmin")
      .config("spark.sql.catalog.hadoop_prod.hadoop.fs.s3a.endpoint", "http://127.0.0.1:9000")
      .getOrCreate()
    //val tesflile="s3a://datalake/aa.txt"
   // val bucketName = "datalake"
    //val minioPath = "s3a://" + bucketName + "/common/outputData"
   // val df = spark.read.text(tesflile)




    //println("总共单词数量"+df.count())
    val data = spark.range(0, 5)
    data.write.format("delta").save("s3a://bat/zhangshan")
    //spark.range(500).write.format("delta").save("/tmp/delta-table")
    val df = spark.read.format("delta").load("/tmp/delta-table")
    df.show()
  }
}
相关推荐
用户1285261160212 小时前
我把祖传Java项目重构后,接口响应从3s砍到了200ms,只改了这几行代码
java
Linsk13 小时前
组件 = 模板 + 业务逻辑
java·前端·vue.js
星沉远浦13 小时前
用Gemini高效解决Java代码报错难以定位的问题
java
用户2986985301417 小时前
Word 文档字符级格式化:Java 实现方案详解
java·后端
笨鸟飞不快17 小时前
从单个服务到集群:一次完整的性能排查复盘
java·前端
荣码18 小时前
用Streamlit给AI应用套个界面,10行代码出Web页面
java·python
SamDeepThinking18 小时前
Java微服务练习方式
java·后端·微服务
朦胧之1 天前
AI 编程-老项目改造篇
java·前端·后端
程序猿大帅1 天前
别再只当调包侠了:用 Spring AI 落地 Function Calling,我被大模型硬生生砸出了三个大坑
java
程序员晓琪1 天前
约定大于配置:基于 Java 包名自动生成 API 版本路由的最佳实践
java·spring boot·后端