spark+phoenix读取hbase

正常来说这个内容应该网上可参考的文章很多,但是我还是捣鼓了好久,现在记录下来,给自己个备忘录。

phoenix是操作hbase的皮肤,他可以轻松的使用sql语句来操作hbase,比直接用hbase的原语操作要友好的多。spark直接操作hbase也是通过hbase的原语操作,操作起来比较繁琐,下面就是将spark和phoenix相结合的方法步骤。

我用的是scala语言,首先pom.xml中添加依赖

复制代码
         <dependency>
            <groupId>org.apache.phoenix</groupId>
            <artifactId>phoenix-spark</artifactId>
            <version>5.0.0-HBase-2.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.phoenix</groupId>
            <artifactId>phoenix-core</artifactId>
            <version>5.0.0-HBase-2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>2.4.12</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>2.4.12</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-common</artifactId>
            <version>2.4.12</version>
        </dependency>

这里添加的版本信息要和你要访问的hbase相一致!

接下来,到phoenix官网下载jar包,Overview | Apache Phoenix

然后解压缩,将里面的phoenix-server-hbase-2.4-5.1.3.jar(你的版本可能和我下载的不一致,这个根据hadoop上安装的hbase的版本来定)拷贝到hbase/lib/目录下,然后重启hbase。

然后将解压的phoenix-client-hbase-2.4-5.1.3.jar包拷贝到你的工程resources目录下,然后将hadoop中的配置文件也都放到resources/conf/这个目录下,接下来开始写代码。

复制代码
import org.apache.spark.SparkContext
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.phoenix.spark.datasource.v2.PhoenixDataSource

val spark = SparkSession
  .builder()
  .appName("phoenix-test")
  .master("local")
  .getOrCreate()

// Load data from TABLE1
val df = spark.sqlContext
  .read
  .format("phoenix")
  .options(Map("table" -> "TABLE1", PhoenixDataSource.ZOOKEEPER_URL -> "phoenix-server:2181"))
  .load

df.filter(df("COL1") === "test_row_1" && df("ID") === 1L)
  .select(df("ID"))
  .show

这是phoenix官网提供的代码,我执行没成功,显示org.apache.phoenix.spark.datasource.v2.PhoenixDataSource这个找不到,我不知道是我依赖包没引对还是其他原因,我的代码在上面的基础上做了一些改动。

复制代码
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.datasources.jdbc._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path

import org.apache.log4j.Logger


object SparkPhoenixHbase {
  @transient lazy val log = Logger.getLogger(this.getClass)
  def main(args: Array[String]): Unit = {

    readFromHBaseWithPhoenix()
  }

  def readFromHBaseWithPhoenix(): Unit = {

    val hadoopConf = new Configuration()
    hadoopConf.addResource(new Path("conf/core-site.xml"))
    hadoopConf.addResource(new Path("conf/hdfs-site.xml"))
    hadoopConf.addResource(new Path("conf/mapred-site.xml"))
    hadoopConf.addResource(new Path("conf/yarn-site.xml"))
    hadoopConf.addResource(new Path("conf/hbase-site.xml"))


  val conf = new SparkConf()
    .setAppName("phoenix-spark-hdase")
    .setMaster("local[*]")
    conf.set("spark.driver.extraClassPath","/resources/phoenix-client-hbase-2.4-5.1.3.jar")
    conf.set("spark.executor.extraClassPath","/resources/phoenix-client-hbase-2.4-5.1.3.jar")

    val it = hadoopConf.iterator()
    while (it.hasNext){
      val entry = it.next()
      conf.set(entry.getKey, entry.getValue)
    }

  val spark = SparkSession
    .builder()
    .master("local")
    .appName("phoenix-hbase")
    .config(conf)
    .getOrCreate()

    val phoenixConfig = Map(
      "url" -> "jdbc:phoenix:10.12.4.51:2181",   //这里是你hadoop上安装的zookeeper的地址
      "driver" -> "org.apache.phoenix.jdbc.PhoenixDriver"
    )

  val df = spark.read
    .format("jdbc")
    .options(phoenixConfig)
    .option("dbtable","student")
    .load()

     df.show() 

    spark.close()

  }
}

最好要在工程里配置上日志打印,不然执行过程中的错误信息是看不到的。

最后执行成功的结果如下所示

复制代码
2024-01-18 08:53:52,487 INFO [org.apache.spark.executor.Executor] : Finished task 0.0 in stage 0.0 (TID 0). 1509 bytes result sent to driver
2024-01-18 08:53:52,493 INFO [org.apache.spark.scheduler.TaskSetManager] : Finished task 0.0 in stage 0.0 (TID 0) in 580 ms on DESKTOP-FT30H9D (executor driver) (1/1)
2024-01-18 08:53:52,494 INFO [org.apache.spark.scheduler.TaskSchedulerImpl] : Removed TaskSet 0.0, whose tasks have all completed, from pool 
2024-01-18 08:53:52,500 INFO [org.apache.spark.scheduler.DAGScheduler] : ResultStage 0 (show at SparkPhoenixHbase.scala:70) finished in 0.774 s
2024-01-18 08:53:52,502 INFO [org.apache.spark.scheduler.DAGScheduler] : Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
2024-01-18 08:53:52,502 INFO [org.apache.spark.scheduler.TaskSchedulerImpl] : Killing all running tasks in stage 0: Stage finished
2024-01-18 08:53:52,504 INFO [org.apache.spark.scheduler.DAGScheduler] : Job 0 finished: show at SparkPhoenixHbase.scala:70, took 0.808840 s
2024-01-18 08:53:52,538 INFO [org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator] : Code generated in 14.3886 ms
+----+--------+---+-------+
|  ID|    NAME|AGE|   ADDR|
+----+--------+---+-------+
|1001|zhangsan| 10|tianjin|
+----+--------+---+-------+

// 能看到这个就说明成功了,我的hbase student表里就这么一行信息

2024-01-18 08:53:52,555 INFO [org.sparkproject.jetty.server.AbstractConnector] : Stopped Spark@4108fa66{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
2024-01-18 08:53:52,556 INFO [org.apache.spark.ui.SparkUI] : Stopped Spark web UI at http://DESKTOP-FT30H9D:4040
2024-01-18 08:53:52,566 INFO [org.apache.spark.MapOutputTrackerMasterEndpoint] : MapOutputTrackerMasterEndpoint stopped!
2024-01-18 08:53:52,581 INFO [org.apache.spark.storage.memory.MemoryStore] : MemoryStore cleared
2024-01-18 08:53:52,581 INFO [org.apache.spark.storage.BlockManager] : BlockManager stopped
2024-01-18 08:53:52,587 INFO [org.apache.spark.storage.BlockManagerMaster] : BlockManagerMaster stopped
2024-01-18 08:53:52,589 INFO [org.apache.spark.scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint] : OutputCommitCoordinator stopped!
2024-01-18 08:53:52,595 INFO [org.apache.spark.SparkContext] : Successfully stopped SparkContext
2024-01-18 08:53:59,207 INFO [org.apache.spark.util.ShutdownHookManager] : Shutdown hook called
2024-01-18 08:53:59,207 INFO [org.apache.spark.util.ShutdownHookManager] : Deleting directory C:\Users\shell\AppData\Local\Temp\spark-344ef832-7438-47dd-9126-725e6c2d8af4
相关推荐
Hello.Reader4 分钟前
Flink JobManager 内存配置指南别让“控制面”先 OOM
大数据·flink
泰迪智能科技38 分钟前
分享|联合编写教材入选第二批“十四五”职业教育国家规划教材名单
大数据·人工智能
TDengine (老段)1 小时前
TDengine 脱敏函数用户手册
大数据·服务器·数据库·物联网·时序数据库·iot·tdengine
鹧鸪云光伏2 小时前
一屏藏万象,智护光能源 —— 鹧鸪云电站大屏赋能新篇
大数据·能源·光伏
Hello.Reader2 小时前
写给生产环境的 Flink 内存配置Process Memory、TaskManager 组件拆解与场景化调优
大数据·flink
Dxy12393102163 小时前
告别重启!Elasticsearch 8.10 杀手级特性:动态同义词(Dynamic Synonyms)深度解析
大数据·elasticsearch·jenkins
宇神城主_蒋浩宇3 小时前
最简单的es理解 数据库视角看写 ES 加 java正删改查深度分页
大数据·数据库·elasticsearch
小小王app小程序开发3 小时前
盲盒随机赏小程序核心玩法拆解与运营逻辑分析
大数据·小程序
许国栋_3 小时前
产品管理系统怎么选?2026主流工具横评、场景适配与避坑
大数据·安全·阿里云·云计算·团队开发
说私域3 小时前
AI智能名片链动2+1模式小程序在消费者商家全链路互动中的应用研究
大数据·人工智能·小程序·流量运营·私域运营