Hive第五章：Integeration with HBase

文章目录

前言
一、HBase与Hive集成基础
- [（一）HBase Introduction](#（一）HBase Introduction)
- [（二）Hive-HBase Integration](#（二）Hive-HBase Integration)
- [（三）Why Use Hive with HBase?](#（三）Why Use Hive with HBase?)
- - [1.SQL-like querying on HBase](#1.SQL-like querying on HBase)
  - [2.Real-time data analysis](#2.Real-time data analysis)
  - [3.Complex queries](#3.Complex queries)
  - [4.Combining strengths](#4.Combining strengths)
- [（四）Hive Integrates HBase Principles](#（四）Hive Integrates HBase Principles)
- [（五）Hive Storage Handlers](#（五）Hive Storage Handlers)
- [（六）Performance Considerations](#（六）Performance Considerations)
- - [1.Network Overhead](#1.Network Overhead)
  - [2.Multiple File Merges in HBase](#2.Multiple File Merges in HBase)
  - [3.Sequential I/O](#3.Sequential I/O)
[二、Verify Hive Integration with HBase](#二、Verify Hive Integration with HBase)
总结

前言

HBase是Hadoop上的高性能NoSQL键值存储。 Hive提供了一个存储处理程序机制，通过使用HBaseStorageHandler类来创建由Hive管理的HBase表，从而与HBase集成。 通过Hive与HBase的集成，Hive用户可以利用HBase的实时事务性能进行实时大数据分析。 目前，集成特性仍在开发中，特别是在提供更高性能和快照支持方面。 HBase是一种用于存储大容量数据的分布式数据库。它是用Java编写的，运行在HDFS之上。因此，它是一种快速、高吞吐量地读写大量数据的方法。

在本章中，我们将讨论HBase与Hive集成所需的设置，我们将通过在Hive shell中创建一些测试HBase表来测试这种集成，并从另一个Hive表中填充其中的内容，最后在HBase表中验证这些内容。

一、HBase与Hive集成基础

（一）HBase Introduction

HBase是建立在Hadoop分布式文件系统（HDFS）之上的高性能分布式NoSQL键值存储。它是为对大型数据集进行高吞吐量、低延迟 的读/写访问而设计的，并用Java实现。HBase针对大规模数据的实时处理进行了优化。

HBase is a high-performance, distributed NoSQL key-value store built on top of Hadoop Distributed File System (HDFS). It is designed for high-throughput, low-latency read/write access to large datasets and is implemented in Java. HBase is optimized for real-time data processing on a massive scale.

另一方面，Hive提供了类似sql的接口HiveQL，用于在大型数据集上执行分析查询。虽然Hive最初构建在HDFS之上，但它可以与HBase集成，从而利用两种系统的优势。

Hive, on the other hand, offers a SQL-like interface called HiveQL for performing analytical queries on large datasets. While Hive was originally built on top of HDFS, it can be integrated with HBase to leverage the strengths of both systems.

（二）Hive-HBase Integration

Hive方便地提供了Hive QL接口，简化了MapReduce的使用，而HBase提供了低延迟的数据库访问。如果将两者结合起来，可以利用MapReduce的优势，对HBase中存储的大量内容进行离线计算和分析。

Hive提供了一种存储处理机制 ，允许通过HBaseStorageHandler类与HBase进行接口。这种集成使得Hive能够：

使用Hive DDL语句直接创建和管理HBase表。
Create and manage HBase tables directly using Hive DDL statements.
在Hive Metastore和HBase目录中同步表的定义。
Synchronize table definitions across the Hive Metastore and HBase catalog.
使用HiveQL查询HBase表 ，支持以类似sql的方式访问NoSQL数据。
Query HBase tables using HiveQL, enabling SQL-like access to NoSQL data.

（三）Why Use Hive with HBase?

Hive方便地提供了Hive QL接口，简化了MapReduce的使用，而HBase提供了低延迟的数据库访问。如果将两者结合起来，可以利用MapReduce的优势，对HBase中存储的大量内容进行离线计算和分析。

1.SQL-like querying on HBase

HBase上的类SQL查询： 使用HiveQL查询HBase数据，让熟悉SQL的分析人员更容易访问NoSQL数据。

Use HiveQL to query HBase data, making it easier for analysts familiar with SQL to access NoSQL data.

2.Real-time data analysis

实时数据分析： 利用HBase的低延迟性能进行实时大数据分析。

Leverage the low-latency performance of HBase for real-time big data analysis.

3.Complex queries

复杂查询： 对HBase数据执行高级分析操作，如GROUP BY、JOIN和ORDER BY。

Execute advanced analytical operations, such as GROUP BY, JOIN, and ORDER BY on HBase data.

4.Combining strengths

优势结合： Hive的批处理能力与HBase的可扩展、低延迟存储相辅相成，为大规模分析提供了强大的解决方案。

Hive's batch-processing capabilities complement HBase's scalable, low-latency storage, providing a powerful solution for large-scale analytics.

实时数据通常存储在HBase中，但直接分析非常困难。 Hive弥补了这一差距，为查询和处理这些数据提供了更丰富的分析能力。

Real-time data often resides in HBase, but direct analysis can be difficult. Hive bridges this gap, providing richer analytical capabilities to query and process this data.

（四）Hive Integrates HBase Principles

Hive和HBase的集成主要通过Hive的存储处理机制 实现。实现这种集成的处理程序hive-hbase-handler.jar工具类。它允许Hive和HBase通过外部API进行通信。原理图如下：

The integration between Hive and HBase is primarily facilitated through Hive's Storage Handler mechanism. The handler used for this integration is the hive-hbase-handler JAR, which allows for smooth communication between Hive and HBase via external APIs.

这个集成的关键类是：
org.apache.hadoop.hive.hbase.HBaseStorageHandler

这个处理程序允许Hive DDL语句同时管理Hive Metastore和HBase目录中的表定义，从而确保两个系统之间的一致性。

This handler allows Hive DDL statements to manage table definitions in both the Hive Metastore and the HBase catalog simultaneously, ensuring consistency across both systems.

（五）Hive Storage Handlers

HiveStorageHandler是Hive连接NoSQL存储（如HBase、Cassandra等）的主接口。对接口的检查显示，必须定义自定义的InputFormat、OutputFormat和SerDe。存储处理程序支持对底层存储子系统进行读写。这意味着要对数据系统写入SELECT查询，以及将操作（如报表）写入数据系统。
HiveStorageHandler is the primary interface Hive uses to connect with NoSQL stores such as HBase,Cassandra, and others. An examination of the interface shows that a custom InputFormat, OutputFormat, and SerDe must be defined. The storage handler enables both reading from and writing to the underlying storage subsystem. This translates into writing SELECT queries against the data system, as well as writing into the data system for actions such as reports.

在NoSQL数据库上执行Hive查询时，由于NoSQL系统的开销，性能会低于HDFS上的Hive和MapReduce 任务。其中一些原因包括到服务器的套接字连接和合并多个底层文件，而来自HDFS的典型访问完全是顺序I/O。顺序I/O在现代硬盘上非常快。

When executing Hive queries over NoSQL databases, the performance is less than normal Hive and MapReduce jobs on HDFS due to the overhead of the NoSQL system. Some of the reasons include the

socket connection to the server and the merging of multiple underlying files, whereas typical access from HDFS is completely sequential I/O. Sequential I/O is very fast on modern hard drives.

1.InputFormat

如何从外部存储系统读取数据。

How data is read from the external storage system.

2.OutputFormat

数据如何写入外部存储系统。

How data is written to the external storage system.

3.SerDe

Hive中数据的序列化和反序列化方式。

How data is serialized and deserialized in Hive.

通过使用存储处理程序，Hive可以读写像HBase这样的NoSQL系统。 这使用户能够在NoSQL数据上运行类似sql的查询，并执行生成报告之类的操作。

By using Storage Handlers, Hive can read from and write to NoSQL systems like HBase. This enables users to run SQL-like queries on NoSQL data and perform actions like generating reports.

（六）Performance Considerations

虽然Hive提供了一种便捷的查询HBase的方式，但也有一些性能权衡需要考虑：

While Hive provides a convenient way to query HBase, there are performance trade-offs to consider:

1.Network Overhead

与HDFS上的Hive查询相比，NoSQL系统（如HBase）上的查询通常会带来更高的延迟。 这是因为在对数据库服务器进行远程调用时需要额外的网络套接字开销。

Queries over NoSQL systems (like HBase) often incur higher latency compared to Hive queries on HDFS. This is due to the additional network socket overhead when making remote calls to the database server.

2.Multiple File Merges in HBase

HBase中的多文件合并：HBase在读取操作时可能需要合并多个底层存储文件（hfile），这会影响性能。

HBase may need to merge multiple underlying storage files (HFiles) during read operations, which can affect performance.

3.Sequential I/O

与为顺序读取优化的HDFS不同，像HBase这样的NoSQL系统没有为这种类型的操作进行优化，从而导致潜在的性能瓶颈。

Unlike HDFS, which is optimized for sequential reads, NoSQL systems like HBase are not optimized for this type of operation, leading to potential performance bottlenecks.

二、Verify Hive Integration with HBase

（一）建表示例

下面是一个示例hbase表创建DDL语句。在这里我们分别在Hive中创建hbase_table_emp表和HBase中创建emp表。该表将在Hive中包含3列，key int， name string和role string。它们被映射到属于cf1列族的两个列名称和角色。 key是"hbase.columns. mapping的属性，自动映射到Hive表的第一列（id int）。

sql 复制代码

CREATE TABLE hbase_table_emp(id int, name string, role string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:name,cf1:role")
TBLPROPERTIES ("hbase.table.name" = "emp");

当我们在Hive终端上运行这个查询时，它会在Hive中创建hbase_table_emp表，在HBase中创建emp表。

poweshell 复制代码

hive> CREATE TABLE hbase_table_emp(id int, name string, role string)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,cf1:name,cf1:role")
> TBLPROPERTIES ("hbase.table.name" = "emp");
OK
Time taken: 2.667 seconds

（二）属性解读

1.定义hive表使用HBaseStorageHandler

sql 复制代码

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

2.SERDEPROPERTIES

SERDEPROPERTIES: - hive can understand hbase table

sql 复制代码

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:name,cf1:role")

3.hbase.columns.mapping

映射HBase列（列限定符）到hive列

map HBase Column (column qualifier) with hive Column.

4.TBLPROPERTIES

将hive表映射到HBase表。

sql 复制代码

TBLPROPERTIES ("hbase.table.name" = "emp");

总结

本章中，核心内容为Hive与HBase集成的基础知识；在HBase上使用Hive的原因；Hive集成HBase的原理，明确Hive集成HBase的意义和原因，以及在什么场景下选择集成