Hive:数据仓库利器

1. 简介

Hive是一个基于Hadoop的开源数据仓库工具,可以用来存储、查询和分析大规模数据。Hive使用SQL-like的HiveQL语言来查询数据,并将其结果存储在Hadoop的文件系统中。

2. 基本概念

介绍 Hive 的核心概念,例如表、分区、桶、HQL 等。

2.1 架构

Design - Apache Hive - Apache Software Foundation

|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 组成 | 详情 |
| UI | The user interfacefor users to submit queries and other operations to the system. As of 2011 the system had a command line interface and a web based GUI was being developed. |
| Driver | The component whichreceives the queries. This component implements the notion of session handles and provides execute and fetch APIs modeled on JDBC/ODBC interfaces. |
| Compiler | The component thatparses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore. |
| Metastore | The component that stores all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored. |
| Execution Engine | The component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components. |

2.2 Data Model

|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 类型 | 详情 |
| Tables | These are analogous to Tables in Relational Databases. Tables can be filtered, projected, joined and unioned. Additionally all the data of a table is stored in a directory in HDFS. Hive also supports the notion of external tables wherein a table can be created on prexisting files or directories in HDFS by providing the appropriate location to the table creation DDL. The rows in a table are organized into typed columns similar to Relational Databases. |
| Partitions | Each Table can have one or more partition keys which determine how the data is stored, for example a table T with a date partition column ds had files with data for a particular date stored in the <table location>/ds=<date> directory in HDFS. Partitions allow the system to prune data to be inspected based on query predicates, for example a query that is interested in rows from T that satisfy the predicate T.ds = '2008-09-01' would only have to look at files in <table location>/ds=2008-09-01/ directory in HDFS. |
| Buckets | **Data in each partition may in turn be divided into Buckets based on the hash of a column in the table.**Each bucket is stored as a file in the partition directory. Bucketing allows the system to efficiently evaluate queries that depend on a sample of data (these are queries that use the SAMPLE clause on the table). |

3. 实践应用

3.1 数仓建设

4. 性能优化

介绍如何优化 Hive 的性能

5. 常见问题解答

5.1 常用SQL

|--------|--------------------------------|
| 场景 | SQL |
| 连续n天登录 | sql SELECT * FROM test; |
| | |

6. 总结

总结 Hive 的关键知识点,并提供学习资源和进一步研究方向。

相关推荐
大大大大晴天️13 小时前
浅聊Hadoop集群的主流安全方案(LDAP+Kerberos+Ranger)
大数据·hadoop·安全
roman_日积跬步-终至千里18 小时前
为什么 Hive 无法通过同步 JDBC 导出百万级数据?
数据仓库·hive·hadoop
WL_Aurora21 小时前
HDFS基础编程常用命令
大数据·hadoop·hdfs
大大大大晴天21 小时前
浅聊Hadoop集群的主流安全方案(LDAP+Kerberos+Ranger)
大数据·hadoop
roman_日积跬步-终至千里21 小时前
Hive JDBC vs MySQL JDBC:**“服务端推完就跑,客户端慢慢吃”**详解
数据仓库·hive·hadoop
计算机毕业编程指导师2 天前
【计算机毕设推荐】Python+Hadoop+Spark共享单车数据可视化分析系统 毕业设计 选题推荐 毕设选题 数据分析 机器学习 数据挖掘
大数据·hadoop·python·计算机·数据挖掘·spark·课程设计
计算机毕业编程指导师2 天前
【计算机毕设】基于Hadoop的共享单车订单数据分析系统+Python+Django全栈开发 毕业设计 选题推荐 毕设选题 数据分析 机器学习 数据挖掘
大数据·hadoop·python·计算机·数据挖掘·spark·django
计算机毕业编程指导师2 天前
【计算机毕设选题推荐】基于Hadoop+Spark的诺贝尔奖可视化分析系统源码 毕业设计 选题推荐 毕设选题 数据分析 机器学习 数据挖掘
大数据·hadoop·python·计算机·spark·毕业设计·诺贝尔奖
m0_716255002 天前
第二部分 电商离线数仓 全套项目代码(可直接在你伪分布式 Hive 运行)
hive·hadoop·分布式
kybs19913 天前
springboot租车系统--附源码68701
java·hadoop·spring boot·python·django·asp.net·php