Hive_Hive统计指令analyze table和 describe table

之前在公司内部经常会看到表的元信息的一些统计信息,当时非常好奇是如何做实现的。

现在发现这些信息主要是基于 analyze table 去做统计的,分享给大家

实现的效果某一个表中每个列的空值数量,重复值数量等,平均长度

具体的指令还是要看HIVE官网,StatsDev - Apache Hive - Apache Software Foundation

指令简介

analyze table和 describe table 一般是组合使用的,其中analyze table指令可以用于数据表的统计,并且是通过额外的任务对数据表的大小或者分区等进行统计。而describe table 则是将统计好的数据展示出来。

官网对这块儿的介绍

统计数据(如表或分区的行数和特定感兴趣的列的直方图)在许多方面都很重要。统计的一个关键用例是查询优化。统计数据作为优化器的成本函数的输入,以便它可以比较不同的计划并从中进行选择。统计数据有时可以满足用户查询的目的。用户可以通过仅查询存储的统计信息而不是触发长时间运行的执行计划来快速获得某些查询的答案。一些例子是获取用户年龄分布的分位数,人们使用的前10个应用程序,以及不同会话的数量。

Analyze table

analyze table 支持表和分区的统计,支持统计以下几个基本项

  • 行数
  • 文件数量
  • 字节大小
Description Stored in Collected by Since
Number of partition the dataset consists of Fictional metastore property: numPartitions computed during displaying the properties of a partitioned table Hive 2.3
Number of files the dataset consists of Metastore table property: numFiles Automatically during Metastore operations
Total size of the dataset as its seen at the filesystem level Metastore table property: totalSize Automatically during Metastore operations
Uncompressed size of the dataset Metastore table property: rawDataSize Computed, these are the basic statistics. Calculated automatically when hive.stats.autogather is enabled. Can be collected manually by: ANALYZE TABLE ... COMPUTE STATISTICS Hive 0.8
Number of rows the dataset consist of Metastore table property: numRows Computed, these are the basic statistics. Calculated automatically when hive.stats.autogather is enabled. Can be collected manually by: ANALYZE TABLE ... COMPUTE STATISTICS
Column level statistics Metastore; TAB_COL_STATS table Computed, Calculated automatically when hive.stats.column.autogather is enabled. Can be collected manually by: ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS

指令详解:

ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] -- (Note: Fully support qualified table name since Hive ``1.2``.``0``, see HIVE-``10007``.)

``COMPUTE STATISTICS

``[FOR COLUMNS] -- (Note: Hive ``0.10``.``0 and later.)

``[CACHE METADATA] -- (Note: Hive ``2.1``.``0 and later.)

``[NOSCAN];

noscan 参数的作用

当使用noscan, 任务不会扫描文件,以便于尽可能的快速,但是不会统计所有项,只会统计以下信息

  • Number of files (文件量)
  • Physical size in bytes (文件的物理存储空间(hdfs 上的空间))

cache metadata 参数的作用

Feature not implemented

Hive Metastore on HBase was discontinued and removed in Hive 3.0.0. See HBaseMetastoreDevelopmentGuide

该指令主要是把计算的统计信息存储在HBase中,不过之后Hive3.0.0之后不再支持

When Hive metastore is configured to use HBase, this command explicitly caches file metadata in HBase metastore.

The goal of this feature is to cache file metadata (e.g. ORC file footers) to avoid reading lots of files from HDFS at split generation time, as well as potentially cache some information about splits (e.g. grouping based on location that would be good for some short time) to further speed up the generation and achieve better cache locality with consistent splits.

使用示例 :

Suppose table Table1 has 4 partitions with the following specs:

  • Partition1: (ds='2008-04-08', hr=11)
  • Partition2: (ds='2008-04-08', hr=12)
  • Partition3: (ds='2008-04-09', hr=11)
  • Partition4: (ds='2008-04-09', hr=12)

and you issue the following command:

|--------------------------------------------------------------------------------------|
| ANALYZE TABLE Table1 PARTITION(ds=``'2008-04-09'``, hr=``11``) COMPUTE STATISTICS; |

then statistics are gathered for partition3 (ds='2008-04-09', hr=11) only.

If you issue the command:

|--------------------------------------------------------------------------------------------------|
| ANALYZE TABLE Table1 PARTITION(ds=``'2008-04-09'``, hr=``11``) COMPUTE STATISTICS FOR COLUMNS; |

then column statistics are gathered for all columns for partition3 (ds='2008-04-09', hr=11). This is available in Hive 0.10.0 and later.

If you issue the command:

|-------------------------------------------------------------------------------|
| ANALYZE TABLE Table1 PARTITION(ds=``'2008-04-09'``, hr) COMPUTE STATISTICS; |

then statistics are gathered for partitions 3 and 4 only (hr=11 and hr=12).

If you issue the command:

|-------------------------------------------------------------------------------------------|
| ANALYZE TABLE Table1 PARTITION(ds=``'2008-04-09'``, hr) COMPUTE STATISTICS FOR COLUMNS; |

then column statistics for all columns are gathered for partitions 3 and 4 only (Hive 0.10.0 and later).

If you issue the command:

|--------------------------------------------------------------|
| ANALYZE TABLE Table1 PARTITION(ds, hr) COMPUTE STATISTICS; |

then statistics are gathered for all four partitions.

If you issue the command:

|--------------------------------------------------------------------------|
| ANALYZE TABLE Table1 PARTITION(ds, hr) COMPUTE STATISTICS FOR COLUMNS; |

then column statistics for all columns are gathered for all four partitions (Hive 0.10.0 and later).

For a non-partitioned table, you can issue the command:

|--------------------------------------------|
| ANALYZE TABLE Table1 COMPUTE STATISTICS; |

to gather statistics of the table.

For a non-partitioned table, you can issue the command:

|--------------------------------------------------------|
| ANALYZE TABLE Table1 COMPUTE STATISTICS FOR COLUMNS; |

to gather column statistics of the table (Hive 0.10.0 and later).

Describe table

当我们使用analyze table 统计相应数据的时候,我们可以调用 descirbe table 查看相关的统计数据,

使用示例

|-----------------------------|
| DESCRIBE EXTENDED TABLE1; |

then among the output, the following would be displayed:

|-------------------------------------------------------------------------------------------------------------|
| ... , parameters:{numPartitions=``4``, numFiles=``16``, numRows=``2000``, totalSize=``16384``, ...}, .... |

If you issue the command:

|-----------------------------------------------------------------------|
| DESCRIBE EXTENDED TABLE1 PARTITION(ds=``'2008-04-09'``, hr=``11``); |

then among the output, the following would be displayed:

|-------------------------------------------------------------------------------------|
| ... , parameters:{numFiles=``4``, numRows=``500``, totalSize=``4096``, ...}, .... |

If you issue the command:

|---------------------------------------------------------------------------------|
| desc formatted concurrent_delete_different partition(ds=``'tomorrow'``) name; |

the output would look like this:

|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| +-----------------+--------------------+-------+-------+------------+-----------------+--------------+--------------+------------+-------------+------------+----------+ | col_name | data_type | min | max | num_nulls | distinct_count | avg_col_len | max_col_len | num_trues | num_falses | bitvector | comment | +-----------------+--------------------+-------+-------+------------+-----------------+--------------+--------------+------------+-------------+------------+----------+ | col_name | name | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | data_type | varchar(``50``) | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | min | | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | max | | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | num_nulls | ``0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | distinct_count | ``2 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | avg_col_len | ``5.0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | max_col_len | ``5 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | num_trues | | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | num_falses | | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | bitVector | | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | comment | from deserializer | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | +-----------------+--------------------+-------+-------+------------+-----------------+--------------+--------------+------------+-------------+------------+----------+ |

注意事项

  • analyze table 会额外启动一个mapreduce job用于数据统计
相关推荐
心疼你的一切5 小时前
解密CANN仓库:AIGC的算力底座、关键应用与API实战解析
数据仓库·深度学习·aigc·cann
qq_12498707537 小时前
基于Hadoop的信贷风险评估的数据可视化分析与预测系统的设计与实现(源码+论文+部署+安装)
大数据·人工智能·hadoop·分布式·信息可视化·毕业设计·计算机毕业设计
十月南城10 小时前
Hive与离线数仓方法论——分层建模、分区与桶的取舍与查询代价
数据仓库·hive·hadoop
鹏说大数据12 小时前
Spark 和 Hive 的关系与区别
大数据·hive·spark
B站计算机毕业设计超人12 小时前
计算机毕业设计Hadoop+Spark+Hive招聘推荐系统 招聘大数据分析 大数据毕业设计(源码+文档+PPT+ 讲解)
大数据·hive·hadoop·python·spark·毕业设计·课程设计
B站计算机毕业设计超人12 小时前
计算机毕业设计hadoop+spark+hive交通拥堵预测 交通流量预测 智慧城市交通大数据 交通客流量分析(源码+LW文档+PPT+讲解视频)
大数据·hive·hadoop·python·spark·毕业设计·课程设计
AI架构师小马12 小时前
Hive调优手册:从入门到精通的完整指南
数据仓库·hive·hadoop·ai
数据架构师的AI之路12 小时前
深入了解大数据领域Hive的HQL语言特性
大数据·hive·hadoop·ai
Gain_chance13 小时前
33-学习笔记尚硅谷数仓搭建-DWS层交易域用户粒度订单表分析及设计代码
数据库·数据仓库·hive·笔记·学习·datagrip
十月南城15 小时前
Hadoop基础认知——HDFS、YARN、MapReduce在现代体系中的位置与价值
hadoop·hdfs·mapreduce