HDFS元数据与auditlog结合Hive元数据统计分析
文章目录
以为HDFS的Path为主要元素,与FSimage元数据信息,AuditLog中审计信息结合Hive的元数据信息进行统计分析
主要统计HDFS Path的子文件夹,子文件数量、子文件大小、最近的修改时间、最近的操作时间、以及是属于hive的某库某表某分区的location。
表设计
hive_hdfs_path
hive使用的hdfs path的关系表,查询Hive的元数据表,取库的location,表的location和分区的location,location由hdfs的path组成。
表字段
| 字段名 | 字段属性 | 字段备注 |
|---|---|---|
| hdfspath | string | hdfs的path,以/开头,不以hdfs开头 |
| db_name | string | 库名 |
| tbl_name | string | 表名 |
| tbl_type | string | 表类型,取值MANAGED_TABLE,VIRTUAL_VIEW,EXTERNAL_TABLE,INDEX_TABLE |
| part_name | string | 分区名 |
| db_id | string | hive库的id |
| tbl_id | string | hive表的id |
| part_id | string | hive表分区的id |
| tbl_sd_id | string | 表的sds中的id |
| part_sd_id | string | 表分区的sds中的id |
| date | string | 该表的分区字段,格式:yyyyMMdd |
数据来源查询语句
hdfs_file_count_size
HDFS Path的文件数量与文件大小统计表,统计Path不同层级目录下所有子文件的数量和文件总大小
表字段
| 字段名 | 字段类型 | 备注 |
|---|---|---|
| path | string | 目录 |
| file_count | long | 目录下的文件数 |
| dir_count | long | 目录下的子文件夹数 |
| file_size | long | 目录下的文件总大小,字节 |
| last_modification_time | string | path最新发生修改的时间 |
| date | string | 该表的分区字段,也是fsimage文件所属日期,格式:yyyyMMdd |
数据来源查询语句
hdfs_audit_cmd
统计某天 auditlog 中不同层级的path的最近cmd时间。将cmd时间均认定为对path的操作时间,不对cmd进行区分。
表字段
| 字段名 | 字段类型 | 备注 |
|---|---|---|
| hdfspath | string | hdfs的path |
| last_cmd_time | string | 最后cmd的时间 |
| date | string | 该表的分区字段,格式:yyyyMMdd |
数据来源查询语句
hdfs_path_time
将某天的hdfs path与当天 auditlog path下的最近操作时间进行关联存储,hdfs_audit_cmd中last_cmd_time有值就更新,无值就保持上一天hdfs_path_time中的last_cmd_time值
path 的last_cmd_time取什么值?modtime?当前时间?或者定某个值?
指定2025-10-31 23:59:59
表字段
| 字段名 | 字段类型 | 备注 |
|---|---|---|
| hdfspath | string | hdfs的path |
| last_cmd_time | string | 最后cmd的时间 |
| date | string | 该表的分区字段,格式:yyyyMMdd |
数据来源查询语句
hdfs_file_count_size_time
统计某天HDFS path的存储情况和访问操作情况
表字段
| 字段名 | 字段类型 | 备注 |
|---|---|---|
| path | string | 目录 |
| file_count | long | 目录下的文件数 |
| dir_count | long | 目录下的子文件夹数 |
| file_size | long | 目录下的文件总大小,字节 |
| last_modification_time | string | path最新发生修改的时间 |
| last_cmd_time | string | path最新发生操作的时间 |
| date | string | 该表的分区字段,也是fsimage文件所属日期,格式:yyyyMMdd |
数据来源查询语句
hdfs_file_count_size_time_hive
表字段
| 字段名 | 字段类型 | 备注 |
|---|---|---|
| path | string | 目录 |
| file_count | long | 目录下的文件数 |
| dir_count | long | 目录下的子文件夹数 |
| file_size | long | 目录下的文件总大小,字节 |
| last_modification_time | string | path最新发生修改的时间 |
| last_cmd_time | string | path最新发生操作的时间 |
| date | string | 该表的分区字段,也是fsimage文件所属日期,格式:yyyyMMdd |
数据来源查询语句
数据流向图
将所有统计全部在hive层处理,由Doris层存储统计结果,供web服务进行点查。
将Path的存储和操作时间统计表与hive使用的hdfs path统计表都转存到Doris层,由web查询服务,进行path点查时,实现两个表的关联查询得到结果。
数仓侧 doris侧 hive元数据 hdfs的fsimage hive表 hdfsaudilog相关hive表 web hdfs_file_count_size_time_hive hive_hdfs_path hdfs_file_count_size hdfs_audit_cmd deg_ads.hdfsauditlog_ultimately_max_time_read hdfs_path_time hdfs_file_count_size_time hdfs_file_count_size_time_hive hdfs_auditlog_suninghadoop2 hdfs_auditlog_suninghadoop3 hdfs_auditlog_suninghadoop4 hdfs_auditlog_suninghadoop7 hdfs_auditlog_suninghadoop9 hdfs_auditlog_suninghadoop11 hdfs_auditlog_suninghadoop13 hdfs_auditlog_suninghadoop15 fsimage_new_all hive_meta_dbs hive_meta_sds hive_meta_tbls hive_meta_partitions
统计任务
任务时间粒度:以天粒度执行任务统计,T-1.5的延迟
任务类型:Spark任务
Web
前端展示
列表中" 修改时间, 操作时间, 子文件夹数量, 子文件数量 ,子文件总大小 " 排序,列表中的【操作】列中,若属于hive的path,增加【hive】,悬停展示,。跳转至dam的hive列表页。
| 子path | 最近修改时间 | 最近操作时间 | 子文件夹数量 | 子文件数量 | 子文件总大小 | 用户 | 用户组 | path的权限 | 操作 |
|---|---|---|---|---|---|---|---|---|---|
| /user/gudong | 2025-11-06 11:54:24 | 2025-11-06 11:54:24 | 12 | 12 | 222 | gudong | gudong | drwxr-xr-x | 删除 hive |
- Hive详情(并非全部的path都有,若无,数据项值为
-。- 库名
- 表名
- 分区名
遗留功能
导出功能
离线统计数据导出,导出限制?
数据来源sql
SQL使用SparkSQL,存在优化空间
hive_hdfs_path
sql
select
b.db_name,
b.db_location,
b.tbl_type,
b.tbl_name,
b.tbl_location,
c.part_name,
c.part_location,
-- 确认是否有part_location
replace (
case
when part_location is not null then part_location
else b.location
end,
'hdfs://routerprd/',
'/'
) as hdfspath
from
( -- 查询出表的location
select
a.db_name,
a.db_location,
a.tbl_type,
a.tbl_name,
a.tbl_id,
hms.location as tbl_location,
-- 如果库中没有表,则使用库的location
case
when tbl_location is not null then tbl_location
else a.db_location
end as location
from
(
-- 查询出库的location和其下的表
select
hmd.name as db_name,
hmd.db_location_uri as db_location,
hmt.tbl_name,
hmt.tbl_type,
hmt.tbl_id,
hmt.sd_id tbl_sd_id
from
hive_meta_dbs hmd
left join hive_meta_tbls hmt on hmt.db_id = hmd.db_id
where
hmd.day = '20251030'
and hmt.day = '20251030'
) as a
left join hive_meta_sds hms on hms.sd_id = a.tbl_sd_id
where
hms.day = '20251030'
) as b
left join (
-- 查询出分区的location
select
hmp.part_name,
hmp.tbl_id,
hms.location as part_location
from
hive_meta_partitions hmp,
hive_meta_sds hms
where
hmp.day = '20251030'
and hms.day = '20251030'
and hms.sd_id = hmp.sd_id
) as c on c.tbl_id = b.tbl_id;
select
replace (
a.location,,
'hdfs://routerprd/',
'/'
) as hdfspath
a.tbl_name,
a.tbl_type,
a.tbl_id,
a.tbl_sd_id,
a.part_name,
a.part_id,
a.part_sd_id,
a.db_name,
a.db_id
from
(
-- 找到分区的表和库信息
select
hms.location,
hmt.tbl_name,
hmt.tbl_type,
hmt.tbl_id,
hmt.sd_id as tbl_sd_id,
hmp.part_name,
hmp.part_id,
hmp.sd_id as part_sd_id,
hmd.name as db_name,
hmd.db_id
from
hive_meta_sds hms
left join hive_meta_partitions hmp on hms.sd_id = hmp.sd_id
left join hive_meta_tbls hmt on hmp.tbl_id = hmt.tbl_id
left join hive_meta_dbs hmd on hmt.db_id = hmd.db_id
where
hms.day = '20251102'
and hms.location != ''
and hmt.day = '20251102'
and hmd.day = '20251102'
and hmp.day = '20251102'
union
-- 找到表的库信息
select
hms.location,
hmt.tbl_name,
hmt.tbl_type,
hmt.tbl_id,
hmt.sd_id as tbl_sd_id,
null as part_name,
null as part_id,
null as part_sd_id,
hmd.name as db_name,
hmd.db_id
from
hive_meta_sds hms
left join hive_meta_tbls hmt on hms.sd_id = hmt.sd_id
left join hive_meta_dbs hmd on hmt.db_id = hmd.db_id
where
hms.day = '20251102'
and hms.location != ''
and hmt.day = '20251102'
and hmd.day = '20251102'
union
-- 查询库的location的信息
select
hmd.db_location_uri as location,
null as tbl_name,
null as tbl_type,
null as tbl_id,
null as tbl_sd_id,
null as part_name,
null as part_id,
null as part_sd_id,
hmd.name as db_name,
hmd.db_id
from
hive_meta_dbs hmd
where
hmd.day = '20251102'
) as a;
hdfs_file_count_size
sql
with
exploded AS (
SELECT
path,
permission,
filesize as size,
modificationtime,
SPLIT (path, '/') AS parts
FROM
fsimage_all_new
WHERE
date = '20251030'
),
all_prefix AS (
SELECT
CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
permission,
size,
modificationtime
FROM
exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
part
WHERE
i > 1
)
SELECT
parent_path,
COUNT_IF(startsWith (permission, 'd'))-1 AS sub_dir_count,
COUNT_IF(!startsWith (permission, 'd')) AS sub_file_count,
SUM(size) AS total_size,
max(modificationtime) as last_modification_time
FROM
all_prefix
GROUP BY
parent_path;
hdfs_path_hive_statistics
sql
select
a.hdfspath,
a.file_count,
a.dir_count,
a.file_size,
a.last_modification_time,
b.db_name,
b.tbl_name,
b.part_name,
b.tbl_type
from
hdfs_file_count_and_size as a
left join hive_hdfs_path as b on a.hdfspath = b.hdfspath
where
a.date = '20251102'
and b.date = '20251102'
hdfs_audit_cmd
sql
with
auditlog15_exploded AS (
select
cast(substr (time, 0, 19) as timestamp) as time,
SPLIT (replace (src, 'src=/', '/'), '/') AS parts
from
hdfs_auditlog_suninghadoop15
where
startsWith (date, '2025-10-28-')
),
auditlog13_exploded AS (
select
cast(substr (time, 0, 19) as timestamp) as time,
SPLIT (replace (src, 'src=/', '/'), '/') AS parts
from
hdfs_auditlog_suninghadoop13
where
startsWith (date, '2025-10-28-')
),
auditlog11_exploded AS (
select
cast(substr (time, 0, 19) as timestamp) as time,
SPLIT (replace (src, 'src=/', '/'), '/') AS parts
from
hdfs_auditlog_suninghadoop11
where
startsWith (date, '2025-10-28-')
),
auditlog9_exploded AS (
select
cast(substr (time, 0, 19) as timestamp) as time,
SPLIT (replace (src, 'src=/', '/'), '/') AS parts
from
hdfs_auditlog_suninghadoop9
where
startsWith (date, '2025-10-28-')
),
auditlog7_exploded AS (
select
cast(substr (time, 0, 19) as timestamp) as time,
SPLIT (replace (src, 'src=/', '/'), '/') AS parts
from
hdfs_auditlog_suninghadoop7
where
startsWith (date, '2025-10-28-')
),
auditlog4_exploded AS (
select
cast(substr (time, 0, 19) as timestamp) as time,
SPLIT (replace (src, 'src=/', '/'), '/') AS parts
from
hdfs_auditlog_suninghadoop4
where
startsWith (date, '2025-10-28-')
),
auditlog3_exploded AS (
select
cast(substr (time, 0, 19) as timestamp) as time,
SPLIT (replace (src, 'src=/', '/'), '/') AS parts
from
hdfs_auditlog_suninghadoop3
where
startsWith (date, '2025-10-28-')
),
auditlog2_exploded AS (
select
cast(substr (time, 0, 19) as timestamp) as time,
SPLIT (replace (src, 'src=/', '/'), '/') AS parts
from
hdfs_auditlog_suninghadoop2
where
startsWith (date, '2025-10-28-')
),
allauditlog15_prefix AS (
SELECT
CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
time
FROM
auditlog15_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
part
WHERE
i > 1
),
allauditlog13_prefix AS (
SELECT
CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
time
FROM
auditlog13_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
part
WHERE
i > 1
),
allauditlog11_prefix AS (
SELECT
CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
time
FROM
auditlog11_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
part
WHERE
i > 1
),
allauditlog9_prefix AS (
SELECT
CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
time
FROM
auditlog9_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
part
WHERE
i > 1
),
allauditlog7_prefix AS (
SELECT
CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
time
FROM
auditlog7_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
part
WHERE
i > 1
),
allauditlog4_prefix AS (
SELECT
CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
time
FROM
auditlog4_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
part
WHERE
i > 1
),
allauditlog3_prefix AS (
SELECT
CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
time
FROM
auditlog3_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
part
WHERE
i > 1
),
allauditlog2_prefix AS (
SELECT
CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
time
FROM
auditlog2_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
part
WHERE
i > 1
),
auditlog_all as (
select parent_path, time from allauditlog2_prefix
union
select parent_path, time from allauditlog3_prefix
union
select parent_path, time from allauditlog4_prefix
union
select parent_path, time from allauditlog7_prefix
union
select parent_path, time from allauditlog9_prefix
union
select parent_path, time from allauditlog11_prefix
union
select parent_path, time from allauditlog13_prefix
union
select parent_path, time from allauditlog15_prefix
)
SELECT
parent_path,
MAX(time) AS last_time
FROM
auditlog_all
GROUP BY
parent_path;
hdfs_path_time
sql
select
a.hdfspath,
case
when b.last_cmd_time is not null then b.last_cmd_time
else c.last_cmd_time
end as last_cmd_time
from
(
select
a.hdfspath,
b.last_cmd_time
from
hdfs_file_count_and_size as a
left join hdfs_audit_cmd as b on a.hdfspath = b.hdfspath
left join hdfs_path_time as c on a.hdfspath = c.hdfspath
where
a.date = '20251102'
and c.date = '20251101'
);
hdfs_file_count_size_time
sql
select
a.*,
b.last_cmd_time
from
hdfs_file_count_size a
inner join hdfs_path_time b
on a.hdfspath = b.hdfspath
hdfs_file_count_size_time_hive
sql
select
a.hdfspath,
a.file_count,
a.dir_count,
a.file_size,
a.last_modification_time,
b.db_name,
b.tbl_name,
b.part_name,
b.tbl_type
from
hdfs_file_count_and_size as a
left join hive_hdfs_path as b on a.hdfspath = b.hdfspath
where
a.date = '20251102'
and b.date = '20251102'
附录
Hive元数据表关系
hive_meta_dbs long db_id string name string db_location_uri hive_meta_tbls long db_id long sd_id string tbl_name string owner long tbl_id string tbl_type hive_meta_partitions long tbl_id string part_name long sd_id hive_meta_sds long id string location contains contains contains contains
fsimage文件的字段
| 字段名 | 中文 |
|---|---|
| Path | 目录路径 |
| Replication | 备份数 其实就是所有的存储份数 |
| ModificationTime | 最后修改时间/创建 |
| AccessTime | 最后访问时间 |
| PreferredBlockSize | 首选块大小 byte |
| BlocksCount | 块数 |
| FileSize | 文件大小 byte |
| NSQUOTA | 名称配额 限制指定目录下允许的文件和目录的数量。 |
| DSQUOTA | 空间配额 限制该目录下允许的字节数 |
| Permission | 权限 |
| UserName | 用户 |
| GroupName | 用户组 |
auditlog中的cmd说明
| 操作名称 | 操作说明 |
|---|---|
| metaSave | 将所有元数据导出到指定文件 |
| listOpenFiles | 列举指定path中的打开的文件 |
| setPermission | 为指定path设置权限 |
| setOwner | 为指定path设置拥有者 |
| open | 获取指定范围内的区块位置 打开文件 |
| concat | 将srcs中的所有块移动并追加到target,为避免回滚,我们将在实际移动开始前验证所有参数的有效性。 |
| setTimes | 存储指定path的修改和访问时间。访问时间精确到小时。若需要,事务会被写入编辑日志但不会立即刷新。 |
| truncate | 将文件截断至较短长度。截断操作不可逆转/恢复,因其会导致数据丢失。 |
| createSymlink | 创建符号链接。 |
| setReplication | 为现有文件设置副本。 |
| setStoragePolicy | 为文件或目录设置存储策略。 |
| satisfyStoragePolicy | 让文件或目录的满足存储策略。让策略应用生效。 |
| unsetStoragePolicy | 撤销为特定文件或目录设置的存储策略。 |
| append | 在命名空间中向现有文件追加内容。 |
| getAdditionalBlock | 客户端希望为指定文件名(正在写入中)获取额外的数据块 |
| completeFile | 将当前进行中的写入操作完整写入指定文件。 |
| rename | 更改指定文件名。 |
| delete | 删除指定的文件 |
| isFileClosed | 检查文件是否关闭 |
| mkdirs | 创建重要的目录 |
| contentSummary | 获取特定文件/目录的内容摘要。 |
| quotaUsage | 获取特定文件/目录的配额使用情况。 |
| listStatus | 获取指定目录的子目录列表 |
| slowDataNodesReport | 慢数据节点汇报 |
| datanodeReport | 数据节点汇报 |
| getDatanodeStorageReport | 获取数据节点的存储汇报 |
| saveNamespace | 存储namespace的image |
| finalizeUpgrade | 完成升级 |
| refreshNodes | 刷新节点 |
| setBalancerBandwidth | 设置数据均衡的带宽大小 |
| rollEditLog | 翻滚editlog |
| listCorruptFileBlocks | 查询损坏的块/文件,返回一个列表,其中每个条目描述一个损坏的文件/块 |
| getDelegationToken | 获取令牌 |
| renewDelegationToken | 刷新令牌的新过期时间 |
| cancelDelegationToken | 取消令牌 |
| allowSnapshot | 允许对指定目录进行快照 |
| disallowSnapshot | 禁止对指定目录进行快照 |
| createSnapshot | 创建一个快照 |
| renameSnapshot | 重命名一个快照 |
| listSnapshottableDirectory | 获取当前用户拥有的可快照目录列表。若当前用户为超级用户,则返回所有可快照目录。 |
| ListSnapshot | 获取指定快照目录的快照列表 |
| computeSnapshotDiff | 查询两个快照的不同 |
| deleteSnapshot | 删除可快照目录的快照 |
| gcDeletedSnapshot | 回收已经删除的快照(理解为物理清理) |
| queryRollingUpgrade | 查询滚动升级信息 |
| startRollingUpgrade | 开始滚动升级 |
| finalizeRollingUpgrade | 关闭滚动升级 |
| addCacheDirective | 增加缓存指令 |
| modifyCacheDirective | 修改缓存指令 |
| removeCacheDirective | 删除缓存指令 |
| listCacheDirectives | 列举缓存指令 |
| addCachePool | 增加缓存池 |
| modifyCachePool | 修改缓存池 |
| removeCachePool | 删除缓存池 |
| listCachePools | 列举缓存池 |
| modifyAclEntries | 修改ACL条目 |
| removeAclEntries | 删除ACL条目 |
| removeDefaultAcl | 删除默认的ACL条目 |
| removeAcl | 对指定path删除ACL |
| setAcl | 为指定的path设置ACL |
| getAclStatus | 获取指定path的ACL |
| createEncryptionZone | 使用指定密钥在目录 src 上创建加密区域。 |
| getEZForPath | 获取指定路径的加密区域。 |
| listEncryptionZones | 列举加密区域 |
| reencryptEncryptionZone | 区域上的重新加密 |
| listReencryptionStatus | 列举重新加密的状态 |
| setErasureCodingPolicy | 在指定路径上设置擦除编码策略。 |
| addErasureCodingPolicies | 向ErasureCodingPolicyManager添加多个擦除编码策略。 |
| removeErasureCodingPolicy | 删除一个擦除编码策略。 |
| enableErasureCodingPolicy | 启用擦除编码策略。 |
| disableErasureCodingPolicy | 禁用擦除编码策略 |
| unsetErasureCodingPolicy | 从指定路径取消设置擦除编码策略。 |
| getECTopologyResultForPolicies | 验证给定策略是否在指定集群配置中受支持。若未指定策略,则检查所有已启用的策略。 |
| getErasureCodingPolicy | 获取指定路径的擦除编码策略信息 |
| getErasureCodingPolicies | 获取所有擦除编码策略 |
| getErasureCodingCodecs | 获取可用的擦除编码编解码器及其对应的编码器。 |
| setXAttr | 设置指定path的扩展数据 |
| getXAttrs | 获取指定path的扩展属性 |
| listXAttrs | 列举指定path的扩展属性 |
| removeXAttr | 删除指定path的扩展属性 |
| checkAccess | 检查path的访问权限 |