HDFS元数据与auditlog结合Hive元数据统计分析

文章目录

HDFS元数据与auditlog结合Hive元数据统计分析

以为HDFS的Path为主要元素，与FSimage元数据信息，AuditLog中审计信息结合Hive的元数据信息进行统计分析

主要统计HDFS Path的子文件夹，子文件数量、子文件大小、最近的修改时间、最近的操作时间、以及是属于hive的某库某表某分区的location。

表设计

hive_hdfs_path

hive使用的hdfs path的关系表，查询Hive的元数据表，取库的location，表的location和分区的location，location由hdfs的path组成。

表字段

字段名	字段属性	字段备注
hdfspath	string	hdfs的path，以/开头，不以hdfs开头
db_name	string	库名
tbl_name	string	表名
tbl_type	string	表类型，取值`MANAGED_TABLE`，`VIRTUAL_VIEW`，`EXTERNAL_TABLE`，`INDEX_TABLE`
part_name	string	分区名
db_id	string	hive库的id
tbl_id	string	hive表的id
part_id	string	hive表分区的id
tbl_sd_id	string	表的sds中的id
part_sd_id	string	表分区的sds中的id
date	string	该表的分区字段，格式：yyyyMMdd

数据来源查询语句

hdfs_file_count_size

HDFS Path的文件数量与文件大小统计表，统计Path不同层级目录下所有子文件的数量和文件总大小

表字段

字段名	字段类型	备注
path	string	目录
file_count	long	目录下的文件数
dir_count	long	目录下的子文件夹数
file_size	long	目录下的文件总大小，字节
last_modification_time	string	path最新发生修改的时间
date	string	该表的分区字段，也是fsimage文件所属日期,格式：yyyyMMdd

数据来源查询语句

hdfs_audit_cmd

统计某天 auditlog 中不同层级的path的最近cmd时间。将cmd时间均认定为对path的操作时间，不对cmd进行区分。

表字段

字段名	字段类型	备注
hdfspath	string	hdfs的path
last_cmd_time	string	最后cmd的时间
date	string	该表的分区字段，格式：yyyyMMdd

数据来源查询语句

hdfs_path_time

将某天的hdfs path与当天 auditlog path下的最近操作时间进行关联存储，hdfs_audit_cmd中last_cmd_time有值就更新，无值就保持上一天hdfs_path_time中的last_cmd_time值

path 的last_cmd_time取什么值？modtime？当前时间？或者定某个值？

指定2025-10-31 23:59:59

表字段

字段名	字段类型	备注
hdfspath	string	hdfs的path
last_cmd_time	string	最后cmd的时间
date	string	该表的分区字段，格式：yyyyMMdd

数据来源查询语句

hdfs_file_count_size_time

统计某天HDFS path的存储情况和访问操作情况

表字段

字段名	字段类型	备注
path	string	目录
file_count	long	目录下的文件数
dir_count	long	目录下的子文件夹数
file_size	long	目录下的文件总大小，字节
last_modification_time	string	path最新发生修改的时间
last_cmd_time	string	path最新发生操作的时间
date	string	该表的分区字段，也是fsimage文件所属日期,格式：yyyyMMdd

数据来源查询语句

hdfs_file_count_size_time_hive

表字段

字段名	字段类型	备注
path	string	目录
file_count	long	目录下的文件数
dir_count	long	目录下的子文件夹数
file_size	long	目录下的文件总大小，字节
last_modification_time	string	path最新发生修改的时间
last_cmd_time	string	path最新发生操作的时间
date	string	该表的分区字段，也是fsimage文件所属日期,格式：yyyyMMdd

数据来源查询语句

数据流向图

将所有统计全部在hive层处理，由Doris层存储统计结果，供web服务进行点查。

将Path的存储和操作时间统计表与hive使用的hdfs path统计表都转存到Doris层，由web查询服务，进行path点查时，实现两个表的关联查询得到结果。
数仓侧 doris侧 hive元数据 hdfs的fsimage hive表 hdfsaudilog相关hive表 web hdfs_file_count_size_time_hive hive_hdfs_path hdfs_file_count_size hdfs_audit_cmd deg_ads.hdfsauditlog_ultimately_max_time_read hdfs_path_time hdfs_file_count_size_time hdfs_file_count_size_time_hive hdfs_auditlog_suninghadoop2 hdfs_auditlog_suninghadoop3 hdfs_auditlog_suninghadoop4 hdfs_auditlog_suninghadoop7 hdfs_auditlog_suninghadoop9 hdfs_auditlog_suninghadoop11 hdfs_auditlog_suninghadoop13 hdfs_auditlog_suninghadoop15 fsimage_new_all hive_meta_dbs hive_meta_sds hive_meta_tbls hive_meta_partitions

统计任务

任务时间粒度：以天粒度执行任务统计，T-1.5的延迟

任务类型：Spark任务

Web

前端展示

列表中" 修改时间，操作时间，子文件夹数量，子文件数量，子文件总大小 " 排序，列表中的【操作】列中，若属于hive的path，增加【hive】，悬停展示，。跳转至dam的hive列表页。

子path	最近修改时间	最近操作时间	子文件夹数量	子文件数量	子文件总大小	用户	用户组	path的权限	操作
/user/gudong	2025-11-06 11:54:24	2025-11-06 11:54:24	12	12	222	gudong	gudong	drwxr-xr-x	删除 hive

Hive详情（并非全部的path都有，若无，数据项值为-。
- 库名
- 表名
- 分区名

遗留功能

导出功能

离线统计数据导出，导出限制？

数据来源sql

SQL使用SparkSQL，存在优化空间

hive_hdfs_path

sql 复制代码

select
    b.db_name,
    b.db_location,
    b.tbl_type,
    b.tbl_name,
    b.tbl_location,
    c.part_name,
    c.part_location,
    -- 确认是否有part_location
    replace (
        case
            when part_location is not null then part_location
            else b.location
        end,
        'hdfs://routerprd/',
        '/'
    ) as hdfspath
from
    (   -- 查询出表的location
        select
            a.db_name,
            a.db_location,
            a.tbl_type,
            a.tbl_name,
            a.tbl_id,
            hms.location as tbl_location,
            -- 如果库中没有表，则使用库的location
            case
                when tbl_location is not null then tbl_location
                else a.db_location
            end as location
        from
            (
                -- 查询出库的location和其下的表
                select
                    hmd.name as db_name,
                    hmd.db_location_uri as db_location,
                    hmt.tbl_name,
                    hmt.tbl_type,
                    hmt.tbl_id,
                    hmt.sd_id tbl_sd_id
                from
                    hive_meta_dbs hmd
                    left join hive_meta_tbls hmt on hmt.db_id = hmd.db_id
                where
                    hmd.day = '20251030'
                    and hmt.day = '20251030'
            ) as a
            left join hive_meta_sds hms on hms.sd_id = a.tbl_sd_id
        where
            hms.day = '20251030'
    ) as b
    left join (
        -- 查询出分区的location
        select
            hmp.part_name,
            hmp.tbl_id,
            hms.location as part_location
        from
            hive_meta_partitions hmp,
            hive_meta_sds hms
        where
            hmp.day = '20251030'
            and hms.day = '20251030'
            and hms.sd_id = hmp.sd_id
    ) as c on c.tbl_id = b.tbl_id;
    
select 
    replace (
        a.location,,
        'hdfs://routerprd/',
        '/'
    ) as hdfspath
    a.tbl_name,
    a.tbl_type,
    a.tbl_id,
    a.tbl_sd_id,
    a.part_name,
    a.part_id,
    a.part_sd_id,
    a.db_name,
    a.db_id
from
    (
        -- 找到分区的表和库信息
        select
            hms.location,
            hmt.tbl_name,
            hmt.tbl_type,
            hmt.tbl_id,
            hmt.sd_id as tbl_sd_id,
            hmp.part_name,
            hmp.part_id,
            hmp.sd_id as part_sd_id,
            hmd.name as db_name,
            hmd.db_id
        from
            hive_meta_sds hms
            left join hive_meta_partitions hmp on hms.sd_id = hmp.sd_id
            left join hive_meta_tbls hmt on hmp.tbl_id = hmt.tbl_id
            left join hive_meta_dbs hmd on hmt.db_id = hmd.db_id
        where
            hms.day = '20251102'
            and hms.location != ''
            and hmt.day = '20251102'
            and hmd.day = '20251102'
            and hmp.day = '20251102'
        union
        -- 找到表的库信息
        select
            hms.location,
            hmt.tbl_name,
            hmt.tbl_type,
            hmt.tbl_id,
            hmt.sd_id as tbl_sd_id,
            null as part_name,
            null as part_id,
            null as part_sd_id,
            hmd.name as db_name,
            hmd.db_id
        from
            hive_meta_sds hms
            left join hive_meta_tbls hmt on hms.sd_id = hmt.sd_id
            left join hive_meta_dbs hmd on hmt.db_id = hmd.db_id
        where
            hms.day = '20251102'
            and hms.location != ''
            and hmt.day = '20251102'
            and hmd.day = '20251102'
        union
        -- 查询库的location的信息
        select
            hmd.db_location_uri as location,
            null as tbl_name,
            null as tbl_type,
            null as tbl_id,
            null as tbl_sd_id,
            null as part_name,
            null as part_id,
            null as part_sd_id,
            hmd.name as db_name,
            hmd.db_id
        from
            hive_meta_dbs hmd
        where
            hmd.day = '20251102'
    ) as a;

hdfs_file_count_size

sql 复制代码

with
    exploded AS (
        SELECT
            path,
            permission,
            filesize as size,
            modificationtime,
            SPLIT (path, '/') AS parts
        FROM
            fsimage_all_new
        WHERE
            date = '20251030'
    ),
    all_prefix AS (
        SELECT
            CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
            permission,
            size,
            modificationtime
        FROM
            exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
            part
        WHERE
            i > 1
    )
SELECT
    parent_path,
    COUNT_IF(startsWith (permission, 'd'))-1 AS sub_dir_count,
    COUNT_IF(!startsWith (permission, 'd')) AS sub_file_count,
    SUM(size) AS total_size,
    max(modificationtime) as last_modification_time
FROM
    all_prefix
GROUP BY
    parent_path;

hdfs_path_hive_statistics

sql 复制代码

select
    a.hdfspath,
    a.file_count,
    a.dir_count,
    a.file_size,
    a.last_modification_time,
    b.db_name,
    b.tbl_name,
    b.part_name,
    b.tbl_type
from
    hdfs_file_count_and_size as a
    left join hive_hdfs_path as b on a.hdfspath = b.hdfspath
where
    a.date = '20251102'
    and b.date = '20251102'

hdfs_audit_cmd

sql 复制代码

with
    auditlog15_exploded AS (
        select
            cast(substr (time, 0, 19) as timestamp) as time,
            SPLIT (replace (src, 'src=/', '/'), '/') AS parts
        from
            hdfs_auditlog_suninghadoop15
        where
            startsWith (date, '2025-10-28-')
    ),
    auditlog13_exploded AS (
        select
            cast(substr (time, 0, 19) as timestamp) as time,
            SPLIT (replace (src, 'src=/', '/'), '/') AS parts
        from
            hdfs_auditlog_suninghadoop13
        where
            startsWith (date, '2025-10-28-')
    ),
    auditlog11_exploded AS (
        select
            cast(substr (time, 0, 19) as timestamp) as time,
            SPLIT (replace (src, 'src=/', '/'), '/') AS parts
        from
            hdfs_auditlog_suninghadoop11
        where
            startsWith (date, '2025-10-28-')
    ),
    auditlog9_exploded AS (
        select
            cast(substr (time, 0, 19) as timestamp) as time,
            SPLIT (replace (src, 'src=/', '/'), '/') AS parts
        from
            hdfs_auditlog_suninghadoop9
        where
            startsWith (date, '2025-10-28-')
    ),
    auditlog7_exploded AS (
        select
            cast(substr (time, 0, 19) as timestamp) as time,
            SPLIT (replace (src, 'src=/', '/'), '/') AS parts
        from
            hdfs_auditlog_suninghadoop7
        where
            startsWith (date, '2025-10-28-')
    ),
    auditlog4_exploded AS (
        select
            cast(substr (time, 0, 19) as timestamp) as time,
            SPLIT (replace (src, 'src=/', '/'), '/') AS parts
        from
            hdfs_auditlog_suninghadoop4
        where
            startsWith (date, '2025-10-28-')
    ),
    auditlog3_exploded AS (
        select
            cast(substr (time, 0, 19) as timestamp) as time,
            SPLIT (replace (src, 'src=/', '/'), '/') AS parts
        from
            hdfs_auditlog_suninghadoop3
        where
            startsWith (date, '2025-10-28-')
    ),
    auditlog2_exploded AS (
        select
            cast(substr (time, 0, 19) as timestamp) as time,
            SPLIT (replace (src, 'src=/', '/'), '/') AS parts
        from
            hdfs_auditlog_suninghadoop2
        where
            startsWith (date, '2025-10-28-')
    ),
    allauditlog15_prefix AS (
        SELECT
            CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
            time
        FROM
            auditlog15_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
            part
        WHERE
            i > 1
    ),
    allauditlog13_prefix AS (
        SELECT
            CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
            time
        FROM
            auditlog13_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
            part
        WHERE
            i > 1
    ),
    allauditlog11_prefix AS (
        SELECT
            CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
            time
        FROM
            auditlog11_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
            part
        WHERE
            i > 1
    ),
    allauditlog9_prefix AS (
        SELECT
            CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
            time
        FROM
            auditlog9_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
            part
        WHERE
            i > 1
    ),
    allauditlog7_prefix AS (
        SELECT
            CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
            time
        FROM
            auditlog7_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
            part
        WHERE
            i > 1
    ),
    allauditlog4_prefix AS (
        SELECT
            CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
            time
        FROM
            auditlog4_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
            part
        WHERE
            i > 1
    ),
    allauditlog3_prefix AS (
        SELECT
            CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
            time
        FROM
            auditlog3_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
            part
        WHERE
            i > 1
    ),
    allauditlog2_prefix AS (
        SELECT
            CONCAT ('/', CONCAT_WS ('/', SLICE (parts, 2, i))) AS parent_path,
            time
        FROM
            auditlog2_exploded LATERAL VIEW POSEXPLODE (parts) pe AS i,
            part
        WHERE
            i > 1
    ),
    auditlog_all as (
        select parent_path, time from allauditlog2_prefix
        union
        select parent_path, time from allauditlog3_prefix
        union
        select parent_path, time from allauditlog4_prefix
        union
        select parent_path, time from allauditlog7_prefix
        union
        select parent_path, time from allauditlog9_prefix
        union
        select parent_path, time from allauditlog11_prefix
        union
        select parent_path, time from allauditlog13_prefix
        union
        select parent_path, time from allauditlog15_prefix
    )
SELECT
    parent_path,
    MAX(time) AS last_time
FROM
    auditlog_all
GROUP BY
    parent_path;

hdfs_path_time

sql 复制代码

select
    a.hdfspath,
    case
        when b.last_cmd_time is not null then b.last_cmd_time
        else c.last_cmd_time
    end as last_cmd_time
from
    (
        select
            a.hdfspath,
            b.last_cmd_time
        from
            hdfs_file_count_and_size as a
            left join hdfs_audit_cmd as b on a.hdfspath = b.hdfspath
            left join hdfs_path_time as c on a.hdfspath = c.hdfspath
        where
            a.date = '20251102'
            and c.date = '20251101'
    );

hdfs_file_count_size_time

sql 复制代码

select
    a.*,
    b.last_cmd_time
from
    hdfs_file_count_size a
inner join hdfs_path_time b
on a.hdfspath = b.hdfspath

hdfs_file_count_size_time_hive

sql 复制代码

select
    a.hdfspath,
    a.file_count,
    a.dir_count,
    a.file_size,
    a.last_modification_time,
    b.db_name,
    b.tbl_name,
    b.part_name,
    b.tbl_type
from
    hdfs_file_count_and_size as a
    left join hive_hdfs_path as b on a.hdfspath = b.hdfspath
where
    a.date = '20251102'
    and b.date = '20251102'

附录

Hive元数据表关系

hive_meta_dbs long db_id string name string db_location_uri hive_meta_tbls long db_id long sd_id string tbl_name string owner long tbl_id string tbl_type hive_meta_partitions long tbl_id string part_name long sd_id hive_meta_sds long id string location contains contains contains contains

fsimage文件的字段

字段名	中文
Path	目录路径
Replication	备份数其实就是所有的存储份数
ModificationTime	最后修改时间/创建
AccessTime	最后访问时间
PreferredBlockSize	首选块大小 byte
BlocksCount	块数
FileSize	文件大小 byte
NSQUOTA	名称配额限制指定目录下允许的文件和目录的数量。
DSQUOTA	空间配额限制该目录下允许的字节数
Permission	权限
UserName	用户
GroupName	用户组

auditlog中的cmd说明

操作名称	操作说明
metaSave	将所有元数据导出到指定文件
listOpenFiles	列举指定path中的打开的文件
setPermission	为指定path设置权限
setOwner	为指定path设置拥有者
open	获取指定范围内的区块位置打开文件
concat	将srcs中的所有块移动并追加到target，为避免回滚，我们将在实际移动开始前验证所有参数的有效性。
setTimes	存储指定path的修改和访问时间。访问时间精确到小时。若需要，事务会被写入编辑日志但不会立即刷新。
truncate	将文件截断至较短长度。截断操作不可逆转/恢复，因其会导致数据丢失。
createSymlink	创建符号链接。
setReplication	为现有文件设置副本。
setStoragePolicy	为文件或目录设置存储策略。
satisfyStoragePolicy	让文件或目录的满足存储策略。让策略应用生效。
unsetStoragePolicy	撤销为特定文件或目录设置的存储策略。
append	在命名空间中向现有文件追加内容。
getAdditionalBlock	客户端希望为指定文件名（正在写入中）获取额外的数据块
completeFile	将当前进行中的写入操作完整写入指定文件。
rename	更改指定文件名。
delete	删除指定的文件
isFileClosed	检查文件是否关闭
mkdirs	创建重要的目录
contentSummary	获取特定文件/目录的内容摘要。
quotaUsage	获取特定文件/目录的配额使用情况。
listStatus	获取指定目录的子目录列表
slowDataNodesReport	慢数据节点汇报
datanodeReport	数据节点汇报
getDatanodeStorageReport	获取数据节点的存储汇报
saveNamespace	存储namespace的image
finalizeUpgrade	完成升级
refreshNodes	刷新节点
setBalancerBandwidth	设置数据均衡的带宽大小
rollEditLog	翻滚editlog
listCorruptFileBlocks	查询损坏的块/文件，返回一个列表，其中每个条目描述一个损坏的文件/块
getDelegationToken	获取令牌
renewDelegationToken	刷新令牌的新过期时间
cancelDelegationToken	取消令牌
allowSnapshot	允许对指定目录进行快照
disallowSnapshot	禁止对指定目录进行快照
createSnapshot	创建一个快照
renameSnapshot	重命名一个快照
listSnapshottableDirectory	获取当前用户拥有的可快照目录列表。若当前用户为超级用户，则返回所有可快照目录。
ListSnapshot	获取指定快照目录的快照列表
computeSnapshotDiff	查询两个快照的不同
deleteSnapshot	删除可快照目录的快照
gcDeletedSnapshot	回收已经删除的快照（理解为物理清理）
queryRollingUpgrade	查询滚动升级信息
startRollingUpgrade	开始滚动升级
finalizeRollingUpgrade	关闭滚动升级
addCacheDirective	增加缓存指令
modifyCacheDirective	修改缓存指令
removeCacheDirective	删除缓存指令
listCacheDirectives	列举缓存指令
addCachePool	增加缓存池
modifyCachePool	修改缓存池
removeCachePool	删除缓存池
listCachePools	列举缓存池
modifyAclEntries	修改ACL条目
removeAclEntries	删除ACL条目
removeDefaultAcl	删除默认的ACL条目
removeAcl	对指定path删除ACL
setAcl	为指定的path设置ACL
getAclStatus	获取指定path的ACL
createEncryptionZone	使用指定密钥在目录 src 上创建加密区域。
getEZForPath	获取指定路径的加密区域。
listEncryptionZones	列举加密区域
reencryptEncryptionZone	区域上的重新加密
listReencryptionStatus	列举重新加密的状态
setErasureCodingPolicy	在指定路径上设置擦除编码策略。
addErasureCodingPolicies	向ErasureCodingPolicyManager添加多个擦除编码策略。
removeErasureCodingPolicy	删除一个擦除编码策略。
enableErasureCodingPolicy	启用擦除编码策略。
disableErasureCodingPolicy	禁用擦除编码策略
unsetErasureCodingPolicy	从指定路径取消设置擦除编码策略。
getECTopologyResultForPolicies	验证给定策略是否在指定集群配置中受支持。若未指定策略，则检查所有已启用的策略。
getErasureCodingPolicy	获取指定路径的擦除编码策略信息
getErasureCodingPolicies	获取所有擦除编码策略
getErasureCodingCodecs	获取可用的擦除编码编解码器及其对应的编码器。
setXAttr	设置指定path的扩展数据
getXAttrs	获取指定path的扩展属性
listXAttrs	列举指定path的扩展属性
removeXAttr	删除指定path的扩展属性
checkAccess	检查path的访问权限