【hudi】数据湖客户端运维工具Hudi-Cli实战

数据湖客户端运维工具Hudi-Cli实战

help

shell 复制代码
hudi:student_mysql_cdc_hudi_fl->help
AVAILABLE COMMANDS

Archived Commits Command
       trigger archival: trigger archival
       show archived commits: Read commits from archived files and show details
       show archived commit stats: Read commits from archived files and show details

Bootstrap Command
       bootstrap run: Run a bootstrap action for current Hudi table
       bootstrap index showmapping: Show bootstrap index mapping
       bootstrap index showpartitions: Show bootstrap indexed partitions

Built-In Commands
       help: Display help about available commands
       stacktrace: Display the full stacktrace of the last error.
       clear: Clear the shell screen.
       quit, exit: Exit the shell.
       history: Display or save the history of previously run commands
       version: Show version info
       script: Read and execute commands from a file.

Cleans Command
       cleans show: Show the cleans
       clean showpartitions: Show partition level details of a clean
       cleans run: run clean

Clustering Command
       clustering run: Run Clustering
       clustering scheduleAndExecute: Run Clustering. Make a cluster plan first and execute that plan immediately
       clustering schedule: Schedule Clustering

Commits Command
       commits compare: Compare commits with another Hoodie table
       commits sync: Sync commits with another Hoodie table
       commit showpartitions: Show partition level details of a commit
       commits show: Show the commits
       commits showarchived: Show the archived commits
       commit showfiles: Show file level details of a commit
       commit show_write_stats: Show write stats of a commit

Compaction Command
       compaction run: Run Compaction for given instant time
       compaction scheduleAndExecute: Schedule compaction plan and execute this plan
       compaction showarchived: Shows compaction details for a specific compaction instant
       compaction repair: Renames the files to make them consistent with the timeline as dictated by Hoodie metadata. Use when compaction unschedule fails partially.
       compaction schedule: Schedule Compaction
       compaction show: Shows compaction details for a specific compaction instant
       compaction unscheduleFileId: UnSchedule Compaction for a fileId
       compaction validate: Validate Compaction
       compaction unschedule: Unschedule Compaction
       compactions show all: Shows all compactions that are in active timeline
       compactions showarchived: Shows compaction details for specified time window

Diff Command
       diff partition: Check how file differs across range of commits. It is meant to be used only for partitioned tables.
       diff file: Check how file differs across range of commits

Export Command
       export instants: Export Instants and their metadata from the Timeline

File System View Command
       show fsview all: Show entire file-system view
       show fsview latest: Show latest file-system view

HDFS Parquet Import Command
       hdfsparquetimport: Imports Parquet table to a hoodie table

Hoodie Log File Command
       show logfile records: Read records from log files
       show logfile metadata: Read commit metadata from log files

Hoodie Sync Validate Command
       sync validate: Validate the sync by counting the number of records

Kerberos Authentication Command
       kerberos kdestroy: Destroy Kerberos authentication
       kerberos kinit: Perform Kerberos authentication

Markers Command
       marker delete: Delete the marker

Metadata Command
       metadata stats: Print stats about the metadata
       metadata list-files: Print a list of all files in a partition from the metadata
       metadata list-partitions: List all partitions from metadata
       metadata validate-files: Validate all files in all partitions from the metadata
       metadata delete: Remove the Metadata Table
       metadata create: Create the Metadata Table if it does not exist
       metadata init: Update the metadata table from commits since the creation
       metadata set: Set options for Metadata Table

Repairs Command
       repair deduplicate: De-duplicate a partition path contains duplicates & produce repaired files to replace with
       rename partition: Rename partition. Usage: rename partition --oldPartition <oldPartition> --newPartition <newPartition>
       repair overwrite-hoodie-props: Overwrite hoodie.properties with provided file. Risky operation. Proceed with caution!
       repair migrate-partition-meta: Migrate all partition meta file currently stored in text format to be stored in base file format. See HoodieTableConfig#PARTITION_METAFILE_USE_DATA_FORMAT.
       repair addpartitionmeta: Add partition metadata to a table, if not present
       repair deprecated partition: Repair deprecated partition ("default"). Re-writes data from the deprecated partition into __HIVE_DEFAULT_PARTITION__
       repair show empty commit metadata: show failed commits
       repair corrupted clean files: repair corrupted clean files

Rollbacks Command
       show rollback: Show details of a rollback instant
       commit rollback: Rollback a commit
       show rollbacks: List all rollback instants

Savepoints Command
       savepoint rollback: Savepoint a commit
       savepoints show: Show the savepoints
       savepoint create: Savepoint a commit
       savepoint delete: Delete the savepoint

Spark Env Command
       set: Set spark launcher env to cli
       show env: Show spark launcher env by key
       show envs all: Show spark launcher envs

Stats Command
       stats filesizes: File Sizes. Display summary stats on sizes of files
       stats wa: Write Amplification. Ratio of how many records were upserted to how many records were actually written

Table Command
       table update-configs: Update the table configs with configs with provided file.
       table recover-configs: Recover table configs, from update/delete that failed midway.
       refresh, metadata refresh, commits refresh, cleans refresh, savepoints refresh: Refresh table metadata
       create: Create a hoodie table if not present
       table delete-configs: Delete the supplied table configs from the table.
       fetch table schema: Fetches latest table schema
       connect: Connect to a hoodie table
       desc: Describe Hoodie Table properties

Temp View Command
       temp_query, temp query: query against created temp view
       temps_show, temps show: Show all views name
       temp_delete, temp delete: Delete view name

Timeline Command
       metadata timeline show incomplete: List all incomplete instants in active timeline of metadata table
       metadata timeline show active: List all instants in active timeline of metadata table
       timeline show incomplete: List all incomplete instants in active timeline
       timeline show active: List all instants in active timeline

Upgrade Or Downgrade Command
       downgrade table: Downgrades a table
       upgrade table: Upgrades a table

Utils Command
       utils loadClass: Load a class

kerberos

shell 复制代码
kerberos kinit --principal xxx@XXXXX.COM --keytab /xxx/kerberos/xxx.keytab

先看下样例表的表结构:

分区表哦!

sql 复制代码
-- FLink SQL建表语句
create table student_mysql_cdc_hudi_fl(
  `_hoodie_commit_time` string comment 'hoodie commit time',
  `_hoodie_commit_seqno` string comment 'hoodie commit seqno',
  `_hoodie_record_key` string comment 'hoodie record key',
  `_hoodie_partition_path` string comment 'hoodie partition path',
  `_hoodie_file_name` string comment 'hoodie file name',
  `s_id` bigint not null comment '主键',
  `s_name` string not null comment '姓名',
  `s_age` int comment '年龄',
  `s_sex` string comment '性别',
  `s_part` string not null comment '分区字段',
  `create_time` timestamp(6) not null comment '创建时间',
  `dl_ts` timestamp(6) not null,
  `dl_s_sex` string not null,
  PRIMARY KEY(s_id) NOT ENFORCED
)PARTITIONED BY (`dl_s_sex`) with ( 
,'connector' = 'hudi'
,'hive_sync.table' = 'student_mysql_cdc_hudi'
,'hoodie.datasource.write.drop.partition.columns' = 'true'
,'hoodie.datasource.write.hive_style_partitioning' = 'true'
,'hoodie.datasource.write.partitionpath.field' = 'dl_s_sex'
,'hoodie.datasource.write.precombine.field' = 'dl_ts'
,'path' = 'hdfs://xxx/hudi_db.db/student_mysql_cdc_hudi'
,'precombine.field' = 'dl_ts'
,'primaryKey' = 's_id'
)

table

connect

shell 复制代码
connect --path /xxx/hudi_db.db/student_mysql_cdc_hudi

desc

shell 复制代码
desc

refresh

shell 复制代码
refresh

fetch table schema

shell 复制代码
fetch table schema
json 复制代码
  "type" : "record",
  "name" : "student_mysql_cdc_hudi_fl_record",
  "namespace" : "hoodie.student_mysql_cdc_hudi_fl",
  "fields" : [ {
    "name" : "_hoodie_commit_time",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_commit_seqno",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_record_key",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_partition_path",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_file_name",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_operation",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "s_id",
    "type" : "long"
  }, {
    "name" : "s_name",
    "type" : "string"
  }, {
    "name" : "s_age",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "s_sex",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "s_part",
    "type" : "string"
  }, {
    "name" : "create_time",
    "type" : {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    }
  }, {
    "name" : "dl_ts",
    "type" : {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    }
  }, {
    "name" : "dl_s_sex",
    "type" : "string"
  } ]
}

commit

commits show

shell 复制代码
commits show --sortBy "Total Bytes Written" --desc true --limit 10

commits showarchived

shell 复制代码
commits showarchived

commit showfiles

shell 复制代码
commit showfiles --commit 20230915164442583
shell 复制代码
commit showfiles --commit 20230915164442583 --sortBy "Partition Path"

commit showpartitions

shell 复制代码
commit showpartitions --commit 20230915164442583
shell 复制代码
commit showpartitions --commit 20230915164442583 --sortBy "Total Bytes Written" --desc true --limit 10

commit show_write_stats

shell 复制代码
commit show_write_stats --commit 20230915164442583

File System View

show fsview all

shell 复制代码
show fsview all

show fsview latest

shell 复制代码
show fsview latest --partitionPath dl_s_sex=female

Log File

show logfile records

shell 复制代码
# 注意10 是需要取数据记录条数
show logfile records 10 /xxx/hudi_db.db/student_mysql_cdc_hudi/dl_s_sex=female/.bf4b06b4-e897-42df-8a3c-a3a2f737d367_20230915163856302.log.1_0-1-0

数据是json格式的:

json 复制代码
{
  "_hoodie_commit_time": "20230915163856302",
  "_hoodie_commit_seqno": "20230915163856302_0_83",
  "_hoodie_record_key": "88",
  "_hoodie_partition_path": "dl_s_sex=female",
  "_hoodie_file_name": "bf4b06b4-e897-42df-8a3c-a3a2f737d367",
  "_hoodie_operation": "I",
  "s_id": 88,
  "s_name": "傅亮",
  "s_age": 4,
  "s_sex": "female",
  "s_part": "2017/11/20",
  "create_time": 790128367000000,
  "dl_ts": -28800000000,
  "dl_s_sex": "female"
}

show logfile metadata

shell 复制代码
show logfile metadata /xxx/xxx/hive/hudi_db.db/student_mysql_cdc_hudi/dl_s_sex=female/dl_create_time_yyyy=1971/dl_create_time_mm=03/.dadac2dd-7e5e-46c3-9b27-f1f03e04a90c_20230915151426134.log.1_0

图片中还有FooterMetadata列没显示全

json 复制代码
{
  "SCHEMA": "{\"type\":\"record\",\"name\":\"student_mysql_cdc_hudi_fl_record\",\"namespace\":\"hoodie.student_mysql_cdc_hudi_fl\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_commit_seqno\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_record_key\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_partition_path\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_file_name\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_operation\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"s_id\",\"type\":\"long\"},{\"name\":\"s_name\",\"type\":\"string\"},{\"name\":\"s_age\",\"type\":[\"null\",\"int\"],\"default\":null},{\"name\":\"s_sex\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"s_part\",\"type\":\"string\"},{\"name\":\"create_time\",\"type\":{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"}},{\"name\":\"dl_ts\",\"type\":{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"}},{\"name\":\"dl_s_sex\",\"type\":\"string\"}]}",
  "INSTANT_TIME": "20230915164442583"
}

differ

diff partition

shell 复制代码
diff partition dl_s_sex=female
differ file
shell 复制代码
# 需要提供FileID。就是log文件的部分
# 如log文件:.bf4b06b4-e897-42df-8a3c-a3a2f737d367_20230915163856302.log.1_0-1-0
diff file bf4b06b4-e897-42df-8a3c-a3a2f737d367

rollbacks

show rollbacks

shell 复制代码
show rollbacks

stats

stats filesizes

shell 复制代码
stats filesizes --partitionPath dl_s_sex=female --sortBy "95th" --desc true --limit 3

stats wa

shell 复制代码
stats wa

compaction

compactions show all

shell 复制代码
compactions show all

compactions showarchived

shell 复制代码
compactions showarchived

compaction showarchived

shell 复制代码
compaction showarchived 20230915200042501

compaction show

shell 复制代码
compaction show 20230915174042680

参考文章:
Apache Hudi数据湖hudi-cli客户端使用

相关推荐
qq_12498707532 分钟前
基于协同过滤算法的在线教育资源推荐平台的设计与实现(源码+论文+部署+安装)
java·大数据·人工智能·spring boot·spring·毕业设计
程途拾光15842 分钟前
发展中国家的AI弯道超车:医疗AI的低成本本土化之路
大数据·人工智能
Mr-Apple1 小时前
记录一次git commit --amend的误操作
大数据·git·elasticsearch
寰天柚子2 小时前
大模型时代的技术从业者:核心能力重构与实践路径
大数据·人工智能
成长之路5142 小时前
【工具变量】上市公司西部陆海新通道DID数据(2010-2024年)
大数据
Hello.Reader3 小时前
Flink SQL UPDATE 语句批模式行级更新、连接器能力要求与实战避坑
大数据·sql·flink
毕设源码-赖学姐3 小时前
【开题答辩全过程】以 基于Spark的电商用户行为分析系统为例,包含答辩的问题和答案
大数据·分布式·spark
图导物联3 小时前
商场室内导航系统:政策适配 + 技术实现 + 代码示例,打通停车逛店全流程
大数据·人工智能·物联网
牛奔4 小时前
git本地提交后,解决push被拒绝 error: failed to push some refs to
大数据·git·elasticsearch·搜索引擎·全文检索
梦里不知身是客114 小时前
doris的优化器策略介绍
大数据