Kafka Ansible+Helm批量部署 压测 监控

一、环境部署列表

1.1 AWS MSK 部署列表

┌──────┬───────────┬──────────────────┬───────┬──────────────┐

│ 环境 │ Kafka版本 │ 规格 │ 存储 │ 节点数 │

├──────┼───────────┼──────────────────┼───────┼──────────────┤

│ dev │ 3.6 │ t3.small (2C2G) │ 500G │ 3 │

├──────┼───────────┼──────────────────┼───────┼──────────────┤

│ sit │ 3.8.x │ m7g.large (2C8G) │ 130G │ 3 │

├──────┼───────────┼──────────────────┼───────┼──────────────┤

│ fat │ 3.8.x │ m7g.large (2C8G) │ 150G │ 3 │

├──────┼───────────┼──────────────────┼───────┼──────────────┤

│ qa │ 3.8.x │ m7g.large (2C8G) │ 150G │ 4 (每AZ 2个) │

├──────┼───────────┼──────────────────┼───────┼──────────────┤

uat3.7.xm7g.large (2C8G)1000G3

├──────┼───────────┼──────────────────┼───────┼──────────────┤

prod3.6.xm7g.large (2C8G)3100G3

└──────┴───────────┴──────────────────┴───────┴──────────────┘

1.2 自建 Kafka 部署列表

┌────────────┬───────────┬──────────┬──────┬───────────┐

│ 环境/版本 │ 部署方式 │ 规格 │ 存储 │ 备注 │

├────────────┼───────────┼──────────┼──────┼───────────┤

│ dev/3.9.0 │ 容器(K8S) │ 2C2G × 3 │ 300G │ 东京A区域 │

├────────────┼───────────┼──────────┼──────┼───────────┤

│ fat/3.9.0 │ 容器(K8S) │ 2C8G × 3 │ 300G │ 东京A区域 │

├────────────┼───────────┼──────────┼──────┼───────────┤

│ qa/3.9.0 │ 容器(K8S) │ 2C8G × 3 │ 300G │ 东京A区域 │

├────────────┼───────────┼──────────┼──────┼───────────┤

│ sit/3.9.0 │ EC2 │ 4C8G × 3 │ 300G │ 性能更好 │

├────────────┼───────────┼──────────┼──────┼───────────┤

│ uat/3.9.0 │ EC2 │ 4C8G × 3 │ 300G │ 性能更好 │

├────────────┼───────────┼──────────┼──────┼───────────┤

│ prod/3.9.0 │ EC2 │ 4C8G × 3 │ 500G │ 待部署 │

└────────────┴───────────┴──────────┴──────┴───────────┘


二、EC2 部署方式(Ansible 自动化)

2.1 快捷脚本方式(推荐)

部署集群

./scripts/kafka-ctl.sh sit deploy

启动服务

./scripts/kafka-ctl.sh sit start

停止服务

./scripts/kafka-ctl.sh sit stop

滚动重启

./scripts/kafka-ctl.sh sit restart

状态检查

./scripts/kafka-ctl.sh sit status

健康检查

./scripts/kafka-ctl.sh sit health

查看 Topics

./scripts/kafka-ctl.sh sit topics

2.2 Ansible Playbook 方式

部署集群

ansible-playbook -i inventories/sit/hosts.yml playbooks/deploy.yml

启动服务

ansible-playbook -i inventories/sit/hosts.yml playbooks/start.yml

停止服务

ansible-playbook -i inventories/sit/hosts.yml playbooks/stop.yml

滚动重启

ansible-playbook -i inventories/sit/hosts.yml playbooks/restart.yml

检查状态

ansible-playbook -i inventories/sit/hosts.yml playbooks/status.yml

单节点操作

ansible-playbook -i inventories/sit/hosts.yml playbooks/restart.yml --limit kafka-sit-1

2.3 维护操作

健康检查

ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=check_health

清理日志

ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=cleanup_logs

备份配置

ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=backup_config

2.4 Topic 管理

列出所有 topics

ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml -e "action=list"

创建 topic

ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \

-e "action=create topic_name=my-topic partitions=3 replication_factor=2"

查看 topic 详情

ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \

-e "action=describe topic_name=my-topic"

删除 topic

ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \

-e "action=delete topic_name=my-topic"

2.5 升级 Kafka

ansible-playbook -i inventories/sit/hosts.yml playbooks/upgrade.yml -e kafka_version=3.10.0


三、手动部署步骤(KRaft 模式)

3.1 前置准备

1. 配置 /etc/hosts(避免DNS解析问题)

sudo tee -a /etc/hosts << EOF

10.17.7.40 kafka-node-0

10.18.118.130 kafka-node-1

10.18.17.213 kafka-node-2

EOF

2. 下载并解压 Kafka

wget https://archive.apache.org/dist/kafka/3.6.2/kafka_2.13-3.6.2.tgz

tar -zxvf kafka_2.13-3.6.2.tgz

ln -s kafka_2.13-3.6.2 kafka

3.2 节点配置(server.properties)

节点0 配置 (broker.id=0, IP: 10.17.7.40):

基础标识配置

broker.id=0

node.id=0

cluster.id=kafka-cluster

监听器配置(KRaft模式)

listeners=PLAINTEXT://:9092,CONTROLLER://:9093

advertised.listeners=PLAINTEXT://kafka-node-0:9092

listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT

controller.listener.names=CONTROLLER

KRaft核心配置(无ZooKeeper)

process.roles=broker,controller

controller.quorum.voters=0@kafka-node-0:9093,1@kafka-node-1:9093,2@kafka-node-2:9093

数据与日志目录

log.dirs=/data/kafka/data

log.dir=/data/kafka/logs

主题默认配置

num.partitions=3

default.replication.factor=3

min.insync.replicas=2

offsets.topic.replication.factor=3

性能配置

num.network.threads=3

num.io.threads=8

socket.send.buffer.bytes=102400

socket.receive.buffer.bytes=102400

socket.request.max.bytes=104857600

log.retention.hours=168

log.segment.bytes=1073741824

log.retention.check.interval.ms=300000

group.initial.rebalance.delay.ms=0

节点1/2 配置:仅修改以下3项

broker.id=1 # 或 2

node.id=1 # 或 2

advertised.listeners=PLAINTEXT://kafka-node-1:9092 # 或 kafka-node-2

3.3 初始化 KRaft 集群(仅节点0执行)

1. 生成集群ID

CLUSTER_ID=$(~/kafka/bin/kafka-storage.sh random-uuid)

echo "集群ID: $CLUSTER_ID"

示例输出: rL_902f6c1d13a1a2e8d6e3b5c8a7f2d1e0

2. 格式化节点0存储目录

bin/kafka-storage.sh format -t $CLUSTER_ID -c /data/kafka/config/server.properties

3. 同步集群ID到节点1、节点2,并执行格式化(SSH操作)

bin/kafka-storage.sh format -t $CLUSTER_ID -c /data/kafka/config/server.properties

3.4 启动集群

创建启动脚本(所有节点执行)

#!/bin/bash

JVM配置(2G初始,4G最大)

export KAFKA_HEAP_OPTS="-Xms2G -Xmx4G"

export KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:/data/kafka/config/log4j.properties"

启动Kafka(后台运行)

~/kafka/bin/kafka-server-start.sh -daemon /data/kafka/config/server.properties

验证启动状态

sleep 10

jps | grep Kafka # 输出Kafka进程ID则启动成功

重要 :严格按 节点0 节点1 节点2 顺序启动,避免控制器选举异常

3.5 验证集群状态

查看集群 broker 列表

~/kafka/bin/kafka-topics.sh --bootstrap-server kafka-node-0:9092 --list

查看集群控制器状态

~/kafka/bin/kafka-metadata-shell.sh --snapshot /data/kafka/data/__cluster_metadata-0/00000000000000000000.log

验证主题创建(测试)

~/kafka/bin/kafka-topics.sh --bootstrap-server kafka-node-0:9092 --create \

--topic test-topic --partitions 3 --replication-factor 3

~/kafka/bin/kafka-topics.sh --bootstrap-server kafka-node-0:9092 --describe --topic test-topic


四、K8S 部署方式(Helm)

4.1 部署步骤

搜索可用版本

helm search repo bitnami/kafka --versions 2>/dev/null | grep -E "3\.9\.|3\.8\.|3\.7\." | head -20

部署/更新集群

cd /root/kafka

helm upgrade kafka-cluster bitnami/kafka --version 31.5.0 -f kafka-cluster-values.yaml -n kafka-cluster

4.2 Helm Values 配置

镜像配置

image:

registry: 292309088324.dkr.ecr.ap-northeast-1.amazonaws.com

repository: sretools

tag: kafka-3.9.0

pullPolicy: Always

全局配置

global:

storageClass: "gp2"

security:

allowInsecureImages: true

Controller 配置(KRaft模式)

controller:

replicaCount: 3

automountServiceAccountToken: true

resources:

requests:

cpu: "1"

memory: "2Gi"

limits:

cpu: "2"

memory: "4Gi"

persistence:

enabled: true

size: 300Gi

storageClass: "gp2"

heapOpts: "-Xmx2g -Xms2g"

Broker 配置(KRaft模式下可设为0)

broker:

replicaCount: 0

监听器配置

listeners:

client:

protocol: PLAINTEXT

interbroker:

protocol: PLAINTEXT

controller:

protocol: PLAINTEXT

external:

protocol: PLAINTEXT

外部访问配置

externalAccess:

enabled: true

autoDiscovery:

enabled: true

controller:

service:

type: LoadBalancer

ports:

external: 9092

annotations:

service.beta.kubernetes.io/aws-load-balancer-type: "external"

service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"

service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"

Kafka 核心配置

extraConfig: |

num.partitions=3

default.replication.factor=2

min.insync.replicas=2

offsets.topic.replication.factor=3

transaction.state.log.replication.factor=3

log.retention.hours=72

auto.create.topics.enable=false

4.3 K8S 集群内部测试

创建测试 topic

kubectl run kafka-test --rm -it --restart='Never' \

--image=292309088324.dkr.ecr.ap-northeast-1.amazonaws.com/sretools:kafka-3.9.0 \

--namespace kafka-cluster \

-- /opt/bitnami/kafka/bin/kafka-topics.sh \

--create \

--topic test-internal \

--partitions 3 \

--replication-factor 2 \

--bootstrap-server kafka-cluster.kafka-cluster.svc.cluster.local:9092

列出 topic

kubectl run kafka-test2 --rm -it --restart='Never' \

--image=292309088324.dkr.ecr.ap-northeast-1.amazonaws.com/sretools:kafka-3.9.0 \

--namespace kafka-cluster \

-- /opt/bitnami/kafka/bin/kafka-topics.sh \

--list \

--bootstrap-server kafka-cluster.kafka-cluster.svc.cluster.local:9092

4.4 外部访问测试

创建 topic

bin/kafka-topics.sh --create --topic test-external --partitions 3 \

--replication-factor 2 --bootstrap-server k8s-kafkaclu-kafkaclu-xxx.elb.ap-northeast-1.amazonaws.com:9094

生产消息

echo "hello from external" | ./kafka-console-producer.sh \

--broker-list k8s-kafkaclu-kafkaclu-xxx.elb.ap-northeast-1.amazonaws.com:9094 --topic test-external

消费消息

./kafka-console-consumer.sh --bootstrap-server k8s-kafkaclu-kafkaclu-xxx.elb.ap-northeast-1.amazonaws.com:9094 \

--topic test-external --from-beginning --max-messages 1

输出: hello from external


五、Broker 详细配置参数

5.1 完整配置文件

1. 基础身份与集群配置

node.id=1

cluster.id=kafka-dev

2. 网络监听配置

listeners=PLAINTEXT://10.18.118.130:9092,CONTROLLER://10.18.118.130:9093

advertised.listeners=PLAINTEXT://10.18.118.130:9092

listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT

controller.listener.names=CONTROLLER

3. KRaft模式核心配置

process.roles=broker,controller

controller.quorum.voters=0@10.17.7.40:9093,1@10.18.118.130:9093,2@10.18.17.213:9093

4. 存储目录

log.dirs=/data/kafka/data

5. 核心线程与网络配置

num.network.threads=3

num.io.threads=4

background.threads=4

socket.send.buffer.bytes=102400

socket.receive.buffer.bytes=102400

socket.request.max.bytes=52428800

6. 主题默认配置

num.partitions=3

num.replica.fetchers=2

default.replication.factor=3

min.insync.replicas=1

7. 日志与留存策略

log.retention.hours=24

log.retention.bytes=64424509440

log.segment.bytes=536870912

log.segment.ms=3600000

log.cleaner.enable=true

log.cleanup.policy=delete

8. 内部Topic副本因子

offsets.topic.replication.factor=3

transaction.state.log.replication.factor=3

9. 运维与监控配置

log.cleaner.backoff.ms=15000

log.flush.interval.messages=10000

log.flush.interval.ms=1000

connections.max.idle.ms=300000

group.initial.rebalance.delay.ms=3000

auto.create.topics.enable=false

delete.topic.enable=true

compression.type=producer

10. 副本同步配置

replica.lag.time.max.ms=30000

5.2 关键参数说明

┌────────────────────────────┬───────────┬─────────────────────────────────┐

│ 参数 │ 推荐值 │ 说明 │

├────────────────────────────┼───────────┼─────────────────────────────────┤

│ num.partitions │ 3 │ 默认分区数 │

├────────────────────────────┼───────────┼─────────────────────────────────┤

│ default.replication.factor │ 3 │ 默认副本数(生产环境必须≥3) │

├────────────────────────────┼───────────┼─────────────────────────────────┤

│ min.insync.replicas │ 2 │ 最小同步副本数(配合 acks=all) │

├────────────────────────────┼───────────┼─────────────────────────────────┤

│ log.retention.hours │ 24-168 │ 日志保留时间(小时) │

├────────────────────────────┼───────────┼─────────────────────────────────┤

│ log.segment.bytes │ 512MB-1GB │ 日志段大小 │

├────────────────────────────┼───────────┼─────────────────────────────────┤

│ num.network.threads │ 3 │ 网络线程数 │

├────────────────────────────┼───────────┼─────────────────────────────────┤

│ num.io.threads │ 4-8 │ I/O线程数 │

├────────────────────────────┼───────────┼─────────────────────────────────┤

│ auto.create.topics.enable │ false │ 生产环境必须关闭 自动创建Topic │

├────────────────────────────┼───────────┼─────────────────────────────────┤

│ delete.topic.enable │ true │ 允许删除Topic │

└────────────────────────────┴───────────┴─────────────────────────────────┘


六、监控配置

6.1 监控架构

kafka_exporter + jmx_exporter → Prometheus → Grafana

6.2 关键监控指标

┌────────┬──────────────────────────────┬───────────────────────┐

│ 类别 │ 指标 │ 告警阈值 │

├────────┼──────────────────────────────┼───────────────────────┤

Broker │ UnderReplicatedPartitions │ > 0 告警 │

├────────┼──────────────────────────────┼───────────────────────┤

Broker │ RequestHandlerAvgIdlePercent │ < 30% 告警 │

├────────┼──────────────────────────────┼───────────────────────┤

系统 │ 内存使用率 │ > 85% 告警 │

├────────┼──────────────────────────────┼───────────────────────┤

系统 │ EBS卷使用率 │ > 80% 告警 │

├────────┼──────────────────────────────┼───────────────────────┤

系统 │ EBS队列深度 (avg.queue_len) │ 持续 > 2 表示磁盘瓶颈 │

├────────┼──────────────────────────────┼───────────────────────┤

系统 │ CPU使用率 │ 持续 > 70% 考虑升级 │

├────────┼──────────────────────────────┼───────────────────────┤

JVM │ Full GC频率和持续时间 │ 频繁告警 │

└────────┴──────────────────────────────┴───────────────────────┘

6.3 kafka_exporter 部署

下载

wget https://github.com/danielqsj/kafka_exporter/releases/download/v1.6.0/kafka_exporter-1.6.0.linux-amd64.tar.gz

启动

nohup ./kafka_exporter \

--kafka.server=10.18.17.213:9092 \

--kafka.server=10.18.118.130:9092 \

--kafka.server=10.17.7.40:9092 \

--kafka.version=3.9.0 \

--web.listen-address=0.0.0.0:9308 \

--log.level=info > kafka-exporter-node1.log 2>&1 &

Prometheus 配置

  • job_name: 'rapidx_kafka_cluster_dev'

metrics_path: /metrics

scrape_interval: 15s

scrape_timeout: 15s

static_configs:

  • targets: ['10.18.118.130:9308']

labels:

instance: "rapidx-kafka-cluster-dev"

env: dev

6.4 jmx_exporter 部署

下载

wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar

创建配置文件 jmx_exporter_config.yml

lowercaseOutputName: true

lowercaseOutputLabelNames: true

rules:

Kafka 核心业务指标

  • pattern: 'kafka.server<type=(.+), name=(.+), topic=(.+), partition=(.+)><>Value'

name: kafka_server_1_2

labels:

topic: $3

partition: $4

  • pattern: 'kafka.server<type=(.+), name=(.+)><>Value'

name: kafka_server_1_2

  • pattern: 'kafka.network<type=(.+), name=(.+)><>Value'

name: kafka_network_1_2

  • pattern: 'kafka.controller<type=(.+), name=(.+)><>Value'

name: kafka_controller_1_2

JVM 基础指标

  • pattern: 'java.lang<type=Memory><>HeapMemoryUsage'

name: jvm_heap_memory_usage

  • pattern: 'java.lang<type=Memory><>NonHeapMemoryUsage'

name: jvm_nonheap_memory_usage

  • pattern: 'java.lang<type=GarbageCollector, name=(.+)><>CollectionCount'

name: jvm_gc_collection_count

labels:

gc: $1

  • pattern: 'java.lang<type=GarbageCollector, name=(.+)><>CollectionTime'

name: jvm_gc_collection_time_ms

labels:

gc: $1

  • pattern: 'java.lang<type=Threading><>ThreadCount'

name: jvm_thread_count

修改 Kafka 启动脚本

#!/bin/bash

source /etc/profile

添加 JMX exporter

export KAFKA_OPTS="$KAFKA_OPTS -javaagent:/opt/kafka/jmx_exporter/jmx_prometheus_javaagent-0.20.0.jar=9999:/opt/kafka/jmx_exporter/jmx_exporter_config.yml"

启动Kafka

/opt/kafka/kafka_2.12-3.9.0/bin/kafka-server-start.sh -daemon /data/kafka/config/server.properties

Prometheus JMX 配置

  • job_name: 'rapidx_kafka_cluster_dev-jmx'

metrics_path: /metrics

scrape_interval: 15s

scrape_timeout: 15s

static_configs:

  • targets: ['10.17.7.40:9999']

labels:

instance: "10.17.7.40:9999"

env: dev

broker_host: 'broker-0'

  • targets: ['10.18.118.130:9999']

labels:

instance: "10.18.118.130:9999"

env: dev

broker_host: 'broker-1'

  • targets: ['10.18.17.213:9999']

labels:

instance: "10.18.17.213:9999"

env: dev

broker_host: 'broker-2'


七、常用维护命令

7.1 Topic 操作

列出所有 Topic

bin/kafka-topics.sh --list --bootstrap-server 10.18.118.130:9092

查看指定 Topic 详情

./kafka-topics.sh --describe --bootstrap-server 10.18.118.130:9092 --topic test-topic

创建 topic

bin/kafka-topics.sh --bootstrap-server 10.18.118.130:9092 --create \

--topic test-topic --partitions 3 --replication-factor 3

创建 topic(带配置)

./kafka-topics.sh \

--create \

--bootstrap-server <MSK_BROKER_LIST> \

--topic my-msk-topic \

--partitions 3 \

--replication-factor 2 \

--config retention.ms=86400000 # 消息保留 24 小时

--config cleanup.policy=delete # 过期消息删除(默认)

删除 topic

./bin/kafka-topics.sh --bootstrap-server 10.18.118.130:9092 --delete --topic default.trading.event

7.2 生产/消费测试

测试生产消息

echo "hello kafka" | /opt/kafka/bin/kafka-console-producer.sh \

--bootstrap-server 10.17.9.79:9092 --topic test-topic --property "acks=all"

测试消费消息

/opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server 10.17.9.79:9092 \

--topic test-topic --from-beginning --max-messages 10

7.3 集群管理

查看集群元数据

/opt/kafka/bin/kafka-metadata.sh --snapshot \

/data/kafka/data/__cluster_metadata-0/00000000000000000000.log --command describe

EC2 停止 kafka 集群

bin/kafka-server-stop.sh

jps | grep -v Kafka # 无输出则停止成功

EC2 启动 kafka 集群(按顺序 0 1 2)

start-kafka.sh

7.4 Helm 操作

安装或更新

helm upgrade --install kafka-cluster bitnami/kafka \

--values kafka-cluster-values.yaml \

--version 26.0.0 \

-n kafka-cluster


八、性能压力测试

8.1 测试工具

使用 Kafka 自带的 kafka-producer-perf-test.sh 脚本

8.2 测试结果

┌──────────────────────┬───────────┬───────────────────────────────────┬──────────┬─────────┐

│ 测试场景 │ 消息数 │ 吞吐量 │ 平均延迟 │ P99延迟 │

├──────────────────────┼───────────┼───────────────────────────────────┼──────────┼─────────┤

│ 1万条 1KB (acks=1) │ 10,000 │ 15,060 records/sec (14.71 MB/s) │ 93.83ms │ 154ms │

├──────────────────────┼───────────┼───────────────────────────────────┼──────────┼─────────┤

│ 10万条 1KB (acks=1) │ 100,000 │ 60,350 records/sec (58.94 MB/s) │ 280.75ms │ 425ms │

├──────────────────────┼───────────┼───────────────────────────────────┼──────────┼─────────┤

│ 100万条 1KB (acks=1) │ 1,000,000 │ 137,438 records/sec (134.22 MB/s) │ 92.40ms │ 453ms │

└──────────────────────┴───────────┴───────────────────────────────────┴──────────┴─────────┘

8.3 测试命令

发送1万条1KB消息,不限速率,acks=1

./bin/kafka-producer-perf-test.sh \

--topic jame-topic1 \

--num-records 10000 \

--record-size 1024 \

--throughput -1 \

--producer-props bootstrap.servers=10.18.118.130:9092 acks=1 \

--print-metrics

发送100万条1KB消息

./bin/kafka-producer-perf-test.sh \

--topic jame-topic1 \

--num-records 1000000 \

--record-size 1024 \

--throughput -1 \

--producer-props bootstrap.servers=10.18.118.130:9092 acks=1 \

--print-metrics


九、监控看板

  • Kafka Topic 监控看板:Kubernetes Kafka Topics

  • Kafka JVM 监控看板:Kubernetes Kafka

把容器部署和ec2部署全部融合在这2个看板中,方便查看不用切换模版。

Ansible 批量部署:

复制代码
kafka-ansible$ tree 
.
├── README.md
├── ansible.cfg
├── inventories
│   ├── sit
│   │   ├── group_vars
│   │   │   └── all.yml
│   │   ├── host_vars
│   │   └── hosts.yml
│   └── uat
│       ├── group_vars
│       │   └── all.yml
│       ├── host_vars
│       └── hosts.yml
├── playbooks
│   ├── deploy.yml
│   ├── enable-jmx-exporter.yml
│   ├── maintenance.yml
│   ├── restart.yml
│   ├── start.yml
│   ├── status.yml
│   ├── stop.yml
│   ├── topic-management.yml
│   └── upgrade.yml
├── roles
│   └── kafka
│       ├── defaults
│       │   └── main.yml
│       ├── files
│       ├── handlers
│       │   └── main.yml
│       ├── meta
│       │   └── main.yml
│       ├── tasks
│       │   ├── configure.yml
│       │   ├── install.yml
│       │   ├── main.yml
│       │   ├── metrics.yml
│       │   ├── prerequisites.yml
│       │   └── service.yml
│       ├── templates
│       │   ├── jmx-exporter.yml.j2
│       │   ├── kafka.env.j2
│       │   ├── kafka.service.j2
│       │   ├── log4j.properties.j2
│       │   └── server.properties.j2
│       └── vars
│           └── main.yml
└── scripts
    └── kafka-ctl.sh

复制代码
--- /home/runner/kafka-ansible/README.md ---
# Kafka Ansible Deployment

基于 Helm 配置的 EC2 Kafka 集群 Ansible 部署方案,支持 SIT 和 UAT 环境的批量部署和管理。

## 目录结构

```
kafka-ansible/
├── ansible.cfg                 # Ansible 配置文件
├── inventories/
│   ├── sit/                    # SIT 环境
│   │   ├── hosts.yml           # 主机清单
│   │   └── group_vars/
│   │       └── all.yml         # 环境变量
│   └── uat/                    # UAT 环境
│       ├── hosts.yml           # 主机清单
│       └── group_vars/
│           └── all.yml         # 环境变量
├── roles/
│   └── kafka/                  # Kafka 角色
│       ├── defaults/           # 默认变量
│       ├── handlers/           # 处理程序
│       ├── meta/               # 角色元数据
│       ├── tasks/              # 任务文件
│       ├── templates/          # 配置模板
│       └── vars/               # 内部变量
├── playbooks/
│   ├── deploy.yml              # 部署
│   ├── start.yml               # 启动
│   ├── stop.yml                # 停止
│   ├── restart.yml             # 重启
│   ├── status.yml              # 状态检查
│   ├── upgrade.yml             # 升级
│   ├── maintenance.yml         # 维护任务
│   └── topic-management.yml    # Topic 管理
├── scripts/
│   └── kafka-ctl.sh            # 快捷控制脚本
└── README.md
```

## 环境配置

### SIT 环境
- IP: 10.17.9.79, 10.17.9.57, 10.17.12.159
- 已配置完成

### UAT 环境
- IP: 待申请
- 配置步骤见下方 "添加 UAT 环境"

## 快速开始

### 1. 配置 SSH 密钥

```bash
# 确保 SSH 密钥存在
ls /data/runner.key
```

### 2. 测试连接

```bash
cd kafka-ansible
ansible -i inventories/sit/hosts.yml kafka -m ping
```

### 3. 部署集群

```bash
# 使用快捷脚本
./scripts/kafka-ctl.sh sit deploy

# 或直接使用 ansible-playbook
ansible-playbook -i inventories/sit/hosts.yml playbooks/deploy.yml
```

## 常用命令

### 使用快捷脚本 (推荐)

```bash
# 部署
./scripts/kafka-ctl.sh sit deploy

# 启动
./scripts/kafka-ctl.sh sit start

# 停止
./scripts/kafka-ctl.sh sit stop

# 滚动重启
./scripts/kafka-ctl.sh sit restart

# 状态检查
./scripts/kafka-ctl.sh sit status

# 健康检查
./scripts/kafka-ctl.sh sit health

# 查看 Topics
./scripts/kafka-ctl.sh sit topics
```

### 直接使用 Playbook

```bash
# 部署集群
ansible-playbook -i inventories/sit/hosts.yml playbooks/deploy.yml

# 启动服务
ansible-playbook -i inventories/sit/hosts.yml playbooks/start.yml

# 停止服务
ansible-playbook -i inventories/sit/hosts.yml playbooks/stop.yml

# 滚动重启
ansible-playbook -i inventories/sit/hosts.yml playbooks/restart.yml

# 检查状态
ansible-playbook -i inventories/sit/hosts.yml playbooks/status.yml

# 单节点操作
ansible-playbook -i inventories/sit/hosts.yml playbooks/restart.yml --limit kafka-sit-1
```

### 维护操作

```bash
# 健康检查
ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=check_health

# 清理日志
ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=cleanup_logs

# 备份配置
ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=backup_config
```

### Topic 管理

```bash
# 列出所有 topics
ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml -e "action=list"

# 创建 topic
ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
  -e "action=create topic_name=my-topic partitions=3 replication_factor=2"

# 查看 topic 详情
ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
  -e "action=describe topic_name=my-topic"

# 删除 topic
ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
  -e "action=delete topic_name=my-topic"
```

### 升级 Kafka

```bash
ansible-playbook -i inventories/sit/hosts.yml playbooks/upgrade.yml -e kafka_version=3.10.0
```

## 添加 UAT 环境

当 UAT 机器申请完成后,按以下步骤配置:

### 1. 更新 hosts.yml

编辑 `inventories/uat/hosts.yml`:

```yaml
---
all:
  children:
    kafka:
      hosts:
        kafka-uat-1:
          ansible_host: 10.x.x.x    # 替换为实际 IP
          kafka_broker_id: 1
          kafka_node_id: 1
        kafka-uat-2:
          ansible_host: 10.x.x.x    # 替换为实际 IP
          kafka_broker_id: 2
          kafka_node_id: 2
        kafka-uat-3:
          ansible_host: 10.x.x.x    # 替换为实际 IP
          kafka_broker_id: 3
          kafka_node_id: 3
      vars:
        kafka_cluster_id: "uat-kafka-cluster"
        environment_name: "uat"
```

### 2. 更新 group_vars (可选)

如果 UAT 使用不同的 SSH 密钥,编辑 `inventories/uat/group_vars/all.yml`。

### 3. 部署

```bash
./scripts/kafka-ctl.sh uat deploy
```

## 配置说明

### 目录配置
- 安装目录: `/opt/kafka`
- 数据目录: `/data/kafka/data`
- 日志目录: `/data/kafka/logs`
- 配置目录: `/opt/kafka/config`

### 端口配置
- Client: 9092
- Controller: 9093
- JMX: 9999
- Metrics Exporter: 9308

### JVM 配置 (基于 8G 内存)
- Heap: 4G (-Xmx4g -Xms4g)
- GC: G1GC
- MaxGCPauseMillis: 20
- InitiatingHeapOccupancyPercent: 35

### Kafka 配置
- 网络线程: 2
- IO 线程: 8
- 分区数: 3
- 副本因子: 2
- min.insync.replicas: 2
- 日志保留: 24 小时

## 监控集成

Kafka 集群已配置 JMX Exporter,Prometheus 可通过以下端点抓取指标:

```
http://<kafka-node-ip>:9308/metrics
```

## 故障排查

### 检查服务状态
```bash
systemctl status kafka
journalctl -u kafka -f
```

### 检查日志
```bash
tail -f /data/kafka/logs/server.log
tail -f /data/kafka/logs/controller.log
```

### 检查集群元数据
```bash
/opt/kafka/bin/kafka-metadata.sh --snapshot /data/kafka/data/__cluster_metadata-0/00000000000000000000.log --command describe
```

## 注意事项

1. **滚动操作**: 所有启动、停止、重启操作都是滚动执行(serial: 1),确保集群可用性
2. **数据目录**: 首次部署会自动格式化 KRaft 存储,已有数据的节点会跳过
3. **SSH 密钥**: 使用 root 用户,密钥路径 /data/runner.key
4. **系统初始化**: 已手动完成,playbook 跳过系统配置步骤
5. **防火墙**: 确保 9092、9093、9999、9308 端口在安全组中开放

--- /home/runner/kafka-ansible/ansible.cfg ---
[defaults]
# Inventory
inventory = inventories/sit/hosts.yml

# Roles path
roles_path = roles

# Host key checking
host_key_checking = False

# Timeout
timeout = 30

# SSH settings
remote_user = root
#private_key_file = /data/runner.key

# Retry files
retry_files_enabled = False

# Gathering
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400

# Callback plugins
callback_whitelist = profile_tasks
stdout_callback = yaml

# Forks for parallel execution
forks = 10

# Display settings
display_skipped_hosts = False
display_ok_hosts = True

# Pipelining for performance
pipelining = True

[privilege_escalation]
become = False

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
control_path = %(directory)s/%%h-%%r
pipelining = True

--- /home/runner/kafka-ansible/inventories/sit/group_vars/all.yml ---
---
# SIT Environment Variables
ansible_user: root
#ansible_ssh_private_key_file: /data/runner.key
ansible_become: false

# Environment specific settings
env: sit
kafka_cluster_name: "kafka-sit"

# Network settings - SIT specific
kafka_advertised_listeners_prefix: "PLAINTEXT"

# Resource limits for SIT (can be lower than production)
kafka_heap_size: "4g"
kafka_min_heap_size: "4g"

# Replication settings
kafka_default_replication_factor: 2
kafka_min_insync_replicas: 2
kafka_offsets_topic_replication_factor: 3
kafka_transaction_state_log_replication_factor: 3
kafka_transaction_state_log_min_isr: 2

--- /home/runner/kafka-ansible/inventories/sit/hosts.yml ---
---
# SIT Environment Kafka Cluster
all:
  children:
    kafka:
      hosts:
        kafka-sit-1:
          ansible_host: 10.17.9.79
          kafka_broker_id: 1
          kafka_node_id: 1
        kafka-sit-2:
          ansible_host: 10.17.9.57
          kafka_broker_id: 2
          kafka_node_id: 2
        kafka-sit-3:
          ansible_host: 10.17.12.159
          kafka_broker_id: 3
          kafka_node_id: 3
      vars:
        # Cluster configuration
        kafka_cluster_id: "sit-kafka-cluster"
        environment_name: "sit"

--- /home/runner/kafka-ansible/inventories/uat/group_vars/all.yml ---
---
# UAT Environment Variables
ansible_user: root
ansible_become: false

# Environment specific settings
env: uat
kafka_cluster_name: "kafka-uat"

# Network settings - UAT specific
kafka_advertised_listeners_prefix: "PLAINTEXT"

# Resource limits for UAT (production-like)
kafka_heap_size: "4g"
kafka_min_heap_size: "4g"

# Replication settings
kafka_default_replication_factor: 2
kafka_min_insync_replicas: 2
kafka_offsets_topic_replication_factor: 3
kafka_transaction_state_log_replication_factor: 3
kafka_transaction_state_log_min_isr: 2

# UAT Node IPs
uat_node1_ip: "10.20.7.184"
uat_node2_ip: "10.20.13.223"
uat_node3_ip: "10.20.10.42"


--- /home/runner/kafka-ansible/inventories/uat/hosts.yml ---
---
# UAT Environment Kafka Cluster
# TODO: Update with actual IP addresses when machines are provisioned
all:
  children:
    kafka:
      hosts:
        kafka-uat-1:
          ansible_host: "{{ uat_node1_ip | default('PENDING') }}"
          kafka_broker_id: 1
          kafka_node_id: 1
        kafka-uat-2:
          ansible_host: "{{ uat_node2_ip | default('PENDING') }}"
          kafka_broker_id: 2
          kafka_node_id: 2
        kafka-uat-3:
          ansible_host: "{{ uat_node3_ip | default('PENDING') }}"
          kafka_broker_id: 3
          kafka_node_id: 3
      vars:
        # Cluster configuration
        kafka_cluster_id: "uat-kafka-cluster"
        environment_name: "uat"

--- /home/runner/kafka-ansible/playbooks/deploy.yml ---
---
# Kafka Deployment Playbook
# Usage:
#   SIT: ansible-playbook -i inventories/sit/hosts.yml playbooks/deploy.yml
#   UAT: ansible-playbook -i inventories/uat/hosts.yml playbooks/deploy.yml

# Phase 1: Install and configure on all nodes (parallel)
- name: Install and Configure Kafka
  hosts: kafka
  become: false

  pre_tasks:
    - name: Display deployment information
      ansible.builtin.debug:
        msg: |
          ========================================
          Deploying Kafka to: {{ inventory_hostname }}
          Environment: {{ environment_name | default('unknown') }}
          Node ID: {{ kafka_node_id }}
          Host IP: {{ ansible_host }}
          ========================================

    - name: Verify connectivity
      ansible.builtin.ping:

  roles:
    - role: kafka
      tags:
        - kafka
      vars:
        kafka_skip_start: true  # Don't start service yet

# Phase 2: Start all Kafka services together
- name: Start Kafka Cluster
  hosts: kafka
  become: false

  tasks:
    - name: Start Kafka service
      ansible.builtin.systemd:
        name: kafka
        state: started

    - name: Wait for controller port (9093)
      ansible.builtin.wait_for:
        host: "{{ ansible_host }}"
        port: 9093
        delay: 5
        timeout: 60
        state: started

# Phase 3: Verify cluster is healthy
- name: Verify Kafka Cluster
  hosts: kafka
  become: false
  serial: 1

  tasks:
    - name: Wait for Kafka broker port (9092)
      ansible.builtin.wait_for:
        host: "{{ ansible_host }}"
        port: "{{ kafka_client_port | default(9092) }}"
        delay: 10
        timeout: 120
        state: started

    - name: Verify Kafka is running
      ansible.builtin.command: "{{ kafka_install_dir | default('/opt/kafka') }}/bin/kafka-broker-api-versions.sh --bootstrap-server {{ ansible_host }}:{{ kafka_client_port | default(9092) }}"
      register: kafka_verify
      changed_when: false
      retries: 5
      delay: 10
      until: kafka_verify.rc == 0
      ignore_errors: yes

    - name: Display deployment status
      ansible.builtin.debug:
        msg: |
          ========================================
          Kafka deployed successfully on {{ inventory_hostname }}
          Bootstrap Server: {{ ansible_host }}:9092
          JMX Port: 9999
          ========================================


--- /home/runner/kafka-ansible/playbooks/enable-jmx-exporter.yml ---
---
# Enable JMX Exporter for Prometheus monitoring
# Usage: ansible-playbook -i inventories/sit/hosts.yml playbooks/enable-jmx-exporter.yml

- name: Enable JMX Prometheus Exporter
  hosts: kafka
  serial: 1

  vars:
    jmx_exporter_version: "1.0.1"
    jmx_exporter_port: 9308
    kafka_install_dir: "/opt/kafka"

  tasks:
    - name: Create metrics directory
      ansible.builtin.file:
        path: "{{ kafka_install_dir }}/metrics"
        state: directory
        mode: "0755"

    - name: Download JMX Prometheus agent
      ansible.builtin.get_url:
        url: "https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/{{ jmx_exporter_version }}/jmx_prometheus_javaagent-{{ jmx_exporter_version }}.jar"
        dest: "{{ kafka_install_dir }}/libs/jmx_prometheus_javaagent.jar"
        mode: "0644"

    - name: Deploy JMX exporter configuration
      ansible.builtin.copy:
        dest: "{{ kafka_install_dir }}/metrics/jmx-exporter.yml"
        mode: "0644"
        content: |
          lowercaseOutputName: true
          lowercaseOutputLabelNames: true
          rules:
            # Kafka server metrics
            - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
              name: kafka_server_$1_$2
              type: GAUGE
              labels:
                clientId: "$3"
                topic: "$4"
                partition: "$5"
            - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value
              name: kafka_server_$1_$2
              type: GAUGE
              labels:
                clientId: "$3"
                broker: "$4:$5"
            - pattern: kafka.server<type=(.+), name=(.+)><>Value
              name: kafka_server_$1_$2
              type: GAUGE
            - pattern: kafka.server<type=(.+), name=(.+)><>Count
              name: kafka_server_$1_$2_total
              type: COUNTER
            # Kafka network metrics
            - pattern: kafka.network<type=(.+), name=(.+), request=(.+), error=(.+)><>Count
              name: kafka_network_$1_$2_total
              type: COUNTER
              labels:
                request: "$3"
                error: "$4"
            - pattern: kafka.network<type=(.+), name=(.+), request=(.+)><>Count
              name: kafka_network_$1_$2_total
              type: COUNTER
              labels:
                request: "$3"
            - pattern: kafka.network<type=(.+), name=(.+)><>Value
              name: kafka_network_$1_$2
              type: GAUGE
            # Kafka controller metrics
            - pattern: kafka.controller<type=(.+), name=(.+)><>Value
              name: kafka_controller_$1_$2
              type: GAUGE
            - pattern: kafka.controller<type=(.+), name=(.+)><>Count
              name: kafka_controller_$1_$2_total
              type: COUNTER
            # KRaft Raft metrics
            - pattern: kafka.raft<type=(.+), name=(.+)><>Value
              name: kafka_raft_$1_$2
              type: GAUGE
            - pattern: kafka.raft<type=(.+), name=(.+)><>Count
              name: kafka_raft_$1_$2_total
              type: COUNTER
            # JVM metrics
            - pattern: java.lang<type=Memory><HeapMemoryUsage>(\w+)
              name: jvm_heap_memory_$1_bytes
              type: GAUGE
            - pattern: java.lang<type=Memory><NonHeapMemoryUsage>(\w+)
              name: jvm_nonheap_memory_$1_bytes
              type: GAUGE
            - pattern: java.lang<type=GarbageCollector, name=(.+)><CollectionCount>
              name: jvm_gc_collection_count_total
              type: COUNTER
              labels:
                gc: "$1"
            - pattern: java.lang<type=GarbageCollector, name=(.+)><CollectionTime>
              name: jvm_gc_collection_time_ms_total
              type: COUNTER
              labels:
                gc: "$1"
            - pattern: java.lang<type=Threading><ThreadCount>
              name: jvm_thread_count
              type: GAUGE

    - name: Enable JMX exporter in systemd service
      ansible.builtin.lineinfile:
        path: /etc/systemd/system/kafka.service
        regexp: '^#?Environment="KAFKA_OPTS=-javaagent'
        line: 'Environment="KAFKA_OPTS=-javaagent:{{ kafka_install_dir }}/libs/jmx_prometheus_javaagent.jar={{ jmx_exporter_port }}:{{ kafka_install_dir }}/metrics/jmx-exporter.yml"'
        insertafter: 'Environment="KAFKA_JMX_OPTS'

    - name: Reload systemd
      ansible.builtin.systemd:
        daemon_reload: yes

    - name: Restart Kafka
      ansible.builtin.systemd:
        name: kafka
        state: restarted

    - name: Wait for Kafka to be ready
      ansible.builtin.wait_for:
        host: "{{ ansible_host }}"
        port: 9092
        delay: 10
        timeout: 120

    - name: Wait for JMX exporter to be ready
      ansible.builtin.wait_for:
        host: "{{ ansible_host }}"
        port: "{{ jmx_exporter_port }}"
        delay: 5
        timeout: 60

    - name: Verify JMX exporter metrics
      ansible.builtin.uri:
        url: "http://{{ ansible_host }}:{{ jmx_exporter_port }}/metrics"
        return_content: no
        status_code: 200
      register: metrics_check

    - name: Display status
      ansible.builtin.debug:
        msg: "JMX Exporter enabled on {{ inventory_hostname }} - http://{{ ansible_host }}:{{ jmx_exporter_port }}/metrics"


--- /home/runner/kafka-ansible/playbooks/maintenance.yml ---
---
# Kafka Maintenance Playbook
# Usage:
#   ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=<task_name>
#
# Available tasks:
#   - cleanup_logs: Clean old log files
#   - backup_config: Backup Kafka configuration
#   - check_health: Comprehensive health check
#   - list_topics: List all topics
#   - describe_cluster: Show cluster details

- name: Kafka Maintenance Tasks
  hosts: kafka
  become: false

  vars:
    task: "check_health"  # Default task

  tasks:
    # Cleanup old logs
    - name: Cleanup old Kafka logs
      ansible.builtin.find:
        paths: /data/kafka/logs
        age: "7d"
        recurse: yes
        file_type: file
      register: old_logs
      when: task == "cleanup_logs"

    - name: Remove old log files
      ansible.builtin.file:
        path: "{{ item.path }}"
        state: absent
      loop: "{{ old_logs.files }}"
      when: task == "cleanup_logs" and old_logs.files | length > 0

    - name: Display cleanup result
      ansible.builtin.debug:
        msg: "Cleaned up {{ old_logs.files | length }} old log files on {{ inventory_hostname }}"
      when: task == "cleanup_logs"

    # Backup configuration
    - name: Create backup directory
      ansible.builtin.file:
        path: "/data/kafka/backups"
        state: directory
        owner: kafka
        group: kafka
        mode: "0755"
      when: task == "backup_config"

    - name: Backup Kafka configuration
      ansible.builtin.archive:
        path: /opt/kafka/config
        dest: "/data/kafka/backups/kafka_config_{{ ansible_date_time.date }}.tar.gz"
        owner: kafka
        group: kafka
      when: task == "backup_config"

    - name: Display backup result
      ansible.builtin.debug:
        msg: "Configuration backed up to /data/kafka/backups/kafka_config_{{ ansible_date_time.date }}.tar.gz"
      when: task == "backup_config"

    # Health check
    - name: Check Kafka process
      ansible.builtin.shell: pgrep -f kafka.Kafka
      register: kafka_pid
      changed_when: false
      failed_when: false
      when: task == "check_health"

    - name: Check Kafka port
      ansible.builtin.wait_for:
        host: "{{ ansible_host }}"
        port: 9092
        timeout: 5
      register: port_check
      ignore_errors: yes
      when: task == "check_health"

    - name: Check disk space
      ansible.builtin.shell: df -h /data/kafka | tail -1 | awk '{print $5}' | sed 's/%//'
      register: disk_usage
      changed_when: false
      when: task == "check_health"

    - name: Check memory usage
      ansible.builtin.shell: free -m | grep Mem | awk '{print int($3/$2*100)}'
      register: memory_usage
      changed_when: false
      when: task == "check_health"

    - name: Display health status
      ansible.builtin.debug:
        msg: |
          Health Check for {{ inventory_hostname }}:
          - Process: {{ 'Running (PID: ' + kafka_pid.stdout + ')' if kafka_pid.rc == 0 else 'NOT RUNNING' }}
          - Port 9092: {{ 'OK' if port_check is succeeded else 'FAILED' }}
          - Disk Usage: {{ disk_usage.stdout }}%
          - Memory Usage: {{ memory_usage.stdout }}%
          - Status: {{ 'HEALTHY' if kafka_pid.rc == 0 and port_check is succeeded and disk_usage.stdout|int < 85 else 'WARNING' }}
      when: task == "check_health"

# Cluster-wide tasks (run on first node only)
- name: Cluster-wide Maintenance Tasks
  hosts: kafka[0]
  become: false
  run_once: true

  vars:
    task: "check_health"

  tasks:
    # List topics
    - name: List all topics
      ansible.builtin.shell: /opt/kafka/bin/kafka-topics.sh --bootstrap-server {{ ansible_host }}:9092 --list
      register: topics
      changed_when: false
      when: task == "list_topics"

    - name: Display topics
      ansible.builtin.debug:
        msg: |
          Topics in cluster:
          {{ topics.stdout }}
      when: task == "list_topics"

    # Describe cluster
    - name: Get cluster info
      ansible.builtin.shell: /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server {{ ansible_host }}:9092 2>/dev/null | head -20
      register: cluster_info
      changed_when: false
      ignore_errors: yes
      when: task == "describe_cluster"

    - name: Display cluster info
      ansible.builtin.debug:
        msg: |
          Cluster Information:
          {{ cluster_info.stdout }}
      when: task == "describe_cluster"

--- /home/runner/kafka-ansible/playbooks/restart.yml ---
---
# Restart Kafka Service Playbook (Rolling Restart)
# Usage:
#   All nodes: ansible-playbook -i inventories/sit/hosts.yml playbooks/restart.yml
#   Single node: ansible-playbook -i inventories/sit/hosts.yml playbooks/restart.yml --limit kafka-sit-1

- name: Rolling Restart Kafka Cluster
  hosts: kafka
  become: false
  serial: 1  # Restart one node at a time

  tasks:
    - name: Display restart information
      ansible.builtin.debug:
        msg: "Starting rolling restart on {{ inventory_hostname }} ({{ ansible_host }})"

    - name: Stop Kafka service
      ansible.builtin.systemd:
        name: kafka
        state: stopped

    - name: Wait for Kafka to stop
      ansible.builtin.wait_for:
        host: "{{ ansible_host }}"
        port: 9092
        state: stopped
        timeout: 60

    - name: Pause before restart
      ansible.builtin.pause:
        seconds: 5

    - name: Start Kafka service
      ansible.builtin.systemd:
        name: kafka
        state: started

    - name: Wait for Kafka to be ready
      ansible.builtin.wait_for:
        host: "{{ ansible_host }}"
        port: 9092
        delay: 10
        timeout: 120
        state: started

    - name: Verify Kafka is running
      ansible.builtin.command: /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server {{ ansible_host }}:9092
      register: kafka_verify
      retries: 5
      delay: 10
      until: kafka_verify.rc == 0
      changed_when: false

    - name: Wait for cluster stabilization
      ansible.builtin.pause:
        seconds: 30
      when: groups['kafka'] | length > 1

    - name: Display restart status
      ansible.builtin.debug:
        msg: "Kafka restarted successfully on {{ inventory_hostname }}"

--- /home/runner/kafka-ansible/playbooks/start.yml ---
---
# Start Kafka Service Playbook
# Usage:
#   All nodes: ansible-playbook -i inventories/sit/hosts.yml playbooks/start.yml
#   Single node: ansible-playbook -i inventories/sit/hosts.yml playbooks/start.yml --limit kafka-sit-1

- name: Start Kafka Cluster
  hosts: kafka
  become: false
  serial: 1  # Start one node at a time

  tasks:
    - name: Start Kafka service
      ansible.builtin.systemd:
        name: kafka
        state: started
        enabled: yes

    - name: Wait for Kafka to be ready
      ansible.builtin.wait_for:
        host: "{{ ansible_host }}"
        port: 9092
        delay: 5
        timeout: 120
        state: started

    - name: Verify Kafka is running
      ansible.builtin.command: systemctl status kafka
      register: kafka_status
      changed_when: false

    - name: Display Kafka status
      ansible.builtin.debug:
        msg: "Kafka is running on {{ inventory_hostname }} ({{ ansible_host }})"
      when: kafka_status.rc == 0

--- /home/runner/kafka-ansible/playbooks/status.yml ---
---
# Check Kafka Cluster Status Playbook
# Usage:
#   ansible-playbook -i inventories/sit/hosts.yml playbooks/status.yml

- name: Check Kafka Cluster Status
  hosts: kafka
  become: false
  gather_facts: yes

  tasks:
    - name: Check Kafka service status
      ansible.builtin.systemd:
        name: kafka
      register: kafka_service

    - name: Check Kafka port
      ansible.builtin.wait_for:
        host: "{{ ansible_host }}"
        port: 9092
        timeout: 5
        state: started
      register: port_check
      ignore_errors: yes

    - name: Check JMX port
      ansible.builtin.wait_for:
        host: "{{ ansible_host }}"
        port: 9999
        timeout: 5
        state: started
      register: jmx_check
      ignore_errors: yes

    - name: Get Kafka process info
      ansible.builtin.shell: ps aux | grep -v grep | grep kafka.Kafka || true
      register: kafka_process
      changed_when: false

    - name: Get disk usage
      ansible.builtin.shell: df -h /data/kafka
      register: disk_usage
      changed_when: false

    - name: Get Kafka data directory size
      ansible.builtin.shell: du -sh /data/kafka/data 2>/dev/null || echo "N/A"
      register: data_size
      changed_when: false

    - name: Display node status
      ansible.builtin.debug:
        msg: |
          ========================================
          Node: {{ inventory_hostname }} ({{ ansible_host }})
          ========================================
          Service Status: {{ 'RUNNING' if kafka_service.status.ActiveState == 'active' else 'STOPPED' }}
          Port 9092: {{ 'OPEN' if port_check is succeeded else 'CLOSED' }}
          JMX Port 9999: {{ 'OPEN' if jmx_check is succeeded else 'CLOSED' }}
          Data Size: {{ data_size.stdout }}
          ----------------------------------------
          Disk Usage:
          {{ disk_usage.stdout }}
          ========================================

- name: Cluster Summary
  hosts: kafka[0]
  become: false
  run_once: true

  tasks:
    - name: Check cluster metadata
      ansible.builtin.shell: |
        /opt/kafka/bin/kafka-metadata.sh --snapshot /data/kafka/data/__cluster_metadata-0/00000000000000000000.log --command describe 2>/dev/null | head -20 || echo "Metadata check skipped"
      register: cluster_metadata
      changed_when: false
      ignore_errors: yes

    - name: List topics
      ansible.builtin.shell: |
        /opt/kafka/bin/kafka-topics.sh --bootstrap-server {{ ansible_host }}:9092 --list 2>/dev/null || echo "No topics or cluster not ready"
      register: topics_list
      changed_when: false
      ignore_errors: yes

    - name: Display cluster summary
      ansible.builtin.debug:
        msg: |
          ========================================
          CLUSTER SUMMARY
          ========================================
          Cluster ID: {{ kafka_cluster_id | default('N/A') }}
          Environment: {{ environment_name | default('unknown') }}
          Total Nodes: {{ groups['kafka'] | length }}

          Topics:
          {{ topics_list.stdout }}
          ========================================

--- /home/runner/kafka-ansible/playbooks/stop.yml ---
---
# Stop Kafka Service Playbook
# Usage:
#   All nodes: ansible-playbook -i inventories/sit/hosts.yml playbooks/stop.yml
#   Single node: ansible-playbook -i inventories/sit/hosts.yml playbooks/stop.yml --limit kafka-sit-1

- name: Stop Kafka Cluster
  hosts: kafka
  become: false
  serial: 1  # Stop one node at a time for graceful shutdown

  tasks:
    - name: Display stop warning
      ansible.builtin.debug:
        msg: "Stopping Kafka on {{ inventory_hostname }} ({{ ansible_host }})"

    - name: Stop Kafka service gracefully
      ansible.builtin.systemd:
        name: kafka
        state: stopped

    - name: Wait for Kafka to stop
      ansible.builtin.wait_for:
        host: "{{ ansible_host }}"
        port: 9092
        state: stopped
        timeout: 60

    - name: Verify Kafka is stopped
      ansible.builtin.command: systemctl status kafka
      register: kafka_status
      failed_when: false
      changed_when: false

    - name: Display stop status
      ansible.builtin.debug:
        msg: "Kafka stopped successfully on {{ inventory_hostname }}"
      when: kafka_status.rc != 0

--- /home/runner/kafka-ansible/playbooks/topic-management.yml ---
---
# Kafka Topic Management Playbook
# Usage:
#   Create topic:
#     ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
#       -e "action=create topic_name=my-topic partitions=3 replication_factor=2"
#
#   Delete topic:
#     ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
#       -e "action=delete topic_name=my-topic"
#
#   Describe topic:
#     ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
#       -e "action=describe topic_name=my-topic"
#
#   List topics:
#     ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
#       -e "action=list"

- name: Kafka Topic Management
  hosts: kafka[0]
  become: false
  run_once: true

  vars:
    action: "list"
    topic_name: ""
    partitions: 3
    replication_factor: 2
    retention_ms: 86400000  # 24 hours
    bootstrap_server: "{{ ansible_host }}:9092"

  tasks:
    - name: Validate topic_name for create/delete/describe
      ansible.builtin.fail:
        msg: "topic_name is required for {{ action }} action"
      when: action in ['create', 'delete', 'describe'] and topic_name == ""

    # List topics
    - name: List all topics
      ansible.builtin.shell: |
        /opt/kafka/bin/kafka-topics.sh --bootstrap-server {{ bootstrap_server }} --list
      register: topics_list
      when: action == "list"
      changed_when: false

    - name: Display topics
      ansible.builtin.debug:
        msg: |
          ========================================
          Topics in cluster:
          ========================================
          {{ topics_list.stdout }}
      when: action == "list"

    # Create topic
    - name: Create topic
      ansible.builtin.shell: |
        /opt/kafka/bin/kafka-topics.sh --bootstrap-server {{ bootstrap_server }} \
          --create \
          --topic {{ topic_name }} \
          --partitions {{ partitions }} \
          --replication-factor {{ replication_factor }} \
          --config retention.ms={{ retention_ms }} \
          --if-not-exists
      register: create_result
      when: action == "create"

    - name: Display create result
      ansible.builtin.debug:
        msg: |
          Topic '{{ topic_name }}' created successfully
          Partitions: {{ partitions }}
          Replication Factor: {{ replication_factor }}
          Retention: {{ retention_ms }}ms
      when: action == "create" and create_result is succeeded

    # Delete topic
    - name: Confirm delete topic
      ansible.builtin.pause:
        prompt: "Are you sure you want to delete topic '{{ topic_name }}'? (yes/no)"
      register: confirm_delete
      when: action == "delete"

    - name: Delete topic
      ansible.builtin.shell: |
        /opt/kafka/bin/kafka-topics.sh --bootstrap-server {{ bootstrap_server }} \
          --delete \
          --topic {{ topic_name }}
      register: delete_result
      when: action == "delete" and confirm_delete.user_input == "yes"

    - name: Display delete result
      ansible.builtin.debug:
        msg: "Topic '{{ topic_name }}' deleted successfully"
      when: action == "delete" and delete_result is succeeded

    # Describe topic
    - name: Describe topic
      ansible.builtin.shell: |
        /opt/kafka/bin/kafka-topics.sh --bootstrap-server {{ bootstrap_server }} \
          --describe \
          --topic {{ topic_name }}
      register: describe_result
      when: action == "describe"
      changed_when: false

    - name: Display topic description
      ansible.builtin.debug:
        msg: |
          ========================================
          Topic Details: {{ topic_name }}
          ========================================
          {{ describe_result.stdout }}
      when: action == "describe"

--- /home/runner/kafka-ansible/playbooks/upgrade.yml ---
---
# Kafka Rolling Upgrade Playbook
# Usage:
#   ansible-playbook -i inventories/sit/hosts.yml playbooks/upgrade.yml -e kafka_version=3.10.0

- name: Rolling Upgrade Kafka Cluster
  hosts: kafka
  become: false
  serial: 1  # Upgrade one node at a time

  vars_prompt:
    - name: confirm_upgrade
      prompt: "Are you sure you want to upgrade Kafka? (yes/no)"
      default: "no"
      private: no

  pre_tasks:
    - name: Abort if not confirmed
      ansible.builtin.fail:
        msg: "Upgrade cancelled by user"
      when: confirm_upgrade != "yes"

    - name: Display upgrade information
      ansible.builtin.debug:
        msg: |
          ========================================
          Upgrading Kafka on: {{ inventory_hostname }}
          Current Version: Check /opt/kafka/libs/
          Target Version: {{ kafka_version }}
          ========================================

  tasks:
    - name: Stop Kafka service
      ansible.builtin.systemd:
        name: kafka
        state: stopped

    - name: Wait for Kafka to stop
      ansible.builtin.wait_for:
        host: "{{ ansible_host }}"
        port: 9092
        state: stopped
        timeout: 60

    - name: Backup current Kafka installation
      ansible.builtin.shell: |
        TIMESTAMP=$(date +%Y%m%d_%H%M%S)
        if [ -L /opt/kafka ]; then
          CURRENT_VERSION=$(readlink /opt/kafka | sed 's/.*kafka_//')
          cp -r /opt/kafka /opt/kafka_backup_${TIMESTAMP}
        fi
      args:
        creates: /opt/kafka_backup_*
      ignore_errors: yes

    - name: Download new Kafka version
      ansible.builtin.get_url:
        url: "https://downloads.apache.org/kafka/{{ kafka_version }}/kafka_{{ kafka_scala_version | default('2.13') }}-{{ kafka_version }}.tgz"
        dest: "/tmp/kafka_{{ kafka_scala_version | default('2.13') }}-{{ kafka_version }}.tgz"
        mode: "0644"

    - name: Extract new Kafka version
      ansible.builtin.unarchive:
        src: "/tmp/kafka_{{ kafka_scala_version | default('2.13') }}-{{ kafka_version }}.tgz"
        dest: "/opt"
        remote_src: yes
        owner: kafka
        group: kafka

    - name: Update symbolic link
      ansible.builtin.file:
        src: "/opt/kafka_{{ kafka_scala_version | default('2.13') }}-{{ kafka_version }}"
        dest: "/opt/kafka"
        state: link
        owner: kafka
        group: kafka
        force: yes

    - name: Copy configuration to new version
      ansible.builtin.copy:
        src: "/opt/kafka_backup_*/config/"
        dest: "/opt/kafka/config/"
        remote_src: yes
        owner: kafka
        group: kafka
      ignore_errors: yes

    - name: Start Kafka service
      ansible.builtin.systemd:
        name: kafka
        state: started

    - name: Wait for Kafka to be ready
      ansible.builtin.wait_for:
        host: "{{ ansible_host }}"
        port: 9092
        delay: 15
        timeout: 180
        state: started

    - name: Verify Kafka is running
      ansible.builtin.command: /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server {{ ansible_host }}:9092
      register: kafka_verify
      retries: 5
      delay: 15
      until: kafka_verify.rc == 0
      changed_when: false

    - name: Wait for cluster stabilization
      ansible.builtin.pause:
        seconds: 60

    - name: Display upgrade status
      ansible.builtin.debug:
        msg: "Kafka upgraded successfully on {{ inventory_hostname }}"

--- /home/runner/kafka-ansible/roles/kafka/defaults/main.yml ---
---
# Kafka Version and Download
kafka_version: "3.9.0"
kafka_scala_version: "2.13"
kafka_download_url: "https://downloads.apache.org/kafka/{{ kafka_version }}/kafka_{{ kafka_scala_version }}-{{ kafka_version }}.tgz"
kafka_checksum: ""  # Optional: Add SHA512 checksum for verification

# Directory Configuration
kafka_install_dir: "/opt/kafka"
kafka_data_dir: "/data/kafka/data"
kafka_log_dir: "/data/kafka/logs"
kafka_config_dir: "/opt/kafka/config"

# User and Group (using root as machines are pre-initialized)
kafka_user: "root"
kafka_group: "root"

# Network Configuration
kafka_client_port: 9092
kafka_controller_port: 9093
kafka_jmx_port: 9999
kafka_exporter_port: 9308

# Listener Configuration
kafka_listeners: "PLAINTEXT://:{{ kafka_client_port }},CONTROLLER://:{{ kafka_controller_port }}"
kafka_advertised_listeners: "PLAINTEXT://{{ ansible_host }}:{{ kafka_client_port }}"
kafka_listener_security_protocol_map: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT"
kafka_inter_broker_listener_name: "PLAINTEXT"
kafka_controller_listener_names: "CONTROLLER"

# JVM Configuration (Based on 8G memory - heap set to 4G, ~50% of RAM)
kafka_heap_size: "4g"
kafka_min_heap_size: "4g"
kafka_jvm_opts: >-
  -Xmx{{ kafka_heap_size }}
  -Xms{{ kafka_min_heap_size }}
  -XX:+UseG1GC
  -XX:MaxGCPauseMillis=20
  -XX:InitiatingHeapOccupancyPercent=35
  -XX:+DisableExplicitGC
  -XX:+ParallelRefProcEnabled
  -Djava.awt.headless=true
  -Dcom.sun.management.jmxremote
  -Dcom.sun.management.jmxremote.port={{ kafka_jmx_port }}
  -Dcom.sun.management.jmxremote.local.only=false
  -Dcom.sun.management.jmxremote.authenticate=false
  -Dcom.sun.management.jmxremote.ssl=false

# Network and Connection Configuration
kafka_num_network_threads: 2
kafka_num_io_threads: 8
kafka_socket_send_buffer_bytes: -1
kafka_socket_receive_buffer_bytes: -1
kafka_socket_request_max_bytes: 157286400
kafka_queued_max_requests: 1000

# Replication Configuration
kafka_unclean_leader_election_enable: "false"
kafka_replica_lag_time_max_ms: 45000
kafka_replica_fetch_max_bytes: 16777216

# Topic Default Configuration
kafka_num_partitions: 3
kafka_default_replication_factor: 2
kafka_min_insync_replicas: 2
kafka_offsets_topic_replication_factor: 3
kafka_transaction_state_log_replication_factor: 3
kafka_transaction_state_log_min_isr: 2

# Performance Configuration
kafka_num_recovery_threads_per_data_dir: 1
kafka_log_retention_check_interval_ms: 1800000

# Partition Configuration
kafka_auto_leader_rebalance_enable: "false"
kafka_group_initial_rebalance_delay_ms: 5000
kafka_compression_type: "producer"

# Log Retention Configuration
kafka_log_retention_hours: 24
kafka_log_segment_bytes: 268435456
kafka_log_cleanup_policy: "delete"

# Operation Safety Configuration
kafka_auto_create_topics_enable: "false"
kafka_delete_topic_enable: "true"
kafka_controlled_shutdown_enable: "true"

# KRaft Mode Configuration (Kafka 3.x)
kafka_process_roles: "broker,controller"
kafka_controller_quorum_voters: ""  # Will be dynamically generated

# Metrics Configuration
kafka_jmx_enabled: false
kafka_exporter_enabled: false

# Java Configuration
java_home: "/opt/java"

# Systemd Configuration
kafka_service_name: "kafka"
kafka_service_restart_policy: "on-failure"
kafka_service_restart_sec: 10


--- /home/runner/kafka-ansible/roles/kafka/handlers/main.yml ---
---
# Handlers for Kafka role

- name: reload systemd
  ansible.builtin.systemd:
    daemon_reload: yes

- name: restart kafka
  ansible.builtin.systemd:
    name: kafka
    state: restarted
  listen: "restart kafka"

- name: stop kafka
  ansible.builtin.systemd:
    name: kafka
    state: stopped
  listen: "stop kafka"

- name: start kafka
  ansible.builtin.systemd:
    name: kafka
    state: started
  listen: "start kafka"

--- /home/runner/kafka-ansible/roles/kafka/meta/main.yml ---
---
galaxy_info:
  author: DevOps Team
  description: Ansible role for deploying Apache Kafka in KRaft mode
  company: LTP
  license: MIT
  min_ansible_version: "2.10"
  platforms:
    - name: Amazon
      versions:
        - "2023"
        - "2"
    - name: EL
      versions:
        - "8"
        - "9"
  galaxy_tags:
    - kafka
    - streaming
    - messaging
    - kraft

dependencies: []

--- /home/runner/kafka-ansible/roles/kafka/tasks/configure.yml ---
---
# Kafka configuration tasks

- name: Generate cluster ID (only on first node)
  ansible.builtin.shell: |
    {{ kafka_install_dir }}/bin/kafka-storage.sh random-uuid
  register: kafka_cluster_uuid
  run_once: true
  when: kafka_cluster_id is not defined or kafka_cluster_id == ""
  changed_when: false

- name: Set cluster ID fact
  ansible.builtin.set_fact:
    kafka_final_cluster_id: "{{ kafka_cluster_id | default(kafka_cluster_uuid.stdout) }}"
  run_once: true

- name: Share cluster ID with all nodes
  ansible.builtin.set_fact:
    kafka_final_cluster_id: "{{ hostvars[groups['kafka'][0]]['kafka_final_cluster_id'] }}"

- name: Display cluster ID
  ansible.builtin.debug:
    msg: "Kafka Cluster ID: {{ kafka_final_cluster_id }}"

- name: Deploy Kafka KRaft configuration
  ansible.builtin.template:
    src: server.properties.j2
    dest: "{{ kafka_config_dir }}/kraft/server.properties"
    owner: "{{ kafka_user }}"
    group: "{{ kafka_group }}"
    mode: "{{ kafka_file_mode }}"
    backup: yes
  notify: restart kafka

- name: Check if storage is formatted
  ansible.builtin.stat:
    path: "{{ kafka_data_dir }}/meta.properties"
  register: kafka_storage_formatted

- name: Format Kafka storage (KRaft mode)
  ansible.builtin.shell: |
    {{ kafka_install_dir }}/bin/kafka-storage.sh format \
      -t {{ kafka_final_cluster_id }} \
      -c {{ kafka_config_dir }}/kraft/server.properties \
      --ignore-formatted
  when: not kafka_storage_formatted.stat.exists
  register: format_result
  changed_when: "'Formatting' in format_result.stdout"

- name: Deploy log4j configuration
  ansible.builtin.template:
    src: log4j.properties.j2
    dest: "{{ kafka_config_dir }}/log4j.properties"
    owner: "{{ kafka_user }}"
    group: "{{ kafka_group }}"
    mode: "{{ kafka_file_mode }}"
  notify: restart kafka


--- /home/runner/kafka-ansible/roles/kafka/tasks/install.yml ---
---
# Kafka installation tasks

- name: Check if Kafka is already installed
  ansible.builtin.stat:
    path: "{{ kafka_install_dir }}/bin/kafka-server-start.sh"
  register: kafka_installed

- name: Check if /opt/kafka is a directory (not symlink)
  ansible.builtin.stat:
    path: "{{ kafka_install_dir }}"
  register: kafka_dir_check

- name: Remove /opt/kafka directory if it exists (will be replaced by symlink)
  ansible.builtin.file:
    path: "{{ kafka_install_dir }}"
    state: absent
  when:
    - kafka_dir_check.stat.exists
    - kafka_dir_check.stat.isdir
    - not kafka_dir_check.stat.islnk

- name: Download Kafka
  ansible.builtin.get_url:
    url: "{{ kafka_download_url }}"
    dest: "/tmp/kafka_{{ kafka_scala_version }}-{{ kafka_version }}.tgz"
    mode: "0644"
    checksum: "{{ kafka_checksum | default(omit) }}"
  when: not kafka_installed.stat.exists

- name: Extract Kafka
  ansible.builtin.unarchive:
    src: "/tmp/kafka_{{ kafka_scala_version }}-{{ kafka_version }}.tgz"
    dest: "/opt"
    remote_src: yes
    owner: "{{ kafka_user }}"
    group: "{{ kafka_group }}"
  when: not kafka_installed.stat.exists

- name: Create symbolic link to kafka
  ansible.builtin.file:
    src: "/opt/kafka_{{ kafka_scala_version }}-{{ kafka_version }}"
    dest: "{{ kafka_install_dir }}"
    state: link
    owner: "{{ kafka_user }}"
    group: "{{ kafka_group }}"
    force: yes

- name: Set ownership of Kafka installation
  ansible.builtin.file:
    path: "/opt/kafka_{{ kafka_scala_version }}-{{ kafka_version }}"
    owner: "{{ kafka_user }}"
    group: "{{ kafka_group }}"
    recurse: yes

- name: Clean up downloaded archive
  ansible.builtin.file:
    path: "/tmp/kafka_{{ kafka_scala_version }}-{{ kafka_version }}.tgz"
    state: absent

- name: Create Kafka environment file
  ansible.builtin.template:
    src: kafka.env.j2
    dest: "{{ kafka_config_dir }}/kafka.env"
    owner: "{{ kafka_user }}"
    group: "{{ kafka_group }}"
    mode: "{{ kafka_file_mode }}"
  notify: restart kafka


--- /home/runner/kafka-ansible/roles/kafka/tasks/main.yml ---
---
# Main tasks file for kafka role

- name: Include prerequisite tasks
  ansible.builtin.include_tasks: prerequisites.yml
  tags:
    - kafka
    - prerequisites

- name: Include installation tasks
  ansible.builtin.include_tasks: install.yml
  tags:
    - kafka
    - install

- name: Include configuration tasks
  ansible.builtin.include_tasks: configure.yml
  tags:
    - kafka
    - configure

- name: Include metrics tasks
  ansible.builtin.include_tasks: metrics.yml
  when: kafka_jmx_enabled or kafka_exporter_enabled
  tags:
    - kafka
    - metrics

- name: Include service tasks
  ansible.builtin.include_tasks: service.yml
  tags:
    - kafka
    - service

--- /home/runner/kafka-ansible/roles/kafka/tasks/metrics.yml ---
---
# Kafka metrics configuration tasks

- name: Create metrics directory
  ansible.builtin.file:
    path: "{{ kafka_install_dir }}/metrics"
    state: directory
    owner: "{{ kafka_user }}"
    group: "{{ kafka_group }}"
    mode: "{{ kafka_dir_mode }}"

- name: Deploy JMX exporter configuration
  ansible.builtin.template:
    src: jmx-exporter.yml.j2
    dest: "{{ kafka_install_dir }}/metrics/jmx-exporter.yml"
    owner: "{{ kafka_user }}"
    group: "{{ kafka_group }}"
    mode: "{{ kafka_file_mode }}"
  when: kafka_jmx_enabled
  notify: restart kafka

- name: Download JMX Prometheus agent (if needed)
  ansible.builtin.get_url:
    url: "https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/1.0.1/jmx_prometheus_javaagent-1.0.1.jar"
    dest: "{{ kafka_install_dir }}/libs/jmx_prometheus_javaagent.jar"
    owner: "{{ kafka_user }}"
    group: "{{ kafka_group }}"
    mode: "0644"
  when: kafka_jmx_enabled

- name: Create Prometheus metrics endpoint info
  ansible.builtin.debug:
    msg: |
      Kafka JMX metrics available at:
      - JMX Port: {{ kafka_jmx_port }}
      - Prometheus JMX Exporter: http://{{ ansible_host }}:{{ kafka_exporter_port }}/metrics

--- /home/runner/kafka-ansible/roles/kafka/tasks/prerequisites.yml ---
---
# Prerequisites for Kafka installation
# Note: System initialization (packages, users, sysctl) already done manually

- name: Verify Java installation
  ansible.builtin.command: java -version
  register: java_version_check
  changed_when: false

- name: Display Java version
  ansible.builtin.debug:
    msg: "{{ java_version_check.stderr_lines }}"

- name: Create Kafka directories
  ansible.builtin.file:
    path: "{{ item }}"
    state: directory
    owner: "{{ kafka_user }}"
    group: "{{ kafka_group }}"
    mode: "{{ kafka_dir_mode }}"
  loop: "{{ kafka_directories }}"

--- /home/runner/kafka-ansible/roles/kafka/tasks/service.yml ---
---
# Kafka service configuration tasks

- name: Deploy Kafka systemd service file
  ansible.builtin.template:
    src: kafka.service.j2
    dest: /etc/systemd/system/kafka.service
    owner: root
    group: root
    mode: "0644"
  notify:
    - reload systemd
    - restart kafka

- name: Reload systemd daemon
  ansible.builtin.systemd:
    daemon_reload: yes

- name: Enable Kafka service
  ansible.builtin.systemd:
    name: kafka
    enabled: yes

# Skip start/verify when kafka_skip_start is true (for cluster deployment)
- name: Start Kafka service
  ansible.builtin.systemd:
    name: kafka
    state: started
  register: kafka_service_start
  when: not (kafka_skip_start | default(false))

- name: Wait for Kafka to be ready
  ansible.builtin.wait_for:
    host: "{{ ansible_host }}"
    port: "{{ kafka_client_port }}"
    delay: 10
    timeout: 120
    state: started
  when:
    - not (kafka_skip_start | default(false))
    - kafka_service_start.changed | default(false)

- name: Verify Kafka is running
  ansible.builtin.command: "{{ kafka_install_dir }}/bin/kafka-broker-api-versions.sh --bootstrap-server {{ ansible_host }}:{{ kafka_client_port }}"
  register: kafka_verify
  changed_when: false
  retries: 5
  delay: 10
  until: kafka_verify.rc == 0
  ignore_errors: yes
  when: not (kafka_skip_start | default(false))


--- /home/runner/kafka-ansible/roles/kafka/templates/jmx-exporter.yml.j2 ---
# {{ ansible_managed }}
# JMX Exporter Configuration for Kafka
# Prometheus metrics available at: http://{{ ansible_host }}:{{ kafka_exporter_port }}/metrics

lowercaseOutputName: true
lowercaseOutputLabelNames: true

rules:
  # Kafka server metrics
  - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
    name: kafka_server_$1_$2
    type: GAUGE
    labels:
      clientId: "$3"
      topic: "$4"
      partition: "$5"

  - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value
    name: kafka_server_$1_$2
    type: GAUGE
    labels:
      clientId: "$3"
      broker: "$4:$5"

  - pattern: kafka.server<type=(.+), name=(.+)><>Value
    name: kafka_server_$1_$2
    type: GAUGE

  - pattern: kafka.server<type=(.+), name=(.+)><>Count
    name: kafka_server_$1_$2_total
    type: COUNTER

  # Kafka network metrics
  - pattern: kafka.network<type=(.+), name=(.+), request=(.+), error=(.+)><>Count
    name: kafka_network_$1_$2_total
    type: COUNTER
    labels:
      request: "$3"
      error: "$4"

  - pattern: kafka.network<type=(.+), name=(.+), request=(.+)><>Count
    name: kafka_network_$1_$2_total
    type: COUNTER
    labels:
      request: "$3"

  - pattern: kafka.network<type=(.+), name=(.+)><>Value
    name: kafka_network_$1_$2
    type: GAUGE

  # Kafka log metrics
  - pattern: kafka.log<type=(.+), name=(.+), topic=(.+), partition=(.+)><>Value
    name: kafka_log_$1_$2
    type: GAUGE
    labels:
      topic: "$3"
      partition: "$4"

  # Kafka controller metrics
  - pattern: kafka.controller<type=(.+), name=(.+)><>Value
    name: kafka_controller_$1_$2
    type: GAUGE

  - pattern: kafka.controller<type=(.+), name=(.+)><>Count
    name: kafka_controller_$1_$2_total
    type: COUNTER

  # KRaft Raft metrics
  - pattern: kafka.raft<type=(.+), name=(.+)><>Value
    name: kafka_raft_$1_$2
    type: GAUGE

  - pattern: kafka.raft<type=(.+), name=(.+)><>Count
    name: kafka_raft_$1_$2_total
    type: COUNTER

  # JVM metrics
  - pattern: java.lang<type=Memory><HeapMemoryUsage>(\w+)
    name: jvm_heap_memory_$1_bytes
    type: GAUGE

  - pattern: java.lang<type=Memory><NonHeapMemoryUsage>(\w+)
    name: jvm_nonheap_memory_$1_bytes
    type: GAUGE

  - pattern: java.lang<type=GarbageCollector, name=(.+)><CollectionCount>
    name: jvm_gc_collection_count_total
    type: COUNTER
    labels:
      gc: "$1"

  - pattern: java.lang<type=GarbageCollector, name=(.+)><CollectionTime>
    name: jvm_gc_collection_time_ms_total
    type: COUNTER
    labels:
      gc: "$1"

  - pattern: java.lang<type=Threading><ThreadCount>
    name: jvm_thread_count
    type: GAUGE

  # Operating system metrics
  - pattern: java.lang<type=OperatingSystem><(\w+)>
    name: jvm_os_$1
    type: GAUGE

--- /home/runner/kafka-ansible/roles/kafka/templates/kafka.env.j2 ---
# {{ ansible_managed }}
# Kafka Environment Variables
# Environment: {{ environment_name | default('unknown') }}

# Java Home
JAVA_HOME=/opt/java

# Kafka Home
KAFKA_HOME={{ kafka_install_dir }}

# Kafka Heap Options
KAFKA_HEAP_OPTS="-Xmx{{ kafka_heap_size }} -Xms{{ kafka_min_heap_size }}"

# Kafka JVM Performance Options
KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled"

# JMX Options
KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port={{ kafka_jmx_port }} -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname={{ ansible_host }}"

# Log4j Configuration
KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:{{ kafka_config_dir }}/log4j.properties"

# Kafka Log Directory
LOG_DIR={{ kafka_log_dir }}

# Node ID
KAFKA_NODE_ID={{ kafka_node_id }}

# Cluster ID
KAFKA_CLUSTER_ID={{ kafka_final_cluster_id | default('') }}

--- /home/runner/kafka-ansible/roles/kafka/templates/kafka.service.j2 ---
# {{ ansible_managed }}
# Kafka Systemd Service File
# Environment: {{ environment_name | default('unknown') }}

[Unit]
Description=Apache Kafka Server (KRaft Mode)
Documentation=https://kafka.apache.org/documentation/
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User={{ kafka_user }}
Group={{ kafka_group }}

# Environment variables
EnvironmentFile=-{{ kafka_config_dir }}/kafka.env

# JVM options
Environment="KAFKA_HEAP_OPTS=-Xmx{{ kafka_heap_size }} -Xms{{ kafka_min_heap_size }}"
Environment="KAFKA_JVM_PERFORMANCE_OPTS=-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled"
Environment="KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port={{ kafka_jmx_port }} -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname={{ ansible_host }}"
{% if kafka_jmx_enabled %}
Environment="KAFKA_OPTS=-javaagent:{{ kafka_install_dir }}/libs/jmx_prometheus_javaagent.jar={{ kafka_exporter_port }}:{{ kafka_install_dir }}/metrics/jmx-exporter.yml"
{% endif %}

# Logging
Environment="KAFKA_LOG4J_OPTS=-Dlog4j.configuration=file:{{ kafka_config_dir }}/log4j.properties"
Environment="LOG_DIR={{ kafka_log_dir }}"

# Working directory
WorkingDirectory={{ kafka_install_dir }}

# Start command
ExecStart={{ kafka_install_dir }}/bin/kafka-server-start.sh {{ kafka_config_dir }}/kraft/server.properties

# Stop command
ExecStop={{ kafka_install_dir }}/bin/kafka-server-stop.sh

# Resource limits
LimitNOFILE=65536
LimitNPROC=65536

# Restart policy
Restart={{ kafka_service_restart_policy }}
RestartSec={{ kafka_service_restart_sec }}

# Timeouts
TimeoutStartSec=180
TimeoutStopSec=120

# Logging to journal
StandardOutput=journal
StandardError=journal
SyslogIdentifier=kafka

[Install]
WantedBy=multi-user.target

--- /home/runner/kafka-ansible/roles/kafka/templates/log4j.properties.j2 ---
# {{ ansible_managed }}
# Kafka Log4j Configuration

# Root logger option
log4j.rootLogger=INFO, stdout, kafkaAppender

# Console appender
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=[%d] %p %m (%c)%n

# Kafka file appender
log4j.appender.kafkaAppender=org.apache.log4j.RollingFileAppender
log4j.appender.kafkaAppender.File={{ kafka_log_dir }}/server.log
log4j.appender.kafkaAppender.MaxFileSize=50MB
log4j.appender.kafkaAppender.MaxBackupIndex=3
log4j.appender.kafkaAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.kafkaAppender.layout.ConversionPattern=[%d] %p %m (%c)%n

# State change logger
log4j.appender.stateChangeAppender=org.apache.log4j.RollingFileAppender
log4j.appender.stateChangeAppender.File={{ kafka_log_dir }}/state-change.log
log4j.appender.stateChangeAppender.MaxFileSize=20MB
log4j.appender.stateChangeAppender.MaxBackupIndex=2
log4j.appender.stateChangeAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.stateChangeAppender.layout.ConversionPattern=[%d] %p %m (%c)%n

# Request logger
log4j.appender.requestAppender=org.apache.log4j.RollingFileAppender
log4j.appender.requestAppender.File={{ kafka_log_dir }}/kafka-request.log
log4j.appender.requestAppender.MaxFileSize=20MB
log4j.appender.requestAppender.MaxBackupIndex=2
log4j.appender.requestAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.requestAppender.layout.ConversionPattern=[%d] %p %m (%c)%n

# Cleaner logger
log4j.appender.cleanerAppender=org.apache.log4j.RollingFileAppender
log4j.appender.cleanerAppender.File={{ kafka_log_dir }}/log-cleaner.log
log4j.appender.cleanerAppender.MaxFileSize=20MB
log4j.appender.cleanerAppender.MaxBackupIndex=2
log4j.appender.cleanerAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.cleanerAppender.layout.ConversionPattern=[%d] %p %m (%c)%n

# Controller logger
log4j.appender.controllerAppender=org.apache.log4j.RollingFileAppender
log4j.appender.controllerAppender.File={{ kafka_log_dir }}/controller.log
log4j.appender.controllerAppender.MaxFileSize=50MB
log4j.appender.controllerAppender.MaxBackupIndex=3
log4j.appender.controllerAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.controllerAppender.layout.ConversionPattern=[%d] %p %m (%c)%n

# Authorizer logger
log4j.appender.authorizerAppender=org.apache.log4j.RollingFileAppender
log4j.appender.authorizerAppender.File={{ kafka_log_dir }}/kafka-authorizer.log
log4j.appender.authorizerAppender.MaxFileSize=20MB
log4j.appender.authorizerAppender.MaxBackupIndex=2
log4j.appender.authorizerAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.authorizerAppender.layout.ConversionPattern=[%d] %p %m (%c)%n

# Logger assignments
log4j.logger.kafka=INFO, kafkaAppender
log4j.logger.kafka.network.RequestChannel$=WARN, requestAppender
log4j.additivity.kafka.network.RequestChannel$=false
log4j.logger.kafka.request.logger=WARN, requestAppender
log4j.additivity.kafka.request.logger=false
log4j.logger.kafka.controller=TRACE, controllerAppender
log4j.additivity.kafka.controller=false
log4j.logger.kafka.log.LogCleaner=INFO, cleanerAppender
log4j.additivity.kafka.log.LogCleaner=false
log4j.logger.state.change.logger=INFO, stateChangeAppender
log4j.additivity.state.change.logger=false
log4j.logger.kafka.authorizer.logger=INFO, authorizerAppender
log4j.additivity.kafka.authorizer.logger=false

# Reduce noisy loggers
log4j.logger.org.apache.zookeeper=WARN
log4j.logger.org.apache.kafka=INFO

--- /home/runner/kafka-ansible/roles/kafka/templates/server.properties.j2 ---
# {{ ansible_managed }}
# Kafka KRaft Mode Configuration
# Environment: {{ environment_name | default('unknown') }}
# Generated on: {{ ansible_date_time.iso8601 }}

############################# Server Basics #############################

# The role of this server. Setting this puts us in KRaft mode
process.roles={{ kafka_process_roles }}

# The node id associated with this instance's roles
node.id={{ kafka_node_id }}

# The connect string for the controller quorum
controller.quorum.voters={{ kafka_quorum_voters_list | trim }}

############################# Socket Server Settings #############################

# The address the socket server listens on
listeners={{ kafka_listeners }}

# Listener name, hostname and port the broker will advertise to clients
advertised.listeners={{ kafka_advertised_listeners }}

# Maps listener names to security protocols
listener.security.protocol.map={{ kafka_listener_security_protocol_map }}

# Name of listener used for communication between brokers
inter.broker.listener.name={{ kafka_inter_broker_listener_name }}

# Name of controller listener
controller.listener.names={{ kafka_controller_listener_names }}

# Network threads
num.network.threads={{ kafka_num_network_threads }}

# IO threads
num.io.threads={{ kafka_num_io_threads }}

# Send buffer
socket.send.buffer.bytes={{ kafka_socket_send_buffer_bytes }}

# Receive buffer
socket.receive.buffer.bytes={{ kafka_socket_receive_buffer_bytes }}

# Maximum request size
socket.request.max.bytes={{ kafka_socket_request_max_bytes }}

# Maximum queued requests
queued.max.requests={{ kafka_queued_max_requests }}

############################# Log Basics #############################

# Log directories
log.dirs={{ kafka_data_dir }}

# Default number of partitions
num.partitions={{ kafka_num_partitions }}

# Number of threads for log recovery
num.recovery.threads.per.data.dir={{ kafka_num_recovery_threads_per_data_dir }}

############################# Replication Configuration #############################

# Default replication factor
default.replication.factor={{ kafka_default_replication_factor }}

# Minimum ISR
min.insync.replicas={{ kafka_min_insync_replicas }}

# Offsets topic replication factor
offsets.topic.replication.factor={{ kafka_offsets_topic_replication_factor }}

# Transaction state log replication factor
transaction.state.log.replication.factor={{ kafka_transaction_state_log_replication_factor }}

# Transaction state log min ISR
transaction.state.log.min.isr={{ kafka_transaction_state_log_min_isr }}

# Unclean leader election
unclean.leader.election.enable={{ kafka_unclean_leader_election_enable }}

# Replica lag time max
replica.lag.time.max.ms={{ kafka_replica_lag_time_max_ms }}

# Replica fetch max bytes
replica.fetch.max.bytes={{ kafka_replica_fetch_max_bytes }}

############################# Log Retention Policy #############################

# Log retention hours
log.retention.hours={{ kafka_log_retention_hours }}

# Log segment size
log.segment.bytes={{ kafka_log_segment_bytes }}

# Log retention check interval
log.retention.check.interval.ms={{ kafka_log_retention_check_interval_ms }}

# Log cleanup policy
log.cleanup.policy={{ kafka_log_cleanup_policy }}

############################# Group Coordinator Settings #############################

# Auto leader rebalance
auto.leader.rebalance.enable={{ kafka_auto_leader_rebalance_enable }}

# Group initial rebalance delay
group.initial.rebalance.delay.ms={{ kafka_group_initial_rebalance_delay_ms }}

############################# Compression #############################

# Compression type
compression.type={{ kafka_compression_type }}

############################# Topic Settings #############################

# Auto create topics
auto.create.topics.enable={{ kafka_auto_create_topics_enable }}

# Delete topic enable
delete.topic.enable={{ kafka_delete_topic_enable }}

############################# Shutdown Settings #############################

# Controlled shutdown
controlled.shutdown.enable={{ kafka_controlled_shutdown_enable }}

--- /home/runner/kafka-ansible/roles/kafka/vars/main.yml ---
---
# Internal variables - Do not modify unless necessary

# Generate controller quorum voters string dynamically
# Format: node_id@host:controller_port
kafka_quorum_voters_list: >-
  {% set voters = [] %}
  {% for host in groups['kafka'] %}
  {% set node_id = hostvars[host]['kafka_node_id'] %}
  {% set host_ip = hostvars[host]['ansible_host'] %}
  {% set _ = voters.append(node_id | string + '@' + host_ip + ':' + kafka_controller_port | string) %}
  {% endfor %}
  {{ voters | join(',') }}

# Required packages for Kafka
kafka_required_packages:
  - "{{ java_package }}"
  - tar
  - gzip
  - wget
  - curl
  - net-tools
  - nc
  - jq

# Directories to create (kafka_install_dir excluded - it's a symlink)
kafka_directories:
  - "{{ kafka_data_dir }}"
  - "{{ kafka_log_dir }}"
  - "/data/kafka"

# File permissions
kafka_dir_mode: "0755"
kafka_file_mode: "0644"
kafka_script_mode: "0755"

--- /home/runner/kafka-ansible/scripts/kafka-ctl.sh ---
#!/bin/bash
#
# Kafka Cluster Control Script
# Usage: ./kafka-ctl.sh <environment> <action> [options]
#
# Examples:
#   ./kafka-ctl.sh sit deploy
#   ./kafka-ctl.sh sit status
#   ./kafka-ctl.sh sit restart
#   ./kafka-ctl.sh uat deploy

set -e

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# Default values
ENV=""
ACTION=""
LIMIT=""
EXTRA_VARS=""

# Function to print usage
usage() {
    echo "Usage: $0 <environment> <action> [options]"
    echo ""
    echo "Environments:"
    echo "  sit     - SIT environment"
    echo "  uat     - UAT environment"
    echo ""
    echo "Actions:"
    echo "  deploy  - Deploy Kafka cluster"
    echo "  start   - Start Kafka services"
    echo "  stop    - Stop Kafka services"
    echo "  restart - Rolling restart Kafka services"
    echo "  status  - Check cluster status"
    echo "  upgrade - Upgrade Kafka version"
    echo "  health  - Run health check"
    echo "  topics  - List all topics"
    echo ""
    echo "Options:"
    echo "  -l, --limit <host>    Limit to specific host(s)"
    echo "  -e, --extra <vars>    Extra variables (key=value)"
    echo "  -v, --verbose         Enable verbose output"
    echo "  -h, --help            Show this help message"
    echo ""
    echo "Examples:"
    echo "  $0 sit deploy"
    echo "  $0 sit restart -l kafka-sit-1"
    echo "  $0 uat upgrade -e kafka_version=3.10.0"
    exit 1
}

# Function to check prerequisites
check_prerequisites() {
    if ! command -v ansible-playbook &> /dev/null; then
        echo -e "${RED}Error: ansible-playbook not found. Please install Ansible.${NC}"
        exit 1
    fi
}

# Function to run ansible playbook
run_playbook() {
    local playbook=$1
    local inventory="${PROJECT_DIR}/inventories/${ENV}/hosts.yml"

    if [[ ! -f "$inventory" ]]; then
        echo -e "${RED}Error: Inventory file not found: ${inventory}${NC}"
        exit 1
    fi

    local cmd="ansible-playbook -i ${inventory} ${PROJECT_DIR}/playbooks/${playbook}.yml"

    if [[ -n "$LIMIT" ]]; then
        cmd="${cmd} --limit ${LIMIT}"
    fi

    if [[ -n "$EXTRA_VARS" ]]; then
        cmd="${cmd} -e ${EXTRA_VARS}"
    fi

    if [[ "$VERBOSE" == "true" ]]; then
        cmd="${cmd} -v"
    fi

    echo -e "${GREEN}Running: ${cmd}${NC}"
    echo ""
    eval $cmd
}

# Parse arguments
if [[ $# -lt 2 ]]; then
    usage
fi

ENV=$1
ACTION=$2
shift 2

# Validate environment
if [[ "$ENV" != "sit" && "$ENV" != "uat" ]]; then
    echo -e "${RED}Error: Invalid environment '${ENV}'. Must be 'sit' or 'uat'.${NC}"
    usage
fi

# Parse options
while [[ $# -gt 0 ]]; do
    case $1 in
        -l|--limit)
            LIMIT="$2"
            shift 2
            ;;
        -e|--extra)
            EXTRA_VARS="$2"
            shift 2
            ;;
        -v|--verbose)
            VERBOSE="true"
            shift
            ;;
        -h|--help)
            usage
            ;;
        *)
            echo -e "${RED}Unknown option: $1${NC}"
            usage
            ;;
    esac
done

# Check prerequisites
check_prerequisites

# Run action
echo -e "${YELLOW}========================================${NC}"
echo -e "${YELLOW}Kafka Cluster Control - ${ENV^^} Environment${NC}"
echo -e "${YELLOW}Action: ${ACTION}${NC}"
echo -e "${YELLOW}========================================${NC}"
echo ""

case $ACTION in
    deploy)
        run_playbook "deploy"
        ;;
    start)
        run_playbook "start"
        ;;
    stop)
        run_playbook "stop"
        ;;
    restart)
        run_playbook "restart"
        ;;
    status)
        run_playbook "status"
        ;;
    upgrade)
        run_playbook "upgrade"
        ;;
    health)
        EXTRA_VARS="task=check_health"
        run_playbook "maintenance"
        ;;
    topics)
        EXTRA_VARS="action=list"
        run_playbook "topic-management"
        ;;
    *)
        echo -e "${RED}Error: Unknown action '${ACTION}'${NC}"
        usage
        ;;
esac

echo ""
echo -e "${GREEN}Done!${NC}"

ansible-kafka 目录

复制代码
kafka-exporter.yaml
---
# Kafka Monitoring for SIT Environment
# JMX Exporter 端口: 5556
# Kafka Exporter 端口: 9308

# JMX Exporter Config for Broker 0
apiVersion: v1
kind: ConfigMap
metadata:
  name: jmx-exporter-config-broker-0
  namespace: kafka-sit
data:
  config.yaml: |
    hostPort: 10.17.9.79:9999
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    rules:
      - pattern: ".*"

---
# JMX Exporter Config for Broker 1
apiVersion: v1
kind: ConfigMap
metadata:
  name: jmx-exporter-config-broker-1
  namespace: kafka-sit
data:
  config.yaml: |
    hostPort: 10.17.9.57:9999
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    rules:
      - pattern: ".*"

---
# JMX Exporter Config for Broker 2
apiVersion: v1
kind: ConfigMap
metadata:
  name: jmx-exporter-config-broker-2
  namespace: kafka-sit
data:
  config.yaml: |
    hostPort: 10.17.12.159:9999
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    rules:
      - pattern: ".*"

---
# JMX Exporter for Broker 0
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jmx-exporter-broker-0
  namespace: kafka-sit
  labels:
    app: jmx-exporter
    broker: broker-0
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jmx-exporter
      broker: broker-0
  template:
    metadata:
      labels:
        app: jmx-exporter
        broker: broker-0
        env: sit
    spec:
      initContainers:
        - name: download-jmx-exporter
          image: curlimages/curl:latest
          command:
            - sh
            - -c
            - |
              curl -sSL -o /jmx-exporter/jmx_prometheus_httpserver.jar \
                https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_httpserver/0.20.0/jmx_prometheus_httpserver-0.20.0.jar
          volumeMounts:
            - name: jmx-exporter-jar
              mountPath: /jmx-exporter
      containers:
        - name: jmx-exporter
          image: public.ecr.aws/amazoncorretto/amazoncorretto:11
          command:
            - java
            - -jar
            - /jmx-exporter/jmx_prometheus_httpserver.jar
            - "5556"
            - /config/config.yaml
          ports:
            - containerPort: 5556
              name: http-metrics
          volumeMounts:
            - name: config
              mountPath: /config
            - name: jmx-exporter-jar
              mountPath: /jmx-exporter
          resources:
            requests:
              cpu: 50m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /metrics
              port: 5556
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 10
          livenessProbe:
            httpGet:
              path: /metrics
              port: 5556
            initialDelaySeconds: 60
            periodSeconds: 30
            timeoutSeconds: 10
      volumes:
        - name: config
          configMap:
            name: jmx-exporter-config-broker-0
        - name: jmx-exporter-jar
          emptyDir: {}

---
# JMX Exporter for Broker 1
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jmx-exporter-broker-1
  namespace: kafka-sit
  labels:
    app: jmx-exporter
    broker: broker-1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jmx-exporter
      broker: broker-1
  template:
    metadata:
      labels:
        app: jmx-exporter
        broker: broker-1
        env: sit
    spec:
      initContainers:
        - name: download-jmx-exporter
          image: curlimages/curl:latest
          command:
            - sh
            - -c
            - |
              curl -sSL -o /jmx-exporter/jmx_prometheus_httpserver.jar \
                https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_httpserver/0.20.0/jmx_prometheus_httpserver-0.20.0.jar
          volumeMounts:
            - name: jmx-exporter-jar
              mountPath: /jmx-exporter
      containers:
        - name: jmx-exporter
          image: public.ecr.aws/amazoncorretto/amazoncorretto:11
          command:
            - java
            - -jar
            - /jmx-exporter/jmx_prometheus_httpserver.jar
            - "5556"
            - /config/config.yaml
          ports:
            - containerPort: 5556
              name: http-metrics
          volumeMounts:
            - name: config
              mountPath: /config
            - name: jmx-exporter-jar
              mountPath: /jmx-exporter
          resources:
            requests:
              cpu: 50m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /metrics
              port: 5556
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 10
          livenessProbe:
            httpGet:
              path: /metrics
              port: 5556
            initialDelaySeconds: 60
            periodSeconds: 30
            timeoutSeconds: 10
      volumes:
        - name: config
          configMap:
            name: jmx-exporter-config-broker-1
        - name: jmx-exporter-jar
          emptyDir: {}

---
# JMX Exporter for Broker 2
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jmx-exporter-broker-2
  namespace: kafka-sit
  labels:
    app: jmx-exporter
    broker: broker-2
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jmx-exporter
      broker: broker-2
  template:
    metadata:
      labels:
        app: jmx-exporter
        broker: broker-2
        env: sit
    spec:
      initContainers:
        - name: download-jmx-exporter
          image: curlimages/curl:latest
          command:
            - sh
            - -c
            - |
              curl -sSL -o /jmx-exporter/jmx_prometheus_httpserver.jar \
                https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_httpserver/0.20.0/jmx_prometheus_httpserver-0.20.0.jar
          volumeMounts:
            - name: jmx-exporter-jar
              mountPath: /jmx-exporter
      containers:
        - name: jmx-exporter
          image: public.ecr.aws/amazoncorretto/amazoncorretto:11
          command:
            - java
            - -jar
            - /jmx-exporter/jmx_prometheus_httpserver.jar
            - "5556"
            - /config/config.yaml
          ports:
            - containerPort: 5556
              name: http-metrics
          volumeMounts:
            - name: config
              mountPath: /config
            - name: jmx-exporter-jar
              mountPath: /jmx-exporter
          resources:
            requests:
              cpu: 50m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /metrics
              port: 5556
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 10
          livenessProbe:
            httpGet:
              path: /metrics
              port: 5556
            initialDelaySeconds: 60
            periodSeconds: 30
            timeoutSeconds: 10
      volumes:
        - name: config
          configMap:
            name: jmx-exporter-config-broker-2
        - name: jmx-exporter-jar
          emptyDir: {}

---
# Kafka Exporter
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kafka-exporter
  namespace: kafka-sit
  labels:
    app: kafka-exporter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kafka-exporter
  template:
    metadata:
      labels:
        app: kafka-exporter
        env: sit
    spec:
      containers:
        - name: kafka-exporter
          image: danielqsj/kafka-exporter:latest
          ports:
            - containerPort: 9308
              name: http-metrics
          args:
            - --kafka.server=10.17.9.79:9092
            - --kafka.server=10.17.9.57:9092
            - --kafka.server=10.17.12.159:9092
            - --web.listen-address=:9308
            - --web.telemetry-path=/metrics
            - --log.level=info
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              cpu: 200m
              memory: 256Mi
          livenessProbe:
            httpGet:
              path: /healthz
              port: 9308
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /healthz
              port: 9308
            initialDelaySeconds: 5
            periodSeconds: 5

---
# Service for JMX Exporter Broker 0
apiVersion: v1
kind: Service
metadata:
  name: kafka-cluster-jmx-metrics-0
  namespace: kafka-sit
  labels:
    app: jmx-exporter
    broker: broker-0
    job-name: kafka-cluster-jmx-metrics
spec:
  type: ClusterIP
  ports:
    - name: http-metrics
      port: 5556
      targetPort: 5556
  selector:
    app: jmx-exporter
    broker: broker-0

---
# Service for JMX Exporter Broker 1
apiVersion: v1
kind: Service
metadata:
  name: kafka-cluster-jmx-metrics-1
  namespace: kafka-sit
  labels:
    app: jmx-exporter
    broker: broker-1
    job-name: kafka-cluster-jmx-metrics
spec:
  type: ClusterIP
  ports:
    - name: http-metrics
      port: 5556
      targetPort: 5556
  selector:
    app: jmx-exporter
    broker: broker-1

---
# Service for JMX Exporter Broker 2
apiVersion: v1
kind: Service
metadata:
  name: kafka-cluster-jmx-metrics-2
  namespace: kafka-sit
  labels:
    app: jmx-exporter
    broker: broker-2
    job-name: kafka-cluster-jmx-metrics
spec:
  type: ClusterIP
  ports:
    - name: http-metrics
      port: 5556
      targetPort: 5556
  selector:
    app: jmx-exporter
    broker: broker-2

---
# Service for Kafka Exporter
apiVersion: v1
kind: Service
metadata:
  name: kafka-exporter
  namespace: kafka-sit
  labels:
    app: kafka-exporter
spec:
  type: ClusterIP
  ports:
    - name: http-metrics
      port: 9308
      targetPort: 9308
  selector:
    app: kafka-exporter

---
# ServiceMonitor for JMX Exporter
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-cluster-jmx-metrics
  namespace: kafka-sit
  labels:
    app: jmx-exporter
    release: kube-prom-stack
spec:
  jobLabel: job-name
  selector:
    matchLabels:
      app: jmx-exporter
  namespaceSelector:
    matchNames:
      - kafka-sit
  endpoints:
    - port: http-metrics
      path: /metrics
      interval: 15s
      scrapeTimeout: 10s
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_label_broker]
          targetLabel: broker
        - sourceLabels: [__meta_kubernetes_pod_label_env]
          targetLabel: env

---
# ServiceMonitor for Kafka Exporter
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-exporter
  namespace: kafka-sit
  labels:
    app: kafka-exporter
    release: kube-prom-stack
spec:
  selector:
    matchLabels:
      app: kafka-exporter
  namespaceSelector:
    matchNames:
      - kafka-sit
  endpoints:
    - port: http-metrics
      path: /metrics
      interval: 15s
      scrapeTimeout: 10s
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_label_env]
          targetLabel: env

[root@ip-10-18-75-168 kafka-sit]# cat jmx-kafka-exporter-configmap.yaml
---
# JMX Exporter Config for Broker 0
apiVersion: v1
kind: ConfigMap
metadata:
  name: jmx-exporter-config-broker-0
  namespace: kafka-sit
data:
  config.yaml: |
    hostPort: 10.17.9.79:9999
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    ssl: false
    whitelistObjectNames: ["kafka.controller:*","kafka.server:*","java.lang:*","kafka.network:*","kafka.log:*"]
    rules:
      - pattern: kafka.controller<type=(ControllerChannelManager), name=(QueueSize), broker-id=(\d+)><>(Value)
        name: kafka_controller_$1_$2_$4
        labels:
          broker_id: "$3"
      - pattern: kafka.controller<type=(ControllerChannelManager), name=(TotalQueueSize)><>(Value)
        name: kafka_controller_$1_$2_$3
      - pattern: kafka.controller<type=(KafkaController), name=(.+)><>(Value)
        name: kafka_controller_$1_$2_$3
      - pattern: kafka.controller<type=(ControllerStats), name=(.+)><>(Count)
        name: kafka_controller_$1_$2_$3
      - pattern : kafka.network<type=(Processor), name=(IdlePercent), networkProcessor=(.+)><>(Value)
        name: kafka_network_$1_$2_$4
        labels:
          network_processor: $3
      - pattern : kafka.network<type=(RequestMetrics), name=(.+), request=(.+)><>(Count|Value)
        name: kafka_network_$1_$2_$4
        labels:
          request: $3
      - pattern : kafka.network<type=(SocketServer), name=(.+)><>(Count|Value)
        name: kafka_network_$1_$2_$3
      - pattern : kafka.network<type=(RequestChannel), name=(.+)><>(Count|Value)
        name: kafka_network_$1_$2_$3
      - pattern: kafka.server<type=(.+), name=(.+), topic=(.+)><>(Count|OneMinuteRate)
        name: kafka_server_$1_$2_$4
        labels:
          topic: $3
      - pattern: kafka.server<type=(ReplicaFetcherManager), name=(.+), clientId=(.+)><>(Value)
        name: kafka_server_$1_$2_$4
        labels:
          client_id: "$3"
      - pattern: kafka.server<type=(DelayedOperationPurgatory), name=(.+), delayedOperation=(.+)><>(Value)
        name: kafka_server_$1_$2_$3_$4
      - pattern: kafka.server<type=(.+), name=(.+)><>(Count|Value|OneMinuteRate)
        name: kafka_server_$1_total_$2_$3
      - pattern: kafka.server<type=(.+)><>(queue-size)
        name: kafka_server_$1_$2
      - pattern: java.lang<type=(.+), name=(.+)><(.+)>(\w+)
        name: java_lang_$1_$4_$3_$2
      - pattern: java.lang<type=(.+), name=(.+)><>(\w+)
        name: java_lang_$1_$3_$2
      - pattern : java.lang<type=(.*)>
      - pattern: kafka.log<type=(.+), name=(.+), topic=(.+), partition=(.+)><>Value
        name: kafka_log_$1_$2
        labels:
          topic: $3
          partition: $4

---
# JMX Exporter Config for Broker 1
apiVersion: v1
kind: ConfigMap
metadata:
  name: jmx-exporter-config-broker-1
  namespace: kafka-sit
data:
  config.yaml: |
    hostPort: 10.17.9.57:9999
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    ssl: false
    whitelistObjectNames: ["kafka.controller:*","kafka.server:*","java.lang:*","kafka.network:*","kafka.log:*"]
    rules:
      - pattern: kafka.controller<type=(ControllerChannelManager), name=(QueueSize), broker-id=(\d+)><>(Value)
        name: kafka_controller_$1_$2_$4
        labels:
          broker_id: "$3"
      - pattern: kafka.controller<type=(ControllerChannelManager), name=(TotalQueueSize)><>(Value)
        name: kafka_controller_$1_$2_$3
      - pattern: kafka.controller<type=(KafkaController), name=(.+)><>(Value)
        name: kafka_controller_$1_$2_$3
      - pattern: kafka.controller<type=(ControllerStats), name=(.+)><>(Count)
        name: kafka_controller_$1_$2_$3
      - pattern : kafka.network<type=(Processor), name=(IdlePercent), networkProcessor=(.+)><>(Value)
        name: kafka_network_$1_$2_$4
        labels:
          network_processor: $3
      - pattern : kafka.network<type=(RequestMetrics), name=(.+), request=(.+)><>(Count|Value)
        name: kafka_network_$1_$2_$4
        labels:
          request: $3
      - pattern : kafka.network<type=(SocketServer), name=(.+)><>(Count|Value)
        name: kafka_network_$1_$2_$3
      - pattern : kafka.network<type=(RequestChannel), name=(.+)><>(Count|Value)
        name: kafka_network_$1_$2_$3
      - pattern: kafka.server<type=(.+), name=(.+), topic=(.+)><>(Count|OneMinuteRate)
        name: kafka_server_$1_$2_$4
        labels:
          topic: $3
      - pattern: kafka.server<type=(ReplicaFetcherManager), name=(.+), clientId=(.+)><>(Value)
        name: kafka_server_$1_$2_$4
        labels:
          client_id: "$3"
      - pattern: kafka.server<type=(DelayedOperationPurgatory), name=(.+), delayedOperation=(.+)><>(Value)
        name: kafka_server_$1_$2_$3_$4
      - pattern: kafka.server<type=(.+), name=(.+)><>(Count|Value|OneMinuteRate)
        name: kafka_server_$1_total_$2_$3
      - pattern: kafka.server<type=(.+)><>(queue-size)
        name: kafka_server_$1_$2
      - pattern: java.lang<type=(.+), name=(.+)><(.+)>(\w+)
        name: java_lang_$1_$4_$3_$2
      - pattern: java.lang<type=(.+), name=(.+)><>(\w+)
        name: java_lang_$1_$3_$2
      - pattern : java.lang<type=(.*)>
      - pattern: kafka.log<type=(.+), name=(.+), topic=(.+), partition=(.+)><>Value
        name: kafka_log_$1_$2
        labels:
          topic: $3
          partition: $4

---
# JMX Exporter Config for Broker 2
apiVersion: v1
kind: ConfigMap
metadata:
  name: jmx-exporter-config-broker-2
  namespace: kafka-sit
data:
  config.yaml: |
    hostPort: 10.17.12.159:9999
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    ssl: false
    whitelistObjectNames: ["kafka.controller:*","kafka.server:*","java.lang:*","kafka.network:*","kafka.log:*"]
    rules:
      - pattern: kafka.controller<type=(ControllerChannelManager), name=(QueueSize), broker-id=(\d+)><>(Value)
        name: kafka_controller_$1_$2_$4
        labels:
          broker_id: "$3"
      - pattern: kafka.controller<type=(ControllerChannelManager), name=(TotalQueueSize)><>(Value)
        name: kafka_controller_$1_$2_$3
      - pattern: kafka.controller<type=(KafkaController), name=(.+)><>(Value)
        name: kafka_controller_$1_$2_$3
      - pattern: kafka.controller<type=(ControllerStats), name=(.+)><>(Count)
        name: kafka_controller_$1_$2_$3
      - pattern : kafka.network<type=(Processor), name=(IdlePercent), networkProcessor=(.+)><>(Value)
        name: kafka_network_$1_$2_$4
        labels:
          network_processor: $3
      - pattern : kafka.network<type=(RequestMetrics), name=(.+), request=(.+)><>(Count|Value)
        name: kafka_network_$1_$2_$4
        labels:
          request: $3
      - pattern : kafka.network<type=(SocketServer), name=(.+)><>(Count|Value)
        name: kafka_network_$1_$2_$3
      - pattern : kafka.network<type=(RequestChannel), name=(.+)><>(Count|Value)
        name: kafka_network_$1_$2_$3
      - pattern: kafka.server<type=(.+), name=(.+), topic=(.+)><>(Count|OneMinuteRate)
        name: kafka_server_$1_$2_$4
        labels:
          topic: $3
      - pattern: kafka.server<type=(ReplicaFetcherManager), name=(.+), clientId=(.+)><>(Value)
        name: kafka_server_$1_$2_$4
        labels:
          client_id: "$3"
      - pattern: kafka.server<type=(DelayedOperationPurgatory), name=(.+), delayedOperation=(.+)><>(Value)
        name: kafka_server_$1_$2_$3_$4
      - pattern: kafka.server<type=(.+), name=(.+)><>(Count|Value|OneMinuteRate)
        name: kafka_server_$1_total_$2_$3
      - pattern: kafka.server<type=(.+)><>(queue-size)
        name: kafka_server_$1_$2
      - pattern: java.lang<type=(.+), name=(.+)><(.+)>(\w+)
        name: java_lang_$1_$4_$3_$2
      - pattern: java.lang<type=(.+), name=(.+)><>(\w+)
        name: java_lang_$1_$3_$2
      - pattern : java.lang<type=(.*)>
      - pattern: kafka.log<type=(.+), name=(.+), topic=(.+), partition=(.+)><>Value
        name: kafka_log_$1_$2
        labels:
          topic: $3
          partition: $4

ec2部署集成到k8s监控中

Helm部署:

复制代码
 tree 
.
├── kafka-cluster-values.yaml
├── kafka-exporter.yaml
├── kafka-gp3-sc.yaml
└── start.sh

复制代码
  global:                                                                                                                                                                                                 
    storageClass: "kafka-gp3"
    security:                                                                                                                                                                                             
      allowInsecureImages: true

  image:
    registry: 292309088324.dkr.ecr.ap-northeast-1.amazonaws.com
    repository: sretools
    tag: kafka-3.9.0
    pullPolicy: Always

  controller:
    replicaCount: 3
    automountServiceAccountToken: true

    # 指定部署到 ap-northeast-1a 可用区
    nodeSelector:
      topology.kubernetes.io/zone: ap-northeast-1a

    # 2C8G 资源配置
    resources:
      requests:
        cpu: "200m"
        memory: "6Gi"
      limits:
        cpu: "2"
        memory: "8Gi"

    persistence:
      enabled: true
      size: 300Gi
      storageClass: "kafka-gp3"

    # JVM 堆内存配置(8G内存建议堆设置为4G,不超过物理内存70%)
    heapOpts: "-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled"

  broker:
    replicaCount: 0
    automountServiceAccountToken: true

  listeners:
    client:
      protocol: PLAINTEXT
    interbroker:
      protocol: PLAINTEXT
    controller:
      protocol: PLAINTEXT
    external:
      protocol: PLAINTEXT

  externalAccess:
    enabled: true
    autoDiscovery:
      enabled: true
      image:
        registry: 292309088324.dkr.ecr.ap-northeast-1.amazonaws.com
        repository: sretools
        tag: kubectl
    controller:
      service:
        type: LoadBalancer
        ports:
          external: 9092
        annotations:
          service.beta.kubernetes.io/aws-load-balancer-type: "external"
          service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
          service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"

  serviceAccount:
    create: true

  service:
    type: ClusterIP
    ports:
      client: 9092
      controller: 9093

  # ==================== AWS 最佳实践配置(2C8G优化)====================
  extraConfig: |
    # ========== 网络与连接配置 ==========
    num.network.threads=2
    num.io.threads=8
    socket.send.buffer.bytes=-1
    socket.receive.buffer.bytes=-1
    socket.request.max.bytes=157286400
    queued.max.requests=1000

    # ========== 可用性与副本配置 ==========
    unclean.leader.election.enable=false
    replica.lag.time.max.ms=45000
    # 8G内存可以适当增大副本拉取量
    replica.fetch.max.bytes=16777216

    # ========== 主题默认配置 ==========
    num.partitions=3
    default.replication.factor=2
    min.insync.replicas=2
    offsets.topic.replication.factor=3
    transaction.state.log.replication.factor=3
    transaction.state.log.min.isr=2

    # ========== 性能与资源配置 ==========
    num.recovery.threads.per.data.dir=1
    log.retention.check.interval.ms=1800000

    # ========== 主题与分区配置 ==========
    auto.leader.rebalance.enable=false
    group.initial.rebalance.delay.ms=5000
    compression.type=producer

    # ========== 日志保留配置 ==========
    log.retention.hours=72
    log.segment.bytes=268435456
    log.cleanup.policy=delete

    # ========== 运维安全配置 ==========
    auto.create.topics.enable=false
    delete.topic.enable=true
    controlled.shutdown.enable=true

  sasl:
    client:
      users: []

  rbac:
    create: true

  metrics:
    jmx:
      enabled: true
      kafkaJmxPort: 9999
      image:
        registry: 292309088324.dkr.ecr.ap-northeast-1.amazonaws.com
        repository: sretools
        tag: jmx-exporter-1.1.0
    kafka:
        enabled: true
        image:
          registry: 292309088324.dkr.ecr.ap-northeast-1.amazonaws.com
          repository: sretools
          tag: kafka-exporter-v1.8.0
        certificatesSecret: ""
        tlsCert: ""
        tlsKey: ""
        tlsCaSecret: ""
        tlsCaCert: ""
        extraFlags: {}
        command: []
        args: []
        containerPorts:
          metrics: 9308
        resources:
          limits: {}
          requests: {}
        service:
          ports:
            metrics: 9308
          annotations: {}
    serviceMonitor:
      enabled: true
      namespace: "kafka-qa"
      labels:
        release: kube-prom-stack
    prometheusRule:
      enabled: false

---kafka-gp3-sc.yaml
  apiVersion: storage.k8s.io/v1
  kind: StorageClass                                                                                                                                                                                      
  metadata:                                                                                                                                                                                               
    name: kafka-gp3
  provisioner: ebs.csi.aws.com
  parameters:
    type: gp3
    fsType: ext4
    iops: "3000"
    throughput: "250"
  reclaimPolicy: Retain
  volumeBindingMode: WaitForFirstConsumer
  allowVolumeExpansion: true


---kafka-exporter.yaml
apiVersion: apps/v1
kind: Deployment                                                                                                                                                                                        
metadata:       
  name: kafka-exporter
  namespace: kafka-qa
  labels:
    app: kafka-exporter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kafka-exporter
  template:
    metadata:
      labels:
        app: kafka-exporter
    spec:
      containers:
      - name: kafka-exporter
        image: 292309088324.dkr.ecr.ap-northeast-1.amazonaws.com/sretools:kafka-exporter-v1.8.0
        args:
          - --kafka.server=kafka-cluster-controller-headless.kafka-qa.svc.cluster.local:9092
        ports:
        - containerPort: 9308
          name: metrics
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
---
apiVersion: v1
kind: Service
metadata:
  name: kafka-exporter
  namespace: kafka-qa
  labels:
    app: kafka-exporter
spec:
  ports:
  - port: 9308
    targetPort: 9308
    name: metrics
  selector:
    app: kafka-exporter
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-exporter
  namespace: kafka-qa
  labels:
    release: kube-prom-stack
spec:
  selector:
    matchLabels:
      app: kafka-exporter
  endpoints:
  - port: metrics
    interval: 10s

helm部署目录

ec2/K8S部署监控看板:

相关推荐
lhxsir4 小时前
kafka数据异常记录
分布式·kafka
小宋10214 小时前
从 Kafka 告警到前端实时可见:SSE 在故障诊断平台中的一次完整落地实践
java·前端·kafka
DemonAvenger6 小时前
深入理解Kafka分区策略:实现数据均衡分布的最佳实践
性能优化·kafka·消息队列
予枫的编程笔记7 小时前
【Kafka进阶篇】Kafka延迟请求处理核心:时间轮算法拆解,比DelayQueue高效10倍
java·kafka·高并发·时间轮算法·delayqueue·延迟任务·timingwheel
新猿一马8 小时前
一文读懂kafka重平衡
kafka
予枫的编程笔记8 小时前
【Kafka进阶篇】Spring Boot Kafka客户端踩坑记:自定义序列化器+ContainerFactory调优指南
java·spring boot·kafka·java21·并发消费·kafka客户端·自定义序列化器
予枫的编程笔记8 小时前
【Kafka进阶篇】Canal+Kafka+ES实战:内容平台数据同步难题,这样解最优雅
redis·mysql·elasticsearch·kafka·canal·数据同步·异步解耦
予枫的编程笔记1 天前
【Kafka进阶篇】Kafka消息重复消费?Exactly-Once语义落地指南,PID+事务消息吃透
人工智能·kafka·消息队列·exactly-once·分布式消息·kafka幂等性·kafka事务消息