一、环境部署列表
1.1 AWS MSK 部署列表
┌──────┬───────────┬──────────────────┬───────┬──────────────┐
│ 环境 │ Kafka版本 │ 规格 │ 存储 │ 节点数 │
├──────┼───────────┼──────────────────┼───────┼──────────────┤
│ dev │ 3.6 │ t3.small (2C2G) │ 500G │ 3 │
├──────┼───────────┼──────────────────┼───────┼──────────────┤
│ sit │ 3.8.x │ m7g.large (2C8G) │ 130G │ 3 │
├──────┼───────────┼──────────────────┼───────┼──────────────┤
│ fat │ 3.8.x │ m7g.large (2C8G) │ 150G │ 3 │
├──────┼───────────┼──────────────────┼───────┼──────────────┤
│ qa │ 3.8.x │ m7g.large (2C8G) │ 150G │ 4 (每AZ 2个) │
├──────┼───────────┼──────────────────┼───────┼──────────────┤
│ uat │ 3.7.x │ m7g.large (2C8G) │ 1000G │ 3 │
├──────┼───────────┼──────────────────┼───────┼──────────────┤
│ prod │ 3.6.x │ m7g.large (2C8G) │ 3100G │ 3 │
└──────┴───────────┴──────────────────┴───────┴──────────────┘
1.2 自建 Kafka 部署列表
┌────────────┬───────────┬──────────┬──────┬───────────┐
│ 环境/版本 │ 部署方式 │ 规格 │ 存储 │ 备注 │
├────────────┼───────────┼──────────┼──────┼───────────┤
│ dev/3.9.0 │ 容器(K8S) │ 2C2G × 3 │ 300G │ 东京A区域 │
├────────────┼───────────┼──────────┼──────┼───────────┤
│ fat/3.9.0 │ 容器(K8S) │ 2C8G × 3 │ 300G │ 东京A区域 │
├────────────┼───────────┼──────────┼──────┼───────────┤
│ qa/3.9.0 │ 容器(K8S) │ 2C8G × 3 │ 300G │ 东京A区域 │
├────────────┼───────────┼──────────┼──────┼───────────┤
│ sit/3.9.0 │ EC2 │ 4C8G × 3 │ 300G │ 性能更好 │
├────────────┼───────────┼──────────┼──────┼───────────┤
│ uat/3.9.0 │ EC2 │ 4C8G × 3 │ 300G │ 性能更好 │
├────────────┼───────────┼──────────┼──────┼───────────┤
│ prod/3.9.0 │ EC2 │ 4C8G × 3 │ 500G │ 待部署 │
└────────────┴───────────┴──────────┴──────┴───────────┘
二、EC2 部署方式(Ansible 自动化)
2.1 快捷脚本方式(推荐)
部署集群
./scripts/kafka-ctl.sh sit deploy
启动服务
./scripts/kafka-ctl.sh sit start
停止服务
./scripts/kafka-ctl.sh sit stop
滚动重启
./scripts/kafka-ctl.sh sit restart
状态检查
./scripts/kafka-ctl.sh sit status
健康检查
./scripts/kafka-ctl.sh sit health
查看 Topics
./scripts/kafka-ctl.sh sit topics
2.2 Ansible Playbook 方式
部署集群
ansible-playbook -i inventories/sit/hosts.yml playbooks/deploy.yml
启动服务
ansible-playbook -i inventories/sit/hosts.yml playbooks/start.yml
停止服务
ansible-playbook -i inventories/sit/hosts.yml playbooks/stop.yml
滚动重启
ansible-playbook -i inventories/sit/hosts.yml playbooks/restart.yml
检查状态
ansible-playbook -i inventories/sit/hosts.yml playbooks/status.yml
单节点操作
ansible-playbook -i inventories/sit/hosts.yml playbooks/restart.yml --limit kafka-sit-1
2.3 维护操作
健康检查
ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=check_health
清理日志
ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=cleanup_logs
备份配置
ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=backup_config
2.4 Topic 管理
列出所有 topics
ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml -e "action=list"
创建 topic
ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
-e "action=create topic_name=my-topic partitions=3 replication_factor=2"
查看 topic 详情
ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
-e "action=describe topic_name=my-topic"
删除 topic
ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
-e "action=delete topic_name=my-topic"
2.5 升级 Kafka
ansible-playbook -i inventories/sit/hosts.yml playbooks/upgrade.yml -e kafka_version=3.10.0
三、手动部署步骤(KRaft 模式)
3.1 前置准备
1. 配置 /etc/hosts(避免DNS解析问题)
sudo tee -a /etc/hosts << EOF
10.17.7.40 kafka-node-0
10.18.118.130 kafka-node-1
10.18.17.213 kafka-node-2
EOF
2. 下载并解压 Kafka
wget https://archive.apache.org/dist/kafka/3.6.2/kafka_2.13-3.6.2.tgz
tar -zxvf kafka_2.13-3.6.2.tgz
ln -s kafka_2.13-3.6.2 kafka
3.2 节点配置(server.properties)
节点0 配置 (broker.id=0, IP: 10.17.7.40):
基础标识配置
broker.id=0
node.id=0
cluster.id=kafka-cluster
监听器配置(KRaft模式)
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
advertised.listeners=PLAINTEXT://kafka-node-0:9092
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
controller.listener.names=CONTROLLER
KRaft核心配置(无ZooKeeper)
process.roles=broker,controller
controller.quorum.voters=0@kafka-node-0:9093,1@kafka-node-1:9093,2@kafka-node-2:9093
数据与日志目录
log.dirs=/data/kafka/data
log.dir=/data/kafka/logs
主题默认配置
num.partitions=3
default.replication.factor=3
min.insync.replicas=2
offsets.topic.replication.factor=3
性能配置
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
group.initial.rebalance.delay.ms=0
节点1/2 配置:仅修改以下3项
broker.id=1 # 或 2
node.id=1 # 或 2
advertised.listeners=PLAINTEXT://kafka-node-1:9092 # 或 kafka-node-2
3.3 初始化 KRaft 集群(仅节点0执行)
1. 生成集群ID
CLUSTER_ID=$(~/kafka/bin/kafka-storage.sh random-uuid)
echo "集群ID: $CLUSTER_ID"
示例输出: rL_902f6c1d13a1a2e8d6e3b5c8a7f2d1e0
2. 格式化节点0存储目录
bin/kafka-storage.sh format -t $CLUSTER_ID -c /data/kafka/config/server.properties
3. 同步集群ID到节点1、节点2,并执行格式化(SSH操作)
bin/kafka-storage.sh format -t $CLUSTER_ID -c /data/kafka/config/server.properties
3.4 启动集群
创建启动脚本(所有节点执行)
#!/bin/bash
JVM配置(2G初始,4G最大)
export KAFKA_HEAP_OPTS="-Xms2G -Xmx4G"
export KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:/data/kafka/config/log4j.properties"
启动Kafka(后台运行)
~/kafka/bin/kafka-server-start.sh -daemon /data/kafka/config/server.properties
验证启动状态
sleep 10
jps | grep Kafka # 输出Kafka进程ID则启动成功
重要 :严格按 节点0 → 节点1 → 节点2 顺序启动,避免控制器选举异常
3.5 验证集群状态
查看集群 broker 列表
~/kafka/bin/kafka-topics.sh --bootstrap-server kafka-node-0:9092 --list
查看集群控制器状态
~/kafka/bin/kafka-metadata-shell.sh --snapshot /data/kafka/data/__cluster_metadata-0/00000000000000000000.log
验证主题创建(测试)
~/kafka/bin/kafka-topics.sh --bootstrap-server kafka-node-0:9092 --create \
--topic test-topic --partitions 3 --replication-factor 3
~/kafka/bin/kafka-topics.sh --bootstrap-server kafka-node-0:9092 --describe --topic test-topic
四、K8S 部署方式(Helm)
4.1 部署步骤
搜索可用版本
helm search repo bitnami/kafka --versions 2>/dev/null | grep -E "3\.9\.|3\.8\.|3\.7\." | head -20
部署/更新集群
cd /root/kafka
helm upgrade kafka-cluster bitnami/kafka --version 31.5.0 -f kafka-cluster-values.yaml -n kafka-cluster
4.2 Helm Values 配置
镜像配置
image:
registry: 292309088324.dkr.ecr.ap-northeast-1.amazonaws.com
repository: sretools
tag: kafka-3.9.0
pullPolicy: Always
全局配置
global:
storageClass: "gp2"
security:
allowInsecureImages: true
Controller 配置(KRaft模式)
controller:
replicaCount: 3
automountServiceAccountToken: true
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
persistence:
enabled: true
size: 300Gi
storageClass: "gp2"
heapOpts: "-Xmx2g -Xms2g"
Broker 配置(KRaft模式下可设为0)
broker:
replicaCount: 0
监听器配置
listeners:
client:
protocol: PLAINTEXT
interbroker:
protocol: PLAINTEXT
controller:
protocol: PLAINTEXT
external:
protocol: PLAINTEXT
外部访问配置
externalAccess:
enabled: true
autoDiscovery:
enabled: true
controller:
service:
type: LoadBalancer
ports:
external: 9092
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "external"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"
Kafka 核心配置
extraConfig: |
num.partitions=3
default.replication.factor=2
min.insync.replicas=2
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
log.retention.hours=72
auto.create.topics.enable=false
4.3 K8S 集群内部测试
创建测试 topic
kubectl run kafka-test --rm -it --restart='Never' \
--image=292309088324.dkr.ecr.ap-northeast-1.amazonaws.com/sretools:kafka-3.9.0 \
--namespace kafka-cluster \
-- /opt/bitnami/kafka/bin/kafka-topics.sh \
--create \
--topic test-internal \
--partitions 3 \
--replication-factor 2 \
--bootstrap-server kafka-cluster.kafka-cluster.svc.cluster.local:9092
列出 topic
kubectl run kafka-test2 --rm -it --restart='Never' \
--image=292309088324.dkr.ecr.ap-northeast-1.amazonaws.com/sretools:kafka-3.9.0 \
--namespace kafka-cluster \
-- /opt/bitnami/kafka/bin/kafka-topics.sh \
--list \
--bootstrap-server kafka-cluster.kafka-cluster.svc.cluster.local:9092
4.4 外部访问测试
创建 topic
bin/kafka-topics.sh --create --topic test-external --partitions 3 \
--replication-factor 2 --bootstrap-server k8s-kafkaclu-kafkaclu-xxx.elb.ap-northeast-1.amazonaws.com:9094
生产消息
echo "hello from external" | ./kafka-console-producer.sh \
--broker-list k8s-kafkaclu-kafkaclu-xxx.elb.ap-northeast-1.amazonaws.com:9094 --topic test-external
消费消息
./kafka-console-consumer.sh --bootstrap-server k8s-kafkaclu-kafkaclu-xxx.elb.ap-northeast-1.amazonaws.com:9094 \
--topic test-external --from-beginning --max-messages 1
输出: hello from external
五、Broker 详细配置参数
5.1 完整配置文件
1. 基础身份与集群配置
node.id=1
cluster.id=kafka-dev
2. 网络监听配置
listeners=PLAINTEXT://10.18.118.130:9092,CONTROLLER://10.18.118.130:9093
advertised.listeners=PLAINTEXT://10.18.118.130:9092
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
controller.listener.names=CONTROLLER
3. KRaft模式核心配置
process.roles=broker,controller
controller.quorum.voters=0@10.17.7.40:9093,1@10.18.118.130:9093,2@10.18.17.213:9093
4. 存储目录
log.dirs=/data/kafka/data
5. 核心线程与网络配置
num.network.threads=3
num.io.threads=4
background.threads=4
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=52428800
6. 主题默认配置
num.partitions=3
num.replica.fetchers=2
default.replication.factor=3
min.insync.replicas=1
7. 日志与留存策略
log.retention.hours=24
log.retention.bytes=64424509440
log.segment.bytes=536870912
log.segment.ms=3600000
log.cleaner.enable=true
log.cleanup.policy=delete
8. 内部Topic副本因子
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
9. 运维与监控配置
log.cleaner.backoff.ms=15000
log.flush.interval.messages=10000
log.flush.interval.ms=1000
connections.max.idle.ms=300000
group.initial.rebalance.delay.ms=3000
auto.create.topics.enable=false
delete.topic.enable=true
compression.type=producer
10. 副本同步配置
replica.lag.time.max.ms=30000
5.2 关键参数说明
┌────────────────────────────┬───────────┬─────────────────────────────────┐
│ 参数 │ 推荐值 │ 说明 │
├────────────────────────────┼───────────┼─────────────────────────────────┤
│ num.partitions │ 3 │ 默认分区数 │
├────────────────────────────┼───────────┼─────────────────────────────────┤
│ default.replication.factor │ 3 │ 默认副本数(生产环境必须≥3) │
├────────────────────────────┼───────────┼─────────────────────────────────┤
│ min.insync.replicas │ 2 │ 最小同步副本数(配合 acks=all) │
├────────────────────────────┼───────────┼─────────────────────────────────┤
│ log.retention.hours │ 24-168 │ 日志保留时间(小时) │
├────────────────────────────┼───────────┼─────────────────────────────────┤
│ log.segment.bytes │ 512MB-1GB │ 日志段大小 │
├────────────────────────────┼───────────┼─────────────────────────────────┤
│ num.network.threads │ 3 │ 网络线程数 │
├────────────────────────────┼───────────┼─────────────────────────────────┤
│ num.io.threads │ 4-8 │ I/O线程数 │
├────────────────────────────┼───────────┼─────────────────────────────────┤
│ auto.create.topics.enable │ false │ 生产环境必须关闭 自动创建Topic │
├────────────────────────────┼───────────┼─────────────────────────────────┤
│ delete.topic.enable │ true │ 允许删除Topic │
└────────────────────────────┴───────────┴─────────────────────────────────┘
六、监控配置
6.1 监控架构
kafka_exporter + jmx_exporter → Prometheus → Grafana
6.2 关键监控指标
┌────────┬──────────────────────────────┬───────────────────────┐
│ 类别 │ 指标 │ 告警阈值 │
├────────┼──────────────────────────────┼───────────────────────┤
│ Broker │ UnderReplicatedPartitions │ > 0 告警 │
├────────┼──────────────────────────────┼───────────────────────┤
│ Broker │ RequestHandlerAvgIdlePercent │ < 30% 告警 │
├────────┼──────────────────────────────┼───────────────────────┤
│ 系统 │ 内存使用率 │ > 85% 告警 │
├────────┼──────────────────────────────┼───────────────────────┤
│ 系统 │ EBS卷使用率 │ > 80% 告警 │
├────────┼──────────────────────────────┼───────────────────────┤
│ 系统 │ EBS队列深度 (avg.queue_len) │ 持续 > 2 表示磁盘瓶颈 │
├────────┼──────────────────────────────┼───────────────────────┤
│ 系统 │ CPU使用率 │ 持续 > 70% 考虑升级 │
├────────┼──────────────────────────────┼───────────────────────┤
│ JVM │ Full GC频率和持续时间 │ 频繁告警 │
└────────┴──────────────────────────────┴───────────────────────┘
6.3 kafka_exporter 部署
下载
启动
nohup ./kafka_exporter \
--kafka.server=10.18.17.213:9092 \
--kafka.server=10.18.118.130:9092 \
--kafka.server=10.17.7.40:9092 \
--kafka.version=3.9.0 \
--web.listen-address=0.0.0.0:9308 \
--log.level=info > kafka-exporter-node1.log 2>&1 &
Prometheus 配置:
- job_name: 'rapidx_kafka_cluster_dev'
metrics_path: /metrics
scrape_interval: 15s
scrape_timeout: 15s
static_configs:
- targets: ['10.18.118.130:9308']
labels:
instance: "rapidx-kafka-cluster-dev"
env: dev
6.4 jmx_exporter 部署
下载
创建配置文件 jmx_exporter_config.yml
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
Kafka 核心业务指标
- pattern: 'kafka.server<type=(.+), name=(.+), topic=(.+), partition=(.+)><>Value'
name: kafka_server_1_2
labels:
topic: $3
partition: $4
- pattern: 'kafka.server<type=(.+), name=(.+)><>Value'
name: kafka_server_1_2
- pattern: 'kafka.network<type=(.+), name=(.+)><>Value'
name: kafka_network_1_2
- pattern: 'kafka.controller<type=(.+), name=(.+)><>Value'
name: kafka_controller_1_2
JVM 基础指标
- pattern: 'java.lang<type=Memory><>HeapMemoryUsage'
name: jvm_heap_memory_usage
- pattern: 'java.lang<type=Memory><>NonHeapMemoryUsage'
name: jvm_nonheap_memory_usage
- pattern: 'java.lang<type=GarbageCollector, name=(.+)><>CollectionCount'
name: jvm_gc_collection_count
labels:
gc: $1
- pattern: 'java.lang<type=GarbageCollector, name=(.+)><>CollectionTime'
name: jvm_gc_collection_time_ms
labels:
gc: $1
- pattern: 'java.lang<type=Threading><>ThreadCount'
name: jvm_thread_count
修改 Kafka 启动脚本:
#!/bin/bash
source /etc/profile
添加 JMX exporter
export KAFKA_OPTS="$KAFKA_OPTS -javaagent:/opt/kafka/jmx_exporter/jmx_prometheus_javaagent-0.20.0.jar=9999:/opt/kafka/jmx_exporter/jmx_exporter_config.yml"
启动Kafka
/opt/kafka/kafka_2.12-3.9.0/bin/kafka-server-start.sh -daemon /data/kafka/config/server.properties
Prometheus JMX 配置:
- job_name: 'rapidx_kafka_cluster_dev-jmx'
metrics_path: /metrics
scrape_interval: 15s
scrape_timeout: 15s
static_configs:
- targets: ['10.17.7.40:9999']
labels:
instance: "10.17.7.40:9999"
env: dev
broker_host: 'broker-0'
- targets: ['10.18.118.130:9999']
labels:
instance: "10.18.118.130:9999"
env: dev
broker_host: 'broker-1'
- targets: ['10.18.17.213:9999']
labels:
instance: "10.18.17.213:9999"
env: dev
broker_host: 'broker-2'
七、常用维护命令
7.1 Topic 操作
列出所有 Topic
bin/kafka-topics.sh --list --bootstrap-server 10.18.118.130:9092
查看指定 Topic 详情
./kafka-topics.sh --describe --bootstrap-server 10.18.118.130:9092 --topic test-topic
创建 topic
bin/kafka-topics.sh --bootstrap-server 10.18.118.130:9092 --create \
--topic test-topic --partitions 3 --replication-factor 3
创建 topic(带配置)
./kafka-topics.sh \
--create \
--bootstrap-server <MSK_BROKER_LIST> \
--topic my-msk-topic \
--partitions 3 \
--replication-factor 2 \
--config retention.ms=86400000 # 消息保留 24 小时
--config cleanup.policy=delete # 过期消息删除(默认)
删除 topic
./bin/kafka-topics.sh --bootstrap-server 10.18.118.130:9092 --delete --topic default.trading.event
7.2 生产/消费测试
测试生产消息
echo "hello kafka" | /opt/kafka/bin/kafka-console-producer.sh \
--bootstrap-server 10.17.9.79:9092 --topic test-topic --property "acks=all"
测试消费消息
/opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server 10.17.9.79:9092 \
--topic test-topic --from-beginning --max-messages 10
7.3 集群管理
查看集群元数据
/opt/kafka/bin/kafka-metadata.sh --snapshot \
/data/kafka/data/__cluster_metadata-0/00000000000000000000.log --command describe
EC2 停止 kafka 集群
bin/kafka-server-stop.sh
jps | grep -v Kafka # 无输出则停止成功
EC2 启动 kafka 集群(按顺序 0 1 2)
7.4 Helm 操作
安装或更新
helm upgrade --install kafka-cluster bitnami/kafka \
--values kafka-cluster-values.yaml \
--version 26.0.0 \
-n kafka-cluster
八、性能压力测试
8.1 测试工具
使用 Kafka 自带的 kafka-producer-perf-test.sh 脚本
8.2 测试结果
┌──────────────────────┬───────────┬───────────────────────────────────┬──────────┬─────────┐
│ 测试场景 │ 消息数 │ 吞吐量 │ 平均延迟 │ P99延迟 │
├──────────────────────┼───────────┼───────────────────────────────────┼──────────┼─────────┤
│ 1万条 1KB (acks=1) │ 10,000 │ 15,060 records/sec (14.71 MB/s) │ 93.83ms │ 154ms │
├──────────────────────┼───────────┼───────────────────────────────────┼──────────┼─────────┤
│ 10万条 1KB (acks=1) │ 100,000 │ 60,350 records/sec (58.94 MB/s) │ 280.75ms │ 425ms │
├──────────────────────┼───────────┼───────────────────────────────────┼──────────┼─────────┤
│ 100万条 1KB (acks=1) │ 1,000,000 │ 137,438 records/sec (134.22 MB/s) │ 92.40ms │ 453ms │
└──────────────────────┴───────────┴───────────────────────────────────┴──────────┴─────────┘
8.3 测试命令
发送1万条1KB消息,不限速率,acks=1
./bin/kafka-producer-perf-test.sh \
--topic jame-topic1 \
--num-records 10000 \
--record-size 1024 \
--throughput -1 \
--producer-props bootstrap.servers=10.18.118.130:9092 acks=1 \
--print-metrics
发送100万条1KB消息
./bin/kafka-producer-perf-test.sh \
--topic jame-topic1 \
--num-records 1000000 \
--record-size 1024 \
--throughput -1 \
--producer-props bootstrap.servers=10.18.118.130:9092 acks=1 \
--print-metrics
九、监控看板
-
Kafka Topic 监控看板:Kubernetes Kafka Topics
-
Kafka JVM 监控看板:Kubernetes Kafka
把容器部署和ec2部署全部融合在这2个看板中,方便查看不用切换模版。
Ansible 批量部署:
kafka-ansible$ tree
.
├── README.md
├── ansible.cfg
├── inventories
│ ├── sit
│ │ ├── group_vars
│ │ │ └── all.yml
│ │ ├── host_vars
│ │ └── hosts.yml
│ └── uat
│ ├── group_vars
│ │ └── all.yml
│ ├── host_vars
│ └── hosts.yml
├── playbooks
│ ├── deploy.yml
│ ├── enable-jmx-exporter.yml
│ ├── maintenance.yml
│ ├── restart.yml
│ ├── start.yml
│ ├── status.yml
│ ├── stop.yml
│ ├── topic-management.yml
│ └── upgrade.yml
├── roles
│ └── kafka
│ ├── defaults
│ │ └── main.yml
│ ├── files
│ ├── handlers
│ │ └── main.yml
│ ├── meta
│ │ └── main.yml
│ ├── tasks
│ │ ├── configure.yml
│ │ ├── install.yml
│ │ ├── main.yml
│ │ ├── metrics.yml
│ │ ├── prerequisites.yml
│ │ └── service.yml
│ ├── templates
│ │ ├── jmx-exporter.yml.j2
│ │ ├── kafka.env.j2
│ │ ├── kafka.service.j2
│ │ ├── log4j.properties.j2
│ │ └── server.properties.j2
│ └── vars
│ └── main.yml
└── scripts
└── kafka-ctl.sh


--- /home/runner/kafka-ansible/README.md ---
# Kafka Ansible Deployment
基于 Helm 配置的 EC2 Kafka 集群 Ansible 部署方案,支持 SIT 和 UAT 环境的批量部署和管理。
## 目录结构
```
kafka-ansible/
├── ansible.cfg # Ansible 配置文件
├── inventories/
│ ├── sit/ # SIT 环境
│ │ ├── hosts.yml # 主机清单
│ │ └── group_vars/
│ │ └── all.yml # 环境变量
│ └── uat/ # UAT 环境
│ ├── hosts.yml # 主机清单
│ └── group_vars/
│ └── all.yml # 环境变量
├── roles/
│ └── kafka/ # Kafka 角色
│ ├── defaults/ # 默认变量
│ ├── handlers/ # 处理程序
│ ├── meta/ # 角色元数据
│ ├── tasks/ # 任务文件
│ ├── templates/ # 配置模板
│ └── vars/ # 内部变量
├── playbooks/
│ ├── deploy.yml # 部署
│ ├── start.yml # 启动
│ ├── stop.yml # 停止
│ ├── restart.yml # 重启
│ ├── status.yml # 状态检查
│ ├── upgrade.yml # 升级
│ ├── maintenance.yml # 维护任务
│ └── topic-management.yml # Topic 管理
├── scripts/
│ └── kafka-ctl.sh # 快捷控制脚本
└── README.md
```
## 环境配置
### SIT 环境
- IP: 10.17.9.79, 10.17.9.57, 10.17.12.159
- 已配置完成
### UAT 环境
- IP: 待申请
- 配置步骤见下方 "添加 UAT 环境"
## 快速开始
### 1. 配置 SSH 密钥
```bash
# 确保 SSH 密钥存在
ls /data/runner.key
```
### 2. 测试连接
```bash
cd kafka-ansible
ansible -i inventories/sit/hosts.yml kafka -m ping
```
### 3. 部署集群
```bash
# 使用快捷脚本
./scripts/kafka-ctl.sh sit deploy
# 或直接使用 ansible-playbook
ansible-playbook -i inventories/sit/hosts.yml playbooks/deploy.yml
```
## 常用命令
### 使用快捷脚本 (推荐)
```bash
# 部署
./scripts/kafka-ctl.sh sit deploy
# 启动
./scripts/kafka-ctl.sh sit start
# 停止
./scripts/kafka-ctl.sh sit stop
# 滚动重启
./scripts/kafka-ctl.sh sit restart
# 状态检查
./scripts/kafka-ctl.sh sit status
# 健康检查
./scripts/kafka-ctl.sh sit health
# 查看 Topics
./scripts/kafka-ctl.sh sit topics
```
### 直接使用 Playbook
```bash
# 部署集群
ansible-playbook -i inventories/sit/hosts.yml playbooks/deploy.yml
# 启动服务
ansible-playbook -i inventories/sit/hosts.yml playbooks/start.yml
# 停止服务
ansible-playbook -i inventories/sit/hosts.yml playbooks/stop.yml
# 滚动重启
ansible-playbook -i inventories/sit/hosts.yml playbooks/restart.yml
# 检查状态
ansible-playbook -i inventories/sit/hosts.yml playbooks/status.yml
# 单节点操作
ansible-playbook -i inventories/sit/hosts.yml playbooks/restart.yml --limit kafka-sit-1
```
### 维护操作
```bash
# 健康检查
ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=check_health
# 清理日志
ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=cleanup_logs
# 备份配置
ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=backup_config
```
### Topic 管理
```bash
# 列出所有 topics
ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml -e "action=list"
# 创建 topic
ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
-e "action=create topic_name=my-topic partitions=3 replication_factor=2"
# 查看 topic 详情
ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
-e "action=describe topic_name=my-topic"
# 删除 topic
ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
-e "action=delete topic_name=my-topic"
```
### 升级 Kafka
```bash
ansible-playbook -i inventories/sit/hosts.yml playbooks/upgrade.yml -e kafka_version=3.10.0
```
## 添加 UAT 环境
当 UAT 机器申请完成后,按以下步骤配置:
### 1. 更新 hosts.yml
编辑 `inventories/uat/hosts.yml`:
```yaml
---
all:
children:
kafka:
hosts:
kafka-uat-1:
ansible_host: 10.x.x.x # 替换为实际 IP
kafka_broker_id: 1
kafka_node_id: 1
kafka-uat-2:
ansible_host: 10.x.x.x # 替换为实际 IP
kafka_broker_id: 2
kafka_node_id: 2
kafka-uat-3:
ansible_host: 10.x.x.x # 替换为实际 IP
kafka_broker_id: 3
kafka_node_id: 3
vars:
kafka_cluster_id: "uat-kafka-cluster"
environment_name: "uat"
```
### 2. 更新 group_vars (可选)
如果 UAT 使用不同的 SSH 密钥,编辑 `inventories/uat/group_vars/all.yml`。
### 3. 部署
```bash
./scripts/kafka-ctl.sh uat deploy
```
## 配置说明
### 目录配置
- 安装目录: `/opt/kafka`
- 数据目录: `/data/kafka/data`
- 日志目录: `/data/kafka/logs`
- 配置目录: `/opt/kafka/config`
### 端口配置
- Client: 9092
- Controller: 9093
- JMX: 9999
- Metrics Exporter: 9308
### JVM 配置 (基于 8G 内存)
- Heap: 4G (-Xmx4g -Xms4g)
- GC: G1GC
- MaxGCPauseMillis: 20
- InitiatingHeapOccupancyPercent: 35
### Kafka 配置
- 网络线程: 2
- IO 线程: 8
- 分区数: 3
- 副本因子: 2
- min.insync.replicas: 2
- 日志保留: 24 小时
## 监控集成
Kafka 集群已配置 JMX Exporter,Prometheus 可通过以下端点抓取指标:
```
http://<kafka-node-ip>:9308/metrics
```
## 故障排查
### 检查服务状态
```bash
systemctl status kafka
journalctl -u kafka -f
```
### 检查日志
```bash
tail -f /data/kafka/logs/server.log
tail -f /data/kafka/logs/controller.log
```
### 检查集群元数据
```bash
/opt/kafka/bin/kafka-metadata.sh --snapshot /data/kafka/data/__cluster_metadata-0/00000000000000000000.log --command describe
```
## 注意事项
1. **滚动操作**: 所有启动、停止、重启操作都是滚动执行(serial: 1),确保集群可用性
2. **数据目录**: 首次部署会自动格式化 KRaft 存储,已有数据的节点会跳过
3. **SSH 密钥**: 使用 root 用户,密钥路径 /data/runner.key
4. **系统初始化**: 已手动完成,playbook 跳过系统配置步骤
5. **防火墙**: 确保 9092、9093、9999、9308 端口在安全组中开放
--- /home/runner/kafka-ansible/ansible.cfg ---
[defaults]
# Inventory
inventory = inventories/sit/hosts.yml
# Roles path
roles_path = roles
# Host key checking
host_key_checking = False
# Timeout
timeout = 30
# SSH settings
remote_user = root
#private_key_file = /data/runner.key
# Retry files
retry_files_enabled = False
# Gathering
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400
# Callback plugins
callback_whitelist = profile_tasks
stdout_callback = yaml
# Forks for parallel execution
forks = 10
# Display settings
display_skipped_hosts = False
display_ok_hosts = True
# Pipelining for performance
pipelining = True
[privilege_escalation]
become = False
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
control_path = %(directory)s/%%h-%%r
pipelining = True
--- /home/runner/kafka-ansible/inventories/sit/group_vars/all.yml ---
---
# SIT Environment Variables
ansible_user: root
#ansible_ssh_private_key_file: /data/runner.key
ansible_become: false
# Environment specific settings
env: sit
kafka_cluster_name: "kafka-sit"
# Network settings - SIT specific
kafka_advertised_listeners_prefix: "PLAINTEXT"
# Resource limits for SIT (can be lower than production)
kafka_heap_size: "4g"
kafka_min_heap_size: "4g"
# Replication settings
kafka_default_replication_factor: 2
kafka_min_insync_replicas: 2
kafka_offsets_topic_replication_factor: 3
kafka_transaction_state_log_replication_factor: 3
kafka_transaction_state_log_min_isr: 2
--- /home/runner/kafka-ansible/inventories/sit/hosts.yml ---
---
# SIT Environment Kafka Cluster
all:
children:
kafka:
hosts:
kafka-sit-1:
ansible_host: 10.17.9.79
kafka_broker_id: 1
kafka_node_id: 1
kafka-sit-2:
ansible_host: 10.17.9.57
kafka_broker_id: 2
kafka_node_id: 2
kafka-sit-3:
ansible_host: 10.17.12.159
kafka_broker_id: 3
kafka_node_id: 3
vars:
# Cluster configuration
kafka_cluster_id: "sit-kafka-cluster"
environment_name: "sit"
--- /home/runner/kafka-ansible/inventories/uat/group_vars/all.yml ---
---
# UAT Environment Variables
ansible_user: root
ansible_become: false
# Environment specific settings
env: uat
kafka_cluster_name: "kafka-uat"
# Network settings - UAT specific
kafka_advertised_listeners_prefix: "PLAINTEXT"
# Resource limits for UAT (production-like)
kafka_heap_size: "4g"
kafka_min_heap_size: "4g"
# Replication settings
kafka_default_replication_factor: 2
kafka_min_insync_replicas: 2
kafka_offsets_topic_replication_factor: 3
kafka_transaction_state_log_replication_factor: 3
kafka_transaction_state_log_min_isr: 2
# UAT Node IPs
uat_node1_ip: "10.20.7.184"
uat_node2_ip: "10.20.13.223"
uat_node3_ip: "10.20.10.42"
--- /home/runner/kafka-ansible/inventories/uat/hosts.yml ---
---
# UAT Environment Kafka Cluster
# TODO: Update with actual IP addresses when machines are provisioned
all:
children:
kafka:
hosts:
kafka-uat-1:
ansible_host: "{{ uat_node1_ip | default('PENDING') }}"
kafka_broker_id: 1
kafka_node_id: 1
kafka-uat-2:
ansible_host: "{{ uat_node2_ip | default('PENDING') }}"
kafka_broker_id: 2
kafka_node_id: 2
kafka-uat-3:
ansible_host: "{{ uat_node3_ip | default('PENDING') }}"
kafka_broker_id: 3
kafka_node_id: 3
vars:
# Cluster configuration
kafka_cluster_id: "uat-kafka-cluster"
environment_name: "uat"
--- /home/runner/kafka-ansible/playbooks/deploy.yml ---
---
# Kafka Deployment Playbook
# Usage:
# SIT: ansible-playbook -i inventories/sit/hosts.yml playbooks/deploy.yml
# UAT: ansible-playbook -i inventories/uat/hosts.yml playbooks/deploy.yml
# Phase 1: Install and configure on all nodes (parallel)
- name: Install and Configure Kafka
hosts: kafka
become: false
pre_tasks:
- name: Display deployment information
ansible.builtin.debug:
msg: |
========================================
Deploying Kafka to: {{ inventory_hostname }}
Environment: {{ environment_name | default('unknown') }}
Node ID: {{ kafka_node_id }}
Host IP: {{ ansible_host }}
========================================
- name: Verify connectivity
ansible.builtin.ping:
roles:
- role: kafka
tags:
- kafka
vars:
kafka_skip_start: true # Don't start service yet
# Phase 2: Start all Kafka services together
- name: Start Kafka Cluster
hosts: kafka
become: false
tasks:
- name: Start Kafka service
ansible.builtin.systemd:
name: kafka
state: started
- name: Wait for controller port (9093)
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 9093
delay: 5
timeout: 60
state: started
# Phase 3: Verify cluster is healthy
- name: Verify Kafka Cluster
hosts: kafka
become: false
serial: 1
tasks:
- name: Wait for Kafka broker port (9092)
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: "{{ kafka_client_port | default(9092) }}"
delay: 10
timeout: 120
state: started
- name: Verify Kafka is running
ansible.builtin.command: "{{ kafka_install_dir | default('/opt/kafka') }}/bin/kafka-broker-api-versions.sh --bootstrap-server {{ ansible_host }}:{{ kafka_client_port | default(9092) }}"
register: kafka_verify
changed_when: false
retries: 5
delay: 10
until: kafka_verify.rc == 0
ignore_errors: yes
- name: Display deployment status
ansible.builtin.debug:
msg: |
========================================
Kafka deployed successfully on {{ inventory_hostname }}
Bootstrap Server: {{ ansible_host }}:9092
JMX Port: 9999
========================================
--- /home/runner/kafka-ansible/playbooks/enable-jmx-exporter.yml ---
---
# Enable JMX Exporter for Prometheus monitoring
# Usage: ansible-playbook -i inventories/sit/hosts.yml playbooks/enable-jmx-exporter.yml
- name: Enable JMX Prometheus Exporter
hosts: kafka
serial: 1
vars:
jmx_exporter_version: "1.0.1"
jmx_exporter_port: 9308
kafka_install_dir: "/opt/kafka"
tasks:
- name: Create metrics directory
ansible.builtin.file:
path: "{{ kafka_install_dir }}/metrics"
state: directory
mode: "0755"
- name: Download JMX Prometheus agent
ansible.builtin.get_url:
url: "https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/{{ jmx_exporter_version }}/jmx_prometheus_javaagent-{{ jmx_exporter_version }}.jar"
dest: "{{ kafka_install_dir }}/libs/jmx_prometheus_javaagent.jar"
mode: "0644"
- name: Deploy JMX exporter configuration
ansible.builtin.copy:
dest: "{{ kafka_install_dir }}/metrics/jmx-exporter.yml"
mode: "0644"
content: |
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
# Kafka server metrics
- pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
name: kafka_server_$1_$2
type: GAUGE
labels:
clientId: "$3"
topic: "$4"
partition: "$5"
- pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value
name: kafka_server_$1_$2
type: GAUGE
labels:
clientId: "$3"
broker: "$4:$5"
- pattern: kafka.server<type=(.+), name=(.+)><>Value
name: kafka_server_$1_$2
type: GAUGE
- pattern: kafka.server<type=(.+), name=(.+)><>Count
name: kafka_server_$1_$2_total
type: COUNTER
# Kafka network metrics
- pattern: kafka.network<type=(.+), name=(.+), request=(.+), error=(.+)><>Count
name: kafka_network_$1_$2_total
type: COUNTER
labels:
request: "$3"
error: "$4"
- pattern: kafka.network<type=(.+), name=(.+), request=(.+)><>Count
name: kafka_network_$1_$2_total
type: COUNTER
labels:
request: "$3"
- pattern: kafka.network<type=(.+), name=(.+)><>Value
name: kafka_network_$1_$2
type: GAUGE
# Kafka controller metrics
- pattern: kafka.controller<type=(.+), name=(.+)><>Value
name: kafka_controller_$1_$2
type: GAUGE
- pattern: kafka.controller<type=(.+), name=(.+)><>Count
name: kafka_controller_$1_$2_total
type: COUNTER
# KRaft Raft metrics
- pattern: kafka.raft<type=(.+), name=(.+)><>Value
name: kafka_raft_$1_$2
type: GAUGE
- pattern: kafka.raft<type=(.+), name=(.+)><>Count
name: kafka_raft_$1_$2_total
type: COUNTER
# JVM metrics
- pattern: java.lang<type=Memory><HeapMemoryUsage>(\w+)
name: jvm_heap_memory_$1_bytes
type: GAUGE
- pattern: java.lang<type=Memory><NonHeapMemoryUsage>(\w+)
name: jvm_nonheap_memory_$1_bytes
type: GAUGE
- pattern: java.lang<type=GarbageCollector, name=(.+)><CollectionCount>
name: jvm_gc_collection_count_total
type: COUNTER
labels:
gc: "$1"
- pattern: java.lang<type=GarbageCollector, name=(.+)><CollectionTime>
name: jvm_gc_collection_time_ms_total
type: COUNTER
labels:
gc: "$1"
- pattern: java.lang<type=Threading><ThreadCount>
name: jvm_thread_count
type: GAUGE
- name: Enable JMX exporter in systemd service
ansible.builtin.lineinfile:
path: /etc/systemd/system/kafka.service
regexp: '^#?Environment="KAFKA_OPTS=-javaagent'
line: 'Environment="KAFKA_OPTS=-javaagent:{{ kafka_install_dir }}/libs/jmx_prometheus_javaagent.jar={{ jmx_exporter_port }}:{{ kafka_install_dir }}/metrics/jmx-exporter.yml"'
insertafter: 'Environment="KAFKA_JMX_OPTS'
- name: Reload systemd
ansible.builtin.systemd:
daemon_reload: yes
- name: Restart Kafka
ansible.builtin.systemd:
name: kafka
state: restarted
- name: Wait for Kafka to be ready
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 9092
delay: 10
timeout: 120
- name: Wait for JMX exporter to be ready
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: "{{ jmx_exporter_port }}"
delay: 5
timeout: 60
- name: Verify JMX exporter metrics
ansible.builtin.uri:
url: "http://{{ ansible_host }}:{{ jmx_exporter_port }}/metrics"
return_content: no
status_code: 200
register: metrics_check
- name: Display status
ansible.builtin.debug:
msg: "JMX Exporter enabled on {{ inventory_hostname }} - http://{{ ansible_host }}:{{ jmx_exporter_port }}/metrics"
--- /home/runner/kafka-ansible/playbooks/maintenance.yml ---
---
# Kafka Maintenance Playbook
# Usage:
# ansible-playbook -i inventories/sit/hosts.yml playbooks/maintenance.yml -e task=<task_name>
#
# Available tasks:
# - cleanup_logs: Clean old log files
# - backup_config: Backup Kafka configuration
# - check_health: Comprehensive health check
# - list_topics: List all topics
# - describe_cluster: Show cluster details
- name: Kafka Maintenance Tasks
hosts: kafka
become: false
vars:
task: "check_health" # Default task
tasks:
# Cleanup old logs
- name: Cleanup old Kafka logs
ansible.builtin.find:
paths: /data/kafka/logs
age: "7d"
recurse: yes
file_type: file
register: old_logs
when: task == "cleanup_logs"
- name: Remove old log files
ansible.builtin.file:
path: "{{ item.path }}"
state: absent
loop: "{{ old_logs.files }}"
when: task == "cleanup_logs" and old_logs.files | length > 0
- name: Display cleanup result
ansible.builtin.debug:
msg: "Cleaned up {{ old_logs.files | length }} old log files on {{ inventory_hostname }}"
when: task == "cleanup_logs"
# Backup configuration
- name: Create backup directory
ansible.builtin.file:
path: "/data/kafka/backups"
state: directory
owner: kafka
group: kafka
mode: "0755"
when: task == "backup_config"
- name: Backup Kafka configuration
ansible.builtin.archive:
path: /opt/kafka/config
dest: "/data/kafka/backups/kafka_config_{{ ansible_date_time.date }}.tar.gz"
owner: kafka
group: kafka
when: task == "backup_config"
- name: Display backup result
ansible.builtin.debug:
msg: "Configuration backed up to /data/kafka/backups/kafka_config_{{ ansible_date_time.date }}.tar.gz"
when: task == "backup_config"
# Health check
- name: Check Kafka process
ansible.builtin.shell: pgrep -f kafka.Kafka
register: kafka_pid
changed_when: false
failed_when: false
when: task == "check_health"
- name: Check Kafka port
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 9092
timeout: 5
register: port_check
ignore_errors: yes
when: task == "check_health"
- name: Check disk space
ansible.builtin.shell: df -h /data/kafka | tail -1 | awk '{print $5}' | sed 's/%//'
register: disk_usage
changed_when: false
when: task == "check_health"
- name: Check memory usage
ansible.builtin.shell: free -m | grep Mem | awk '{print int($3/$2*100)}'
register: memory_usage
changed_when: false
when: task == "check_health"
- name: Display health status
ansible.builtin.debug:
msg: |
Health Check for {{ inventory_hostname }}:
- Process: {{ 'Running (PID: ' + kafka_pid.stdout + ')' if kafka_pid.rc == 0 else 'NOT RUNNING' }}
- Port 9092: {{ 'OK' if port_check is succeeded else 'FAILED' }}
- Disk Usage: {{ disk_usage.stdout }}%
- Memory Usage: {{ memory_usage.stdout }}%
- Status: {{ 'HEALTHY' if kafka_pid.rc == 0 and port_check is succeeded and disk_usage.stdout|int < 85 else 'WARNING' }}
when: task == "check_health"
# Cluster-wide tasks (run on first node only)
- name: Cluster-wide Maintenance Tasks
hosts: kafka[0]
become: false
run_once: true
vars:
task: "check_health"
tasks:
# List topics
- name: List all topics
ansible.builtin.shell: /opt/kafka/bin/kafka-topics.sh --bootstrap-server {{ ansible_host }}:9092 --list
register: topics
changed_when: false
when: task == "list_topics"
- name: Display topics
ansible.builtin.debug:
msg: |
Topics in cluster:
{{ topics.stdout }}
when: task == "list_topics"
# Describe cluster
- name: Get cluster info
ansible.builtin.shell: /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server {{ ansible_host }}:9092 2>/dev/null | head -20
register: cluster_info
changed_when: false
ignore_errors: yes
when: task == "describe_cluster"
- name: Display cluster info
ansible.builtin.debug:
msg: |
Cluster Information:
{{ cluster_info.stdout }}
when: task == "describe_cluster"
--- /home/runner/kafka-ansible/playbooks/restart.yml ---
---
# Restart Kafka Service Playbook (Rolling Restart)
# Usage:
# All nodes: ansible-playbook -i inventories/sit/hosts.yml playbooks/restart.yml
# Single node: ansible-playbook -i inventories/sit/hosts.yml playbooks/restart.yml --limit kafka-sit-1
- name: Rolling Restart Kafka Cluster
hosts: kafka
become: false
serial: 1 # Restart one node at a time
tasks:
- name: Display restart information
ansible.builtin.debug:
msg: "Starting rolling restart on {{ inventory_hostname }} ({{ ansible_host }})"
- name: Stop Kafka service
ansible.builtin.systemd:
name: kafka
state: stopped
- name: Wait for Kafka to stop
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 9092
state: stopped
timeout: 60
- name: Pause before restart
ansible.builtin.pause:
seconds: 5
- name: Start Kafka service
ansible.builtin.systemd:
name: kafka
state: started
- name: Wait for Kafka to be ready
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 9092
delay: 10
timeout: 120
state: started
- name: Verify Kafka is running
ansible.builtin.command: /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server {{ ansible_host }}:9092
register: kafka_verify
retries: 5
delay: 10
until: kafka_verify.rc == 0
changed_when: false
- name: Wait for cluster stabilization
ansible.builtin.pause:
seconds: 30
when: groups['kafka'] | length > 1
- name: Display restart status
ansible.builtin.debug:
msg: "Kafka restarted successfully on {{ inventory_hostname }}"
--- /home/runner/kafka-ansible/playbooks/start.yml ---
---
# Start Kafka Service Playbook
# Usage:
# All nodes: ansible-playbook -i inventories/sit/hosts.yml playbooks/start.yml
# Single node: ansible-playbook -i inventories/sit/hosts.yml playbooks/start.yml --limit kafka-sit-1
- name: Start Kafka Cluster
hosts: kafka
become: false
serial: 1 # Start one node at a time
tasks:
- name: Start Kafka service
ansible.builtin.systemd:
name: kafka
state: started
enabled: yes
- name: Wait for Kafka to be ready
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 9092
delay: 5
timeout: 120
state: started
- name: Verify Kafka is running
ansible.builtin.command: systemctl status kafka
register: kafka_status
changed_when: false
- name: Display Kafka status
ansible.builtin.debug:
msg: "Kafka is running on {{ inventory_hostname }} ({{ ansible_host }})"
when: kafka_status.rc == 0
--- /home/runner/kafka-ansible/playbooks/status.yml ---
---
# Check Kafka Cluster Status Playbook
# Usage:
# ansible-playbook -i inventories/sit/hosts.yml playbooks/status.yml
- name: Check Kafka Cluster Status
hosts: kafka
become: false
gather_facts: yes
tasks:
- name: Check Kafka service status
ansible.builtin.systemd:
name: kafka
register: kafka_service
- name: Check Kafka port
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 9092
timeout: 5
state: started
register: port_check
ignore_errors: yes
- name: Check JMX port
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 9999
timeout: 5
state: started
register: jmx_check
ignore_errors: yes
- name: Get Kafka process info
ansible.builtin.shell: ps aux | grep -v grep | grep kafka.Kafka || true
register: kafka_process
changed_when: false
- name: Get disk usage
ansible.builtin.shell: df -h /data/kafka
register: disk_usage
changed_when: false
- name: Get Kafka data directory size
ansible.builtin.shell: du -sh /data/kafka/data 2>/dev/null || echo "N/A"
register: data_size
changed_when: false
- name: Display node status
ansible.builtin.debug:
msg: |
========================================
Node: {{ inventory_hostname }} ({{ ansible_host }})
========================================
Service Status: {{ 'RUNNING' if kafka_service.status.ActiveState == 'active' else 'STOPPED' }}
Port 9092: {{ 'OPEN' if port_check is succeeded else 'CLOSED' }}
JMX Port 9999: {{ 'OPEN' if jmx_check is succeeded else 'CLOSED' }}
Data Size: {{ data_size.stdout }}
----------------------------------------
Disk Usage:
{{ disk_usage.stdout }}
========================================
- name: Cluster Summary
hosts: kafka[0]
become: false
run_once: true
tasks:
- name: Check cluster metadata
ansible.builtin.shell: |
/opt/kafka/bin/kafka-metadata.sh --snapshot /data/kafka/data/__cluster_metadata-0/00000000000000000000.log --command describe 2>/dev/null | head -20 || echo "Metadata check skipped"
register: cluster_metadata
changed_when: false
ignore_errors: yes
- name: List topics
ansible.builtin.shell: |
/opt/kafka/bin/kafka-topics.sh --bootstrap-server {{ ansible_host }}:9092 --list 2>/dev/null || echo "No topics or cluster not ready"
register: topics_list
changed_when: false
ignore_errors: yes
- name: Display cluster summary
ansible.builtin.debug:
msg: |
========================================
CLUSTER SUMMARY
========================================
Cluster ID: {{ kafka_cluster_id | default('N/A') }}
Environment: {{ environment_name | default('unknown') }}
Total Nodes: {{ groups['kafka'] | length }}
Topics:
{{ topics_list.stdout }}
========================================
--- /home/runner/kafka-ansible/playbooks/stop.yml ---
---
# Stop Kafka Service Playbook
# Usage:
# All nodes: ansible-playbook -i inventories/sit/hosts.yml playbooks/stop.yml
# Single node: ansible-playbook -i inventories/sit/hosts.yml playbooks/stop.yml --limit kafka-sit-1
- name: Stop Kafka Cluster
hosts: kafka
become: false
serial: 1 # Stop one node at a time for graceful shutdown
tasks:
- name: Display stop warning
ansible.builtin.debug:
msg: "Stopping Kafka on {{ inventory_hostname }} ({{ ansible_host }})"
- name: Stop Kafka service gracefully
ansible.builtin.systemd:
name: kafka
state: stopped
- name: Wait for Kafka to stop
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 9092
state: stopped
timeout: 60
- name: Verify Kafka is stopped
ansible.builtin.command: systemctl status kafka
register: kafka_status
failed_when: false
changed_when: false
- name: Display stop status
ansible.builtin.debug:
msg: "Kafka stopped successfully on {{ inventory_hostname }}"
when: kafka_status.rc != 0
--- /home/runner/kafka-ansible/playbooks/topic-management.yml ---
---
# Kafka Topic Management Playbook
# Usage:
# Create topic:
# ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
# -e "action=create topic_name=my-topic partitions=3 replication_factor=2"
#
# Delete topic:
# ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
# -e "action=delete topic_name=my-topic"
#
# Describe topic:
# ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
# -e "action=describe topic_name=my-topic"
#
# List topics:
# ansible-playbook -i inventories/sit/hosts.yml playbooks/topic-management.yml \
# -e "action=list"
- name: Kafka Topic Management
hosts: kafka[0]
become: false
run_once: true
vars:
action: "list"
topic_name: ""
partitions: 3
replication_factor: 2
retention_ms: 86400000 # 24 hours
bootstrap_server: "{{ ansible_host }}:9092"
tasks:
- name: Validate topic_name for create/delete/describe
ansible.builtin.fail:
msg: "topic_name is required for {{ action }} action"
when: action in ['create', 'delete', 'describe'] and topic_name == ""
# List topics
- name: List all topics
ansible.builtin.shell: |
/opt/kafka/bin/kafka-topics.sh --bootstrap-server {{ bootstrap_server }} --list
register: topics_list
when: action == "list"
changed_when: false
- name: Display topics
ansible.builtin.debug:
msg: |
========================================
Topics in cluster:
========================================
{{ topics_list.stdout }}
when: action == "list"
# Create topic
- name: Create topic
ansible.builtin.shell: |
/opt/kafka/bin/kafka-topics.sh --bootstrap-server {{ bootstrap_server }} \
--create \
--topic {{ topic_name }} \
--partitions {{ partitions }} \
--replication-factor {{ replication_factor }} \
--config retention.ms={{ retention_ms }} \
--if-not-exists
register: create_result
when: action == "create"
- name: Display create result
ansible.builtin.debug:
msg: |
Topic '{{ topic_name }}' created successfully
Partitions: {{ partitions }}
Replication Factor: {{ replication_factor }}
Retention: {{ retention_ms }}ms
when: action == "create" and create_result is succeeded
# Delete topic
- name: Confirm delete topic
ansible.builtin.pause:
prompt: "Are you sure you want to delete topic '{{ topic_name }}'? (yes/no)"
register: confirm_delete
when: action == "delete"
- name: Delete topic
ansible.builtin.shell: |
/opt/kafka/bin/kafka-topics.sh --bootstrap-server {{ bootstrap_server }} \
--delete \
--topic {{ topic_name }}
register: delete_result
when: action == "delete" and confirm_delete.user_input == "yes"
- name: Display delete result
ansible.builtin.debug:
msg: "Topic '{{ topic_name }}' deleted successfully"
when: action == "delete" and delete_result is succeeded
# Describe topic
- name: Describe topic
ansible.builtin.shell: |
/opt/kafka/bin/kafka-topics.sh --bootstrap-server {{ bootstrap_server }} \
--describe \
--topic {{ topic_name }}
register: describe_result
when: action == "describe"
changed_when: false
- name: Display topic description
ansible.builtin.debug:
msg: |
========================================
Topic Details: {{ topic_name }}
========================================
{{ describe_result.stdout }}
when: action == "describe"
--- /home/runner/kafka-ansible/playbooks/upgrade.yml ---
---
# Kafka Rolling Upgrade Playbook
# Usage:
# ansible-playbook -i inventories/sit/hosts.yml playbooks/upgrade.yml -e kafka_version=3.10.0
- name: Rolling Upgrade Kafka Cluster
hosts: kafka
become: false
serial: 1 # Upgrade one node at a time
vars_prompt:
- name: confirm_upgrade
prompt: "Are you sure you want to upgrade Kafka? (yes/no)"
default: "no"
private: no
pre_tasks:
- name: Abort if not confirmed
ansible.builtin.fail:
msg: "Upgrade cancelled by user"
when: confirm_upgrade != "yes"
- name: Display upgrade information
ansible.builtin.debug:
msg: |
========================================
Upgrading Kafka on: {{ inventory_hostname }}
Current Version: Check /opt/kafka/libs/
Target Version: {{ kafka_version }}
========================================
tasks:
- name: Stop Kafka service
ansible.builtin.systemd:
name: kafka
state: stopped
- name: Wait for Kafka to stop
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 9092
state: stopped
timeout: 60
- name: Backup current Kafka installation
ansible.builtin.shell: |
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
if [ -L /opt/kafka ]; then
CURRENT_VERSION=$(readlink /opt/kafka | sed 's/.*kafka_//')
cp -r /opt/kafka /opt/kafka_backup_${TIMESTAMP}
fi
args:
creates: /opt/kafka_backup_*
ignore_errors: yes
- name: Download new Kafka version
ansible.builtin.get_url:
url: "https://downloads.apache.org/kafka/{{ kafka_version }}/kafka_{{ kafka_scala_version | default('2.13') }}-{{ kafka_version }}.tgz"
dest: "/tmp/kafka_{{ kafka_scala_version | default('2.13') }}-{{ kafka_version }}.tgz"
mode: "0644"
- name: Extract new Kafka version
ansible.builtin.unarchive:
src: "/tmp/kafka_{{ kafka_scala_version | default('2.13') }}-{{ kafka_version }}.tgz"
dest: "/opt"
remote_src: yes
owner: kafka
group: kafka
- name: Update symbolic link
ansible.builtin.file:
src: "/opt/kafka_{{ kafka_scala_version | default('2.13') }}-{{ kafka_version }}"
dest: "/opt/kafka"
state: link
owner: kafka
group: kafka
force: yes
- name: Copy configuration to new version
ansible.builtin.copy:
src: "/opt/kafka_backup_*/config/"
dest: "/opt/kafka/config/"
remote_src: yes
owner: kafka
group: kafka
ignore_errors: yes
- name: Start Kafka service
ansible.builtin.systemd:
name: kafka
state: started
- name: Wait for Kafka to be ready
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 9092
delay: 15
timeout: 180
state: started
- name: Verify Kafka is running
ansible.builtin.command: /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server {{ ansible_host }}:9092
register: kafka_verify
retries: 5
delay: 15
until: kafka_verify.rc == 0
changed_when: false
- name: Wait for cluster stabilization
ansible.builtin.pause:
seconds: 60
- name: Display upgrade status
ansible.builtin.debug:
msg: "Kafka upgraded successfully on {{ inventory_hostname }}"
--- /home/runner/kafka-ansible/roles/kafka/defaults/main.yml ---
---
# Kafka Version and Download
kafka_version: "3.9.0"
kafka_scala_version: "2.13"
kafka_download_url: "https://downloads.apache.org/kafka/{{ kafka_version }}/kafka_{{ kafka_scala_version }}-{{ kafka_version }}.tgz"
kafka_checksum: "" # Optional: Add SHA512 checksum for verification
# Directory Configuration
kafka_install_dir: "/opt/kafka"
kafka_data_dir: "/data/kafka/data"
kafka_log_dir: "/data/kafka/logs"
kafka_config_dir: "/opt/kafka/config"
# User and Group (using root as machines are pre-initialized)
kafka_user: "root"
kafka_group: "root"
# Network Configuration
kafka_client_port: 9092
kafka_controller_port: 9093
kafka_jmx_port: 9999
kafka_exporter_port: 9308
# Listener Configuration
kafka_listeners: "PLAINTEXT://:{{ kafka_client_port }},CONTROLLER://:{{ kafka_controller_port }}"
kafka_advertised_listeners: "PLAINTEXT://{{ ansible_host }}:{{ kafka_client_port }}"
kafka_listener_security_protocol_map: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT"
kafka_inter_broker_listener_name: "PLAINTEXT"
kafka_controller_listener_names: "CONTROLLER"
# JVM Configuration (Based on 8G memory - heap set to 4G, ~50% of RAM)
kafka_heap_size: "4g"
kafka_min_heap_size: "4g"
kafka_jvm_opts: >-
-Xmx{{ kafka_heap_size }}
-Xms{{ kafka_min_heap_size }}
-XX:+UseG1GC
-XX:MaxGCPauseMillis=20
-XX:InitiatingHeapOccupancyPercent=35
-XX:+DisableExplicitGC
-XX:+ParallelRefProcEnabled
-Djava.awt.headless=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port={{ kafka_jmx_port }}
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
# Network and Connection Configuration
kafka_num_network_threads: 2
kafka_num_io_threads: 8
kafka_socket_send_buffer_bytes: -1
kafka_socket_receive_buffer_bytes: -1
kafka_socket_request_max_bytes: 157286400
kafka_queued_max_requests: 1000
# Replication Configuration
kafka_unclean_leader_election_enable: "false"
kafka_replica_lag_time_max_ms: 45000
kafka_replica_fetch_max_bytes: 16777216
# Topic Default Configuration
kafka_num_partitions: 3
kafka_default_replication_factor: 2
kafka_min_insync_replicas: 2
kafka_offsets_topic_replication_factor: 3
kafka_transaction_state_log_replication_factor: 3
kafka_transaction_state_log_min_isr: 2
# Performance Configuration
kafka_num_recovery_threads_per_data_dir: 1
kafka_log_retention_check_interval_ms: 1800000
# Partition Configuration
kafka_auto_leader_rebalance_enable: "false"
kafka_group_initial_rebalance_delay_ms: 5000
kafka_compression_type: "producer"
# Log Retention Configuration
kafka_log_retention_hours: 24
kafka_log_segment_bytes: 268435456
kafka_log_cleanup_policy: "delete"
# Operation Safety Configuration
kafka_auto_create_topics_enable: "false"
kafka_delete_topic_enable: "true"
kafka_controlled_shutdown_enable: "true"
# KRaft Mode Configuration (Kafka 3.x)
kafka_process_roles: "broker,controller"
kafka_controller_quorum_voters: "" # Will be dynamically generated
# Metrics Configuration
kafka_jmx_enabled: false
kafka_exporter_enabled: false
# Java Configuration
java_home: "/opt/java"
# Systemd Configuration
kafka_service_name: "kafka"
kafka_service_restart_policy: "on-failure"
kafka_service_restart_sec: 10
--- /home/runner/kafka-ansible/roles/kafka/handlers/main.yml ---
---
# Handlers for Kafka role
- name: reload systemd
ansible.builtin.systemd:
daemon_reload: yes
- name: restart kafka
ansible.builtin.systemd:
name: kafka
state: restarted
listen: "restart kafka"
- name: stop kafka
ansible.builtin.systemd:
name: kafka
state: stopped
listen: "stop kafka"
- name: start kafka
ansible.builtin.systemd:
name: kafka
state: started
listen: "start kafka"
--- /home/runner/kafka-ansible/roles/kafka/meta/main.yml ---
---
galaxy_info:
author: DevOps Team
description: Ansible role for deploying Apache Kafka in KRaft mode
company: LTP
license: MIT
min_ansible_version: "2.10"
platforms:
- name: Amazon
versions:
- "2023"
- "2"
- name: EL
versions:
- "8"
- "9"
galaxy_tags:
- kafka
- streaming
- messaging
- kraft
dependencies: []
--- /home/runner/kafka-ansible/roles/kafka/tasks/configure.yml ---
---
# Kafka configuration tasks
- name: Generate cluster ID (only on first node)
ansible.builtin.shell: |
{{ kafka_install_dir }}/bin/kafka-storage.sh random-uuid
register: kafka_cluster_uuid
run_once: true
when: kafka_cluster_id is not defined or kafka_cluster_id == ""
changed_when: false
- name: Set cluster ID fact
ansible.builtin.set_fact:
kafka_final_cluster_id: "{{ kafka_cluster_id | default(kafka_cluster_uuid.stdout) }}"
run_once: true
- name: Share cluster ID with all nodes
ansible.builtin.set_fact:
kafka_final_cluster_id: "{{ hostvars[groups['kafka'][0]]['kafka_final_cluster_id'] }}"
- name: Display cluster ID
ansible.builtin.debug:
msg: "Kafka Cluster ID: {{ kafka_final_cluster_id }}"
- name: Deploy Kafka KRaft configuration
ansible.builtin.template:
src: server.properties.j2
dest: "{{ kafka_config_dir }}/kraft/server.properties"
owner: "{{ kafka_user }}"
group: "{{ kafka_group }}"
mode: "{{ kafka_file_mode }}"
backup: yes
notify: restart kafka
- name: Check if storage is formatted
ansible.builtin.stat:
path: "{{ kafka_data_dir }}/meta.properties"
register: kafka_storage_formatted
- name: Format Kafka storage (KRaft mode)
ansible.builtin.shell: |
{{ kafka_install_dir }}/bin/kafka-storage.sh format \
-t {{ kafka_final_cluster_id }} \
-c {{ kafka_config_dir }}/kraft/server.properties \
--ignore-formatted
when: not kafka_storage_formatted.stat.exists
register: format_result
changed_when: "'Formatting' in format_result.stdout"
- name: Deploy log4j configuration
ansible.builtin.template:
src: log4j.properties.j2
dest: "{{ kafka_config_dir }}/log4j.properties"
owner: "{{ kafka_user }}"
group: "{{ kafka_group }}"
mode: "{{ kafka_file_mode }}"
notify: restart kafka
--- /home/runner/kafka-ansible/roles/kafka/tasks/install.yml ---
---
# Kafka installation tasks
- name: Check if Kafka is already installed
ansible.builtin.stat:
path: "{{ kafka_install_dir }}/bin/kafka-server-start.sh"
register: kafka_installed
- name: Check if /opt/kafka is a directory (not symlink)
ansible.builtin.stat:
path: "{{ kafka_install_dir }}"
register: kafka_dir_check
- name: Remove /opt/kafka directory if it exists (will be replaced by symlink)
ansible.builtin.file:
path: "{{ kafka_install_dir }}"
state: absent
when:
- kafka_dir_check.stat.exists
- kafka_dir_check.stat.isdir
- not kafka_dir_check.stat.islnk
- name: Download Kafka
ansible.builtin.get_url:
url: "{{ kafka_download_url }}"
dest: "/tmp/kafka_{{ kafka_scala_version }}-{{ kafka_version }}.tgz"
mode: "0644"
checksum: "{{ kafka_checksum | default(omit) }}"
when: not kafka_installed.stat.exists
- name: Extract Kafka
ansible.builtin.unarchive:
src: "/tmp/kafka_{{ kafka_scala_version }}-{{ kafka_version }}.tgz"
dest: "/opt"
remote_src: yes
owner: "{{ kafka_user }}"
group: "{{ kafka_group }}"
when: not kafka_installed.stat.exists
- name: Create symbolic link to kafka
ansible.builtin.file:
src: "/opt/kafka_{{ kafka_scala_version }}-{{ kafka_version }}"
dest: "{{ kafka_install_dir }}"
state: link
owner: "{{ kafka_user }}"
group: "{{ kafka_group }}"
force: yes
- name: Set ownership of Kafka installation
ansible.builtin.file:
path: "/opt/kafka_{{ kafka_scala_version }}-{{ kafka_version }}"
owner: "{{ kafka_user }}"
group: "{{ kafka_group }}"
recurse: yes
- name: Clean up downloaded archive
ansible.builtin.file:
path: "/tmp/kafka_{{ kafka_scala_version }}-{{ kafka_version }}.tgz"
state: absent
- name: Create Kafka environment file
ansible.builtin.template:
src: kafka.env.j2
dest: "{{ kafka_config_dir }}/kafka.env"
owner: "{{ kafka_user }}"
group: "{{ kafka_group }}"
mode: "{{ kafka_file_mode }}"
notify: restart kafka
--- /home/runner/kafka-ansible/roles/kafka/tasks/main.yml ---
---
# Main tasks file for kafka role
- name: Include prerequisite tasks
ansible.builtin.include_tasks: prerequisites.yml
tags:
- kafka
- prerequisites
- name: Include installation tasks
ansible.builtin.include_tasks: install.yml
tags:
- kafka
- install
- name: Include configuration tasks
ansible.builtin.include_tasks: configure.yml
tags:
- kafka
- configure
- name: Include metrics tasks
ansible.builtin.include_tasks: metrics.yml
when: kafka_jmx_enabled or kafka_exporter_enabled
tags:
- kafka
- metrics
- name: Include service tasks
ansible.builtin.include_tasks: service.yml
tags:
- kafka
- service
--- /home/runner/kafka-ansible/roles/kafka/tasks/metrics.yml ---
---
# Kafka metrics configuration tasks
- name: Create metrics directory
ansible.builtin.file:
path: "{{ kafka_install_dir }}/metrics"
state: directory
owner: "{{ kafka_user }}"
group: "{{ kafka_group }}"
mode: "{{ kafka_dir_mode }}"
- name: Deploy JMX exporter configuration
ansible.builtin.template:
src: jmx-exporter.yml.j2
dest: "{{ kafka_install_dir }}/metrics/jmx-exporter.yml"
owner: "{{ kafka_user }}"
group: "{{ kafka_group }}"
mode: "{{ kafka_file_mode }}"
when: kafka_jmx_enabled
notify: restart kafka
- name: Download JMX Prometheus agent (if needed)
ansible.builtin.get_url:
url: "https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/1.0.1/jmx_prometheus_javaagent-1.0.1.jar"
dest: "{{ kafka_install_dir }}/libs/jmx_prometheus_javaagent.jar"
owner: "{{ kafka_user }}"
group: "{{ kafka_group }}"
mode: "0644"
when: kafka_jmx_enabled
- name: Create Prometheus metrics endpoint info
ansible.builtin.debug:
msg: |
Kafka JMX metrics available at:
- JMX Port: {{ kafka_jmx_port }}
- Prometheus JMX Exporter: http://{{ ansible_host }}:{{ kafka_exporter_port }}/metrics
--- /home/runner/kafka-ansible/roles/kafka/tasks/prerequisites.yml ---
---
# Prerequisites for Kafka installation
# Note: System initialization (packages, users, sysctl) already done manually
- name: Verify Java installation
ansible.builtin.command: java -version
register: java_version_check
changed_when: false
- name: Display Java version
ansible.builtin.debug:
msg: "{{ java_version_check.stderr_lines }}"
- name: Create Kafka directories
ansible.builtin.file:
path: "{{ item }}"
state: directory
owner: "{{ kafka_user }}"
group: "{{ kafka_group }}"
mode: "{{ kafka_dir_mode }}"
loop: "{{ kafka_directories }}"
--- /home/runner/kafka-ansible/roles/kafka/tasks/service.yml ---
---
# Kafka service configuration tasks
- name: Deploy Kafka systemd service file
ansible.builtin.template:
src: kafka.service.j2
dest: /etc/systemd/system/kafka.service
owner: root
group: root
mode: "0644"
notify:
- reload systemd
- restart kafka
- name: Reload systemd daemon
ansible.builtin.systemd:
daemon_reload: yes
- name: Enable Kafka service
ansible.builtin.systemd:
name: kafka
enabled: yes
# Skip start/verify when kafka_skip_start is true (for cluster deployment)
- name: Start Kafka service
ansible.builtin.systemd:
name: kafka
state: started
register: kafka_service_start
when: not (kafka_skip_start | default(false))
- name: Wait for Kafka to be ready
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: "{{ kafka_client_port }}"
delay: 10
timeout: 120
state: started
when:
- not (kafka_skip_start | default(false))
- kafka_service_start.changed | default(false)
- name: Verify Kafka is running
ansible.builtin.command: "{{ kafka_install_dir }}/bin/kafka-broker-api-versions.sh --bootstrap-server {{ ansible_host }}:{{ kafka_client_port }}"
register: kafka_verify
changed_when: false
retries: 5
delay: 10
until: kafka_verify.rc == 0
ignore_errors: yes
when: not (kafka_skip_start | default(false))
--- /home/runner/kafka-ansible/roles/kafka/templates/jmx-exporter.yml.j2 ---
# {{ ansible_managed }}
# JMX Exporter Configuration for Kafka
# Prometheus metrics available at: http://{{ ansible_host }}:{{ kafka_exporter_port }}/metrics
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
# Kafka server metrics
- pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
name: kafka_server_$1_$2
type: GAUGE
labels:
clientId: "$3"
topic: "$4"
partition: "$5"
- pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value
name: kafka_server_$1_$2
type: GAUGE
labels:
clientId: "$3"
broker: "$4:$5"
- pattern: kafka.server<type=(.+), name=(.+)><>Value
name: kafka_server_$1_$2
type: GAUGE
- pattern: kafka.server<type=(.+), name=(.+)><>Count
name: kafka_server_$1_$2_total
type: COUNTER
# Kafka network metrics
- pattern: kafka.network<type=(.+), name=(.+), request=(.+), error=(.+)><>Count
name: kafka_network_$1_$2_total
type: COUNTER
labels:
request: "$3"
error: "$4"
- pattern: kafka.network<type=(.+), name=(.+), request=(.+)><>Count
name: kafka_network_$1_$2_total
type: COUNTER
labels:
request: "$3"
- pattern: kafka.network<type=(.+), name=(.+)><>Value
name: kafka_network_$1_$2
type: GAUGE
# Kafka log metrics
- pattern: kafka.log<type=(.+), name=(.+), topic=(.+), partition=(.+)><>Value
name: kafka_log_$1_$2
type: GAUGE
labels:
topic: "$3"
partition: "$4"
# Kafka controller metrics
- pattern: kafka.controller<type=(.+), name=(.+)><>Value
name: kafka_controller_$1_$2
type: GAUGE
- pattern: kafka.controller<type=(.+), name=(.+)><>Count
name: kafka_controller_$1_$2_total
type: COUNTER
# KRaft Raft metrics
- pattern: kafka.raft<type=(.+), name=(.+)><>Value
name: kafka_raft_$1_$2
type: GAUGE
- pattern: kafka.raft<type=(.+), name=(.+)><>Count
name: kafka_raft_$1_$2_total
type: COUNTER
# JVM metrics
- pattern: java.lang<type=Memory><HeapMemoryUsage>(\w+)
name: jvm_heap_memory_$1_bytes
type: GAUGE
- pattern: java.lang<type=Memory><NonHeapMemoryUsage>(\w+)
name: jvm_nonheap_memory_$1_bytes
type: GAUGE
- pattern: java.lang<type=GarbageCollector, name=(.+)><CollectionCount>
name: jvm_gc_collection_count_total
type: COUNTER
labels:
gc: "$1"
- pattern: java.lang<type=GarbageCollector, name=(.+)><CollectionTime>
name: jvm_gc_collection_time_ms_total
type: COUNTER
labels:
gc: "$1"
- pattern: java.lang<type=Threading><ThreadCount>
name: jvm_thread_count
type: GAUGE
# Operating system metrics
- pattern: java.lang<type=OperatingSystem><(\w+)>
name: jvm_os_$1
type: GAUGE
--- /home/runner/kafka-ansible/roles/kafka/templates/kafka.env.j2 ---
# {{ ansible_managed }}
# Kafka Environment Variables
# Environment: {{ environment_name | default('unknown') }}
# Java Home
JAVA_HOME=/opt/java
# Kafka Home
KAFKA_HOME={{ kafka_install_dir }}
# Kafka Heap Options
KAFKA_HEAP_OPTS="-Xmx{{ kafka_heap_size }} -Xms{{ kafka_min_heap_size }}"
# Kafka JVM Performance Options
KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled"
# JMX Options
KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port={{ kafka_jmx_port }} -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname={{ ansible_host }}"
# Log4j Configuration
KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:{{ kafka_config_dir }}/log4j.properties"
# Kafka Log Directory
LOG_DIR={{ kafka_log_dir }}
# Node ID
KAFKA_NODE_ID={{ kafka_node_id }}
# Cluster ID
KAFKA_CLUSTER_ID={{ kafka_final_cluster_id | default('') }}
--- /home/runner/kafka-ansible/roles/kafka/templates/kafka.service.j2 ---
# {{ ansible_managed }}
# Kafka Systemd Service File
# Environment: {{ environment_name | default('unknown') }}
[Unit]
Description=Apache Kafka Server (KRaft Mode)
Documentation=https://kafka.apache.org/documentation/
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User={{ kafka_user }}
Group={{ kafka_group }}
# Environment variables
EnvironmentFile=-{{ kafka_config_dir }}/kafka.env
# JVM options
Environment="KAFKA_HEAP_OPTS=-Xmx{{ kafka_heap_size }} -Xms{{ kafka_min_heap_size }}"
Environment="KAFKA_JVM_PERFORMANCE_OPTS=-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled"
Environment="KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port={{ kafka_jmx_port }} -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname={{ ansible_host }}"
{% if kafka_jmx_enabled %}
Environment="KAFKA_OPTS=-javaagent:{{ kafka_install_dir }}/libs/jmx_prometheus_javaagent.jar={{ kafka_exporter_port }}:{{ kafka_install_dir }}/metrics/jmx-exporter.yml"
{% endif %}
# Logging
Environment="KAFKA_LOG4J_OPTS=-Dlog4j.configuration=file:{{ kafka_config_dir }}/log4j.properties"
Environment="LOG_DIR={{ kafka_log_dir }}"
# Working directory
WorkingDirectory={{ kafka_install_dir }}
# Start command
ExecStart={{ kafka_install_dir }}/bin/kafka-server-start.sh {{ kafka_config_dir }}/kraft/server.properties
# Stop command
ExecStop={{ kafka_install_dir }}/bin/kafka-server-stop.sh
# Resource limits
LimitNOFILE=65536
LimitNPROC=65536
# Restart policy
Restart={{ kafka_service_restart_policy }}
RestartSec={{ kafka_service_restart_sec }}
# Timeouts
TimeoutStartSec=180
TimeoutStopSec=120
# Logging to journal
StandardOutput=journal
StandardError=journal
SyslogIdentifier=kafka
[Install]
WantedBy=multi-user.target
--- /home/runner/kafka-ansible/roles/kafka/templates/log4j.properties.j2 ---
# {{ ansible_managed }}
# Kafka Log4j Configuration
# Root logger option
log4j.rootLogger=INFO, stdout, kafkaAppender
# Console appender
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=[%d] %p %m (%c)%n
# Kafka file appender
log4j.appender.kafkaAppender=org.apache.log4j.RollingFileAppender
log4j.appender.kafkaAppender.File={{ kafka_log_dir }}/server.log
log4j.appender.kafkaAppender.MaxFileSize=50MB
log4j.appender.kafkaAppender.MaxBackupIndex=3
log4j.appender.kafkaAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.kafkaAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
# State change logger
log4j.appender.stateChangeAppender=org.apache.log4j.RollingFileAppender
log4j.appender.stateChangeAppender.File={{ kafka_log_dir }}/state-change.log
log4j.appender.stateChangeAppender.MaxFileSize=20MB
log4j.appender.stateChangeAppender.MaxBackupIndex=2
log4j.appender.stateChangeAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.stateChangeAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
# Request logger
log4j.appender.requestAppender=org.apache.log4j.RollingFileAppender
log4j.appender.requestAppender.File={{ kafka_log_dir }}/kafka-request.log
log4j.appender.requestAppender.MaxFileSize=20MB
log4j.appender.requestAppender.MaxBackupIndex=2
log4j.appender.requestAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.requestAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
# Cleaner logger
log4j.appender.cleanerAppender=org.apache.log4j.RollingFileAppender
log4j.appender.cleanerAppender.File={{ kafka_log_dir }}/log-cleaner.log
log4j.appender.cleanerAppender.MaxFileSize=20MB
log4j.appender.cleanerAppender.MaxBackupIndex=2
log4j.appender.cleanerAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.cleanerAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
# Controller logger
log4j.appender.controllerAppender=org.apache.log4j.RollingFileAppender
log4j.appender.controllerAppender.File={{ kafka_log_dir }}/controller.log
log4j.appender.controllerAppender.MaxFileSize=50MB
log4j.appender.controllerAppender.MaxBackupIndex=3
log4j.appender.controllerAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.controllerAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
# Authorizer logger
log4j.appender.authorizerAppender=org.apache.log4j.RollingFileAppender
log4j.appender.authorizerAppender.File={{ kafka_log_dir }}/kafka-authorizer.log
log4j.appender.authorizerAppender.MaxFileSize=20MB
log4j.appender.authorizerAppender.MaxBackupIndex=2
log4j.appender.authorizerAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.authorizerAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
# Logger assignments
log4j.logger.kafka=INFO, kafkaAppender
log4j.logger.kafka.network.RequestChannel$=WARN, requestAppender
log4j.additivity.kafka.network.RequestChannel$=false
log4j.logger.kafka.request.logger=WARN, requestAppender
log4j.additivity.kafka.request.logger=false
log4j.logger.kafka.controller=TRACE, controllerAppender
log4j.additivity.kafka.controller=false
log4j.logger.kafka.log.LogCleaner=INFO, cleanerAppender
log4j.additivity.kafka.log.LogCleaner=false
log4j.logger.state.change.logger=INFO, stateChangeAppender
log4j.additivity.state.change.logger=false
log4j.logger.kafka.authorizer.logger=INFO, authorizerAppender
log4j.additivity.kafka.authorizer.logger=false
# Reduce noisy loggers
log4j.logger.org.apache.zookeeper=WARN
log4j.logger.org.apache.kafka=INFO
--- /home/runner/kafka-ansible/roles/kafka/templates/server.properties.j2 ---
# {{ ansible_managed }}
# Kafka KRaft Mode Configuration
# Environment: {{ environment_name | default('unknown') }}
# Generated on: {{ ansible_date_time.iso8601 }}
############################# Server Basics #############################
# The role of this server. Setting this puts us in KRaft mode
process.roles={{ kafka_process_roles }}
# The node id associated with this instance's roles
node.id={{ kafka_node_id }}
# The connect string for the controller quorum
controller.quorum.voters={{ kafka_quorum_voters_list | trim }}
############################# Socket Server Settings #############################
# The address the socket server listens on
listeners={{ kafka_listeners }}
# Listener name, hostname and port the broker will advertise to clients
advertised.listeners={{ kafka_advertised_listeners }}
# Maps listener names to security protocols
listener.security.protocol.map={{ kafka_listener_security_protocol_map }}
# Name of listener used for communication between brokers
inter.broker.listener.name={{ kafka_inter_broker_listener_name }}
# Name of controller listener
controller.listener.names={{ kafka_controller_listener_names }}
# Network threads
num.network.threads={{ kafka_num_network_threads }}
# IO threads
num.io.threads={{ kafka_num_io_threads }}
# Send buffer
socket.send.buffer.bytes={{ kafka_socket_send_buffer_bytes }}
# Receive buffer
socket.receive.buffer.bytes={{ kafka_socket_receive_buffer_bytes }}
# Maximum request size
socket.request.max.bytes={{ kafka_socket_request_max_bytes }}
# Maximum queued requests
queued.max.requests={{ kafka_queued_max_requests }}
############################# Log Basics #############################
# Log directories
log.dirs={{ kafka_data_dir }}
# Default number of partitions
num.partitions={{ kafka_num_partitions }}
# Number of threads for log recovery
num.recovery.threads.per.data.dir={{ kafka_num_recovery_threads_per_data_dir }}
############################# Replication Configuration #############################
# Default replication factor
default.replication.factor={{ kafka_default_replication_factor }}
# Minimum ISR
min.insync.replicas={{ kafka_min_insync_replicas }}
# Offsets topic replication factor
offsets.topic.replication.factor={{ kafka_offsets_topic_replication_factor }}
# Transaction state log replication factor
transaction.state.log.replication.factor={{ kafka_transaction_state_log_replication_factor }}
# Transaction state log min ISR
transaction.state.log.min.isr={{ kafka_transaction_state_log_min_isr }}
# Unclean leader election
unclean.leader.election.enable={{ kafka_unclean_leader_election_enable }}
# Replica lag time max
replica.lag.time.max.ms={{ kafka_replica_lag_time_max_ms }}
# Replica fetch max bytes
replica.fetch.max.bytes={{ kafka_replica_fetch_max_bytes }}
############################# Log Retention Policy #############################
# Log retention hours
log.retention.hours={{ kafka_log_retention_hours }}
# Log segment size
log.segment.bytes={{ kafka_log_segment_bytes }}
# Log retention check interval
log.retention.check.interval.ms={{ kafka_log_retention_check_interval_ms }}
# Log cleanup policy
log.cleanup.policy={{ kafka_log_cleanup_policy }}
############################# Group Coordinator Settings #############################
# Auto leader rebalance
auto.leader.rebalance.enable={{ kafka_auto_leader_rebalance_enable }}
# Group initial rebalance delay
group.initial.rebalance.delay.ms={{ kafka_group_initial_rebalance_delay_ms }}
############################# Compression #############################
# Compression type
compression.type={{ kafka_compression_type }}
############################# Topic Settings #############################
# Auto create topics
auto.create.topics.enable={{ kafka_auto_create_topics_enable }}
# Delete topic enable
delete.topic.enable={{ kafka_delete_topic_enable }}
############################# Shutdown Settings #############################
# Controlled shutdown
controlled.shutdown.enable={{ kafka_controlled_shutdown_enable }}
--- /home/runner/kafka-ansible/roles/kafka/vars/main.yml ---
---
# Internal variables - Do not modify unless necessary
# Generate controller quorum voters string dynamically
# Format: node_id@host:controller_port
kafka_quorum_voters_list: >-
{% set voters = [] %}
{% for host in groups['kafka'] %}
{% set node_id = hostvars[host]['kafka_node_id'] %}
{% set host_ip = hostvars[host]['ansible_host'] %}
{% set _ = voters.append(node_id | string + '@' + host_ip + ':' + kafka_controller_port | string) %}
{% endfor %}
{{ voters | join(',') }}
# Required packages for Kafka
kafka_required_packages:
- "{{ java_package }}"
- tar
- gzip
- wget
- curl
- net-tools
- nc
- jq
# Directories to create (kafka_install_dir excluded - it's a symlink)
kafka_directories:
- "{{ kafka_data_dir }}"
- "{{ kafka_log_dir }}"
- "/data/kafka"
# File permissions
kafka_dir_mode: "0755"
kafka_file_mode: "0644"
kafka_script_mode: "0755"
--- /home/runner/kafka-ansible/scripts/kafka-ctl.sh ---
#!/bin/bash
#
# Kafka Cluster Control Script
# Usage: ./kafka-ctl.sh <environment> <action> [options]
#
# Examples:
# ./kafka-ctl.sh sit deploy
# ./kafka-ctl.sh sit status
# ./kafka-ctl.sh sit restart
# ./kafka-ctl.sh uat deploy
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Default values
ENV=""
ACTION=""
LIMIT=""
EXTRA_VARS=""
# Function to print usage
usage() {
echo "Usage: $0 <environment> <action> [options]"
echo ""
echo "Environments:"
echo " sit - SIT environment"
echo " uat - UAT environment"
echo ""
echo "Actions:"
echo " deploy - Deploy Kafka cluster"
echo " start - Start Kafka services"
echo " stop - Stop Kafka services"
echo " restart - Rolling restart Kafka services"
echo " status - Check cluster status"
echo " upgrade - Upgrade Kafka version"
echo " health - Run health check"
echo " topics - List all topics"
echo ""
echo "Options:"
echo " -l, --limit <host> Limit to specific host(s)"
echo " -e, --extra <vars> Extra variables (key=value)"
echo " -v, --verbose Enable verbose output"
echo " -h, --help Show this help message"
echo ""
echo "Examples:"
echo " $0 sit deploy"
echo " $0 sit restart -l kafka-sit-1"
echo " $0 uat upgrade -e kafka_version=3.10.0"
exit 1
}
# Function to check prerequisites
check_prerequisites() {
if ! command -v ansible-playbook &> /dev/null; then
echo -e "${RED}Error: ansible-playbook not found. Please install Ansible.${NC}"
exit 1
fi
}
# Function to run ansible playbook
run_playbook() {
local playbook=$1
local inventory="${PROJECT_DIR}/inventories/${ENV}/hosts.yml"
if [[ ! -f "$inventory" ]]; then
echo -e "${RED}Error: Inventory file not found: ${inventory}${NC}"
exit 1
fi
local cmd="ansible-playbook -i ${inventory} ${PROJECT_DIR}/playbooks/${playbook}.yml"
if [[ -n "$LIMIT" ]]; then
cmd="${cmd} --limit ${LIMIT}"
fi
if [[ -n "$EXTRA_VARS" ]]; then
cmd="${cmd} -e ${EXTRA_VARS}"
fi
if [[ "$VERBOSE" == "true" ]]; then
cmd="${cmd} -v"
fi
echo -e "${GREEN}Running: ${cmd}${NC}"
echo ""
eval $cmd
}
# Parse arguments
if [[ $# -lt 2 ]]; then
usage
fi
ENV=$1
ACTION=$2
shift 2
# Validate environment
if [[ "$ENV" != "sit" && "$ENV" != "uat" ]]; then
echo -e "${RED}Error: Invalid environment '${ENV}'. Must be 'sit' or 'uat'.${NC}"
usage
fi
# Parse options
while [[ $# -gt 0 ]]; do
case $1 in
-l|--limit)
LIMIT="$2"
shift 2
;;
-e|--extra)
EXTRA_VARS="$2"
shift 2
;;
-v|--verbose)
VERBOSE="true"
shift
;;
-h|--help)
usage
;;
*)
echo -e "${RED}Unknown option: $1${NC}"
usage
;;
esac
done
# Check prerequisites
check_prerequisites
# Run action
echo -e "${YELLOW}========================================${NC}"
echo -e "${YELLOW}Kafka Cluster Control - ${ENV^^} Environment${NC}"
echo -e "${YELLOW}Action: ${ACTION}${NC}"
echo -e "${YELLOW}========================================${NC}"
echo ""
case $ACTION in
deploy)
run_playbook "deploy"
;;
start)
run_playbook "start"
;;
stop)
run_playbook "stop"
;;
restart)
run_playbook "restart"
;;
status)
run_playbook "status"
;;
upgrade)
run_playbook "upgrade"
;;
health)
EXTRA_VARS="task=check_health"
run_playbook "maintenance"
;;
topics)
EXTRA_VARS="action=list"
run_playbook "topic-management"
;;
*)
echo -e "${RED}Error: Unknown action '${ACTION}'${NC}"
usage
;;
esac
echo ""
echo -e "${GREEN}Done!${NC}"
ansible-kafka 目录


kafka-exporter.yaml
---
# Kafka Monitoring for SIT Environment
# JMX Exporter 端口: 5556
# Kafka Exporter 端口: 9308
# JMX Exporter Config for Broker 0
apiVersion: v1
kind: ConfigMap
metadata:
name: jmx-exporter-config-broker-0
namespace: kafka-sit
data:
config.yaml: |
hostPort: 10.17.9.79:9999
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
- pattern: ".*"
---
# JMX Exporter Config for Broker 1
apiVersion: v1
kind: ConfigMap
metadata:
name: jmx-exporter-config-broker-1
namespace: kafka-sit
data:
config.yaml: |
hostPort: 10.17.9.57:9999
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
- pattern: ".*"
---
# JMX Exporter Config for Broker 2
apiVersion: v1
kind: ConfigMap
metadata:
name: jmx-exporter-config-broker-2
namespace: kafka-sit
data:
config.yaml: |
hostPort: 10.17.12.159:9999
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
- pattern: ".*"
---
# JMX Exporter for Broker 0
apiVersion: apps/v1
kind: Deployment
metadata:
name: jmx-exporter-broker-0
namespace: kafka-sit
labels:
app: jmx-exporter
broker: broker-0
spec:
replicas: 1
selector:
matchLabels:
app: jmx-exporter
broker: broker-0
template:
metadata:
labels:
app: jmx-exporter
broker: broker-0
env: sit
spec:
initContainers:
- name: download-jmx-exporter
image: curlimages/curl:latest
command:
- sh
- -c
- |
curl -sSL -o /jmx-exporter/jmx_prometheus_httpserver.jar \
https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_httpserver/0.20.0/jmx_prometheus_httpserver-0.20.0.jar
volumeMounts:
- name: jmx-exporter-jar
mountPath: /jmx-exporter
containers:
- name: jmx-exporter
image: public.ecr.aws/amazoncorretto/amazoncorretto:11
command:
- java
- -jar
- /jmx-exporter/jmx_prometheus_httpserver.jar
- "5556"
- /config/config.yaml
ports:
- containerPort: 5556
name: http-metrics
volumeMounts:
- name: config
mountPath: /config
- name: jmx-exporter-jar
mountPath: /jmx-exporter
resources:
requests:
cpu: 50m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
readinessProbe:
httpGet:
path: /metrics
port: 5556
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 10
livenessProbe:
httpGet:
path: /metrics
port: 5556
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
volumes:
- name: config
configMap:
name: jmx-exporter-config-broker-0
- name: jmx-exporter-jar
emptyDir: {}
---
# JMX Exporter for Broker 1
apiVersion: apps/v1
kind: Deployment
metadata:
name: jmx-exporter-broker-1
namespace: kafka-sit
labels:
app: jmx-exporter
broker: broker-1
spec:
replicas: 1
selector:
matchLabels:
app: jmx-exporter
broker: broker-1
template:
metadata:
labels:
app: jmx-exporter
broker: broker-1
env: sit
spec:
initContainers:
- name: download-jmx-exporter
image: curlimages/curl:latest
command:
- sh
- -c
- |
curl -sSL -o /jmx-exporter/jmx_prometheus_httpserver.jar \
https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_httpserver/0.20.0/jmx_prometheus_httpserver-0.20.0.jar
volumeMounts:
- name: jmx-exporter-jar
mountPath: /jmx-exporter
containers:
- name: jmx-exporter
image: public.ecr.aws/amazoncorretto/amazoncorretto:11
command:
- java
- -jar
- /jmx-exporter/jmx_prometheus_httpserver.jar
- "5556"
- /config/config.yaml
ports:
- containerPort: 5556
name: http-metrics
volumeMounts:
- name: config
mountPath: /config
- name: jmx-exporter-jar
mountPath: /jmx-exporter
resources:
requests:
cpu: 50m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
readinessProbe:
httpGet:
path: /metrics
port: 5556
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 10
livenessProbe:
httpGet:
path: /metrics
port: 5556
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
volumes:
- name: config
configMap:
name: jmx-exporter-config-broker-1
- name: jmx-exporter-jar
emptyDir: {}
---
# JMX Exporter for Broker 2
apiVersion: apps/v1
kind: Deployment
metadata:
name: jmx-exporter-broker-2
namespace: kafka-sit
labels:
app: jmx-exporter
broker: broker-2
spec:
replicas: 1
selector:
matchLabels:
app: jmx-exporter
broker: broker-2
template:
metadata:
labels:
app: jmx-exporter
broker: broker-2
env: sit
spec:
initContainers:
- name: download-jmx-exporter
image: curlimages/curl:latest
command:
- sh
- -c
- |
curl -sSL -o /jmx-exporter/jmx_prometheus_httpserver.jar \
https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_httpserver/0.20.0/jmx_prometheus_httpserver-0.20.0.jar
volumeMounts:
- name: jmx-exporter-jar
mountPath: /jmx-exporter
containers:
- name: jmx-exporter
image: public.ecr.aws/amazoncorretto/amazoncorretto:11
command:
- java
- -jar
- /jmx-exporter/jmx_prometheus_httpserver.jar
- "5556"
- /config/config.yaml
ports:
- containerPort: 5556
name: http-metrics
volumeMounts:
- name: config
mountPath: /config
- name: jmx-exporter-jar
mountPath: /jmx-exporter
resources:
requests:
cpu: 50m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
readinessProbe:
httpGet:
path: /metrics
port: 5556
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 10
livenessProbe:
httpGet:
path: /metrics
port: 5556
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
volumes:
- name: config
configMap:
name: jmx-exporter-config-broker-2
- name: jmx-exporter-jar
emptyDir: {}
---
# Kafka Exporter
apiVersion: apps/v1
kind: Deployment
metadata:
name: kafka-exporter
namespace: kafka-sit
labels:
app: kafka-exporter
spec:
replicas: 1
selector:
matchLabels:
app: kafka-exporter
template:
metadata:
labels:
app: kafka-exporter
env: sit
spec:
containers:
- name: kafka-exporter
image: danielqsj/kafka-exporter:latest
ports:
- containerPort: 9308
name: http-metrics
args:
- --kafka.server=10.17.9.79:9092
- --kafka.server=10.17.9.57:9092
- --kafka.server=10.17.12.159:9092
- --web.listen-address=:9308
- --web.telemetry-path=/metrics
- --log.level=info
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 256Mi
livenessProbe:
httpGet:
path: /healthz
port: 9308
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /healthz
port: 9308
initialDelaySeconds: 5
periodSeconds: 5
---
# Service for JMX Exporter Broker 0
apiVersion: v1
kind: Service
metadata:
name: kafka-cluster-jmx-metrics-0
namespace: kafka-sit
labels:
app: jmx-exporter
broker: broker-0
job-name: kafka-cluster-jmx-metrics
spec:
type: ClusterIP
ports:
- name: http-metrics
port: 5556
targetPort: 5556
selector:
app: jmx-exporter
broker: broker-0
---
# Service for JMX Exporter Broker 1
apiVersion: v1
kind: Service
metadata:
name: kafka-cluster-jmx-metrics-1
namespace: kafka-sit
labels:
app: jmx-exporter
broker: broker-1
job-name: kafka-cluster-jmx-metrics
spec:
type: ClusterIP
ports:
- name: http-metrics
port: 5556
targetPort: 5556
selector:
app: jmx-exporter
broker: broker-1
---
# Service for JMX Exporter Broker 2
apiVersion: v1
kind: Service
metadata:
name: kafka-cluster-jmx-metrics-2
namespace: kafka-sit
labels:
app: jmx-exporter
broker: broker-2
job-name: kafka-cluster-jmx-metrics
spec:
type: ClusterIP
ports:
- name: http-metrics
port: 5556
targetPort: 5556
selector:
app: jmx-exporter
broker: broker-2
---
# Service for Kafka Exporter
apiVersion: v1
kind: Service
metadata:
name: kafka-exporter
namespace: kafka-sit
labels:
app: kafka-exporter
spec:
type: ClusterIP
ports:
- name: http-metrics
port: 9308
targetPort: 9308
selector:
app: kafka-exporter
---
# ServiceMonitor for JMX Exporter
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kafka-cluster-jmx-metrics
namespace: kafka-sit
labels:
app: jmx-exporter
release: kube-prom-stack
spec:
jobLabel: job-name
selector:
matchLabels:
app: jmx-exporter
namespaceSelector:
matchNames:
- kafka-sit
endpoints:
- port: http-metrics
path: /metrics
interval: 15s
scrapeTimeout: 10s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_label_broker]
targetLabel: broker
- sourceLabels: [__meta_kubernetes_pod_label_env]
targetLabel: env
---
# ServiceMonitor for Kafka Exporter
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kafka-exporter
namespace: kafka-sit
labels:
app: kafka-exporter
release: kube-prom-stack
spec:
selector:
matchLabels:
app: kafka-exporter
namespaceSelector:
matchNames:
- kafka-sit
endpoints:
- port: http-metrics
path: /metrics
interval: 15s
scrapeTimeout: 10s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_label_env]
targetLabel: env
[root@ip-10-18-75-168 kafka-sit]# cat jmx-kafka-exporter-configmap.yaml
---
# JMX Exporter Config for Broker 0
apiVersion: v1
kind: ConfigMap
metadata:
name: jmx-exporter-config-broker-0
namespace: kafka-sit
data:
config.yaml: |
hostPort: 10.17.9.79:9999
lowercaseOutputName: true
lowercaseOutputLabelNames: true
ssl: false
whitelistObjectNames: ["kafka.controller:*","kafka.server:*","java.lang:*","kafka.network:*","kafka.log:*"]
rules:
- pattern: kafka.controller<type=(ControllerChannelManager), name=(QueueSize), broker-id=(\d+)><>(Value)
name: kafka_controller_$1_$2_$4
labels:
broker_id: "$3"
- pattern: kafka.controller<type=(ControllerChannelManager), name=(TotalQueueSize)><>(Value)
name: kafka_controller_$1_$2_$3
- pattern: kafka.controller<type=(KafkaController), name=(.+)><>(Value)
name: kafka_controller_$1_$2_$3
- pattern: kafka.controller<type=(ControllerStats), name=(.+)><>(Count)
name: kafka_controller_$1_$2_$3
- pattern : kafka.network<type=(Processor), name=(IdlePercent), networkProcessor=(.+)><>(Value)
name: kafka_network_$1_$2_$4
labels:
network_processor: $3
- pattern : kafka.network<type=(RequestMetrics), name=(.+), request=(.+)><>(Count|Value)
name: kafka_network_$1_$2_$4
labels:
request: $3
- pattern : kafka.network<type=(SocketServer), name=(.+)><>(Count|Value)
name: kafka_network_$1_$2_$3
- pattern : kafka.network<type=(RequestChannel), name=(.+)><>(Count|Value)
name: kafka_network_$1_$2_$3
- pattern: kafka.server<type=(.+), name=(.+), topic=(.+)><>(Count|OneMinuteRate)
name: kafka_server_$1_$2_$4
labels:
topic: $3
- pattern: kafka.server<type=(ReplicaFetcherManager), name=(.+), clientId=(.+)><>(Value)
name: kafka_server_$1_$2_$4
labels:
client_id: "$3"
- pattern: kafka.server<type=(DelayedOperationPurgatory), name=(.+), delayedOperation=(.+)><>(Value)
name: kafka_server_$1_$2_$3_$4
- pattern: kafka.server<type=(.+), name=(.+)><>(Count|Value|OneMinuteRate)
name: kafka_server_$1_total_$2_$3
- pattern: kafka.server<type=(.+)><>(queue-size)
name: kafka_server_$1_$2
- pattern: java.lang<type=(.+), name=(.+)><(.+)>(\w+)
name: java_lang_$1_$4_$3_$2
- pattern: java.lang<type=(.+), name=(.+)><>(\w+)
name: java_lang_$1_$3_$2
- pattern : java.lang<type=(.*)>
- pattern: kafka.log<type=(.+), name=(.+), topic=(.+), partition=(.+)><>Value
name: kafka_log_$1_$2
labels:
topic: $3
partition: $4
---
# JMX Exporter Config for Broker 1
apiVersion: v1
kind: ConfigMap
metadata:
name: jmx-exporter-config-broker-1
namespace: kafka-sit
data:
config.yaml: |
hostPort: 10.17.9.57:9999
lowercaseOutputName: true
lowercaseOutputLabelNames: true
ssl: false
whitelistObjectNames: ["kafka.controller:*","kafka.server:*","java.lang:*","kafka.network:*","kafka.log:*"]
rules:
- pattern: kafka.controller<type=(ControllerChannelManager), name=(QueueSize), broker-id=(\d+)><>(Value)
name: kafka_controller_$1_$2_$4
labels:
broker_id: "$3"
- pattern: kafka.controller<type=(ControllerChannelManager), name=(TotalQueueSize)><>(Value)
name: kafka_controller_$1_$2_$3
- pattern: kafka.controller<type=(KafkaController), name=(.+)><>(Value)
name: kafka_controller_$1_$2_$3
- pattern: kafka.controller<type=(ControllerStats), name=(.+)><>(Count)
name: kafka_controller_$1_$2_$3
- pattern : kafka.network<type=(Processor), name=(IdlePercent), networkProcessor=(.+)><>(Value)
name: kafka_network_$1_$2_$4
labels:
network_processor: $3
- pattern : kafka.network<type=(RequestMetrics), name=(.+), request=(.+)><>(Count|Value)
name: kafka_network_$1_$2_$4
labels:
request: $3
- pattern : kafka.network<type=(SocketServer), name=(.+)><>(Count|Value)
name: kafka_network_$1_$2_$3
- pattern : kafka.network<type=(RequestChannel), name=(.+)><>(Count|Value)
name: kafka_network_$1_$2_$3
- pattern: kafka.server<type=(.+), name=(.+), topic=(.+)><>(Count|OneMinuteRate)
name: kafka_server_$1_$2_$4
labels:
topic: $3
- pattern: kafka.server<type=(ReplicaFetcherManager), name=(.+), clientId=(.+)><>(Value)
name: kafka_server_$1_$2_$4
labels:
client_id: "$3"
- pattern: kafka.server<type=(DelayedOperationPurgatory), name=(.+), delayedOperation=(.+)><>(Value)
name: kafka_server_$1_$2_$3_$4
- pattern: kafka.server<type=(.+), name=(.+)><>(Count|Value|OneMinuteRate)
name: kafka_server_$1_total_$2_$3
- pattern: kafka.server<type=(.+)><>(queue-size)
name: kafka_server_$1_$2
- pattern: java.lang<type=(.+), name=(.+)><(.+)>(\w+)
name: java_lang_$1_$4_$3_$2
- pattern: java.lang<type=(.+), name=(.+)><>(\w+)
name: java_lang_$1_$3_$2
- pattern : java.lang<type=(.*)>
- pattern: kafka.log<type=(.+), name=(.+), topic=(.+), partition=(.+)><>Value
name: kafka_log_$1_$2
labels:
topic: $3
partition: $4
---
# JMX Exporter Config for Broker 2
apiVersion: v1
kind: ConfigMap
metadata:
name: jmx-exporter-config-broker-2
namespace: kafka-sit
data:
config.yaml: |
hostPort: 10.17.12.159:9999
lowercaseOutputName: true
lowercaseOutputLabelNames: true
ssl: false
whitelistObjectNames: ["kafka.controller:*","kafka.server:*","java.lang:*","kafka.network:*","kafka.log:*"]
rules:
- pattern: kafka.controller<type=(ControllerChannelManager), name=(QueueSize), broker-id=(\d+)><>(Value)
name: kafka_controller_$1_$2_$4
labels:
broker_id: "$3"
- pattern: kafka.controller<type=(ControllerChannelManager), name=(TotalQueueSize)><>(Value)
name: kafka_controller_$1_$2_$3
- pattern: kafka.controller<type=(KafkaController), name=(.+)><>(Value)
name: kafka_controller_$1_$2_$3
- pattern: kafka.controller<type=(ControllerStats), name=(.+)><>(Count)
name: kafka_controller_$1_$2_$3
- pattern : kafka.network<type=(Processor), name=(IdlePercent), networkProcessor=(.+)><>(Value)
name: kafka_network_$1_$2_$4
labels:
network_processor: $3
- pattern : kafka.network<type=(RequestMetrics), name=(.+), request=(.+)><>(Count|Value)
name: kafka_network_$1_$2_$4
labels:
request: $3
- pattern : kafka.network<type=(SocketServer), name=(.+)><>(Count|Value)
name: kafka_network_$1_$2_$3
- pattern : kafka.network<type=(RequestChannel), name=(.+)><>(Count|Value)
name: kafka_network_$1_$2_$3
- pattern: kafka.server<type=(.+), name=(.+), topic=(.+)><>(Count|OneMinuteRate)
name: kafka_server_$1_$2_$4
labels:
topic: $3
- pattern: kafka.server<type=(ReplicaFetcherManager), name=(.+), clientId=(.+)><>(Value)
name: kafka_server_$1_$2_$4
labels:
client_id: "$3"
- pattern: kafka.server<type=(DelayedOperationPurgatory), name=(.+), delayedOperation=(.+)><>(Value)
name: kafka_server_$1_$2_$3_$4
- pattern: kafka.server<type=(.+), name=(.+)><>(Count|Value|OneMinuteRate)
name: kafka_server_$1_total_$2_$3
- pattern: kafka.server<type=(.+)><>(queue-size)
name: kafka_server_$1_$2
- pattern: java.lang<type=(.+), name=(.+)><(.+)>(\w+)
name: java_lang_$1_$4_$3_$2
- pattern: java.lang<type=(.+), name=(.+)><>(\w+)
name: java_lang_$1_$3_$2
- pattern : java.lang<type=(.*)>
- pattern: kafka.log<type=(.+), name=(.+), topic=(.+), partition=(.+)><>Value
name: kafka_log_$1_$2
labels:
topic: $3
partition: $4
ec2部署集成到k8s监控中
Helm部署:
tree
.
├── kafka-cluster-values.yaml
├── kafka-exporter.yaml
├── kafka-gp3-sc.yaml
└── start.sh


global:
storageClass: "kafka-gp3"
security:
allowInsecureImages: true
image:
registry: 292309088324.dkr.ecr.ap-northeast-1.amazonaws.com
repository: sretools
tag: kafka-3.9.0
pullPolicy: Always
controller:
replicaCount: 3
automountServiceAccountToken: true
# 指定部署到 ap-northeast-1a 可用区
nodeSelector:
topology.kubernetes.io/zone: ap-northeast-1a
# 2C8G 资源配置
resources:
requests:
cpu: "200m"
memory: "6Gi"
limits:
cpu: "2"
memory: "8Gi"
persistence:
enabled: true
size: 300Gi
storageClass: "kafka-gp3"
# JVM 堆内存配置(8G内存建议堆设置为4G,不超过物理内存70%)
heapOpts: "-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled"
broker:
replicaCount: 0
automountServiceAccountToken: true
listeners:
client:
protocol: PLAINTEXT
interbroker:
protocol: PLAINTEXT
controller:
protocol: PLAINTEXT
external:
protocol: PLAINTEXT
externalAccess:
enabled: true
autoDiscovery:
enabled: true
image:
registry: 292309088324.dkr.ecr.ap-northeast-1.amazonaws.com
repository: sretools
tag: kubectl
controller:
service:
type: LoadBalancer
ports:
external: 9092
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "external"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"
serviceAccount:
create: true
service:
type: ClusterIP
ports:
client: 9092
controller: 9093
# ==================== AWS 最佳实践配置(2C8G优化)====================
extraConfig: |
# ========== 网络与连接配置 ==========
num.network.threads=2
num.io.threads=8
socket.send.buffer.bytes=-1
socket.receive.buffer.bytes=-1
socket.request.max.bytes=157286400
queued.max.requests=1000
# ========== 可用性与副本配置 ==========
unclean.leader.election.enable=false
replica.lag.time.max.ms=45000
# 8G内存可以适当增大副本拉取量
replica.fetch.max.bytes=16777216
# ========== 主题默认配置 ==========
num.partitions=3
default.replication.factor=2
min.insync.replicas=2
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=2
# ========== 性能与资源配置 ==========
num.recovery.threads.per.data.dir=1
log.retention.check.interval.ms=1800000
# ========== 主题与分区配置 ==========
auto.leader.rebalance.enable=false
group.initial.rebalance.delay.ms=5000
compression.type=producer
# ========== 日志保留配置 ==========
log.retention.hours=72
log.segment.bytes=268435456
log.cleanup.policy=delete
# ========== 运维安全配置 ==========
auto.create.topics.enable=false
delete.topic.enable=true
controlled.shutdown.enable=true
sasl:
client:
users: []
rbac:
create: true
metrics:
jmx:
enabled: true
kafkaJmxPort: 9999
image:
registry: 292309088324.dkr.ecr.ap-northeast-1.amazonaws.com
repository: sretools
tag: jmx-exporter-1.1.0
kafka:
enabled: true
image:
registry: 292309088324.dkr.ecr.ap-northeast-1.amazonaws.com
repository: sretools
tag: kafka-exporter-v1.8.0
certificatesSecret: ""
tlsCert: ""
tlsKey: ""
tlsCaSecret: ""
tlsCaCert: ""
extraFlags: {}
command: []
args: []
containerPorts:
metrics: 9308
resources:
limits: {}
requests: {}
service:
ports:
metrics: 9308
annotations: {}
serviceMonitor:
enabled: true
namespace: "kafka-qa"
labels:
release: kube-prom-stack
prometheusRule:
enabled: false
---kafka-gp3-sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: kafka-gp3
provisioner: ebs.csi.aws.com
parameters:
type: gp3
fsType: ext4
iops: "3000"
throughput: "250"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---kafka-exporter.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kafka-exporter
namespace: kafka-qa
labels:
app: kafka-exporter
spec:
replicas: 1
selector:
matchLabels:
app: kafka-exporter
template:
metadata:
labels:
app: kafka-exporter
spec:
containers:
- name: kafka-exporter
image: 292309088324.dkr.ecr.ap-northeast-1.amazonaws.com/sretools:kafka-exporter-v1.8.0
args:
- --kafka.server=kafka-cluster-controller-headless.kafka-qa.svc.cluster.local:9092
ports:
- containerPort: 9308
name: metrics
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
---
apiVersion: v1
kind: Service
metadata:
name: kafka-exporter
namespace: kafka-qa
labels:
app: kafka-exporter
spec:
ports:
- port: 9308
targetPort: 9308
name: metrics
selector:
app: kafka-exporter
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kafka-exporter
namespace: kafka-qa
labels:
release: kube-prom-stack
spec:
selector:
matchLabels:
app: kafka-exporter
endpoints:
- port: metrics
interval: 10s
helm部署目录
ec2/K8S部署监控看板:



