使用codebuddy调优kafka等

The user wants to learn about Kafka configuration tuning, log retention policy, and ACL access control. Let me first check the current Kafka configuration to understand the existing setup, then teach these concepts with hands-on demonstrations.

Let me explore the current Kafka setup and configuration files.

好的,我来系统性地讲解 Kafka 配置调优、日志保留策略和 ACL 权限控制这三部分。先查看你当前的 Kafka 配置:

Now I have a good picture of the current Kafka setup. Let me read the current server.properties to understand the exact configuration, then I'll teach the user about:

  1. Kafka 配置调优 - Broker 级参数调优、Producer/Consumer 参数调优、OS 级调优
  2. 日志保留策略 - log.retention.* 系列参数
  3. ACL 权限控制 - SASL/SCRAM 认证 + ACL 授权

Let me now read the current config files and build a comprehensive hands-on tutorial.

Let me read the key configuration files first, then create a practical tutorial.

找到了 Kafka 在 /opt/module/kafka/config/server.properties。让我读取当前配置:

Now I have a clear picture of the current Kafka setup. Let me create a comprehensive hands-on tutorial covering:

  1. Kafka 配置调优 - Broker level tuning (network threads, IO threads, socket buffers, page cache, etc.)
  2. 日志保留策略 - log.retention.hours, log.retention.bytes, log.segment.bytes, log.cleanup.policy
  3. ACL 权限控制 - SASL/SCRAM authentication, ACL authorization

Since Kafka is running on the 3-node cluster, I'll demonstrate these with actual commands and config changes.

Let me structure this tutorial:

Part 1: Kafka 配置调优

Key parameters:

  • num.network.threads - 网络线程数(处理网络请求)
  • num.io.threads - IO线程数(处理磁盘读写)
  • socket.send.buffer.bytes / socket.receive.buffer.bytes - Socket缓冲区
  • num.partitions - 默认分区数
  • num.recovery.threads.per.data.dir - 恢复线程数
  • log.flush.interval.messages / log.flush.interval.ms - 刷盘策略
  • compression.type - Broker端压缩
  • unclean.leader.election.enable - 是否允许非ISR副本当选Leader
  • min.insync.replicas - 最小同步副本数
  • auto.create.topics.enable - 是否自动创建主题

OS level:

  • vm.swappiness
  • net.core.rmem_max / net.core.wmem_max
  • file descriptors

Part 2: 日志保留策略

Key parameters:

  • log.retention.hours (currently 168 = 7 days)
  • log.retention.bytes - 基于大小的保留
  • log.segment.bytes (currently 536870912 = 512MB)
  • log.retention.check.interval.ms (currently 300000 = 5min)
  • log.cleanup.policy - delete or compact
  • log.cleaner.enable

Part 3: ACL 权限控制

Steps:

  1. Configure SASL/SCRAM in server.properties
  2. Create SCRAM users
  3. Configure JAAS files
  4. Set super users
  5. Enable ACL authorizer
  6. Create ACL rules

Let me write this up as a comprehensive tutorial and also create some helper scripts.

I'll first create a detailed hands-on walkthrough, then demonstrate where possible. Since ACL requires restarting Kafka with new config, I'll be careful about that.

Let me create the tutorial in a structured way:

  1. First, let me show the current config and explain each parameter
  2. Then demonstrate log retention with actual topic management
  3. Then show ACL setup

Let me start by reading the current Kafka cluster state and then providing the tutorial.

好的,现在我对你的 Kafka 集群有了完整了解。我来系统地讲解这三个主题,并结合你当前的实际配置进行演示。


📚 一、Kafka 配置调优

先看你的 server.properties 当前配置,我逐一分析:

1.1 Broker 核心参数分析

复制代码
当前配置                       | 建议值         | 说明
num.network.threads=3          | CPU核数        | 处理网络请求的线程,太小会成为瓶颈
num.io.threads=8               | CPU核数×2      | 处理磁盘IO的线程
socket.send.buffer.bytes=102400| 保持           | 100KB socket发送缓冲,基本够用
socket.receive.buffer.bytes=102400| 保持        | 100KB socket接收缓冲
socket.request.max.bytes=100MB | 保持           | 单次请求最大100MB
num.partitions=6               | 按需调整       | 默认分区数,仅对新topic生效

先看看你当前集群的资源使用情况:

Now I have the system info. Let me check the Kafka cluster status and then create the comprehensive tutorial.

System info:

  • CPU: 1 core
  • Memory: 5.9G total, 2.9G available
  • Disk: 50G, 42G available (18% used)
  • File descriptors: 131072
  • Swappiness: 30
  • Socket buffers: 212992 (208KB)

Let me check if Kafka and ZK are running, then present the full tutorial with hands-on examples.

Kafka is running. Topics: __consumer_offsets, lesson, nginx-log, nginx-lost, test-topic.

Now I'll create a comprehensive tutorial. Let me present this in a structured way with:

  1. Configuration tuning theory + actionable recommendations
  2. Log retention demo using actual kafka commands
  3. ACL setup hands-on

For the hands-on parts, I'll:

  • Show how to check and modify topic-level retention
  • Demonstrate ACL setup with SASL/SCRAM

Let me just do everything in demonstrations. I'll use execute_command to run actual Kafka commands where possible.

Let me structure my response:

Part 1: 配置调优

  • Current config analysis
  • Tuning recommendations with rationale
  • OS level tuning

Part 2: 日志保留策略

  • Current: log.retention.hours=168 (7天)
  • Demonstrate: check current retention, modify topic-level retention, segment management
  • Time-based vs Size-based retention
  • Cleanup policies (delete vs compact)

Part 3: ACL 权限控制

  • SASL/SCRAM authentication setup
  • ACL rules setup
  • Demo: create users, set permissions, verify

Let me start executing the demos.

集群正在运行,5个topic。下面开始三部分教学。


🔧 一、Kafka 配置调优

1.1 你的当前配置诊断

根据你的系统环境(1核CPU、6G内存),当前配置分析:

参数 当前值 诊断 建议
num.network.threads 3 1核CPU设3够用 保持
num.io.threads 8 偏大,1核CPU建议2-4 改为 4
socket.send/receive.buffer.bytes 102400(100K) 偏小 改为 1048576(1M)
num.partitions 6 默认值,按需即可 保持
socket.request.max.bytes 100MB 合理 保持
log.segment.bytes 512MB 合理 保持

先演示一下:查看当前topic的详细信息:

Good, the nginx-log topic has 3 partitions, replication factor 2, leaders are distributed. Now let me continue the tutorial.

Let me present:

  1. The tuning analysis
  2. Demonstrate log retention modifications
  3. Then ACL setup

Let me continue with the demo commands.

1.2 🔴 关键调优参数详解

现在用一个实际的调优来演示。先检查 topic 级别动态配置:

Now I have comprehensive data. Let me present the full tutorial now.

Key findings:

  • retention.ms = 604800000 (7 days, matches log.retention.hours=168)
  • cleanup.policy = delete
  • segment.bytes = 536870912 (512MB)
  • min.insync.replicas = 1 (this is a concern for data durability)
  • unclean.leader.election.enable = false (good)
  • compression.type = producer (follows producer's setting)
  • retention.bytes = -1 (unlimited by size, only time-based)

nginx-log offsets: partition 0=26, partition 1=24, partition 2=30

Now let me present the full tutorial with 3 sections. I'll be comprehensive but practical.

Let me go ahead and do the hands-on demos:

  1. Show how to modify retention at topic level
  2. Demonstrate ACL setup

Actually, for ACL, I need to:

  1. Create JAAS config files
  2. Modify server.properties
  3. Restart Kafka - this is disruptive

Let me present the concept and configuration steps for ACL rather than actually doing it on the running cluster (unless the user wants to).

Let me present everything now in a clear format.

现在数据都有了,开始完整教学。


🔧 一、Kafka 配置调优

1.1 Broker 级参数(server.properties)

复制代码
┌─────────────────────┬──────────┬──────────────────────────────────────────────┐
│ 参数                 │ 当前值    │ 作用与调优建议                               │
├─────────────────────┼──────────┼──────────────────────────────────────────────┤
│ num.network.threads │ 3        │ 处理网络请求。公式:CPU核数                    │
│ num.io.threads      │ 8        │ 处理磁盘IO。公式:CPU核数×2                   │
│ num.partitions      │ 6        │ 新topic默认分区数,按需设                     │
│ offsets.topic.      │ 3        │ __consumer_offsets副本数,3节点=3 ✅          │
│   replication.factor│          │                                              │
│ transaction.state.  │ 3        │ 事务日志副本数 ✅                             │
│   log.replication   │          │                                              │
│ transaction.state.  │ 2        │ 事务日志最少ISR ✅                            │
│   log.min.isr       │          │                                              │
│ log.retention.hours │ 168(7天) │ 日志保留时间                                  │
│ log.segment.bytes   │ 512MB    │ 每个日志分段大小,满了分新段                    │
│ zookeeper.connect   │ 3节点    │ ZK集群地址,3节点容错 ✅                       │
│ zookeeper.connection │ 6000ms  │ ZK超时时间                                   │
│   .timeout.ms       │          │                                              │
└─────────────────────┴──────────┴──────────────────────────────────────────────┘

1.2 🔴 关键缺失的调优参数

你的配置缺少以下重要参数:

Good, now let me continue with the OS-level tuning and then move to log retention and ACL.

1.3 操作系统级调优

Good. Now let me move to Part 2: Log Retention Policy. Let me demonstrate this hands-on by modifying the retention of a topic.

📦 二、日志保留策略 ------ 实战操作

2.1 两种保留策略

复制代码
┌──────────────────────────────────────────────────────────────┐
│                    Kafka 日志保留策略                          │
├──────────────────┬──────────────────┬────────────────────────┤
│   按时间保留       │  按大小保留       │  清理策略              │
│ retention.ms      │ retention.bytes  │ cleanup.policy        │
│ retention.hours   │                  │                       │
│ retention.minutes │                  │                       │
├──────────────────┼──────────────────┼────────────────────────┤
│ delete (默认)    │ 到期/超限→物理删除  │ 适合: 日志、临时数据   │
│ compact          │ 保留每个key最新值  │ 适合: 数据库变更日志    │
│ delete,compact   │ 两者混合          │ 适合: 复杂场景         │
└──────────────────┴──────────────────┴────────────────────────┘

2.2 查看你当前的保留配置

Let me try again with correct paths. The log directories might be structured differently.

I see nginx-log has partitions 0 and 1 on this broker. nginx-log partition 2 is on another broker. There's also a nginx-lost topic.

Let me check the retention config properly.

Now I can see the retention config clearly:

  • retention.ms=604800000 = 7 days (matches broker setting of log.retention.hours=168)
  • cleanup.policy=delete
  • retention.bytes=-1 = no size limit, only time-based
  • segment.bytes=536870912 = 512MB per segment
  • segment.ms=604800000 = 7 days per segment
  • file.delete.delay.ms=60000 = 1 minute delay before file deletion
  • delete.retention.ms=86400000 = 1 day (for compacted topics, deleted keys retained for this long)

The log files:

  • Partition 0: 00000000000000000019.log (2.9KB), 00000000000000000019.index (10MB)
  • Partition 1: 00000000000000000017.log (3.3KB), 00000000000000000017.index (10MB)

The numbers in filenames are offsets. The first segment starts at offset 19 for partition 0 and offset 17 for partition 1.

Now let me demonstrate modifying retention at topic level.

Actually, let me present all this info first, then demonstrate modifying retention, and finally the ACL part.

Let me organize my response now with clear sections and actual demos.

2.3 📊 你的 nginx-log 保留策略解读

复制代码
当前 nginx-log 保留配置:
┌──────────────────────────────────────────────────┐
│ retention.ms        = 604800000  (7天)           │
│ retention.bytes     = -1         (不限制大小)     │
│ cleanup.policy      = delete     (到期删除)       │
│ segment.bytes       = 536870912  (512MB分段)     │
│ segment.ms          = 604800000  (7天滚新段)      │
│ file.delete.delay.ms= 60000      (删除前等1分钟)  │
└──────────────────────────────────────────────────┘

分区段文件结构:
/opt/module/kafka/logs/nginx-log-0/
├── 00000000000000000019.log      ← 数据文件(offset 19开始)
├── 00000000000000000019.index    ← 稀疏索引(offset→文件位置)
└── 00000000000000000019.timeindex ← 时间索引(时间戳→offset)

2.4 🔴 实战:动态修改 topic 保留策略(无需重启)

Good, the retention.bytes was set to 1GB. Now let me verify it, then demonstrate compact policy.

Good. Now let me demonstrate the compact policy and then the ACL part.

For compact policy, I'll create a test topic and show the concept.

2.5 🔴 实战:Compact 策略 --- Key 维度保留最新值

Good. Now let me write data with same keys but different values to demonstrate compact behavior.

Good, now let me read all data to see all 5 values, then trigger a segment roll and see how compact works.

Actually, I should just show the concept. Let me read the data and explain.

Now let me move to Part 3: ACL access control. This is the most complex part.

For ACL, I need to:

  1. Create SCRAM users
  2. Configure JAAS
  3. Modify server.properties
  4. Restart Kafka

This is disruptive. Let me present the concept clearly, show the config changes needed, but not actually restart Kafka since that would break their running services.

Actually, let me at least show the config steps and create the JAAS directory and user creation script.

Let me structure this well.

可以看到5条同key数据都在。compact触发后只会保留 value_v5。现在进入ACL部分。


🔐 三、ACL 权限控制

3.1 整体架构

复制代码
                      ┌─────────────────┐
                      │   SASL/SCRAM    │  ← 认证: 你是谁?
                      │   (Authentication)│
                      └────────┬────────┘
                               │ 验证通过
                      ┌────────▼────────┐
                      │   ACL Rules     │  ← 授权: 你能做什么?
                      │  (Authorization) │
                      └────────┬────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
        READ:topicA    WRITE:topicB    DESCRIBE:topicC

3.2 当前你的 Kafka 安全状态

Your Kafka is completely open. Let me now create the ACL setup demo. I'll:

  1. Create the JAAS config file for SCRAM
  2. Create SCRAM users
  3. Show the server.properties changes needed
  4. Create an ACL rule setup script

Since restarting Kafka would be disruptive, I'll create all the config files and scripts, and show the user how to apply them. But I can still create SCRAM users without restarting.

Actually, let me check: can we create SCRAM users dynamically without restart? Let me test.

3.3 🔴 实战:SASL/SCRAM 认证 + ACL 授权配置

先创建 SCRAM 用户(无需重启即可创建,但需要配置后重启才能生效):

Users created successfully. Now let me verify and create the JAAS config files and the updated server.properties with all the security settings.

SCRAM users are created. Now let me create:

  1. JAAS broker config file
  2. JAAS client config file (for CLI tools)
  3. Updated server.properties with SASL and ACL settings
  4. ACL rules script

Let me create all config files now.

用户已存储到 ZooKeeper。现在创建 SASL 的 JAAS 配置文件和更新配置:

Now let me create the final files:

  1. Security config properties (a diff for server.properties)
  2. ACL rules script
  3. Client config files for producer and consumer
    现在创建带 SASL+ACL 的完整 server.properties 模板和 ACL 规则脚本:
    Now let me also create producer and consumer config files for SASL, then give the user a comprehensive summary.

Actually, let me also create a comprehensive summary file that ties everything together.

Let me also create the producer and consumer SASL config files.

Now let me also create a final comprehensive reference document, and also chmod the script. Let me also show the final summary.

Actually, let me also reset the retention config I changed on nginx-log back to the original state, since I changed it during the demo.

Wait, let me think about this. The user was learning, so leaving the 1GB retention is fine. They can change it back if needed.

Let me chmod the script and present the comprehensive summary.

Now let me present the final comprehensive summary. Let me also show all the files I created.

Let me organize the final summary into a clear plan showing:

  1. What was done in each section
  2. Files created
  3. How to enable security (when ready)
  4. Key takeaway commands

Let me present this in a clean fo

📋 完整总结

✅ 今天实际操作了什么

操作 详情 状态
查看当前配置 server.properties 全量参数分析
动态改保留策略 nginx-logretention.bytes=1GB
创建 compact topic user-cache --- compact策略,同key去重
写入 compact 数据 同key多值写入验证
创建 SCRAM 用户 admin / producer / consumer 共3个
创建 JAAS 配置 Broker端 + Client端
创建 ACL 脚本 完整的权限管理脚本
创建 SASL 客户端配置 生产者 + 消费者配置

📁 新创建的文件

文件 用途
/home/kafka/kafka-security.properties server.properties 安全追加配置
/home/kafka/kafka-acl.sh ACL 规则管理脚本
/home/kafka/kafka-producer-sasl.properties SASL 生产者配置
/home/kafka/kafka-consumer-sasl.properties SASL 消费者配置
/opt/module/kafka/config/kafka_server_jaas.conf Broker JAAS 配置
/opt/module/kafka/config/kafka_client_jaas.conf 客户端 JAAS 配置

🔐 如何启用 SASL+ACL(3步)

复制代码
步骤1: 追加安全配置到 server.properties(每个节点都要改)
  cat /home/kafka/kafka-security.properties >> /opt/module/kafka/config/server.properties
  
步骤2: 修改 Kafka 启动脚本,加载 JAAS
  export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/module/kafka/config/kafka_server_jaas.conf"

步骤3: 重启 Kafka,加载 ACL 规则
  bash qidong.sh restart
  bash /home/kafka/kafka-acl.sh setup    ← 初始化 ACL 规则
  bash /home/kafka/kafka-acl.sh list      ← 查看规则

📐 本次涉及的核心参数速查

复制代码
┌──────────────┬───────────────────────────────────────┐
│ 调优         │ num.io.threads=4                       │
│              │ socket.send.buffer=1048576             │
│              │ min.insync.replicas=2                  │
│              │ compression.type=lz4                   │
│              │ vm.swappiness=1                        │
├──────────────┼───────────────────────────────────────┤
│ 保留策略      │ retention.ms=604800000  (7天)          │
│              │ retention.bytes=-1       (不限大小)     │
│              │ cleanup.policy=delete                   │
│              │ → compact 适合KV存储                    │
│              │ → delete  适合日志流                     │
├──────────────┼───────────────────────────────────────┤
│ ACL          │ SASL/SCRAM-SHA-256 认证                │
│              │ AclAuthorizer 授权                      │
│              │ super.users=User:admin                 │
│              │ 端口: 9093(SASL) + 9092(PLAINTEXT)     │
└──────────────┴───────────────────────────────────────┘

有任何部分想深入,比如具体的压测、动态改 Broker 配置、ACL 细粒度规则等,随时告诉我!

🚀 实战:性能调优脚本

下面是一个完整的 Shell 脚本 kafka_perf_tuner.sh,用于监控 Kafka 关键指标、动态调整核心参数,并通过压测验证调优效果:

bash 复制代码
#!/bin/bash
# kafka_perf_tuner.sh - Kafka 性能监控与调优脚本
# 功能:监控关键指标、动态调整参数、执行压测验证

set -e

# 配置变量
KAFKA_HOME=${KAFKA_HOME:-"/opt/kafka"}
BROKER_HOST=${BROKER_HOST:-"localhost"}
BROKER_PORT=${BROKER_PORT:-"9092"}
TOPIC=${TOPIC:-"perf-test-topic"}
PARTITIONS=${PARTITIONS:-3}
REPLICATION_FACTOR=${REPLICATION_FACTOR:-1}
ZOOKEEPER=${ZOOKEEPER:-"localhost:2181"}

# 颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color

log_info() {
    echo -e "${BLUE}[INFO]${NC} $1"
}

log_success() {
    echo -e "${GREEN}[SUCCESS]${NC} $1"
}

log_warn() {
    echo -e "${YELLOW}[WARN]${NC} $1"
}

log_error() {
    echo -e "${RED}[ERROR]${NC} $1"
}

# 1. 监控函数
monitor_kafka() {
    log_info "开始监控 Kafka 关键指标..."
    
    echo -e "\n${GREEN}=== Kafka Broker 状态 ===${NC}"
    ${KAFKA_HOME}/bin/kafka-broker-api-versions.sh --bootstrap-server ${BROKER_HOST}:${BROKER_PORT}
    
    echo -e "\n${GREEN}=== Topic 状态 ===${NC}"
    ${KAFKA_HOME}/bin/kafka-topics.sh --bootstrap-server ${BROKER_HOST}:${BROKER_PORT} --list
    
    echo -e "\n${GREEN}=== 消费者组延迟 ===${NC}"
    ${KAFKA_HOME}/bin/kafka-consumer-groups.sh --bootstrap-server ${BROKER_HOST}:${BROKER_PORT} --all-groups --describe
    
    echo -e "\n${GREEN}=== 网络连接统计 ===${NC}"
    netstat -an | grep ":${BROKER_PORT}" | awk '{print $6}' | sort | uniq -c | sort -rn
    
    echo -e "\n${GREEN}=== 系统资源监控 ===${NC}"
    echo "CPU使用率: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}')%"
    echo "内存使用: $(free -h | awk '/^Mem:/ {print $3"/"$2}')"
    echo "磁盘IO: $(iostat -d -x 1 1 | awk '/^[a-z]/ {print $1": "$14"% util"}')"
}

# 2. 动态参数调整函数
tune_parameters() {
    local CONFIG_FILE="${KAFKA_HOME}/config/server.properties"
    local TUNE_MODE=${1:-"balanced"}  # balanced/throughput/latency
    
    log_info "开始动态调整参数 (模式: ${TUNE_MODE})..."
    
    # 备份原配置
    cp "${CONFIG_FILE}" "${CONFIG_FILE}.backup.$(date +%Y%m%d_%H%M%S)"
    
    case ${TUNE_MODE} in
        "throughput")
            # 吞吐量优先模式
            log_info "应用吞吐量优先配置..."
            sed -i "s/^num.io.threads=.*/num.io.threads=8/" "${CONFIG_FILE}"
            sed -i "s/^num.network.threads=.*/num.network.threads=8/" "${CONFIG_FILE}"
            sed -i "s/^socket.send.buffer.bytes=.*/socket.send.buffer.bytes=2097152/" "${CONFIG_FILE}"
            sed -i "s/^socket.receive.buffer.bytes=.*/socket.receive.buffer.bytes=2097152/" "${CONFIG_FILE}"
            sed -i "s/^compression.type=.*/compression.type=lz4/" "${CONFIG_FILE}"
            sed -i "s/^batch.size=.*/batch.size=16384/" "${CONFIG_FILE}"
            sed -i "s/^linger.ms=.*/linger.ms=5/" "${CONFIG_FILE}"
            ;;
        "latency")
            # 延迟优先模式
            log_info "应用延迟优先配置..."
            sed -i "s/^num.io.threads=.*/num.io.threads=4/" "${CONFIG_FILE}"
            sed -i "s/^num.network.threads=.*/num.network.threads=4/" "${CONFIG_FILE}"
            sed -i "s/^socket.send.buffer.bytes=.*/socket.send.buffer.bytes=1048576/" "${CONFIG_FILE}"
            sed -i "s/^socket.receive.buffer.bytes=.*/socket.receive.buffer.bytes=1048576/" "${CONFIG_FILE}"
            sed -i "s/^compression.type=.*/compression.type=none/" "${CONFIG_FILE}"
            sed -i "s/^batch.size=.*/batch.size=4096/" "${CONFIG_FILE}"
            sed -i "s/^linger.ms=.*/linger.ms=0/" "${CONFIG_FILE}"
            ;;
        *)
            # 平衡模式
            log_info "应用平衡配置..."
            sed -i "s/^num.io.threads=.*/num.io.threads=6/" "${CONFIG_FILE}"
            sed -i "s/^num.network.threads=.*/num.network.threads=6/" "${CONFIG_FILE}"
            sed -i "s/^socket.send.buffer.bytes=.*/socket.send.buffer.bytes=1572864/" "${CONFIG_FILE}"
            sed -i "s/^socket.receive.buffer.bytes=.*/socket.receive.buffer.bytes=1572864/" "${CONFIG_FILE}"
            sed -i "s/^compression.type=.*/compression.type=snappy/" "${CONFIG_FILE}"
            sed -i "s/^batch.size=.*/batch.size=8192/" "${CONFIG_FILE}"
            sed -i "s/^linger.ms=.*/linger.ms=2/" "${CONFIG_FILE}"
            ;;
    esac
    
    # 应用操作系统级调优
    log_info "应用操作系统级调优..."
    sysctl -w vm.swappiness=1
    sysctl -w net.core.rmem_max=2097152
    sysctl -w net.core.wmem_max=2097152
    sysctl -w net.ipv4.tcp_rmem="4096 87380 2097152"
    sysctl -w net.ipv4.tcp_wmem="4096 65536 2097152"
    
    log_success "参数调整完成!需要重启 Kafka Broker 使配置生效"
    log_warn "执行: systemctl restart kafka 或 ${KAFKA_HOME}/bin/kafka-server-stop.sh && ${KAFKA_HOME}/bin/kafka-server-start.sh ${CONFIG_FILE}"
}

# 3. 创建压测 Topic
create_perf_topic() {
    log_info "创建压测 Topic: ${TOPIC}"
    
    ${KAFKA_HOME}/bin/kafka-topics.sh \
        --bootstrap-server ${BROKER_HOST}:${BROKER_PORT} \
        --create \
        --topic ${TOPIC} \
        --partitions ${PARTITIONS} \
        --replication-factor ${REPLICATION_FACTOR} \
        --config retention.ms=3600000 \
        --config retention.bytes=-1 \
        --if-not-exists
    
    log_success "Topic ${TOPIC} 创建成功"
    
    # 查看 Topic 详情
    ${KAFKA_HOME}/bin/kafka-topics.sh \
        --bootstrap-server ${BROKER_HOST}:${BROKER_PORT} \
        --topic ${TOPIC} \
        --describe
}

# 4. 生产者压测
run_producer_perf_test() {
    local NUM_RECORDS=${1:-100000}
    local RECORD_SIZE=${2:-1024}
    local THROUGHPUT=${3:-10000}
    
    log_info "开始生产者压测: ${NUM_RECORDS} 条记录, 每条 ${RECORD_SIZE} 字节"
    
    ${KAFKA_HOME}/bin/kafka-producer-perf-test.sh \
        --topic ${TOPIC} \
        --num-records ${NUM_RECORDS} \
        --record-size ${RECORD_SIZE} \
        --throughput ${THROUGHPUT} \
        --producer-props \
            bootstrap.servers=${BROKER_HOST}:${BROKER_PORT} \
            acks=all \
            batch.size=16384 \
            linger.ms=5 \
            compression.type=lz4 \
        --print-metrics
    
    log_success "生产者压测完成"
}

# 5. 消费者压测
run_consumer_perf_test() {
    local NUM_RECORDS=${1:-100000}
    
    log_info "开始消费者压测: 消费 ${NUM_RECORDS} 条记录"
    
    ${KAFKA_HOME}/bin/kafka-consumer-perf-test.sh \
        --topic ${TOPIC} \
        --bootstrap-server ${BROKER_HOST}:${BROKER_PORT} \
        --messages ${NUM_RECORDS} \
        --group perf-test-group-$(date +%s) \
        --print-metrics
    
    log_success "消费者压测完成"
}

# 6. 综合性能测试
run_comprehensive_test() {
    log_info "=== 开始综合性能测试 ==="
    
    # 监控初始状态
    monitor_kafka
    
    # 创建测试 Topic
    create_perf_topic
    
    echo -e "\n${GREEN}=== 测试 1: 平衡模式基准测试 ===${NC}"
    run_producer_perf_test 50000 1024 5000
    run_consumer_perf_test 50000
    
    # 调整到吞吐量模式
    echo -e "\n${GREEN}=== 测试 2: 吞吐量优先模式 ===${NC}"
    log_warn "请重启 Kafka 使新配置生效后继续测试..."
    read -p "重启完成后按 Enter 继续..."
    
    run_producer_perf_test 100000 2048 10000
    run_consumer_perf_test 100000
    
    # 监控最终状态
    monitor_kafka
    
    log_success "综合性能测试完成!"
}

# 7. 清理函数
cleanup() {
    log_info "清理测试环境..."
    
    # 删除测试 Topic
    ${KAFKA_HOME}/bin/kafka-topics.sh \
        --bootstrap-server ${BROKER_HOST}:${BROKER_PORT} \
        --delete \
        --topic ${TOPIC} \
        --if-exists
    
    # 删除消费者组
    ${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
        --bootstrap-server ${BROKER_HOST}:${BROKER_PORT} \
        --delete \
        --group "perf-test-group-*" \
        --all-topics
    
    log_success "清理完成"
}

# 8. 使用说明
show_usage() {
    echo -e "${GREEN}Kafka 性能调优脚本使用说明${NC}"
    echo "用法: $0 [命令] [参数]"
    echo ""
    echo "命令:"
    echo "  monitor          监控 Kafka 关键指标"
    echo "  tune [模式]      动态调整参数 (模式: balanced/throughput/latency)"
    echo "  producer-test    运行生产者压测"
    echo "  consumer-test    运行消费者压测"
    echo "  full-test        运行完整性能测试套件"
    echo "  cleanup          清理测试环境"
    echo "  help             显示此帮助信息"
    echo ""
    echo "示例:"
    echo "  $0 monitor"
    echo "  $0 tune throughput"
    echo "  $0 full-test"
    echo "  $0 cleanup"
}

# 主函数
main() {
    case "$1" in
        "monitor")
            monitor_kafka
            ;;
        "tune")
            tune_parameters "$2"
            ;;
        "producer-test")
            create_perf_topic
            run_producer_perf_test "$2" "$3" "$4"
            ;;
        "consumer-test")
            run_consumer_perf_test "$2"
            ;;
        "full-test")
            run_comprehensive_test
            ;;
        "cleanup")
            cleanup
            ;;
        "help"|"")
            show_usage
            ;;
        *)
            log_error "未知命令: $1"
            show_usage
            exit 1
            ;;
    esac
}

# 执行主函数
main "$@"

📋 脚本功能说明

核心功能模块:

  1. 监控模块 (monitor_kafka)

    • Broker 状态检查
    • Topic 列表与状态
    • 消费者组延迟监控
    • 网络连接统计
    • 系统资源监控(CPU/内存/磁盘IO)
  2. 动态调优模块 (tune_parameters)

    • 吞吐量优先模式:增大缓冲区、线程数,启用压缩
    • 延迟优先模式:减小缓冲区、禁用压缩,零延迟发送
    • 平衡模式:折中配置,适合大多数场景
    • 自动备份原配置,支持回滚
  3. 压测验证模块

    • kafka-producer-perf-test:生产者性能测试
    • kafka-consumer-perf-test:消费者性能测试
    • 可调节记录数、大小、吞吐量限制
  4. 综合测试套件 (run_comprehensive_test)

    • 自动化多轮测试对比
    • 不同配置模式下的性能对比

关键参数说明:

bash 复制代码
# 网络与线程调优
num.io.threads=8          # I/O 线程数(建议:CPU核心数×2)
num.network.threads=8     # 网络线程数
socket.send.buffer.bytes=2097152  # 发送缓冲区(2MB)
socket.receive.buffer.bytes=2097152 # 接收缓冲区(2MB)

# 生产者调优
batch.size=16384          # 批次大小(16KB)
linger.ms=5               # 批次等待时间
compression.type=lz4      # 压缩算法(lz4/snappy/gzip)

# 操作系统级调优
vm.swappiness=1           # 减少交换
net.core.rmem_max=2097152 # 最大接收缓冲区

🚀 快速使用:

bash 复制代码
# 1. 赋予执行权限
chmod +x kafka_perf_tuner.sh

# 2. 设置环境变量(可选)
export KAFKA_HOME=/opt/kafka
export BROKER_HOST=your-kafka-host

# 3. 运行完整测试
./kafka_perf_tuner.sh full-test

# 4. 单独调优(吞吐量模式)
./kafka_perf_tuner.sh tune throughput

# 5. 监控当前状态
./kafka_perf_tuner.sh monitor

📊 预期输出指标:

  • 生产者吞吐量:records/sec, MB/sec
  • 生产者延迟:avg/50th/95th/99th percentile
  • 消费者吞吐量:MB/sec, records/sec
  • 端到端延迟:从生产到消费的总时间

⚠️ 注意事项:

  1. 参数调整后需要重启 Kafka Broker 生效
  2. 压测前确保有足够的磁盘空间
  3. 生产环境建议在非高峰时段测试
  4. 监控系统资源,避免压测导致服务不可用

这个脚本可以直接保存使用,通过对比不同配置下的性能数据,帮助你找到最适合业务场景的 Kafka 参数组合!

细粒度规则等,随时告诉我!

相关推荐
小熊美家熊猫系统15 小时前
电子合同技术实现与合规实践
java·开发语言·分布式
AI 小老六18 小时前
Google AX 控制面拆解:分布式 Agent 如何把断点恢复、审计策略和执行调度收进同一条链路
人工智能·分布式·后端·ai·架构·ai编程
functionflux19 小时前
kafka-python:Python 生态中最成熟的 Kafka 客户端
分布式·python·其他·kafka
kyle~1 天前
DDS分布式实时系统---自省机制
开发语言·分布式·机器人·c#·接口·ros2
q21030633721 天前
kafka启动几秒后挂了,重启多次无果
分布式·kafka
凯源智能1 天前
工商业分布式光伏箱变智能监控落地实战
分布式·箱变测控·光伏箱变测控装置·箱变监控系统·箱式变测控装置
沂水弦音1 天前
软控 EI 系列模块优势与竞品对比分析:面向 EtherCAT 分布式 I/O 的工程选型视角
分布式·制造·工业自动化·ethercat·io模块
木心术11 天前
在NVIDIA DGX Spark上部署NemoClaw的实际操作方案以及实际应用便利性。
大数据·分布式·spark
kuokay1 天前
MLOps 与 AIOps 的核心概
人工智能·分布式·大模型·agent·llama