环境说明
主机 IP | 主机名 | 节点角色 | 数据目录 | Kubernetes 节点标签 |
---|---|---|---|---|
192.168.10.100 | zk1 | Master | /opt/zookeeper/data |
zk-cluster=true |
192.168.10.101 | zk2 | Worker | /opt/zookeeper/data |
zk-cluster=true |
192.168.10.102 | zk3 | Worker | /opt/zookeeper/data |
zk-cluster=true |
192.168.10.103 | zk4 | Worker | /opt/zookeeper/data |
zk-cluster=true |
192.168.10.104 | zk5 | Worker | /opt/zookeeper/data |
zk-cluster=true |
一、基础环境部署(所有节点)
1. 系统配置
bash
# 设置主机名
sudo hostnamectl set-hostname zk1 # 分别在每台机器执行对应主机名
# 编辑hosts文件
sudo tee -a /etc/hosts <<EOF 192.168.10.100 zk1 192.168.10.101 zk2 192.168.10.102 zk3 192.168.10.103 zk4 192.168.10.104 zk5 EOF # 关闭SELinux sudo setenforce 0 sudo sed -i 's/^SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config # 优化内核参数 sudo tee -a /etc/sysctl.conf <<EOF net.core.somaxconn=65535 net.ipv4.tcp_max_syn_backlog=65535 vm.swappiness=1 EOF sudo sysctl -p
2. Docker 安装
bash
# 安装依赖
sudo dnf install -y yum-utils device-mapper-persistent-data lvm2
# 添加Docker仓库
sudo yum-config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo # 安装Docker sudo dnf install -y docker-ce docker-ce-cli containerd.io # 配置Docker sudo mkdir -p /etc/docker sudo tee /etc/docker/daemon.json <<EOF { "log-driver": "json-file", "log-opts": { "max-size": "100m", "max-file": "3" } } EOF # 启动Docker sudo systemctl start docker sudo systemctl enable docker
3. Kubernetes 组件安装
bash
# 禁用Swap
sudo swapoff -a
sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab # 安装kubeadm/kubelet/kubectl cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo [kubernetes] name=Kubernetes baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64 enabled=1 gpgcheck=1 repo_gpgcheck=1 gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg EOF sudo dnf install -y kubelet kubeadm kubectl --disableexcludes=kubernetes sudo systemctl enable --now kubelet # 初始化Master节点 (仅在zk1执行) sudo kubeadm init --pod-network-cidr=10.244.0.0/16 \ --control-plane-endpoint="zk1:6443" \ --upload-certs \ --apiserver-advertise-address=192.168.10.100 # 配置kubectl (在zk1执行) mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config # 安装网络插件 kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml # 加入Worker节点 (在zk2-zk5执行) # 使用kubeadm init输出的join命令 kubeadm join zk1:6443 --token <token> --discovery-token-ca-cert-hash <hash>
二、Zookeeper Operator 部署
1. 安装 Zookeeper Operator
bash
# 创建命名空间
kubectl create ns zookeeper-operator
# 部署Operator
kubectl apply -f https://raw.githubusercontent.com/pravega/zookeeper-operator/master/deploy/all_ns/rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/pravega/zookeeper-operator/master/deploy/all_ns/operator.yaml
# 验证Operator状态
kubectl get pods -n zookeeper-operator
2. 创建 Zookeeper 集群 CRD
zookeeper-cluster.yaml
:
yaml
apiVersion: zookeeper.pravega.io/v1beta1
kind: ZookeeperCluster
metadata: name: zookeeper-cluster namespace: default spec: replicas: 5 image: repository: zookeeper tag: 3.8.0 persistence: storageClassName: local-storage volumeReclaimPolicy: Retain size: 20Gi config: initLimit: 15 syncLimit: 5 tickTime: 2000 autopurge: snapRetainCount: 10 purgeInterval: 24 pod: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - zookeeper topologyKey: kubernetes.io/hostname nodeSelector: zk-cluster: "true" securityContext: runAsUser: 1000 fsGroup: 1000 resources: requests: memory: "2Gi" cpu: "1" limits: memory: "4Gi" cpu: "2" security: enable: true jaasConfig: secretRef: zk-jaas-secret tlsConfig: enable: true secretRef: zk-tls-secret metrics: enable: true port: 7000
3. 创建安全配置
bash
# JAAS 认证配置
kubectl create secret generic zk-jaas-secret \
--from-literal=jaas-config="Server {
org.apache.zookeeper.server.auth.DigestLoginModule required
user_admin=\"adminpassword\" user_appuser=\"apppassword\"; };" # TLS 证书配置 # (提前生成keystore.jks) kubectl create secret generic zk-tls-secret \ --from-file=keystore.jks=keystore.jks \ --from-literal=keystore-password=changeit
4. 创建存储类
local-storage.yaml
:
yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata: name: local-storage provisioner: kubernetes.io/no-provisioner volumeBindingMode: WaitForFirstConsumer
5. 部署集群
bash
kubectl apply -f local-storage.yaml
kubectl apply -f zookeeper-cluster.yaml
# 查看集群状态
kubectl get zookeepercluster
kubectl get pods -l app=zookeeper
三、自动化运维功能实现
1. 自动扩缩容
bash
# 水平扩展
kubectl patch zk zookeeper-cluster --type='merge' -p '{"spec":{"replicas":7}}' # 垂直扩容 kubectl patch zk zookeeper-cluster --type='merge' -p '{"spec":{"pod":{"resources":{"limits":{"memory":"8Gi"}}}}}'
2. 自动备份与恢复
zk-backup-job.yaml
:
yaml
apiVersion: batch/v1
kind: CronJob
metadata: name: zk-backup spec: schedule: "0 2 * * *" jobTemplate: spec: template: spec: containers: - name: backup image: zookeeper:3.8.0 command: ["/bin/sh", "-c"] args: - | echo "Connecting to ${ZK_SERVER}" echo "savemn" | nc ${ZK_SERVER} 2181 tar czf /backup/$(date +%Y%m%d).tar.gz -C /data . volumeMounts: - name: backup-volume mountPath: /backup - name: data-volume mountPath: /data restartPolicy: OnFailure volumes: - name: backup-volume persistentVolumeClaim: claimName: zk-backup-pvc - name: data-volume persistentVolumeClaim: claimName: $(ZK_PVC)
3. 自动监控告警
prometheus-monitoring.yaml
:
yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata: name: zookeeper-monitor spec: selector: matchLabels: app: zookeeper endpoints: - port: metrics interval: 15s namespaceSelector: any: true
4. 自动证书轮换
bash
# 证书更新后滚动重启
kubectl patch zk zookeeper-cluster --type='merge' -p '{"spec":{"tlsConfig":{"certUpdated":true}}}'
四、安全合规与灾备
1. 安全加固
yaml
# 在CRD中增加安全配置
spec:
security:
enable: true jaasConfig: secretRef: zk-jaas-secret tlsConfig: enable: true secretRef: zk-tls-secret networkPolicy: enabled: true allowedClients: - 192.168.10.0/24
2. 跨集群灾备
yaml
apiVersion: zookeeper.pravega.io/v1beta1
kind: ZookeeperCluster
metadata: name: zookeeper-dr spec: replicas: 3 config: # 配置为观察者模式 peerType: observer # 连接主集群 initConfig: | server.1=zk1:2888:3888:participant;2181 server.2=zk2:2888:3888:participant;2181 server.3=zk3:2888:3888:participant;2181 server.4=dr-zk1:2888:3888:observer;2181 server.5=dr-zk2:2888:3888:observer;2181 server.6=dr-zk3:2888:3888:observer;2181
五、日常运维操作
1. 集群状态检查
bash
# 查看集群状态
kubectl get zookeepercluster
kubectl describe zk zookeeper-cluster
# 检查节点角色
kubectl exec zookeeper-cluster-0 -- zkServer.sh status
2. 日志管理
bash
# 查看实时日志
kubectl logs -f zookeeper-cluster-0
# 日志归档配置 (Operator自动管理)
3. 配置热更新
bash
# 修改配置后触发更新
kubectl patch zk zookeeper-cluster --type='merge' -p '{"spec":{"config":{"tickTime":"3000"}}}'
六、扩展与升级
1. 集群升级流程
bash
# 滚动升级到新版本
kubectl patch zk zookeeper-cluster --type='merge' -p '{"spec":{"image":{"tag":"3.9.0"}}}' # 监控升级进度 kubectl get pods -w -l app=zookeeper
2. 多集群管理
bash
# 部署多套Zookeeper集群
kubectl apply -f zookeeper-cluster-app1.yaml
kubectl apply -f zookeeper-cluster-app2.yaml
# 统一监控
kubectl apply -f zookeeper-global-monitor.yaml
七、备份与恢复方案
1. Velero 全集群备份
bash
# 安装Velero
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.0.0 \ --bucket zk-backups \ --secret-file ./credentials-velero \ --use-restic # 创建备份 velero backup create zk-full-backup --include-namespaces default --selector app=zookeeper # 灾难恢复 velero restore create --from-backup zk-full-backup
2. 数据迁移
bash
# 使用zkTransfer工具
kubectl exec zookeeper-cluster-0 -- zkTransfer.sh \
--source zk1:2181 \
--target new-zk1:2181 \ --path /critical_data \ --parallel 8
运维检查清单
检查项 | 频率 | 命令/方法 |
---|---|---|
集群健康状态 | 每日 | kubectl get zk |
节点资源使用率 | 每日 | kubectl top pods |
证书有效期检查 | 每月 | keytool -list -v -keystore |
备份恢复测试 | 每季度 | Velero恢复演练 |
安全漏洞扫描 | 每月 | Trivy扫描镜像 |
故障转移演练 | 每半年 | 模拟节点故障 |
性能压测 | 每年 | ZK Benchmark工具 |
通过Kubernetes Operator实现Zookeeper集群的全生命周期自动化管理,结合Velero实现灾备,Prometheus实现监控,显著提升运维效率。生产环境建议:
使用HashiCorp Vault管理密钥
部署多可用区集群
集成OpenPolicyAgent进行策略管理
使用GitOps工作流(Argo CD)管理配置