核心架构
plaintext
┌──────────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │FluentBit│ │FluentBit│ │FluentBit│ (DaemonSet) │
│ └───┬─────┘ └───┬─────┘ └───┬─────┘ │
│ └───────────┼─────────────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Parse/Filter │ │
│ └────────────────┘ │
└────────────────────────────┼─────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────────┐
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │Elasticsearch│ │ Kibana │ │ Prometheus │ │
│ │(日志存储) │◄─│ (可视化) │ │ (指标采集) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
一、Prometheus部署(kube-prometheus-stack)
1.1 Helm安装
bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring
1.2 自定义配置
yaml
# values-prometheus.yaml
prometheus:
prometheusSpec:
replicas: 2
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: ssd
resources:
requests:
storage: 100Gi
resources:
requests: {cpu: 500m, memory: 2Gi}
limits: {cpu: 2, memory: 4Gi}
alertmanager:
alertmanagerSpec:
replicas: 2
grafana:
adminPassword: "SecurePass123"
persistence:
enabled: true
storageClassName: ssd
size: 10Gi
1.3 ServiceMonitor自动发现
yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-monitor
labels:
release: prometheus # 必须与Prometheus的serviceMonitorSelector匹配
spec:
selector:
matchLabels:
app: my-app
namespaceSelector:
matchNames: [production]
endpoints:
- port: metrics
path: /metrics
interval: 15s
1.4 告警规则
yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: k8s-alerts
spec:
groups:
- name: k8s-critical
rules:
- alert: PodCPUHigh
expr: |
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
/ sum(kube_pod_container_resource_limits_cpu_cores) by (pod) > 0.9
for: 5m
labels: {severity: warning}
annotations:
summary: "Pod {{ $labels.pod }} CPU>90%"
- alert: PodOOMKilled
expr: kube_pod_container_status_restarts_total > 5
labels: {severity: critical}
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 2m
labels: {severity: critical}
二、EFK栈部署
2.1 Elasticsearch(ECK)
yaml
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: efk-es
namespace: logging
spec:
version: 8.11.0
nodeSets:
- name: default
count: 3
config:
node.store.allow_mmap: false
xpack.security.enabled: true
podTemplate:
spec:
initContainers:
- name: sysctl
command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
containers:
- name: elasticsearch
env: [{name: ES_JAVA_OPTS, value: "-Xms4g -Xmx4g"}]
resources:
requests: {memory: 8Gi, cpu: 2}
limits: {memory: 16Gi, cpu: 4}
volumeClaimTemplates:
- metadata: {name: elasticsearch-data}
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: ssd
resources:
requests: {storage: 500Gi}
---
apiVersion: v1
kind: Namespace
metadata: {name: logging}
2.2 Fluent BitDaemonSet
yaml
apiVersion: v1
kind: ServiceAccount
metadata: {name: fluent-bit, namespace: logging}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata: {name: fluent-bit}
rules:
- apiGroups: [""]
resources: ["namespaces", "pods", "nodes"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata: {name: fluent-bit}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: fluent-bit
subjects: [{kind: ServiceAccount, name: fluent-bit, namespace: logging}]
---
apiVersion: v1
kind: ConfigMap
metadata: {name: fluent-bit-config, namespace: logging}
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
http_server On
http_listen 0.0.0.0
http_port 2020
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Mem_Buf_Limit 50MB
Skip_Long_Lines On
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Merge_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude On
[FILTER]
Name parser
Match kube.*
Key_Name log
Parser json
Reserve_Data On
[OUTPUT]
Name es
Match kube.*
Host efk-es-http.logging.svc
Port 9200
HTTP_User elastic
HTTP_Passwd ${ELASTIC_PASSWORD}
tls On
tls.verify Off
Logstash_Format On
Logstash_Prefix logs
Replace_Dots On
parsers.conf: |
[PARSER]
Name json
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
spec:
selector:
matchLabels: {app: fluent-bit}
template:
metadata:
labels: {app: fluent-bit}
spec:
serviceAccountName: fluent-bit
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.1.10
ports: [{name: metrics, containerPort: 2020}]
env:
- name: ELASTIC_PASSWORD
valueFrom:
secretKeyRef:
name: efk-es-es-elastic-user
key: elastic
resources:
requests: {cpu: 100m, memory: 100Mi}
limits: {cpu: 500m, memory: 500Mi}
volumeMounts:
- {name: varlogcontainers, mountPath: /var/log/containers, readOnly: true}
- {name: varlogpods, mountPath: /var/log/pods, readOnly: true}
- {name: varlogdocker, mountPath: /var/lib/docker/containers, readOnly: true}
volumes:
- {name: varlogcontainers, hostPath: {path: /var/log/containers}}
- {name: varlogpods, hostPath: {path: /var/log/pods}}
- {name: varlogdocker, hostPath: {path: /var/lib/docker/containers}}
2.3 Kibana
yaml
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
name: efk-kibana
namespace: logging
spec:
version: 8.11.0
count: 2
elasticsearchRef: {name: efk-es}
三、常用命令
bash
# Prometheus
kubectl get pods -n monitoring
kubectl exec -it prometheus-0 -n monitoring -- wget -qO- localhost:9090/api/v1/targets
# PromQL: Pod CPU使用率 / Pod内存使用率
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
sum(container_memory_working_set_bytes) by (pod) / sum(kube_pod_container_resource_limits_memory_bytes) by (pod)
# Fluent Bit日志
kubectl logs -n logging -l app=fluent-bit -f
# ES健康检查
kubectl exec -it efk-es-es-default-0 -n logging -- \
curl -k -u elastic:$(kubectl get secret efk-es-es-elastic-user -n logging -o jsonpath='{.data.elastic}' | base64 -d) \
"https://localhost:9200/_cluster/health?pretty"
# 访问服务
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
kubectl port-forward -n logging svc/efk-kibana-kb-http 5601:5601
四、问题排查
4.1 Prometheus Target Down
bash
# 检查Target
kubectl describe pod -n monitoring prometheus-0 | tail -20
kubectl get servicemonitor -n monitoring
# 原因: ServiceMonitor标签不匹配、网络策略阻止、Pod未启动
4.2 Fluent Bit日志丢失
bash
# 检查缓冲区状态
kubectl exec -it fluent-bit-xxx -n logging -- curl -s localhost:2020/api/v1/storage
kubectl logs -n logging -l app=fluent-bit | grep -E "(retry|error|full)"
# 原因: 缓冲区满(Mem_Buf_Limit太小)、ES写入阻塞、解析失败被过滤
# 解决: 调大Mem_Buf_Limit至100MB
4.3 ES存储压力
bash
# 查看索引
kubectl exec -it efk-es-es-default-0 -n logging -- \
curl -s -u elastic:password "localhost:9200/_cat/indices?v"
# ILM策略: 热数据7天 rollover,温数据30天 forcemerge,删除60天旧数据
curl -X PUT "localhost:9200/_ilm/policy/logs-policy" -u elastic:password -H 'Content-Type: application/json' -d'
{"policy":{"phases":{"hot":{"min_age":"0ms","actions":{"rollover":{"max_primary_shard_size":"50gb"}}},
"warm":{"min_age":"7d","actions":{"shrink":{"number_of_shards":1},"forcemerge":{"max_num_segments":1}}},
"delete":{"min_age":"60d","actions":{"delete":{}}}}}}'
五、最佳实践
- 存储: 日志盘与系统盘分离,Prometheus/ES使用SSD
- 资源: 严格设置limits,Prometheus: 4Gi/2核,ES: 16Gi/4核
- ILM: ES必须配置生命周期管理,避免磁盘耗尽
- 高可用: Prometheus/ES至少3节点,Alertmanager集群模式
- 过滤: 生产环境过滤debug日志,减少存储压力
- 监控联动: Prometheus告警可触发Kibana日志查询
一键部署
bash
#!/bin/bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring --create-namespace
kubectl create namespace logging
kubectl apply -f elasticsearch.yaml -f fluent-bit.yaml -f kibana.yaml
# Prometheus: kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Kibana: kubectl port-forward -n logging svc/efk-kibana-kb-http 5601:5601