26.1 Docker日志管理
26.1.1 日志驱动概述
Docker支持多种日志驱动,用于收集和管理容器日志。
常用日志驱动:
| 驱动 | 说明 | 适用场景 |
|---|---|---|
json-file |
默认,JSON格式存储 | 开发环境 |
syslog |
发送到系统syslog | 本地集中日志 |
journald |
systemd日志系统 | systemd环境 |
fluentd |
Fluentd日志收集 | 云原生架构 |
splunk |
Splunk企业日志 | 企业环境 |
awslogs |
AWS CloudWatch | AWS云环境 |
none |
禁用日志 | 无日志需求 |
26.1.2 配置日志驱动
全局配置:
json
// /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3",
"compress": "true"
}
}
容器级配置:
bash
# 使用syslog驱动
docker run -d \
--log-driver syslog \
--log-opt syslog-address=tcp://192.168.1.10:514 \
nginx
# json-file带轮转
docker run -d \
--log-driver json-file \
--log-opt max-size=10m \
--log-opt max-file=3 \
nginx
# Compose配置
services:
web:
image: nginx
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
26.1.3 查看容器日志
bash
# 查看最新日志
docker logs nginx
# 实时跟踪日志
docker logs -f nginx
# 显示时间戳
docker logs -t nginx
# 最近100行
docker logs --tail 100 nginx
# 特定时间段
docker logs --since 2024-01-01T00:00:00 nginx
docker logs --since 1h nginx
# 多容器日志
docker-compose logs -f web db redis
26.2 集中式日志收集
26.2.1 使用ELK Stack
部署ELK:
yaml
# elk-stack.yml
version: '3.8'
services:
elasticsearch:
image: elasticsearch:8.11.0
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- xpack.security.enabled=false
ports:
- "9200:9200"
volumes:
- es-data:/usr/share/elasticsearch/data
logstash:
image: logstash:8.11.0
ports:
- "5000:5000/tcp"
- "5000:5000/udp"
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
depends_on:
- elasticsearch
kibana:
image: kibana:8.11.0
ports:
- "5601:5601"
environment:
ELASTICSEARCH_HOSTS: http://elasticsearch:9200
depends_on:
- elasticsearch
volumes:
es-data:
Logstash配置:
ruby
# logstash.conf
input {
tcp {
port => 5000
codec => json
}
}
filter {
if [docker][container][name] {
mutate {
add_field => { "container_name" => "%{[docker][container][name]}" }
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "docker-logs-%{+YYYY.MM.dd}"
}
}
应用容器配置:
bash
# 使用syslog发送到Logstash
docker run -d \
--log-driver syslog \
--log-opt syslog-address=tcp://localhost:5000 \
--log-opt tag="{{.Name}}" \
nginx
26.2.2 使用Grafana Loki
部署Loki Stack:
yaml
# loki-stack.yml
version: '3.8'
services:
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- ./loki-config.yml:/etc/loki/local-config.yaml
- loki-data:/loki
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail-config.yml:/etc/promtail/config.yml
command: -config.file=/etc/promtail/config.yml
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
volumes:
- grafana-data:/var/lib/grafana
volumes:
loki-data:
grafana-data:
Promtail配置:
yaml
# promtail-config.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: docker
static_configs:
- targets:
- localhost
labels:
job: docker
__path__: /var/lib/docker/containers/*/*-json.log
pipeline_stages:
- json:
expressions:
stream: stream
log: log
- output:
source: log
26.3 Prometheus监控
26.3.1 监控架构
完整监控栈:
yaml
# monitoring-stack.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
privileged: true
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
- ./grafana-dashboards:/etc/grafana/provisioning/dashboards
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
volumes:
prometheus-data:
grafana-data:
Prometheus配置:
yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- 'alert-rules.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'docker-containers'
docker_sd_configs:
- host: unix:///var/run/docker.sock
26.3.2 告警规则
yaml
# alert-rules.yml
groups:
- name: container-alerts
interval: 30s
rules:
- alert: ContainerDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "容器 {{ $labels.instance }} 宕机"
- alert: HighMemoryUsage
expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "容器 {{ $labels.name }} 内存使用超过90%"
- alert: HighCPUUsage
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "容器 {{ $labels.name }} CPU持续高负载"
- alert: ContainerRestarting
expr: rate(container_start_time_seconds[15m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "容器 {{ $labels.name }} 频繁重启"
Alertmanager配置:
yaml
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'instance']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'ops@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'password'
26.3.3 常用查询
promql
# 容器CPU使用率
rate(container_cpu_usage_seconds_total{name="nginx"}[5m]) * 100
# 容器内存使用
container_memory_usage_bytes{name="nginx"} / 1024 / 1024
# 容器内存使用率
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100
# 网络接收速率
rate(container_network_receive_bytes_total{name="nginx"}[5m])
# 磁盘I/O
rate(container_fs_writes_bytes_total{name="nginx"}[5m])
# 容器重启次数
container_start_time_seconds{name="nginx"}
# 按镜像统计容器数
count(container_last_seen) by (image)
26.4 Grafana可视化
26.4.1 数据源配置
yaml
# grafana-datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
26.4.2 导入Dashboard
bash
# 访问Grafana
# http://localhost:3000 (admin/admin)
# 导入官方Dashboard
# 1. 点击 "+" -> Import
# 2. 输入Dashboard ID:
# - 193: Docker监控
# - 11074: Node Exporter
# - 13946: Loki日志
# 或使用grafana-cli
docker exec grafana grafana-cli plugins install grafana-piechart-panel
26.4.3 自定义Dashboard
json
{
"dashboard": {
"title": "Docker容器监控",
"panels": [
{
"title": "CPU使用率",
"targets": [
{
"expr": "rate(container_cpu_usage_seconds_total[5m]) * 100"
}
],
"type": "graph"
},
{
"title": "内存使用",
"targets": [
{
"expr": "container_memory_usage_bytes / 1024 / 1024"
}
],
"type": "graph"
}
]
}
}
26.5 性能分析工具
26.5.1 docker stats增强
bash
#!/bin/bash
# enhanced-stats.sh - 增强的容器统计
while true; do
clear
echo "=== Docker容器资源使用 ($(date)) ==="
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.NetIO}}\t{{.BlockIO}}" | \
awk 'NR==1 {print; next} {
split($3, mem, "/");
mem_mb = substr(mem[1], 1, length(mem[1])-3);
if (mem_mb > 500) print "\033[31m" $0 "\033[0m"; # 红色警告
else print $0;
}'
echo ""
echo "总容器数: $(docker ps -q | wc -l)"
echo "CPU总使用: $(docker stats --no-stream --format '{{.CPUPerc}}' | sed 's/%//' | awk '{sum+=$1} END {print sum "%"}')"
sleep 5
done
26.5.2 性能分析脚本
bash
#!/bin/bash
# container-profiling.sh - 容器性能分析
CONTAINER=$1
if [ -z "$CONTAINER" ]; then
echo "用法: $0 <容器名>"
exit 1
fi
echo "=== 容器基本信息 ==="
docker inspect $CONTAINER | jq '{
Name: .Name,
Status: .State.Status,
Image: .Config.Image,
Created: .Created
}'
echo -e "\n=== 资源使用统计 ==="
docker stats --no-stream $CONTAINER
echo -e "\n=== 进程列表 (按内存) ==="
docker exec $CONTAINER ps aux --sort=-rss | head -n 10
echo -e "\n=== 进程列表 (按CPU) ==="
docker exec $CONTAINER ps aux --sort=-pcpu | head -n 10
echo -e "\n=== 网络连接 ==="
docker exec $CONTAINER netstat -tunlp 2>/dev/null | head -n 20
echo -e "\n=== 磁盘使用 ==="
docker exec $CONTAINER df -h
echo -e "\n=== 最近日志 ==="
docker logs --tail 50 $CONTAINER
26.5.3 使用ctop
bash
# 安装ctop
sudo wget https://github.com/bcicen/ctop/releases/download/v0.7.7/ctop-0.7.7-linux-amd64 \
-O /usr/local/bin/ctop
sudo chmod +x /usr/local/bin/ctop
# 运行ctop
ctop
# 快捷键:
# a - 显示所有容器(包括停止的)
# s - 按CPU/内存/名称排序
# Enter - 查看容器详情
# l - 查看日志
# e - 进入容器shell
26.6 分布式追踪
26.6.1 Jaeger部署
yaml
# jaeger.yml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "5775:5775/udp"
- "6831:6831/udp"
- "6832:6832/udp"
- "5778:5778"
- "16686:16686"
- "14268:14268"
- "14250:14250"
- "9411:9411"
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411
app:
image: myapp:latest
environment:
- JAEGER_AGENT_HOST=jaeger
- JAEGER_AGENT_PORT=6831
- JAEGER_SAMPLER_TYPE=const
- JAEGER_SAMPLER_PARAM=1
26.7 ROCm GPU监控
26.7.1 GPU指标收集
bash
#!/bin/bash
# rocm-gpu-monitor.sh - ROCm GPU监控
while true; do
echo "=== $(date) ==="
# GPU使用情况
echo "GPU使用率:"
rocm-smi --showuse
echo -e "\nGPU温度:"
rocm-smi --showtemp
echo -e "\nGPU内存:"
rocm-smi --showmeminfo vram
echo -e "\nGPU功耗:"
rocm-smi --showpower
# 容器GPU使用
echo -e "\n容器资源:"
docker stats --no-stream $(docker ps --filter "ancestor=rocm/pytorch" -q)
echo "================================"
sleep 10
done
26.7.2 Prometheus GPU Exporter
yaml
# rocm-monitoring.yml
version: '3.8'
services:
rocm-exporter:
build:
context: .
dockerfile: Dockerfile.rocm-exporter
ports:
- "9400:9400"
devices:
- /dev/kfd
- /dev/dri
group_add:
- video
volumes:
- /opt/rocm:/opt/rocm:ro
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ROCm Exporter Dockerfile:
dockerfile
FROM rocm/dev-ubuntu-22.04:latest
RUN pip3 install prometheus-client
COPY rocm-exporter.py /app/
CMD ["python3", "/app/rocm-exporter.py"]
Exporter脚本:
python
#!/usr/bin/env python3
# rocm-exporter.py
import subprocess
import re
from prometheus_client import start_http_server, Gauge
import time
gpu_utilization = Gauge('rocm_gpu_utilization', 'GPU使用率', ['gpu_id'])
gpu_temperature = Gauge('rocm_gpu_temperature', 'GPU温度', ['gpu_id'])
gpu_memory_used = Gauge('rocm_gpu_memory_used', 'GPU已用内存', ['gpu_id'])
def collect_metrics():
result = subprocess.run(['rocm-smi', '--showuse'],
capture_output=True, text=True)
# 解析rocm-smi输出并更新指标
# (简化示例,实际需要完整解析)
if __name__ == '__main__':
start_http_server(9400)
while True:
collect_metrics()
time.sleep(15)
26.8 日志与监控最佳实践
26.8.1 日志管理原则
- 结构化日志: 使用JSON格式
- 日志轮转: 限制日志大小
- 集中收集: 使用ELK或Loki
- 保留策略: 30-90天
- 敏感信息: 过滤密码、Token
26.8.2 监控指标
核心指标:
- CPU使用率 (>80%告警)
- 内存使用率 (>85%告警)
- 磁盘I/O (>100MB/s关注)
- 网络流量 (异常突增)
- 容器重启次数
业务指标:
- 响应时间
- 请求成功率
- 并发连接数
- 数据库连接池
26.8.3 告警策略
yaml
# 告警分级
severity_levels:
- critical: # P0 - 立即处理
- 服务完全不可用
- 数据丢失风险
- 安全事件
- warning: # P1 - 1小时内处理
- 性能显著下降
- 资源即将耗尽
- 部分功能异常
- info: # P2 - 工作时间处理
- 性能优化建议
- 容量规划提醒
26.9 实战案例
26.9.1 完整监控方案
yaml
# production-monitoring.yml
version: '3.8'
services:
# 应用服务
web:
image: nginx:alpine
labels:
- "logging=enabled"
- "monitoring=enabled"
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
# 日志收集
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- loki-data:/loki
promtail:
image: grafana/promtail:latest
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail.yml:/etc/promtail/config.yml
# 监控
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
privileged: true
# 可视化
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
volumes:
loki-data:
prometheus-data:
grafana-data:
26.10 总结
26.10.1 核心要点
日志管理:
- 配置合适的日志驱动和轮转策略
- 使用ELK或Loki实现集中式日志
- 结构化日志便于查询分析
监控体系:
- Prometheus收集时序指标
- cAdvisor提供容器指标
- Grafana实现可视化
- Alertmanager处理告警
性能分析:
- docker stats实时监控
- ctop交互式工具
- 自定义脚本深度分析
26.10.2 工具选择
| 需求 | 推荐工具 | 说明 |
|---|---|---|
| 日志收集 | Loki/ELK | Loki更轻量 |
| 指标监控 | Prometheus | 云原生标准 |
| 可视化 | Grafana | 强大的Dashboard |
| 追踪 | Jaeger | 分布式追踪 |
| 告警 | Alertmanager | Prometheus生态 |
扩展资源:
- Prometheus官方文档
- Grafana Dashboard库
- ELK Stack最佳实践