ELK Stack + Redis 的完整操作应用流程,从基本使用到高级应用场景。
📊 一、基础操作流程
1.1 数据采集配置(Filebeat)
场景1:收集 Nginx 访问日志
# 创建 Nginx 日志格式(在 Nginx 配置中)
sudo nano /etc/nginx/nginx.conf
nginx
http {
log_format json_combined escape=json
'{'
'"@timestamp":"$time_iso8601",'
'"remote_addr":"$remote_addr",'
'"remote_user":"$remote_user",'
'"request":"$request",'
'"status": "$status",'
'"body_bytes_sent":"$body_bytes_sent",'
'"request_time":"$request_time",'
'"http_referrer":"$http_referer",'
'"http_user_agent":"$http_user_agent"'
'}';
access_log /var/log/nginx/access.log json_combined;
}
Filebeat 配置文件:
yaml
# /etc/filebeat/filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/nginx/access.log
json.keys_under_root: true
json.add_error_key: true
tags: ["nginx", "access-log"]
- type: log
enabled: true
paths:
- /var/log/nginx/error.log
tags: ["nginx", "error-log"]
场景2:收集 Java 应用日志(多行处理)
yaml
filebeat.inputs:
- type: log
paths:
- /opt/myapp/logs/*.log
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
fields:
app: "my-java-app"
env: "production"
1.2 数据增强处理(Logstash)
复杂日志解析示例:
ruby
# /etc/logstash/conf.d/advanced.conf
input {
redis {
host => "redis-host"
key => "filebeat-logs"
data_type => "list"
}
}
filter {
# 根据来源标签路由
if "nginx-access" in [tags] {
# 解析用户代理
useragent {
source => "http_user_agent"
target => "user_agent"
}
# GeoIP 定位
geoip {
source => "remote_addr"
target => "geo"
}
# 解析 URL 参数
if [request] {
grok {
match => { "request" => "%{WORD:http_method} %{URIPATHPARAM:request_path}(?:%{URIPARAM:query_params})? %{WORD:http_version}" }
}
# 提取 URL 参数
if [query_params] {
kv {
source => "query_params"
field_split => "&?"
target => "url_params"
}
}
}
# 分类状态码
mutate {
add_field => {
"status_category" => "%{status}"
}
}
ruby {
code => '
status = event.get("status").to_i
if status >= 200 && status < 300
event.set("status_category", "2xx")
elsif status >= 300 && status < 400
event.set("status_category", "3xx")
elsif status >= 400 && status < 500
event.set("status_category", "4xx")
else
event.set("status_category", "5xx")
end
'
}
}
# 处理 Java 异常堆栈
if [message] =~ /Exception|Error|at .+\.java:\d+\)/ {
# 标记为错误
mutate {
add_tag => ["error", "stacktrace"]
}
}
# 添加业务字段
mutate {
add_field => {
"[@metadata][index_suffix]" => "logs"
}
}
}
output {
# 根据标签路由到不同索引
if "nginx-access" in [tags] {
elasticsearch {
hosts => ["es-host:9200"]
index => "nginx-access-%{+YYYY.MM.dd}"
user => "elastic"
password => "${ES_PASSWORD}"
}
} else if "error" in [tags] {
elasticsearch {
hosts => ["es-host:9200"]
index => "error-logs-%{+YYYY.MM.dd}"
}
} else {
elasticsearch {
hosts => ["es-host:9200"]
index => "application-%{+YYYY.MM.dd}"
}
}
# 发送重要错误到告警系统
if [level] == "ERROR" and [app] == "payment-service" {
http {
url => "http://alert-system/webhook"
http_method => "post"
format => "json"
}
}
}
📈 二、Kibana 操作实战
2.1 创建索引模式
-
首次登录 Kibana (http://your-ip:5601)
-
用户名:
elastic -
密码: 您设置的密码
-
侧菜单 → Stack Management → Index Patterns → Create index pattern
-
输入:
nginx-access-*(匹配所有 Nginx 访问日志) -
时间字段:
@timestamp -
点击 Create index pattern
2.2 Discover 搜索分析
基础搜索查询:
# 1. 简单搜索
message: "ERROR"
# 2. 字段搜索
status: 500
http_method: "POST"
# 3. 范围查询
status >= 400 and status <= 599
@timestamp >= now()-1h
response_time > 5
# 4. 逻辑操作
(level: ERROR OR level: FATAL) AND app: "order-service"
# 5. 通配符
host.name: web-server-*
# 6. 正则表达式
request_path: /api/*
# 7. 存在性检查
_exists_: user_id AND !_exists_: session_id
保存搜索:
-
在 Discover 页面配置好搜索条件
-
点击 Save → 输入名称(如 "今日错误日志")
-
后续可在 Saved Objects 中快速加载
2.3 可视化仪表板创建
创建 Nginx 监控仪表板:
步骤1:创建状态码分布饼图
Visualize → Create visualization → Pie 选择索引模式: nginx-access-* 切片依据: Terms 字段: status_category 大小: Count
步骤2:创建请求量时序图
Visualize → Create visualization → Line
选择索引模式: nginx-access-*
X轴: Date Histogram (字段: @timestamp, 间隔: 1小时)
Y轴: Count
步骤3:创建 Top 10 访问路径
Visualize → Create visualization → Data Table
选择索引模式: nginx-access-*
拆分行: Terms
字段: request_path.keyword
大小: 10
排序: Descending
步骤4:创建地理分布地图
Visualize → Create visualization → Coordinate Map
选择索引模式: nginx-access-*
Terms → 字段: geo.country_name.keyword
大小: Count
步骤5:组合成仪表板
Dashboard → Create dashboard
点击 "Add" → 选择上面创建的所有可视化
调整布局 → 保存仪表板
2.4 高级可视化示例
实时业务监控仪表板配置:
json
{
"dashboard": {
"title": "业务系统实时监控",
"panels": [
{
"type": "metric",
"title": "QPS",
"expression": "count(index=nginx-access-*) | timechart span=1m"
},
{
"type": "timeseries",
"title": "响应时间趋势",
"expression": "avg(response_time) from nginx-access-* by app"
},
{
"type": "gauge",
"title": "错误率",
"expression": "count(status >= 500) / count(*) * 100"
},
{
"type": "treemap",
"title": "API 调用分布",
"expression": "count() by request_path | sort -count"
}
]
}
}
🚨 三、告警与监控配置
3.1 Kibana Alerting 告警规则
创建错误率告警:
yaml
# 在 Kibana: Stack Management → Rules and Connectors
1. 创建规则:
- 名称: "高错误率告警"
- 检查间隔: 5分钟
2. 定义条件:
查询:
index: nginx-access-*
time range: 最后5分钟
aggregation: count
group by: terms(app.keyword)
阈值:
错误请求数 / 总请求数 > 0.05 (5%错误率)
3. 配置动作:
- 类型: Email / Slack / Webhook
- 接收人: ops-team@company.com
- 消息模板: "应用 {{context.group}} 错误率 {{context.value}}%,超过阈值5%"
3.2 常用告警规则示例
规则1:服务异常检测
WHEN count() OVER logs
WHERE level = "ERROR" AND app = "payment-service"
GROUP BY host
HAVING COUNT() > 10 IN LAST 10m
THEN alert("支付服务异常")
规则2:响应时间超时
WHEN avg(response_time) OVER nginx-access-*
WHERE app = "api-gateway"
GROUP BY api_endpoint
HAVING avg(response_time) > 2000 IN LAST 5m
THEN alert("API响应超时")
规则3:流量突增检测
WHEN derivative(count()) OVER nginx-access-*
WHERE @timestamp >= now()-1h
GROUP BY app
HAVING derivative > 200% COMPARE TO 1h ago
THEN alert("流量突增")
🔍 四、日志分析与故障排查实战
4.1 常见故障排查场景
场景1:查找慢查询
# Kibana Discover 搜索
index: nginx-access-*
response_time > 5
sort: response_time desc
# 分析慢查询模式
request_path: "/api/*" AND response_time > 3
| stats avg(response_time), count() by request_path
| sort -avg(response_time)
场景2:分析错误日志关联
# 找到错误并关联相关日志
level: "ERROR"
| transaction by trace_id # 如果使用分布式追踪
| sort by @timestamp
场景3:用户行为分析
# 特定用户会话分析
user_id: "12345"
| sort by @timestamp
| 通过时间线查看用户完整操作路径
4.2 使用 Lens 进行交互式分析
# 1. 打开 Lens
Kibana → Analytics → Lens
# 2. 拖拽字段进行分析
- 拖动 @timestamp 到 X 轴
- 拖动 status 到颜色区分
- 拖动 request_path 到细分切片
- 选择可视化类型: 面积图
# 3. 添加公式计算错误率
count(kql='status >= 400') / count()
📊 五、性能优化与维护
5.1 Elasticsearch 索引管理
创建索引生命周期策略(ILM):
json
PUT _ilm/policy/logs-policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "7d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"forcemerge": {
"max_num_segments": 1
},
"shrink": {
"number_of_shards": 1
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"searchable_snapshot": {
"snapshot_repository": "my_repository"
}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
应用到索引模板:
json
PUT _index_template/logs-template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.lifecycle.name": "logs-policy"
},
"mappings": {
"dynamic_templates": [
{
"strings_as_keyword": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword"
}
}
}
]
}
}
}
5.2 监控 ELK 自身健康
创建系统监控仪表板:
# 监控 Elasticsearch 集群健康
GET _cluster/health
# 监控索引状态
GET _cat/indices?v&s=store.size:desc
# 监控节点状态
GET _cat/nodes?v&h=name,heap.percent,ram.percent,cpu
# Kibana 查询示例
index: ".monitoring-es-*"
| timechart span=1h avg(node.jvm.mem.heap_used_percent) by node_name
🛠️ 六、实用脚本和工具
6.1 日志数据模拟器
python
#!/usr/bin/env python3
"""
模拟生成日志数据,用于测试 ELK 流水线
"""
import json
import random
import time
import redis
from datetime import datetime
from faker import Faker
fake = Faker('zh_CN')
class LogGenerator:
def __init__(self, redis_host='localhost', redis_port=6379, password=None):
self.redis = redis.Redis(
host=redis_host,
port=redis_port,
password=password,
db=0,
decode_responses=True
)
def generate_nginx_log(self):
"""生成 Nginx 访问日志"""
log = {
"@timestamp": datetime.utcnow().isoformat() + "Z",
"remote_addr": fake.ipv4(),
"remote_user": fake.user_name() if random.random() > 0.7 else "-",
"request": f"{random.choice(['GET', 'POST', 'PUT', 'DELETE'])} "
f"{random.choice(['/api/users', '/api/orders', '/products', '/'])} "
f"HTTP/1.1",
"status": random.choice([200, 200, 200, 200, 404, 500]),
"body_bytes_sent": random.randint(100, 10000),
"request_time": round(random.uniform(0.1, 5.0), 3),
"http_referrer": fake.url() if random.random() > 0.5 else "",
"http_user_agent": fake.user_agent(),
"app": random.choice(['web-frontend', 'api-backend', 'mobile-api']),
"host": f"server-{random.randint(1, 10)}",
"level": "INFO"
}
return json.dumps(log)
def generate_app_log(self):
"""生成应用日志"""
levels = ['INFO', 'DEBUG', 'WARN', 'ERROR']
log = {
"@timestamp": datetime.utcnow().isoformat() + "Z",
"level": random.choices(
levels,
weights=[0.7, 0.1, 0.15, 0.05]
)[0],
"message": random.choice([
"User login successful",
"Database query executed",
"Cache miss occurred",
f"Error processing request: {fake.sentence()}",
f"Starting transaction ID: {fake.uuid4()}"
]),
"logger": f"com.example.{random.choice(['UserService', 'OrderService', 'PaymentService'])}",
"thread": f"thread-{random.randint(100, 999)}",
"app": "order-service",
"env": "production"
}
return json.dumps(log)
def start_generating(self, interval=1, count=1000):
"""开始生成日志"""
print(f"开始生成日志,间隔: {interval}秒,总数: {count}")
for i in range(count):
# 随机选择日志类型
if random.random() > 0.3:
log_entry = self.generate_nginx_log()
log_type = "nginx"
else:
log_entry = self.generate_app_log()
log_type = "app"
# 发送到 Redis
self.redis.lpush("filebeat-logs", log_entry)
if i % 100 == 0:
print(f"已生成 {i} 条日志,当前队列长度: {self.redis.llen('filebeat-logs')}")
time.sleep(interval)
print("日志生成完成")
if __name__ == "__main__":
# 使用示例
generator = LogGenerator(
redis_host="localhost",
password="YourSecurePassword123"
)
generator.start_generating(interval=0.1, count=5000)
6.2 ELK 维护脚本
#!/bin/bash
# elk-maintenance.sh
# 清理旧索引
DAYS_TO_KEEP=30
ES_HOST="localhost:9200"
ES_USER="elastic"
ES_PASSWORD="YourPassword"
# 删除30天前的索引
curl -u "$ES_USER:$ES_PASSWORD" -X DELETE \
"$ES_HOST/logs-*-$(date -d "$DAYS_TO_KEEP days ago" +%Y.%m.%d)"
# 优化索引
curl -u "$ES_USER:$ES_PASSWORD" -X POST \
"$ES_HOST/_all/_forcemerge?max_num_segments=1"
# 查看索引状态
echo "=== 索引状态 ==="
curl -s -u "$ES_USER:$ES_PASSWORD" "$ES_HOST/_cat/indices?v" | \
sort -k3
# 查看集群健康
echo -e "\n=== 集群健康 ==="
curl -s -u "$ES_USER:$ES_PASSWORD" "$ES_HOST/_cluster/health?pretty"
# 清理 Logstash 死信队列
find /var/lib/logstash/dead_letter_queue -name "*.log" -mtime +7 -delete
🎯 七、最佳实践建议
7.1 日志规范
-
结构化日志:始终使用 JSON 格式
-
统一字段:
json
{ "@timestamp": "ISO8601格式", "level": "INFO/WARN/ERROR", "message": "描述性消息", "app": "应用名", "env": "环境", "trace_id": "追踪ID(分布式)", "user_id": "用户ID(如有)" } -
避免敏感信息:不要记录密码、token、密钥
7.2 搜索优化
-
使用 keyword 字段:用于精确匹配和聚合
-
合理使用 mapping:提前定义字段类型
-
避免过多分词:只有需要全文搜索的字段才分词
7.3 性能监控指标
# 关键监控指标
1. Elasticsearch JVM 堆使用率 < 75%
2. 索引速率 > 1000 docs/sec(根据硬件调整)
3. 查询延迟 < 100ms(95分位)
4. 磁盘使用率 < 85%
5. Redis 内存使用率 < 70%
🚀 快速开始检查清单
-
✅ 所有服务正常运行
-
✅ Filebeat 发送日志到 Redis
-
✅ Redis 队列中有数据
-
✅ Logstash 从 Redis 消费并处理
-
✅ Elasticsearch 创建索引
-
✅ Kibana 配置索引模式
-
✅ 创建第一个可视化图表
-
✅ 设置告警规则
-
✅ 配置索引生命周期
-
✅ 定期备份快照
通过以上操作指南,您可以:
-
✅ 收集和分析各种类型的日志
-
✅ 创建实时监控仪表板
-
✅ 设置智能告警
-
✅ 进行故障排查和性能分析
-
✅ 维护和优化 ELK 集群