【Redis】监控与慢查询日志 —— slowlog、INFO 命令、RedisInsight 可视化监控

【Redis】监控与慢查询日志 🔍

引言 🎯

各位同学大家好，我是老曹！今天我们来当一回"Redis 医生"，学会给 Redis 做全面体检！🏥

很多同学只知道 Redis 很快，但到底有多快？哪里慢了？为什么慢？这些问题都需要专业的监控手段来解答。今天我们就来学习如何成为 Redis性能侦探！

学习目标 📚

🎯 掌握核心技能

理解 Redis 监控体系架构
掌握 slowlog 慢查询分析方法
学会使用 INFO 命令获取关键指标
熟练使用 RedisInsight 可视化工具

🎯 实战能力提升

能独立搭建完整的 Redis 监控体系
具备性能瓶颈快速定位能力
掌握监控告警配置技巧

Redis 监控体系架构 🏗️

1. 监控层次划分

Redis监控体系
基础设施层
性能指标层
业务应用层
服务器资源
网络状态
磁盘IO
内存使用率
连接数统计
命中率分析
命令执行统计
业务响应时间
错误率监控
自定义业务指标

2. 关键监控指标分类

监控维度	核心指标	正常范围	告警阈值
性能指标	QPS/RPS	根据业务定	>基线值200%
	响应时间	<10ms	>50ms
	CPU使用率	<70%	>85%
内存指标	内存使用率	<80%	>90%
	内存碎片率	1.0-1.5	>2.0 或 <1.0
	缓存命中率	>95%	<90%
连接指标	当前连接数	根据配置	>最大连接数80%
	拒绝连接数	0	>0

SlowLog 慢查询日志详解 🔍

3. SlowLog 工作原理

慢查询日志 Redis服务器客户端慢查询日志 Redis服务器客户端执行命令逻辑 alt [执行时间 > slowlog-log-slower-- than] 发送命令开始计时结束计时记录慢查询返回结果

4. SlowLog 配置参数详解

bash 复制代码

# Redis 配置文件 redis.conf
slowlog-log-slower-than 10000    # 记录超过10000微秒(10ms)的命令
slowlog-max-len 128              # 最多保存128条慢查询记录
latency-monitor-threshold 100    # 延迟监控阈值(毫秒)

5. SlowLog 实战操作

bash 复制代码

# 查看慢查询日志
redis-cli SLOWLOG GET 10

# 重置慢查询统计
redis-cli SLOWLOG RESET

# 获取慢查询长度
redis-cli SLOWLOG LEN

# 实时监控慢查询
redis-cli --latency

6. 慢查询日志格式解析

python 复制代码

# Python 解析 SlowLog 示例
import redis

def analyze_slowlog():
    r = redis.Redis(host='localhost', port=6379)
    
    # 获取最近10条慢查询
    slowlogs = r.slowlog_get(10)
    
    for log in slowlogs:
        print(f"""
        ID: {log['id']}
        时间戳: {log['time']} ({format_timestamp(log['time'])})
        执行时间: {log['duration']} 微秒
        命令: {' '.join(map(str, log['command']))}
        客户端IP: {get_client_info(log)}
        """)

INFO 命令深度解析 📊

7. INFO 命令分类详解

bash 复制代码

# Redis INFO 命令完整分类
redis-cli INFO all          # 获取所有信息
redis-cli INFO server       # 服务器基本信息
redis-cli INFO clients      # 客户端连接信息
redis-cli INFO memory       # 内存使用情况
redis-cli INFO persistence  # 持久化状态
redis-cli INFO stats        # 统计信息
redis-cli INFO replication  # 主从复制状态
redis-cli INFO cpu          # CPU 使用情况
redis-cli INFO commandstats # 命令统计
redis-cli INFO cluster      # 集群信息
redis-cli INFO keyspace     # 数据库键空间

8. 关键指标解读

bash 复制代码

# 内存信息详解
redis-cli INFO memory

# 输出示例及解读
used_memory:1048576           # 已使用内存(字节) - 当前实际使用
used_memory_human:1.00M       # 人性化显示
used_memory_rss:2097152       # RSS内存(操作系统视角)
used_memory_peak:2097152      # 内存使用峰值
mem_fragmentation_ratio:2.00  # 内存碎片率(重要指标!)

bash 复制代码

# 性能统计信息
redis-cli INFO stats

# 关键指标解读
total_connections_received:1000    # 总连接数
total_commands_processed:50000     # 总处理命令数
instantaneous_ops_per_sec:1000     # 每秒操作数(QPS)
rejected_connections:0             # 拒绝连接数
sync_full:2                        # 完整同步次数
expired_keys:100                   # 过期键数量
evicted_keys:50                    # 淘汰键数量

RedisInsight 可视化监控 🖥️

9. RedisInsight 功能特性

yaml 复制代码

# RedisInsight 主要功能模块
features:
  dashboard:
    real_time_metrics: 实时性能监控
    custom_dashboards: 自定义仪表板
    alert_rules: 告警规则配置
  
  performance:
    slow_queries: 慢查询分析
    memory_analysis: 内存使用分析
    latency_monitoring: 延迟监控
  
  data_browser:
    key_explorer: 键浏览和搜索
    data_visualization: 数据可视化展示
    bulk_operations: 批量操作支持
  
  cluster_management:
    topology_view: 集群拓扑视图
    node_monitoring: 节点状态监控
    configuration_management: 配置管理

10. 监控面板配置示例

json 复制代码

{
  "dashboard": {
    "name": "生产环境Redis监控",
    "widgets": [
      {
        "type": "gauge",
        "title": "内存使用率",
        "metric": "used_memory_percent",
        "thresholds": {
          "warning": 80,
          "critical": 90
        }
      },
      {
        "type": "line_chart",
        "title": "QPS趋势",
        "metrics": ["instantaneous_ops_per_sec"],
        "time_range": "1h"
      },
      {
        "type": "table",
        "title": "Top 10 慢查询",
        "query": "SLOWLOG GET 10"
      }
    ]
  }
}

监控告警配置 ⚠️

11. 告警规则设计

yaml 复制代码

# Prometheus + AlertManager 配置示例
alerting_rules:
  - name: redis_high_memory
    rules:
      - alert: RedisMemoryHigh
        expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis内存使用率过高"
          description: "{{ $labels.instance }} 内存使用率达到 {{ $value }}%"

  - name: redis_slow_queries
    rules:
      - alert: RedisSlowQueries
        expr: increase(redis_slowlog_length[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Redis出现大量慢查询"
          description: "{{ $labels.instance }} 5分钟内新增 {{ $value }} 条慢查询"

  - name: redis_connection_issues
    rules:
      - alert: RedisConnectionsHigh
        expr: redis_connected_clients / redis_config_maxclients * 100 > 80
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Redis连接数接近上限"
          description: "{{ $labels.instance }} 连接数使用率达到 {{ $value }}%"

12. 自定义监控脚本

python 复制代码

#!/usr/bin/env python3
# Redis健康检查脚本

import redis
import json
import time
from datetime import datetime

class RedisMonitor:
    def __init__(self, host='localhost', port=6379):
        self.redis_client = redis.Redis(host=host, port=port)
        self.thresholds = {
            'memory_usage': 85,      # 内存使用率阈值 %
            'hit_rate': 90,          # 缓存命中率阈值 %
            'connections_ratio': 80  # 连接数使用率阈值 %
        }
    
    def get_health_status(self):
        """获取Redis健康状态"""
        try:
            # 获取基本信息
            info = self.redis_client.info()
            stats = self.redis_client.info('stats')
            
            health_data = {
                'timestamp': datetime.now().isoformat(),
                'status': 'healthy',
                'issues': [],
                'metrics': {}
            }
            
            # 内存检查
            memory_pct = (info['used_memory'] / info['maxmemory'] * 100 
                         if info['maxmemory'] > 0 else 0)
            health_data['metrics']['memory_usage'] = round(memory_pct, 2)
            
            if memory_pct > self.thresholds['memory_usage']:
                health_data['issues'].append({
                    'type': 'high_memory',
                    'message': f'内存使用率过高: {memory_pct:.2f}%'
                })
                health_data['status'] = 'warning'
            
            # 缓存命中率检查
            if stats.get('keyspace_hits', 0) + stats.get('keyspace_misses', 0) > 0:
                hit_rate = (stats['keyspace_hits'] / 
                           (stats['keyspace_hits'] + stats['keyspace_misses']) * 100)
                health_data['metrics']['hit_rate'] = round(hit_rate, 2)
                
                if hit_rate < self.thresholds['hit_rate']:
                    health_data['issues'].append({
                        'type': 'low_hit_rate',
                        'message': f'缓存命中率偏低: {hit_rate:.2f}%'
                    })
                    if health_data['status'] == 'healthy':
                        health_data['status'] = 'warning'
            
            # 连接数检查
            conn_ratio = (info['connected_clients'] / info['maxclients'] * 100)
            health_data['metrics']['connection_usage'] = round(conn_ratio, 2)
            
            if conn_ratio > self.thresholds['connections_ratio']:
                health_data['issues'].append({
                    'type': 'high_connections',
                    'message': f'连接数使用率过高: {conn_ratio:.2f}%'
                })
                health_data['status'] = 'critical'
            
            return health_data
            
        except Exception as e:
            return {
                'timestamp': datetime.now().isoformat(),
                'status': 'unhealthy',
                'issues': [{'type': 'connection_error', 'message': str(e)}],
                'metrics': {}
            }
    
    def export_to_prometheus(self):
        """导出Prometheus格式指标"""
        health = self.get_health_status()
        metrics = []
        
        # 基础指标
        metrics.append(f'redis_up{{instance="localhost:6379"}} {1 if health["status"] != "unhealthy" else 0}')
        metrics.append(f'redis_memory_usage_percent{{instance="localhost:6379"}} {health["metrics"].get("memory_usage", 0)}')
        metrics.append(f'redis_hit_rate_percent{{instance="localhost:6379"}} {health["metrics"].get("hit_rate", 0)}')
        metrics.append(f'redis_connection_usage_percent{{instance="localhost:6379"}} {health["metrics"].get("connection_usage", 0)}')
        
        # 问题计数
        issue_count = len(health['issues'])
        metrics.append(f'redis_issues_total{{instance="localhost:6379"}} {issue_count}')
        
        return '\n'.join(metrics)

# 使用示例
if __name__ == "__main__":
    monitor = RedisMonitor()
    health_status = monitor.get_health_status()
    print(json.dumps(health_status, indent=2, ensure_ascii=False))
    
    # 导出Prometheus指标
    prometheus_metrics = monitor.export_to_prometheus()
    print("\nPrometheus Metrics:")
    print(prometheus_metrics)

10大面试高频问题 💯

13. 面试必考题详解

Q: 如何判断Redis是否存在性能问题？ ```A: 主要看这几个指标：

slowlog 中是否有大量慢查询

INFO stats 中 instantaneous_ops_per_sec 是否正常

内存使用率是否持续增长

缓存命中率是否低于90% ```

Q: slowlog-log-slower-than 设置多少合适？

A: 建议设置为 10000(10ms)，对于要求极高的场景可设为 1000(1ms)

Q: mem_fragmentation_ratio 什么情况下需要关注？

A: >2.0 表示内存碎片严重；<1.0 可能表示内存不足

Q: Redis 监控应该重点关注哪些指标？

A: QPS、内存使用率、缓存命中率、连接数、持久化状态

Q: 如何排查Redis变慢的问题？

A: 1) 查看slowlog找出慢查询

检查INFO stats中的各项统计

分析key的大小分布

检查网络延迟

Q: RedisInsight相比命令行监控有什么优势？

A: 图形化界面更直观，支持历史数据分析，

可以同时监控多个实例，提供告警功能

Q: 如何设置合理的监控告警阈值？

A: 基于历史数据建立基线，通常：

内存使用率 >85% 告警

QPS突增 >200% 告警

缓存命中率 <90% 告警 ```

Q: Redis主从架构如何监控？

A: 监控主从延迟、同步状态、从节点健康状况

Q: 生产环境推荐的监控频率是多久？

A: 关键指标每30秒采集一次，慢查询实时监控

Q: 如何做Redis容量规划？

A: 基于当前数据量和增长率，预留50%以上的冗余空间

实战案例分析 🌟

14. 电商系统Redis监控方案

监控维度
用户请求
应用服务器
Redis集群
监控系统
告警通知
业务指标
性能指标
资源指标

15. 监控配置最佳实践

yaml 复制代码

# 完整的Redis监控配置方案
redis_monitoring:
  collection_interval: 30s          # 采集间隔
  retention_period: 30d             # 数据保留时间
  
  metrics_to_collect:
    - name: basic_stats
      commands: [INFO stats, INFO memory, INFO clients]
      
    - name: slow_queries
      commands: [SLOWLOG GET 50]
      
    - name: key_patterns
      patterns: ["user:*", "order:*", "product:*"]
  
  alerting:
    channels:
      - type: email
        recipients: ["ops@company.com"]
      - type: webhook
        url: "https://alerts.company.com/webhook"
    
    rules:
      - metric: memory_usage
        operator: ">"
        threshold: 85
        duration: 5m
        
      - metric: hit_rate
        operator: "<"
        threshold: 90
        duration: 10m

总结与最佳实践 📝

16. 今日要点回顾

✅ 核心收获：

掌握了 Redis 监控的完整体系架构
学会了 slowlog 和 INFO 命令的实际应用
熟悉了可视化监控工具的使用
具备了告警配置和问题排查能力

✅ 监控黄金法则：

js 复制代码

预防为主，监控为辅
指标量化，告警精准
多维分析，快速响应
持续优化，防患未然

✅ 生产环境建议：

建立完整的监控告警体系
定期分析慢查询日志
设置合理的性能基线
做好容量规划和预案

🎉 老曹寄语 ：监控不是目的，而是手段！我们要通过监控发现问题、解决问题、预防问题。记住：好的监控能让问题在爆发前就被解决！

下节课我们将进入 Redis 高可用的世界，学习主从复制技术！🔄

关注老曹，带你成为真正的 Redis 运维专家！ 🚀

【Redis】 监控与慢查询日志 —— slowlog、INFO 命令、RedisInsight 可视化监控