实战：一次完整的网站故障排查记录（从用户访问到数据库）

1. 故障背景与现象描述

1.1 故障场景

某电商网站 shop.example.com 在促销活动期间出现以下问题：

用户反映网站访问缓慢，页面加载时间超过10秒
部分用户无法完成支付流程
后台管理系统频繁报错
监控系统显示服务器资源使用率异常

1.2 初始信息收集

创建信息收集脚本：initial_info_collection.sh

bash 复制代码

#!/bin/bash

# 初始信息收集脚本
LOG_FILE="/tmp/troubleshooting_$(date +%Y%m%d_%H%M%S).log"

# 记录开始时间
echo "=== 故障排查开始时间: $(date) ===" | tee -a $LOG_FILE

# 收集系统基本信息
echo -e "\n=== 系统基本信息 ===" | tee -a $LOG_FILE
echo "主机名: $(hostname)" | tee -a $LOG_FILE
echo "系统版本: $(cat /etc/redhat-release 2>/dev/null || cat /etc/os-release 2>/dev/null | grep PRETTY_NAME)" | tee -a $LOG_FILE
echo "内核版本: $(uname -r)" | tee -a $LOG_FILE
echo "当前用户: $(whoami)" | tee -a $LOG_FILE

# 收集时间信息
echo -e "\n=== 时间同步状态 ===" | tee -a $LOG_FILE
timedatectl status 2>/dev/null || date | tee -a $LOG_FILE
chronyc sources 2>/dev/null | head -10 | tee -a $LOG_FILE

# 收集网络信息
echo -e "\n=== 网络连接状态 ===" | tee -a $LOG_FILE
netstat -tulpn | grep -E ":(80|443|3306)" | tee -a $LOG_FILE

# 收集进程信息
echo -e "\n=== 相关进程状态 ===" | tee -a $LOG_FILE
ps aux | grep -E "(nginx|apache|mysql|php|java)" | head -20 | tee -a $LOG_FILE

echo -e "\n初始信息收集完成，详细日志: $LOG_FILE"

2. 用户端访问排查

2.1 模拟用户访问测试

创建访问测试脚本：user_access_test.sh

bash 复制代码

#!/bin/bash

# 用户端访问测试脚本
DOMAIN="shop.example.com"
TEST_URLS=(
    "https://$DOMAIN"
    "https://$DOMAIN/products"
    "https://$DOMAIN/checkout"
    "https://$DOMAIN/api/health"
)

# 设置颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

# 测试函数
test_url() {
    local url=$1
    echo -e "\n测试 URL: $url"
    
    # 测试DNS解析
    local domain=$(echo $url | awk -F/ '{print $3}')
    echo "DNS解析测试:"
    nslookup $domain 2>&1 | grep -A5 "Name:"
    
    # 测试HTTP响应
    echo -e "\nHTTP响应测试:"
    local start_time=$(date +%s%N)
    
    # 使用curl进行详细测试
    local response=$(curl -L -s -w \
        "HTTP状态码: %{http_code}\n\
         总时间: %{time_total}s\n\
         DNS解析时间: %{time_namelookup}s\n\
         连接建立时间: %{time_connect}s\n\
         SSL握手时间: %{time_appconnect}s\n\
         开始传输时间: %{time_starttransfer}s\n\
         重定向次数: %{num_redirects}\n\
         重定向时间: %{time_redirect}s\n\
         下载速度: %{speed_download} bytes/sec\n" \
        -o /dev/null \
        "$url")
    
    local end_time=$(date +%s%N)
    local total_time=$(( (end_time - start_time) / 1000000 ))
    
    echo "$response"
    
    # 判断响应时间
    local http_code=$(echo "$response" | grep "HTTP状态码" | awk '{print $2}')
    local total_time_sec=$(echo "$response" | grep "总时间" | awk '{print $2}' | sed 's/s//')
    
    if [ "$http_code" -eq 200 ]; then
        if (( $(echo "$total_time_sec > 5" | bc -l) )); then
            echo -e "${YELLOW}警告: 响应时间过长${NC}"
        else
            echo -e "${GREEN}正常: 访问成功${NC}"
        fi
    else
        echo -e "${RED}错误: HTTP状态码 $http_code${NC}"
    fi
    
    # 测试TCP连接
    echo -e "\nTCP端口连接测试:"
    local port=443
    if echo "$url" | grep -q "http:"; then
        port=80
    fi
    
    timeout 5 bash -c "echo >/dev/tcp/$domain/$port" 2>/dev/null && \
        echo -e "${GREEN}端口 $port 连接成功${NC}" || \
        echo -e "${RED}端口 $port 连接失败${NC}"
}

# 执行所有测试
echo "=== 用户端访问测试开始 ==="
for url in "${TEST_URLS[@]}"; do
    test_url "$url"
    echo "----------------------------------------"
done

# 网络质量测试
echo -e "\n=== 网络质量测试 ==="
ping -c 5 $DOMAIN | tail -3
traceroute -m 15 $DOMAIN 2>/dev/null | head -10

2.2 浏览器端问题排查

创建浏览器诊断脚本：browser_diagnosis.html

html 复制代码

<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">
    <title>网站性能诊断工具</title>
    <style>
        body { 
            font-family: Arial, sans-serif; 
            margin: 40px; 
            background-color: #1e1e1e;
            color: #ffffff;
        }
        .test-section { 
            margin: 20px 0; 
            padding: 15px; 
            border: 1px solid #444; 
            border-radius: 5px;
        }
        button { 
            padding: 10px 20px; 
            margin: 5px; 
            background: #007acc; 
            color: white; 
            border: none; 
            border-radius: 3px; 
            cursor: pointer;
        }
        button:hover { background: #005a9e; }
        .result { 
            margin-top: 10px; 
            padding: 10px; 
            background: #2d2d30; 
            border-radius: 3px; 
            white-space: pre-wrap;
        }
        .success { color: #4ec9b0; }
        .warning { color: #ffcc00; }
        .error { color: #f44747; }
    </style>
</head>
<body>
    <h1>网站性能诊断工具</h1>
    
    <div class="test-section">
        <h3>1. 基础连接测试</h3>
        <button onclick="testDNS()">测试DNS解析</button>
        <button onclick="testPing()">测试网络延迟</button>
        <div id="dnsResult" class="result"></div>
    </div>

    <div class="test-section">
        <h3>2. 资源加载测试</h3>
        <button onclick="testResourceLoad()">测试资源加载时间</button>
        <div id="resourceResult" class="result"></div>
    </div>

    <div class="test-section">
        <h3>3. WebSocket测试</h3>
        <button onclick="testWebSocket()">测试WebSocket连接</button>
        <div id="websocketResult" class="result"></div>
    </div>

    <div class="test-section">
        <h3>4. 控制台错误检查</h3>
        <button onclick="checkConsoleErrors()">检查JavaScript错误</button>
        <div id="consoleResult" class="result"></div>
    </div>

<script>
// DNS解析测试
async function testDNS() {
    const domain = 'shop.example.com';
    const resultDiv = document.getElementById('dnsResult');
    resultDiv.innerHTML = '测试中...';
    
    try {
        const startTime = performance.now();
        const response = await fetch(`https://dns.google/resolve?name=${domain}`);
        const data = await response.json();
        const endTime = performance.now();
        
        let result = `DNS解析时间: ${(endTime - startTime).toFixed(2)}ms\n`;
        result += `解析结果: ${JSON.stringify(data.Answer || data.Authority, null, 2)}`;
        
        resultDiv.innerHTML = result;
        resultDiv.className = 'result success';
    } catch (error) {
        resultDiv.innerHTML = `DNS解析失败: ${error.message}`;
        resultDiv.className = 'result error';
    }
}

// 资源加载测试
function testResourceLoad() {
    const resources = [
        'https://shop.example.com/static/css/main.css',
        'https://shop.example.com/static/js/app.js',
        'https://shop.example.com/api/products'
    ];
    
    const resultDiv = document.getElementById('resourceResult');
    resultDiv.innerHTML = '';
    
    resources.forEach(url => {
        const startTime = performance.now();
        fetch(url, { method: 'HEAD' })
            .then(response => {
                const endTime = performance.now();
                const loadTime = endTime - startTime;
                
                const status = response.status === 200 ? '✅' : '❌';
                const line = `${status} ${url}: ${loadTime.toFixed(2)}ms (状态: ${response.status})`;
                
                resultDiv.innerHTML += line + '\n';
                
                if (loadTime > 1000) {
                    resultDiv.className = 'result warning';
                } else if (response.status === 200) {
                    resultDiv.className = 'result success';
                } else {
                    resultDiv.className = 'result error';
                }
            })
            .catch(error => {
                resultDiv.innerHTML += `❌ ${url}: 加载失败 - ${error.message}\n`;
                resultDiv.className = 'result error';
            });
    });
}

// WebSocket测试
function testWebSocket() {
    const resultDiv = document.getElementById('websocketResult');
    resultDiv.innerHTML = 'WebSocket连接测试中...';
    
    const ws = new WebSocket('wss://shop.example.com/ws');
    const startTime = performance.now();
    
    ws.onopen = function() {
        const endTime = performance.now();
        resultDiv.innerHTML = `✅ WebSocket连接成功\n连接时间: ${(endTime - startTime).toFixed(2)}ms`;
        resultDiv.className = 'result success';
        ws.close();
    };
    
    ws.onerror = function(error) {
        resultDiv.innerHTML = `❌ WebSocket连接失败: ${error.message}`;
        resultDiv.className = 'result error';
    };
    
    setTimeout(() => {
        if (ws.readyState !== WebSocket.OPEN) {
            resultDiv.innerHTML = '❌ WebSocket连接超时';
            resultDiv.className = 'result error';
            ws.close();
        }
    }, 5000);
}

// 控制台错误检查
function checkConsoleErrors() {
    const resultDiv = document.getElementById('consoleResult');
    
    // 重写console.error来捕获错误
    const originalError = console.error;
    const errors = [];
    
    console.error = function(...args) {
        errors.push(args.join(' '));
        originalError.apply(console, args);
    };
    
    // 触发可能的错误
    setTimeout(() => {
        console.error = originalError;
        
        if (errors.length > 0) {
            resultDiv.innerHTML = `发现 ${errors.length} 个错误:\n${errors.join('\n')}`;
            resultDiv.className = 'result error';
        } else {
            resultDiv.innerHTML = '✅ 未发现控制台错误';
            resultDiv.className = 'result success';
        }
    }, 2000);
}
</script>
</body>
</html>

3. 网络层排查

3.1 网络连通性深度测试

创建网络诊断脚本：network_diagnosis.sh

bash 复制代码

#!/bin/bash

# 网络层深度诊断脚本
DOMAIN="shop.example.com"
LOG_FILE="/tmp/network_diagnosis_$(date +%Y%m%d_%H%M%S).log"

echo "=== 网络层深度诊断 ===" | tee -a $LOG_FILE

# 1. 基础网络测试
echo -e "\n1. 基础网络连通性测试:" | tee -a $LOG_FILE
ping -c 10 -i 0.2 $DOMAIN | tee -a $LOG_FILE

# 2. MTU路径发现
echo -e "\n2. MTU路径发现:" | tee -a $LOG_FILE
for mtu in 1500 1400 1300 1200 1100 1000; do
    echo "测试MTU: $mtu" | tee -a $LOG_FILE
    ping -c 2 -M do -s $((mtu-28)) $DOMAIN 2>&1 | grep -E "bytes from|Packet" | tee -a $LOG_FILE
done

# 3. 路由跟踪
echo -e "\n3. 详细路由跟踪:" | tee -a $LOG_FILE
traceroute -w 2 -q 2 -m 20 $DOMAIN 2>/dev/null | tee -a $LOG_FILE

# 4. 端口扫描
echo -e "\n4. 关键端口扫描:" | tee -a $LOG_FILE
ports=(80 443 3306 6379 22)
for port in "${ports[@]}"; do
    timeout 3 bash -c "echo >/dev/tcp/$DOMAIN/$port" 2>/dev/null && \
        echo "端口 $port: 开放" | tee -a $LOG_FILE || \
        echo "端口 $port: 关闭" | tee -a $LOG_FILE
done

# 5. SSL证书检查
echo -e "\n5. SSL证书检查:" | tee -a $LOG_FILE
openssl s_client -connect $DOMAIN:443 -servername $DOMAIN < /dev/null 2>/dev/null | \
    openssl x509 -noout -dates | tee -a $LOG_FILE

# 6. HTTP/2支持检查
echo -e "\n6. HTTP/2支持检查:" | tee -a $LOG_FILE
curl -I --http2 -s https://$DOMAIN | grep -i "HTTP/" | tee -a $LOG_FILE

# 7. DNS解析详细检查
echo -e "\n7. DNS解析详细检查:" | tee -a $LOG_FILE
echo "A记录:" | tee -a $LOG_FILE
dig A $DOMAIN +short | tee -a $LOG_FILE
echo "AAAA记录:" | tee -a $LOG_FILE
dig AAAA $DOMAIN +short | tee -a $LOG_FILE
echo "CNAME记录:" | tee -a $LOG_FILE
dig CNAME $DOMAIN +short | tee -a $LOG_FILE

# 8. 网络性能测试
echo -e "\n8. 网络性能测试:" | tee -a $LOG_FILE
# 下载速度测试（小文件）
curl -o /dev/null -w "下载速度: %{speed_download} bytes/sec\n" \
    https://$DOMAIN/static/images/logo.png 2>&1 | tee -a $LOG_FILE

# 9. 连接数统计
echo -e "\n9. 当前连接数统计:" | tee -a $LOG_FILE
netstat -an | grep -E ":(80|443)" | grep ESTABLISHED | wc -l | \
    awk '{print "ESTABLISHED连接数:", $1}' | tee -a $LOG_FILE

echo -e "\n网络诊断完成，详细日志: $LOG_FILE"

3.2 网络流量分析

创建流量分析脚本：traffic_analysis.sh

bash 复制代码

#!/bin/bash

# 网络流量分析脚本
INTERFACE="eth0"
DURATION=60
LOG_FILE="/tmp/traffic_analysis_$(date +%Y%m%d_%H%M%S).log"

echo "=== 网络流量分析开始 ===" | tee -a $LOG_FILE
echo "监控网卡: $INTERFACE" | tee -a $LOG_FILE
echo "监控时长: ${DURATION}秒" | tee -a $LOG_FILE

# 1. 使用tcpdump捕获流量
echo -e "\n1. 开始捕获HTTP/HTTPS流量..." | tee -a $LOG_FILE
tcpdump -i $INTERFACE -w /tmp/traffic_capture.pcap -c 1000 port 80 or port 443 2>/dev/null &
TCPDUMP_PID=$!

# 2. 实时连接监控
echo -e "\n2. 实时连接监控:" | tee -a $LOG_FILE
for i in $(seq 1 $DURATION); do
    echo "--- 第 $i 秒 ---" >> $LOG_FILE
    netstat -tn | grep -E ":(80|443)" | grep ESTABLISHED | \
        awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -5 >> $LOG_FILE
    sleep 1
done

# 停止tcpdump
kill $TCPDUMP_PID 2>/dev/null
wait $TCPDUMP_PID 2>/dev/null

# 3. 分析捕获的流量
echo -e "\n3. 流量分析结果:" | tee -a $LOG_FILE
if [ -f /tmp/traffic_capture.pcap ]; then
    # 统计TOP IP
    echo "TOP来源IP:" | tee -a $LOG_FILE
    tcpdump -nn -r /tmp/traffic_capture.pcap 2>/dev/null | \
        awk '{print $3}' | cut -d. -f1-4 | sort | uniq -c | sort -rn | head -10 | tee -a $LOG_FILE
    
    # 统计协议分布
    echo -e "\n协议分布:" | tee -a $LOG_FILE
    tcpdump -nn -r /tmp/traffic_capture.pcap 2>/dev/null | \
        awk '{print $5}' | cut -d. -f5 | sort | uniq -c | sort -rn | tee -a $LOG_FILE
    
    # 清理临时文件
    rm -f /tmp/traffic_capture.pcap
fi

# 4. 检查网络错误
echo -e "\n4. 网络错误统计:" | tee -a $LOG_FILE
netstat -i | grep $INTERFACE | tee -a $LOG_FILE

echo -e "\n流量分析完成，详细日志: $LOG_FILE"

4. 服务器层排查

4.1 系统资源监控

创建系统监控脚本：system_monitoring.sh

bash 复制代码

#!/bin/bash

# 系统资源监控脚本
DURATION=300  # 监控5分钟
INTERVAL=5
LOG_FILE="/tmp/system_monitoring_$(date +%Y%m%d_%H%M%S).log"

echo "=== 系统资源监控开始 ===" | tee -a $LOG_FILE
echo "监控时长: ${DURATION}秒" | tee -a $LOG_FILE
echo "采样间隔: ${INTERVAL}秒" | tee -a $LOG_FILE

# 监控函数
monitor_system() {
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
    
    echo -e "\n--- $timestamp ---" >> $LOG_FILE
    
    # CPU使用率
    echo "CPU使用率:" >> $LOG_FILE
    mpstat 1 1 | grep -A5 "Average" >> $LOG_FILE
    
    # 内存使用
    echo -e "\n内存使用:" >> $LOG_FILE
    free -h >> $LOG_FILE
    
    # 磁盘IO
    echo -e "\n磁盘IO:" >> $LOG_FILE
    iostat -x 1 1 | grep -A10 "Device" >> $LOG_FILE
    
    # 负载情况
    echo -e "\n系统负载:" >> $LOG_FILE
    uptime >> $LOG_FILE
    
    # 进程资源使用TOP10
    echo -e "\n进程资源使用TOP10:" >> $LOG_FILE
    ps aux --sort=-%cpu | head -11 >> $LOG_FILE
    
    # 网络连接数
    echo -e "\n网络连接统计:" >> $LOG_FILE
    netstat -ant | awk '{print $6}' | sort | uniq -c | sort -rn >> $LOG_FILE
}

# 开始监控
END_TIME=$((SECONDS + DURATION))
while [ $SECONDS -lt $END_TIME ]; do
    monitor_system
    sleep $INTERVAL
done

# 生成总结报告
echo -e "\n=== 监控总结报告 ===" >> $LOG_FILE

# 检查是否有异常
echo -e "\n异常检查:" >> $LOG_FILE

# 检查负载
LOAD_AVG=$(uptime | awk -F'load average:' '{print $2}' | awk -F, '{print $1}' | tr -d ' ')
if (( $(echo "$LOAD_AVG > 10" | bc -l) )); then
    echo "警告: 系统负载过高: $LOAD_AVG" >> $LOG_FILE
fi

# 检查内存使用
MEM_FREE=$(free -m | grep Mem | awk '{print $7}')
if [ $MEM_FREE -lt 100 ]; then
    echo "警告: 可用内存不足: ${MEM_FREE}MB" >> $LOG_FILE
fi

# 检查磁盘空间
DISK_USAGE=$(df / | awk 'END{print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 90 ]; then
    echo "警告: 根分区使用率过高: ${DISK_USAGE}%" >> $LOG_FILE
fi

echo -e "\n监控完成，详细日志: $LOG_FILE"

4.2 Web服务器排查

创建Web服务器诊断脚本：webserver_diagnosis.sh

bash 复制代码

#!/bin/bash

# Web服务器诊断脚本
LOG_FILE="/tmp/webserver_diagnosis_$(date +%Y%m%d_%H%M%S).log"

echo "=== Web服务器诊断开始 ===" | tee -a $LOG_FILE

# 检测Web服务器类型
detect_webserver() {
    if systemctl is-active --quiet nginx; then
        echo "nginx"
    elif systemctl is-active --quiet apache2; then
        echo "apache"
    elif systemctl is-active --quiet httpd; then
        echo "httpd"
    else
        echo "unknown"
    fi
}

WEBSERVER=$(detect_webserver)
echo "检测到Web服务器: $WEBSERVER" | tee -a $LOG_FILE

# 通用检查
echo -e "\n1. Web服务状态:" | tee -a $LOG_FILE
systemctl status $WEBSERVER --no-pager | tee -a $LOG_FILE

echo -e "\n2. 监听端口:" | tee -a $LOG_FILE
netstat -tulpn | grep -E ":80|:443" | tee -a $LOG_FILE

echo -e "\n3. 活动连接数:" | tee -a $LOG_FILE
netstat -an | grep -E ":(80|443)" | grep ESTABLISHED | wc -l | tee -a $LOG_FILE

# 服务器特定检查
case $WEBSERVER in
    "nginx")
        echo -e "\n4. Nginx配置检查:" | tee -a $LOG_FILE
        nginx -t 2>&1 | tee -a $LOG_FILE
        
        echo -e "\n5. Nginx状态信息:" | tee -a $LOG_FILE
        # 检查是否有status模块
        curl -s http://localhost/nginx_status 2>/dev/null | tee -a $LOG_FILE
        
        echo -e "\n6. Nginx进程信息:" | tee -a $LOG_FILE
        ps aux | grep nginx | grep -v grep | tee -a $LOG_FILE
        
        echo -e "\n7. Nginx错误日志检查:" | tee -a $LOG_FILE
        tail -50 /var/log/nginx/error.log 2>/dev/null | tee -a $LOG_FILE
        ;;
        
    "apache"|"httpd")
        echo -e "\n4. Apache配置检查:" | tee -a $LOG_FILE
        apache2ctl -t 2>/dev/null || httpd -t 2>/dev/null | tee -a $LOG_FILE
        
        echo -e "\n5. Apache状态信息:" | tee -a $LOG_FILE
        curl -s http://localhost/server-status 2>/dev/null | head -20 | tee -a $LOG_FILE
        
        echo -e "\n6. Apache进程信息:" | tee -a $LOG_FILE
        ps aux | grep -E "apache2|httpd" | grep -v grep | tee -a $LOG_FILE
        
        echo -e "\n7. Apache错误日志检查:" | tee -a $LOG_FILE
        tail -50 /var/log/apache2/error.log 2>/dev/null || \
        tail -50 /var/log/httpd/error_log 2>/dev/null | tee -a $LOG_FILE
        ;;
        
    *)
        echo "未知的Web服务器" | tee -a $LOG_FILE
        ;;
esac

# 检查虚拟主机配置
echo -e "\n8. 虚拟主机配置:" | tee -a $LOG_FILE
if [ "$WEBSERVER" = "nginx" ]; then
    find /etc/nginx -name "*.conf" -exec grep -l "shop.example.com" {} \; 2>/dev/null | tee -a $LOG_FILE
elif [ "$WEBSERVER" = "apache" ] || [ "$WEBSERVER" = "httpd" ]; then
    find /etc/apache2 -name "*.conf" -exec grep -l "shop.example.com" {} \; 2>/dev/null | \
    find /etc/httpd -name "*.conf" -exec grep -l "shop.example.com" {} \; 2>/dev/null | tee -a $LOG_FILE
fi

# SSL证书检查
echo -e "\n9. SSL证书检查:" | tee -a $LOG_FILE
openssl s_client -connect shop.example.com:443 -servername shop.example.com < /dev/null 2>/dev/null | \
    openssl x509 -noout -text | grep -A5 "Validity" | tee -a $LOG_FILE

echo -e "\nWeb服务器诊断完成，详细日志: $LOG_FILE"

5. 应用层排查

5.1 应用程序日志分析

创建应用日志分析脚本：application_log_analysis.sh

bash 复制代码

#!/bin/bash

# 应用程序日志分析脚本
LOG_DIRS=("/var/log/nginx" "/var/log/apache2" "/var/log/httpd" "/var/www/logs" "/app/logs")
DATE_RANGE="1 hour ago"
LOG_FILE="/tmp/app_log_analysis_$(date +%Y%m%d_%H%M%S).log"

echo "=== 应用程序日志分析开始 ===" | tee -a $LOG_FILE

# 查找应用日志文件
find_log_files() {
    local log_files=()
    for dir in "${LOG_DIRS[@]}"; do
        if [ -d "$dir" ]; then
            while IFS= read -r file; do
                log_files+=("$file")
            done < <(find "$dir" -name "*.log" -type f -mtime -1 2>/dev/null)
        fi
    done
    printf '%s\n' "${log_files[@]}"
}

LOG_FILES=($(find_log_files))

echo "发现的日志文件:" | tee -a $LOG_FILE
printf '%s\n' "${LOG_FILES[@]}" | tee -a $LOG_FILE

# 分析每个日志文件
for log_file in "${LOG_FILES[@]}"; do
    echo -e "\n=== 分析日志文件: $log_file ===" | tee -a $LOG_FILE
    
    # 检查文件大小
    file_size=$(du -h "$log_file" | cut -f1)
    echo "文件大小: $file_size" | tee -a $LOG_FILE
    
    # 错误统计
    echo -e "\n错误统计 (最近1小时):" | tee -a $LOG_FILE
    if [[ "$log_file" == *"error"* ]] || [[ "$log_file" == *"Error"* ]]; then
        # 错误日志文件
        grep "$(date -d "$DATE_RANGE" '+%Y-%m-%d %H:')" "$log_file" 2>/dev/null | \
            awk '{print $1, $2, $3}' | sort | uniq -c | sort -rn | head -10 | tee -a $LOG_FILE
    else
        # 访问日志 - 查找错误状态码
        grep "$(date -d "$DATE_RANGE" '+%Y-%m-%d %H:')" "$log_file" 2>/dev/null | \
            awk '$9 >= 400 {print $9}' | sort | uniq -c | sort -rn | head -10 | tee -a $LOG_FILE
    fi
    
    # 慢请求分析（针对访问日志）
    if [[ "$log_file" == *"access"* ]]; then
        echo -e "\n慢请求分析 (响应时间 > 5秒):" | tee -a $LOG_FILE
        # 假设日志格式包含响应时间（需要根据实际日志格式调整）
        grep "$(date -d "$DATE_RANGE" '+%Y-%m-%d %H:')" "$log_file" 2>/dev/null | \
            awk '$NF > 5 {print $7, $NF "s"}' | sort -k2 -rn | head -10 | tee -a $LOG_FILE
    fi
    
    # 频繁访问IP
    echo -e "\n频繁访问IP TOP10:" | tee -a $LOG_FILE
    grep "$(date -d "$DATE_RANGE" '+%Y-%m-%d %H:')" "$log_file" 2>/dev/null | \
        awk '{print $1}' | sort | uniq -c | sort -rn | head -10 | tee -a $LOG_FILE
    
    # 热门URL
    echo -e "\n热门URL TOP10:" | tee -a $LOG_FILE
    grep "$(date -d "$DATE_RANGE" '+%Y-%m-%d %H:')" "$log_file" 2>/dev/null | \
        awk '{print $7}' | sort | uniq -c | sort -rn | head -10 | tee -a $LOG_FILE
done

# 检查应用进程
echo -e "\n=== 应用进程状态 ===" | tee -a $LOG_FILE
ps aux | grep -E "(php|java|python|node)" | grep -v grep | head -20 | tee -a $LOG_FILE

# 检查应用端口
echo -e "\n=== 应用端口监听 ===" | tee -a $LOG_FILE
netstat -tulpn | grep -E ":(8080|3000|9000)" | tee -a $LOG_FILE

echo -e "\n应用程序日志分析完成，详细日志: $LOG_FILE"

5.2 应用性能监控

创建应用性能监控脚本：application_performance.sh

bash 复制代码

#!/bin/bash

# 应用性能监控脚本
LOG_FILE="/tmp/app_performance_$(date +%Y%m%d_%H%M%S).log"

echo "=== 应用性能监控开始 ===" | tee -a $LOG_FILE

# 1. 检查应用响应时间
echo -e "\n1. 应用接口响应时间测试:" | tee -a $LOG_FILE
API_ENDPOINTS=(
    "https://shop.example.com/api/health"
    "https://shop.example.com/api/products"
    "https://shop.example.com/api/user/profile"
    "https://shop.example.com/api/order/checkout"
)

for endpoint in "${API_ENDPOINTS[@]}"; do
    echo -n "测试 $endpoint ... " | tee -a $LOG_FILE
    response_time=$(curl -o /dev/null -s -w "%{time_total}s\n" "$endpoint")
    echo "响应时间: $response_time" | tee -a $LOG_FILE
done

# 2. 数据库连接测试
echo -e "\n2. 数据库连接测试:" | tee -a $LOG_FILE
# 假设有数据库连接测试接口
curl -s "https://shop.example.com/api/db/health" | jq '.' 2>/dev/null | tee -a $LOG_FILE

# 3. 缓存状态检查
echo -e "\n3. 缓存状态检查:" | tee -a $LOG_FILE
curl -s "https://shop.example.com/api/cache/status" | jq '.' 2>/dev/null | tee -a $LOG_FILE

# 4. 外部服务依赖检查
echo -e "\n4. 外部服务依赖检查:" | tee -a $LOG_FILE
EXTERNAL_SERVICES=(
    "payment_gateway"
    "email_service" 
    "sms_service"
    "cdn_service"
)

for service in "${EXTERNAL_SERVICES[@]}"; do
    echo -n "检查 $service ... " | tee -a $LOG_FILE
    # 这里应该是实际的健康检查端点
    if curl -s "https://shop.example.com/api/dependencies/$service" | grep -q "healthy"; then
        echo "正常" | tee -a $LOG_FILE
    else
        echo "异常" | tee -a $LOG_FILE
    fi
done

# 5. 应用指标收集
echo -e "\n5. 应用指标收集:" | tee -a $LOG_FILE
APPLICATION_METRICS=(
    "active_sessions"
    "memory_usage"
    "request_queue"
    "error_rate"
)

for metric in "${APPLICATION_METRICS[@]}"; do
    value=$(curl -s "https://shop.example.com/api/metrics/$metric" 2>/dev/null)
    echo "$metric: $value" | tee -a $LOG_FILE
done

# 6. 垃圾回收统计（针对JVM应用）
echo -e "\n6. JVM应用状态检查:" | tee -a $LOG_FILE
JAVA_PID=$(pgrep -f java | head -1)
if [ -n "$JAVA_PID" ]; then
    echo "Java进程PID: $JAVA_PID" | tee -a $LOG_FILE
    jstat -gc $JAVA_PID 2>/dev/null | head -2 | tee -a $LOG_FILE
fi

echo -e "\n应用性能监控完成，详细日志: $LOG_FILE"

6. 数据库层排查

6.1 数据库性能分析

创建数据库诊断脚本：database_diagnosis.sh

bash 复制代码

#!/bin/bash

# 数据库性能诊断脚本
DB_TYPE="mysql"  # 可以是mysql或postgresql
LOG_FILE="/tmp/database_diagnosis_$(date +%Y%m%d_%H%M%S).log"

echo "=== 数据库性能诊断开始 ===" | tee -a $LOG_FILE

# 数据库连接测试
test_db_connection() {
    case $DB_TYPE in
        "mysql")
            mysql -h localhost -u root -p$DB_PASSWORD -e "SELECT 1 as connection_test;" 2>/dev/null
            ;;
        "postgresql")
            psql -h localhost -U postgres -c "SELECT 1 as connection_test;" 2>/dev/null
            ;;
    esac
}

echo "数据库连接测试:" | tee -a $LOG_FILE
if test_db_connection; then
    echo "✅ 数据库连接正常" | tee -a $LOG_FILE
else
    echo "❌ 数据库连接失败" | tee -a $LOG_FILE
    exit 1
fi

# MySQL特定诊断
if [ "$DB_TYPE" = "mysql" ]; then
    echo -e "\n1. MySQL状态检查:" | tee -a $LOG_FILE
    mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW GLOBAL STATUS;" 2>/dev/null | \
        grep -E "(Threads_connected|Threads_running|Queries|Slow_queries|Aborted_connects)" | tee -a $LOG_FILE
    
    echo -e "\n2. MySQL变量检查:" | tee -a $LOG_FILE
    mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW GLOBAL VARIABLES;" 2>/dev/null | \
        grep -E "(max_connections|innodb_buffer_pool_size|query_cache_size|tmp_table_size)" | tee -a $LOG_FILE
    
    echo -e "\n3. 进程列表:" | tee -a $LOG_FILE
    mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW PROCESSLIST;" 2>/dev/null | head -20 | tee -a $LOG_FILE
    
    echo -e "\n4. 表锁定状态:" | tee -a $LOG_FILE
    mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW ENGINE INNODB STATUS\G" 2>/dev/null | \
        grep -A10 "LATEST DETECTED DEADLOCK" | tee -a $LOG_FILE
    
    echo -e "\n5. 慢查询日志检查:" | tee -a $LOG_FILE
    mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW VARIABLES LIKE 'slow_query_log%';" 2>/dev/null | tee -a $LOG_FILE
    mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW VARIABLES LIKE 'long_query_time';" 2>/dev/null | tee -a $LOG_FILE
    
    echo -e "\n6. 数据库大小统计:" | tee -a $LOG_FILE
    mysql -h localhost -u root -p$DB_PASSWORD -e "
        SELECT table_schema as 'Database', 
        ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) as 'Size (MB)' 
        FROM information_schema.tables 
        GROUP BY table_schema;" 2>/dev/null | tee -a $LOG_FILE
    
    echo -e "\n7. 索引统计:" | tee -a $LOG_FILE
    mysql -h localhost -u root -p$DB_PASSWORD -e "
        SELECT TABLE_NAME, INDEX_NAME, COLUMN_NAME, SEQ_IN_INDEX
        FROM information_schema.STATISTICS 
        WHERE TABLE_SCHEMA = 'shop_db'
        ORDER BY TABLE_NAME, INDEX_NAME, SEQ_IN_INDEX;" 2>/dev/null | head -20 | tee -a $LOG_FILE

# PostgreSQL特定诊断
elif [ "$DB_TYPE" = "postgresql" ]; then
    echo -e "\n1. PostgreSQL连接统计:" | tee -a $LOG_FILE
    psql -h localhost -U postgres -c "
        SELECT datname, numbackends, xact_commit, xact_rollback 
        FROM pg_stat_database 
        WHERE datname = 'shop_db';" 2>/dev/null | tee -a $LOG_FILE
    
    echo -e "\n2. 活动查询:" | tee -a $LOG_FILE
    psql -h localhost -U postgres -c "
        SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state 
        FROM pg_stat_activity 
        WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';" 2>/dev/null | tee -a $LOG_FILE
    
    echo -e "\n3. 表大小统计:" | tee -a $LOG_FILE
    psql -h localhost -U postgres -c "
        SELECT schemaname, tablename, 
        pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
        FROM pg_tables 
        WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
        ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
        LIMIT 10;" 2>/dev/null | tee -a $LOG_FILE
fi

# 通用数据库检查
echo -e "\n8. 数据库连接数监控:" | tee -a $LOG_FILE
case $DB_TYPE in
    "mysql")
        mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW STATUS LIKE 'Threads_connected';" 2>/dev/null | tee -a $LOG_FILE
        ;;
    "postgresql")
        psql -h localhost -U postgres -c "SELECT count(*) FROM pg_stat_activity;" 2>/dev/null | tee -a $LOG_FILE
        ;;
esac

echo -e "\n9. 数据库性能计数器:" | tee -a $LOG_FILE
case $DB_TYPE in
    "mysql")
        mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW STATUS LIKE 'Innodb_buffer_pool_read%';" 2>/dev/null | tee -a $LOG_FILE
        ;;
    "postgresql")
        psql -h localhost -U postgres -c "SELECT * FROM pg_stat_bgwriter;" 2>/dev/null | tee -a $LOG_FILE
        ;;
esac

echo -e "\n数据库诊断完成，详细日志: $LOG_FILE"

6.2 慢查询分析

创建慢查询分析脚本：slow_query_analysis.sh

bash 复制代码

#!/bin/bash

# 慢查询分析脚本
LOG_FILE="/tmp/slow_query_analysis_$(date +%Y%m%d_%H%M%S).log"

echo "=== 慢查询分析开始 ===" | tee -a $LOG_FILE

# 查找慢查询日志
find_slow_logs() {
    find /var/lib/mysql /var/log/mysql /var/log -name "*slow*" -type f 2>/dev/null
}

SLOW_LOGS=($(find_slow_logs))

if [ ${#SLOW_LOGS[@]} -eq 0 ]; then
    echo "未找到慢查询日志文件" | tee -a $LOG_FILE
    exit 1
fi

echo "发现的慢查询日志文件:" | tee -a $LOG_FILE
printf '%s\n' "${SLOW_LOGS[@]}" | tee -a $LOG_FILE

for slow_log in "${SLOW_LOGS[@]}"; do
    echo -e "\n=== 分析慢查询日志: $slow_log ===" | tee -a $LOG_FILE
    
    # 检查文件是否可读
    if [ ! -r "$slow_log" ]; then
        echo "无法读取文件: $slow_log" | tee -a $LOG_FILE
        continue
    fi
    
    # 分析慢查询
    echo -e "\n1. 慢查询统计 (最近24小时):" | tee -a $LOG_FILE
    grep "$(date -d '24 hours ago' '+%Y-%m-%d')" "$slow_log" 2>/dev/null | \
        grep "Query_time" | wc -l | awk '{print "慢查询数量:", $1}' | tee -a $LOG_FILE
    
    echo -e "\n2. 最慢的10个查询:" | tee -a $LOG_FILE
    grep -A5 "Query_time" "$slow_log" 2>/dev/null | \
        awk '/Query_time/ {time=$3} /SET timestamp/ {timestamp=$3} /# Time:/ {print "时间:", timestamp, "查询时间:", time "s"}' | \
        sort -k4 -rn | head -10 | tee -a $LOG_FILE
    
    echo -e "\n3. 按模式分组的慢查询:" | tee -a $LOG_FILE
    # 提取SQL语句的前50个字符作为模式
    grep -A2 "Query_time" "$slow_log" 2>/dev/null | \
        grep -E "^(SELECT|UPDATE|DELETE|INSERT)" | \
        cut -c1-50 | sort | uniq -c | sort -rn | head -10 | tee -a $LOG_FILE
    
    echo -e "\n4. 按小时分布的慢查询:" | tee -a $LOG_FILE
    grep "Query_time" "$slow_log" 2>/dev/null | \
        awk '{print $1, $2}' | cut -d: -f1 | sort | uniq -c | tee -a $LOG_FILE
done

# 实时监控慢查询
echo -e "\n=== 实时慢查询监控 (60秒) ===" | tee -a $LOG_FILE
for i in {1..12}; do
    echo "--- 第 $((i*5)) 秒 ---" | tee -a $LOG_FILE
    case $DB_TYPE in
        "mysql")
            mysql -h localhost -u root -p$DB_PASSWORD -e "
                SHOW FULL PROCESSLIST;
            " 2>/dev/null | grep -v "Sleep" | tee -a $LOG_FILE
            ;;
        "postgresql")
            psql -h localhost -U postgres -c "
                SELECT pid, now() - pg_stat_activity.query_start AS duration, query 
                FROM pg_stat_activity 
                WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
                AND state = 'active';
            " 2>/dev/null | tee -a $LOG_FILE
            ;;
    esac
    sleep 5
done

echo -e "\n慢查询分析完成，详细日志: $LOG_FILE"

7. 故障排查流程图

graph TD A[用户报告网站访问慢] --> B{用户端访问测试} B --> C{DNS解析正常?} C -->|是| D{网络连通性正常?} C -->|否| E[DNS配置问题] D -->|是| F{服务器响应正常?} D -->|否| G[网络路由问题] F -->|是| H{应用服务正常?} F -->|否| I[服务器负载问题] H -->|是| J{数据库响应正常?} H -->|否| K[应用代码问题] J -->|是| L[缓存/外部服务问题] J -->|否| M[数据库性能问题] E --> N[修复DNS配置] G --> O[检查网络设备] I --> P[优化服务器配置] K --> Q[修复应用代码] M --> R[优化数据库] L --> S[检查缓存服务] N --> T[故障解决] O --> T P --> T Q --> T R --> T S --> T style A fill:#1e3a5f,color:#ffffff style B fill:#4a1e5f,color:#ffffff style C fill:#5f3a1e,color:#ffffff style D fill:#5f3a1e,color:#ffffff style E fill:#5f1e1e,color:#ffffff style F fill:#5f3a1e,color:#ffffff style G fill:#5f1e1e,color:#ffffff style H fill:#5f3a1e,color:#ffffff style I fill:#5f1e1e,color:#ffffff style J fill:#5f3a1e,color:#ffffff style K fill:#5f1e1e,color:#ffffff style L fill:#5f1e1e,color:#ffffff style M fill:#5f1e1e,color:#ffffff style N fill:#1e5f3a,color:#ffffff style O fill:#1e5f3a,color:#ffffff style P fill:#1e5f3a,color:#ffffff style Q fill:#1e5f3a,color:#ffffff style R fill:#1e5f3a,color:#ffffff style S fill:#1e5f3a,color:#ffffff style T fill:#1e3a5f,color:#ffffff

8. 故障解决方案与预防

8.1 创建综合修复脚本

创建综合修复脚本：comprehensive_fix.sh

bash 复制代码

#!/bin/bash

# 综合故障修复脚本
LOG_FILE="/tmp/comprehensive_fix_$(date +%Y%m%d_%H%M%S).log"

echo "=== 综合故障修复开始 ===" | tee -a $LOG_FILE

# 1. 清理系统资源
echo -e "\n1. 清理系统资源..." | tee -a $LOG_FILE

# 清理临时文件
find /tmp -type f -atime +1 -delete 2>/dev/null
echo "临时文件清理完成" | tee -a $LOG_FILE

# 清理日志文件
find /var/log -name "*.log" -type f -size +1G -exec truncate -s 0 {} \; 2>/dev/null
echo "大日志文件清理完成" | tee -a $LOG_FILE

# 2. 优化系统配置
echo -e "\n2. 优化系统配置..." | tee -a $LOG_FILE

# 调整文件描述符限制
echo "* soft nofile 65536" >> /etc/security/limits.conf
echo "* hard nofile 65536" >> /etc/security/limits.conf
echo "文件描述符限制已调整" | tee -a $LOG_FILE

# 3. 重启服务（按依赖顺序）
echo -e "\n3. 重启相关服务..." | tee -a $LOG_FILE

services=("mysql" "redis" "nginx" "php-fpm" "application-service")

for service in "${services[@]}"; do
    if systemctl is-active --quiet "$service"; then
        echo "重启服务: $service" | tee -a $LOG_FILE
        systemctl restart "$service"
        sleep 5
        
        # 验证服务状态
        if systemctl is-active --quiet "$service"; then
            echo "✅ $service 重启成功" | tee -a $LOG_FILE
        else
            echo "❌ $service 重启失败" | tee -a $LOG_FILE
        fi
    fi
done

# 4. 数据库优化
echo -e "\n4. 数据库优化..." | tee -a $LOG_FILE

# MySQL优化
if command -v mysql &> /dev/null; then
    mysql -h localhost -u root -p$DB_PASSWORD -e "
        FLUSH TABLES;
        FLUSH QUERY CACHE;
        RESET QUERY CACHE;
    " 2>/dev/null && echo "数据库缓存已刷新" | tee -a $LOG_FILE
fi

# 5. 应用缓存清理
echo -e "\n5. 清理应用缓存..." | tee -a $LOG_FILE

# 清理PHP OPcache
if command -v php &> /dev/null; then
    php -r "opcache_reset();" 2>/dev/null && echo "PHP OPcache已清理" | tee -a $LOG_FILE
fi

# 清理Redis缓存
if command -v redis-cli &> /dev/null; then
    redis-cli FLUSHALL 2>/dev/null && echo "Redis缓存已清理" | tee -a $LOG_FILE
fi

# 6. 监控配置检查
echo -e "\n6. 配置监控告警..." | tee -a $LOG_FILE

# 创建基础监控脚本
cat > /usr/local/bin/system_health_check.sh << 'EOF'
#!/bin/bash
# 系统健康检查脚本

# 检查条件
check_load() {
    local load=$(awk '{print $1}' /proc/loadavg)
    local cores=$(nproc)
    if (( $(echo "$load > $cores * 2" | bc -l) )); then
        echo "HIGH_LOAD: $load"
        return 1
    fi
    return 0
}

check_memory() {
    local mem_free=$(free -m | awk 'NR==2{print $4}')
    if [ $mem_free -lt 100 ]; then
        echo "LOW_MEMORY: ${mem_free}MB"
        return 1
    fi
    return 0
}

check_disk() {
    local disk_usage=$(df / | awk 'NR==2{print $5}' | sed 's/%//')
    if [ $disk_usage -gt 90 ]; then
        echo "HIGH_DISK_USAGE: ${disk_usage}%"
        return 1
    fi
    return 0
}

# 执行检查
check_load
check_memory
check_disk
EOF

chmod +x /usr/local/bin/system_health_check.sh

# 7. 创建定期维护任务
echo -e "\n7. 配置定期维护..." | tee -a $LOG_FILE

# 添加定时维护任务
(crontab -l 2>/dev/null; echo "0 2 * * * /usr/local/bin/system_health_check.sh >> /var/log/health_check.log") | crontab -
(crontab -l 2>/dev/null; echo "0 4 * * 0 /usr/local/bin/cleanup_script.sh") | crontab -

echo -e "\n综合故障修复完成" | tee -a $LOG_FILE
echo "详细日志: $LOG_FILE" | tee -a $LOG_FILE

8.2 创建预防性监控

创建预防性监控脚本：preventive_monitoring.sh

bash 复制代码

#!/bin/bash

# 预防性监控配置脚本
LOG_FILE="/tmp/preventive_monitoring_$(date +%Y%m%d_%H%M%S).log"

echo "=== 预防性监控配置开始 ===" | tee -a $LOG_FILE

# 1. 安装监控工具
echo -e "\n1. 安装监控工具..." | tee -a $LOG_FILE

if command -v apt-get &> /dev/null; then
    # Ubuntu/Debian
    apt-get update
    apt-get install -y htop iotop nethogs sysstat dstat
elif command -v yum &> /dev/null; then
    # CentOS/RHEL
    yum install -y htop iotop nethogs sysstat dstat
fi

# 2. 配置系统监控
echo -e "\n2. 配置系统监控..." | tee -a $LOG_FILE

# 启用sysstat数据收集
sed -i 's/ENABLED=\"false\"/ENABLED=\"true\"/' /etc/default/sysstat
systemctl enable sysstat
systemctl start sysstat

# 3. 创建自定义监控脚本
echo -e "\n3. 创建自定义监控脚本..." | tee -a $LOG_FILE

cat > /usr/local/bin/website_health_check.sh << 'EOF'
#!/bin/bash
# 网站健康检查脚本

BASE_URL="https://shop.example.com"
ALERT_EMAIL="admin@example.com"
LOG_FILE="/var/log/website_health.log"

# 检查函数
check_http_response() {
    local url=$1
    local response=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 10 "$url")
    if [ "$response" -ne 200 ]; then
        echo "HTTP_ERROR: $url returned $response"
        return 1
    fi
    return 0
}

check_response_time() {
    local url=$1
    local threshold=$2
    local response_time=$(curl -s -o /dev/null -w "%{time_total}" --connect-timeout 10 "$url")
    if (( $(echo "$response_time > $threshold" | bc -l) )); then
        echo "SLOW_RESPONSE: $url took ${response_time}s"
        return 1
    fi
    return 0
}

check_database() {
    # 数据库连接检查
    if ! mysql -h localhost -u root -p$DB_PASSWORD -e "SELECT 1;" &> /dev/null; then
        echo "DATABASE_ERROR: Cannot connect to database"
        return 1
    fi
    return 0
}

# 执行检查
timestamp=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$timestamp] Starting health check..." >> $LOG_FILE

errors=0
check_http_response "$BASE_URL" || ((errors++))
check_response_time "$BASE_URL" 3 || ((errors++))
check_http_response "$BASE_URL/api/health" || ((errors++))
check_response_time "$BASE_URL/api/health" 2 || ((errors++))
check_database || ((errors++))

if [ $errors -gt 0 ]; then
    echo "[$timestamp] Health check failed with $errors errors" >> $LOG_FILE
    # 发送告警邮件
    echo "Health check failed at $timestamp. Check $LOG_FILE for details." | mail -s "Website Health Alert" $ALERT_EMAIL
else
    echo "[$timestamp] Health check passed" >> $LOG_FILE
fi
EOF

chmod +x /usr/local/bin/website_health_check.sh

# 4. 配置日志轮转
echo -e "\n4. 配置日志轮转..." | tee -a $LOG_FILE

cat > /etc/logrotate.d/website_monitoring << 'EOF'
/var/log/website_health.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 644 root root
}
EOF

# 5. 设置监控定时任务
echo -e "\n5. 设置监控定时任务..." | tee -a $LOG_FILE

(crontab -l 2>/dev/null; echo "*/5 * * * * /usr/local/bin/website_health_check.sh") | crontab -
(crontab -l 2>/dev/null; echo "0 * * * * /usr/local/bin/system_health_check.sh") | crontab -

# 6. 创建故障恢复脚本
echo -e "\n6. 创建故障恢复脚本..." | tee -a $LOG_FILE

cat > /usr/local/bin/emergency_recovery.sh << 'EOF'
#!/bin/bash
# 紧急故障恢复脚本

echo "=== 紧急故障恢复开始 ==="

# 快速服务重启
for service in nginx mysql php-fpm redis; do
    if systemctl is-active --quiet "$service"; then
        echo "重启服务: $service"
        systemctl restart "$service"
    fi
done

# 清理临时文件
find /tmp -type f -mtime +1 -delete

# 重启失败的容器（如果有）
if command -v docker &> /dev/null; then
    docker restart $(docker ps -q) 2>/dev/null
fi

# 清除缓存
sync
echo 3 > /proc/sys/vm/drop_caches

echo "紧急恢复完成"
EOF

chmod +x /usr/local/bin/emergency_recovery.sh

echo -e "\n预防性监控配置完成" | tee -a $LOG_FILE
echo "详细日志: $LOG_FILE" | tee -a $LOG_FILE

9. 总结与经验

通过这次完整的故障排查，我们建立了系统的排查流程：

9.1 排查流程总结

graph TB A[故障报告] --> B[信息收集] B --> C[用户端测试] C --> D[网络层分析] D --> E[服务器层检查] E --> F[应用层诊断] F --> G[数据库分析] G --> H[根因定位] H --> I[解决方案实施] I --> J[验证与监控] J --> K[文档记录] style A fill:#1e3a5f,color:#ffffff style B fill:#4a1e5f,color:#ffffff style C fill:#5f3a1e,color:#ffffff style D fill:#5f3a1e,color:#ffffff style E fill:#5f3a1e,color:#ffffff style F fill:#5f3a1e,color:#ffffff style G fill:#5f3a1e,color:#ffffff style H fill:#5f1e1e,color:#ffffff style I fill:#1e5f3a,color:#ffffff style J fill:#1e5f3a,color:#ffffff style K fill:#1e3a5f,color:#ffffff

9.2 关键经验

系统化排查：从外到内，从简单到复杂
工具准备：提前准备好诊断脚本和工具
监控先行：建立完善的监控体系
文档完整：记录每次故障的排查过程和解决方案
预防为主：通过定期维护预防故障发生