1. 故障背景与现象描述
1.1 故障场景
某电商网站 shop.example.com 在促销活动期间出现以下问题:
- 用户反映网站访问缓慢,页面加载时间超过10秒
- 部分用户无法完成支付流程
- 后台管理系统频繁报错
- 监控系统显示服务器资源使用率异常
1.2 初始信息收集
创建信息收集脚本:initial_info_collection.sh
bash
#!/bin/bash
# 初始信息收集脚本
LOG_FILE="/tmp/troubleshooting_$(date +%Y%m%d_%H%M%S).log"
# 记录开始时间
echo "=== 故障排查开始时间: $(date) ===" | tee -a $LOG_FILE
# 收集系统基本信息
echo -e "\n=== 系统基本信息 ===" | tee -a $LOG_FILE
echo "主机名: $(hostname)" | tee -a $LOG_FILE
echo "系统版本: $(cat /etc/redhat-release 2>/dev/null || cat /etc/os-release 2>/dev/null | grep PRETTY_NAME)" | tee -a $LOG_FILE
echo "内核版本: $(uname -r)" | tee -a $LOG_FILE
echo "当前用户: $(whoami)" | tee -a $LOG_FILE
# 收集时间信息
echo -e "\n=== 时间同步状态 ===" | tee -a $LOG_FILE
timedatectl status 2>/dev/null || date | tee -a $LOG_FILE
chronyc sources 2>/dev/null | head -10 | tee -a $LOG_FILE
# 收集网络信息
echo -e "\n=== 网络连接状态 ===" | tee -a $LOG_FILE
netstat -tulpn | grep -E ":(80|443|3306)" | tee -a $LOG_FILE
# 收集进程信息
echo -e "\n=== 相关进程状态 ===" | tee -a $LOG_FILE
ps aux | grep -E "(nginx|apache|mysql|php|java)" | head -20 | tee -a $LOG_FILE
echo -e "\n初始信息收集完成,详细日志: $LOG_FILE"
2. 用户端访问排查
2.1 模拟用户访问测试
创建访问测试脚本:user_access_test.sh
bash
#!/bin/bash
# 用户端访问测试脚本
DOMAIN="shop.example.com"
TEST_URLS=(
"https://$DOMAIN"
"https://$DOMAIN/products"
"https://$DOMAIN/checkout"
"https://$DOMAIN/api/health"
)
# 设置颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
# 测试函数
test_url() {
local url=$1
echo -e "\n测试 URL: $url"
# 测试DNS解析
local domain=$(echo $url | awk -F/ '{print $3}')
echo "DNS解析测试:"
nslookup $domain 2>&1 | grep -A5 "Name:"
# 测试HTTP响应
echo -e "\nHTTP响应测试:"
local start_time=$(date +%s%N)
# 使用curl进行详细测试
local response=$(curl -L -s -w \
"HTTP状态码: %{http_code}\n\
总时间: %{time_total}s\n\
DNS解析时间: %{time_namelookup}s\n\
连接建立时间: %{time_connect}s\n\
SSL握手时间: %{time_appconnect}s\n\
开始传输时间: %{time_starttransfer}s\n\
重定向次数: %{num_redirects}\n\
重定向时间: %{time_redirect}s\n\
下载速度: %{speed_download} bytes/sec\n" \
-o /dev/null \
"$url")
local end_time=$(date +%s%N)
local total_time=$(( (end_time - start_time) / 1000000 ))
echo "$response"
# 判断响应时间
local http_code=$(echo "$response" | grep "HTTP状态码" | awk '{print $2}')
local total_time_sec=$(echo "$response" | grep "总时间" | awk '{print $2}' | sed 's/s//')
if [ "$http_code" -eq 200 ]; then
if (( $(echo "$total_time_sec > 5" | bc -l) )); then
echo -e "${YELLOW}警告: 响应时间过长${NC}"
else
echo -e "${GREEN}正常: 访问成功${NC}"
fi
else
echo -e "${RED}错误: HTTP状态码 $http_code${NC}"
fi
# 测试TCP连接
echo -e "\nTCP端口连接测试:"
local port=443
if echo "$url" | grep -q "http:"; then
port=80
fi
timeout 5 bash -c "echo >/dev/tcp/$domain/$port" 2>/dev/null && \
echo -e "${GREEN}端口 $port 连接成功${NC}" || \
echo -e "${RED}端口 $port 连接失败${NC}"
}
# 执行所有测试
echo "=== 用户端访问测试开始 ==="
for url in "${TEST_URLS[@]}"; do
test_url "$url"
echo "----------------------------------------"
done
# 网络质量测试
echo -e "\n=== 网络质量测试 ==="
ping -c 5 $DOMAIN | tail -3
traceroute -m 15 $DOMAIN 2>/dev/null | head -10
2.2 浏览器端问题排查
创建浏览器诊断脚本:browser_diagnosis.html
html
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<title>网站性能诊断工具</title>
<style>
body {
font-family: Arial, sans-serif;
margin: 40px;
background-color: #1e1e1e;
color: #ffffff;
}
.test-section {
margin: 20px 0;
padding: 15px;
border: 1px solid #444;
border-radius: 5px;
}
button {
padding: 10px 20px;
margin: 5px;
background: #007acc;
color: white;
border: none;
border-radius: 3px;
cursor: pointer;
}
button:hover { background: #005a9e; }
.result {
margin-top: 10px;
padding: 10px;
background: #2d2d30;
border-radius: 3px;
white-space: pre-wrap;
}
.success { color: #4ec9b0; }
.warning { color: #ffcc00; }
.error { color: #f44747; }
</style>
</head>
<body>
<h1>网站性能诊断工具</h1>
<div class="test-section">
<h3>1. 基础连接测试</h3>
<button onclick="testDNS()">测试DNS解析</button>
<button onclick="testPing()">测试网络延迟</button>
<div id="dnsResult" class="result"></div>
</div>
<div class="test-section">
<h3>2. 资源加载测试</h3>
<button onclick="testResourceLoad()">测试资源加载时间</button>
<div id="resourceResult" class="result"></div>
</div>
<div class="test-section">
<h3>3. WebSocket测试</h3>
<button onclick="testWebSocket()">测试WebSocket连接</button>
<div id="websocketResult" class="result"></div>
</div>
<div class="test-section">
<h3>4. 控制台错误检查</h3>
<button onclick="checkConsoleErrors()">检查JavaScript错误</button>
<div id="consoleResult" class="result"></div>
</div>
<script>
// DNS解析测试
async function testDNS() {
const domain = 'shop.example.com';
const resultDiv = document.getElementById('dnsResult');
resultDiv.innerHTML = '测试中...';
try {
const startTime = performance.now();
const response = await fetch(`https://dns.google/resolve?name=${domain}`);
const data = await response.json();
const endTime = performance.now();
let result = `DNS解析时间: ${(endTime - startTime).toFixed(2)}ms\n`;
result += `解析结果: ${JSON.stringify(data.Answer || data.Authority, null, 2)}`;
resultDiv.innerHTML = result;
resultDiv.className = 'result success';
} catch (error) {
resultDiv.innerHTML = `DNS解析失败: ${error.message}`;
resultDiv.className = 'result error';
}
}
// 资源加载测试
function testResourceLoad() {
const resources = [
'https://shop.example.com/static/css/main.css',
'https://shop.example.com/static/js/app.js',
'https://shop.example.com/api/products'
];
const resultDiv = document.getElementById('resourceResult');
resultDiv.innerHTML = '';
resources.forEach(url => {
const startTime = performance.now();
fetch(url, { method: 'HEAD' })
.then(response => {
const endTime = performance.now();
const loadTime = endTime - startTime;
const status = response.status === 200 ? '✅' : '❌';
const line = `${status} ${url}: ${loadTime.toFixed(2)}ms (状态: ${response.status})`;
resultDiv.innerHTML += line + '\n';
if (loadTime > 1000) {
resultDiv.className = 'result warning';
} else if (response.status === 200) {
resultDiv.className = 'result success';
} else {
resultDiv.className = 'result error';
}
})
.catch(error => {
resultDiv.innerHTML += `❌ ${url}: 加载失败 - ${error.message}\n`;
resultDiv.className = 'result error';
});
});
}
// WebSocket测试
function testWebSocket() {
const resultDiv = document.getElementById('websocketResult');
resultDiv.innerHTML = 'WebSocket连接测试中...';
const ws = new WebSocket('wss://shop.example.com/ws');
const startTime = performance.now();
ws.onopen = function() {
const endTime = performance.now();
resultDiv.innerHTML = `✅ WebSocket连接成功\n连接时间: ${(endTime - startTime).toFixed(2)}ms`;
resultDiv.className = 'result success';
ws.close();
};
ws.onerror = function(error) {
resultDiv.innerHTML = `❌ WebSocket连接失败: ${error.message}`;
resultDiv.className = 'result error';
};
setTimeout(() => {
if (ws.readyState !== WebSocket.OPEN) {
resultDiv.innerHTML = '❌ WebSocket连接超时';
resultDiv.className = 'result error';
ws.close();
}
}, 5000);
}
// 控制台错误检查
function checkConsoleErrors() {
const resultDiv = document.getElementById('consoleResult');
// 重写console.error来捕获错误
const originalError = console.error;
const errors = [];
console.error = function(...args) {
errors.push(args.join(' '));
originalError.apply(console, args);
};
// 触发可能的错误
setTimeout(() => {
console.error = originalError;
if (errors.length > 0) {
resultDiv.innerHTML = `发现 ${errors.length} 个错误:\n${errors.join('\n')}`;
resultDiv.className = 'result error';
} else {
resultDiv.innerHTML = '✅ 未发现控制台错误';
resultDiv.className = 'result success';
}
}, 2000);
}
</script>
</body>
</html>
3. 网络层排查
3.1 网络连通性深度测试
创建网络诊断脚本:network_diagnosis.sh
bash
#!/bin/bash
# 网络层深度诊断脚本
DOMAIN="shop.example.com"
LOG_FILE="/tmp/network_diagnosis_$(date +%Y%m%d_%H%M%S).log"
echo "=== 网络层深度诊断 ===" | tee -a $LOG_FILE
# 1. 基础网络测试
echo -e "\n1. 基础网络连通性测试:" | tee -a $LOG_FILE
ping -c 10 -i 0.2 $DOMAIN | tee -a $LOG_FILE
# 2. MTU路径发现
echo -e "\n2. MTU路径发现:" | tee -a $LOG_FILE
for mtu in 1500 1400 1300 1200 1100 1000; do
echo "测试MTU: $mtu" | tee -a $LOG_FILE
ping -c 2 -M do -s $((mtu-28)) $DOMAIN 2>&1 | grep -E "bytes from|Packet" | tee -a $LOG_FILE
done
# 3. 路由跟踪
echo -e "\n3. 详细路由跟踪:" | tee -a $LOG_FILE
traceroute -w 2 -q 2 -m 20 $DOMAIN 2>/dev/null | tee -a $LOG_FILE
# 4. 端口扫描
echo -e "\n4. 关键端口扫描:" | tee -a $LOG_FILE
ports=(80 443 3306 6379 22)
for port in "${ports[@]}"; do
timeout 3 bash -c "echo >/dev/tcp/$DOMAIN/$port" 2>/dev/null && \
echo "端口 $port: 开放" | tee -a $LOG_FILE || \
echo "端口 $port: 关闭" | tee -a $LOG_FILE
done
# 5. SSL证书检查
echo -e "\n5. SSL证书检查:" | tee -a $LOG_FILE
openssl s_client -connect $DOMAIN:443 -servername $DOMAIN < /dev/null 2>/dev/null | \
openssl x509 -noout -dates | tee -a $LOG_FILE
# 6. HTTP/2支持检查
echo -e "\n6. HTTP/2支持检查:" | tee -a $LOG_FILE
curl -I --http2 -s https://$DOMAIN | grep -i "HTTP/" | tee -a $LOG_FILE
# 7. DNS解析详细检查
echo -e "\n7. DNS解析详细检查:" | tee -a $LOG_FILE
echo "A记录:" | tee -a $LOG_FILE
dig A $DOMAIN +short | tee -a $LOG_FILE
echo "AAAA记录:" | tee -a $LOG_FILE
dig AAAA $DOMAIN +short | tee -a $LOG_FILE
echo "CNAME记录:" | tee -a $LOG_FILE
dig CNAME $DOMAIN +short | tee -a $LOG_FILE
# 8. 网络性能测试
echo -e "\n8. 网络性能测试:" | tee -a $LOG_FILE
# 下载速度测试(小文件)
curl -o /dev/null -w "下载速度: %{speed_download} bytes/sec\n" \
https://$DOMAIN/static/images/logo.png 2>&1 | tee -a $LOG_FILE
# 9. 连接数统计
echo -e "\n9. 当前连接数统计:" | tee -a $LOG_FILE
netstat -an | grep -E ":(80|443)" | grep ESTABLISHED | wc -l | \
awk '{print "ESTABLISHED连接数:", $1}' | tee -a $LOG_FILE
echo -e "\n网络诊断完成,详细日志: $LOG_FILE"
3.2 网络流量分析
创建流量分析脚本:traffic_analysis.sh
bash
#!/bin/bash
# 网络流量分析脚本
INTERFACE="eth0"
DURATION=60
LOG_FILE="/tmp/traffic_analysis_$(date +%Y%m%d_%H%M%S).log"
echo "=== 网络流量分析开始 ===" | tee -a $LOG_FILE
echo "监控网卡: $INTERFACE" | tee -a $LOG_FILE
echo "监控时长: ${DURATION}秒" | tee -a $LOG_FILE
# 1. 使用tcpdump捕获流量
echo -e "\n1. 开始捕获HTTP/HTTPS流量..." | tee -a $LOG_FILE
tcpdump -i $INTERFACE -w /tmp/traffic_capture.pcap -c 1000 port 80 or port 443 2>/dev/null &
TCPDUMP_PID=$!
# 2. 实时连接监控
echo -e "\n2. 实时连接监控:" | tee -a $LOG_FILE
for i in $(seq 1 $DURATION); do
echo "--- 第 $i 秒 ---" >> $LOG_FILE
netstat -tn | grep -E ":(80|443)" | grep ESTABLISHED | \
awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -5 >> $LOG_FILE
sleep 1
done
# 停止tcpdump
kill $TCPDUMP_PID 2>/dev/null
wait $TCPDUMP_PID 2>/dev/null
# 3. 分析捕获的流量
echo -e "\n3. 流量分析结果:" | tee -a $LOG_FILE
if [ -f /tmp/traffic_capture.pcap ]; then
# 统计TOP IP
echo "TOP来源IP:" | tee -a $LOG_FILE
tcpdump -nn -r /tmp/traffic_capture.pcap 2>/dev/null | \
awk '{print $3}' | cut -d. -f1-4 | sort | uniq -c | sort -rn | head -10 | tee -a $LOG_FILE
# 统计协议分布
echo -e "\n协议分布:" | tee -a $LOG_FILE
tcpdump -nn -r /tmp/traffic_capture.pcap 2>/dev/null | \
awk '{print $5}' | cut -d. -f5 | sort | uniq -c | sort -rn | tee -a $LOG_FILE
# 清理临时文件
rm -f /tmp/traffic_capture.pcap
fi
# 4. 检查网络错误
echo -e "\n4. 网络错误统计:" | tee -a $LOG_FILE
netstat -i | grep $INTERFACE | tee -a $LOG_FILE
echo -e "\n流量分析完成,详细日志: $LOG_FILE"
4. 服务器层排查
4.1 系统资源监控
创建系统监控脚本:system_monitoring.sh
bash
#!/bin/bash
# 系统资源监控脚本
DURATION=300 # 监控5分钟
INTERVAL=5
LOG_FILE="/tmp/system_monitoring_$(date +%Y%m%d_%H%M%S).log"
echo "=== 系统资源监控开始 ===" | tee -a $LOG_FILE
echo "监控时长: ${DURATION}秒" | tee -a $LOG_FILE
echo "采样间隔: ${INTERVAL}秒" | tee -a $LOG_FILE
# 监控函数
monitor_system() {
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
echo -e "\n--- $timestamp ---" >> $LOG_FILE
# CPU使用率
echo "CPU使用率:" >> $LOG_FILE
mpstat 1 1 | grep -A5 "Average" >> $LOG_FILE
# 内存使用
echo -e "\n内存使用:" >> $LOG_FILE
free -h >> $LOG_FILE
# 磁盘IO
echo -e "\n磁盘IO:" >> $LOG_FILE
iostat -x 1 1 | grep -A10 "Device" >> $LOG_FILE
# 负载情况
echo -e "\n系统负载:" >> $LOG_FILE
uptime >> $LOG_FILE
# 进程资源使用TOP10
echo -e "\n进程资源使用TOP10:" >> $LOG_FILE
ps aux --sort=-%cpu | head -11 >> $LOG_FILE
# 网络连接数
echo -e "\n网络连接统计:" >> $LOG_FILE
netstat -ant | awk '{print $6}' | sort | uniq -c | sort -rn >> $LOG_FILE
}
# 开始监控
END_TIME=$((SECONDS + DURATION))
while [ $SECONDS -lt $END_TIME ]; do
monitor_system
sleep $INTERVAL
done
# 生成总结报告
echo -e "\n=== 监控总结报告 ===" >> $LOG_FILE
# 检查是否有异常
echo -e "\n异常检查:" >> $LOG_FILE
# 检查负载
LOAD_AVG=$(uptime | awk -F'load average:' '{print $2}' | awk -F, '{print $1}' | tr -d ' ')
if (( $(echo "$LOAD_AVG > 10" | bc -l) )); then
echo "警告: 系统负载过高: $LOAD_AVG" >> $LOG_FILE
fi
# 检查内存使用
MEM_FREE=$(free -m | grep Mem | awk '{print $7}')
if [ $MEM_FREE -lt 100 ]; then
echo "警告: 可用内存不足: ${MEM_FREE}MB" >> $LOG_FILE
fi
# 检查磁盘空间
DISK_USAGE=$(df / | awk 'END{print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 90 ]; then
echo "警告: 根分区使用率过高: ${DISK_USAGE}%" >> $LOG_FILE
fi
echo -e "\n监控完成,详细日志: $LOG_FILE"
4.2 Web服务器排查
创建Web服务器诊断脚本:webserver_diagnosis.sh
bash
#!/bin/bash
# Web服务器诊断脚本
LOG_FILE="/tmp/webserver_diagnosis_$(date +%Y%m%d_%H%M%S).log"
echo "=== Web服务器诊断开始 ===" | tee -a $LOG_FILE
# 检测Web服务器类型
detect_webserver() {
if systemctl is-active --quiet nginx; then
echo "nginx"
elif systemctl is-active --quiet apache2; then
echo "apache"
elif systemctl is-active --quiet httpd; then
echo "httpd"
else
echo "unknown"
fi
}
WEBSERVER=$(detect_webserver)
echo "检测到Web服务器: $WEBSERVER" | tee -a $LOG_FILE
# 通用检查
echo -e "\n1. Web服务状态:" | tee -a $LOG_FILE
systemctl status $WEBSERVER --no-pager | tee -a $LOG_FILE
echo -e "\n2. 监听端口:" | tee -a $LOG_FILE
netstat -tulpn | grep -E ":80|:443" | tee -a $LOG_FILE
echo -e "\n3. 活动连接数:" | tee -a $LOG_FILE
netstat -an | grep -E ":(80|443)" | grep ESTABLISHED | wc -l | tee -a $LOG_FILE
# 服务器特定检查
case $WEBSERVER in
"nginx")
echo -e "\n4. Nginx配置检查:" | tee -a $LOG_FILE
nginx -t 2>&1 | tee -a $LOG_FILE
echo -e "\n5. Nginx状态信息:" | tee -a $LOG_FILE
# 检查是否有status模块
curl -s http://localhost/nginx_status 2>/dev/null | tee -a $LOG_FILE
echo -e "\n6. Nginx进程信息:" | tee -a $LOG_FILE
ps aux | grep nginx | grep -v grep | tee -a $LOG_FILE
echo -e "\n7. Nginx错误日志检查:" | tee -a $LOG_FILE
tail -50 /var/log/nginx/error.log 2>/dev/null | tee -a $LOG_FILE
;;
"apache"|"httpd")
echo -e "\n4. Apache配置检查:" | tee -a $LOG_FILE
apache2ctl -t 2>/dev/null || httpd -t 2>/dev/null | tee -a $LOG_FILE
echo -e "\n5. Apache状态信息:" | tee -a $LOG_FILE
curl -s http://localhost/server-status 2>/dev/null | head -20 | tee -a $LOG_FILE
echo -e "\n6. Apache进程信息:" | tee -a $LOG_FILE
ps aux | grep -E "apache2|httpd" | grep -v grep | tee -a $LOG_FILE
echo -e "\n7. Apache错误日志检查:" | tee -a $LOG_FILE
tail -50 /var/log/apache2/error.log 2>/dev/null || \
tail -50 /var/log/httpd/error_log 2>/dev/null | tee -a $LOG_FILE
;;
*)
echo "未知的Web服务器" | tee -a $LOG_FILE
;;
esac
# 检查虚拟主机配置
echo -e "\n8. 虚拟主机配置:" | tee -a $LOG_FILE
if [ "$WEBSERVER" = "nginx" ]; then
find /etc/nginx -name "*.conf" -exec grep -l "shop.example.com" {} \; 2>/dev/null | tee -a $LOG_FILE
elif [ "$WEBSERVER" = "apache" ] || [ "$WEBSERVER" = "httpd" ]; then
find /etc/apache2 -name "*.conf" -exec grep -l "shop.example.com" {} \; 2>/dev/null | \
find /etc/httpd -name "*.conf" -exec grep -l "shop.example.com" {} \; 2>/dev/null | tee -a $LOG_FILE
fi
# SSL证书检查
echo -e "\n9. SSL证书检查:" | tee -a $LOG_FILE
openssl s_client -connect shop.example.com:443 -servername shop.example.com < /dev/null 2>/dev/null | \
openssl x509 -noout -text | grep -A5 "Validity" | tee -a $LOG_FILE
echo -e "\nWeb服务器诊断完成,详细日志: $LOG_FILE"
5. 应用层排查
5.1 应用程序日志分析
创建应用日志分析脚本:application_log_analysis.sh
bash
#!/bin/bash
# 应用程序日志分析脚本
LOG_DIRS=("/var/log/nginx" "/var/log/apache2" "/var/log/httpd" "/var/www/logs" "/app/logs")
DATE_RANGE="1 hour ago"
LOG_FILE="/tmp/app_log_analysis_$(date +%Y%m%d_%H%M%S).log"
echo "=== 应用程序日志分析开始 ===" | tee -a $LOG_FILE
# 查找应用日志文件
find_log_files() {
local log_files=()
for dir in "${LOG_DIRS[@]}"; do
if [ -d "$dir" ]; then
while IFS= read -r file; do
log_files+=("$file")
done < <(find "$dir" -name "*.log" -type f -mtime -1 2>/dev/null)
fi
done
printf '%s\n' "${log_files[@]}"
}
LOG_FILES=($(find_log_files))
echo "发现的日志文件:" | tee -a $LOG_FILE
printf '%s\n' "${LOG_FILES[@]}" | tee -a $LOG_FILE
# 分析每个日志文件
for log_file in "${LOG_FILES[@]}"; do
echo -e "\n=== 分析日志文件: $log_file ===" | tee -a $LOG_FILE
# 检查文件大小
file_size=$(du -h "$log_file" | cut -f1)
echo "文件大小: $file_size" | tee -a $LOG_FILE
# 错误统计
echo -e "\n错误统计 (最近1小时):" | tee -a $LOG_FILE
if [[ "$log_file" == *"error"* ]] || [[ "$log_file" == *"Error"* ]]; then
# 错误日志文件
grep "$(date -d "$DATE_RANGE" '+%Y-%m-%d %H:')" "$log_file" 2>/dev/null | \
awk '{print $1, $2, $3}' | sort | uniq -c | sort -rn | head -10 | tee -a $LOG_FILE
else
# 访问日志 - 查找错误状态码
grep "$(date -d "$DATE_RANGE" '+%Y-%m-%d %H:')" "$log_file" 2>/dev/null | \
awk '$9 >= 400 {print $9}' | sort | uniq -c | sort -rn | head -10 | tee -a $LOG_FILE
fi
# 慢请求分析(针对访问日志)
if [[ "$log_file" == *"access"* ]]; then
echo -e "\n慢请求分析 (响应时间 > 5秒):" | tee -a $LOG_FILE
# 假设日志格式包含响应时间(需要根据实际日志格式调整)
grep "$(date -d "$DATE_RANGE" '+%Y-%m-%d %H:')" "$log_file" 2>/dev/null | \
awk '$NF > 5 {print $7, $NF "s"}' | sort -k2 -rn | head -10 | tee -a $LOG_FILE
fi
# 频繁访问IP
echo -e "\n频繁访问IP TOP10:" | tee -a $LOG_FILE
grep "$(date -d "$DATE_RANGE" '+%Y-%m-%d %H:')" "$log_file" 2>/dev/null | \
awk '{print $1}' | sort | uniq -c | sort -rn | head -10 | tee -a $LOG_FILE
# 热门URL
echo -e "\n热门URL TOP10:" | tee -a $LOG_FILE
grep "$(date -d "$DATE_RANGE" '+%Y-%m-%d %H:')" "$log_file" 2>/dev/null | \
awk '{print $7}' | sort | uniq -c | sort -rn | head -10 | tee -a $LOG_FILE
done
# 检查应用进程
echo -e "\n=== 应用进程状态 ===" | tee -a $LOG_FILE
ps aux | grep -E "(php|java|python|node)" | grep -v grep | head -20 | tee -a $LOG_FILE
# 检查应用端口
echo -e "\n=== 应用端口监听 ===" | tee -a $LOG_FILE
netstat -tulpn | grep -E ":(8080|3000|9000)" | tee -a $LOG_FILE
echo -e "\n应用程序日志分析完成,详细日志: $LOG_FILE"
5.2 应用性能监控
创建应用性能监控脚本:application_performance.sh
bash
#!/bin/bash
# 应用性能监控脚本
LOG_FILE="/tmp/app_performance_$(date +%Y%m%d_%H%M%S).log"
echo "=== 应用性能监控开始 ===" | tee -a $LOG_FILE
# 1. 检查应用响应时间
echo -e "\n1. 应用接口响应时间测试:" | tee -a $LOG_FILE
API_ENDPOINTS=(
"https://shop.example.com/api/health"
"https://shop.example.com/api/products"
"https://shop.example.com/api/user/profile"
"https://shop.example.com/api/order/checkout"
)
for endpoint in "${API_ENDPOINTS[@]}"; do
echo -n "测试 $endpoint ... " | tee -a $LOG_FILE
response_time=$(curl -o /dev/null -s -w "%{time_total}s\n" "$endpoint")
echo "响应时间: $response_time" | tee -a $LOG_FILE
done
# 2. 数据库连接测试
echo -e "\n2. 数据库连接测试:" | tee -a $LOG_FILE
# 假设有数据库连接测试接口
curl -s "https://shop.example.com/api/db/health" | jq '.' 2>/dev/null | tee -a $LOG_FILE
# 3. 缓存状态检查
echo -e "\n3. 缓存状态检查:" | tee -a $LOG_FILE
curl -s "https://shop.example.com/api/cache/status" | jq '.' 2>/dev/null | tee -a $LOG_FILE
# 4. 外部服务依赖检查
echo -e "\n4. 外部服务依赖检查:" | tee -a $LOG_FILE
EXTERNAL_SERVICES=(
"payment_gateway"
"email_service"
"sms_service"
"cdn_service"
)
for service in "${EXTERNAL_SERVICES[@]}"; do
echo -n "检查 $service ... " | tee -a $LOG_FILE
# 这里应该是实际的健康检查端点
if curl -s "https://shop.example.com/api/dependencies/$service" | grep -q "healthy"; then
echo "正常" | tee -a $LOG_FILE
else
echo "异常" | tee -a $LOG_FILE
fi
done
# 5. 应用指标收集
echo -e "\n5. 应用指标收集:" | tee -a $LOG_FILE
APPLICATION_METRICS=(
"active_sessions"
"memory_usage"
"request_queue"
"error_rate"
)
for metric in "${APPLICATION_METRICS[@]}"; do
value=$(curl -s "https://shop.example.com/api/metrics/$metric" 2>/dev/null)
echo "$metric: $value" | tee -a $LOG_FILE
done
# 6. 垃圾回收统计(针对JVM应用)
echo -e "\n6. JVM应用状态检查:" | tee -a $LOG_FILE
JAVA_PID=$(pgrep -f java | head -1)
if [ -n "$JAVA_PID" ]; then
echo "Java进程PID: $JAVA_PID" | tee -a $LOG_FILE
jstat -gc $JAVA_PID 2>/dev/null | head -2 | tee -a $LOG_FILE
fi
echo -e "\n应用性能监控完成,详细日志: $LOG_FILE"
6. 数据库层排查
6.1 数据库性能分析
创建数据库诊断脚本:database_diagnosis.sh
bash
#!/bin/bash
# 数据库性能诊断脚本
DB_TYPE="mysql" # 可以是mysql或postgresql
LOG_FILE="/tmp/database_diagnosis_$(date +%Y%m%d_%H%M%S).log"
echo "=== 数据库性能诊断开始 ===" | tee -a $LOG_FILE
# 数据库连接测试
test_db_connection() {
case $DB_TYPE in
"mysql")
mysql -h localhost -u root -p$DB_PASSWORD -e "SELECT 1 as connection_test;" 2>/dev/null
;;
"postgresql")
psql -h localhost -U postgres -c "SELECT 1 as connection_test;" 2>/dev/null
;;
esac
}
echo "数据库连接测试:" | tee -a $LOG_FILE
if test_db_connection; then
echo "✅ 数据库连接正常" | tee -a $LOG_FILE
else
echo "❌ 数据库连接失败" | tee -a $LOG_FILE
exit 1
fi
# MySQL特定诊断
if [ "$DB_TYPE" = "mysql" ]; then
echo -e "\n1. MySQL状态检查:" | tee -a $LOG_FILE
mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW GLOBAL STATUS;" 2>/dev/null | \
grep -E "(Threads_connected|Threads_running|Queries|Slow_queries|Aborted_connects)" | tee -a $LOG_FILE
echo -e "\n2. MySQL变量检查:" | tee -a $LOG_FILE
mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW GLOBAL VARIABLES;" 2>/dev/null | \
grep -E "(max_connections|innodb_buffer_pool_size|query_cache_size|tmp_table_size)" | tee -a $LOG_FILE
echo -e "\n3. 进程列表:" | tee -a $LOG_FILE
mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW PROCESSLIST;" 2>/dev/null | head -20 | tee -a $LOG_FILE
echo -e "\n4. 表锁定状态:" | tee -a $LOG_FILE
mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW ENGINE INNODB STATUS\G" 2>/dev/null | \
grep -A10 "LATEST DETECTED DEADLOCK" | tee -a $LOG_FILE
echo -e "\n5. 慢查询日志检查:" | tee -a $LOG_FILE
mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW VARIABLES LIKE 'slow_query_log%';" 2>/dev/null | tee -a $LOG_FILE
mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW VARIABLES LIKE 'long_query_time';" 2>/dev/null | tee -a $LOG_FILE
echo -e "\n6. 数据库大小统计:" | tee -a $LOG_FILE
mysql -h localhost -u root -p$DB_PASSWORD -e "
SELECT table_schema as 'Database',
ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) as 'Size (MB)'
FROM information_schema.tables
GROUP BY table_schema;" 2>/dev/null | tee -a $LOG_FILE
echo -e "\n7. 索引统计:" | tee -a $LOG_FILE
mysql -h localhost -u root -p$DB_PASSWORD -e "
SELECT TABLE_NAME, INDEX_NAME, COLUMN_NAME, SEQ_IN_INDEX
FROM information_schema.STATISTICS
WHERE TABLE_SCHEMA = 'shop_db'
ORDER BY TABLE_NAME, INDEX_NAME, SEQ_IN_INDEX;" 2>/dev/null | head -20 | tee -a $LOG_FILE
# PostgreSQL特定诊断
elif [ "$DB_TYPE" = "postgresql" ]; then
echo -e "\n1. PostgreSQL连接统计:" | tee -a $LOG_FILE
psql -h localhost -U postgres -c "
SELECT datname, numbackends, xact_commit, xact_rollback
FROM pg_stat_database
WHERE datname = 'shop_db';" 2>/dev/null | tee -a $LOG_FILE
echo -e "\n2. 活动查询:" | tee -a $LOG_FILE
psql -h localhost -U postgres -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';" 2>/dev/null | tee -a $LOG_FILE
echo -e "\n3. 表大小统计:" | tee -a $LOG_FILE
psql -h localhost -U postgres -c "
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;" 2>/dev/null | tee -a $LOG_FILE
fi
# 通用数据库检查
echo -e "\n8. 数据库连接数监控:" | tee -a $LOG_FILE
case $DB_TYPE in
"mysql")
mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW STATUS LIKE 'Threads_connected';" 2>/dev/null | tee -a $LOG_FILE
;;
"postgresql")
psql -h localhost -U postgres -c "SELECT count(*) FROM pg_stat_activity;" 2>/dev/null | tee -a $LOG_FILE
;;
esac
echo -e "\n9. 数据库性能计数器:" | tee -a $LOG_FILE
case $DB_TYPE in
"mysql")
mysql -h localhost -u root -p$DB_PASSWORD -e "SHOW STATUS LIKE 'Innodb_buffer_pool_read%';" 2>/dev/null | tee -a $LOG_FILE
;;
"postgresql")
psql -h localhost -U postgres -c "SELECT * FROM pg_stat_bgwriter;" 2>/dev/null | tee -a $LOG_FILE
;;
esac
echo -e "\n数据库诊断完成,详细日志: $LOG_FILE"
6.2 慢查询分析
创建慢查询分析脚本:slow_query_analysis.sh
bash
#!/bin/bash
# 慢查询分析脚本
LOG_FILE="/tmp/slow_query_analysis_$(date +%Y%m%d_%H%M%S).log"
echo "=== 慢查询分析开始 ===" | tee -a $LOG_FILE
# 查找慢查询日志
find_slow_logs() {
find /var/lib/mysql /var/log/mysql /var/log -name "*slow*" -type f 2>/dev/null
}
SLOW_LOGS=($(find_slow_logs))
if [ ${#SLOW_LOGS[@]} -eq 0 ]; then
echo "未找到慢查询日志文件" | tee -a $LOG_FILE
exit 1
fi
echo "发现的慢查询日志文件:" | tee -a $LOG_FILE
printf '%s\n' "${SLOW_LOGS[@]}" | tee -a $LOG_FILE
for slow_log in "${SLOW_LOGS[@]}"; do
echo -e "\n=== 分析慢查询日志: $slow_log ===" | tee -a $LOG_FILE
# 检查文件是否可读
if [ ! -r "$slow_log" ]; then
echo "无法读取文件: $slow_log" | tee -a $LOG_FILE
continue
fi
# 分析慢查询
echo -e "\n1. 慢查询统计 (最近24小时):" | tee -a $LOG_FILE
grep "$(date -d '24 hours ago' '+%Y-%m-%d')" "$slow_log" 2>/dev/null | \
grep "Query_time" | wc -l | awk '{print "慢查询数量:", $1}' | tee -a $LOG_FILE
echo -e "\n2. 最慢的10个查询:" | tee -a $LOG_FILE
grep -A5 "Query_time" "$slow_log" 2>/dev/null | \
awk '/Query_time/ {time=$3} /SET timestamp/ {timestamp=$3} /# Time:/ {print "时间:", timestamp, "查询时间:", time "s"}' | \
sort -k4 -rn | head -10 | tee -a $LOG_FILE
echo -e "\n3. 按模式分组的慢查询:" | tee -a $LOG_FILE
# 提取SQL语句的前50个字符作为模式
grep -A2 "Query_time" "$slow_log" 2>/dev/null | \
grep -E "^(SELECT|UPDATE|DELETE|INSERT)" | \
cut -c1-50 | sort | uniq -c | sort -rn | head -10 | tee -a $LOG_FILE
echo -e "\n4. 按小时分布的慢查询:" | tee -a $LOG_FILE
grep "Query_time" "$slow_log" 2>/dev/null | \
awk '{print $1, $2}' | cut -d: -f1 | sort | uniq -c | tee -a $LOG_FILE
done
# 实时监控慢查询
echo -e "\n=== 实时慢查询监控 (60秒) ===" | tee -a $LOG_FILE
for i in {1..12}; do
echo "--- 第 $((i*5)) 秒 ---" | tee -a $LOG_FILE
case $DB_TYPE in
"mysql")
mysql -h localhost -u root -p$DB_PASSWORD -e "
SHOW FULL PROCESSLIST;
" 2>/dev/null | grep -v "Sleep" | tee -a $LOG_FILE
;;
"postgresql")
psql -h localhost -U postgres -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
AND state = 'active';
" 2>/dev/null | tee -a $LOG_FILE
;;
esac
sleep 5
done
echo -e "\n慢查询分析完成,详细日志: $LOG_FILE"
7. 故障排查流程图
graph TD
A[用户报告网站访问慢] --> B{用户端访问测试}
B --> C{DNS解析正常?}
C -->|是| D{网络连通性正常?}
C -->|否| E[DNS配置问题]
D -->|是| F{服务器响应正常?}
D -->|否| G[网络路由问题]
F -->|是| H{应用服务正常?}
F -->|否| I[服务器负载问题]
H -->|是| J{数据库响应正常?}
H -->|否| K[应用代码问题]
J -->|是| L[缓存/外部服务问题]
J -->|否| M[数据库性能问题]
E --> N[修复DNS配置]
G --> O[检查网络设备]
I --> P[优化服务器配置]
K --> Q[修复应用代码]
M --> R[优化数据库]
L --> S[检查缓存服务]
N --> T[故障解决]
O --> T
P --> T
Q --> T
R --> T
S --> T
style A fill:#1e3a5f,color:#ffffff
style B fill:#4a1e5f,color:#ffffff
style C fill:#5f3a1e,color:#ffffff
style D fill:#5f3a1e,color:#ffffff
style E fill:#5f1e1e,color:#ffffff
style F fill:#5f3a1e,color:#ffffff
style G fill:#5f1e1e,color:#ffffff
style H fill:#5f3a1e,color:#ffffff
style I fill:#5f1e1e,color:#ffffff
style J fill:#5f3a1e,color:#ffffff
style K fill:#5f1e1e,color:#ffffff
style L fill:#5f1e1e,color:#ffffff
style M fill:#5f1e1e,color:#ffffff
style N fill:#1e5f3a,color:#ffffff
style O fill:#1e5f3a,color:#ffffff
style P fill:#1e5f3a,color:#ffffff
style Q fill:#1e5f3a,color:#ffffff
style R fill:#1e5f3a,color:#ffffff
style S fill:#1e5f3a,color:#ffffff
style T fill:#1e3a5f,color:#ffffff
8. 故障解决方案与预防
8.1 创建综合修复脚本
创建综合修复脚本:comprehensive_fix.sh
bash
#!/bin/bash
# 综合故障修复脚本
LOG_FILE="/tmp/comprehensive_fix_$(date +%Y%m%d_%H%M%S).log"
echo "=== 综合故障修复开始 ===" | tee -a $LOG_FILE
# 1. 清理系统资源
echo -e "\n1. 清理系统资源..." | tee -a $LOG_FILE
# 清理临时文件
find /tmp -type f -atime +1 -delete 2>/dev/null
echo "临时文件清理完成" | tee -a $LOG_FILE
# 清理日志文件
find /var/log -name "*.log" -type f -size +1G -exec truncate -s 0 {} \; 2>/dev/null
echo "大日志文件清理完成" | tee -a $LOG_FILE
# 2. 优化系统配置
echo -e "\n2. 优化系统配置..." | tee -a $LOG_FILE
# 调整文件描述符限制
echo "* soft nofile 65536" >> /etc/security/limits.conf
echo "* hard nofile 65536" >> /etc/security/limits.conf
echo "文件描述符限制已调整" | tee -a $LOG_FILE
# 3. 重启服务(按依赖顺序)
echo -e "\n3. 重启相关服务..." | tee -a $LOG_FILE
services=("mysql" "redis" "nginx" "php-fpm" "application-service")
for service in "${services[@]}"; do
if systemctl is-active --quiet "$service"; then
echo "重启服务: $service" | tee -a $LOG_FILE
systemctl restart "$service"
sleep 5
# 验证服务状态
if systemctl is-active --quiet "$service"; then
echo "✅ $service 重启成功" | tee -a $LOG_FILE
else
echo "❌ $service 重启失败" | tee -a $LOG_FILE
fi
fi
done
# 4. 数据库优化
echo -e "\n4. 数据库优化..." | tee -a $LOG_FILE
# MySQL优化
if command -v mysql &> /dev/null; then
mysql -h localhost -u root -p$DB_PASSWORD -e "
FLUSH TABLES;
FLUSH QUERY CACHE;
RESET QUERY CACHE;
" 2>/dev/null && echo "数据库缓存已刷新" | tee -a $LOG_FILE
fi
# 5. 应用缓存清理
echo -e "\n5. 清理应用缓存..." | tee -a $LOG_FILE
# 清理PHP OPcache
if command -v php &> /dev/null; then
php -r "opcache_reset();" 2>/dev/null && echo "PHP OPcache已清理" | tee -a $LOG_FILE
fi
# 清理Redis缓存
if command -v redis-cli &> /dev/null; then
redis-cli FLUSHALL 2>/dev/null && echo "Redis缓存已清理" | tee -a $LOG_FILE
fi
# 6. 监控配置检查
echo -e "\n6. 配置监控告警..." | tee -a $LOG_FILE
# 创建基础监控脚本
cat > /usr/local/bin/system_health_check.sh << 'EOF'
#!/bin/bash
# 系统健康检查脚本
# 检查条件
check_load() {
local load=$(awk '{print $1}' /proc/loadavg)
local cores=$(nproc)
if (( $(echo "$load > $cores * 2" | bc -l) )); then
echo "HIGH_LOAD: $load"
return 1
fi
return 0
}
check_memory() {
local mem_free=$(free -m | awk 'NR==2{print $4}')
if [ $mem_free -lt 100 ]; then
echo "LOW_MEMORY: ${mem_free}MB"
return 1
fi
return 0
}
check_disk() {
local disk_usage=$(df / | awk 'NR==2{print $5}' | sed 's/%//')
if [ $disk_usage -gt 90 ]; then
echo "HIGH_DISK_USAGE: ${disk_usage}%"
return 1
fi
return 0
}
# 执行检查
check_load
check_memory
check_disk
EOF
chmod +x /usr/local/bin/system_health_check.sh
# 7. 创建定期维护任务
echo -e "\n7. 配置定期维护..." | tee -a $LOG_FILE
# 添加定时维护任务
(crontab -l 2>/dev/null; echo "0 2 * * * /usr/local/bin/system_health_check.sh >> /var/log/health_check.log") | crontab -
(crontab -l 2>/dev/null; echo "0 4 * * 0 /usr/local/bin/cleanup_script.sh") | crontab -
echo -e "\n综合故障修复完成" | tee -a $LOG_FILE
echo "详细日志: $LOG_FILE" | tee -a $LOG_FILE
8.2 创建预防性监控
创建预防性监控脚本:preventive_monitoring.sh
bash
#!/bin/bash
# 预防性监控配置脚本
LOG_FILE="/tmp/preventive_monitoring_$(date +%Y%m%d_%H%M%S).log"
echo "=== 预防性监控配置开始 ===" | tee -a $LOG_FILE
# 1. 安装监控工具
echo -e "\n1. 安装监控工具..." | tee -a $LOG_FILE
if command -v apt-get &> /dev/null; then
# Ubuntu/Debian
apt-get update
apt-get install -y htop iotop nethogs sysstat dstat
elif command -v yum &> /dev/null; then
# CentOS/RHEL
yum install -y htop iotop nethogs sysstat dstat
fi
# 2. 配置系统监控
echo -e "\n2. 配置系统监控..." | tee -a $LOG_FILE
# 启用sysstat数据收集
sed -i 's/ENABLED=\"false\"/ENABLED=\"true\"/' /etc/default/sysstat
systemctl enable sysstat
systemctl start sysstat
# 3. 创建自定义监控脚本
echo -e "\n3. 创建自定义监控脚本..." | tee -a $LOG_FILE
cat > /usr/local/bin/website_health_check.sh << 'EOF'
#!/bin/bash
# 网站健康检查脚本
BASE_URL="https://shop.example.com"
ALERT_EMAIL="admin@example.com"
LOG_FILE="/var/log/website_health.log"
# 检查函数
check_http_response() {
local url=$1
local response=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 10 "$url")
if [ "$response" -ne 200 ]; then
echo "HTTP_ERROR: $url returned $response"
return 1
fi
return 0
}
check_response_time() {
local url=$1
local threshold=$2
local response_time=$(curl -s -o /dev/null -w "%{time_total}" --connect-timeout 10 "$url")
if (( $(echo "$response_time > $threshold" | bc -l) )); then
echo "SLOW_RESPONSE: $url took ${response_time}s"
return 1
fi
return 0
}
check_database() {
# 数据库连接检查
if ! mysql -h localhost -u root -p$DB_PASSWORD -e "SELECT 1;" &> /dev/null; then
echo "DATABASE_ERROR: Cannot connect to database"
return 1
fi
return 0
}
# 执行检查
timestamp=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$timestamp] Starting health check..." >> $LOG_FILE
errors=0
check_http_response "$BASE_URL" || ((errors++))
check_response_time "$BASE_URL" 3 || ((errors++))
check_http_response "$BASE_URL/api/health" || ((errors++))
check_response_time "$BASE_URL/api/health" 2 || ((errors++))
check_database || ((errors++))
if [ $errors -gt 0 ]; then
echo "[$timestamp] Health check failed with $errors errors" >> $LOG_FILE
# 发送告警邮件
echo "Health check failed at $timestamp. Check $LOG_FILE for details." | mail -s "Website Health Alert" $ALERT_EMAIL
else
echo "[$timestamp] Health check passed" >> $LOG_FILE
fi
EOF
chmod +x /usr/local/bin/website_health_check.sh
# 4. 配置日志轮转
echo -e "\n4. 配置日志轮转..." | tee -a $LOG_FILE
cat > /etc/logrotate.d/website_monitoring << 'EOF'
/var/log/website_health.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 644 root root
}
EOF
# 5. 设置监控定时任务
echo -e "\n5. 设置监控定时任务..." | tee -a $LOG_FILE
(crontab -l 2>/dev/null; echo "*/5 * * * * /usr/local/bin/website_health_check.sh") | crontab -
(crontab -l 2>/dev/null; echo "0 * * * * /usr/local/bin/system_health_check.sh") | crontab -
# 6. 创建故障恢复脚本
echo -e "\n6. 创建故障恢复脚本..." | tee -a $LOG_FILE
cat > /usr/local/bin/emergency_recovery.sh << 'EOF'
#!/bin/bash
# 紧急故障恢复脚本
echo "=== 紧急故障恢复开始 ==="
# 快速服务重启
for service in nginx mysql php-fpm redis; do
if systemctl is-active --quiet "$service"; then
echo "重启服务: $service"
systemctl restart "$service"
fi
done
# 清理临时文件
find /tmp -type f -mtime +1 -delete
# 重启失败的容器(如果有)
if command -v docker &> /dev/null; then
docker restart $(docker ps -q) 2>/dev/null
fi
# 清除缓存
sync
echo 3 > /proc/sys/vm/drop_caches
echo "紧急恢复完成"
EOF
chmod +x /usr/local/bin/emergency_recovery.sh
echo -e "\n预防性监控配置完成" | tee -a $LOG_FILE
echo "详细日志: $LOG_FILE" | tee -a $LOG_FILE
9. 总结与经验
通过这次完整的故障排查,我们建立了系统的排查流程:
9.1 排查流程总结
graph TB
A[故障报告] --> B[信息收集]
B --> C[用户端测试]
C --> D[网络层分析]
D --> E[服务器层检查]
E --> F[应用层诊断]
F --> G[数据库分析]
G --> H[根因定位]
H --> I[解决方案实施]
I --> J[验证与监控]
J --> K[文档记录]
style A fill:#1e3a5f,color:#ffffff
style B fill:#4a1e5f,color:#ffffff
style C fill:#5f3a1e,color:#ffffff
style D fill:#5f3a1e,color:#ffffff
style E fill:#5f3a1e,color:#ffffff
style F fill:#5f3a1e,color:#ffffff
style G fill:#5f3a1e,color:#ffffff
style H fill:#5f1e1e,color:#ffffff
style I fill:#1e5f3a,color:#ffffff
style J fill:#1e5f3a,color:#ffffff
style K fill:#1e3a5f,color:#ffffff
9.2 关键经验
- 系统化排查:从外到内,从简单到复杂
- 工具准备:提前准备好诊断脚本和工具
- 监控先行:建立完善的监控体系
- 文档完整:记录每次故障的排查过程和解决方案
- 预防为主:通过定期维护预防故障发生