Python网络自动化实战:批量巡检华为交换机完整方案
网络运维自动化实战指南 ------ 从手动SSH到批量自动巡检的完整方案
一、背景与痛点
随着企业网络规模扩大,交换机数量从几百台增长到上千台。传统人工巡检方式存在诸多问题:
传统巡检的痛点
- 效率低下 :逐台SSH登录,手动执行
display version、display interface brief等命令,重复劳动耗时费力 - 容易出错:手工操作易输错命令,或遗漏某些端口的异常状态
- 数据分散:收集的信息散落在各种Excel表中,后期分析困难
- 无法定时:突发故障时,想快速全网巡检根本来不及
自动化巡检的优势
- 高效:脚本24小时待命,定时执行任务
- 准确:标准化操作,避免人为失误
- 规范:输出结构化报告,便于分析和归档
- 实时:可对接邮件或企业微信,及时通知异常
二、环境准备与工具选择
2.1 Python环境
推荐使用 Python 3.8+,本文使用 Python 3.11。
2.2 核心依赖库
bash
# 安装核心库
pip install netmiko pandas openpyxl paramiko
# 可选:用于结构化解析命令输出
pip install ntc-templates textfsm
| 库名 | 用途 | 说明 |
|---|---|---|
| Netmiko | SSH连接与命令执行 | 对华为设备支持良好,封装了分页处理等细节 |
| Pandas | 数据处理与Excel报告生成 | 结构化存储巡检结果 |
| openpyxl | Excel文件读写 | Pandas依赖库 |
| Paramiko | SSH底层库 | Netmiko的底层实现 |
| ntc-templates | 命令输出解析模板 | 将非结构化输出转为结构化数据 |
2.3 设备信息的管理
严禁将IP、用户名、密码硬编码在脚本中!
推荐使用CSV文件存储设备清单:
csv
ip,hostname,device_type,username,password,secret
192.168.1.1,SW-Core-01,huawei,admin,YourPass,enable_pass
192.168.1.2,SW-Access-01,huawei,admin,YourPass,enable_pass
192.168.1.3,SW-Access-02,huawei,admin,YourPass,enable_pass
2.4 环境注意事项
- 交换机配置:确保交换机已开启SSH,配置好VTY用户权限
- 账号权限:巡检账号建议配置只读权限,降低风险
- 网络连通性:确保脚本服务器能访问所有目标设备
- 并发控制:1000台设备不要一次性全开,建议分批或限流(并发50-100)
三、华为交换机常用巡检命令
以下是日常巡检的必查命令清单:
| 命令 | 说明 | 重点关注 |
|---|---|---|
display version |
系统版本、启动时间 | 版本一致性、运行时间 |
display device |
单板状态 | 板卡运行状态 |
display fan |
风扇状态 | 转速、告警 |
display power |
电源状态 | 电源模块状态 |
display cpu-usage |
CPU使用率 | 超过80%需关注 |
display memory |
内存使用率 | 超过80%需关注 |
display interface brief |
接口摘要 | 接口UP/DOWN状态 |
display interface description |
接口描述 | 接口标签核对 |
display arp |
ARP表 | ARP条目数量 |
display mac-address |
MAC地址表 | MAC条目数量 |
display logbuffer |
日志缓冲 | 近期告警信息 |
display temperature |
温度信息 | 超过阈值需关注 |
四、完整Python脚本
4.1 项目结构
huawei_inspection/
├── devices.csv # 设备清单
├── config.py # 全局配置
├── inspection.py # 主程序
├── parsers.py # 输出解析函数
├── reporter.py # 报告生成模块
└── logs/ # 日志目录
4.2 全局配置文件 config.py
python
# config.py - 全局配置文件
# 巡检命令列表
INSPECTION_COMMANDS = [
"display version",
"display device",
"display fan",
"display power",
"display cpu-usage",
"display memory",
"display interface brief",
"display interface description",
"display arp",
"display mac-address",
"display logbuffer",
"display temperature",
]
# 并发配置
MAX_WORKERS = 50 # 最大并发数,根据服务器性能调整
# 连接超时配置
CONNECT_TIMEOUT = 30 # SSH连接超时(秒)
READ_TIMEOUT = 30 # 读取超时(秒)
# 重试配置
MAX_RETRIES = 3 # 最大重试次数
RETRY_DELAY = 2 # 重试间隔(秒)
# 报告配置
REPORT_DIR = "./reports" # 报告输出目录
LOG_DIR = "./logs" # 日志输出目录
# 阈值告警配置
CPU_THRESHOLD = 80 # CPU使用率告警阈值(%)
MEM_THRESHOLD = 80 # 内存使用率告警阈值(%)
TEMP_THRESHOLD = 60 # 温度告警阈值(℃)
4.3 设备连接与命令执行模块
python
# connector.py - 设备连接与命令执行
import time
import logging
from netmiko import ConnectHandler
from netmiko.exceptions import NetmikoTimeoutException, NetmikoAuthenticationException
import config
logger = logging.getLogger(__name__)
def connect_huawei_device(device_info):
"""
连接华为交换机设备
Args:
device_info: 设备信息字典,包含ip、username、password等
Returns:
ConnectHandler对象 或 None
"""
device = {
'device_type': 'huawei',
'host': device_info['ip'],
'username': device_info['username'],
'password': device_info['password'],
'port': 22,
'timeout': config.CONNECT_TIMEOUT,
'read_timeout': config.READ_TIMEOUT,
}
# 如果有enable密码
if 'secret' in device_info and device_info['secret']:
device['secret'] = device_info['secret']
try:
conn = ConnectHandler(**device)
# 进入特权模式(如果需要)
if device_info.get('secret'):
conn.enable()
return conn
except NetmikoAuthenticationException as e:
logger.error(f"[{device_info['ip']}] 认证失败: {str(e)}")
return None
except NetmikoTimeoutException as e:
logger.error(f"[{device_info['ip']}] 连接超时: {str(e)}")
return None
except Exception as e:
logger.error(f"[{device_info['ip']}] 连接异常: {str(e)}")
return None
def run_commands(conn, commands):
"""
在设备上执行命令列表
Args:
conn: Netmiko连接对象
commands: 命令列表
Returns:
dict: 命令到输出的映射
"""
results = {}
for cmd in commands:
try:
# 关闭分页显示
if 'display' in cmd.lower():
output = conn.send_command(cmd, expect_string=r'#')
else:
output = conn.send_command(cmd)
results[cmd] = output
except Exception as e:
logger.warning(f"命令执行失败 [{cmd}]: {str(e)}")
results[cmd] = f"ERROR: {str(e)}"
return results
def inspect_single_device(device_info, commands=None):
"""
巡检单台设备(带重试机制)
Args:
device_info: 设备信息字典
commands: 巡检命令列表(默认使用config中的配置)
Returns:
dict: 包含设备IP、主机名、命令输出、状态等信息
"""
if commands is None:
commands = config.INSPECTION_COMMANDS
ip = device_info['ip']
hostname = device_info.get('hostname', ip)
result = {
'ip': ip,
'hostname': hostname,
'status': 'failure',
'outputs': {},
'error': None,
'start_time': time.time(),
'end_time': None,
}
# 重试机制
for attempt in range(1, config.MAX_RETRIES + 1):
logger.info(f"[{ip}] 第 {attempt} 次尝试连接...")
conn = connect_huawei_device(device_info)
if conn is None:
if attempt < config.MAX_RETRIES:
time.sleep(config.RETRY_DELAY * attempt) # 指数退避
continue
else:
result['error'] = "连接失败,已达最大重试次数"
result['end_time'] = time.time()
return result
# 连接成功,执行命令
try:
logger.info(f"[{ip}] 连接成功,开始执行巡检命令...")
outputs = run_commands(conn, commands)
conn.disconnect()
result['status'] = 'success'
result['outputs'] = outputs
result['end_time'] = time.time()
logger.info(f"[{ip}] 巡检完成,耗时 {result['end_time'] - result['start_time']:.2f} 秒")
return result
except Exception as e:
conn.disconnect()
logger.error(f"[{ip}] 巡检过程中异常: {str(e)}")
if attempt < config.MAX_RETRIES:
time.sleep(config.RETRY_DELAY * attempt)
else:
result['error'] = str(e)
result['end_time'] = time.time()
return result
return result
4.4 输出解析模块 parsers.py
python
# parsers.py - 命令输出解析函数
import re
import config
def parse_version(output):
"""解析 display version 输出"""
result = {
'version': 'N/A',
'uptime': 'N/A',
'model': 'N/A',
}
# 提取版本信息
match = re.search(r'Version\s+([^\s]+)', output, re.IGNORECASE)
if match:
result['version'] = match.group(1)
# 提取运行时间
match = re.search(r'uptime\s+is\s+([^\n]+)', output, re.IGNORECASE)
if match:
result['uptime'] = match.group(1).strip()
# 提取设备型号
match = re.search(r'(S\d{4}|CE\d{4}|AR\d{4})', output, re.IGNORECASE)
if match:
result['model'] = match.group(1)
return result
def parse_cpu(output):
"""解析 display cpu-usage 输出"""
result = {
'cpu_usage': 'N/A',
'alarm': False,
}
# 华为设备CPU使用率提取
match = re.search(r'CPU Usage\s*:\s*(\d+)%', output, re.IGNORECASE)
if not match:
match = re.search(r'CPU utilization\s*[:\s]+(\d+)%', output, re.IGNORECASE)
if not match:
# 尝试匹配表格中的CPU使用率
match = re.search(r'(\d+)%\s*\n', output)
if match:
cpu_val = int(match.group(1))
result['cpu_usage'] = cpu_val
result['alarm'] = cpu_val > config.CPU_THRESHOLD
return result
def parse_memory(output):
"""解析 display memory 输出"""
result = {
'memory_usage': 'N/A',
'total_memory': 'N/A',
'used_memory': 'N/A',
'alarm': False,
}
# 提取内存使用率
match = re.search(r'Memory Using Percentage Is\s*:\s*(\d+)%', output, re.IGNORECASE)
if not match:
match = re.search(r'Memory utilization\s*[:\s]+(\d+)%', output, re.IGNORECASE)
if match:
mem_val = int(match.group(1))
result['memory_usage'] = mem_val
result['alarm'] = mem_val > config.MEM_THRESHOLD
# 提取总内存和已用内存
total_match = re.search(r'Total Memory\s*[:\s]+(\d+)\s*MB', output, re.IGNORECASE)
used_match = re.search(r'Used Memory\s*[:\s]+(\d+)\s*MB', output, re.IGNORECASE)
if total_match:
result['total_memory'] = int(total_match.group(1))
if used_match:
result['used_memory'] = int(used_match.group(1))
return result
def parse_interface_brief(output):
"""解析 display interface brief 输出"""
result = {
'total_ports': 0,
'up_ports': 0,
'down_ports': 0,
'admin_down_ports': 0,
'abnormal_ports': [],
}
lines = output.split('\n')
for line in lines:
# 匹配接口状态行(华为设备格式)
if re.match(r'^\s*(GE|10GE|25GE|40GE|100GE|Eth|Vlanif)', line, re.IGNORECASE):
result['total_ports'] += 1
if 'up' in line.lower() and 'up' in line.lower().split()[1:] if len(line.split()) > 1 else False:
result['up_ports'] += 1
elif 'down' in line.lower():
if 'administratively' in line.lower() or 'admin' in line.lower():
result['admin_down_ports'] += 1
else:
result['down_ports'] += 1
# 记录异常DOWN的接口
port_name = line.split()[0] if line.split() else 'unknown'
result['abnormal_ports'].append(port_name)
return result
def parse_power(output):
"""解析 display power 输出"""
result = {
'power_status': [],
'alarm': False,
}
lines = output.split('\n')
for line in lines:
if 'Power' in line or '电源' in line:
status = 'normal'
if 'abnormal' in line.lower() or 'fault' in line.lower() or '失败' in line:
status = 'abnormal'
result['alarm'] = True
result['power_status'].append({
'line': line.strip(),
'status': status,
})
return result
def parse_fan(output):
"""解析 display fan 输出"""
result = {
'fan_status': [],
'alarm': False,
}
lines = output.split('\n')
for line in lines:
if 'Fan' in line or '风扇' in line:
status = 'normal'
if 'abnormal' in line.lower() or 'fault' in line.lower() or '失败' in line:
status = 'abnormal'
result['alarm'] = True
result['fan_status'].append({
'line': line.strip(),
'status': status,
})
return result
def parse_temperature(output):
"""解析 display temperature 输出"""
result = {
'temperature': [],
'alarm': False,
}
# 提取温度值
matches = re.findall(r'(\d+)\s*°?C', output, re.IGNORECASE)
for temp_str in matches:
temp_val = int(temp_str)
alarm = temp_val > config.TEMP_THRESHOLD
if alarm:
result['alarm'] = True
result['temperature'].append({
'value': temp_val,
'alarm': alarm,
})
return result
def parse_logbuffer(output, max_lines=50):
"""解析 display logbuffer 输出,提取最近告警"""
result = {
'recent_logs': [],
'error_count': 0,
'warning_count': 0,
}
lines = output.split('\n')
log_lines = [l for l in lines if l.strip()][:max_lines]
for line in log_lines:
level = 'info'
if 'error' in line.lower() or '错误' in line or 'ERR' in line:
level = 'error'
result['error_count'] += 1
elif 'warning' in line.lower() or '警告' in line or 'WARN' in line:
level = 'warning'
result['warning_count'] += 1
result['recent_logs'].append({
'content': line.strip(),
'level': level,
})
return result
def parse_all_outputs(outputs):
"""
解析所有命令输出,返回结构化数据
Args:
outputs: dict, 命令到输出的映射
Returns:
dict: 解析后的结构化数据
"""
parsed = {}
for cmd, output in outputs.items():
if 'display version' in cmd:
parsed['version'] = parse_version(output)
elif 'display cpu' in cmd:
parsed['cpu'] = parse_cpu(output)
elif 'display memory' in cmd:
parsed['memory'] = parse_memory(output)
elif 'display interface brief' in cmd:
parsed['interface'] = parse_interface_brief(output)
elif 'display power' in cmd:
parsed['power'] = parse_power(output)
elif 'display fan' in cmd:
parsed['fan'] = parse_fan(output)
elif 'display temperature' in cmd:
parsed['temperature'] = parse_temperature(output)
elif 'display logbuffer' in cmd:
parsed['log'] = parse_logbuffer(output)
return parsed
4.5 报告生成模块 reporter.py
python
# reporter.py - 报告生成模块
import os
import pandas as pd
from datetime import datetime
import config
def generate_excel_report(results, output_path=None):
"""
生成Excel格式的巡检报告
Args:
results: 巡检结果列表
output_path: 输出文件路径(可选)
Returns:
str: 报告文件路径
"""
if output_path is None:
os.makedirs(config.REPORT_DIR, exist_ok=True)
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
output_path = os.path.join(config.REPORT_DIR, f'巡检报告_{timestamp}.xlsx')
# 构建汇总数据
summary_data = []
detail_data = []
for res in results:
ip = res['ip']
hostname = res['hostname']
status = res['status']
row = {
'IP地址': ip,
'主机名': hostname,
'巡检状态': '成功' if status == 'success' else '失败',
'错误信息': res.get('error', ''),
'耗时(秒)': round(res['end_time'] - res['start_time'], 2) if res['end_time'] else 'N/A',
}
# 解析输出数据
if status == 'success' and res.get('parsed'):
parsed = res['parsed']
# 版本信息
if 'version' in parsed:
row['设备型号'] = parsed['version'].get('model', 'N/A')
row['系统版本'] = parsed['version'].get('version', 'N/A')
row['运行时间'] = parsed['version'].get('uptime', 'N/A')
# CPU信息
if 'cpu' in parsed:
row['CPU使用率(%)'] = parsed['cpu'].get('cpu_usage', 'N/A')
row['CPU告警'] = '是' if parsed['cpu'].get('alarm') else '否'
# 内存信息
if 'memory' in parsed:
row['内存使用率(%)'] = parsed['memory'].get('memory_usage', 'N/A')
row['内存告警'] = '是' if parsed['memory'].get('alarm') else '否'
# 接口信息
if 'interface' in parsed:
row['接口总数'] = parsed['interface'].get('total_ports', 'N/A')
row['UP接口数'] = parsed['interface'].get('up_ports', 'N/A')
row['DOWN接口数'] = parsed['interface'].get('down_ports', 'N/A')
# 电源状态
if 'power' in parsed:
row['电源告警'] = '是' if parsed['power'].get('alarm') else '否'
# 风扇状态
if 'fan' in parsed:
row['风扇告警'] = '是' if parsed['fan'].get('alarm') else '否'
# 温度状态
if 'temperature' in parsed:
row['温度告警'] = '是' if parsed['temperature'].get('alarm') else '否'
# 日志信息
if 'log' in parsed:
row['错误日志数'] = parsed['log'].get('error_count', 0)
row['警告日志数'] = parsed['log'].get('warning_count', 0)
summary_data.append(row)
# 生成汇总Sheet
df_summary = pd.DataFrame(summary_data)
# 写入Excel
with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
df_summary.to_excel(writer, sheet_name='巡检汇总', index=False)
# 生成统计Sheet
stats_data = generate_statistics(results)
df_stats = pd.DataFrame([stats_data])
df_stats.to_excel(writer, sheet_name='统计信息', index=False)
# 生成异常设备Sheet
abnormal = [r for r in results if is_abnormal(r)]
if abnormal:
abnormal_data = []
for res in abnormal:
abnormal_data.append({
'IP地址': res['ip'],
'主机名': res['hostname'],
'异常类型': get_abnormal_type(res),
'详细描述': get_abnormal_detail(res),
})
df_abnormal = pd.DataFrame(abnormal_data)
df_abnormal.to_excel(writer, sheet_name='异常设备', index=False)
print(f"报告已生成: {output_path}")
return output_path
def generate_statistics(results):
"""生成统计信息"""
total = len(results)
success = sum(1 for r in results if r['status'] == 'success')
failure = total - success
# 统计告警设备
cpu_alarm = 0
mem_alarm = 0
power_alarm = 0
fan_alarm = 0
temp_alarm = 0
for res in results:
if res['status'] == 'success' and res.get('parsed'):
parsed = res['parsed']
if parsed.get('cpu', {}).get('alarm'):
cpu_alarm += 1
if parsed.get('memory', {}).get('alarm'):
mem_alarm += 1
if parsed.get('power', {}).get('alarm'):
power_alarm += 1
if parsed.get('fan', {}).get('alarm'):
fan_alarm += 1
if parsed.get('temperature', {}).get('alarm'):
temp_alarm += 1
return {
'总设备数': total,
'巡检成功': success,
'巡检失败': failure,
'成功率(%)': round(success / total * 100, 2) if total > 0 else 0,
'CPU告警设备': cpu_alarm,
'内存告警设备': mem_alarm,
'电源告警设备': power_alarm,
'风扇告警设备': fan_alarm,
'温度告警设备': temp_alarm,
}
def is_abnormal(result):
"""判断设备是否异常"""
if result['status'] != 'success':
return True
parsed = result.get('parsed', {})
# 检查各项告警
for key in ['cpu', 'memory', 'power', 'fan', 'temperature']:
if parsed.get(key, {}).get('alarm'):
return True
return False
def get_abnormal_type(result):
"""获取异常类型描述"""
if result['status'] != 'success':
return '连接失败'
types = []
parsed = result.get('parsed', {})
if parsed.get('cpu', {}).get('alarm'):
types.append('CPU高')
if parsed.get('memory', {}).get('alarm'):
types.append('内存高')
if parsed.get('power', {}).get('alarm'):
types.append('电源异常')
if parsed.get('fan', {}).get('alarm'):
types.append('风扇异常')
if parsed.get('temperature', {}).get('alarm'):
types.append('温度高')
return '、'.join(types) if types else '未知异常'
def get_abnormal_detail(result):
"""获取异常详细描述"""
details = []
parsed = result.get('parsed', {})
if parsed.get('cpu', {}).get('alarm'):
details.append(f"CPU使用率: {parsed['cpu'].get('cpu_usage')}%")
if parsed.get('memory', {}).get('alarm'):
details.append(f"内存使用率: {parsed['memory'].get('memory_usage')}%")
return '; '.join(details)
4.6 主程序 inspection.py
python
# inspection.py - 主程序
import csv
import logging
import os
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed
import parsers
import reporter
import config
import connector
# 配置日志
def setup_logging():
"""配置日志输出"""
os.makedirs(config.LOG_DIR, exist_ok=True)
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
log_file = os.path.join(config.LOG_DIR, f'inspection_{timestamp}.log')
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s [%(levelname)s] %(message)s',
handlers=[
logging.FileHandler(log_file, encoding='utf-8'),
logging.StreamHandler(),
]
)
return logging.getLogger(__name__)
def load_devices(csv_path='devices.csv'):
"""
从CSV文件加载设备清单
Args:
csv_path: CSV文件路径
Returns:
list: 设备信息字典列表
"""
devices = []
try:
with open(csv_path, 'r', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
devices.append(row)
logger.info(f"成功加载 {len(devices)} 台设备信息")
except FileNotFoundError:
logger.error(f"设备清单文件不存在: {csv_path}")
raise
except Exception as e:
logger.error(f"加载设备清单失败: {str(e)}")
raise
return devices
def inspect_device_wrapper(device_info):
"""
包装函数,用于线程池调用
执行巡检并解析结果
"""
result = connector.inspect_single_device(device_info)
# 解析命令输出
if result['status'] == 'success' and result.get('outputs'):
try:
result['parsed'] = parsers.parse_all_outputs(result['outputs'])
except Exception as e:
logger.error(f"[{result['ip']}] 输出解析失败: {str(e)}")
result['parsed'] = {}
return result
def main():
"""主函数"""
logger.info("=" * 60)
logger.info("华为交换机自动巡检工具 v1.0")
logger.info("=" * 60)
start_time = datetime.now()
# 加载设备清单
logger.info("正在加载设备清单...")
devices = load_devices('devices.csv')
# 并发巡检
logger.info(f"开始巡检,共 {len(devices)} 台设备,并发数: {config.MAX_WORKERS}")
results = []
with ThreadPoolExecutor(max_workers=config.MAX_WORKERS) as executor:
future_to_device = {
executor.submit(inspect_device_wrapper, device): device
for device in devices
}
completed = 0
for future in as_completed(future_to_device):
completed += 1
device = future_to_device[future]
try:
result = future.result()
results.append(result)
# 进度显示
progress = completed / len(devices) * 100
logger.info(f"进度: {completed}/{len(devices)} ({progress:.1f}%) - {device['ip']}")
except Exception as exc:
logger.error(f"{device['ip']} 巡检异常: {str(exc)}")
results.append({
'ip': device['ip'],
'hostname': device.get('hostname', device['ip']),
'status': 'failure',
'error': str(exc),
'outputs': {},
'parsed': {},
})
# 生成报告
logger.info("巡检完成,正在生成报告...")
report_path = reporter.generate_excel_report(results)
# 输出统计信息
end_time = datetime.now()
duration = (end_time - start_time).total_seconds()
stats = reporter.generate_statistics(results)
logger.info("=" * 60)
logger.info("巡检统计信息:")
logger.info(f" 总设备数: {stats['总设备数']}")
logger.info(f" 成功: {stats['巡检成功']}")
logger.info(f" 失败: {stats['巡检失败']}")
logger.info(f" 成功率: {stats['成功率(%)']}%")
logger.info(f" CPU告警: {stats['CPU告警设备']} 台")
logger.info(f" 内存告警: {stats['内存告警设备']} 台")
logger.info(f" 电源告警: {stats['电源告警设备']} 台")
logger.info(f" 风扇告警: {stats['风扇告警设备']} 台")
logger.info(f" 温度告警: {stats['温度告警设备']} 台")
logger.info(f"总耗时: {duration:.2f} 秒")
logger.info(f"报告路径: {report_path}")
logger.info("=" * 60)
return results
if __name__ == "__main__":
logger = setup_logging()
main()
4.7 定时任务配置
使用系统的crontab配置定时巡检:
bash
# 编辑crontab
crontab -e
# 每周一早上8点执行巡检
0 8 * * 1 /usr/bin/python3 /path/to/inspection.py >> /path/to/logs/cron.log 2>&1
# 每天凌晨2点执行巡检
0 2 * * * /usr/bin/python3 /path/to/inspection.py
五、常见问题与解决方案
5.1 连接超时/失败
原因:网络抖动、设备繁忙、SSH服务异常
解决方案:
- 加重试机制(已实现,默认重试3次)
- 使用指数退避策略
- 检查网络连通性
python
# 指数退避实现(已在connector.py中实现)
time.sleep(RETRY_DELAY * attempt) # attempt为当前重试次数
5.2 命令输出分页
原因:华为设备默认启用分页显示
解决方案:
python
# 方法1:发送屏幕长度命令
conn.send_command("screen-length 0", expect_string=r'#')
# 方法2:在Netmiko中设置
conn = ConnectHandler(**device, fast_cli=False)
5.3 权限问题
原因:巡检账号权限不足
解决方案:
- 创建只读账号
bash
# 在交换机上配置只读账号
local-user inspector password cipher YourPassword
local-user inspector privilege level 1
local-user inspector service-type ssh
5.4 性能瓶颈
原因:并发数过高,服务器资源不足
解决方案:
- 降低并发数(修改config.py中的MAX_WORKERS)
- 分批执行(按区域或设备类型分组)
5.5 密码安全
原因:明文存储密码存在安全风险
解决方案:
python
# 使用环境变量
import os
password = os.environ.get('DEVICE_PASSWORD')
# 使用keyring库
import keyring
password = keyring.get_password("network_devices", "admin")
六、实战经验分享
6.1 分批巡检策略
对于1000台以上的大型网络,建议采用分批巡检策略:
python
# 按区域分批
regions = {
'region1': ['192.168.1.1', '192.168.1.2', ...],
'region2': ['192.168.2.1', '192.168.2.2', ...],
}
for region, ips in regions.items():
logger.info(f"开始巡检区域: {region}")
# 执行该区域的巡检
region_devices = [d for d in all_devices if d['ip'] in ips]
# ...
6.2 异常处理最佳实践
python
# 每台设备独立异常处理,避免单台失败影响整体
try:
result = inspect_single_device(device)
except Exception as e:
logger.error(f"{device['ip']} 异常: {str(e)}")
# 继续处理下一台设备
6.3 结果验证
巡检完成后,建议对结果进行抽样验证:
- 随机抽取5-10台设备,手动登录核对
- 检查报告中的异常设备,确认告警是否真实
七、总结
通过Python自动化巡检,我们实现了:
- 效率提升:1000台设备巡检时间从数天缩短到2-4小时
- 准确性:标准化操作,避免人为失误
- 规范化:结构化报告,便于分析和归档
- 可扩展:轻松支持设备扩容
附录:完整脚本下载
所有脚本已整理为完整项目,可直接下载使用。根据实际环境修改 devices.csv 和 config.py 后运行:
bash
python inspection.py
注意:使用前请确保在授权范围内进行,并备份好现有配置。网络安全,人人有责!