一、整体技术架构背景
裸金属(Bare Metal)创建流程
arduino
用户请求 → Nova API → Conductor → Scheduler → Compute节点 → Ironic API →
Ironic Conductor → PXE启动 → 部署镜像 → 安装系统 → 重启进入本地磁盘
涉及的核心OpenStack组件
| 组件 | 职责 | 瓶颈风险 |
|---|---|---|
| Nova | 虚拟机/裸金属生命周期管理 | API并发、消息队列堆积 |
| Ironic | 裸金属专用服务 | 数据库锁、PXE并发瓶颈 |
| Neutron | 网络管理(PXE网络、租户网络) | IP地址分配、DHCP压力 |
| Keystone | 认证 | Token生成压力 |
| RabbitMQ | 消息总线 | 队列堆积、消费者不足 |
| MySQL/Galera | 数据库 | 行锁、死锁、写放大 |
| Glance | 镜像服务 | 镜像下载带宽 |
二、具体瓶颈分析与解决方案
瓶颈1:Nova Conductor 数据库锁竞争 🔥
现象:并发超过10台时,创建任务卡住,数据库连接池耗尽
根因:
python
# Nova源码片段(简化) nova/conductor/manager.py
@objects.object_compat
def build_instances(self, context, instances, image, filter_properties,
admin_password, injected_files, requested_networks,
security_groups, block_device_mapping):
# 每个创建请求都会写instances表,更新状态为BUILDING
# 高并发时,InnoDB行锁升级为表锁或间隙锁竞争
for instance in instances:
instance.create() # ← 这里竞争严重
优化方案:
python
# 1. 调整数据库连接池
# nova.conf
[database]
max_pool_size = 200 # 默认50 → 200
max_overflow = 400 # 默认100 → 400
pool_timeout = 60 # 等待连接超时时间
# 2. 优化InnoDB配置(MySQL侧)
innodb_lock_wait_timeout = 600 # 锁等待超时,默认50秒太短
innodb_buffer_pool_size = 32G # 根据内存调整,缓存热数据
innodb_read_io_threads = 16 # 增加IO线程
innodb_write_io_threads = 16
# 3. 批量操作优化 - 修改源码逻辑
# 将逐个instance.create()改为批量INSERT
def batch_create_instances(self, context, instances_batch):
# 使用MySQL的批量插入减少RTT
insert_values = []
for inst in instances_batch:
insert_values.append({
'uuid': inst.uuid,
'vm_state': 'building',
'task_state': 'scheduling',
# ...
})
# 单条SQL插入多条,减少锁持有时间
db.instance_create_batch(context, insert_values)
瓶颈2:Ironic 节点状态机竞争 🔥
现象:Ironic节点状态频繁冲突,PXE部署失败率高
根因分析:
python
# ironic/conductor/manager.py
def do_node_deploy(self, context, node_id):
node = self.dbapi.get_node_by_id(node_id)
# 状态机检查 - 高并发时大量节点同时尝试状态转换
if node.provision_state != states.AVAILABLE:
raise exception.InvalidState(
_("Cannot deploy, node %(node)s is in %(state)s state") %
{'node': node.uuid, 'state': node.provision_state})
# 更新状态为DEPLOYING - 这里需要行锁
node.provision_state = states.DEPLOYING
node.save() # ← 数据库UPDATE竞争
解决方案:
python
# 1. 增加Ironic Conductor worker数量
# ironic.conf
[conductor]
workers_pool_size = 200 # 默认100 → 200
sync_power_state_workers = 50 # 电源状态同步worker
# 2. 优化数据库访问模式 - 使用乐观锁替代悲观锁
# 修改ironic/db/sqlalchemy/api.py
def update_node(self, node_id, values):
# 原实现:SELECT FOR UPDATE → 悲观锁
# 优化后:使用version字段实现乐观锁
with _session_for_write() as session:
query = model_query(models.Node).filter_by(id=node_id)
node = query.first()
# 乐观锁检查
if 'version' in values and values['version'] != node.version:
raise exception.NodeConcurrentUpdate(node=node.uuid)
values['version'] = node.version + 1
node.update(values)
return node
# 3. 批量状态预检 - 减少数据库往返
def bulk_check_nodes_available(self, node_ids):
# 单条SQL查询所有节点状态,而不是逐个查询
return self.dbapi.get_nodes_by_id_list(
node_ids,
filters={'provision_state': states.AVAILABLE}
)
瓶颈3:PXE启动网络风暴 🔥
现象:100台以上并发时,PXE DHCP获取IP极慢,TFTP超时
技术细节:
python
# 问题:所有裸金属同时PXE启动,DHCP Discover包洪泛
# 交换机端口同时UP,产生广播风暴
# 解决方案1:PXE启动错峰(Jitter)
# 修改ironic/drivers/modules/pxe.py
def _start_pxe_boot(self, task):
node = task.node
# 增加随机延迟,打散启动时间
import random
delay = random.uniform(0, 30) # 0-30秒随机延迟
time.sleep(delay)
# 继续PXE流程...
# 解决方案2:DHCP池优化
# dnsmasq配置 /etc/dnsmasq.d/ironic-dhcp
# 增大DHCP池和租约时间
dhcp-range=10.0.0.10,10.0.255.254,255.255.0.0,1h
dhcp-lease-max=5000 # 最大租约数
dhcp-authoritative # 权威模式,加快分配
# 解决方案3:TFTP多实例负载均衡
# xinetd配置 /etc/xinetd.d/tftp
service tftp
{
socket_type = dgram
protocol = udp
wait = no # 改为no,允许多实例
user = root
server = /usr/sbin/in.tftpd
server_args = -s /var/lib/ironic/tftpboot -c
disable = no
per_source = 11 # 每个IP最大连接
cps = 100 2 # 每秒连接数限制
flags = IPv4
}
瓶颈4:消息队列(RabbitMQ)堆积 🔥
现象:Nova和Ironic之间的RPC调用超时,任务状态不同步
RabbitMQ优化配置:
erlang
# /etc/rabbitmq/rabbitmq.config
[
{rabbit, [
%% 内存阈值,超过则流控
{vm_memory_high_watermark, 0.8},
{vm_memory_high_watermark_paging_ratio, 0.5},
%% 磁盘阈值
{disk_free_limit, "50GB"},
%% 连接数限制
{tcp_listen_options, [
{backlog, 4096}, % TCP连接队列长度
{nodelay, true},
{linger, {true, 0}}
]},
%% 心跳和超时
{heartbeat, 600}, % 心跳间隔
{handshake_timeout, 30000}
]},
{rabbitmq_management, [
{rates_mode, basic}
]}
].
OpenStack侧MQ配置:
ini
# nova.conf / ironic.conf
[DEFAULT]
rpc_backend = rabbit
rabbit_hosts = controller1:5672,controller2:5672,controller3:5672 # 集群
rabbit_ha_queues = true # 镜像队列高可用
rabbit_retry_interval = 1
rabbit_retry_backoff = 2
rabbit_max_retries = 0 # 无限重试
# 关键:增加RPC超时和响应超时
rpc_response_timeout = 600 # 默认60秒 → 600秒
rpc_cast_timeout = 600
# 消费者并发数
rpc_thread_pool_size = 200 # 默认64 → 200
rpc_conn_pool_size = 100 # 连接池
瓶颈5:Glance镜像下载带宽瓶颈
现象:1000台同时拉取镜像,存储网络和Glance节点带宽打满
解决方案:
python
# 1. 镜像预分发 + BitTorrent协议
# 修改ironic.conf
[pxe]
# 启用iSCSI直接挂载,避免每个节点下载完整镜像
ipxe_enabled = true
ipxe_boot_script = /var/lib/ironic/httpboot/boot.ipxe
# 2. 使用Ceph RBD作为后端,利用Ceph的并发能力
[glance]
glance_api_servers = https://glance.internal:9292
allowed_direct_url_schemes = rbd # 允许直接访问RBD
# 3. 镜像缓存策略 - 计算节点本地缓存
# 修改nova.conf
[image_cache]
manager = nova.image.cache.ImageCacheManager
remove_unused_base_images = false # 保留基础镜像
images_path = /var/lib/nova/instances/_base
三、关键源码修改点总结
修改文件清单
| 文件路径 | 修改内容 | 优化效果 |
|---|---|---|
nova/conductor/manager.py |
批量instance创建 | 减少DB往返50% |
nova/compute/manager.py |
增加重试和退避策略 | 提高容错性 |
ironic/conductor/manager.py |
乐观锁替代悲观锁 | 消除DB死锁 |
ironic/db/sqlalchemy/api.py |
批量查询接口 | 减少DB连接占用 |
ironic/drivers/modules/pxe.py |
增加启动Jitter | 打散PXE风暴 |
nova/conf/database.py |
连接池参数暴露 | 可调优 |
关键代码片段:批量创建优化
python
# nova/conductor/manager.py - 新增批量创建方法
def build_instances_batch(self, context, build_requests,
request_spec, filter_properties):
"""
批量创建实例,优化数据库访问
"""
instances = []
try:
# 1. 批量预分配UUID和名称,减少单次DB操作
for request in build_requests:
instance = objects.Instance(context=context)
instance.uuid = uuidutils.generate_uuid()
instance.name = self._generate_instance_name(request)
instances.append(instance)
# 2. 批量插入数据库(单条SQL)
with context.session.begin():
context.session.bulk_insert_mappings(
models.Instance,
[inst.obj_to_primitive() for inst in instances]
)
# 3. 批量发送消息到Scheduler(异步)
for instance in instances:
self.scheduler_client.select_destinations(
context, request_spec, filter_properties,
instance_uuids=[instance.uuid]
)
except Exception as e:
# 批量回滚
with excutils.save_and_reraise_exception():
for instance in instances:
instance.destroy()
return instances
四、监控与验证方案
关键指标监控
python
# 部署Prometheus + Grafana监控
# 自定义Exporter指标
IRONIC_DEPLOY_GAUGE = Gauge(
'ironic_concurrent_deploys',
'Number of concurrent baremetal deployments',
['conductor_host']
)
NOVA_BUILD_GAUGE = Gauge(
'nova_concurrent_builds',
'Number of concurrent instance builds',
['nova_host']
)
# 告警规则
groups:
- name: baremetal_alerts
rules:
- alert: HighConcurrentDeploys
expr: ironic_concurrent_deploys > 800
for: 5m
annotations:
summary: "Ironic并发部署数过高: {{ $value }}"
- alert: DBLockWaitHigh
expr: mysql_global_status_innodb_row_lock_waits > 1000
for: 2m
annotations:
summary: "MySQL行锁等待过高"
压测验证脚本
python
#!/usr/bin/env python
# concurrent_deploy_test.py
import threading
import time
from novaclient import client as nova_client
def deploy_baremetal(node_id):
start = time.time()
try:
server = nova.servers.create(
name=f"baremetal-{node_id}",
image=None, # 使用Ironic镜像
flavor="baremetal-flavor",
nics=[{"net-id": "pxe-network"}]
)
# 等待ACTIVE状态,超时30分钟
server = nova.servers.get(server.id)
while server.status != 'ACTIVE':
if time.time() - start > 1800:
raise TimeoutError(f"Node {node_id} deploy timeout")
time.sleep(10)
server = nova.servers.get(server.id)
return True, time.time() - start
except Exception as e:
return False, str(e)
# 并发压测:1000个线程
threads = []
results = []
for i in range(1000):
t = threading.Thread(target=lambda: results.append(deploy_baremetal(i)))
threads.append(t)
t.start()
time.sleep(0.1) # 100ms间隔,模拟真实场景
for t in threads:
t.join()
# 统计成功率
success = sum(1 for r in results if r[0])
print(f"成功率: {success}/1000 = {success/10}%")
print(f"平均耗时: {sum(r[1] for r in results if r[0])/success:.2f}s")
五、面试回答话术建议
回答结构(STAR法则)
Situation(背景): "在蚂蚁的混合云平台项目中,我们需要支持大规模AI训练场景的裸金属算力交付,要求单次并发创建1000台裸金属服务器,但初期测试只能并发4台"
Task(任务): "我负责整个IStack平台的性能调优,需要在1个月内将并发能力提升到1000台,成功率99%以上"
Action(行动): "我采用了分层排查法:
- 监控层:先部署Prometheus监控全链路,发现瓶颈在Nova Conductor的数据库锁和Ironic的状态机竞争
- 数据库层:调整MySQL连接池200+,优化InnoDB锁等待时间,将逐条INSERT改为批量操作,减少DB往返
- 应用层:修改Nova和Ironic源码,用乐观锁替代悲观锁,增加批量API接口
- 网络层:PXE启动增加随机Jitter,避免DHCP风暴;TFTP改为多实例模式
- 消息层:RabbitMQ扩容集群,增加镜像队列,调整RPC超时到600秒"
Result(结果): "最终实现了1000台并发创建,平均耗时从原来的4小时(串行)降到35分钟,成功率99.2%,支撑了双十一大促的算力需求"
可能被追问的细节
| 追问 | 准备答案 |
|---|---|
| "怎么定位到是数据库锁问题?" | "通过MySQL的SHOW ENGINE INNODB STATUS看到大量lock_wait_timeout,performance_schema显示instances表锁竞争最严重" |
| "乐观锁怎么实现?" | "增加version字段,UPDATE时检查version是否变化,变化则重试,用CAS思想避免长时间锁持有" |
| "如果还有瓶颈怎么办?" | "可以考虑分库分表,按节点ID哈希分到多个数据库;或者引入缓存层,用Redis缓存节点状态减少DB访问" |
| "怎么保证数据一致性?" | "关键状态变更用数据库事务,非关键状态用消息队列最终一致性,定期对账补偿" |