一、单智能体路径规划(Q-learning)
1... MATLAB实现步骤
-
环境建模:
matlab% 创建栅格地图(0: 可通行, 1: 障碍物) gridSize = 20; gridMap = ones(gridSize); gridMap(5:8, 10) = 1; % 设置障碍物 -
Q-learning参数设置:
matlabQ = zeros(gridSize^2, 4); % 4个动作:上下左右 alpha = 0.1; gamma = 0.9; epsilon = 0.1; -
训练循环:
matlabfor episode = 1:1000 state = [startRow, startCol]; % 起点 while ~isGoal(state) % 选择动作(ε-greedy策略) if rand < epsilon action = randi(4); % 随机动作 else [~, action] = max(Q(sub2ind(size(Q), state(1), state(2)), :)); end % 执行动作并更新状态 nextState = moveAgent(state, action); % 计算奖励 reward = calculateReward(state, nextState, gridMap); % 更新Q值 Q(sub2ind(size(Q), state(1), state(2)), action) = ... (1-alpha)*Q(sub2ind(size(Q), state(1), state(2)), action) + ... alpha*(reward + gamma*max(Q(sub2ind(size(Q), nextState(1), nextState(2)), :))); state = nextState; end end
二、多智能体路径规划(DDPG)
1. 算法原理
-
DDPG(深度确定性策略梯度):适用于连续动作空间,包含Actor-Critic网络结构:
-
Actor:策略网络,输出确定性动作 a=μ(s)
-
Critic:Q值评估网络,估计 Q(s,a)
-
-
奖励设计:
-
协同奖励:多智能体共同接近目标(如总距离减少)
-
冲突惩罚:智能体间距离过近时施加负奖励
-
2. MATLAB实现步骤
-
环境建模:
matlabfunction env = createMultiAgentEnv(gridMap, starts, goals) numAgents = size(starts, 1); stateInfo = rlNumericSpec([numAgents*2 1], 'LowerLimit', [1 1], 'UpperLimit', [gridSize gridSize]); actionInfo = rlNumericSpec([numAgents*2 1], 'LowerLimit', [-1 1], 'UpperLimit', [1 1]); env = rl.env.MATLABEnvironment('ObservationInfo', stateInfo, 'ActionInfo', actionInfo); env.GridMap = gridMap; env.Starts = starts; env.Goals = goals; end -
DDPG智能体构建:
matlab% Actor网络 actorNet = [ featureInputLayer(4) % 输入:4维状态(2智能体坐标) fullyConnectedLayer(64) reluLayer fullyConnectedLayer(2) % 输出:2维动作(x,y方向速度) tanhLayer]; % 动作范围[-1,1] % Critic网络 criticNet = [ concatenationLayer(1,2,'Name','concat') % 合并状态和动作 fullyConnectedLayer(64) reluLayer fullyConnectedLayer(1)]; agentOpts = rlDDPGAgentOptions('SampleTime', 0.1, 'DiscountFactor', 0.99); agent = rlDDPGAgent(actorNet, criticNet, agentOpts); -
多智能体训练:
matlab% 多智能体联合训练 simOpts = rlSimulationOptions('MaxSteps', 500); trainOpts = rlTrainingOptions('MaxEpisodes', 1000, 'Verbose', false); trainingStats = train(agent, env, trainOpts);
三、优化
-
状态空间扩展:
-
单智能体:加入历史路径信息(如最近5步轨迹)
-
多智能体:联合状态(所有智能体位置+目标点)
-
-
奖励函数改进:
-
动态权重调整:根据任务阶段调整奖励权重
-
稀疏奖励处理:引入虚拟奖励点
-
-
冲突避免机制:
-
势场法:在奖励函数中加入排斥势场
-
通信机制:智能体共享局部观测信息
-
四、MATLAB代码示例(简化版)
单智能体Q-learning完整代码
matlab
% 参数设置
gridSize = 10;
start = [1,1]; goal = [10,10];
Q = zeros(gridSize^2, 4);
% 训练循环
for ep = 1:500
state = start;
while ~isequal(state, goal)
% 选择动作
if rand < 0.1
action = randi(4);
else
[~, action] = max(Q(sub2ind([gridSize gridSize], state(1), state(2)), :));
end
% 执行动作
nextState = move(state, action);
% 计算奖励
reward = -1 + 100*(isequal(nextState, goal));
% 更新Q值
Q(sub2ind([gridSize gridSize], state(1), state(2)), action) = ...
Q(sub2ind([gridSize gridSize], state(1), state(2)), action) + ...
0.1*(reward + 0.9*max(Q(sub2ind([gridSize gridSize], nextState(1), nextState(2)), :)) - ...
Q(sub2ind([gridSize gridSize], state(1), state(2)), action));
state = nextState;
end
end
多智能体DDPG可视化代码
matlab
% 绘制训练曲线
figure;
plot(trainingStats.EpisodeRewards);
xlabel('Episode'); ylabel('Total Reward');
% 路径可视化
figure;
hold on;
plot(env.Goals(:,1), env.Goals(:,2), 'go');
for i = 1:numAgents
plot(agentTrajectory{i}(:,1), agentTrajectory{i}(:,2), 'r-o');
end
axis equal;
参考代码 基于强化学习的单智能体与多智能体路径规划算法 www.youwenfan.com/contentcsq/80525.html
五、参考文献
-
Q-learning路径规划:
-
DDPG多智能体实现:
-
复杂场景优化: