基于MATLAB强化学习的单智能体与多智能体路径规划算法

一、单智能体路径规划（Q-learning）

1... MATLAB实现步骤

环境建模：

matlab 复制代码

% 创建栅格地图（0: 可通行, 1: 障碍物）
gridSize = 20;
gridMap = ones(gridSize);
gridMap(5:8, 10) = 1; % 设置障碍物

Q-learning参数设置：

matlab 复制代码

Q = zeros(gridSize^2, 4); % 4个动作：上下左右
alpha = 0.1; gamma = 0.9; epsilon = 0.1;

训练循环：

matlab 复制代码

for episode = 1:1000
    state = [startRow, startCol]; % 起点
    while ~isGoal(state)
        % 选择动作（ε-greedy策略）
        if rand < epsilon
            action = randi(4); % 随机动作
        else
            [~, action] = max(Q(sub2ind(size(Q), state(1), state(2)), :));
        end
        % 执行动作并更新状态
        nextState = moveAgent(state, action);
        % 计算奖励
        reward = calculateReward(state, nextState, gridMap);
        % 更新Q值
        Q(sub2ind(size(Q), state(1), state(2)), action) = ...
            (1-alpha)*Q(sub2ind(size(Q), state(1), state(2)), action) + ...
            alpha*(reward + gamma*max(Q(sub2ind(size(Q), nextState(1), nextState(2)), :)));
        state = nextState;
    end
end

二、多智能体路径规划（DDPG）

1. 算法原理

DDPG（深度确定性策略梯度）：适用于连续动作空间，包含Actor-Critic网络结构：
- Actor：策略网络，输出确定性动作 a=μ(s)
- Critic：Q值评估网络，估计 Q(s,a)
奖励设计：
- 协同奖励：多智能体共同接近目标（如总距离减少）
- 冲突惩罚：智能体间距离过近时施加负奖励

2. MATLAB实现步骤

环境建模：

matlab 复制代码

function env = createMultiAgentEnv(gridMap, starts, goals)
    numAgents = size(starts, 1);
    stateInfo = rlNumericSpec([numAgents*2 1], 'LowerLimit', [1 1], 'UpperLimit', [gridSize gridSize]);
    actionInfo = rlNumericSpec([numAgents*2 1], 'LowerLimit', [-1 1], 'UpperLimit', [1 1]);
    env = rl.env.MATLABEnvironment('ObservationInfo', stateInfo, 'ActionInfo', actionInfo);
    env.GridMap = gridMap;
    env.Starts = starts;
    env.Goals = goals;
end

DDPG智能体构建：

matlab 复制代码

% Actor网络
actorNet = [
    featureInputLayer(4) % 输入：4维状态（2智能体坐标）
    fullyConnectedLayer(64)
    reluLayer
    fullyConnectedLayer(2) % 输出：2维动作（x,y方向速度）
    tanhLayer]; % 动作范围[-1,1]

% Critic网络
criticNet = [
    concatenationLayer(1,2,'Name','concat') % 合并状态和动作
    fullyConnectedLayer(64)
    reluLayer
    fullyConnectedLayer(1)];

agentOpts = rlDDPGAgentOptions('SampleTime', 0.1, 'DiscountFactor', 0.99);
agent = rlDDPGAgent(actorNet, criticNet, agentOpts);

多智能体训练：

matlab 复制代码

% 多智能体联合训练
simOpts = rlSimulationOptions('MaxSteps', 500);
trainOpts = rlTrainingOptions('MaxEpisodes', 1000, 'Verbose', false);
trainingStats = train(agent, env, trainOpts);

三、优化

状态空间扩展：
- 单智能体：加入历史路径信息（如最近5步轨迹）
- 多智能体：联合状态（所有智能体位置+目标点）
奖励函数改进：
- 动态权重调整：根据任务阶段调整奖励权重
- 稀疏奖励处理：引入虚拟奖励点
冲突避免机制：
- 势场法：在奖励函数中加入排斥势场
- 通信机制：智能体共享局部观测信息

四、MATLAB代码示例（简化版）

单智能体Q-learning完整代码

matlab 复制代码

% 参数设置
gridSize = 10;
start = [1,1]; goal = [10,10];
Q = zeros(gridSize^2, 4);

% 训练循环
for ep = 1:500
    state = start;
    while ~isequal(state, goal)
        % 选择动作
        if rand < 0.1
            action = randi(4);
        else
            [~, action] = max(Q(sub2ind([gridSize gridSize], state(1), state(2)), :));
        end
        % 执行动作
        nextState = move(state, action);
        % 计算奖励
        reward = -1 + 100*(isequal(nextState, goal));
        % 更新Q值
        Q(sub2ind([gridSize gridSize], state(1), state(2)), action) = ...
            Q(sub2ind([gridSize gridSize], state(1), state(2)), action) + ...
            0.1*(reward + 0.9*max(Q(sub2ind([gridSize gridSize], nextState(1), nextState(2)), :)) - ...
            Q(sub2ind([gridSize gridSize], state(1), state(2)), action));
        state = nextState;
    end
end

多智能体DDPG可视化代码

matlab 复制代码

% 绘制训练曲线
figure;
plot(trainingStats.EpisodeRewards);
xlabel('Episode'); ylabel('Total Reward');

% 路径可视化
figure;
hold on;
plot(env.Goals(:,1), env.Goals(:,2), 'go');
for i = 1:numAgents
    plot(agentTrajectory{i}(:,1), agentTrajectory{i}(:,2), 'r-o');
end
axis equal;

参考代码基于强化学习的单智能体与多智能体路径规划算法 www.youwenfan.com/contentcsq/80525.html

五、参考文献

Q-learning路径规划：
DDPG多智能体实现：
复杂场景优化：