无线通信:基于深度强化学习

这里写自定义目录标题

面向非地面网络的智能无线资源管理机制与算法研究

[1]曹阳. 面向非地面网络的智能无线资源管理机制与算法研究[D]. 电子科技大学, 2023. DOI: 10.27005/d.cnki.gdzku.2023.000168.

异构蜂窝网络:用户关联和信道分配

N. Zhao, Y. -C. Liang, D. Niyato, Y. Pei, M. Wu and Y. Jiang, "Deep Reinforcement Learning for User Association and Resource Allocation in Heterogeneous Cellular Networks," in IEEE Transactions on Wireless Communications, vol. 18, no. 11, pp. 5141-5152, Nov. 2019, doi: 10.1109/TWC.2019.2933417.

L BSs → N UEs

K orthogonal channels

the Joint user association and resource allocation Optimization Problem

variables: discrete

1.bli(t)=1: the <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>ith UE chooses to associate with the BS <math xmlns="http://www.w3.org/1998/Math/MathML"> l l </math>l at time <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t

2.cik(t)=1: the <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>ith UE utilizes the channel <math xmlns="http://www.w3.org/1998/Math/MathML"> C k Ck </math>Ck at time <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t

constraints

1.each UE can only choose at most one BS at any time

2.each UE can only choose at most one channel at any time

3.the SINR of the <math xmlns="http://www.w3.org/1998/Math/MathML"> i t h ith </math>ith UE ≥

a stochastic game

state

si(t) ∈{0, 1}

si(t)=0 means that the ith UE cannot meet its the minimum QoS requirement, that is, Γi(t) < Ωi

the number of possible states is <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 N 2^N </math>2N

action

alki(t)={bli(t),cik(t)},

1.bli(t)=1: the <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>ith UE chooses to associate with the BS <math xmlns="http://www.w3.org/1998/Math/MathML"> l l </math>l at time <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t

2.cik(t)=1: the <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>ith UE utilizes the channel <math xmlns="http://www.w3.org/1998/Math/MathML"> C k Ck </math>Ck at time <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t

The number of possible actions of each UE is <math xmlns="http://www.w3.org/1998/Math/MathML"> L K LK </math>LK = L种选择方式 * K种选择方式

reward

the long-term reward Φi = the weighted sum of the instantaneous rewards over a finite period T

the reward of the ith UE = the ith UE's utility - the action-selection cost Ψi

Ψi > 0. Note that the negative reward (−Ψi) acts as a punishment.

to guarantee the minimum QoS of all UEs, this negative reward should be set big enough.

the ith UE's utility = \rho_i * the total transmission capacity of the ith UE - the total transmission cost associated with the ith UE

Multi-Agent Q-Learning Method

At the beginning of each training episode, the network state is initialized through message passing.

  1. Each UE is connected to the neighboring BS with the maximum received signal power.
    By using a pilot signal, each UE can measure the received power from the associated BS and the randomly-selected channel.
  2. Then, each UE reports its own current state to its current associated BS.
    By the message passing among the BSs through the backhaul communication link, the global state information of all UEs is obtained.
  3. Then, the BSs send this global state informations to all UEs.

Each episode ends when the QoS of all UEs is satisfied or when the maximum step T is reached.

The total episode reward is the accumulation of instantaneous rewards of all steps within an episode.

<math xmlns="http://www.w3.org/1998/Math/MathML"> Q i ( s , a i ) = Q i ( s , a i ) + δ [ u i ( s , a i , π − i ) + γ max ⁡ a i ′ ∈ A i Q i ( s ′ , a i ′ ) − Q i ( s , a i ) [ u i ( s , a i , π − i ) + γ max ⁡ a i ′ ∈ A i Q i ( s ′ , a i ′ ) ] , {Q_{i}}(s,a_{i})={Q_{i}}(s,a_{i})+ \delta \left[{ {u_{i}}(s,a_{i},{\mathcal{ \pi }}{-i}) + \gamma \max \limits {a{i}' \in {\mathcal{ A}}{i}} {Q_{i}}({s'},{a_{i}'})} {- {Q_{i}}(s,a_{i})\vphantom {\left [{ {u_{i}}(s,a_{i},{\mathcal{ \pi }}{-i}) + \gamma \max \limits {a{i}' \in {\mathcal{ A}}{i}} {Q_{i}}({s'},{a_{i}'})}\right.} }\right], </math>Qi(s,ai)=Qi(s,ai)+δ[ui(s,ai,π−i)+γai′∈AimaxQi(s′,ai′)−Qi(s,ai)[ui(s,ai,π−i)+γai′∈AimaxQi(s′,ai′)],

Multi-Agent dueling double DQN Algorithm

dueling double deep Q-network (D3QN)

A NN function approximator <math xmlns="http://www.w3.org/1998/Math/MathML"> Q i ( s , a i ; θ ) ≈ Q i ∗ ( s , a i ) Q_{i}(s,a_{i};{\theta }) \approx {Q_{i}^{*}}(s,a_{i}) </math>Qi(s,ai;θ)≈Qi∗(s,ai) with weights θ is used as an online network.

The DQN utilizes a target network alongside the online network to stabilize the overall network performance.

experience replay

During learning, instead of using only the current experience (s, ai,ui(s, ai),s′), the NN can be trained through sampling mini-batches of experiences from replay memory D uniformly at random.

By reducing the correlation among the training examples, the experience replay strategy ensures that the optimal policy cannot be driven to a local minima.

double DQN

since the same values are used to select and evaluate an action in Q-learning and DQN methods, Q-value function may be over-optimistically estimated.

Thus, double DQN (DDQN) [44] is used to mitigate the above problem

dueling architecture

The advantage function A(s, ai) describes the advantage of the action ai compared with the other possible actions.

This dueling architecture can lead to better policy evaluation.

<math xmlns="http://www.w3.org/1998/Math/MathML"> L i ( θ ) = E s , a i , u i ( s , a i ) , s ′ [ ( y i D Q N − Q i ( s , a i ; θ ) ) 2 ] , {L_{i}}({\theta }) = {E_{s,a_{i},u_{i}(s,a_{i}),s'}}[{(y_{i}^{DQN} - Q_{i}(s,a_{i};{\theta }))^{2}}], </math>Li(θ)=Es,ai,ui(s,ai),s′[(yiDQN−Qi(s,ai;θ))2],

上图里面红色的y为
<math xmlns="http://www.w3.org/1998/Math/MathML"> y i D D Q N = u i ( s , a i ) + γ Q i ( s ′ , arg ⁡ max ⁡ a i ′ ∈ A i Q i ( s ′ , a i ′ ; θ ) ; θ − ) . y_{i}^{DDQN} = {u_{i}}(s,a_{i}) + \gamma Q_{i}\left ({s',\mathop {\arg \max }\limits {a'{i} \in {\mathcal{ A}}{i}} Q{i}(s',a'_{i};{\theta });\theta ^{-} }\right). </math>yiDDQN=ui(s,ai)+γQi(s′,ai′∈AiargmaxQi(s′,ai′;θ);θ−).

分布式动态下行链路波束成形

J. Ge, Y. -C. Liang, J. Joung and S. Sun, "Deep Reinforcement Learning for Distributed Dynamic MISO Downlink-Beamforming Coordination," in IEEE Transactions on Communications, vol. 68, no. 10, pp. 6070-6085, Oct. 2020, doi: 10.1109/TCOMM.2020.3004524.

波束成形论文代码

multi-cell MISO-IC model

a downlink cellular network of K cells

no intra-cell interference

all the BSs are equipped with a uniform linear array having <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N ( <math xmlns="http://www.w3.org/1998/Math/MathML"> N ≥ 1 N≥1 </math>N≥1) antenna elements.

<math xmlns="http://www.w3.org/1998/Math/MathML"> max ⁡ W ( t ) ∑ k = 1 K C k ( W ( t ) ) 8a s . t . 0 ≤ ∥ w k ( t ) ∥ 2 ≤ p m a x , ∀ k ∈ K , 8b \max {\mathbf {W}{(t)}}~\sum {k=1}^{K}C{k}(\mathbf {W}{(t)}) \text{8a}\\{\mathrm{ s.t.}}~0\leq \left \|{\mathbf {w}{k}{(t)}}\right \|^{2} \leq p_{\mathrm{ max}},~\forall k \in \mathcal {K},\text{8b} </math>maxW(t) ∑k=1KCk(W(t))8as.t. 0≤∥wk(t)∥2≤pmax, ∀k∈K,8b

the beamformer of BS <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k

the available maximum transmit power budget of each BS

Limited-Information Exchange Protocol

a downlink data transmission framework

The first phase (phase 1) is a preparing phase for the subsequent data transmission

the second phase (phase 2) is for the downlink data transmission.

in the centralized approaches, the cascade procedure of collecting global CSI, computing beamformers, and sending beamformers to the corresponding BSs is supposed to be carried out within phase 1.

Designed limited-information exchange protocol in time slot t.

BSs are able to share their historical measurements and other information with their interferers and interfered neighbors.

  1. The received interference power from interferer <math xmlns="http://www.w3.org/1998/Math/MathML"> j ∈ I k ( t ) j∈Ik(t) </math>j∈Ik(t) in time slot <math xmlns="http://www.w3.org/1998/Math/MathML"> t − 1 t−1 </math>t−1
    , i.e., <math xmlns="http://www.w3.org/1998/Math/MathML"> ∣ ∣ h † j , k ( t − 1 ) w j ( t − 1 ) ∣ ∣ 2 ∣∣h†j,k(t−1)wj(t−1)∣∣2 </math>∣∣h†j,k(t−1)wj(t−1)∣∣2 .
  2. The total interference-plus-noise power of UE <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k in time slot <math xmlns="http://www.w3.org/1998/Math/MathML"> t − 1 t−1 </math>t−1
    , i.e., <math xmlns="http://www.w3.org/1998/Math/MathML"> ∑ l ≠ k ∣ ∣ h † l , k ( t − 1 ) w l ( t − 1 ) ∣ ∣ 2 + σ 2 ∑l≠k∣∣h†l,k(t−1)wl(t−1)∣∣2+σ2 </math>∑l=k∣∣h†l,k(t−1)wl(t−1)∣∣2+σ2 .
  3. The achievable rate of direct link <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k in time slot <math xmlns="http://www.w3.org/1998/Math/MathML"> t − 1 t−1 </math>t−1
    , i.e., <math xmlns="http://www.w3.org/1998/Math/MathML"> C k ( W ( t − 1 ) ) Ck(W(t−1)) </math>Ck(W(t−1)) .
  4. The equivalent channel gain of direct link <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k in time slot <math xmlns="http://www.w3.org/1998/Math/MathML"> t − 1 t−1 </math>t−1
    , i.e., <math xmlns="http://www.w3.org/1998/Math/MathML"> ∣ ∣ h † k , k ( t − 1 ) w ¯ k ( t − 1 ) ∣ ∣ 2 ∣∣h†k,k(t−1)w¯k(t−1)∣∣2 </math>∣∣h†k,k(t−1)w¯k(t−1)∣∣2 .

Distributed DRL-Based DTDE Scheme for DDBC

distributed-training-distributed-executing (DTDE)

distributed dynamic downlink-beamforming coordination (DDBC)

each BS is an independent agent

a multi-agent reinforcement learning problem

Illustration of the proposed distributed DRL-based DTDE scheme in the considered multi-agent system.

  1. Actions
    <math xmlns="http://www.w3.org/1998/Math/MathML"> A = { ( p , c ) , p ∈ P , c ∈ C } , \mathcal {A} = \{(p, {\mathbf{c}}),~p\in \mathcal {P},~{\mathbf{c}}\in \mathcal {C}\}, </math>A={(p,c), p∈P, c∈C},
    <math xmlns="http://www.w3.org/1998/Math/MathML"> P = { 0 , 1 Q p o w − 1 p m a x ,    2 Q p o w − 1 p m a x ,    ⋯   ,    p m a x } \mathcal {P}=\left \{{0, \tfrac {1}{Q_{\mathrm{ pow}}-1}p_{\mathrm{ max}},\,\,\tfrac {2}{Q_{\mathrm{ pow}}-1}p_{\mathrm{ max}},\,\,\cdots,\,\,p_{\mathrm{ max}}}\right \} </math>P={0,Qpow−11pmax,Qpow−12pmax,⋯,pmax}
    <math xmlns="http://www.w3.org/1998/Math/MathML"> C = { c 0 ,    c 1 ,    ⋯   ,    c Q c o d e − 1 } \mathcal {C}=\left \{{\mathbf {c}{0},\,\,\mathbf {c}{1},\,\,\cdots,\,\,\mathbf {c}{Q{\mathrm{ code}}-1}}\right \} </math>C={c0,c1,⋯,cQcode−1}
    1.the transmit power of BS k in time slot t
    2.code <math xmlns="http://www.w3.org/1998/Math/MathML"> c k ( t ) ck(t) </math>ck(t)
    the total number of available actions is <math xmlns="http://www.w3.org/1998/Math/MathML"> Q = Q p o w Q c o d e Q=QpowQcode </math>Q=QpowQcode
  2. States
    1.Local Information
    2.Interferers' Information
    3.Interfered Neighbors' Information
  3. Reward
    the achievable rate of agent <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k
    the penalty on BS k is defined as the sum of the achievable rate losses of the interfered neighbors j∈Ok(t+1) , which are interfered by BS k , as follows:

Distributed DRL-Based DTDE Scheme for DDBC

In training step t , the prediction error
<math xmlns="http://www.w3.org/1998/Math/MathML"> L ( θ ) = 1 2 M b ∑ ⟨ s , a , r , s ′ ⟩ ∈ D ( r ′ − q ( s , a ; θ ) ) 2 L(\boldsymbol {\theta })= \frac {1}{2M_{b}}\sum \limits _{\langle s,a,r,s'\rangle \in \mathcal {D}}\left ({r'-q(s,a; \boldsymbol {\theta })}\right)^{2} </math>L(θ)=2Mb1⟨s,a,r,s′⟩∈D∑(r′−q(s,a;θ))2

the target value of reward
<math xmlns="http://www.w3.org/1998/Math/MathML"> r ′ = r + γ max ⁡ a ′ q ( s ′ , a ′ ; θ − ) r'= r + \gamma \max \limits _{a'}q(s',a'; \boldsymbol {\theta }^{-}) </math>r′=r+γa′maxq(s′,a′;θ−)

the optimizer returns a set of gradients shown in (22) to update the weights of the trained DQN through the back-propagation (BP) technique
<math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ L ( θ ) ∂ θ = 1 M b ∑ ⟨ s , a , r , s ′ ⟩ ∈ D ( r ′ − q ( s , a ; θ ) ) ∇ q ( s , a ; θ ) . \frac {\partial L(\boldsymbol {\theta })}{\partial \boldsymbol {\theta }}= \frac {1}{M_{b}}\sum _{\langle s,a,r,s'\rangle \in \mathcal {D}}\left ({r'-q(s,a; \boldsymbol {\theta })}\right)\nabla q(s,a; \boldsymbol {\theta }). </math>∂θ∂L(θ)=Mb1∑⟨s,a,r,s′⟩∈D(r′−q(s,a;θ))∇q(s,a;θ).

空基网络中的分布式多用户接入控制:BS和UE关联

Y. Cao, S. -Y. Lien and Y. -C. Liang, "Deep Reinforcement Learning For Multi-User Access Control in Non-Terrestrial Networks," in IEEE Transactions on Communications, vol. 69, no. 3, pp. 1605-1619, March 2021, doi: 10.1109/TCOMM.2020.3041347.

K架固定翼式UAV作为NT-BS,为特定区域内的M个移动UE提供下行传输服务

ci,j(t)表示第j个NT-BS对第i个UE在时隙t的传输速率

变量:ui,j(t)表示第i个UE在时隙t是否接入第j个NT-BS

约束:每个UE在单个时隙只能接入一个NT-BS

基于分布式 DRL 的用户驱动接入控制算法

每个用户均作为独立的智能体,利用DQN在UE侧建立本地接入决策模块,并且每个UE仅采用本地观测量自主地完成NT-BS选择。

状态空间

si(t)表示第i个UE在时隙t的状态(4K+1个元素)

1.第i个用户在时隙t−1所接入的NT-BS标号(K个元素)

2.第i个用户处在时隙t−1和时隙t来自各NT-BS的RSS(2K个元素)

3.每个NT-BS在时隙t−1的接入用户数目(K个元素)

4.第i个UE在时隙t−1实现的传输速率(1个元素)

动作空间

ai(t)表示第i个UE在时隙t的动作(K个元素)

ui,j(t)表示第i个UE在时隙t是否接入第j个NT-BS

奖励函数

第i个UE在第t个时隙内的奖励函数

ri(t)=ri−(t)−ηφi(t)

UE获得的本地奖励

该UE对接入相同NT-BS的其他UE和速率所造成的影响

第i个UE的本地奖励:由第i个UE在当前时隙的传输速率和对应的切换代价所决定

ri−(t)=ωi(t),ai(t)=ai(t−1),

ri−(t)=ωi(t)−C,ai(t) ̸=ai(t−1).

第i个UE采用特定接入决策对集合Oj(t)中其他UE造成的影响φi(t)=

在假定第i个UE没有接入第j个NT-BS的情况下第k个UE的传输速率ωk−i(t)-第k个UE的真实传输速率ωk(t)

对k求和

算法:UE 端DQN

用户驱动的智能接入控制方案

基准算法

  1. RSS 算法:在每个时隙,每个UE 选择可以提供最强RSS 的NT-BS 接入。
  2. Q 学习算法 [122]:在该算法中,每个UE 采用表格形式函数(即 Q 表格) 来估计每个状态-动作对的 Q 值。在每个时隙,每个UE 根据所建立的 Q 表格做 出NT-BS 选择。
  3. UCB 算法 [123]:在UCB 算法中,UE 在每个时隙根据下列公式确定其NT-BS 选择
  4. 随机算法:在该算法中,每个UE 在每个时隙随机选择一个NT-BS 接入。
  5. 搜索算法(最优结果):为了获得最优结果,该算法假定存在一个集中决 策节点实时收集全局网络信息。

M = 40 个UE 和 K = 6 个NT-BS

天基网络中的LEO卫星多用户信道分配:RB分配

Y. Cao, S. -Y. Lien and Y. -C. Liang, "Multi-tier Collaborative Deep Reinforcement Learning for Non-terrestrial Network Empowered Vehicular Connections," 2021 IEEE 29th International Conference on Network Protocols (ICNP), Dallas, TX, USA, 2021, pp. 1-6, doi: 10.1109/ICNP52444.2021.9651962.

多波束LEO卫星下行传输系统

每颗LEO卫星采用多波束技术在地面形成N个频率复用因子为1的波束小区,并且每个波束小区内有M个移动UE。

在第t个时隙中,第j个小区中第i个UE在第k个RB可以实现的下行传输速率可以表示为ci,j,k(t)

在第t个时隙中,第j个小区中第i个UE可以实现的下行传输速率可以表示为

UE满意度:描述LEO卫星为UE分配的RB数目与该UE真实的RB数目需求之间的偏差

速率-满意度效用函数,其中wr+ws=1。

从时隙0到时隙T−1的平均UE间速率-满意度效用函数最小值为

最大化从时隙0到时隙T−1的平均UE间速率-满意度效用函数最小值

变量:如果在时隙t,第j个波束小区中第i个UE接入第k个RB,那么xti,j,k=1。

约束:在单个时隙内,每个RB只能分配给一个UE

算法:双延迟深度确定性策略梯度 (Twin Delayed Deep Deterministic Policy Gradient, TD3) 算法

天基网络中的LEO地球固定小区设计:波束角度、RB分配、RB预配置子集

Y. Cao, S. -Y. Lien, Y. -C. Liang, D. Niyato and X. S. Shen, "Collaborative Deep Reinforcement Learning for Resource Optimization in Non-Terrestrial Networks," 2023 IEEE 34th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Toronto, ON, Canada, 2023, pp. 1-7, doi: 10.1109/PIMRC56721.2023.10294047.

多时间尺度资源配置机制下的LEO卫星地球固定小区方案

最小化UE从时隙0到时隙N−1利用的RB数目

  1. LEO发射波束角度
  2. UE接收波束角度
  3. RB分配方案:当在第n个时隙中第m个RB被分配给UE时,bn,m等于1。
  4. RB预配置子集:MnLEO表示在时隙n从RB集合M中选取的预配置RB子集。

约束:每个时隙内的速率需求DnUE

天空融合网络中的多维资源优化:卫星的频谱分配、UAV轨迹、UAV和UE关联

多波束GEO卫星与UAV中继构成的多层NTN

单颗GEO卫星形成C个互不重叠的波束小区,为地面N个UE提供无线覆盖服务。

GEO卫星的频域资源共可分为K个等宽度且互不重叠的频谱块,时域则被划分为等长的时隙。

M个UAV中继

每个波束小区内的UE数目Nj

在时隙t内,GEO卫星到第j个UAV中继的传输速率

在时隙t内,第i个UE从第j个UAV中继获得的接收速率

切换代价Γi,j(t)

变量:

  1. GEO卫星的频谱分配方案:Wj,t表示GEO卫星在时隙t为第j个UAV中继(波束小区)分配的频谱块数目
  2. UAV的部署轨迹:paj(t)表示第j个UAV中继的三维坐标值
  3. UE的接入决策:若uti,j=1,则表示在时隙t第i个UE接入第j个UAV中继。

约束:

  1. 每个UE在每个时隙只允许接入一个UAV中继
  2. K个等宽度且互不重叠的频谱块
相关推荐
ssf-yasuo5 个月前
TWM论文阅读笔记
论文阅读·笔记·深度学习·深度强化学习·world model
HuggingFace5 个月前
将强化学习重新引入 RLHF
rlhf·深度强化学习
ssf-yasuo5 个月前
STORM论文阅读笔记
论文阅读·笔记·深度学习·深度强化学习·world model
喝凉白开都长肉的大胖子8 个月前
VSCode配置Python教程
人工智能·vscode·python·深度学习·jupyter·visual studio·深度强化学习
IT猿手8 个月前
基于遗传算法GA的机器人栅格地图最短路径规划,可以自定义地图及起始点(提供MATLAB代码)
开发语言·算法·数学建模·matlab·机器人·深度强化学习·多目标优化算法
IT猿手8 个月前
群智能优化算法:巨型犰狳优化算法(GAO)求解23个基准函数(提供MATLAB代码)
开发语言·算法·数学建模·matlab·深度强化学习·多目标优化算法
IT猿手8 个月前
2024最新算法:电鳗觅食优化算法(Electric eel foraging optimization,EEFO)求解23个基准函数(提供MATLAB代码)
开发语言·python·算法·数学建模·matlab·深度强化学习·多目标优化算法
IT猿手9 个月前
粒子群优化算法(Particle Swarm Optimization,PSO)求解基于移动边缘计算的任务卸载与资源调度优化(提供MATLAB代码)
人工智能·python·算法·matlab·边缘计算·强化学习·深度强化学习
IT猿手9 个月前
基于差分进化算法(Differential Evolution Algorithm,DE)的移动边缘计算的任务卸载与资源调度研究(提供MATLAB代码)
人工智能·python·算法·matlab·边缘计算·强化学习·深度强化学习
HuggingFace9 个月前
最新 Hugging Face 强化学习课程(中文版)来啦!
huggingface·深度强化学习·hugging face