


[1]曹阳. 面向非地面网络的智能无线资源管理机制与算法研究[D]. 电子科技大学, 2023. DOI: 10.27005/d.cnki.gdzku.2023.000168.


N. Zhao, Y. -C. Liang, D. Niyato, Y. Pei, M. Wu and Y. Jiang, "Deep Reinforcement Learning for User Association and Resource Allocation in Heterogeneous Cellular Networks," in IEEE Transactions on Wireless Communications, vol. 18, no. 11, pp. 5141-5152, Nov. 2019, doi: 10.1109/TWC.2019.2933417.

L BSs → N UEs

K orthogonal channels

the Joint user association and resource allocation Optimization Problem

variables: discrete

1.bli(t)=1: the <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>ith UE chooses to associate with the BS <math xmlns="http://www.w3.org/1998/Math/MathML"> l l </math>l at time <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t

2.cik(t)=1: the <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>ith UE utilizes the channel <math xmlns="http://www.w3.org/1998/Math/MathML"> C k Ck </math>Ck at time <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t


1.each UE can only choose at most one BS at any time

2.each UE can only choose at most one channel at any time

3.the SINR of the <math xmlns="http://www.w3.org/1998/Math/MathML"> i t h ith </math>ith UE ≥

a stochastic game


si(t) ∈{0, 1}

si(t)=0 means that the ith UE cannot meet its the minimum QoS requirement, that is, Γi(t) < Ωi

the number of possible states is <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 N 2^N </math>2N



1.bli(t)=1: the <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>ith UE chooses to associate with the BS <math xmlns="http://www.w3.org/1998/Math/MathML"> l l </math>l at time <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t

2.cik(t)=1: the <math xmlns="http://www.w3.org/1998/Math/MathML"> i i </math>ith UE utilizes the channel <math xmlns="http://www.w3.org/1998/Math/MathML"> C k Ck </math>Ck at time <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t

The number of possible actions of each UE is <math xmlns="http://www.w3.org/1998/Math/MathML"> L K LK </math>LK = L种选择方式 * K种选择方式


the long-term reward Φi = the weighted sum of the instantaneous rewards over a finite period T

the reward of the ith UE = the ith UE's utility - the action-selection cost Ψi

Ψi > 0. Note that the negative reward (−Ψi) acts as a punishment.

to guarantee the minimum QoS of all UEs, this negative reward should be set big enough.

the ith UE's utility = \rho_i * the total transmission capacity of the ith UE - the total transmission cost associated with the ith UE

Multi-Agent Q-Learning Method

At the beginning of each training episode, the network state is initialized through message passing.

  1. Each UE is connected to the neighboring BS with the maximum received signal power.
    By using a pilot signal, each UE can measure the received power from the associated BS and the randomly-selected channel.
  2. Then, each UE reports its own current state to its current associated BS.
    By the message passing among the BSs through the backhaul communication link, the global state information of all UEs is obtained.
  3. Then, the BSs send this global state informations to all UEs.

Each episode ends when the QoS of all UEs is satisfied or when the maximum step T is reached.

The total episode reward is the accumulation of instantaneous rewards of all steps within an episode.

<math xmlns="http://www.w3.org/1998/Math/MathML"> Q i ( s , a i ) = Q i ( s , a i ) + δ [ u i ( s , a i , π − i ) + γ max ⁡ a i ′ ∈ A i Q i ( s ′ , a i ′ ) − Q i ( s , a i ) [ u i ( s , a i , π − i ) + γ max ⁡ a i ′ ∈ A i Q i ( s ′ , a i ′ ) ] , {Q_{i}}(s,a_{i})={Q_{i}}(s,a_{i})+ \delta \left[{ {u_{i}}(s,a_{i},{\mathcal{ \pi }}{-i}) + \gamma \max \limits {a{i}' \in {\mathcal{ A}}{i}} {Q_{i}}({s'},{a_{i}'})} {- {Q_{i}}(s,a_{i})\vphantom {\left [{ {u_{i}}(s,a_{i},{\mathcal{ \pi }}{-i}) + \gamma \max \limits {a{i}' \in {\mathcal{ A}}{i}} {Q_{i}}({s'},{a_{i}'})}\right.} }\right], </math>Qi(s,ai)=Qi(s,ai)+δ[ui(s,ai,π−i)+γai′∈AimaxQi(s′,ai′)−Qi(s,ai)[ui(s,ai,π−i)+γai′∈AimaxQi(s′,ai′)],

Multi-Agent dueling double DQN Algorithm

dueling double deep Q-network (D3QN)

A NN function approximator <math xmlns="http://www.w3.org/1998/Math/MathML"> Q i ( s , a i ; θ ) ≈ Q i ∗ ( s , a i ) Q_{i}(s,a_{i};{\theta }) \approx {Q_{i}^{*}}(s,a_{i}) </math>Qi(s,ai;θ)≈Qi∗(s,ai) with weights θ is used as an online network.

The DQN utilizes a target network alongside the online network to stabilize the overall network performance.

experience replay

During learning, instead of using only the current experience (s, ai,ui(s, ai),s′), the NN can be trained through sampling mini-batches of experiences from replay memory D uniformly at random.

By reducing the correlation among the training examples, the experience replay strategy ensures that the optimal policy cannot be driven to a local minima.

double DQN

since the same values are used to select and evaluate an action in Q-learning and DQN methods, Q-value function may be over-optimistically estimated.

Thus, double DQN (DDQN) [44] is used to mitigate the above problem

dueling architecture

The advantage function A(s, ai) describes the advantage of the action ai compared with the other possible actions.

This dueling architecture can lead to better policy evaluation.

<math xmlns="http://www.w3.org/1998/Math/MathML"> L i ( θ ) = E s , a i , u i ( s , a i ) , s ′ [ ( y i D Q N − Q i ( s , a i ; θ ) ) 2 ] , {L_{i}}({\theta }) = {E_{s,a_{i},u_{i}(s,a_{i}),s'}}[{(y_{i}^{DQN} - Q_{i}(s,a_{i};{\theta }))^{2}}], </math>Li(θ)=Es,ai,ui(s,ai),s′[(yiDQN−Qi(s,ai;θ))2],

<math xmlns="http://www.w3.org/1998/Math/MathML"> y i D D Q N = u i ( s , a i ) + γ Q i ( s ′ , arg ⁡ max ⁡ a i ′ ∈ A i Q i ( s ′ , a i ′ ; θ ) ; θ − ) . y_{i}^{DDQN} = {u_{i}}(s,a_{i}) + \gamma Q_{i}\left ({s',\mathop {\arg \max }\limits {a'{i} \in {\mathcal{ A}}{i}} Q{i}(s',a'_{i};{\theta });\theta ^{-} }\right). </math>yiDDQN=ui(s,ai)+γQi(s′,ai′∈AiargmaxQi(s′,ai′;θ);θ−).


J. Ge, Y. -C. Liang, J. Joung and S. Sun, "Deep Reinforcement Learning for Distributed Dynamic MISO Downlink-Beamforming Coordination," in IEEE Transactions on Communications, vol. 68, no. 10, pp. 6070-6085, Oct. 2020, doi: 10.1109/TCOMM.2020.3004524.


multi-cell MISO-IC model

a downlink cellular network of K cells

no intra-cell interference

all the BSs are equipped with a uniform linear array having <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N ( <math xmlns="http://www.w3.org/1998/Math/MathML"> N ≥ 1 N≥1 </math>N≥1) antenna elements.

<math xmlns="http://www.w3.org/1998/Math/MathML"> max ⁡ W ( t ) ∑ k = 1 K C k ( W ( t ) ) 8a s . t . 0 ≤ ∥ w k ( t ) ∥ 2 ≤ p m a x , ∀ k ∈ K , 8b \max {\mathbf {W}{(t)}}~\sum {k=1}^{K}C{k}(\mathbf {W}{(t)}) \text{8a}\\{\mathrm{ s.t.}}~0\leq \left \|{\mathbf {w}{k}{(t)}}\right \|^{2} \leq p_{\mathrm{ max}},~\forall k \in \mathcal {K},\text{8b} </math>maxW(t) ∑k=1KCk(W(t))8as.t. 0≤∥wk(t)∥2≤pmax, ∀k∈K,8b

the beamformer of BS <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k

the available maximum transmit power budget of each BS

Limited-Information Exchange Protocol

a downlink data transmission framework

The first phase (phase 1) is a preparing phase for the subsequent data transmission

the second phase (phase 2) is for the downlink data transmission.

in the centralized approaches, the cascade procedure of collecting global CSI, computing beamformers, and sending beamformers to the corresponding BSs is supposed to be carried out within phase 1.

Designed limited-information exchange protocol in time slot t.

BSs are able to share their historical measurements and other information with their interferers and interfered neighbors.

  1. The received interference power from interferer <math xmlns="http://www.w3.org/1998/Math/MathML"> j ∈ I k ( t ) j∈Ik(t) </math>j∈Ik(t) in time slot <math xmlns="http://www.w3.org/1998/Math/MathML"> t − 1 t−1 </math>t−1
    , i.e., <math xmlns="http://www.w3.org/1998/Math/MathML"> ∣ ∣ h † j , k ( t − 1 ) w j ( t − 1 ) ∣ ∣ 2 ∣∣h†j,k(t−1)wj(t−1)∣∣2 </math>∣∣h†j,k(t−1)wj(t−1)∣∣2 .
  2. The total interference-plus-noise power of UE <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k in time slot <math xmlns="http://www.w3.org/1998/Math/MathML"> t − 1 t−1 </math>t−1
    , i.e., <math xmlns="http://www.w3.org/1998/Math/MathML"> ∑ l ≠ k ∣ ∣ h † l , k ( t − 1 ) w l ( t − 1 ) ∣ ∣ 2 + σ 2 ∑l≠k∣∣h†l,k(t−1)wl(t−1)∣∣2+σ2 </math>∑l=k∣∣h†l,k(t−1)wl(t−1)∣∣2+σ2 .
  3. The achievable rate of direct link <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k in time slot <math xmlns="http://www.w3.org/1998/Math/MathML"> t − 1 t−1 </math>t−1
    , i.e., <math xmlns="http://www.w3.org/1998/Math/MathML"> C k ( W ( t − 1 ) ) Ck(W(t−1)) </math>Ck(W(t−1)) .
  4. The equivalent channel gain of direct link <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k in time slot <math xmlns="http://www.w3.org/1998/Math/MathML"> t − 1 t−1 </math>t−1
    , i.e., <math xmlns="http://www.w3.org/1998/Math/MathML"> ∣ ∣ h † k , k ( t − 1 ) w ¯ k ( t − 1 ) ∣ ∣ 2 ∣∣h†k,k(t−1)w¯k(t−1)∣∣2 </math>∣∣h†k,k(t−1)w¯k(t−1)∣∣2 .

Distributed DRL-Based DTDE Scheme for DDBC

distributed-training-distributed-executing (DTDE)

distributed dynamic downlink-beamforming coordination (DDBC)

each BS is an independent agent

a multi-agent reinforcement learning problem

Illustration of the proposed distributed DRL-based DTDE scheme in the considered multi-agent system.

  1. Actions
    <math xmlns="http://www.w3.org/1998/Math/MathML"> A = { ( p , c ) , p ∈ P , c ∈ C } , \mathcal {A} = \{(p, {\mathbf{c}}),~p\in \mathcal {P},~{\mathbf{c}}\in \mathcal {C}\}, </math>A={(p,c), p∈P, c∈C},
    <math xmlns="http://www.w3.org/1998/Math/MathML"> P = { 0 , 1 Q p o w − 1 p m a x ,    2 Q p o w − 1 p m a x ,    ⋯   ,    p m a x } \mathcal {P}=\left \{{0, \tfrac {1}{Q_{\mathrm{ pow}}-1}p_{\mathrm{ max}},\,\,\tfrac {2}{Q_{\mathrm{ pow}}-1}p_{\mathrm{ max}},\,\,\cdots,\,\,p_{\mathrm{ max}}}\right \} </math>P={0,Qpow−11pmax,Qpow−12pmax,⋯,pmax}
    <math xmlns="http://www.w3.org/1998/Math/MathML"> C = { c 0 ,    c 1 ,    ⋯   ,    c Q c o d e − 1 } \mathcal {C}=\left \{{\mathbf {c}{0},\,\,\mathbf {c}{1},\,\,\cdots,\,\,\mathbf {c}{Q{\mathrm{ code}}-1}}\right \} </math>C={c0,c1,⋯,cQcode−1}
    1.the transmit power of BS k in time slot t
    2.code <math xmlns="http://www.w3.org/1998/Math/MathML"> c k ( t ) ck(t) </math>ck(t)
    the total number of available actions is <math xmlns="http://www.w3.org/1998/Math/MathML"> Q = Q p o w Q c o d e Q=QpowQcode </math>Q=QpowQcode
  2. States
    1.Local Information
    2.Interferers' Information
    3.Interfered Neighbors' Information
  3. Reward
    the achievable rate of agent <math xmlns="http://www.w3.org/1998/Math/MathML"> k k </math>k
    the penalty on BS k is defined as the sum of the achievable rate losses of the interfered neighbors j∈Ok(t+1) , which are interfered by BS k , as follows:

Distributed DRL-Based DTDE Scheme for DDBC

In training step t , the prediction error
<math xmlns="http://www.w3.org/1998/Math/MathML"> L ( θ ) = 1 2 M b ∑ ⟨ s , a , r , s ′ ⟩ ∈ D ( r ′ − q ( s , a ; θ ) ) 2 L(\boldsymbol {\theta })= \frac {1}{2M_{b}}\sum \limits _{\langle s,a,r,s'\rangle \in \mathcal {D}}\left ({r'-q(s,a; \boldsymbol {\theta })}\right)^{2} </math>L(θ)=2Mb1⟨s,a,r,s′⟩∈D∑(r′−q(s,a;θ))2

the target value of reward
<math xmlns="http://www.w3.org/1998/Math/MathML"> r ′ = r + γ max ⁡ a ′ q ( s ′ , a ′ ; θ − ) r'= r + \gamma \max \limits _{a'}q(s',a'; \boldsymbol {\theta }^{-}) </math>r′=r+γa′maxq(s′,a′;θ−)

the optimizer returns a set of gradients shown in (22) to update the weights of the trained DQN through the back-propagation (BP) technique
<math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ L ( θ ) ∂ θ = 1 M b ∑ ⟨ s , a , r , s ′ ⟩ ∈ D ( r ′ − q ( s , a ; θ ) ) ∇ q ( s , a ; θ ) . \frac {\partial L(\boldsymbol {\theta })}{\partial \boldsymbol {\theta }}= \frac {1}{M_{b}}\sum _{\langle s,a,r,s'\rangle \in \mathcal {D}}\left ({r'-q(s,a; \boldsymbol {\theta })}\right)\nabla q(s,a; \boldsymbol {\theta }). </math>∂θ∂L(θ)=Mb1∑⟨s,a,r,s′⟩∈D(r′−q(s,a;θ))∇q(s,a;θ).


Y. Cao, S. -Y. Lien and Y. -C. Liang, "Deep Reinforcement Learning For Multi-User Access Control in Non-Terrestrial Networks," in IEEE Transactions on Communications, vol. 69, no. 3, pp. 1605-1619, March 2021, doi: 10.1109/TCOMM.2020.3041347.





基于分布式 DRL 的用户驱动接入控制算法


















ri−(t)=ωi(t)−C,ai(t) ̸=ai(t−1).




算法:UE 端DQN



  1. RSS 算法:在每个时隙,每个UE 选择可以提供最强RSS 的NT-BS 接入。
  2. Q 学习算法 [122]:在该算法中,每个UE 采用表格形式函数(即 Q 表格) 来估计每个状态-动作对的 Q 值。在每个时隙,每个UE 根据所建立的 Q 表格做 出NT-BS 选择。
  3. UCB 算法 [123]:在UCB 算法中,UE 在每个时隙根据下列公式确定其NT-BS 选择
  4. 随机算法:在该算法中,每个UE 在每个时隙随机选择一个NT-BS 接入。
  5. 搜索算法(最优结果):为了获得最优结果,该算法假定存在一个集中决 策节点实时收集全局网络信息。

M = 40 个UE 和 K = 6 个NT-BS


Y. Cao, S. -Y. Lien and Y. -C. Liang, "Multi-tier Collaborative Deep Reinforcement Learning for Non-terrestrial Network Empowered Vehicular Connections," 2021 IEEE 29th International Conference on Network Protocols (ICNP), Dallas, TX, USA, 2021, pp. 1-6, doi: 10.1109/ICNP52444.2021.9651962.











算法:双延迟深度确定性策略梯度 (Twin Delayed Deep Deterministic Policy Gradient, TD3) 算法


Y. Cao, S. -Y. Lien, Y. -C. Liang, D. Niyato and X. S. Shen, "Collaborative Deep Reinforcement Learning for Resource Optimization in Non-Terrestrial Networks," 2023 IEEE 34th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Toronto, ON, Canada, 2023, pp. 1-7, doi: 10.1109/PIMRC56721.2023.10294047.



  1. LEO发射波束角度
  2. UE接收波束角度
  3. RB分配方案:当在第n个时隙中第m个RB被分配给UE时,bn,m等于1。
  4. RB预配置子集:MnLEO表示在时隙n从RB集合M中选取的预配置RB子集。












  1. GEO卫星的频谱分配方案:Wj,t表示GEO卫星在时隙t为第j个UAV中继(波束小区)分配的频谱块数目
  2. UAV的部署轨迹:paj(t)表示第j个UAV中继的三维坐标值
  3. UE的接入决策:若uti,j=1,则表示在时隙t第i个UE接入第j个UAV中继。


  1. 每个UE在每个时隙只允许接入一个UAV中继
  2. K个等宽度且互不重叠的频谱块
