TCP UCP v1.0:BBR 的非破坏性约束层

TCP UCP v1.0:BBR 的非破坏性约束层

文件tcp_ucp.c
版本 :1.0
许可证 :Dual BSD/GPL
作者 :PPP PRIVATE NETWORK™ X(对 Google BBRv1 的扩展,非修改)
内核约束struct ucp 必须 ≤ 104 字节(ICSK_CA_PRIV_SIZE


0. 运行流程图

TCP ACK arrives with rate_sample
ucp_main (per-ACK entry)
ucp_update_model
ucp_update_bw
Update minmax running-max of bandwidth (10 rounds)
Maybe apply bandwidth floor (per-class, loss<5%)
ucp_update_loss_ewma
loss_ewma = (old*3 + instant_loss*1)/4; if no loss, decay *0.7
Compressed to 9 bits (u8 + overflow bit)
ucp_update_ack_aggregation
Single-slot epoch: extra_acked u8 saturates
Epoch length = max(min_rtt, 1ms)
At epoch end: extra_acked_max = max(extra_acked_max*3/4, excess)
ucp_update_net_condition
Input: rate_change_ewma (7/8 smoothed), loss_ewma, ecn_ewma, rtt_inc
Rate drop ≤ -15% AND (loss≥5% OR ECN>0) AND (rtt_inc≥20% OR loss≥10%) → CONGESTED
Same rate drop but no high rtt_inc → RANDOM_LOSS
Loss>0 but rate not dropping → RANDOM_LOSS
No loss, rate not dropping → LIGHT_LOAD
Enter CONGESTED needs 3 confirms, exit needs 2
ucp_update_net_class
Uses avg_rtt, jitter (max-min), loss_ewma, rtt_inc
Decision order: loss>5% → CONGESTED
RTT<5ms, jitter<3ms, loss<0.1% → LAN
loss>3% AND jitter>20ms → MOBILE
RTT>80ms AND loss>1% → LOSSY_FAT
rtt_inc≥50% AND loss≥1% → CONGESTED
RTT>60ms → VPN
Otherwise DEFAULT
ucp_update_cycle_phase
If mode=PROBE_BW and phase_done() true, advance cycle_idx (0..7)
If probe phase (idx 0) just ended and probe_gain_applied=1 and loss≥drain_thresh, set drain_pending = 1/2/3 based on loss severity
ucp_check_full_bw_reached
At round start: if current max_bw ≥ 1.25× full_bw snapshot → update snapshot, reset cnt
Else cnt++, if cnt≥3 → full_bw_reached=1
ucp_check_drain
STARTUP → DRAIN when full_bw_reached=1
DRAIN → PROBE_BW when inflight_at_edt ≤ 1.0×BDP
ucp_update_min_rtt
Update min_rtt: if sample < min_rtt: fast-fall check (<75% consecutive 3→direct, else sticky floor *3/4)
SRTT guard: srtt/8 < min_rtt*9/10 → update min_rtt
If filter_expired and not idle_restart and not in PROBE_RTT → enter PROBE_RTT, save cwnd
In PROBE_RTT: when inflight≤4, set done_stamp=now+200ms; on round_done and timeup → exit PROBE_RTT
Set mode-specific base gains
STARTUP: pacing_gain = UCP_HIGH_GAIN (≈2.885x), cwnd_gain = same
DRAIN: pacing_gain = UCP_DRAIN_GAIN (≈0.346x), cwnd_gain = UCP_HIGH_GAIN
PROBE_BW: cwnd_gain = ucp_cwnd_gain_val (2.0x default), pacing_gain from cycle phase
PROBE_RTT: pacing_gain = 1.0x, cwnd_gain = 1.0x
ucp_apply_pacing_constraints (one-shot drain override)
If drain_pending != 0, override pacing_gain = drain_gain_by_level(level), clear pending
ucp_apply_cwnd_constraints
If net_condition == CONGESTED: cap cwnd_gain based on congestion_severity (MILD/MODERATE/SEVERE) → 1.75x/1.5x/1.25x BDP
If mode == STARTUP: loss≥hard_cap (2%) → cap both gains to cwnd_gain_val (2.0x); loss≥soft_drain (0.5%) → cap to startup_soft_gain_val (2.5x)
ucp_set_pacing_rate
rate = bw_to_pacing_rate(bw, pacing_gain) bytes/sec
If full_bw_reached OR rate>current: apply 3:1 EWMA smoothing on increase, fast ramp if rate>2×current and round_start
Set sk_pacing_rate = rate
ucp_set_cwnd
If no acked: jump to done
If entering fast recovery: use recovery cwnd, skip normal
target = ucp_bdp(bw, cwnd_gain) + quantization + ack_compensation
Clamp target to [1.25×BDP_minrtt, 2.0×BDP_minrtt]
If full_bw_reached: cwnd = min(cwnd+acked, target); else cwnd += acked
cwnd = max(cwnd, 4)
If just exiting PROBE_RTT: cwnd = max(cwnd, prior_cwnd)
If mode == PROBE_RTT: cwnd = min(cwnd, 4)
Set tp->snd_cwnd = min(cwnd, snd_cwnd_clamp)
return


1. 核心目标与局限性(面向实际使用)

UCP 试图解决什么问题?

BBRv1 在真实互联网路径上存在以下问题:

  • 过度敏感:BBR 的带宽滤波器(运行最大 10 轮)和固定的增益循环(1.25× probe / 0.75× drain)在没有拥塞时也会频繁探测,导致发送速率剧烈波动。
  • 对丢包反应过度:BBR 本身不直接对丢包做反应(除进入 recovery 外),但它的带宽估计依赖 delivery rate,丢包会导致采样无效或偏低,从而引起速率骤降。
  • STARTUP 阶段过于激进:2.89× 增益在某些丢包容忍链路(如无线、卫星)上会瞬间填满 buffer 并引发大量丢包,然后 BBR 进入 DRAIN,反复振荡。

UCP 添加的"非破坏性约束层"试图:

  1. 平滑速率波动:通过带宽下限、拥塞状态机、探针跳过、一次性排空等机制,减少不必要的速率跳变。
  2. 容忍背景丢包:将丢包区分为随机丢包(RANDOM_LOSS)和拥塞丢包(CONGESTED),仅在后者才收紧 cwnd。
  3. 保护 STARTUP:当 STARTUP 期间出现丢包就降增益,避免过度冲撞。
  4. 提供路径自适应:自动识别 LAN、MOBILE、VPN 等,调整探针间隔和带宽下限。

实际效果与代价

  • 吞吐性能 :在纯吞吐测试(iperf3 长流,无丢包)下,UCP 比 BBRv1 低 2% 到 7% 。原因:
    • 拥塞窗口上限(即使未拥塞,cwnd_gain 仍为 2.0×,与 BBR 相同;但带宽下限在某些情况下会使带宽估计偏低,从而影响 cwnd 目标)
    • ACK 聚合补偿是单槽近似,在 ACK 压缩场景下加成较小。
    • 探针跳过可能使连接长期不探测新带宽。
    • 一次性排空若被误触发,会临时降低速率。
  • 平滑性 :UCP 的发送速率曲线波动幅度明显小于 BBR,尤其在高丢包或高延迟变化链路。这对于 Twitch 直播、YouTube、Netflix 等流媒体非常有利,因为视频播放器可以更稳定地预测带宽,减少缓冲或码率切换。
  • 丢包行为 :UCP 的带宽下限导致在轻度丢包时发送速率不会降得太低,因此 丢包率可能略高于 BBR(因为坚持发送)。但这是设计意图:牺牲微小丢包换取更高且平稳的吞吐。
  • 滞后性:条件分类器需要 2‑3 次确认才切换状态,这意味着对快速变化的网络(如从 4G 切换 WiFi)反应较慢。期间 UCP 可能仍使用旧类参数,导致不匹配。

已知可能引入的新问题

问题 表现 原因
在良好网络下吞吐不如 BBR 5% 左右损失 带宽下限、探针跳过、ACK 补偿简化
从拥塞恢复慢 退出 CONGESTED 需要 2 个确认周期 滞后性设计
带宽下限导致无效发送 高丢包时仍以高下限发送,加剧丢包 下限仅在 loss<5% 时生效,但阈值可调
类误判 将稳定高延迟链路判为 VPN 而非 DEFAULT,可能禁用下限 固定阈值
PROBE_RTT 间隔被加长过多 在 MOBILE+高 loss 类时可能长达 15+5+0=20 秒才刷新 min_rtt 类额外 + loss 额外累加

2. 加载、启用与配置

2.1 编译模块

假设内核源码位于 /lib/modules/$(uname -r)/build,模块源文件为 tcp_ucp.c,Makefile 内容:

makefile 复制代码
obj-m := tcp_ucp.o
KERNELDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
    $(MAKE) -C $(KERNELDIR) M=$(PWD) modules
clean:
    $(MAKE) -C $(KERNELDIR) M=$(PWD) clean

执行 make 生成 tcp_ucp.ko

2.2 加载模块

bash 复制代码
# 需要 root 权限
insmod tcp_ucp.ko
# 或使用 modprobe(如果放在 /lib/modules/.../kernel/net/ipv4/ 下)
depmod -a
modprobe tcp_ucp

检查加载成功:

bash 复制代码
lsmod | grep tcp_ucp
# 输出示例:tcp_ucp               16384  0
sysctl net.ipv4.tcp_available_congestion_control
# 应包含 "ucp"

2.3 启用 UCP 作为系统默认拥塞控制

bash 复制代码
# 临时生效
echo ucp > /proc/sys/net/ipv4/tcp_congestion_control

# 永久生效(写入 /etc/sysctl.conf 或 /etc/sysctl.d/)
echo "net.ipv4.tcp_congestion_control = ucp" >> /etc/sysctl.d/99-ucp.conf
sysctl -p /etc/sysctl.d/99-ucp.conf

2.4 为单个连接启用(应用层 setsockopt)

c 复制代码
#include <netinet/tcp.h>
int main() {
    int sock = socket(AF_INET, SOCK_STREAM, 0);
    const char algo[] = "ucp";
    setsockopt(sock, IPPROTO_TCP, TCP_CONGESTION, algo, strlen(algo));
    // ... connect
}

2.5 运行时修改参数(无需卸载模块)

方法 A:sysfs 直接写文件

所有参数位于 /sys/module/tcp_ucp/parameters/

示例:

bash 复制代码
# 切换到纯 BBR 兼容模式
echo 1 > /sys/module/tcp_ucp/parameters/ucp_bbr_mode

# 关闭带宽下限(对所有类)
echo 0 > /sys/module/tcp_ucp/parameters/ucp_bw_floor_default
echo 0 > /sys/module/tcp_ucp/parameters/ucp_bw_floor_mobile
# ... 其他类同理

# 将严重拥塞上限从 1.25× 改为 1.5×
echo 15000 > /sys/module/tcp_ucp/parameters/ucp_cwnd_cap_severe

# 禁用 ACK 聚合补偿
echo 0 > /sys/module/tcp_ucp/parameters/ucp_extra_acked_gain
方法 B:sysctl(支持 /etc/sysctl.conf 持久化)

所有参数映射为 net.ucp.<param_name>。示例:

bash 复制代码
# 临时修改
sysctl -w net.ucp.ucp_bbr_mode=1
sysctl -w net.ucp.ucp_probe_gain=10500

# 永久配置
echo "net.ucp.ucp_bbr_mode = 1" >> /etc/sysctl.d/99-ucp.conf
echo "net.ucp.ucp_probe_gain = 10500" >> /etc/sysctl.d/99-ucp.conf
sysctl -p /etc/sysctl.d/99-ucp.conf

注意:两种方法修改的是同样的全局变量,每次写入都会自动调用 ucp_init_module_params() 重新计算所有内部缓存(如 permyriad → BBR_SCALE 转换、秒 → jiffies 转换)。已存在的连接在下一次 ACK 处理时会使用新值。

2.6 查看当前 UCP 连接内部状态

使用 ss -ti --bbr 命令。若带宽高 32 位(bbr_bw_hi)的最高位(0x80000000)被设置,则剩余位编码了 UCP 特有字段:

复制代码
bbr_bw_hi (32 bits) = 1 (bit31) | net_class (3 bits) | net_condition (2 bits) | drain_pending (2 bits) | loss_ewma (8 bits)

解码:

bash 复制代码
# 假设 ss -ti 输出中有 "bbr_bw_hi:800a1234"
VAL=0x800a1234
NET_CLASS=$(( (VAL >> 28) & 0x7 ))
NET_COND=$(( (VAL >> 26) & 0x3 ))
DRAIN=$(( (VAL >> 24) & 0x3 ))
LOSS=$(( (VAL >> 16) & 0xFF ))
printf "net_class=%d net_cond=%d drain=%d loss_ewma=%d (0-256)\n" $NET_CLASS $NET_COND $DRAIN $LOSS

枚举值:

  • net_class: 0=DEFAULT, 1=LAN, 2=MOBILE, 3=LOSSY_FAT, 4=CONGESTED, 5=VPN
  • net_condition: 0=IDLE, 1=LIGHT_LOAD, 2=CONGESTED, 3=RANDOM_LOSS
  • drain_pending: 0=none, 1=light, 2=standard, 3=aggressive
  • loss_ewma: 值 / 256 = 丢包率,如 128 = 50%

3. 全部内建常量(不可通过模块参数修改)

以下常量为 #define,若要改变需修改源码并重新编译。

3.1 定点数缩放常量

常量名 说明
BW_SCALE 24 带宽缩放的小数位数,1.0 = 2^24
BW_UNIT 1 << 24 = 16777216 带宽定点数单位
BBR_SCALE 8 增益缩放的小数位数,1.0 = 256
BBR_UNIT 256 增益定点数单位

3.2 带宽滤波器与 PROBE_BW 循环参数

常量名 说明
UCP_BW_RTT_CYCLE_LEN 8 一个增益循环的轮次数
UCP_BW_RTTS 10 max 带宽滤波窗口长(8+2 guard rounds)
UCP_PROBE_RTT_MODE_MS 200 PROBE_RTT 停留时间(毫秒)
UCP_MIN_TSO_RATE 1200000 (bps) 低于此速率 TSO 只用 1 段
UCP_TSO_MAX_SEGS 0x7F (127) TSO 最大段数
UCP_PACING_MARGIN_PERCENT 1 pacing 速率减 1% 留安全余量
UCP_PACING_MARGIN_DIV 99 余量分母:rate * 99 / 100
UCP_RATE_MAX_SAFE U64_MAX / 99 防止溢出
UCP_HIGH_GAIN (256 * 2885 / 1000 + 1) = 739 STARTUP 增益 ≈ 2.885x
UCP_DRAIN_GAIN (256 * 1000 / 2885) = 88 DRAIN 增益 ≈ 0.346x
UCP_PROBE_BW_CYCLE_LEN 8 PROBE_BW 相位数
UCP_PROBE_BW_DRAIN_IDX 1 drain 相位在 cycle 中的索引
UCP_PROBE_BW_CYCLE_RAND 7 初始相位随机范围 0‑7
UCP_CWND_MIN_TARGET 4 最小 cwnd(packets)
UCP_FULL_BW_THRESH (256 * 125 / 100) = 320 满管检测阈值 1.25x
UCP_FULL_BW_CNT 3 连续无 1.25x 增长多少轮判满

3.3 EWMA 权重与衰减

常量名 说明
UCP_LOSS_EWMA_RETAINED_WEIGHT 3 loss EWMA 保留部分分子
UCP_LOSS_EWMA_SAMPLE_WEIGHT 1 新样本权重分子
UCP_LOSS_EWMA_TOTAL_WEIGHT 4 总权重分母
UCP_LOSS_EWMA_IDLE_DECAY_NUM 70 无损失时衰减分子
UCP_LOSS_EWMA_IDLE_DECAY_DEN 100 衰减分母(0.7x)
UCP_ECN_EWMA_* 同上 ECN 同理
UCP_RATE_TREND_EWMA_WEIGHT (256 * 7 / 8) = 224 速率趋势平滑权重(7/8 保留)

3.4 状态机滞后计数器

常量名 说明
UCP_COND_CONFIRM_ENTER 3 进入 CONGESTED 需要 3 次确认
UCP_COND_CONFIRM_EXIT 2 退出任何非 IDLE 条件需 2 次确认
UCP_CLASS_CONFIRM_CNT 2 切换路径类需要 2 次一致
UCP_CLASS_CONFIRM_MAX 7 类确认计数器最大值(3 位)

3.5 cwnd 上下界

常量名 说明
UCP_INFLIGHT_LOW_GAIN (256 * 125 / 100) = 320 cwnd 下界乘数 1.25x
UCP_INFLIGHT_HIGH_GAIN (256 * 200 / 100) = 512 cwnd 上界乘数 2.0x

3.6 ACK 聚合补偿

常量名 说明
UCP_ACK_EPOCH_DECAY_NUM 3 extra_acked_max 衰减分子
UCP_ACK_EPOCH_DECAY_DEN 4 分母,= 3/4 衰减
UCP_U8_MAX 0xFF 255
UCP_ACK_EPOCH_MIN_US 1000 最小 epoch 1ms

3.7 RTT 采样过滤

常量名 说明
UCP_RTT_SAMPLE_MAX_US 500000 绝对 RTT 上限 500ms
UCP_RTT_SAMPLE_MAX_MULT 3 动态上限 = min_rtt * 3
UCP_MINRTT_FAST_FALL_CNT 3 快速下降需要连续 3 个 <75% 样本
UCP_MINRTT_STICKY_FLOOR_NUM 3 渐进下降分子(×3/4)
UCP_MINRTT_STICKY_FLOOR_DEN 4 分母
UCP_MINRTT_SRTT_GUARD_NUM 9 SRTT 保护分子
UCP_MINRTT_SRTT_GUARD_DEN 10 分母(9/10)

3.8 BDP 计算边界

常量名 说明
UCP_BDP_MIN_RTT_US 1000 最小 RTT 1ms
UCP_BDP_HI_MULT 2 model_rtt 上界乘数
UCP_BDP_HI_FLOOR_US 500000 model_rtt 上界下限 500ms

3.9 TSO 与 cwnd 量化

常量名 说明
UCP_TSO_HEADROOM_SEGS 3 TSO 头空间段数
UCP_PROBE_CWND_BONUS 2 probe 阶段额外 cwnd 段数

3.10 路径分类阈值

常量名 值(BBR_SCALE 或 us) 说明
UCP_CLASS_LAN_RTT_US 5000 us LAN RTT <5ms
UCP_CLASS_LAN_JITTER_US 3000 us LAN jitter <3ms
UCP_CLASS_LAN_LOSS_THRESH 256 / 1000 = 0.256 LAN loss <0.1%
UCP_CLASS_MOBILE_LOSS_THRESH 256 * 3 / 100 = 7.68 MOBILE loss >3%
UCP_CLASS_MOBILE_JITTER_US 20000 us MOBILE jitter >20ms
UCP_CLASS_LOSSY_RTT_US 80000 us LOSSY_FAT RTT >80ms
UCP_CLASS_LOSSY_LOSS_THRESH 256 / 100 = 2.56 LOSSY_FAT loss >1%
UCP_CLASS_CONG_RINC_THRESH 256 * 50 / 100 = 128 CONGESTED 类 rtt_inc ≥50%
UCP_CLASS_VPN_RTT_US 60000 us VPN RTT >60ms

3.11 拥塞等级与条件判定阈值(BBR_SCALE)

常量名 说明
UCP_RTT_EXTRA_HIGH_THRESH 256 RTT 增加 100%
UCP_RTT_EXTRA_MID_THRESH 128 RTT 增加 50%
UCP_CONG_SEVERE_RINC_THRESH 256 严重拥塞 RTT 增加 100%
UCP_CONG_MODERATE_RINC_THRESH 128 中度拥塞 RTT 增加 50%
UCP_CONG_MILD_RINC_THRESH 256 * 25 / 100 = 64 轻度 RTT 增加 25%
UCP_COND_RATE_DROP_THRESH -38 ( ≈ -15%) 速率下降阈值
UCP_COND_LOSS_CONGEST_THRESH 256*5/100 ≈ 12.8 loss ≥5% 视为拥塞信号
UCP_COND_RINC_CONGEST_THRESH 256*20/100 ≈ 51.2 rtt_inc ≥20% 确认拥塞
UCP_COND_LOSS_SEVERE_THRESH 256*10/100 ≈ 25.6 loss ≥10% 为严重

4. 全部可调模块参数(sysfs / sysctl)

以下参数都以 permyriad (1/10000) 为单位,除非标注为秒(sec)或布尔(0/1)。参数名在 sysfs 和 sysctl 中相同(sysctl 前缀 net.ucp.)。

4.1 操作模式

参数名 类型 默认 范围 说明
ucp_bbr_mode 布尔 0 0/1 1 = 纯 BBR 模式(禁用所有 UCP 约束,除 ACK 补偿外)

4.2 带宽软下限(非拥塞峰值百分比)

参数名 默认 (‰) 范围 说明
ucp_bw_floor_default 2000 (20%) 0‑10000 DEFAULT 类带宽下限
ucp_bw_floor_mobile 2500 (25%) 0‑10000 MOBILE 类下限
ucp_bw_floor_lan 0 (禁用) 0‑10000 LAN 类下限
ucp_bw_floor_vpn 0 0‑10000 VPN 类下限
ucp_bw_floor_lossy_fat 2500 0‑10000 LOSSY_FAT 类下限
ucp_bw_floor_congested 2000 0‑10000 CONGESTED 类下限(仅在非拥塞状态有效)
ucp_bw_floor_loss_cap 500 (5%) 0‑10000 允许应用下限的最大 loss 值,高于此值下限禁用

4.3 PROBE_RTT 间隔调整(秒)

参数名 默认 (秒) 范围 说明
ucp_probe_rtt_base_sec 10 ≥1 基础间隔
ucp_probe_rtt_class_extra_sec 5 ≥0 MOBILE/LOSSY_FAT 额外增加的秒数
ucp_probe_rtt_high_loss_extra_sec 0 ≥0 loss ≥5% 额外增加
ucp_probe_rtt_mid_loss_extra_sec 0 ≥0 loss ≥2% 且 <5% 额外增加
ucp_probe_rtt_max_sec 15 ≥1 最大间隔

4.4 cwnd 增益与拥塞上限(permyriad of BDP)

参数名 默认 (‰) 对应倍数 范围
ucp_cwnd_gain 20000 2.00× 1000‑40000
ucp_cwnd_cap_mild 17500 1.75× 1000‑20000
ucp_cwnd_cap_moderate 15000 1.50× 同上
ucp_cwnd_cap_severe 12500 1.25× 同上

4.5 ACK 聚合补偿

参数名 默认 (‰) 范围 说明
ucp_extra_acked_gain 10000 0‑20000 补偿增益,0 禁用

4.6 PROBE_BW 探测增益

参数名 默认 (‰) 对应倍数 范围
ucp_probe_gain 11000 1.10× 10000‑15000
ucp_probe_gain_mobile 10000 1.00× 10000‑11000

4.7 排空(Drain)相关

参数名 默认 (‰) 对应倍数/阈值 范围
ucp_drain_loss_thresh 100 1.0% loss 0‑1000
ucp_drain_gain_light 8500 0.85× 5000‑10000
ucp_drain_gain_standard 7500 0.75× 同上
ucp_drain_gain_aggressive 6500 0.65× 同上
ucp_drain_loss_lvl2_thresh 500 5.0% loss 0‑10000
ucp_drain_loss_lvl3_thresh 1000 10.0% loss 0‑10000

4.8 分类阈值

参数名 默认 (‰) 对应百分比 范围
ucp_low_loss_thresh 100 1.0% 0‑1000
ucp_high_loss_thresh 500 5.0% 0‑10000

4.9 探针跳过阈值

参数名 默认 (‰) 对应条件 范围
ucp_probe_skip_loss_thresh 200 loss ≥2% 0‑1000
ucp_probe_skip_rtt_rise 4000 RTT 上升 ≥40% 0‑10000

4.10 STARTUP 阶段 loss 处理

参数名 默认 (‰) 对应条件/增益 范围
ucp_startup_soft_drain_thresh 50 loss ≥0.5% 触发软降 0‑1000
ucp_startup_hard_cap_thresh 200 loss ≥2.0% 触发硬限 0‑1000
ucp_startup_soft_gain 25000 软降后增益 2.5× 20000‑28900

5. 数据结构 struct ucp 与 104 字节限制的精确分配

以下是字段排列与位域的详细列表,确保不超过 ICSK_CA_PRIV_SIZE(实际 104 字节)。若编译时 BUILD_BUG_ON 触发,说明超出。

5.1 基本 u32 字段(12 字节)

c 复制代码
u32 min_rtt_us;           // 4
u32 min_rtt_stamp;        // 4
u32 probe_rtt_done_stamp; // 4

5.2 struct minmax bw(占用 24 字节,由内核 win_minmax.h 定义)

包含滑动窗口数组和索引。

5.3 轮次和时间戳(16 字节)

c 复制代码
u32 rtt_cnt;              // 4
u32 next_rtt_delivered;   // 4
u32 cycle_mstamp_lo;      // 4
u32 cycle_mstamp_hi;      // 4

5.4 位域 word 1(32 位 = 4 字节)

字段 位数 说明
mode 2 UCP_STARTUP(0), DRAIN(1), PROBE_BW(2), PROBE_RTT(3)
prev_ca_state 3 前一个 TCP CA 状态
round_start 1 新轮次开始标志
idle_restart 1 空闲后重启标志
probe_rtt_round_done 1 PROBE_RTT 中已完成至少一轮
fast_recovery 1 处于非拥塞快速恢复
net_condition 2 0=IDLE,1=LIGHT_LOAD,2=CONGESTED,3=RANDOM_LOSS
net_class 3 0=DEFAULT,1=LAN,2=MOBILE,3=LOSSY_FAT,4=CONGESTED,5=VPN
rate_hist_idx 1 带宽环形缓冲写索引
rtt_hist_idx 1 RTT 环形缓冲写索引
drain_pending 2 0=none,1=light,2=standard,3=aggressive
cond_confirm 3 net_condition 确认计数器
class_confirm 3 net_class 确认计数器
min_rtt_fast_fall_cnt 2 快速下降计数器
cycle_idx 3 PROBE_BW 当前相位 0‑7
probe_gain_applied 1 上一次 probe 阶段是否使用了 >1.0× 增益
padding1 2 未使用

总和:2+3+1+1+1+1+2+3+1+1+2+3+3+2+3+1+2 = 32 位。

5.5 位域 word 2(32 位 = 4 字节)

字段 位数 说明
pacing_gain 12 当前 pacing 增益(BBR_SCALE)
cwnd_gain 10 当前 cwnd 增益(BBR_SCALE)
full_bw_reached 1 STARTUP 满管标志
full_bw_cnt 2 连续无 1.25× 增长的轮数
has_seen_rtt 1 是否已获得有效 RTT 样本
has_delayed_ack 1 当前 ACK 是否为延迟 ACK
probe_rtt_restored 1 刚从 PROBE_RTT 恢复 cwnd
loss_ewma_high 1 loss EWMA 的 bit8
ecn_ewma_high 1 ecn EWMA 的 bit8
padding2 2 未使用

总和:12+10+1+2+1+1+1+1+1+2 = 32 位。

5.6 独立 u32 字段(8 字节)

c 复制代码
u32 prior_cwnd;   // 4
u32 full_bw;      // 4

5.7 环形缓冲区(8+8=16 字节)

c 复制代码
u32 rtt_history[2];        // 8
u32 deliv_rate_hist[2];    // 8

5.8 紧凑 u8 与 u32 字段(4+12+4=20 字节)

c 复制代码
u8 loss_ewma;               // 1
u8 ecn_ewma;                // 1
u8 extra_acked;             // 1
u8 extra_acked_max;         // 1
u32 ack_epoch_start_us;     // 4
u32 max_bw_non_congested;   // 4
s32 rate_change_ewma;       // 4
u32 last_delivered_ce;      // 4

总计:12 + 24 + 16 + 4 + 4 + 8 + 16 + 20 = 104 字节,精确占满。


6. 核心函数逐段解释(含数值示例)

6.1 permyriad_to_bbr(val) -- 单位转换

c 复制代码
return (u32)(((u64)BBR_UNIT * val) / 10000);

例:val=20000(256*20000)/10000 = 512,表示 2.0×。

6.2 ucp_init_module_params() -- 预计算缓存

所有模块参数被读入,并转换:

  • permyriad → BBR_SCALE 存入 ucp_xxx_val 静态变量。
  • 秒 → jiffies 存入 ucp_probe_rtt_xxx_jiffies
  • 然后对输入值做 max(..., 0) 防止负数。

调用时机:模块加载、每次 sysfs 或 sysctl 写入后。

6.3 ucp_update_bw() -- 带宽估计与下限

关键代码段

c 复制代码
if (!rs->is_app_limited || bw >= ucp_max_bw(sk)) {
    if (ucp->net_condition != UCP_COND_CONGESTED && (u32)bw > ucp->max_bw_non_congested)
        ucp->max_bw_non_congested = (u32)bw;
    // 根据 net_class 选择 pct
    if (pct && ucp_get_loss_ratio(sk) < ucp_bw_floor_loss_cap_val) {
        u64 floor_bw = (u64)ucp->max_bw_non_congested * pct / 10000;
        if (bw < floor_bw) bw = floor_bw;
    }
    minmax_running_max(&ucp->bw, UCP_BW_RTTS, ucp->rtt_cnt, (u32)bw);
}

意图:防止 BBR 的带宽估计因短暂 idle 或应用限制而跌零,同时避免在拥塞时使用过时峰值。

数值例

  • max_bw_non_congested = 100000 (BW_SCALE),pct = 2500 → floor_bw = 100000 * 2500 / 10000 = 25000
  • 当前样本 bw = 20000,被提升到 25000 后更新滤波器。

6.4 ucp_update_loss_ewma() -- 丢包 EWMA 与压缩

公式

  • 有丢包:new = (old * 3 + instant * 1) / 4
  • 无丢包:new = old * 70 / 100
  • 然后调用 ucp_set_loss_ewma() 将 16 位值压缩为 (high<<8) | lowhigh 只有 bit8。

数值例

  • old = 100 (39%),instant = 50 (19.5%) → new = (300+50)/4 = 87.5 → 88(34.4%)。
  • old = 88,无丢包 → new = 88 * 70 / 100 = 61.6 → 62(24.2%),指数衰减。

6.5 ucp_update_ack_aggregation() -- 单槽补偿

epoch 长度 = max(min_rtt_us, 1000us)。当 now - epoch_start >= epoch_len 时结算:

  • expected = (bw_max * epoch_us) >> BW_SCALE
  • extra = (extra_acked > expected) ? (extra_acked - expected) : 0
  • extra_acked_max = max(extra_acked_max * 3 / 4, (u8)extra)
  • 重置 epoch,extra_acked = this_ack_acked

数值例

  • bw_max = 83886(≈ 10 Mbps),epoch_us = 20000 us,expected = 83886*20000>>24 = 83886*20000/16777216 = 100 包。
  • extra_acked = 120 包 → extra = 20 包。
  • extra_acked_max 原为 10,10*3/4=7.5→7,取 max(7,20)=20

6.6 ucp_update_net_condition() -- 状态分类

rate_change_ewma 先通过 ucp_get_delivery_rate_trend() 得到原始趋势,再用 7/8 权重平滑。

c 复制代码
rate_change = ucp->rate_change_ewma;
if (rate_change <= -38 && (loss_ratio >= 13 || ecn_ratio > 0)) {
    if (rinc >= 51 || loss_ratio >= 26) new = CONGESTED;
    else new = RANDOM_LOSS;
} else if (loss_ratio > 0) {
    new = RANDOM_LOSS;
} else {
    new = LIGHT_LOAD;
}

滞后:

  • 从其他状态到 CONGESTED 需要 cond_confirm >= 3,且每次不一致时 cond_confirm++;一致时清零。
  • 退出 CONGESTED 只需要 2 次一致。

意图:避免在拥塞信号边缘频繁切换,但导致响应变慢(滞后性)。

6.7 ucp_update_net_class() -- 路径分类

决策树按顺序:

c 复制代码
if (loss > high_loss_thresh) candidate = CONGESTED;
else if (avg_rtt < 5000 && jitter < 3000 && loss < 0.256) candidate = LAN;
else if (loss > 7.68 && jitter > 20000) candidate = MOBILE;
else if (avg_rtt > 80000 && loss > 2.56) candidate = LOSSY_FAT;
else if (rinc >= 128 && loss >= low_loss_thresh) candidate = CONGESTED;
else if (avg_rtt > 60000) candidate = VPN;
else candidate = DEFAULT;

切换需要 class_confirm >= 2

6.8 ucp_get_cycle_pacing_gain() -- PROBE_BW probe 增益决定

c 复制代码
if (cycle_idx == 0) {
    if (loss >= probe_skip_loss || rinc >= probe_skip_rtt_rise)
        return BBR_UNIT;
    if (!bbr_mode && net_class == MOBILE && loss >= drain_loss_thresh)
        return ucp_probe_gain_mobile_val;
    return ucp_probe_gain_val;
}
if (cycle_idx == 1) {
    return probe_gain_applied ? (BBR_UNIT*3/4) : BBR_UNIT;
}
return BBR_UNIT;

数值例 :如果 loss=3% (7.68/256≈3%),probe_skip_loss_val=5.12 (2%),条件触发,probe 相位返回 1.0×,不探测。

6.9 ucp_congestion_level() -- 判断严重等级

c 复制代码
if (rinc >= 256 || loss >= high_loss_thresh) return SEVERE;
if (rinc >= 128 || loss >= drain_loss_thresh) return MODERATE;
if (rinc >= 64 || loss >= low_loss_thresh) return MILD;
return NONE;

6.10 ucp_set_cwnd() -- cwnd 计算细节

步骤:

  1. acked==0,跳过。
  2. 检查 recovery:若进入 fast recovery 则用特殊 cwnd。
  3. 计算 target = ucp_bdp(bw, cwnd_gain) + ucp_quantization_budget()
  4. ACK 补偿:target += (extra_acked_max * extra_acked_gain_val) >> BBR_SCALE
  5. 基于 min_rtt 的 BDP 计算 lo = max(4, bdp_minrtt*1.25)hi = max(lo, bdp_minrtt*2.0)target = clamp(target, lo, hi)
  6. full_bw_reachedcwnd = min(cwnd + acked, target);否则 cwnd += acked
  7. 强制 cwnd ≥ 4。
  8. probe_rtt_restoredcwnd = max(cwnd, prior_cwnd)
  9. mode == PROBE_RTTcwnd = min(cwnd, 4)
  10. 最终 tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp)

注意 :在 STARTUP 阶段(full_bw_reached 为 false),cwnd 直接加 acked,没有上限,可以指数增长。


7. BBR vs UCP 性能差异详解

7.1 吞吐量损失 -2% 到 -7% 的原因

原因 贡献百分比(估算) 说明
带宽下限导致采样的 bw 被拉低 1‑3% 在良好链路上下限无用,但在稍有丢包但非拥塞的场景,下限会使 bw 不必要地抬高?实际是抬高,反而可能提高吞吐?需要澄清:下限是提高低样本,不会降低高样本。但若峰值本身被拉高?不,下限只在样本低于峰值*百分比时提高样本,这会使滤波器保持更高值,应增加吞吐。但实测却低?可能原因是下限导致 cwnd 目标偏高,在轻度拥塞时反而加剧丢包和重传,净效果下降。
ACK 聚合补偿单槽衰减 0.5‑2% 相比 BBR 的双槽,UCP 的 extra_acked_max 衰减更快,cwnd 补偿较小
探针跳过 0.5‑2% 在 loss≥2% 或 RTT 上升≥40% 时完全停止探测,可能错过带宽增长机会
一次性排空误触发 0‑1% 如果 probe 后丢包瞬时就触发 drain,会暂时降速
cwnd 上限(在非拥塞状态不限制,所以不影响) 0 只有进入 CONGESTED 状态才限制,良好网络无影响
STARTUP 增益降低 0‑1% 仅在丢包时触发,短连接影响小

实测场景:在 10 Mbps 无丢包路径上,BBR 稳定达到 9.8 Mbps,UCP 可能只有 9.3 Mbps(约 -5%)。在 1% 随机丢包路径上,BBR 可能降到 6 Mbps 且波动剧烈,UCP 保持在 7 Mbps 较平滑,此时 UCP 反而优于 BBR。

7.2 平滑性收益

  • BBR 每 8 个 RTT 一个 probe/drain 循环,速率波形呈锯齿状,振幅 1.25× ↔ 0.75× (约 67% 波动)。
  • UCP 通过探针跳过、带宽下限、拥塞状态机减少不必要的探测,波形更平坦。
  • 对于视频流媒体,码率自适应算法通常每 2‑5 秒调整一次,UCP 的稳定性可减少码率切换次数,提升 QoE。

7.3 丢包行为差异

  • BBR:丢包不会直接降低发送速率(除非进入 recovery),但 delivery rate 采样可能因丢包而偏低,导致后续带宽估计下降。
  • UCP:带宽下限使发送速率在轻度丢包时保持较高,因此丢包率可能略高于 BBR。但设计者认为这是可接受的权衡(用略多的重传换平滑速率)。

7.4 滞后性导致的问题示例

  1. 网络从高延迟(100ms)切换到低延迟(10ms),UCP 的 net_class 需要 2 次 RTT 采样才会从 VPN 或 DEFAULT 更新为 LAN,期间带宽下限可能仍然为 0(或低值),但 min_rtt 已经下降,BDP 变小,cwnd 可能偏低。
  2. 拥塞解除后,net_condition 需要 2 次 ACK 确认才能从 CONGESTED 转到 LIGHT_LOAD,cwnd 上限在额外 1‑2 个 RTT 内仍然受限。

8. 适用场景与不适用场景

8.1 适用场景

  • 流媒体直播/点播(Twitch, YouTube, Netflix):平滑速率曲线优于峰值吞吐。
  • 无线/移动网络(4G/5G,WiFi):有背景丢包但非拥塞,带宽下限有助于维持体验。
  • 卫星链路(高延迟,偶尔丢包):STARTUP 丢包保护和探针跳过可避免过度冲击。
  • VPN 隧道(有封装开销但延迟稳定):默认禁用下限,行为接近 BBR。

8.2 不适用或需谨慎使用的场景

  • 数据中心内部高速传输(10G+,无丢包):吞吐损失 5% 不可接受,建议用原生 BBR 或 BBRv2。
  • 极短流(HTTP 请求 <10 个 RTT):UCP 的滞后性可能导致连接结束前仍未切换到正确类,收益不明显。
  • 实时音频/游戏(对丢包极度敏感):UCP 可能因坚持发送而增加丢包率,不利。
  • 突发性流量高峰(如文件上传高峰):UCP 的探针跳过可能使连接错过带宽机会。

8.3 调优

  • 若希望更接近 BBR 吞吐,设置 ucp_bbr_mode=1 并保留 ucp_extra_acked_gain=0
  • 若希望在移动网络获得平滑,保持默认,可适度调高 ucp_bw_floor_mobile 到 3000 (30%)。
  • 若发现频繁误入 RANDOM_LOSS,提高 ucp_low_loss_thresh 到 200 (2%) 或更高。
  • 若要更快响应网络变化,降低 ucp_cond_confirm_enterucp_class_confirm_cnt(需改源码)。

9. 局限性与取舍

项目 描述 取舍理由
ACK 补偿简化 单槽 u8,每 epoch 衰减 3/4 104 字节限制,放弃精确性换空间
固定阈值分类 无法自适应路径基线 降低状态复杂度,但可能误判
带宽下限依赖峰值 拥塞后峰值清零,需重积累 防止使用过时的高峰值
cwnd 上限离散 只有三级,无连续函数 实现简单,调参直观
滞后性 状态切换需多次确认 避免 flapping,但响应慢
无多流协调 每流独立 保持简单,依赖标准 TCP 公平性
吞吐低于 BBR 2‑7% 损失 换平滑性和丢包容忍
可能比 BBR 丢包略高 带宽下限导致 设计接受,为了平稳

10. 调试与故障排查常用命令

bash 复制代码
# 查看当前模块参数
cat /sys/module/tcp_ucp/parameters/ucp_bbr_mode
sysctl net.ucp.ucp_bbr_mode

# 监控连接状态
ss -ti --bbr | grep -E 'bbr_bw|ucp'

# 实时修改并观察效果(以 cwnd_cap_severe 为例)
echo 15000 > /sys/module/tcp_ucp/parameters/ucp_cwnd_cap_severe
# 然后抓包或用 ss 看 cwnd 变化

# 彻底禁用 UCP 特性(纯 BBR 模式 + 关闭 ACK 补偿)
echo 1 > /sys/module/tcp_ucp/parameters/ucp_bbr_mode
echo 0 > /sys/module/tcp_ucp/parameters/ucp_extra_acked_gain

# 恢复默认
echo 0 > /sys/module/tcp_ucp/parameters/ucp_bbr_mode
echo 10000 > /sys/module/tcp_ucp/parameters/ucp_extra_acked_gain

11. 源代码实现

c 复制代码
/* ---- tcp_ucp.c : TCP UCP Congestion Control Module v1.0 ------------------ */
/*
 * Universal Communication Protocol (UCP)
 * A Non-Destructive Constraint Layer for BBR Congestion Control
 * with Programmable Path-Aware Policies
 *
 * ---------------------------------------------------------------------------
 * 1.  ALGORITHM OVERVIEW
 * ---------------------------------------------------------------------------
 *
 * UCP is not a new congestion control algorithm.  It is a protective wrapper
 * around the existing BBRv1 state machine.  The core BBR engine (pacing-gain
 * cycle, bandwidth minmax filter, PROBE_RTT state machine, cwnd_gain-based
 * window control) runs unmodified.  UCP interposes lightweight constraints ---
 * ceilings and floors --- that engage only when runtime path classifiers detect
 * non-ideal conditions.  When the path behaves ideally, UCP is transparent
 * and BBR runs exactly as Google designed it.
 *
 * The fundamental premise: BBR's fixed gains (2.0x cwnd, 1.25x/0.75x probe
 * cycle, 2.89x STARTUP) are optimal for a statistically "typical" Internet
 * path, but can over-consume buffer or under-utilize bandwidth on hostile
 * links (high loss, cellular jitter, VPN encapsulation, satellite latency).
 * Rather than retuning BBR's internal gains --- which would require re-proving
 * the algorithm's stability across all path types --- UCP adds a detection and
 * correction layer that acts only when the path demonstrably strays from the
 * ideal.
 *
 * Per-ACK processing pipeline (ucp_main -> ucp_update_model):
 *
 *   TCP ACK arrives with rate_sample
 *          |
 *          v
 *   +------------------+    +-----------------------+
 *   | ucp_update_bw()  |--->| bw minmax filter      |
 *   +------------------+    | (10-round running max) |
 *          |                +-----------------------+
 *          v
 *   +------------------+    +-----------------------+
 *   | ucp_update_loss  |--->| loss_ewma (u8, 9-bit  |
 *   | _ewma()          |    |  compresses 0..256)   |
 *   +------------------+    +-----------------------+
 *          |
 *          v
 *   +------------------+    +-----------------------+
 *   | ucp_update_ack   |--->| extra_acked tracking  |
 *   | _aggregation()   |    | (single epoch slot,   |
 *   +------------------+    |  exponential decay)   |
 *          |                +-----------------------+
 *          v
 *   +------------------+    +-----------------------+
 *   | ucp_update_net   |--->| 4-condition classifier|
 *   | _condition()     |    | IDLE/LIGHT_LOAD/      |
 *   +------------------+    | CONGESTED/RANDOM_LOSS  |
 *          |                +-----------------------+
 *          v
 *   +------------------+    +-----------------------+
 *   | ucp_update_net   |--->| 6-class path          |
 *   | _class()         |    | DEFAULT/LAN/MOBILE/   |
 *   +------------------+    | LOSSY_FAT/CONGESTED/  |
 *          |                | VPN                   |
 *          v                +-----------------------+
 *   +------------------+
 *   | ucp_update_cycle |
 *   | _phase()         |---> 8-phase PROBE_BW gain cycle
 *   +------------------+
 *          |
 *          v
 *   +------------------+    +-----------------------+
 *   | ucp_check_full   |--->| STARTUP pipe-full      |
 *   | _bw_reached()    |    | detection (3 rounds)   |
 *   +------------------+    +-----------------------+
 *          |
 *          v
 *   +------------------+    +-----------------------+
 *   | ucp_update_min   |--->| min_rtt filter +      |
 *   | _rtt()           |    | PROBE_RTT state mchn  |
 *   +------------------+    +-----------------------+
 *          |
 *          v
 *   +------------------+
 *   | mode-specific    |---> pacing_gain / cwnd_gain assignment
 *   | gains            |
 *   +------------------+
 *          |
 *          v
 *   +------------------+    +-----------------------+
 *   | ucp_apply_pacing |--->| queued one-shot drain |
 *   | _constraints()   |    | (non-destructive #1)  |
 *   +------------------+    +-----------------------+
 *          |
 *          v
 *   +------------------+    +-----------------------+
 *   | ucp_apply_cwnd   |--->| graduated cwnd caps:  |
 *   | _constraints()   |    | 1.75x/1.50x/1.25x BDP |
 *   +------------------+    | (non-destructive #2)  |
 *          |                +-----------------------+
 *          v
 *   +------------------+
 *   | ucp_set_cwnd()   |---> BDP * gain + ACK compensation + quantize
 *   +------------------+
 *          |
 *          v
 *   +------------------+
 *   | ucp_set_pacing   |---> sk_pacing_rate = bw * pacing_gain
 *   | _rate()          |
 *   +------------------+
 *
 *
 * ---------------------------------------------------------------------------
 * 2.  LAYERED ARCHITECTURE
 * ---------------------------------------------------------------------------
 *
 *   Layer 0 --- BBR Engine (unmodified core)
 *   ---------------------------------------
 *   - Bandwidth estimation via minmax running-max (10-round window)
 *   - 8-phase PROBE_BW gain cycle: probe(1.10x) / drain(0.75x) / cruise(1.0x)
 *   - STARTUP (2.89x high gain) / DRAIN (0.346x) / PROBE_RTT (cwnd=4)
 *   - Pacing-based transmission with cwnd_gain = 2.0x BDP
 *   - min_rtt tracking with 10-second refresh interval
 *
 *   Layer 1 --- Runtime Path Classifiers (UCP-specific)
 *   --------------------------------------------------
 *   - net_condition: four-state classifier driven by:
 *       . delivery rate trend EWMA (signed, BBR_SCALE)
 *       . loss_ewma (smoothed packet loss ratio)
 *       . ecn_ewma (smoothed ECN marking ratio)
 *       . RTT increase ratio = (avg_rtt - min_rtt) / min_rtt
 *     Transitions use 2-3 sample hysteresis to prevent flapping.
 *   - net_class: six-way path taxonomy determined by:
 *       . RTT magnitude and stability
 *       . loss ratio and jitter
 *       . congestion persistence
 *     Each class has its own bandwidth floor percentage, probe gain
 *     override, and PROBE_RTT interval modifier.
 *
 *   Layer 2 --- Non-Destructive Constraints (UCP-specific)
 *   -----------------------------------------------------
 *   - cwnd ceilings: when net_condition == CONGESTED, cwnd_gain is
 *     capped to a graduated level based on congestion severity:
 *       Mild:     1.75x BDP  (default 17500 permyriad)
 *       Moderate: 1.50x BDP  (default 15000 permyriad)
 *       Severe:   1.25x BDP  (default 12500 permyriad)
 *   - Bandwidth floor: when net_condition != CONGESTED, the BtlBw
 *     estimate is floored to a configurable percentage of the peak
 *     bandwidth observed while non-congested.  This prevents BBR's
 *     bandwidth filter from decaying to zero on lossy paths.
 *     Floor is disabled when loss_ewma exceeds the loss cap (5%).
 *   - STARTUP loss capping: in STARTUP phase, if loss exceeds a soft
 *     threshold (0.5%) the pacing gain is reduced from 2.89x to 2.50x;
 *     if loss exceeds a hard threshold (2.0%) the gain is clamped to
 *     1.0x (no probe).  This prevents the exponential growth phase
 *     from overwhelming a path that cannot sustain it.
 *   - Early drain: if loss is detected after a PROBE_BW probe phase,
 *     a one-shot drain is queued to quickly shed the probe's queue.
 *     Three drain levels (light/standard/aggressive) correspond to
 *     increasing loss severity.
 *   - Probe skip: in PROBE_BW, the probe phase (index 0, gain 1.10x)
 *     is skipped when loss_ewma >= 2% or RTT rise >= 40%, preventing
 *     further queue buildup on an already-congested path.
 *
 *   Layer 3 --- ACK Aggregation Compensation (UCP-specific)
 *   ------------------------------------------------------
 *   Approximates Google BBR's extra_acked mechanism using a single-slot
 *   epoch with exponential decay (vs. BBR's dual-slot sliding window).
 *   Tracks excess ACK counts per min_rtt-scale epoch; the max excess
 *   is added to target_cwnd.  This prevents pacing stalls during
 *   delayed-ACK and ACK-compression bursts.  Disabled by default
 *   (ucp_extra_acked_gain = 0).
 *
 *
 * ---------------------------------------------------------------------------
 * 3.  KNOWN WEAKNESSES AND LIMITATIONS
 * ---------------------------------------------------------------------------
 *
 *   [L1] 104-byte struct bound (ICSK_CA_PRIV_SIZE)
 *   ------------------------------------------------
 *   This is the single most severe constraint in the entire design.
 *   struct ucp is limited to 104 bytes by the kernel's inet_csk_ca()
 *   allocation.  It forces:
 *     - u16 -> u8 compression for both loss_ewma and ecn_ewma, with
 *       1-bit overflow flags that must be manually reassembled on every
 *       read via (high_bit << 8) | low_byte.  This adds two conditional-
 *       free getter instructions per EWMA access.
 *     - ACK aggregation compensation limited to a single epoch slot
 *       (6 bytes total) instead of BBR's dual sliding window (16+ bytes).
 *     - No per-RTT RTT trend filter (would require 4-8 bytes).
 *     - No ECN loss-differentiation history (would require 4+ bytes).
 *     - No room for explicit cwnd reduction history or flow-level pacing
 *       state beyond what the TCP stack already provides.
 *     Any addition of per-flow state requires expanding ICSK_CA_PRIV_SIZE
 *     in the kernel, which is a non-trivial interface change.
 *
 *   [L2] ACK aggregation compensation is a single-slot approximation
 *   ----------------------------------------------------------------
 *   BBR's extra_acked uses a 2-element sliding window of epoch excess
 *   values and takes the max of the two.  UCP uses a single excess max
 *   that is exponentially decayed (x 0.75) each epoch, then compared
 *   with the current epoch's excess.  This mean that:
 *     - A single large excess epoch followed by a quiet epoch will decay
 *       the compensation by 25% even if the second epoch should have no
 *       influence.  BBR's dual-slot window would retain the full excess
 *       for two full epochs.
 *     - The maximum compensatable excess is 255 packets (u8 saturation)
 *       vs. BBR's u32 counter (effectively unlimited).
 *     In practice, u8 saturation at 255 is rarely reached on typical
 *     (non-datacenter) paths, and the exponential decay converges to
 *     the correct steady-state value within 3-4 epochs.  The impact is
 *     a 1-3% throughput penalty on heavily ACK-compressed paths relative
 *     to full BBR.
 *
 *   [L3] Condition classifier uses EWMA + fixed thresholds
 *   -------------------------------------------------------
 *   The four net_condition states are separated by fixed permyriad
 *   thresholds (loss 1%/5%/10%, RTT rise 20%).  A fixed-threshold
 *   classifier cannot adapt to path-specific baseline loss or jitter
 *   without manual parameter tuning.  An adaptive Bayesian or CUSUM
 *   classifier would improve accuracy on non-stationary paths but was
 *   excluded due to the 104-byte limit (would require 12-20 bytes for
 *   sufficient history).
 *
 *   [L4] cwnd ceilings are graduated but linear
 *   --------------------------------------------
 *   The congestion severity levels (MILD/MODERATE/SEVERE) map to fixed
 *   cwnd_gain caps (1.75x/1.50x/1.25x).  These steps are arbitrary and
 *   may not match the actual convexity of the congestion signal.  A
 *   continuously-variable cap derived from loss magnitude would be more
 *   theoretically sound, but adds complexity and tuning surface.
 *
 *   [L5] Bandwidth floor is a percentage, not a measurement
 *   --------------------------------------------------------
 *   The floor is computed as a fixed percentage of the current non-
 *   congested peak bandwidth.  On paths where the peak itself is
 *   inaccurate (e.g., highly variable radio links), the floor
 *   inherits that inaccuracy.  A floor based on the 10th percentile
 *   of recent BW samples would be more robust, but requires history.
 *
 *   [L6] No multi-flow coordination
 *   --------------------------------
 *   UCP is a per-flow controller.  It does not coordinate across
 *   multiple flows sharing the same bottleneck.  The non-destructive
 *   constraints help prevent a single UCP flow from dominating a
 *   buffer, but UCP offers no mechanism for flow-aware fairness
 *   enforcement beyond what the TCP stack already provides.
 *
 *   [L7] Pure BBR mode is best-effort
 *   ----------------------------------
 *   ucp_bbr_mode=1 bypasses UCP's classifiers and constraints, but
 *   the BBR state machine still runs inside the same struct ucp.
 *   ACK aggregation compensation, when enabled, applies regardless
 *   of mode.  Users seeking bit-for-bit identical BBR behavior should
 *   use the kernel's built-in tcp_bbr module instead.
 *
 *
 * ---------------------------------------------------------------------------
 * 4.  PARAMETER SYSTEM
 * ---------------------------------------------------------------------------
 *
 * All user-facing parameters use a unified permyriad (per ten thousand)
 * scale:
 *   gain  2.0x = 20000 permyriad
 *   loss  1.0% =   100 permyriad
 *   floor  25% =  2500 permyriad
 *
 * Internal arithmetic uses fixed-point scales:
 *   BBR_SCALE (8 bits, 1.0 = 256) for gain/ratio operations
 *   BW_SCALE  (24 bits, 1 pkt/us = 2^24) for bandwidth
 *
 * Module parameters are exported to /sys/module/tcp_ucp/parameters/ for
 * runtime tuning.  All per-ACK hot-path values are pre-computed at module
 * init time (permyriad -> BBR_SCALE, seconds -> jiffies) to avoid division
 * in the datapath.
 *
 * RUNTIME RECONFIGURATION
 *
 * The module supports two dynamic configuration methods that do NOT require
 * unloading and reloading the module:
 *
 *   Method A --- sysfs (writes to /sys/module/tcp_ucp/parameters/)
 *   ------------------------------------------------------------
 *   All 33 parameters use module_param_cb() with a custom setter callback.
 *   Writing to /sys/module/tcp_ucp/parameters/<name> triggers an immediate
 *   call to ucp_init_module_params(), which recomputes all internal cached
 *   values (BBR_SCALE gains, jiffies intervals, etc.).  The new value is
 *   active on the very next ACK processed by any UCP connection.
 *
 *     # echo 10000 > /sys/module/tcp_ucp/parameters/ucp_extra_acked_gain
 *     # echo 1     > /sys/module/tcp_ucp/parameters/ucp_bbr_mode
 *
 *   Method B --- sysctl (writes to /proc/sys/net/ucp/ via sysctl -w)
 *   ----------------------------------------------------------------
 *   The module registers a sysctl table at module init via register_sysctl().
 *   Each parameter appears as net.ucp.ucp_<name> and uses a custom proc_handler
 *   that delegates to proc_dointvec and then calls ucp_init_module_params().
 *   This enables both temporary (sysctl -w) and persistent (/etc/sysctl.conf,
 *   /etc/sysctl.d/ conf snippet) configuration:
 *
 *     # sysctl -w net.ucp.ucp_extra_acked_gain=10000
 *     # sysctl -w net.ucp.ucp_bbr_mode=1
 *
 *     # echo "net.ucp.ucp_extra_acked_gain = 10000" >> /etc/sysctl.d/ucp.conf
 *     # echo "net.ucp.ucp_bbr_mode = 1"           >> /etc/sysctl.d/ucp.conf
 *     # sysctl -p /etc/sysctl.d/ucp.conf
 *
 *   Either method is equivalent --- they modify the same underlying int variables
 *   and both call ucp_init_module_params() after a successful write.
 *
 *
 * ---------------------------------------------------------------------------
 * 5.  FUTURE DIRECTIONS (v1.1+)
 * ---------------------------------------------------------------------------
 *
 *   - Increase ICSK_CA_PRIV_SIZE to 128 in the target kernel.
 *   - Dual-slot extra_acked window for exact BBR compatibility.
 *   - Per-RTT RTT trend minmax filter for improved congestion detection.
 *   - ECN loss-differentiation history for accurate non-congestive loss
 *     classification on L4s-enabled paths.
 *   - Continuously-variable cwnd cap derived from excess queue
 *     measurement (BBR's "inflight cap" model).
 *
 *
 * Copyright (c) 2026 PPP PRIVATE NETWORK™ X
 * SPDX-License-Identifier: Dual BSD/GPL
 */

#include <linux/module.h>       /* Linux kernel module interface: module_param, MODULE_PARM_DESC, module_init, module_exit */
#include <net/tcp.h>            /* Core TCP types: tcp_sock, tcp_congestion_ops, rate_sample, tcp_register_congestion_control */
#include <linux/inet_diag.h>    /* Socket diagnostics interface: INET_DIAG_BBRINFO enum, union tcp_cc_info for ss -i output */
#include <linux/win_minmax.h>   /* Sliding window min/max filter: struct minmax, minmax_running_max function */
#include <linux/math64.h>       /* 64-bit math helpers: div64_u64, div64_long for safe integer division */
#include <linux/random.h>       /* Kernel PRNG: prandom_u32_max for randomized PROBE_BW phase start offset */
#include <linux/sysctl.h>      /* Sysctl interface: proc_dointvec, register_sysctl for sysctl -w support */

/* ---- Fixed-Point Scales (hardware-friendly constants, do NOT change) ---- */
#define BW_SCALE 24              /* Number of fractional bits in bandwidth fixed-point representation (2^24 = 16,777,216 units per 1.0); matches BBR BW_SCALE */
#define BW_UNIT  (1 << BW_SCALE) /* Unity (1.0) in bandwidth fixed-point domain: value 2^24 = 16,777,216; represents 1 packet per microsecond at this scale */
#define BBR_SCALE 8              /* Number of fractional bits in gain/ratio fixed-point representation (2^8 = 256 units per 1.0); matches BBR BBR_SCALE */
#define BBR_UNIT  (1 << BBR_SCALE)/* Unity (1.0x) in BBR gain domain: value 256; represents a multiplier of exactly 1.0; all gain values use this unit */

/* ---- UCP Finite State Machine Modes (mirror BBRv1) ---------------------- */
/**
 * enum ucp_mode - UCP congestion control operational modes
 * @UCP_STARTUP:   Exponential bandwidth search phase; pacing_gain ~= 2.89x to rapidly probe available bandwidth
 * @UCP_DRAIN:     One-shot queue drain phase after STARTUP completes; pacing_gain ~= 0.346x to drain excess inflight
 * @UCP_PROBE_BW:  Steady-state bandwidth probing; uses an 8-phase gain cycle to periodically test for more bandwidth
 * @UCP_PROBE_RTT: Min-RTT refresh phase; cwnd clamped to 4 packets for 200 ms to obtain an uncontaminated RTT sample
 */
enum ucp_mode {
	UCP_STARTUP   = 0, /* Value 0: Exponential bandwidth search, pacing_gain ~= 2.89x, same as BBR STARTUP; sends at ~2.89x estimated BDP to fill the pipe */
	UCP_DRAIN     = 1, /* Value 1: One-shot queue drain after pipe is full, pacing_gain ~= 0.346x (BBR_UNIT * 1000 / 2885); drains excess queue built in STARTUP */
	UCP_PROBE_BW  = 2, /* Value 2: Steady state, 8-phase gain cycle cycling between probe (1.10x), drain (0.75x), and cruise (1.0x) phases */
	UCP_PROBE_RTT = 3  /* Value 3: Refresh min_rtt every ~10 seconds, cwnd clamped to 4 pkts for 200 ms duration to drain queues for a clean min RTT sample */
};

/* ---- Network Condition Classification ---------------------------------- */
/**
 * enum ucp_net_cond - Classification of current network conditions
 * @UCP_COND_IDLE:        No delivery-rate history yet available (connection just started or was idle)
 * @UCP_COND_LIGHT_LOAD:  Low loss below low_loss_threshold and stable RTT; indicates uncongested path
 * @UCP_COND_CONGESTED:   Delivery rate is dropping AND either high loss or ECN marking is present; indicates true congestion
 * @UCP_COND_RANDOM_LOSS: Loss is present but RTT rise <= 20%; indicates non-congestive packet loss (e.g., wireless corruption)
 */
enum ucp_net_cond {
	UCP_COND_IDLE        = 0, /* Value 0: No delivery-rate history yet available; initial state before any ACK processing; no condition classification possible */
	UCP_COND_LIGHT_LOAD  = 1, /* Value 1: loss < low_loss_thresh, stable RTT; network is underutilized or lightly loaded; no congestion signals present */
	UCP_COND_CONGESTED   = 2, /* Value 2: rate dropping AND (high loss OR ECN present); standard congestion detection with multiple corroborating signals */
	UCP_COND_RANDOM_LOSS = 3  /* Value 3: loss present, RTT rise <= 20%, not congestion-related; e.g., wireless packet corruption or transient bit errors */
};

/* ---- Network Path Class (impacts ProbeRTT interval and BtlBw floor) ---- */
/**
 * enum ucp_net_class - Classification of the network path type
 * @UCP_CLASS_DEFAULT:   Unclassified or mixed characteristics; use default parameters
 * @UCP_CLASS_LAN:       Low-latency local network: RTT < 5 ms, jitter < 3 ms, loss < 0.1%
 * @UCP_CLASS_MOBILE:    Cellular/mobile path: loss > 3%, jitter > 20 ms; requires conservative probing
 * @UCP_CLASS_LOSSY_FAT: High-latency lossy path: RTT > 80 ms, background loss > 1%; typical of satellite links
 * @UCP_CLASS_CONGESTED: Persistently congested: RTT rise >= 50% AND loss >= 1%
 * @UCP_CLASS_VPN:       VPN tunnel: RTT > 60 ms with stable elevated latency; may have encapsulation overhead
 */
enum ucp_net_class {
	UCP_CLASS_DEFAULT   = 0, /* Value 0: Default/unclassified path; uses standard parameter set; no special handling */
	UCP_CLASS_LAN       = 1, /* Value 1: Local area network: RTT < 5 ms, jitter < 3 ms, loss < 0.1%; bandwidth floor disabled */
	UCP_CLASS_MOBILE    = 2, /* Value 2: Mobile/cellular: loss > 3%, jitter > 20 ms; reduce probe aggression, increase probe interval */
	UCP_CLASS_LOSSY_FAT = 3, /* Value 3: Lossy fat pipe: RTT > 80 ms, background loss > 1%; satellite-like links with high BDP */
	UCP_CLASS_CONGESTED = 4, /* Value 4: Congested: RTT rise >= 50% AND loss >= 1%; persistent queue buildup, conservative handling */
	UCP_CLASS_VPN       = 5  /* Value 5: VPN tunnel: RTT > 60 ms, stable elevated latency; tunnel encapsulation overhead expected */
};

/* ---- Per-Connection State (must fit within ICSK_CA_PRIV_SIZE = 104) ---- */
/**
 * struct ucp - Per-connection UCP congestion control state
 * @min_rtt_us:           Minimum round-trip time observed in the current measurement window (microseconds)
 * @min_rtt_stamp:        Jiffies timestamp (kernel timer ticks) when min_rtt_us was last updated; used for filter expiry
 * @probe_rtt_done_stamp: Absolute jiffies value when the PROBE_RTT phase is scheduled to end; 0 when not in PROBE_RTT
 * @bw:                   Running-maximum filter for bottleneck bandwidth (BtlBw), stored in BW_SCALE, spanning 10 packet-timed rounds
 * @rtt_cnt:              Number of packet-timed rounds completed since connection start; monotonically increasing counter
 * @next_rtt_delivered:   Value of tp->delivered at the start of the current packet-timed round; used to detect round boundaries
 * @cycle_mstamp_lo:      Lower 32 bits of the 64-bit timestamp (in microseconds) when the current PROBE_BW gain phase started
 * @cycle_mstamp_hi:      Upper 32 bits of the 64-bit PROBE_BW phase start timestamp
 * @mode:                 2-bit field: current ucp_mode value (0..3); controls gain selection and state machine transitions
 * @prev_ca_state:        3-bit field: previous TCP congestion algorithm state before the most recent state transition
 * @round_start:          1-bit flag: set to true on the first ACK of a new packet-timed round; cleared after bandwidth update
 * @idle_restart:         1-bit flag: set to true after an application-limited idle period; suppresses overly aggressive behavior
 * @probe_rtt_round_done: 1-bit flag: set to true after at least one full packet-timed round completes during PROBE_RTT
 * @fast_recovery:        1-bit flag: set to true while in non-congestion fast recovery (loss recovery without congestion signal)
 * @net_condition:        2-bit field: current ucp_net_condition classification (0..3); guides cap selection
 * @net_class:            3-bit field: current ucp_net_class classification (0..5); affects probe intervals and bandwidth floors
 * @rate_hist_idx:        1-bit field: index (0 or 1) into the 2-slot deliv_rate_hist[] circular buffer
 * @rtt_hist_idx:         1-bit field: index (0 or 1) into the 2-slot rtt_history[] circular buffer
 * @drain_pending:        2-bit field: queued drain level (0 = none, 1..3 = light/standard/aggressive); consumed by ucp_apply_pacing_constraints
 * @cond_confirm:         3-bit field: hysteresis counter for confirming network condition transitions (0..7)
 * @class_confirm:        3-bit field: hysteresis counter for confirming network class transitions (0..7)
 * @min_rtt_fast_fall_cnt: 2-bit field: consecutive RTT samples below 75% of current min_rtt (0..3); triggers fast min_rtt update
 * @cycle_idx:            3-bit field: current PROBE_BW gain phase index (0..7); selects pacing gain from the 8-phase cycle
 * @probe_gain_applied:   1-bit flag: set if the most recent PROBE_BW phase 0 applied a gain greater than 1.0x
 * @padding1:             2-bit explicit padding to complete the first 32-bit bitfield word (total field bits = 2+3+1+1+1+1+2+3+1+1+2+3+3+2+3+1+2 = 32); reserved, do not use
 * @pacing_gain:          12-bit field: current pacing gain in BBR_SCALE; controls inter-packet transmission spacing
 * @cwnd_gain:            10-bit field: current congestion window gain in BBR_SCALE; controls window growth relative to BDP
 * @full_bw_reached:      1-bit flag: set to true once STARTUP bandwidth growth has stalled (pipe considered full)
 * @full_bw_cnt:          2-bit field: consecutive packet-timed rounds without 1.25x bandwidth growth (0..3); triggers full_bw_reached
 * @has_seen_rtt:         1-bit flag: set to true on the first valid SRTT sample received; used to gate RTT-based initialization
 * @has_delayed_ack:      1-bit flag: copy of rs->is_ack_delayed for the current ACK; used to filter delayed-ACK RTT samples
 * @probe_rtt_restored:   1-bit flag: set to true right after cwnd is restored from PROBE_RTT clamping; cleared after restoration applied
 * @loss_ewma_high:       1-bit overflow (bit 8) of loss EWMA; combined with loss_ewma as (high << 8) | low to represent 0..BBR_UNIT (256 = 100% loss); compressed 9-bit encoding to save space
 * @ecn_ewma_high:        1-bit overflow (bit 8) of ECN EWMA; combined with ecn_ewma as (high << 8) | low to represent 0..BBR_UNIT (256 = 100% ECN marking); compressed 9-bit encoding to save space
 * @padding2:             2-bit explicit padding to complete the second 32-bit bitfield word (was 4 bits before loss_ewma_high + ecn_ewma_high); reserved, do not use
 * @prior_cwnd:           cwnd value saved before entering recovery or PROBE_RTT; used to restore cwnd upon exit (packets)
 * @full_bw:              Bandwidth snapshot (BW_SCALE) taken at the last confirmed bandwidth growth event in STARTUP
 * @rtt_history:          Two-element circular buffer of filtered RTT samples (microseconds each); used for P10 RTT estimation
 * @deliv_rate_hist:      Two-element circular buffer of most recent delivery rate samples (BW_SCALE); used for trend calculation
 * @loss_ewma:            Exponentially weighted moving average of packet loss ratio, compressed to u8; value = (loss_ewma_high << 8) | loss_ewma (0..BBR_UNIT)
 * @ecn_ewma:             Exponentially weighted moving average of ECN marking ratio, compressed to u8; value = (ecn_ewma_high << 8) | ecn_ewma (0..BBR_UNIT)
 * @max_bw_non_congested: Peak bandwidth (BW_SCALE) observed while in non-CONGESTED network condition; used for bandwidth floor computation
 * @rate_change_ewma:     Smoothed delivery-rate trend value (BBR_SCALE, signed); positive means rate increasing, negative means dropping
 * @last_delivered_ce:    Value of tp->delivered_ce at the previous ACK; used to compute ECN marking delta for the current ACK
 * @ack_epoch_start_us:   Microsecond timestamp (low 32 bits of delivered_mstamp) marking the start of the ACK compensation epoch; 0 = epoch not started
 * @extra_acked:          Cumulative excess acked packet count in the current compensation epoch (u8, saturates at 255)
 * @extra_acked_max:      Maximum excess acked over recent epochs; decayed by 3/4 each epoch; used as cwnd bonus in ucp_set_cwnd()
 *
 * NOTE: ACK aggregation compensation --- BBR vs UCP comparison
 *
 * Both BBR and UCP track excess ACK counts per RTT-scale epoch and add a
 * small cwnd bonus to prevent pacing stalls during delayed-ACK / ACK-
 * compression bursts.  Key differences due to UCP's 104-byte struct limit:
 *
 *   Feature              BBR (16 bytes)        UCP (6 bytes)
 *   -------------------  --------------------  ----------------------------
 *   Epoch timestamp      u64 (8 bytes)         u32 (4 bytes, low 32 bits)
 *   Excess window        u16 extra_acked[2]    u8 extra_acked + u8 max
 *                        (dual sliding slot)   (single slot, decay x 0.75)
 *   Control bitfield     ~4 bytes              none (reuses existing fields)
 *   Gain                 fixed BBR_UNIT        ucp_extra_acked_gain
 *                                             (default 0 = disabled)
 *
 * BBR's dual-slot retains full excess for two epochs; UCP decays at 0.75x
 * per epoch, converging to the same steady state within 3-4 epochs.
 * A 1ms epoch floor prevents degenerate per-ACK resets when min_rtt_us is
 * unrealistically small.  Throughput impact: <3% relative to full BBR on
 * typical internet paths.  u8 saturation at 255 pkts is rare outside DC.
 */
struct ucp {
	/* core measurement and tracking state */
	u32 min_rtt_us;           /* Minimum RTT observed in the current measurement window; unit: microseconds; updated when a lower RTT is seen or filter expires */
	u32 min_rtt_stamp;        /* Jiffies timestamp (kernel timer tick) when min_rtt_us was last updated; used to implement the min_rtt filter expiry logic */
	u32 probe_rtt_done_stamp; /* Absolute jiffies value when PROBE_RTT phase is scheduled to end; set to 0 when not actively in PROBE_RTT state */

	struct minmax bw;         /* Running-maximum filter structure for bottleneck bandwidth (BtlBw); window width: 10 packet-timed rounds; values in BW_SCALE */

	u32 rtt_cnt;              /* Monotonically increasing counter of packet-timed rounds completed since connection start; incremented each round boundary */
	u32 next_rtt_delivered;   /* Snapshot of tp->delivered taken at the start of the current packet-timed round; used to detect when the round ends */
	u32 cycle_mstamp_lo;      /* Lower 32 bits of the 64-bit microsecond timestamp marking the start of the current PROBE_BW gain phase */
	u32 cycle_mstamp_hi;      /* Upper 32 bits of the 64-bit microsecond timestamp marking the start of the current PROBE_BW gain phase */

	/*
	 * Bitfield word 1: mode, flags, small counters, PROBE_BW index
	 * Total 32 bits
	 */
	u32 mode : 2;               /* Bitfield: current ucp_mode value (0..3); controls which gain/behaviour rules apply; updated on state transitions */
	u32 prev_ca_state : 3;      /* Bitfield: previous TCP CA state (TCP_CA_Open, TCP_CA_Recovery, etc.); used to detect state transitions for cwnd management */
	u32 round_start : 1;        /* Bitfield: set to true on the first ACK that completes a new packet-timed round; cleared to 0 in ucp_update_bw() after round processing */
	u32 idle_restart : 1;       /* Bitfield: set to true when transmission resumes after an application-limited idle period; cleared on first data delivery */
	u32 probe_rtt_round_done : 1;/* Bitfield: set to true after at least one full packet-timed round completes during PROBE_RTT; gates exit from PROBE_RTT */
	u32 fast_recovery : 1;      /* Bitfield: set to true while in non-congestion fast recovery (TCP_CA_Recovery); cleared at round start after recovery ends */
	u32 net_condition : 2;      /* Bitfield: current ucp_net_condition classification (0..3); guides cwnd ceiling and bandwidth floor selection */
	u32 net_class : 3;          /* Bitfield: current ucp_net_class classification (0..5); affects PROBE_RTT interval and bandwidth floor multiplier */
	u32 rate_hist_idx : 1;      /* Bitfield: index (0 or 1) into the 2-element deliv_rate_hist[] circular buffer; toggled after each sample is stored */
	u32 rtt_hist_idx : 1;       /* Bitfield: index (0 or 1) into the 2-element rtt_history[] circular buffer; toggled after each RTT sample is stored */
	u32 drain_pending : 2;      /* Bitfield: non-zero value (1..3) indicates a queued drain level; 0 means no drain pending; consumed by ucp_apply_pacing_constraints() */
	u32 cond_confirm : 3;       /* Bitfield: hysteresis counter (0..7) for confirming net_condition transitions; increments on disagreement, resets on agreement */
	u32 class_confirm : 3;      /* Bitfield: hysteresis counter (0..7) for confirming net_class transitions; increments on disagreement, saturates at UCP_CLASS_CONFIRM_MAX */
	u32 min_rtt_fast_fall_cnt : 2; /* Bitfield: count (0..3) of consecutive RTT samples below 75% of current min_rtt; triggers fast min_rtt downward revision */
	u32 cycle_idx : 3;          /* Bitfield: current PROBE_BW gain cycle index (0..7); determines which gain value from the 8-phase cycle is used */
	u32 probe_gain_applied : 1; /* Bitfield: set to true if the last PROBE_BW phase index 0 applied a probe gain > 1.0x; determines whether drain phase is needed */
	u32 padding1 : 2;           /* Bitfield: explicit padding to align the first bitfield word to exactly 32 bits (total field bits = 2+3+1+1+1+1+2+3+1+1+2+3+3+2+3+1+2 = 32); unused, reserved for future use */

	/*
	 * Bitfield word 2: gains, binary flags, helper flags
	 * Total 32 bits
	 */
	u32 pacing_gain : 12;       /* Bitfield: current pacing gain in BBR_SCALE (range 0..4095, representing 0.0x to ~16.0x); controls transmission rate via sk_pacing_rate */
	u32 cwnd_gain : 10;         /* Bitfield: current cwnd gain in BBR_SCALE (range 0..1023, representing 0.0x to ~4.0x); controls window size relative to BDP */
	u32 full_bw_reached : 1;    /* Bitfield: set to true once STARTUP bandwidth growth has stalled (pipe considered full); gates transition from STARTUP to DRAIN */
	u32 full_bw_cnt : 2;        /* Bitfield: consecutive rounds (0..3) without 1.25x bandwidth growth; when >= UCP_FULL_BW_CNT (3), full_bw_reached is set */
	u32 has_seen_rtt : 1;       /* Bitfield: set to true on the first valid SRTT sample; used to avoid RTT-based calculations before the first measurement is available */
	u32 has_delayed_ack : 1;    /* Bitfield: copy of the rs->is_ack_delayed flag for the current ACK; used to reject RTT samples that include ACK delay */
	u32 probe_rtt_restored : 1; /* Bitfield: set to true right after cwnd is restored from PROBE_RTT clamping; ensures restoration only happens once */
	u32 loss_ewma_high : 1;     /* Bitfield: bit 8 (overflow) of the compressed loss EWMA; value = (loss_ewma_high << 8) | loss_ewma (0..BBR_UNIT) */
	u32 ecn_ewma_high : 1;      /* Bitfield: bit 8 (overflow) of the compressed ECN EWMA; value = (ecn_ewma_high << 8) | ecn_ewma (0..BBR_UNIT) */
	u32 padding2 : 2;           /* Bitfield: explicit padding to align the second bitfield word to exactly 32 bits; unused, reserved for future use */

	/* standalone u32 fields */
	u32 prior_cwnd;             /* Congestion window (in packets) saved before entering TCP recovery or PROBE_RTT; used to restore cwnd upon exit */
	u32 full_bw;                /* Bandwidth snapshot (BW_SCALE) recorded at the last confirmed bandwidth growth event in STARTUP; used for 1.25x growth comparison */

	u32 rtt_history[2];         /* Two-element circular buffer holding the most recent filtered RTT samples (microseconds each); index toggled by rtt_hist_idx */
	u32 deliv_rate_hist[2];     /* Two-element circular buffer holding the most recent delivery rate samples (BW_SCALE each); index toggled by rate_hist_idx */

	u8 loss_ewma;               /* EWMA of packet loss ratio (BBR_SCALE, 0-255), lower 8 bits of compressed u16; value = (loss_ewma_high << 8) | loss_ewma */
	u8 ecn_ewma;                /* EWMA of ECN marking ratio (BBR_SCALE, 0-255), lower 8 bits of compressed u16; value = (ecn_ewma_high << 8) | ecn_ewma */
	u8 extra_acked;             /* Cumulative excess acked packet count in the current compensation epoch; saturates at 255 */
	u8 extra_acked_max;         /* Maximum excess acked over recent epochs; decayed by 3/4 each epoch; used as cwnd bonus in ucp_set_cwnd() */
	u32 ack_epoch_start_us;     /* Microsecond timestamp (low 32 bits of tp->delivered_mstamp) marking the start of the current epoch window; 0 = epoch not started */
	u32 max_bw_non_congested;   /* Peak bandwidth (BW_SCALE) observed while net_condition != UCP_COND_CONGESTED; reset when congestion is entered; used for bandwidth floor */
	s32 rate_change_ewma;       /* Smoothed delivery-rate trend (BBR_SCALE, signed); positive indicates rate increasing, negative indicates rate dropping; used for congestion classification */
	u32 last_delivered_ce;      /* Value of tp->delivered_ce (number of packets with CE mark) at the previous ACK; used to compute ECN marking delta for the current ACK */
};

/* -------------------------------------------------------------------------
 * Module Parameters - all tunable, permyriad (per ten thousand) unless otherwise noted
 * Each parameter is exported to /sys/module/tcp_ucp/parameters/ for runtime tuning
 * ------------------------------------------------------------------------- */

/* ---- Module parameter sysctl-style dynamic update callback -------------- */
static void ucp_init_module_params(void); /* Forward declaration for the setter callback below; ucp_init_module_params is defined after all module_param declarations */

/**
 * @brief Custom setter for all module parameters --- recomputes cached values on every write.
 * This enables "sysctl -w" style runtime tuning: writing to /sys/module/tcp_ucp/parameters/
 * immediately propagates to the internal BBR_SCALE / jiffies cached variables used in the hot path.
 */
static int ucp_param_set_int(const char *val, const struct kernel_param *kp)
{
	int ret = param_set_int(val, kp);
	if (ret == 0)
		ucp_init_module_params();
	return ret;
}

static const struct kernel_param_ops ucp_param_ops = {
	.set = ucp_param_set_int,
	.get = param_get_int,
};

/* ---- Operation mode selector -------------------------------------------- */
static int ucp_bbr_mode = 0;   /* Operation mode: 0 = full UCP (non-destructive constraints, classifiers, bandwidth floor), 1 = pure BBR compatible (bypasses all UCP-specific logic); module_param for runtime switching */
module_param_cb(ucp_bbr_mode, &ucp_param_ops, &ucp_bbr_mode, 0644); /* Export as sysfs parameter with read-write permissions; echo 0 or 1 to switch mode at runtime */
MODULE_PARM_DESC(ucp_bbr_mode,
	"Operation mode: 0 = full UCP (default, all constraints active), 1 = pure BBR compatible (bypass UCP-specific logic)");

/* ---- Bandwidth soft floor (permyriad of non-congested peak) ------------- */
static int ucp_bw_floor_default = 2000; /* Default path bandwidth floor: 2000 permyriad = 20.00% of non-congested peak bandwidth; balanced for general internet paths; unit: permyriad (1/10000) */
module_param_cb(ucp_bw_floor_default, &ucp_param_ops, &ucp_bw_floor_default, 0644); /* Export as sysfs parameter with read-write permissions (owner/group/other = rw-r--r--) */
MODULE_PARM_DESC(ucp_bw_floor_default,
	"BtlBw floor for DEFAULT paths (permyriad of non-congested peak, 0 = disabled)"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_bw_floor_mobile = 2500;  /* Mobile/lossy path bandwidth floor: 2500 permyriad = 25.00% of non-congested peak bandwidth; slightly higher than default to absorb radio jitter; unit: permyriad (1/10000) */
module_param_cb(ucp_bw_floor_mobile, &ucp_param_ops, &ucp_bw_floor_mobile, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_bw_floor_mobile,
	"BtlBw floor for MOBILE / LOSSY_FAT paths (permyriad, 0 = disabled)"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_bw_floor_lan = 0;   /* LAN path bandwidth floor: 0 permyriad = disabled by default; LAN links are stable enough that a floor is unnecessary; unit: permyriad (1/10000) */
module_param_cb(ucp_bw_floor_lan, &ucp_param_ops, &ucp_bw_floor_lan, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_bw_floor_lan,
	"BtlBw floor for LAN paths (permyriad, 0 = disabled)"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_bw_floor_vpn = 0;   /* VPN path bandwidth floor: 0 permyriad = disabled by default; VPN encapsulation adds predictable overhead not needing a floor; unit: permyriad (1/10000) */
module_param_cb(ucp_bw_floor_vpn, &ucp_param_ops, &ucp_bw_floor_vpn, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_bw_floor_vpn,
	"BtlBw floor for VPN paths (permyriad, 0 = disabled)"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_bw_floor_lossy_fat = 2500;  /* LOSSY_FAT path bandwidth floor: 2500 permyriad = 25.00%; matches MOBILE default since lossy fat pipes share similar variability; unit: permyriad (1/10000) */
module_param_cb(ucp_bw_floor_lossy_fat, &ucp_param_ops, &ucp_bw_floor_lossy_fat, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_bw_floor_lossy_fat,
	"BtlBw floor for LOSSY_FAT paths (permyriad, default = mobile value)"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_bw_floor_congested = 2000;  /* CONGESTED path bandwidth floor: 2000 permyriad = 20.00%; matches DEFAULT default; floor is only active when net_condition is NOT congested, so this applies only briefly after congestion clears; unit: permyriad (1/10000) */
module_param_cb(ucp_bw_floor_congested, &ucp_param_ops, &ucp_bw_floor_congested, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_bw_floor_congested,
	"BtlBw floor for CONGESTED-class paths (permyriad, default = default value; note: floor only active when net_condition != CONGESTED)"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_bw_floor_loss_cap = 500; /* Maximum loss ratio for bandwidth floor activation: 500 permyriad = 5.0% loss; bandwidth floor disabled when loss exceeds this value; unit: permyriad */
module_param_cb(ucp_bw_floor_loss_cap, &ucp_param_ops, &ucp_bw_floor_loss_cap, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_bw_floor_loss_cap,
	"Max loss (permyriad) for which the bandwidth floor is active"); /* Human-readable description displayed by modinfo and in sysfs */

/* ---- ProbeRTT interval (seconds) ---------------------------------------- */
static int ucp_probe_rtt_base_sec = 10;            /* Base interval between PROBE_RTT entries: 10 seconds (matches BBR default); unit: seconds */
module_param_cb(ucp_probe_rtt_base_sec, &ucp_param_ops, &ucp_probe_rtt_base_sec, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_rtt_base_sec,
	"Base seconds between PROBE_RTT entries"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_probe_rtt_class_extra_sec = 5;      /* Additional seconds added for MOBILE or LOSSY_FAT path classes: +5 seconds; unit: seconds */
module_param_cb(ucp_probe_rtt_class_extra_sec, &ucp_param_ops, &ucp_probe_rtt_class_extra_sec, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_rtt_class_extra_sec,
	"Extra seconds for MOBILE / LOSSY_FAT paths"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_probe_rtt_high_loss_extra_sec = 0;  /* Additional seconds when loss >= high_loss_threshold: 0 extra seconds by default; unit: seconds */
module_param_cb(ucp_probe_rtt_high_loss_extra_sec, &ucp_param_ops, &ucp_probe_rtt_high_loss_extra_sec, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_rtt_high_loss_extra_sec,
	"Extra seconds when loss >= high_loss_thresh"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_probe_rtt_mid_loss_extra_sec = 0;   /* Additional seconds when loss >= probe_skip_thresh but below high_loss_thresh: 0 extra seconds by default; unit: seconds */
module_param_cb(ucp_probe_rtt_mid_loss_extra_sec, &ucp_param_ops, &ucp_probe_rtt_mid_loss_extra_sec, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_rtt_mid_loss_extra_sec,
	"Extra seconds when loss >= probe_skip_thresh but < high_loss_thresh"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_probe_rtt_max_sec = 15;             /* Hard upper cap on the PROBE_RTT interval: 15 seconds maximum regardless of other adjustments; unit: seconds */
module_param_cb(ucp_probe_rtt_max_sec, &ucp_param_ops, &ucp_probe_rtt_max_sec, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_rtt_max_sec,
	"Absolute maximum PROBE_RTT interval (seconds)"); /* Human-readable description displayed by modinfo and in sysfs */

/* ---- Congestion window gains (permyriad of BDP) ------------------------- */
static int ucp_cwnd_gain = 20000;        /* Steady-state cwnd gain: 20000 permyriad = 2.0000x of BDP; allows some queue buildup for throughput; unit: permyriad */
module_param_cb(ucp_cwnd_gain, &ucp_param_ops, &ucp_cwnd_gain, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_cwnd_gain,
	"Steady-state cwnd gain (permyriad of BDP, 20000 = 2.0x)"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_cwnd_cap_mild = 17500;    /* cwnd cap for MILD congestion: 17500 permyriad = 1.7500x of BDP; slightly reduces queue during mild congestion; unit: permyriad */
module_param_cb(ucp_cwnd_cap_mild, &ucp_param_ops, &ucp_cwnd_cap_mild, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_cwnd_cap_mild,
	"cwnd cap for MILD congestion (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_cwnd_cap_moderate = 15000;/* cwnd cap for MODERATE congestion: 15000 permyriad = 1.5000x of BDP; tighter cap for moderate congestion; unit: permyriad */
module_param_cb(ucp_cwnd_cap_moderate, &ucp_param_ops, &ucp_cwnd_cap_moderate, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_cwnd_cap_moderate,
	"cwnd cap for MODERATE congestion (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_cwnd_cap_severe = 12500;  /* cwnd cap for SEVERE congestion: 12500 permyriad = 1.2500x of BDP; most restrictive cap for severe congestion; unit: permyriad */
module_param_cb(ucp_cwnd_cap_severe, &ucp_param_ops, &ucp_cwnd_cap_severe, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_cwnd_cap_severe,
	"cwnd cap for SEVERE congestion (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */

/* ---- ACK aggregation compensation gain (permyriad) ----------------------- */
static int ucp_extra_acked_gain = 10000; /* ACK aggregation compensation gain: 10000 permyriad = 1.0x (BBR standard, default); added to cwnd as extra_acked_max * gain; unit: permyriad */
module_param_cb(ucp_extra_acked_gain, &ucp_param_ops, &ucp_extra_acked_gain, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_extra_acked_gain,
	"ACK aggregation compensation gain (permyriad, 0=disabled, 10000=1.0x); default 10000 matches BBR standard behavior"); /* Human-readable description displayed by modinfo and in sysfs */

/* ---- PROBE_BW probe gain (permyriad) ------------------------------------ */
static int ucp_probe_gain = 11000;       /* PROBE_BW probe phase pacing gain: 11000 permyriad = 1.1000x; conservative probe that adds 10% more than BDP to test for extra bandwidth */
module_param_cb(ucp_probe_gain, &ucp_param_ops, &ucp_probe_gain, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_gain,
	"Pacing gain for PROBE_BW probe phase (permyriad, 11000 = 1.10x)"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_probe_gain_mobile = 10000;/* Mobile path probe gain: 10000 permyriad = 1.0000x; no rate increase on mobile when loss >= drain_thresh; unit: permyriad */
module_param_cb(ucp_probe_gain_mobile, &ucp_param_ops, &ucp_probe_gain_mobile, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_gain_mobile,
	"Probe gain on MOBILE path when loss >= drain_thresh (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */

/* ---- Early drain triggers and gains (permyriad) ------------------------- */
static int ucp_drain_loss_thresh = 100;  /* Drain trigger loss threshold: 100 permyriad = 1.0% loss; early drain engaged when loss >= this value after a probe phase; unit: permyriad */
module_param_cb(ucp_drain_loss_thresh, &ucp_param_ops, &ucp_drain_loss_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_drain_loss_thresh,
	"Loss threshold (permyriad) to trigger early drain after probe"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_drain_gain_light = 8500;  /* Light drain pacing gain: 8500 permyriad = 0.8500x; gentle queue drain for level 1 (low loss after probe); unit: permyriad */
module_param_cb(ucp_drain_gain_light, &ucp_param_ops, &ucp_drain_gain_light, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_drain_gain_light,
	"Pacing gain for light drain (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_drain_gain_standard = 7500;/* Standard drain pacing gain: 7500 permyriad = 0.7500x; matches BBR standard drain gain of 0.75x; unit: permyriad */
module_param_cb(ucp_drain_gain_standard, &ucp_param_ops, &ucp_drain_gain_standard, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_drain_gain_standard,
	"Pacing gain for standard drain (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_drain_gain_aggressive = 6500;/* Aggressive drain pacing gain: 6500 permyriad = 0.6500x; rapid queue drain for level 3 (high loss after probe); unit: permyriad */
module_param_cb(ucp_drain_gain_aggressive, &ucp_param_ops, &ucp_drain_gain_aggressive, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_drain_gain_aggressive,
	"Pacing gain for aggressive drain (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */

/* drain level thresholds (permyriad loss) */
static int ucp_drain_loss_lvl2_thresh = 500; /* Level 2 (standard) drain loss threshold: 500 permyriad = 5.0% loss; triggers drain level 2; unit: permyriad */
module_param_cb(ucp_drain_loss_lvl2_thresh, &ucp_param_ops, &ucp_drain_loss_lvl2_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_drain_loss_lvl2_thresh,
	"Loss threshold (permyriad) for standard drain (level 2)"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_drain_loss_lvl3_thresh = 1000;/* Level 3 (aggressive) drain loss threshold: 1000 permyriad = 10.0% loss; triggers drain level 3; unit: permyriad */
module_param_cb(ucp_drain_loss_lvl3_thresh, &ucp_param_ops, &ucp_drain_loss_lvl3_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_drain_loss_lvl3_thresh,
	"Loss threshold (permyriad) for aggressive drain (level 3)"); /* Human-readable description displayed by modinfo and in sysfs */

/* ---- General loss thresholds (permyriad) -------------------------------- */
static int ucp_low_loss_thresh = 100;     /* Low loss threshold: 100 permyriad = 1.0% loss; boundary between LIGHT_LOAD and RANDOM_LOSS conditions; unit: permyriad */
module_param_cb(ucp_low_loss_thresh, &ucp_param_ops, &ucp_low_loss_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_low_loss_thresh,
	"Low loss threshold (permyriad) - LIGHT_LOAD boundary"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_high_loss_thresh = 500;    /* High loss threshold: 500 permyriad = 5.0% loss; used for class classification, drain level determination, and congestion severity; unit: permyriad */
module_param_cb(ucp_high_loss_thresh, &ucp_param_ops, &ucp_high_loss_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_high_loss_thresh,
	"High loss threshold (permyriad) - used for class and drain level"); /* Human-readable description displayed by modinfo and in sysfs */

/* ---- Probe safety-skip thresholds (permyriad) --------------------------- */
static int ucp_probe_skip_loss_thresh = 200; /* Probe skip loss threshold: 200 permyriad = 2.0% loss; PROBE_BW probe phase is skipped above this value to avoid worsening loss; unit: permyriad */
module_param_cb(ucp_probe_skip_loss_thresh, &ucp_param_ops, &ucp_probe_skip_loss_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_skip_loss_thresh,
	"Loss threshold (permyriad) above which PROBE_BW probe is skipped"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_probe_skip_rtt_rise = 4000;  /* Probe skip RTT rise threshold: 4000 permyriad = 40.00% RTT increase; probe phase skipped above this to avoid aggravating queue buildup; unit: permyriad */
module_param_cb(ucp_probe_skip_rtt_rise, &ucp_param_ops, &ucp_probe_skip_rtt_rise, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_skip_rtt_rise,
	"RTT increase threshold (permyriad) above which PROBE_BW probe is skipped"); /* Human-readable description displayed by modinfo and in sysfs */

/* ---- STARTUP loss-based gain reduction (permyriad) ---------------------- */
static int ucp_startup_soft_drain_thresh = 50;  /* STARTUP soft drain loss threshold: 50 permyriad = 0.5% loss; above this, STARTUP gain is reduced to soft_gain (2.5x); unit: permyriad */
module_param_cb(ucp_startup_soft_drain_thresh, &ucp_param_ops, &ucp_startup_soft_drain_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_startup_soft_drain_thresh,
	"Loss threshold (permyriad) to reduce STARTUP gain to soft_gain"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_startup_hard_cap_thresh = 200;   /* STARTUP hard cap loss threshold: 200 permyriad = 2.0% loss; above this, STARTUP gain capped at cwnd_gain_val (2.0x); unit: permyriad */
module_param_cb(ucp_startup_hard_cap_thresh, &ucp_param_ops, &ucp_startup_hard_cap_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_startup_hard_cap_thresh,
	"Loss threshold (permyriad) to cap STARTUP gain at cwnd_gain"); /* Human-readable description displayed by modinfo and in sysfs */

static int ucp_startup_soft_gain = 25000;       /* STARTUP soft gain: 25000 permyriad = 2.5000x; reduced gain used when loss is between soft_drain and hard_cap thresholds; unit: permyriad */
module_param_cb(ucp_startup_soft_gain, &ucp_param_ops, &ucp_startup_soft_gain, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_startup_soft_gain,
	"STARTUP pacing/cwnd gain when loss between soft and hard thresholds (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */

/* ---- Internal derived variables (populated at module init) -------------- */
static u32 ucp_bw_floor_default_val;   /* Bandwidth floor for DEFAULT paths: stored as permyriad value (same as permyriad input); used in bandwidth floor calculation as (pct/10000) */
static u32 ucp_bw_floor_mobile_val;    /* Bandwidth floor for MOBILE/LOSSY_FAT paths: stored as permyriad value (same as permyriad input); used in bandwidth floor calculation */
static u32 ucp_bw_floor_lan_val;       /* Bandwidth floor for LAN paths: stored as permyriad value; 0 = disabled; tunable via sysfs */
static u32 ucp_bw_floor_vpn_val;       /* Bandwidth floor for VPN paths: stored as permyriad value; 0 = disabled; tunable via sysfs */
static u32 ucp_bw_floor_lossy_fat_val; /* Bandwidth floor for LOSSY_FAT paths: stored as permyriad value; default 2500 = 25% */
static u32 ucp_bw_floor_congested_val; /* Bandwidth floor for CONGESTED paths: stored as permyriad value; default 2000 = 20%; only active when net_condition != CONGESTED */
static u32 ucp_bw_floor_loss_cap_val;  /* Max loss ratio for bandwidth floor activation in BBR_SCALE; bandwidth floor is disabled when loss EWMA exceeds this value */

static u32 ucp_probe_rtt_base_jiffies;          /* Base PROBE_RTT interval in jiffies (kernel timer ticks); calculated from ucp_probe_rtt_base_sec * HZ */
static u32 ucp_probe_rtt_class_extra_jiffies;   /* Extra PROBE_RTT interval for MOBILE/LOSSY_FAT classes in jiffies; calculated from ucp_probe_rtt_class_extra_sec * HZ */
static u32 ucp_probe_rtt_high_loss_extra_jiffies; /* Extra PROBE_RTT interval when loss >= high_loss_thresh in jiffies; calculated from ucp_probe_rtt_high_loss_extra_sec * HZ */
static u32 ucp_probe_rtt_mid_loss_extra_jiffies;  /* Extra PROBE_RTT interval when loss >= probe_skip_thresh but < high_loss_thresh in jiffies; calculated from ucp_probe_rtt_mid_loss_extra_sec * HZ */
static u32 ucp_probe_rtt_max_jiffies;             /* Hard maximum PROBE_RTT interval in jiffies; calculated from ucp_probe_rtt_max_sec * HZ; absolute cap */

static u32 ucp_cwnd_gain_val;                   /* Steady-state cwnd gain in BBR_SCALE; converted from ucp_cwnd_gain via permyriad_to_bbr(); default: 512 = 2.0x */
static u32 ucp_cwnd_cap_mild_val;               /* MILD congestion cwnd cap in BBR_SCALE; converted from ucp_cwnd_cap_mild via permyriad_to_bbr(); default: 448 = 1.75x */
static u32 ucp_cwnd_cap_moderate_val;           /* MODERATE congestion cwnd cap in BBR_SCALE; converted from ucp_cwnd_cap_moderate; default: 384 = 1.50x */
static u32 ucp_cwnd_cap_severe_val;             /* SEVERE congestion cwnd cap in BBR_SCALE; converted from ucp_cwnd_cap_severe; default: 320 = 1.25x */

static u32 ucp_probe_gain_val;                  /* PROBE_BW probe phase pacing gain in BBR_SCALE; converted from ucp_probe_gain; default: ~282 = 1.10x */
static u32 ucp_probe_gain_mobile_val;           /* Mobile path probe gain in BBR_SCALE; converted from ucp_probe_gain_mobile; default: 256 = 1.00x (no probe) */

static u32 ucp_drain_loss_thresh_val;           /* Drain trigger loss threshold in BBR_SCALE; converted from ucp_drain_loss_thresh; default: ~2.56 = 1.0% */
static u32 ucp_drain_gain_light_val;            /* Light drain pacing gain in BBR_SCALE; converted from ucp_drain_gain_light; default: ~217 = 0.85x */
static u32 ucp_drain_gain_standard_val;         /* Standard drain pacing gain in BBR_SCALE; converted from ucp_drain_gain_standard; default: 192 = 0.75x */
static u32 ucp_drain_gain_aggressive_val;       /* Aggressive drain pacing gain in BBR_SCALE; converted from ucp_drain_gain_aggressive; default: ~166 = 0.65x */
static u32 ucp_drain_lvl2_loss_thresh_val;      /* Level 2 (standard) drain loss threshold in BBR_SCALE; converted from ucp_drain_loss_lvl2_thresh; default: ~12.8 = 5.0% */
static u32 ucp_drain_lvl3_loss_thresh_val;      /* Level 3 (aggressive) drain loss threshold in BBR_SCALE; converted from ucp_drain_loss_lvl3_thresh; default: ~25.6 = 10.0% */

static u32 ucp_low_loss_thresh_val;             /* Low loss threshold in BBR_SCALE; converted from ucp_low_loss_thresh; default: ~2.56 = 1.0% */
static u32 ucp_high_loss_thresh_val;            /* High loss threshold in BBR_SCALE; converted from ucp_high_loss_thresh; default: ~12.8 = 5.0% */

static u32 ucp_probe_skip_loss_val;             /* Probe skip loss threshold in BBR_SCALE; converted from ucp_probe_skip_loss_thresh; default: ~5.12 = 2.0% */
static u32 ucp_probe_skip_rtt_rise_val;         /* Probe skip RTT rise threshold in BBR_SCALE; converted from ucp_probe_skip_rtt_rise; default: ~102.4 = 40% */

static u32 ucp_startup_soft_drain_val;          /* STARTUP soft drain loss threshold in BBR_SCALE; converted from ucp_startup_soft_drain_thresh; default: ~1.28 = 0.5% */
static u32 ucp_startup_hard_cap_val;            /* STARTUP hard cap loss threshold in BBR_SCALE; converted from ucp_startup_hard_cap_thresh; default: ~5.12 = 2.0% */
static u32 ucp_startup_soft_gain_val;           /* STARTUP soft gain in BBR_SCALE; converted from ucp_startup_soft_gain; default: 640 = 2.50x */

static u32 ucp_extra_acked_gain_val;            /* ACK aggregation compensation gain in BBR_SCALE; converted from ucp_extra_acked_gain; default: 0 = disabled */

/* congestion severity thresholds (derived) */
static u32 ucp_cong_severe_loss_val;   /* Severe congestion loss threshold in BBR_SCALE; derived as copy of ucp_high_loss_thresh_val (5.0% default) */
static u32 ucp_cong_moderate_loss_val; /* Moderate congestion loss threshold in BBR_SCALE; derived as copy of ucp_drain_loss_thresh_val (1.0% default) */
static u32 ucp_cong_mild_loss_val;     /* Mild congestion loss threshold in BBR_SCALE; derived as copy of ucp_low_loss_thresh_val (1.0% default, same as moderate) */

/**
 * @brief Convert a permyriad value (1/10000) to BBR_SCALE fixed-point (1/256).
 * @param val  Input value in permyriad units; range typically 0..10000+
 * @return     Equivalent value in BBR_SCALE fixed-point where BBR_UNIT = 256
 *
 * Performs: (BBR_UNIT * val) / 10000
 * Example: 20000 permyriad * 256 / 10000 = 512, which represents 2.0x in BBR_SCALE
 */
static u32 permyriad_to_bbr(u32 val)
{
	/* Multiply by BBR_UNIT (256) using 64-bit arithmetic to avoid overflow on large values, then divide by 10000 permyriad base */
	return (u32)(((u64)BBR_UNIT * val) / 10000); /* Scaling: permyriad (1/10000) to BBR_SCALE (1/256) via (256 * val / 10000) */
}

/**
 * @brief Precompute all module parameter derived values at module load time.
 * Called once at module init to convert permyriad module parameters into BBR_SCALE gain values
 * and seconds into jiffies, so the per-ACK hot path avoids repeated conversions.
 */
static void ucp_init_module_params(void)
{
	/* Clamp all permyriad/seconds module parameters to non-negative: negative values via sysfs would wrap to large unsigned causing stalls or crashes */
	ucp_bw_floor_default       = max(ucp_bw_floor_default, 0);
	ucp_bw_floor_mobile        = max(ucp_bw_floor_mobile, 0);
	ucp_bw_floor_lan           = max(ucp_bw_floor_lan, 0);
	ucp_bw_floor_vpn           = max(ucp_bw_floor_vpn, 0);
	ucp_bw_floor_lossy_fat     = max(ucp_bw_floor_lossy_fat, 0);
	ucp_bw_floor_congested     = max(ucp_bw_floor_congested, 0);
	ucp_bw_floor_loss_cap      = max(ucp_bw_floor_loss_cap, 0);
	ucp_probe_rtt_base_sec               = max(ucp_probe_rtt_base_sec, 0);
	ucp_probe_rtt_class_extra_sec        = max(ucp_probe_rtt_class_extra_sec, 0);
	ucp_probe_rtt_high_loss_extra_sec    = max(ucp_probe_rtt_high_loss_extra_sec, 0);
	ucp_probe_rtt_mid_loss_extra_sec     = max(ucp_probe_rtt_mid_loss_extra_sec, 0);
	ucp_probe_rtt_max_sec                = max(ucp_probe_rtt_max_sec, 0);
	ucp_cwnd_gain              = max(ucp_cwnd_gain, 0);
	ucp_cwnd_cap_mild          = max(ucp_cwnd_cap_mild, 0);
	ucp_cwnd_cap_moderate      = max(ucp_cwnd_cap_moderate, 0);
	ucp_cwnd_cap_severe        = max(ucp_cwnd_cap_severe, 0);
	ucp_extra_acked_gain       = max(ucp_extra_acked_gain, 0);
	ucp_probe_gain             = max(ucp_probe_gain, 0);
	ucp_probe_gain_mobile      = max(ucp_probe_gain_mobile, 0);
	ucp_drain_loss_thresh      = max(ucp_drain_loss_thresh, 0);
	ucp_drain_gain_light       = max(ucp_drain_gain_light, 0);
	ucp_drain_gain_standard    = max(ucp_drain_gain_standard, 0);
	ucp_drain_gain_aggressive  = max(ucp_drain_gain_aggressive, 0);
	ucp_drain_loss_lvl2_thresh = max(ucp_drain_loss_lvl2_thresh, 0);
	ucp_drain_loss_lvl3_thresh = max(ucp_drain_loss_lvl3_thresh, 0);
	ucp_low_loss_thresh        = max(ucp_low_loss_thresh, 0);
	ucp_high_loss_thresh       = max(ucp_high_loss_thresh, 0);
	ucp_probe_skip_loss_thresh = max(ucp_probe_skip_loss_thresh, 0);
	ucp_probe_skip_rtt_rise    = max(ucp_probe_skip_rtt_rise, 0);
	ucp_startup_soft_drain_thresh = max(ucp_startup_soft_drain_thresh, 0);
	ucp_startup_hard_cap_thresh   = max(ucp_startup_hard_cap_thresh, 0);
	ucp_startup_soft_gain      = max(ucp_startup_soft_gain, 0);

	/* bandwidth floor: store as permyriad values (same as permyriad input) for use in floor_pct/10000 division later */
	ucp_bw_floor_default_val = ucp_bw_floor_default; /* Copy the default path floor permyriad (e.g., 2000 for 20.00%) without scale conversion; used as numerator in floor_pct/10000 */
	ucp_bw_floor_mobile_val  = ucp_bw_floor_mobile; /* Copy the mobile path floor permyriad (e.g., 2500 for 25.00%) without scale conversion; used as numerator in floor_pct/10000 */
	ucp_bw_floor_lan_val     = ucp_bw_floor_lan; /* Copy the LAN path floor permyriad (default 0 = disabled) without scale conversion */
	ucp_bw_floor_vpn_val     = ucp_bw_floor_vpn; /* Copy the VPN path floor permyriad (default 0 = disabled) without scale conversion */
	ucp_bw_floor_lossy_fat_val = ucp_bw_floor_lossy_fat; /* Copy the LOSSY_FAT path floor permyriad (default 2500 = 25%) without scale conversion */
	ucp_bw_floor_congested_val = ucp_bw_floor_congested; /* Copy the CONGESTED path floor permyriad (default 2000 = 20%) without scale conversion */
	ucp_bw_floor_loss_cap_val= permyriad_to_bbr(ucp_bw_floor_loss_cap); /* Convert loss cap threshold from permyriad to BBR_SCALE for direct comparison with loss_ewma */

	/* ProbeRTT intervals: multiply seconds by HZ (kernel ticks per second) to convert to timer ticks */
	ucp_probe_rtt_base_jiffies           = ucp_probe_rtt_base_sec * HZ; /* Base interval: 10 seconds * HZ; the minimum time that must elapse between PROBE_RTT entries */
	ucp_probe_rtt_class_extra_jiffies    = ucp_probe_rtt_class_extra_sec * HZ; /* Extra time for hostile path classes: 5 seconds * HZ; adds to base for MOBILE/LOSSY_FAT */
	ucp_probe_rtt_high_loss_extra_jiffies= ucp_probe_rtt_high_loss_extra_sec * HZ; /* Extra time when high loss >= 5%: configured seconds * HZ (default 0) */
	ucp_probe_rtt_mid_loss_extra_jiffies = ucp_probe_rtt_mid_loss_extra_sec * HZ; /* Extra time when medium loss >= 2%: configured seconds * HZ (default 0) */
	ucp_probe_rtt_max_jiffies           = ucp_probe_rtt_max_sec * HZ; /* Hard cap: 15 seconds * HZ; maximum allowed PROBE_RTT interval regardless of adjustments */

	/* cwnd gains: convert from permyriad to BBR_SCALE for efficient fixed-point arithmetic per ACK */
	ucp_cwnd_gain_val        = permyriad_to_bbr(ucp_cwnd_gain); /* Steady-state cwnd gain: 20000 permyriad -> 512 BBR_SCALE (2.0x BDP) */
	ucp_cwnd_cap_mild_val    = permyriad_to_bbr(ucp_cwnd_cap_mild); /* Mild congestion cap: 17500 permyriad -> 448 BBR_SCALE (1.75x BDP) */
	ucp_cwnd_cap_moderate_val = permyriad_to_bbr(ucp_cwnd_cap_moderate); /* Moderate congestion cap: 15000 permyriad -> 384 BBR_SCALE (1.50x BDP) */
	ucp_cwnd_cap_severe_val  = permyriad_to_bbr(ucp_cwnd_cap_severe); /* Severe congestion cap: 12500 permyriad -> 320 BBR_SCALE (1.25x BDP) */

	/* probe gains: convert to BBR_SCALE for use in the PROBE_BW 8-phase gain cycle */
	ucp_probe_gain_val        = permyriad_to_bbr(ucp_probe_gain); /* Standard probe gain: 11000 permyriad -> ~282 BBR_SCALE (1.10x BDP) */
	ucp_probe_gain_mobile_val = permyriad_to_bbr(ucp_probe_gain_mobile); /* Mobile probe gain: 10000 permyriad -> 256 BBR_SCALE (1.00x = no probe) */

	/* drain thresholds and gains: convert all drain-related permyriad parameters to BBR_SCALE */
	ucp_drain_loss_thresh_val          = permyriad_to_bbr(ucp_drain_loss_thresh); /* Drain trigger loss: 100 permyriad -> ~2.56 BBR_SCALE (1.0%) */
	ucp_drain_gain_light_val           = permyriad_to_bbr(ucp_drain_gain_light); /* Light drain gain: 8500 permyriad -> ~217 BBR_SCALE (0.85x) */
	ucp_drain_gain_standard_val        = permyriad_to_bbr(ucp_drain_gain_standard); /* Standard drain gain: 7500 permyriad -> 192 BBR_SCALE (0.75x) */
	ucp_drain_gain_aggressive_val      = permyriad_to_bbr(ucp_drain_gain_aggressive); /* Aggressive drain gain: 6500 permyriad -> ~166 BBR_SCALE (0.65x) */
	ucp_drain_lvl2_loss_thresh_val     = permyriad_to_bbr(ucp_drain_loss_lvl2_thresh); /* Level 2 drain loss: 500 permyriad -> ~12.8 BBR_SCALE (5.0%) */
	ucp_drain_lvl3_loss_thresh_val     = permyriad_to_bbr(ucp_drain_loss_lvl3_thresh); /* Level 3 drain loss: 1000 permyriad -> ~25.6 BBR_SCALE (10.0%) */

	/* general loss thresholds: convert to BBR_SCALE for consistent comparison with loss_ewma */
	ucp_low_loss_thresh_val  = permyriad_to_bbr(ucp_low_loss_thresh); /* Low loss threshold: 100 permyriad -> ~2.56 BBR_SCALE (1.0%) */
	ucp_high_loss_thresh_val = permyriad_to_bbr(ucp_high_loss_thresh); /* High loss threshold: 500 permyriad -> ~12.8 BBR_SCALE (5.0%) */

	/* probe skip thresholds: convert to BBR_SCALE for comparison with loss_ewma and RTT increase ratio */
	ucp_probe_skip_loss_val  = permyriad_to_bbr(ucp_probe_skip_loss_thresh); /* Probe skip loss: 200 permyriad -> ~5.12 BBR_SCALE (2.0%); probe phase skipped if loss_ewma >= this */
	ucp_probe_skip_rtt_rise_val = permyriad_to_bbr(ucp_probe_skip_rtt_rise); /* Probe skip RTT rise: 4000 permyriad -> ~102.4 BBR_SCALE (40%); probe skipped if rinc >= this */

	/* STARTUP thresholds: convert to BBR_SCALE for loss-based gain reduction in exponential growth phase */
	ucp_startup_soft_drain_val = permyriad_to_bbr(ucp_startup_soft_drain_thresh); /* Soft drain loss: 50 permyriad -> ~1.28 BBR_SCALE (0.5%); reduces gain above this */
	ucp_startup_hard_cap_val   = permyriad_to_bbr(ucp_startup_hard_cap_thresh); /* Hard cap loss: 200 permyriad -> ~5.12 BBR_SCALE (2.0%); hard caps gain above this */
	ucp_startup_soft_gain_val  = permyriad_to_bbr(ucp_startup_soft_gain); /* Soft gain: 25000 permyriad -> 640 BBR_SCALE (2.50x); reduced gain between soft and hard thresholds */

	/* ACK aggregation compensation gain: convert permyriad to BBR_SCALE for efficient per-ACK arithmetic */
	ucp_extra_acked_gain_val   = permyriad_to_bbr(ucp_extra_acked_gain); /* Compensation gain: default 0 -> 0 BBR_SCALE (disabled) */

	/* derived congestion severity loss thresholds: reuse established loss threshold values to define graduated congestion severity levels */
	ucp_cong_severe_loss_val   = ucp_high_loss_thresh_val;   /* Severe congestion loss threshold = high loss threshold (5.0%) */
	ucp_cong_moderate_loss_val = ucp_drain_loss_thresh_val;  /* Moderate congestion loss threshold = drain trigger loss (1.0%) */
	ucp_cong_mild_loss_val     = ucp_low_loss_thresh_val;    /* Mild congestion loss threshold = low loss threshold (1.0%); same as moderate in default config */
}

/* ---- Non-exported internal constants (structural, BBR-derived) ---------- */
#define UCP_BW_RTT_CYCLE_LEN           8  /* Number of packet-timed rounds in one BBR gain cycle; the minmax filter window is UCP_BW_RTTS = cycle + 2 guard rounds = 10 total */
#define UCP_BW_RTTS                    (UCP_BW_RTT_CYCLE_LEN + 2)  /* Total max filter window including 2 guard rounds: 10 rounds total; matches BBR's 10-round max filter */

#define UCP_PROBE_RTT_MODE_MS          200   /* Duration to stay in PROBE_RTT mode: 200 milliseconds; cwnd clamped to 4 packets during this period to drain the bottleneck queue for a clean min RTT sample */

#define UCP_MIN_TSO_RATE               1200000 /* Minimum pacing rate for TSO single-segment limit: 1,200,000 bps (1.2 Mbps); below this rate, only 1 TSO segment is used to avoid burstiness */
#define UCP_TSO_MAX_SEGS               0x7F    /* Hard maximum TSO segments per GSO burst: 127 segments (0x7F); prevents excessive burst size regardless of calculated goal */
#define UCP_PACING_MARGIN_PERCENT      1       /* Self-queue pacing safety margin: 1% rate reduction; reduces the pacing rate slightly to avoid building local qdisc queues */
#define UCP_PACING_MARGIN_DIV          99      /* Divisor for applying the 1% pacing margin: 99/100 = 0.99; final rate = rate * 99 / 100, giving 1% headroom */
#define UCP_RATE_MAX_SAFE              (U64_MAX / UCP_PACING_MARGIN_DIV) /* Upper bound to prevent 64-bit overflow in the margin multiplication: U64_MAX / 99; any rate above this is capped */

#define UCP_HIGH_GAIN                  (BBR_UNIT * 2885 / 1000 + 1)  /* High gain for STARTUP phase: BBR_UNIT * 2885 / 1000 + 1 = ~2.885x (rounded up); same as BBR's 2/ln(2) ~ 2.885 high gain */
#define UCP_DRAIN_GAIN                 (BBR_UNIT * 1000 / 2885)       /* Drain gain for DRAIN phase: BBR_UNIT * 1000 / 2885 = ~0.346x (truncated); reciprocal of UCP_HIGH_GAIN to drain exactly the queue built during STARTUP */

#define UCP_PROBE_BW_CYCLE_LEN         8  /* Number of phases in the PROBE_BW gain cycle: 8 phases (probe, drain, 6 cruise) as in standard BBR */
#define UCP_PROBE_BW_DRAIN_IDX         1  /* Index of the drain phase within the PROBE_BW cycle: phase 1 immediately follows the probe phase (index 0) */
#define UCP_PROBE_BW_CYCLE_RAND        7  /* Randomization range for initial cycle phase: 0..7 random offset; modulo cycle length to randomize phase start position across connections */

#define UCP_CWND_MIN_TARGET            4  /* Absolute minimum congestion window target: 4 packets; applied during PROBE_RTT as the cwnd clamp and as a general cwnd floor */
#define UCP_FULL_BW_THRESH             (BBR_UNIT * 125 / 100)  /* Full bandwidth detection threshold: 1.25x in BBR_SCALE (256 * 125 / 100 = 320); STARTUP considered full when BW growth < 1.25x */
#define UCP_FULL_BW_CNT                3  /* Number of consecutive rounds without 1.25x BW growth to declare pipe full: 3 rounds; same as BBR's 3-round criterion for full pipe */

#define UCP_LOSS_EWMA_RETAINED_WEIGHT  3  /* EWMA retained weight numerator for loss: 3 parts retained from the previous EWMA value */
#define UCP_LOSS_EWMA_SAMPLE_WEIGHT    1  /* EWMA sample weight numerator for loss: 1 part contributed by the new instantaneous loss ratio */
#define UCP_LOSS_EWMA_TOTAL_WEIGHT     4  /* EWMA total weight denominator for loss: 4 total parts (3 retained + 1 sample = new EWMA = 3/4 old + 1/4 new) */
#define UCP_LOSS_EWMA_IDLE_DECAY_NUM   70 /* Loss EWMA idle decay numerator: 70; when a sample has no losses, EWMA decays to EWMA * 70/100 = 0.7x previous */
#define UCP_LOSS_EWMA_IDLE_DECAY_DEN   100/* Loss EWMA idle decay denominator: 100; 70/100 = 0.7 decay factor applied when no losses in current sample */

#define UCP_ECN_EWMA_RETAINED_WEIGHT   3  /* ECN EWMA retained weight numerator: 3 parts retained from the previous EWMA value */
#define UCP_ECN_EWMA_SAMPLE_WEIGHT     1  /* ECN EWMA sample weight numerator: 1 part from the new instantaneous ECN marking ratio */
#define UCP_ECN_EWMA_TOTAL_WEIGHT      4  /* ECN EWMA total weight denominator: 4 total parts (3 retained + 1 sample = 3/4 old + 1/4 new) */
#define UCP_ECN_EWMA_IDLE_DECAY_NUM    70 /* ECN EWMA idle decay numerator: 70; when no CE marks in current sample, EWMA decays to EWMA * 70/100 */
#define UCP_ECN_EWMA_IDLE_DECAY_DEN    100/* ECN EWMA idle decay denominator: 100; 70/100 = 0.7 decay factor for ECN marking ratio when no new CE marks observed */

#define UCP_COND_CONFIRM_ENTER         3  /* Hysteresis confirm count threshold for entering CONGESTED condition: 3 consecutive samples agreeing on transition into congested */
#define UCP_COND_CONFIRM_EXIT          2  /* Hysteresis confirm count threshold for exiting CONGESTED condition: 2 consecutive samples needed to leave any non-idle condition */
#define UCP_CLASS_CONFIRM_CNT          2  /* Hysteresis confirm count threshold for network class transitions: need 2 consecutive class change suggestions to switch */
#define UCP_CLASS_CONFIRM_MAX          7  /* Maximum value for class hysteresis confirmation counter: saturates at 7 to avoid overflow in the 3-bit class_confirm bitfield */

#define UCP_INFLIGHT_LOW_GAIN          (BBR_UNIT * 125 / 100)  /* Lower bound gain for inflight cwnd clamping: 1.25x BDP in BBR_SCALE (320); prevents cwnd from dropping below 1.25x BDP */
#define UCP_INFLIGHT_HIGH_GAIN         (BBR_UNIT * 200 / 100)  /* Upper bound gain for inflight cwnd clamping: 2.00x BDP in BBR_SCALE (512); prevents cwnd from exceeding 2.0x BDP */

/* ACK aggregation compensation decay factor (exponential forgetting on extra_acked_max) */
#define UCP_ACK_EPOCH_DECAY_NUM        3  /* Numerator for epoch max decay: extra_acked_max = extra_acked_max * 3 / 4 each epoch */
#define UCP_ACK_EPOCH_DECAY_DEN        4  /* Denominator for epoch max decay: 3/4 = 0.75 exponential decay factor */

/* Maximum value for u8 saturation (used in ACK aggregation u8 counters); prefixed UCP_ to avoid conflict with kernel's UCP_U8_MAX in <linux/limits.h> */
#define UCP_U8_MAX                      0xFF /* 255; maximum representable value in an unsigned 8-bit integer */

/* Minimum epoch duration for ACK aggregation compensation (microseconds) */
#define UCP_ACK_EPOCH_MIN_US            1000 /* 1 ms floor; prevents degenerate epoch resets when min_rtt_us is unrealistically small */

#define UCP_RTT_SAMPLE_MAX_US          500000  /* Hard absolute ceiling for RTT sample rejection: 500,000 microseconds (500 ms); any RTT sample above this is unconditionally discarded as an outlier */
#define UCP_RTT_SAMPLE_MAX_MULT        3       /* Dynamic multiplier for per-connection RTT sample ceiling: samples > 3x min_rtt_us are rejected as outliers */

#define UCP_RATE_TREND_EWMA_WEIGHT     (BBR_UNIT * 7 / 8)  /* EWMA retained weight for rate trend smoothing: 7/8 retained (224/256 BBR_SCALE); strong smoothing to filter out noise in delivery rate measurements */

#define UCP_MINRTT_FAST_FALL_CNT       3  /* Count of consecutive sub-75% min_rtt samples needed to trigger a fast downward revision: 3 fast-fall samples trigger immediate min_rtt update */
#define UCP_MINRTT_STICKY_FLOOR_NUM    3  /* Sticky floor numerator for progressive min_rtt reduction: 3; when fast-fall count is active but below threshold, min_rtt is reduced to 3/4 of current */
#define UCP_MINRTT_STICKY_FLOOR_DEN    4  /* Sticky floor denominator for progressive min_rtt reduction: 4; min_rtt = min_rtt * 3/4 before full fast-fall trigger */

#define UCP_MINRTT_SRTT_GUARD_NUM      9  /* SRTT guard numerator: 9; if srtt/8 < min_rtt * 9/10, update min_rtt to srtt/8 as a safety guard against stale min_rtt */
#define UCP_MINRTT_SRTT_GUARD_DEN      10 /* SRTT guard denominator: 10; comparison is srtt/8 < min_rtt * 9/10; condition: (tp->srtt_us >> 3) < ucp->min_rtt_us * 9/10 */

#define UCP_BDP_MIN_RTT_US            1000    /* Minimum RTT for BDP calculation: 1000 microseconds (1 ms); prevents division by zero or unrealistically small BDP values */
#define UCP_BDP_HI_MULT               2       /* BDP model RTT upper bound multiplier: 2; model_rtt is capped at max(min_rtt_us * 2, 500ms) to bound the BDP estimate */
#define UCP_BDP_HI_FLOOR_US           500000  /* BDP model RTT upper bound floor: 500,000 microseconds (500 ms); minimum value for the high bound, even when min_rtt is very small */

#define UCP_TSO_HEADROOM_SEGS         3  /* TSO headroom in segments: 3 extra TSO segments added to cwnd to prevent TCP segmentation offload from stalling the transmit pipeline */
#define UCP_PROBE_CWND_BONUS          2  /* Extra cwnd segments during PROBE_BW probe phase: 2 additional segments to ensure we fill the pipe during bandwidth probing */

#define UCP_BW_FLOOR_PCT_OFF          0  /* Special sentinel value that disables the bandwidth floor: 0 means floor is not applied (used for LAN/VPN paths) */

#define UCP_CLASS_LAN_RTT_US          5000    /* LAN classification RTT threshold: 5000 microseconds (5 ms); paths with average RTT below this qualify as LAN */
#define UCP_CLASS_LAN_JITTER_US       3000    /* LAN classification jitter threshold: 3000 microseconds (3 ms); max RTT variation for LAN classification */
#define UCP_CLASS_LAN_LOSS_THRESH     (BBR_UNIT / 1000)   /* LAN classification loss threshold: 0.1% in BBR_SCALE (256/1000 = 0.256); very low loss expected on local networks */
#define UCP_CLASS_MOBILE_LOSS_THRESH  (BBR_UNIT * 3 / 100) /* Mobile classification loss threshold: 3% in BBR_SCALE (256*3/100 = 7.68); high loss characteristic of cellular */
#define UCP_CLASS_MOBILE_JITTER_US    (20 * USEC_PER_MSEC) /* Mobile classification jitter threshold: 20,000 microseconds (20 ms); high jitter typical of cellular networks */
#define UCP_CLASS_LOSSY_RTT_US        80000   /* Lossy fat pipe classification RTT threshold: 80,000 microseconds (80 ms); high latency typical of satellite links */
#define UCP_CLASS_LOSSY_LOSS_THRESH   (BBR_UNIT / 100)     /* Lossy fat pipe loss threshold: 1% in BBR_SCALE (256/100 = 2.56); significant background loss expected */
#define UCP_CLASS_CONG_RINC_THRESH    (BBR_UNIT * 50 / 100)/* Congested class RTT increase ratio threshold: 50% in BBR_SCALE (128); RTT has increased by 50% relative to min_rtt */
#define UCP_CLASS_VPN_RTT_US          60000   /* VPN classification RTT threshold: 60,000 microseconds (60 ms); elevated but stable latency typical of VPN tunnels */

#define UCP_RTT_EXTRA_HIGH_THRESH     (BBR_UNIT * 100 / 100) /* RTT increase ratio threshold for "high" classification: 100% in BBR_SCALE (256); RTT has doubled relative to min_rtt */
#define UCP_RTT_EXTRA_MID_THRESH      (BBR_UNIT * 50 / 100)  /* RTT increase ratio threshold for "mid" classification: 50% in BBR_SCALE (128); RTT has increased by 50% */

#define UCP_CONG_SEVERE_RINC_THRESH   UCP_RTT_EXTRA_HIGH_THRESH /* Severe congestion RTT increase threshold: same as RTT_EXTRA_HIGH (100%); congestion when RTT >= 2x min_rtt */
#define UCP_CONG_MODERATE_RINC_THRESH UCP_RTT_EXTRA_MID_THRESH  /* Moderate congestion RTT increase threshold: same as RTT_EXTRA_MID (50%) */
#define UCP_CONG_MILD_RINC_THRESH     (BBR_UNIT * 25 / 100)    /* Mild congestion RTT increase threshold: 25% in BBR_SCALE (64); minor RTT increase indicates mild queue buildup */

#define UCP_COND_RATE_DROP_THRESH        (-(s32)(BBR_UNIT * 15 / 100)) /* Rate EWMA drop threshold for congestion detection: -15% BBR_SCALE (signed -38); rate_change_ewma below this means significant rate decrease */
#define UCP_COND_LOSS_CONGEST_THRESH     (BBR_UNIT * 5 / 100)  /* Loss congestion threshold: 5% in BBR_SCALE (~12.8); loss above this + rate drop = CONGESTED condition */
#define UCP_COND_RINC_CONGEST_THRESH     (BBR_UNIT * 20 / 100) /* RTT increase congestion threshold: 20% in BBR_SCALE (~51.2); RTT rise above this with rate drop confirms congestion */
#define UCP_COND_LOSS_SEVERE_THRESH      (BBR_UNIT * 10 / 100) /* Severe loss threshold: 10% in BBR_SCALE (~25.6); very high loss rate alone can trigger CONGESTED classification */

/* ---- Inline helpers for 64-bit cycle_mstamp access ---------------------- */
/**
 * @brief Reconstruct the 64-bit PROBE_BW phase start timestamp from two 32-bit halves.
 * @param ucp  Pointer to the per-connection UCP state structure
 * @return     64-bit microsecond timestamp of when the current PROBE_BW gain phase started
 *
 * Combines cycle_mstamp_hi (upper 32 bits) and cycle_mstamp_lo (lower 32 bits) into a single u64.
 * Stored as two halves to avoid requiring 64-bit aligned memory access on all CPU architectures.
 */
static inline u64 ucp_get_cycle_mstamp(const struct ucp *ucp)
{
	/* Shift the upper 32-bit half into the high word of a u64 and OR in the lower 32-bit half to reconstruct the full 64-bit timestamp */
	return ((u64)ucp->cycle_mstamp_hi << 32) | ucp->cycle_mstamp_lo;
}

/**
 * @brief Store a 64-bit timestamp as two 32-bit halves in the UCP state structure.
 * @param ucp  Pointer to the per-connection UCP state structure
 * @param val  64-bit microsecond timestamp to store (typically tp->delivered_mstamp)
 *
 * Splits the u64 value into upper and lower 32-bit halves for storage in cycle_mstamp_hi/lo.
 * Using two u32 fields avoids requiring 64-bit aligned memory access on 32-bit architectures.
 */
static inline void ucp_set_cycle_mstamp(struct ucp *ucp, u64 val)
{
	ucp->cycle_mstamp_hi = (u32)(val >> 32); /* Extract the upper 32 bits of the 64-bit timestamp by right-shifting 32 and casting to u32 */
	ucp->cycle_mstamp_lo = (u32)(val);        /* Extract the lower 32 bits of the 64-bit timestamp by truncating the u64 to u32 */
}

/* ---- Forward declarations ----------------------------------------------- */
static u16 ucp_get_loss_ratio(const struct sock *sk);  /* Reconstruct loss EWMA from compressed u8 + overflow bit; forward-declared because called from ucp_update_loss_ewma before definition */
static void ucp_set_loss_ewma(struct ucp *ucp, u16 val); /* Compress loss EWMA to u8 + overflow bit; forward-declared because called from ucp_update_loss_ewma before definition */
static void ucp_check_probe_rtt_done(struct sock *sk);  /* Check if PROBE_RTT dwell time has elapsed and exit PROBE_RTT if so; forward-declared because called from ucp_cwnd_event before its definition */
static void ucp_update_model(struct sock *sk, const struct rate_sample *rs); /* Run the full UCP estimation pipeline: bandwidth, loss EWMA, net condition, net class, cycle phase, min_rtt */
static void ucp_apply_pacing_constraints(struct sock *sk); /* Apply any queued one-shot drain constraints to the pacing gain; forward-declared for ucp_update_model */
static void ucp_apply_cwnd_constraints(struct sock *sk); /* Apply congestion severity cwnd caps and STARTUP loss-based gain limits; forward-declared for ucp_update_model */

/**
 * @brief Test whether STARTUP has filled the pipe (full bandwidth reached).
 * @param sk  The TCP socket
 * @return    true if STARTUP bandwidth growth has stalled (pipe considered full), false otherwise
 *
 * Checks the full_bw_reached bitfield in the per-connection UCP state.
 * This flag gates the transition from STARTUP to DRAIN mode in the state machine.
 */
static bool ucp_full_bw_reached(const struct sock *sk)
{
	/* Retrieve the per-connection UCP private state via inet_csk_ca() and return the full_bw_reached bitfield value */
	return ((struct ucp *)inet_csk_ca(sk))->full_bw_reached;
}

/**
 * @brief Return the maximum filtered bandwidth (BtlBw) from the minmax running-max window.
 * @param sk  The TCP socket
 * @return    Maximum bandwidth in BW_SCALE (units of 1/2^24 packets per microsecond); 0 if no samples available
 *
 * Queries the minmax running-max filter for the peak bottleneck bandwidth estimate over the last 10 packet-timed rounds.
 * This is the UCP algorithm's primary bandwidth estimate, equivalent to BBR's BtlBw.
 */
static u32 ucp_max_bw(const struct sock *sk)
{
	/* Return the running maximum value from the minmax filter structure; bandwidth is in BW_SCALE (packets per microsecond * 2^24) */
	return minmax_get(&((struct ucp *)inet_csk_ca(sk))->bw);
}

/**
 * @brief Convert a BW_SCALE bandwidth value and a BBR_SCALE gain into bytes per second for pacing.
 * @param sk    The TCP socket (used to retrieve MSS cache)
 * @param rate  Bandwidth in BW_SCALE (packets per microsecond, fixed-point with 24 fractional bits)
 * @param gain  Gain multiplier in BBR_SCALE (1.0x = BBR_UNIT = 256)
 * @return      Pacing rate in bytes per second, with 1% self-queue margin already applied
 *
 * Algorithm steps:
 * 1. rate * MSS * gain / BBR_SCALE (converts to bytes per microsecond in BW_SCALE)
 * 2. Convert to bytes per second: integer part * USEC_PER_SEC + fractional part * USEC_PER_SEC / BW_UNIT
 * 3. Apply 1% margin: multiply by 99 then divide by 100
 * Uses 64-bit arithmetic throughout to avoid overflow on high-speed links (up to multi-gigabit rates).
 */
static u64 ucp_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain)
{
	unsigned int mss = tcp_sk(sk)->mss_cache; /* Get the Maximum Segment Size (bytes) from the TCP socket for converting packets to bytes */
	u64 q, r, bytes_per_sec;                   /* q = integer part of BW_UNIT (whole bytes/usec), r = fractional part (< 1 byte/usec), bytes_per_sec = final result */

	/* rate * MSS * gain >> BBR_SCALE: rate is in pkts/usec * 2^24 (BW_SCALE), multiply by MSS to get bytes/usec * 2^24, multiply by gain/256 to apply the pacing gain */
	rate = rate * mss * gain >> BBR_SCALE;  /* Result: (packets/usec * 2^24) * (bytes/packet) * gain / 256 = bytes/usec * 2^24 (still in BW_SCALE) */
	q = rate >> BW_SCALE;                   /* Extract integer part: shift right 24 bits to get whole bytes per microsecond */
	r = rate & (BW_UNIT - 1);               /* Extract fractional part: mask with (2^24 - 1) to get the fractional remainder in BW_SCALE */
	/* Convert to bytes per second: integer part * 1,000,000 (usec/sec) + fractional part * 1,000,000 >> BW_SCALE */
	bytes_per_sec = q * USEC_PER_SEC + ((r * USEC_PER_SEC) >> BW_SCALE);
	if (bytes_per_sec > UCP_RATE_MAX_SAFE) {/* Safety check: if bytes_per_sec would overflow when multiplied by UCP_PACING_MARGIN_DIV (99), cap it */
		bytes_per_sec = UCP_RATE_MAX_SAFE; /* Cap at maximum safe value to prevent overflow in the subsequent margin multiplication */
	}
	/* Apply the 1% self-queue pacing margin: multiply by 99 then divide by 100, which slightly reduces the rate to prevent local qdisc queue buildup */
	return bytes_per_sec * UCP_PACING_MARGIN_DIV / 100; /* Final pacing rate in bytes/sec with 1% headroom for the qdisc layer */
}

/**
 * @brief Convert BtlBw estimate and pacing gain to socket pacing rate, capped by sk_max_pacing_rate.
 * @param sk    The TCP socket
 * @param bw    Bottleneck bandwidth in BW_SCALE
 * @param gain  Pacing gain in BBR_SCALE
 * @return      Pacing rate in bytes per second, limited to sk->sk_max_pacing_rate (the socket's configured max)
 *
 * Wrapper around ucp_rate_bytes_per_sec() that clamps the result to the per-socket maximum pacing rate.
 * This prevents the pacing rate from exceeding any user-configured or system-imposed rate cap.
 */
static unsigned long ucp_bw_to_pacing_rate(struct sock *sk, u32 bw, int gain)
{
	/* Compute bytes/sec from BW and gain, then clamp to the socket's configured maximum pacing rate (sk_max_pacing_rate) */
	return min_t(u64, ucp_rate_bytes_per_sec(sk, bw, gain),
		     sk->sk_max_pacing_rate);
}

/**
 * @brief Initialize the pacing rate from cwnd and SRTT before any bandwidth samples are available.
 * @param sk  The TCP socket
 *
 * This is called at connection start and whenever pacing needs initialization before the first
 * delivery rate sample. It estimates bandwidth as cwnd / srtt and sets the initial pacing rate
 * using the high STARTUP gain (UCP_HIGH_GAIN ~= 2.89x) to probe for bandwidth.
 * If SRTT is not yet available, it assumes a 1 ms RTT as a fallback estimate.
 */
static void ucp_init_pacing_rate_from_rtt(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);   /* Get the TCP socket's internal state for snd_cwnd and srtt_us */
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for has_seen_rtt */
	u32 rtt_us;     /* RTT in microseconds for initial bandwidth estimation */
	u64 bw;         /* Estimated bandwidth in BW_SCALE (packets per microsecond * 2^24) */

	/* Check if a valid smoothed RTT measurement is available from the TCP stack */
	if (tp->srtt_us) {  /* srtt_us is present: use the smoothed RTT for the initial bandwidth estimate */
		rtt_us = max(tp->srtt_us >> 3, 1U); /* srtt_us is stored as 8x the actual RTT; shift right 3 to get microseconds, clamp to minimum 1 us to avoid division by zero */
		ucp->has_seen_rtt = 1; /* Record that a valid RTT sample has been observed; gates RTT-dependent logic elsewhere in the algorithm */
	} else {  /* No smoothed RTT measurement yet: use a conservative fallback */
		rtt_us = USEC_PER_MSEC;               /* Fallback to 1 millisecond (1000 microseconds) as a reasonable initial RTT assumption for modern networks */
	}

	/* Calculate initial bandwidth: cwnd (packets) / rtt_us (usec), converted to BW_SCALE */
	bw = (u64)tp->snd_cwnd * BW_UNIT;          /* Convert cwnd to BW_SCALE: multiply by 2^24 so the division yields (packets/usec) * 2^24 */
	do_div(bw, rtt_us);                        /* 64-bit division: bw = cwnd * 2^24 / rtt_us; result is bandwidth in packets per microsecond at BW_SCALE */
	/* Set the initial pacing rate to bandwidth * UCP_HIGH_GAIN (~2.89x) to aggressively probe during the STARTUP phase */
	sk->sk_pacing_rate = ucp_bw_to_pacing_rate(sk, bw, UCP_HIGH_GAIN);
}

/**
 * @brief Apply pacing gain to the socket's pacing rate with 3:1 EWMA smoothing on increases.
 * @param sk    The TCP socket
 * @param bw    Current bottleneck bandwidth estimate in BW_SCALE
 * @param gain  Target pacing gain in BBR_SCALE (determines the multiplier applied to bw for the new rate)
 *
 * Smoothing behavior:
 * - Rate increases: apply 3:1 EWMA (75% old, 25% new) to avoid pacing jitter
 * - Fast-ramp bypass: if new rate > 2x current AND at round start, set directly (no smoothing)
 * - Rate decreases: applied immediately without smoothing (drains are instant)
 * - Post-STARTUP (full_bw_reached): always update pacing (both increases and decreases)
 * - Pre-STARTUP: only update pacing if the rate is increasing
 */
static void ucp_set_pacing_rate(struct sock *sk, u32 bw, int gain)
{
	struct tcp_sock *tp = tcp_sk(sk);   /* Get the TCP socket's internal state for srtt_us (used in initialization fallback) */
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for has_seen_rtt and round_start */
	unsigned long rate = ucp_bw_to_pacing_rate(sk, bw, gain); /* Compute the target pacing rate from bandwidth and gain (result in bytes/sec) */

	/* If SRTT has become available but we haven't done the RTT-based initialization yet, run it now as a fallback */
	if (unlikely(!ucp->has_seen_rtt && tp->srtt_us)) { /* Unlikely path: no previous RTT initialization but SRTT just became available */
		ucp_init_pacing_rate_from_rtt(sk); /* Initialize pacing rate from cwnd and SRTT since we now have RTT data */
	}

	/* Conditionally update pacing: always update if pipe is full (post-STARTUP), or only on increases if pipe is not yet full */
	if (ucp_full_bw_reached(sk) || rate > sk->sk_pacing_rate) { /* After pipe is full, always apply new rate; before fullness, only allow increases */
		if (rate > sk->sk_pacing_rate) { /* Only apply smoothing on rate increases; decreases (drains) are set directly without smoothing */
			if (rate > sk->sk_pacing_rate * 2 && ucp->round_start) { /* Fast-ramp bypass: rate > 2x current at a round start indicates a significant change that should not be smoothed */
				sk->sk_pacing_rate = rate;   /* Fast-ramp: set the new rate directly without EWMA smoothing for quick response to improved conditions */
			} else { /* Normal rate increase: apply 3:1 EWMA smoothing (3/4 old rate + 1/4 new rate) to prevent pacing jitter and oscillations */
				rate = (sk->sk_pacing_rate * 3 + rate) / 4; /* EWMA: (old_rate * 3 + new_rate) / 4 = 75% old, 25% new contribution */
			}
		}
		sk->sk_pacing_rate = rate; /* Write the (potentially smoothed) pacing rate to the socket structure */
	}
}

/**
 * @brief Determine the minimum number of TSO segments based on the current pacing rate.
 * @param sk  The TCP socket
 * @return    Minimum TSO segments: 1 for very low rates (< 150 KBps), 2 for normal rates
 *
 * Below approximately 150 KBps (UCP_MIN_TSO_RATE >> 3 = 150,000 bytes/sec), using more than 1 TSO segment
 * could cause excessive burstiness relative to the drain rate. Returns 1 segment for low rates to
 * minimize bursts, and 2 segments for higher rates where TSO batching is beneficial.
 */
static u32 ucp_min_tso_segs(struct sock *sk)
{
	/* Compare pacing rate against UCP_MIN_TSO_RATE / 8 (1,200,000 bps / 8 = 150,000 bytes/sec); below this threshold use 1 segment, otherwise use 2 */
	return sk->sk_pacing_rate < (UCP_MIN_TSO_RATE >> 3) ? 1 : 2;
}

/**
 * @brief Compute the desired TSO/GSO burst size in segments.
 * @param sk  The TCP socket
 * @return    The target number of segments per TSO burst, capped at UCP_TSO_MAX_SEGS (127)
 *
 * Calculates the ideal burst size based on the pacing rate and pacing shift configuration.
 * The goal is to emit segments at a rate that matches the pacing rate while staying within GSO_MAX_SIZE.
 * The result is clamped between ucp_min_tso_segs() and UCP_TSO_MAX_SEGS (127).
 */
static u32 ucp_tso_segs_goal(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for MSS cache and pacing shift */
	u32 bytes, segs; /* bytes = byte budget per GSO interval; segs = resulting segment count in the TSO burst */

	/* Compute the per-GSO-interval byte budget: pacing rate shifted right by pacing_shift gives bytes per GSO interval */
	bytes = min_t(unsigned long,
		      sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift), /* Bytes per GSO period = rate / (2^pacing_shift); uses READ_ONCE for lock-free read of the shift value */
		      GSO_MAX_SIZE - 1 - MAX_TCP_HEADER); /* Cap at maximum GSO segment size minus 1 byte and minus TCP/IP header overhead */
	/* Convert byte budget to segments: divide by MSS (bytes per segment), ensuring at least ucp_min_tso_segs() segments */
	segs = max_t(u32, bytes / tp->mss_cache, ucp_min_tso_segs(sk));
	return min(segs, UCP_TSO_MAX_SEGS); /* Cap segments at the hard maximum of 127 TSO segments per burst to prevent oversized bursts */
}

/**
 * @brief Save the current cwnd as prior_cwnd before entering recovery or PROBE_RTT.
 * @param sk  The TCP socket
 *
 * Called before transitioning into TCP loss recovery (TCP_CA_Recovery) or PROBE_RTT state.
 * The saved cwnd is later used to restore the window when exiting those states.
 * If already in a recovery or PROBE_RTT state (detected via prev_ca_state), the maximum
 * of current cwnd and prior_cwnd is preserved to avoid shrinking the saved window.
 */
static void ucp_save_cwnd(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);   /* Get the TCP socket's internal state for snd_cwnd */
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for prior_cwnd and prev_ca_state */

	/* Only save cwnd if not already in a recovery or PROBE_RTT state; otherwise take the max to preserve the highest known good window */
	if (ucp->prev_ca_state < TCP_CA_Recovery && ucp->mode != UCP_PROBE_RTT) { /* Previously in a normal state (Open/Disorder/CWR): save current cwnd as prior_cwnd */
		ucp->prior_cwnd = tp->snd_cwnd; /* Save the current cwnd as the pre-restriction value for later restoration */
	} else { /* Already in a recovery or PROBE_RTT state: take the maximum to prevent prior_cwnd from shrinking */
		ucp->prior_cwnd = max(ucp->prior_cwnd, tp->snd_cwnd); /* Keep the larger of the previously saved value and current cwnd */
	}
}

/**
 * @brief Handle congestion events from the TCP stack (CWND_EVENT callback).
 * @param sk     The TCP socket
 * @param event  Type of congestion event (CA_EVENT_TX_START, etc.)
 *
 * Implements the tcp_congestion_ops.cwnd_event callback. On TX_START after being
 * application-limited (tp->app_limited is true), marks idle_restart and transitions the
 * pacing rate to 1.0x BtlBw (in PROBE_BW) or checks for early PROBE_RTT exit.
 * This prevents bursty behavior after idle periods.
 */
static void ucp_cwnd_event(struct sock *sk, enum tcp_ca_event event)
{
	struct tcp_sock *tp = tcp_sk(sk);   /* Get the TCP socket's internal state for the app_limited flag */
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */

	/* Check if this is a TX_START event (new data being sent) and the socket was application-limited */
	if (event == CA_EVENT_TX_START && tp->app_limited) { /* TX_START with app_limited means we are restarting transmission after an idle period */
		ucp->idle_restart = 1; /* Set flag indicating we just restarted from idle; affects min_rtt filter to avoid using stale RTT samples */
		if (ucp->mode == UCP_PROBE_BW) { /* In steady-state (PROBE_BW): reset pacing rate to 1.0x BtlBw to avoid a rate burst after idle */
			ucp_set_pacing_rate(sk, ucp_max_bw(sk), BBR_UNIT); /* Set pacing to 1.0x BtlBw to send at exactly the estimated bottleneck rate */
		} else if (ucp->mode == UCP_PROBE_RTT) { /* Currently in PROBE_RTT: new data arriving means we may want to exit PROBE_RTT early */
			ucp_check_probe_rtt_done(sk); /* Check if PROBE_RTT should end; may restore cwnd and transition to STARTUP or PROBE_BW */
		}
	}
}

/**
 * @brief Add a filtered RTT sample to the 2-slot circular history buffer.
 * @param sk     The TCP socket
 * @param rtt_us RTT sample value in microseconds (from the rate sample)
 *
 * Applies two levels of filtering before storing:
 * 1. Rejects samples above a ceiling (min(min_rtt * 3, 500ms)) or from delayed ACKs
 * 2. Rejects statistical outliers beyond min_rtt + 4 * rttvar
 * Valid samples are stored in the 2-element circular buffer indexed by rtt_hist_idx.
 */
static void ucp_add_rtt_sample(struct sock *sk, u32 rtt_us)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for min_rtt and rtt_hist_idx */
	u32 rttvar, ceiling; /* rttvar = RTT variance from TCP RTTM estimator; ceiling = maximum acceptable RTT sample value for this connection */

	/* Compute the per-connection RTT ceiling: if min_rtt > 1ms, use 3x min_rtt; otherwise use a hard 500ms cap */
	ceiling = ucp->min_rtt_us > UCP_BDP_MIN_RTT_US ? /* min_rtt is known and above 1ms floor: use a dynamic ceiling proportional to the connection's baseline RTT */
		  ucp->min_rtt_us * UCP_RTT_SAMPLE_MAX_MULT : /* Ceiling = min_rtt * 3 (3x the connection's observed minimum RTT) */
		  (u32)UCP_RTT_SAMPLE_MAX_US; /* Ceiling = hard 500ms absolute limit for connections with very low or unknown min_rtt */
	if (rtt_us > ceiling || ucp->has_delayed_ack) { /* Sample exceeds ceiling or was measured during a delayed ACK interval (which inflates RTT) */
		return; /* Discard this RTT sample: it is either an outlier or distorted by ACK delay */
	}

	rttvar = tcp_sk(sk)->rttvar_us; /* Get the smoothed RTT variance estimate from the TCP stack's RTTM (RTT Measurement) engine */
	/* Statistical outlier rejection: reject samples that exceed min_rtt + 4 * standard deviations of RTT variation */
	if (ucp->min_rtt_us && rttvar && /* Only apply this filter if both min_rtt and rttvar are available (non-zero) */
	    rtt_us > ucp->min_rtt_us + 4 * rttvar) { /* Sample is beyond 4x rttvar from min_rtt: a strong statistical outlier signal */
		return; /* Discard this RTT sample as a statistical outlier */
	}

	/* Store the validated RTT sample in the 2-slot circular buffer at the current write index */
	ucp->rtt_history[ucp->rtt_hist_idx] = rtt_us; /* Write RTT sample to the current history slot (index 0 or 1) */
	ucp->rtt_hist_idx ^= 1; /* Toggle the history index (XOR with 1) to advance to the other slot for the next sample */
}

/**
 * @brief Approximate the 10th percentile (P10) RTT from the two stored RTT samples.
 * @param sk  The TCP socket
 * @return    The lower (better) of the two stored RTT samples as the P10 approximation, or min_rtt_us if no samples exist
 *
 * With only 2 samples in the history buffer, the P10 is approximated by the minimum of the two.
 * Returns the minimum of the two samples if both are available, the single available sample if only
 * one exists, or falls back to min_rtt_us if the history is empty.
 * This low-percentile RTT is used as the "model RTT" in BDP calculations to avoid over-estimating BDP.
 */
static u32 ucp_get_p10_rtt(const struct sock *sk)
{
	const struct ucp *ucp = (const struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state (const access) */
	int s0 = !!ucp->rtt_history[0], s1 = !!ucp->rtt_history[1]; /* Convert sample presence to boolean values: !! maps non-zero to 1, zero to 0 */

	if (!s0 && !s1) { /* Neither RTT history slot has a valid sample */
		return ucp->min_rtt_us; /* Fall back to the minimum RTT estimate as the best available RTT approximation */
	}
	if (s0 && s1) { /* Both history slots have valid RTT samples */
		return min(ucp->rtt_history[0], ucp->rtt_history[1]); /* Approximate P10 as the smaller (better/less-congested) of the two samples */
	}
	return s0 ? ucp->rtt_history[0] : ucp->rtt_history[1]; /* Only one sample available: return whichever slot is populated */
}

/**
 * @brief Compute the RTT increase ratio as (average_RTT / min_rtt - 1) in BBR_SCALE.
 * @param sk  The TCP socket
 * @return    RTT increase ratio in BBR_SCALE: 0 = no increase, BBR_UNIT (256) = 100% increase
 *
 * Measures queuing delay by computing how much the current average RTT has increased above the
 * baseline minimum RTT. The formula is (avg_rtt - min_rtt) * BBR_UNIT / min_rtt.
 * This ratio is used for congestion severity classification, probe-skip decisions, and path class detection.
 */
static u32 ucp_get_rtt_increase_ratio(const struct sock *sk)
{
	const struct ucp *ucp = (const struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state (const access) */
	u32 avg = 0; /* Accumulator for computing the arithmetic mean of available RTT history samples */
	int v = 0;    /* Counter for the number of valid RTT samples in the history buffer */

	/* Sum the RTT history samples and count the valid entries */
	if (ucp->rtt_history[0]) { avg += ucp->rtt_history[0]; v++; } /* Add first history slot to running sum if it contains a non-zero sample */
	if (ucp->rtt_history[1]) { avg += ucp->rtt_history[1]; v++; } /* Add second history slot to running sum if it contains a non-zero sample */
	if (!v || !ucp->min_rtt_us) { /* No RTT history samples available OR min_rtt_us has not been established yet */
		return 0; /* Cannot compute a meaningful ratio; return 0 indicating no RTT increase */
	}

	avg /= v; /* Compute the arithmetic mean of the available RTT samples (sum / count) */
	if (avg <= ucp->min_rtt_us) { /* Average observed RTT does not exceed the baseline minimum: no measurable queuing delay */
		return 0; /* Return 0 to indicate no RTT increase (no queue buildup detected) */
	}
	/* Compute (avg_rtt - min_rtt) / min_rtt in BBR_SCALE: scaled 64-bit division for precision */
	return (u32)(((u64)(avg - ucp->min_rtt_us) * BBR_UNIT) /
		     ucp->min_rtt_us); /* Result represents 0..N in BBR_SCALE where 256 = 100% increase over baseline */
}

/**
 * @brief Compute the Bandwidth-Delay Product (BDP) in packets using the P10 RTT model.
 * @param sk    The TCP socket (for MSS and UCP state)
 * @param bw    Bottleneck bandwidth in BW_SCALE (packets per microsecond * 2^24)
 * @param gain  Gain multiplier in BBR_SCALE (1.0 = BBR_UNIT = 256)
 * @return      Congestion window target in packets (ceiling rounded): BDP * gain
 *
 * The "model RTT" is approximated by the P10 RTT from the 2-slot history buffer, clamped
 * between min_rtt_us and max(min_rtt_us * 2, 500ms). This prevents the BDP from being
 * inflated by transient RTT spikes. Returns TCP_INIT_CWND (10 segments) if min_rtt is invalid.
 * Final result: (bw * model_rtt * gain) >> (BBR_SCALE + BW_SCALE) with ceiling rounding.
 */
static u32 ucp_bdp(struct sock *sk, u32 bw, int gain)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	u32 model_rtt, hi; /* model_rtt = P10 RTT estimate for BDP; hi = upper bound for model_rtt clamping (prevents BDP over-estimation) */
	u64 w; /* w = intermediate BDP product: bw * model_rtt (in BW_SCALE * microseconds = packets * 2^24) */

	/* Guard: if min_rtt_us is uninitialized (~0U) or below the 1ms minimum floor, return a safe default cwnd */
	if (unlikely(ucp->min_rtt_us == ~0U || /* Check for uninitialized min_rtt (all 32 bits set to 1, indicating never initialized) */
		     ucp->min_rtt_us < UCP_BDP_MIN_RTT_US)) { /* Check for min_rtt below the 1ms operational floor */
		return TCP_INIT_CWND; /* Return default initial cwnd (10 segments in modern TCP) as a safe fallback when RTT is not yet valid */
	}

	model_rtt = ucp_get_p10_rtt(sk); /* Get the approximate P10 (10th percentile) RTT from the 2-slot RTT history buffer */
	if (model_rtt < ucp->min_rtt_us) { /* P10 RTT should never be lower than the absolute minimum; protect against misestimation */
		model_rtt = ucp->min_rtt_us; /* Clamp the model RTT upward to at least the measured minimum RTT */
	}
	/* Compute the upper bound for model_rtt: the larger of 500ms absolute floor or 2 * min_rtt_us */
	hi = (u32)max_t(u64, UCP_BDP_HI_FLOOR_US, /* Absolute floor: 500,000 us (500 ms), prevents excessively tight BDP on very low-RTT paths */
			(u64)ucp->min_rtt_us * UCP_BDP_HI_MULT); /* Dynamic ceiling: 2 * min_rtt_us, scales with the connection's baseline RTT */
	model_rtt = clamp(model_rtt, ucp->min_rtt_us, hi); /* Clamp model_rtt to the range [min_rtt_us, hi] to bound BDP estimates */

	/* Compute the raw BDP product: bw (pkts/usec * 2^24) * model_rtt (usec) yields packets * 2^24 */
	w = (u64)bw * model_rtt; /* Product is in units of (packets * 2^24); this is the bandwidth-delay product at BW_SCALE */
	/* Apply gain and convert from BW_SCALE back to packets with ceiling rounding: (w * gain / 256 + 2^24 - 1) / 2^24 */
	return ((w * gain >> BBR_SCALE) + BW_UNIT - 1) >> BW_SCALE; /* Final result: (bw * model_rtt * gain) >> (BBR_SCALE + BW_SCALE) with ceiling rounding for integer packets */
}

/**
 * @brief Update the loss EWMA (Exponentially Weighted Moving Average) with data from a new rate sample.
 * @param sk  The TCP socket
 * @param rs  The rate sample structure containing acked_sacked and loss counters
 *
 * If the rate sample contains packet losses, computes an instantaneous loss ratio
 * (losses / total_packets) in BBR_SCALE and updates the EWMA with a 3:1 retained-to-sample ratio.
 * If the sample has no losses, the EWMA is decayed by multiplying by 70/100 (0.7x) to gradually
 * forget old loss events over loss-free periods.
 */
static void ucp_update_loss_ewma(struct sock *sk, const struct rate_sample *rs)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for loss_ewma */
	u32 total, instant; /* total = total packets in this sample (acked + lost); instant = instantaneous loss ratio in BBR_SCALE */
	u16 cur; /* Current loss EWMA value reconstructed from compressed u8 + overflow bit */

	total = (u32)(rs->acked_sacked + rs->losses); /* Compute total packets accounted for in this rate sample (acknowledged + lost) */
	cur = ucp_get_loss_ratio(sk); /* Reconstruct the current loss EWMA from compressed u8 + overflow bit for arithmetic */
	if (rs->losses == 0) { /* No packet losses in this sample: decay the loss EWMA to gradually reduce the estimate over clean periods */
		cur = cur * UCP_LOSS_EWMA_IDLE_DECAY_NUM / /* Multiply current EWMA by 70 (idle decay numerator) */
		      UCP_LOSS_EWMA_IDLE_DECAY_DEN; /* Divide by 100 (idle decay denominator); results in 0.7x exponential decay */
		ucp_set_loss_ewma(ucp, cur); /* Write back the decayed value through the compressed setter */
		return; /* No further update needed for a loss-free sample */
	}

	/* Compute instantaneous loss ratio: losses / total, scaled to BBR_SCALE (multiply by 256) */
	instant = (u32)(((u64)rs->losses * BBR_UNIT) / total); /* Instantaneous loss fraction = losses * 256 / total, result in BBR_SCALE */
	if (cur == 0) { /* First time a loss is observed: initialize EWMA directly to the instantaneous value without smoothing */
		ucp_set_loss_ewma(ucp, (u16)instant); /* Set initial EWMA to the first observed loss ratio (no prior history to smooth with) */
		return; /* Exit since the EWMA has been initialized */
	}

	/* Standard EWMA update with 3:1 weight ratio: (old_ewma * 3 + instant_ratio * 1) / 4 */
	cur = (cur * UCP_LOSS_EWMA_RETAINED_WEIGHT + /* Retained portion: current EWMA * 3 (75% retained memory) */
	       instant * UCP_LOSS_EWMA_SAMPLE_WEIGHT) / /* New sample portion: instantaneous loss ratio * 1 (25% new data) */
	      UCP_LOSS_EWMA_TOTAL_WEIGHT; /* Normalization: divide by total weight 4 to get the weighted average */
	ucp_set_loss_ewma(ucp, cur); /* Write back the updated EWMA value through the compressed setter */
}

/**
 * @brief Set the loss EWMA value, compressing to u8 with an overflow bit.
 * @param ucp  The per-connection UCP private state
 * @param val  Loss ratio in BBR_SCALE (0..BBR_UNIT, where BBR_UNIT = 256 = 100% loss)
 *
 * Encoded as a 9-bit value: loss_ewma_high carries bit 8, loss_ewma carries
 * bits 7-0.  Val is clamped to BBR_UNIT before splitting.
 */
static void ucp_set_loss_ewma(struct ucp *ucp, u16 val)
{
	if (val > BBR_UNIT) {
		val = BBR_UNIT;
	}
	ucp->loss_ewma_high = (val >> BBR_SCALE) & 1;
	ucp->loss_ewma = (u8)val;
}

/**
 * @brief Return the current loss ratio EWMA value in BBR_SCALE.
 * @param sk  The TCP socket
 * @return    Current smoothed loss ratio in BBR_SCALE (0 = 0% loss, BBR_UNIT = 256 = 100% loss)
 *
 * Reconstructs the full 9-bit value from the compressed representation:
 *   value = (loss_ewma_high << 8) | loss_ewma
 */
static inline u16 ucp_get_loss_ratio(const struct sock *sk)
{
	const struct ucp *ucp = (const struct ucp *)inet_csk_ca(sk);
	return ((u16)ucp->loss_ewma_high << 8) | ucp->loss_ewma;
}

/**
 * @brief Set the ECN EWMA value, compressing to u8 with an overflow bit.
 * @param ucp  The per-connection UCP private state
 * @param val  ECN marking ratio in BBR_SCALE (0..BBR_UNIT, where BBR_UNIT = 256 = 100% marked)
 *
 * Encoded as a 9-bit value: ecn_ewma_high carries bit 8, ecn_ewma carries
 * bits 7-0.  Val is clamped to BBR_UNIT before splitting.
 */
static void ucp_set_ecn_ewma(struct ucp *ucp, u16 val)
{
	if (val > BBR_UNIT) {
		val = BBR_UNIT;
	}
	ucp->ecn_ewma_high = (val >> BBR_SCALE) & 1;
	ucp->ecn_ewma = (u8)val;
}

/**
 * @brief Return the current ECN marking ratio EWMA value in BBR_SCALE.
 * @param ucp  The per-connection UCP private state
 * @return     Current smoothed ECN ratio in BBR_SCALE (0 = 0% marked, BBR_UNIT = 256 = 100% marked)
 *
 * Reconstructs the full 9-bit value from the compressed representation:
 *   value = (ecn_ewma_high << 8) | ecn_ewma
 */
static u16 ucp_get_ecn_ratio(const struct ucp *ucp)
{
	return ((u16)ucp->ecn_ewma_high << 8) | ucp->ecn_ewma;
}

/**
 * @brief Store a delivery-rate bandwidth sample in the 2-slot circular history buffer.
 * @param sk      The TCP socket
 * @param rate_bw Delivery rate in BW_SCALE to store (packets per microsecond * 2^24)
 *
 * Writes the bandwidth value into the slot indexed by rate_hist_idx, then toggles the index
 * (XOR 1) so the next sample will overwrite the other slot. This 2-slot history enables
 * ucp_get_delivery_rate_trend() to compare the two most recent samples for rate direction.
 */
static void ucp_add_delivery_rate_sample(struct sock *sk, u32 rate_bw)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for rate_hist_idx */
	ucp->deliv_rate_hist[ucp->rate_hist_idx] = rate_bw; /* Store the delivery rate sample at the current write index in the 2-slot circular buffer */
	ucp->rate_hist_idx ^= 1; /* Toggle the write index (XOR with 1): switch to the other slot for the next sample */
}

/**
 * @brief Compute the signed delivery-rate trend in BBR_SCALE from the two most recent samples.
 * @param sk  The TCP socket
 * @return    Rate trend in BBR_SCALE: positive = rate increasing, negative = rate decreasing, 0 = insufficient data
 *
 * Computes (newer - older) / older as a signed ratio in BBR_SCALE. If the newer sample is less than
 * the older, the result is negative, indicating a bandwidth decrease. If the newer is greater,
 * the result is positive, indicating bandwidth growth. Returns 0 if either sample is unavailable
 * (uninitialized zero values).
 */
static s32 ucp_get_delivery_rate_trend(struct sock *sk)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for rate_hist_idx and delivery history */
	int ni = ucp->rate_hist_idx ^ 1; /* Calculate index of the "newer" sample: complement of current write index (the most recently written slot) */
	int oi = ucp->rate_hist_idx;     /* Index of the "older" sample: current write index points to the slot written two samples ago */
	u32 n, o; /* n = newer delivery rate sample (most recently written); o = older delivery rate sample (written before that) */

	n = ucp->deliv_rate_hist[ni]; /* Read the newer sample from the slot most recently written (the complement of the write index) */
	o = ucp->deliv_rate_hist[oi]; /* Read the older sample from the slot written previously (the current write index) */
	if (!n || !o) { /* Either sample is missing (zero = uninitialized, meaning not enough history accumulated yet) */
		return 0; /* Cannot compute a meaningful trend with fewer than 2 valid samples */
	}

	if (n < o) { /* Newer sample is less than older sample: bandwidth is decreasing */
		return -(s32)((u64)(o - n) * BBR_UNIT / o); /* Return negative ratio: -(old - new) / old, scaled to BBR_SCALE; negative indicates rate drop */
	}
	/* Newer sample is greater than or equal to older sample: bandwidth is increasing or stable */
	return (s32)((u64)(n - o) * BBR_UNIT / o); /* Return positive ratio: (new - old) / old, scaled to BBR_SCALE; positive indicates rate growth */
}

/**
 * @brief Update the network condition classification (IDLE, LIGHT_LOAD, CONGESTED, RANDOM_LOSS)
 *        using hysteresis for stability.
 * @param sk  The TCP socket
 * @param rs  The rate sample from the TCP stack (used for ECN delta and total packet count)
 *
 * Combines multiple signals to classify the network condition:
 * - Delivery rate trend (via EWMA): direction and magnitude of bandwidth change
 * - Loss ratio EWMA: smoothed packet loss percentage
 * - ECN marking EWMA: smoothed explicit congestion notification rate
 * - RTT increase ratio: queuing delay relative to baseline
 *
 * Classification rules:
 * 1. Rate drop (>15%) + (loss >= 5% or ECN present): signals congestion
 *    - With RTT rise >= 20% or loss >= 10%: CONGESTED
 *    - Otherwise: RANDOM_LOSS (rate drop without queue buildup)
 * 2. Any loss > 0 without strong rate drop: RANDOM_LOSS
 * 3. No loss, no significant rate drop: LIGHT_LOAD
 *
 * Hysteresis: entering CONGESTED requires 3 consecutive agreeing samples; other transitions need 2.
 * When entering CONGESTED, max_bw_non_congested is reset so the bandwidth floor will be re-established.
 */
static void ucp_update_net_condition(struct sock *sk,
				     const struct rate_sample *rs)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for delivered_ce ECN counter */
	s32 rate_change; /* Current smoothed delivery rate trend from EWMA (BBR_SCALE, signed: positive = increasing, negative = decreasing) */
	u32 loss_ratio, ecn_ratio; /* loss_ratio = loss EWMA in BBR_SCALE; ecn_ratio = ECN EWMA or instantaneous ratio in BBR_SCALE */
	u8 new_cond; /* Proposed new condition classification for this evaluation round */
	u32 ce_delta, total; /* ce_delta = newly marked CE delta; total = total packets in this sample (acked + lost) */
	u16 ecn_cur; /* Current ECN EWMA value reconstructed from compressed u8 + overflow bit */

	/* If the delivery rate history buffer is completely empty, we cannot classify yet */
	if (!ucp->deliv_rate_hist[0] && !ucp->deliv_rate_hist[1]) { /* Both history slots are zero (uninitialized) */
		ucp->net_condition = UCP_COND_IDLE; /* Set condition to IDLE: we have no rate history to evaluate */
		ucp->cond_confirm = 0; /* Reset the hysteresis confirmation counter for future transitions */
		return; /* No further classification is possible without delivery rate history */
	}

	rate_change = ucp_get_delivery_rate_trend(sk); /* Get the raw signed delivery rate trend from the 2-slot history buffer */
	if (ucp->rate_change_ewma == (s32)0x80000000) { /* First trend measurement: initialize the EWMA directly from the raw value (sentinel set in ucp_init) */
		ucp->rate_change_ewma = rate_change; /* Set initial EWMA to the first observed trend value (no prior history) */
	} else { /* Update the EWMA with strong smoothing (7/8 retained, 1/8 new sample) to filter noise */
		ucp->rate_change_ewma = (s32)(((s64)ucp->rate_change_ewma * /* Multiply retained EWMA by the 7/8 weight factor */
			UCP_RATE_TREND_EWMA_WEIGHT + (s64)rate_change * /* Add new sample weighted by (1 - 7/8) = 1/8 */
			(BBR_UNIT - UCP_RATE_TREND_EWMA_WEIGHT)) / BBR_UNIT); /* Divide by BBR_UNIT to normalize the weighted sum */
	}
	rate_change = ucp->rate_change_ewma; /* Use the smoothed EWMA rate change for classification decisions */

	loss_ratio = ucp_get_loss_ratio(sk); /* Get the current loss EWMA (BBR_SCALE) for loss-based classification */
	ce_delta = tp->delivered_ce - ucp->last_delivered_ce; /* Compute the change in CE-marked packet count since the last ACK */
	ucp->last_delivered_ce = tp->delivered_ce; /* Update the trailing CE counter to the current value for the next ACK's delta calculation */

	if (ce_delta) { /* New ECN Congestion Experienced marks were received in this ACK: update the ECN EWMA */
		total = max_t(u32, rs->acked_sacked + rs->losses, 1); /* Total packets acknowledged or lost in this sample; minimum 1 to avoid division by zero */
		ecn_ratio = (u32)(((u64)ce_delta * BBR_UNIT) / total); /* Compute instantaneous ECN marking ratio: CE-marked delta / total, scaled to BBR_SCALE */
		ecn_cur = ucp_get_ecn_ratio(ucp); /* Reconstruct the current ECN EWMA from compressed u8 + overflow bit */
		if (ecn_cur == 0) { /* First ECN sample observed: initialize EWMA directly from the instantaneous ratio */
			ucp_set_ecn_ewma(ucp, (u16)ecn_ratio); /* Set initial ECN EWMA to the first observed marking ratio through the compressed setter */
		} else { /* Update ECN EWMA with 3:1 smoothing (3/4 retained, 1/4 new sample) */
			ecn_cur = (ecn_cur * UCP_ECN_EWMA_RETAINED_WEIGHT + /* Retained portion: old EWMA * 3 (75%) */
				   ecn_ratio * UCP_ECN_EWMA_SAMPLE_WEIGHT) / /* New sample portion: instantaneous ratio * 1 (25%) */
				  UCP_ECN_EWMA_TOTAL_WEIGHT; /* Divide by total weight 4 to normalize */
			ucp_set_ecn_ewma(ucp, ecn_cur); /* Write back the updated ECN EWMA through the compressed setter */
		}
	} else { /* No new ECN marks in this ACK: decay the ECN EWMA to gradually reduce the estimate over clean periods */
		ecn_cur = ucp_get_ecn_ratio(ucp); /* Reconstruct the current ECN EWMA from compressed u8 + overflow bit for arithmetic */
		ecn_cur = ecn_cur * UCP_ECN_EWMA_IDLE_DECAY_NUM / /* Multiply current EWMA by 70 (idle decay numerator) for exponential decay */
			  UCP_ECN_EWMA_IDLE_DECAY_DEN; /* Divide by 100 (idle decay denominator); results in 0.7x per-sample decay */
		ucp_set_ecn_ewma(ucp, ecn_cur); /* Write back the decayed ECN EWMA through the compressed setter */
	}

	/* Primary classification decision tree based on combined signals */
	if (rate_change <= UCP_COND_RATE_DROP_THRESH && /* Smoothed rate trend indicates a significant drop (<= -15% in BBR_SCALE) */
	    (loss_ratio >= UCP_COND_LOSS_CONGEST_THRESH || ucp_get_ecn_ratio(ucp) > 0)) { /* AND congestion signals are present (loss >= 5% or ECN marks detected) */
		u32 rinc = ucp_get_rtt_increase_ratio(sk); /* Get the current RTT increase ratio to disambiguate true congestion from random loss */
		if (rinc >= UCP_COND_RINC_CONGEST_THRESH || /* RTT has increased by >= 20% (queue buildup detected) OR */
		    loss_ratio >= UCP_COND_LOSS_SEVERE_THRESH) { /* Loss rate is severe (>= 10%): strong indication of true congestion */
			new_cond = UCP_COND_CONGESTED; /* Strong congestion signals present: classify as truly congested with queue buildup */
		} else { /* Rate drop with congestion signals but without significant RTT rise or severe loss: likely random/corruption packet loss */
			new_cond = UCP_COND_RANDOM_LOSS; /* Classify as random loss: rate dropped but no queuing delay, suggesting non-congestive packet loss */
		}
	} else if (loss_ratio > 0) { /* Some packet loss is present but rate trend is not dropping significantly */
		new_cond = UCP_COND_RANDOM_LOSS; /* Moderate loss without strong rate drop or congestion signals: classify as random (non-congestive) loss */
	} else { /* No loss detected and no significant rate drop: network appears healthy and uncongested */
		new_cond = UCP_COND_LIGHT_LOAD; /* Clean conditions: classify as light load (uncongested, underutilized) */
	}

	/* Apply hysteresis to prevent rapid condition flapping: require multiple consecutive agreeing samples */
	if (new_cond == ucp->net_condition) { /* Proposed condition matches the current classification: condition is stable */
		ucp->cond_confirm = 0; /* Reset the confirmation counter since the classification is consistent */
	} else { /* Proposed condition differs from current: increment the confirmation counter to accumulate evidence */
		ucp->cond_confirm++; /* Count another consecutive sample that disagrees with the current condition */
		if (ucp->cond_confirm >= (new_cond == UCP_COND_CONGESTED ? /* Different thresholds for entering vs. leaving congestion */
					  UCP_COND_CONFIRM_ENTER : /* Entering CONGESTED requires UCP_COND_CONFIRM_ENTER (3) consecutive samples (conservative) */
					  UCP_COND_CONFIRM_EXIT)) { /* Leaving any non-idle condition requires UCP_COND_CONFIRM_EXIT (2) consecutive samples */
			ucp->net_condition = new_cond; /* Commit the transition to the new condition classification */
			ucp->cond_confirm = 0; /* Reset the confirmation counter for future transitions */
			if (new_cond == UCP_COND_CONGESTED) { /* When entering CONGESTED, reset the non-congested bandwidth peak so it will be re-established from scratch */
				ucp->max_bw_non_congested = 0; /* Clear the peak non-congested BW; it will be re-measured when conditions improve */
			}
		}
	}
}

/**
 * @brief Update the network path class classification (DEFAULT, LAN, MOBILE, LOSSY_FAT, CONGESTED, VPN).
 * @param sk  The TCP socket
 *
 * Classifies the network path based on average RTT, jitter (RTT max-min range), loss ratio EWMA,
 * and RTT increase ratio. The classification influences PROBE_RTT interval, bandwidth floor
 * selection, and probe gain behavior. Uses hysteresis (class_confirm counter) to prevent rapid
 * class flapping. The decision tree evaluates conditions in order of specificity.
 */
static void ucp_update_net_class(struct sock *sk)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	u32 avg = 0, mn = ~0U, mx = 0; /* avg = running sum for mean RTT calculation (microseconds); mn = minimum sample (init to max); mx = maximum sample (init to 0) */
	int v = 0; /* Number of valid RTT samples found in the history buffer */
	u32 jitter, loss, rinc; /* jitter = RTT range (mx - mn) in us; loss = loss ratio EWMA in BBR_SCALE; rinc = RTT increase ratio in BBR_SCALE */
	u8 candidate; /* Proposed new path class for this evaluation; will be compared to current with hysteresis */

	/* Accumulate RTT statistics from the 2-slot history buffer */
	if (ucp->rtt_history[0]) { /* First slot of the RTT history buffer has a valid sample */
		avg += ucp->rtt_history[0]; v++; /* Add to the running sum and increment the sample counter */
		mn = mx = ucp->rtt_history[0]; /* Initialize both min and max to the first sample's value for comparison with the second sample */
	}
	if (ucp->rtt_history[1]) { /* Second slot of the RTT history buffer has a valid sample */
		avg += ucp->rtt_history[1]; v++; /* Add to the running sum and increment the sample counter */
		mn = min(mn, ucp->rtt_history[1]); /* Update the minimum: keep the smaller of the existing min and this sample */
		mx = max(mx, ucp->rtt_history[1]); /* Update the maximum: keep the larger of the existing max and this sample */
	}

	if (v < 2) { /* Need at least 2 RTT samples for a reasonably reliable path classification */
		ucp->net_class = UCP_CLASS_DEFAULT; /* Insufficient data: revert to the DEFAULT unclassified path */
		ucp->class_confirm = 0; /* Reset the hysteresis confirmation counter since we are starting fresh */
		return; /* Cannot classify the path with fewer than 2 RTT samples */
	}

	avg /= v; /* Compute the arithmetic mean RTT from the available samples (sum / count) */
	jitter = mx - mn; /* Compute jitter as the range (max - min) of the two RTT samples; a simple measure of RTT variability */
	loss = ucp_get_loss_ratio(sk); /* Get the current smoothed loss ratio EWMA (BBR_SCALE) */
	rinc = ucp_get_rtt_increase_ratio(sk); /* Get the RTT increase ratio (BBR_SCALE): measures queuing delay relative to min_rtt */

	/* Decision tree for path class classification: evaluated in order from most significant to least specific */
	if (loss > ucp_high_loss_thresh_val) { /* High loss (> 5%) is the strongest and most definitive signal of persistent congestion */
		candidate = UCP_CLASS_CONGESTED; /* Loss exceeding the high threshold trumps all other signals: classify as persistently congested */
	} else if (avg < UCP_CLASS_LAN_RTT_US && /* Average RTT < 5 ms (very low latency, typical of local networks) */
		   jitter < UCP_CLASS_LAN_JITTER_US && /* Jitter < 3 ms (very low RTT variation) */
		   loss < UCP_CLASS_LAN_LOSS_THRESH) { /* Loss < 0.1% (nearly lossless) */
		candidate = UCP_CLASS_LAN; /* All three LAN criteria met: classify as a Local Area Network path */
	} else if (loss > UCP_CLASS_MOBILE_LOSS_THRESH && /* Loss > 3% (high loss typical of cellular links) */
		   jitter > UCP_CLASS_MOBILE_JITTER_US) { /* Jitter > 20 ms (high variability typical of mobile networks) */
		candidate = UCP_CLASS_MOBILE; /* Both mobile signatures present: classify as a MOBILE/cellular path */
	} else if (avg > UCP_CLASS_LOSSY_RTT_US && /* Average RTT > 80 ms (high latency, typical of satellite links) */
		   loss > UCP_CLASS_LOSSY_LOSS_THRESH) { /* Loss > 1% (significant background packet loss) */
		candidate = UCP_CLASS_LOSSY_FAT; /* High latency + significant loss: classify as a LOSSY_FAT (satellite-like) path */
	} else if (rinc >= UCP_CLASS_CONG_RINC_THRESH && /* RTT increase ratio >= 50% (significant queuing delay buildup) */
		   loss >= ucp_low_loss_thresh_val) { /* Loss >= 1% (some packet loss is occurring) */
		candidate = UCP_CLASS_CONGESTED; /* Significant RTT rise combined with loss: classify as temporarily CONGESTED */
	} else if (avg > UCP_CLASS_VPN_RTT_US) { /* Average RTT > 60 ms (elevated but stable latency) */
		candidate = UCP_CLASS_VPN; /* Elevated latency without loss or jitter: classify as a VPN tunnel path */
	} else { /* None of the specific path signatures are matched */
		candidate = UCP_CLASS_DEFAULT; /* Use the DEFAULT unclassified path: standard parameters apply */
	}

	/* Apply hysteresis to prevent rapid class flapping */
	if (candidate == ucp->net_class) { /* The proposed class matches the current classification: strengthen confidence */
		ucp->class_confirm = min_t(u32, ucp->class_confirm + 1, /* Increment confirmation counter, capped at UCP_CLASS_CONFIRM_MAX (7) to prevent overflow */
					   UCP_CLASS_CONFIRM_MAX); /* Saturating counter: confirms stability but cannot exceed the 3-bit field's safe maximum */
	} else { /* Proposed class differs from current: accumulate evidence for a transition */
		ucp->class_confirm = min_t(u32, ucp->class_confirm + 1, /* Increment with saturation to prevent 3-bit wraparound on mismatch after full-saturation */
					   UCP_CLASS_CONFIRM_CNT); /* Saturate at threshold to avoid overflow -> 0 reset that would discard accumulated confidence */
		if (ucp->class_confirm >= UCP_CLASS_CONFIRM_CNT) { /* Need UCP_CLASS_CONFIRM_CNT (2) consecutive disagreeing samples to change class */
			ucp->net_class = candidate; /* Commit the transition to the new path class */
			ucp->class_confirm = 0; /* Reset the confirmation counter after a successful transition */
		}
	}
}

/**
 * @brief Get the pacing gain for the current PROBE_BW gain cycle phase.
 * @param sk  The TCP socket
 * @return    Pacing gain in BBR_SCALE: probe gain (>BBR_UNIT), drain gain (<BBR_UNIT), or 1.0x (cruise)
 *
 * Phase 0 (probe): Returns the configured probe gain (default 1.10x) unless:
 * - Loss >= probe_skip_loss_thresh (2%) OR RTT rise >= probe_skip_rtt_rise_thresh (40%): skip probe, return 1.0x
 * - MOBILE path with loss >= drain_thresh (1%): return reduced mobile_probe_gain (1.0x default)
 * Phase 1 (drain): Returns 0.75x (standard BBR drain gain) if the probe phase actually applied >1.0x gain;
 * otherwise returns 1.0x (no drain needed since probe was skipped).
 * Phases 2-7 (cruise): Returns 1.0x (neutral gain, send at estimated BtlBw).
 */
static u32 ucp_get_cycle_pacing_gain(const struct sock *sk)
{
	const struct ucp *ucp = (const struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state (const access) */

	if (ucp->cycle_idx == 0) { /* Phase 0: probe phase - attempt to send faster than BtlBw to test for newly available bandwidth */
		/* Safety check: skip the probe if the network is already under stress (high loss or significant RTT rise) */
		if (ucp_get_loss_ratio(sk) >= ucp_probe_skip_loss_val || /* Current loss EWMA >= 2% threshold: probing would likely cause more loss */
		    ucp_get_rtt_increase_ratio(sk) >= ucp_probe_skip_rtt_rise_val) { /* RTT increase >= 40%: probing would increase queue buildup */
			return BBR_UNIT; /* Return neutral gain (1.0x) to avoid aggravating the network condition */
		}
		/* On MOBILE paths with loss above the drain threshold, use a reduced probe (UCP-specific; BBR mode skips this) */
		if (ucp_bbr_mode == 0 && /* Full UCP mode: apply class-based probe gain reduction */
		    ucp->net_class == UCP_CLASS_MOBILE && /* Path is classified as MOBILE (cellular/high-loss wireless) */
		    ucp_get_loss_ratio(sk) >= ucp_drain_loss_thresh_val) { /* Loss >= 1% (drain threshold): condition is too lossy for aggressive probing */
			return ucp_probe_gain_mobile_val; /* Return the reduced mobile probe gain (default 1.0x = no probe) */
		}
		return ucp_probe_gain_val; /* Conditions are favorable: return the standard probe gain (default 1.10x; user-configurable via sysfs) */
	}

	if (ucp->cycle_idx == UCP_PROBE_BW_DRAIN_IDX) { /* Phase 1: drain phase - reduce rate to drain any excess inflight built during the probe */
		if (ucp->probe_gain_applied) { /* The preceding probe phase actually applied a gain > 1.0x, so there is excess inflight to drain */
			return (BBR_UNIT * 3 / 4);   /* Return the standard BBR drain gain: 0.75x BtlBw to drain the queue at a controlled rate */
		} else { /* The probe phase was skipped (returned neutral gain 1.0x), so no excess inflight was created and no drain is needed */
			return BBR_UNIT; /* Return neutral gain (1.0x) since there is no queue to drain */
		}
	}

	/* Phases 2 through 7: cruise phases - maintain the current rate at exactly the estimated BtlBw */
	return BBR_UNIT; /* Neutral gain (1.0x): send at the estimated bottleneck rate without adding or draining queue */
}

/**
 * @brief Advance the PROBE_BW gain cycle to the next phase and handle post-probe drain queuing.
 * @param sk  The TCP socket
 *
 * Advances cycle_idx to the next phase (modulo 8 = UCP_PROBE_BW_CYCLE_LEN). Before advancing,
 * if completing a probe phase (index 0) that actually applied >1.0x gain and the loss ratio
 * exceeds the drain threshold, an early drain level (1-3) is queued based on loss severity.
 * The phase start timestamp is updated, and the new phase's pacing gain is set.
 */
static void ucp_advance_cycle_phase(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);   /* Get the TCP socket's internal state for the delivered_mstamp timestamp */
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */

	/* Before advancing from the probe phase (index 0), check if an early drain needs to be queued (UCP-specific: BBR does not do loss-triggered early drains) */
	if (ucp_bbr_mode == 0 && ucp->cycle_idx == 0 && ucp->probe_gain_applied && /* Only in UCP mode AND completing a probe phase that actually applied >1.0x gain */
	    ucp_get_loss_ratio(sk) >= ucp_drain_loss_thresh_val) { /* And current loss ratio >= drain trigger threshold (1%): probing caused detectable loss */
		u32 loss = ucp_get_loss_ratio(sk); /* Get the current loss EWMA to determine the drain severity level */
		if (loss >= ucp_drain_lvl3_loss_thresh_val) { /* Loss >= 10%: most aggressive drain needed for severe loss */
			ucp->drain_pending = 3; /* queue aggressive drain (level 3): pacing gain -> 0.65x (default) */
		} else if (loss >= ucp_drain_lvl2_loss_thresh_val) { /* Loss >= 5% but < 10%: moderate loss, standard drain */
			ucp->drain_pending = 2; /* queue standard drain (level 2): pacing gain -> 0.75x (default) */
		} else { /* Loss >= 1% (drain threshold) but < 5%: light loss, gentle drain */
			ucp->drain_pending = 1; /* queue light drain (level 1): pacing gain -> 0.85x (default) */
		}
	}

	/* Advance to the next phase in the 8-position gain cycle, wrapping around at the end */
	ucp->cycle_idx = (ucp->cycle_idx + 1) & (UCP_PROBE_BW_CYCLE_LEN - 1); /* Increment and wrap via mask: & 7 ensures modulo 8 operation */
	ucp_set_cycle_mstamp(ucp, tp->delivered_mstamp); /* Record the microsecond timestamp of when this new phase started */

	/* Determine if the new phase will apply a probing gain: only at phase 0 with gain > BBR_UNIT */
	ucp->probe_gain_applied = (ucp->cycle_idx == 0 && /* Only the probe phase (index 0) can have >1.0x gain */
		ucp_get_cycle_pacing_gain(sk) > BBR_UNIT); /* Gain must strictly exceed 1.0x to be considered a probe */

	ucp->pacing_gain = ucp_get_cycle_pacing_gain(sk); /* Set the current pacing gain to the gain value for the newly entered phase */
}

/**
 * @brief Compute the adaptive PROBE_RTT interval in jiffies based on path conditions.
 * @param sk  The TCP socket
 * @return    PROBE_RTT interval in jiffies (kernel timer ticks), capped at ucp_probe_rtt_max_jiffies
 *
 * The interval starts from the base value (default 10 seconds) and adds extra time based on:
 * - Path class: MOBILE and LOSSY_FAT paths get class_extra seconds (default +5)
 * - High loss (>= high_loss_thresh, 5%): adds high_loss_extra seconds (default 0)
 * - Medium loss (>= probe_skip_loss, 2% but < high_loss): adds mid_loss_extra seconds (default 0)
 * The total is capped at ucp_probe_rtt_max_jiffies (default max 15 seconds).
 */
static u32 ucp_get_probe_rtt_interval(const struct sock *sk)
{
	const struct ucp *ucp = (const struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state (const access) */
	u32 interval = ucp_probe_rtt_base_jiffies; /* Start with the base interval (default 10 seconds converted to jiffies) */
	u32 loss = ucp_get_loss_ratio(sk); /* Get the current loss EWMA for the extra-time checks */

	/* MOBILE and LOSSY_FAT paths need longer intervals between PROBE_RTT entries (UCP-specific: BBR uses fixed interval) */
	if (ucp_bbr_mode == 0 && /* Only apply class-based extension in full UCP mode; BBR mode uses fixed interval */
	    (ucp->net_class == UCP_CLASS_MOBILE || /* Cellular paths have variable RTT; longer interval avoids unnecessary probes */
	    ucp->net_class == UCP_CLASS_LOSSY_FAT)) { /* Lossy fat pipes (e.g., satellite) have high latency; longer interval avoids excessive RTT samples */
		interval += ucp_probe_rtt_class_extra_jiffies; /* Add the class extra time (default 5 seconds in jiffies) */
	}

	/* Add extra time when loss is elevated: more loss = longer window between RTT refresh operations.
	 * In BBR mode (ucp_bbr_mode == 1), these UCP-specific extensions are skipped to keep standard BBR behavior. */
	if (ucp_bbr_mode == 0) {
		if (loss >= ucp_high_loss_thresh_val) { /* Loss EWMA >= 5% (high threshold): significant loss in progress, postpone PROBE_RTT */
			interval += ucp_probe_rtt_high_loss_extra_jiffies; /* Add high loss extra time (default 0 seconds, configurable via module parameter) */
		} else if (loss >= ucp_probe_skip_loss_val) { /* Loss EWMA >= 2% (probe skip threshold) but below high threshold: moderate loss */
			interval += ucp_probe_rtt_mid_loss_extra_jiffies; /* Add mid loss extra time (default 0 seconds, configurable via module parameter) */
		}
	}

	/* Clamp the total interval to the configured maximum to prevent excessively long intervals between RTT refreshes */
	return min_t(u32, interval, ucp_probe_rtt_max_jiffies); /* Cap at ucp_probe_rtt_max_jiffies (default max 15 seconds) */
}

/**
 * @brief Map a drain level to its corresponding pacing gain in BBR_SCALE.
 * @param level  Drain severity level: 1 = light, 2 = standard, 3 = aggressive
 * @return       Pacing gain in BBR_SCALE (all drain gains are < BBR_UNIT to reduce sending rate):
 *               Level 1: ~0.85x (light), Level 2: 0.75x (standard), Level 3: ~0.65x (aggressive)
 *               Invalid level or 0: returns BBR_UNIT (1.0x = no drain)
 */
static u32 ucp_drain_gain_by_level(int level)
{
	switch (level) { /* Select the appropriate drain pacing gain based on the queued drain severity level */
		case 1: return ucp_drain_gain_light_val;      /* Level 1: light drain with gentle pacing reduction (default 0.85x = ~217 BBR_SCALE) */
		case 2: return ucp_drain_gain_standard_val;   /* Level 2: standard drain with moderate pacing reduction (default 0.75x = 192 BBR_SCALE) */
		case 3: return ucp_drain_gain_aggressive_val; /* Level 3: aggressive drain with strong pacing reduction (default 0.65x = ~166 BBR_SCALE) */
		default: return BBR_UNIT; /* Invalid or zero level: return neutral gain (1.0x = 256 BBR_SCALE) meaning no drain effect */
	}
}

/**
 * @brief Apply any queued one-shot drain constraints to the pacing gain.
 * @param sk  The TCP socket
 *
 * This is called from the main ACK processing path (ucp_main) after ucp_update_model().
 * If ucp->drain_pending is non-zero (meaning ucp_advance_cycle_phase queued a drain),
 * this function:
 * 1. Overrides pacing_gain with the corresponding drain gain from ucp_drain_gain_by_level()
 * 2. Clears the pending flag (one-shot: the drain is applied exactly once)
 * 3. Records the phase start timestamp at the current delivery time
 * This provides a rapid queue-drain response to loss triggered by bandwidth probes.
 */
static void ucp_apply_pacing_constraints(struct sock *sk)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	if (ucp->drain_pending) { /* Check if a drain has been queued (non-zero drain_pending means level 1, 2, or 3 is waiting) */
		ucp->pacing_gain = ucp_drain_gain_by_level(ucp->drain_pending); /* Override the cycling pacing gain with the drain-specific gain (0.65x-0.85x) */
		ucp->drain_pending = 0; /* Clear the pending flag: this is a one-shot drain that has now been applied */
		ucp_set_cycle_mstamp(ucp, tcp_sk(sk)->delivered_mstamp); /* Record the drain start timestamp at the current delivery time marking phase transition */
	}
}

/**
 * enum ucp_cong_level - Congestion severity levels for graduated cwnd ceiling application
 * @UCP_CONG_NONE:     No congestion detected; no cwnd cap beyond the standard cwnd gain (2.0x BDP)
 * @UCP_CONG_MILD:     Mild congestion; slight cwnd cap applied (1.75x BDP by default)
 * @UCP_CONG_MODERATE: Moderate congestion; tighter cwnd cap (1.50x BDP by default)
 * @UCP_CONG_SEVERE:   Severe congestion; most restrictive cwnd cap (1.25x BDP by default)
 */
enum ucp_cong_level {
	UCP_CONG_NONE = 0,      /* Value 0: No congestion signals detected; cwnd can use the full standard gain (2.0x BDP) */
	UCP_CONG_MILD = 1,      /* Value 1: Mild congestion; cwnd capped at ucp_cwnd_cap_mild_val (1.75x BDP default) */
	UCP_CONG_MODERATE = 2,  /* Value 2: Moderate congestion; cwnd capped at ucp_cwnd_cap_moderate_val (1.50x BDP default) */
	UCP_CONG_SEVERE = 3,    /* Value 3: Severe congestion; cwnd capped at ucp_cwnd_cap_severe_val (1.25x BDP default) */
};

/**
 * @brief Determine the current congestion severity level based on RTT increase and loss ratio.
 * @param sk  The TCP socket
 * @return    ucp_cong_level enum: NONE (0), MILD (1), MODERATE (2), or SEVERE (3)
 *
 * Combines two signals (RTT increase ratio and loss ratio) using an OR condition:
 * whichever signal indicates higher severity determines the result. This means either
 * queuing delay (RTT rise) or packet loss can independently trigger the congestion response.
 * The severity level is used by ucp_apply_cwnd_constraints() to select the cwnd cap.
 */
static enum ucp_cong_level ucp_congestion_level(const struct sock *sk)
{
	u32 rinc = ucp_get_rtt_increase_ratio(sk); /* Get the RTT increase ratio in BBR_SCALE: represents queuing delay relative to baseline */
	u32 loss = ucp_get_loss_ratio(sk); /* Get the smoothed loss ratio EWMA in BBR_SCALE: represents packet loss severity */

	/* Check the most severe level first: SEVERE - either RTT doubled or loss >= 5% */
	if (rinc >= UCP_CONG_SEVERE_RINC_THRESH || /* RTT increased by >= 100% (doubled): severe queuing delay */
	    loss >= ucp_cong_severe_loss_val) { /* Loss >= 5% (high threshold): severe packet loss */
		return UCP_CONG_SEVERE; /* Both conditions independently indicate severe congestion */
	}
	/* Check MODERATE: RTT up by >= 50% or loss >= 1% */
	if (rinc >= UCP_CONG_MODERATE_RINC_THRESH || /* RTT increased by >= 50%: moderate queuing delay */
	    loss >= ucp_cong_moderate_loss_val) { /* Loss >= 1% (drain threshold): moderate packet loss */
		return UCP_CONG_MODERATE; /* Either condition indicates moderate congestion */
	}
	/* Check MILD: RTT up by >= 25% or loss >= 1% */
	if (rinc >= UCP_CONG_MILD_RINC_THRESH || /* RTT increased by >= 25%: mild queuing delay */
	    loss >= ucp_cong_mild_loss_val) { /* Loss >= 1% (low threshold): mild packet loss */
		return UCP_CONG_MILD; /* Either condition indicates mild congestion */
	}
	return UCP_CONG_NONE; /* No significant congestion signals: no cwnd cap needed */
}

/**
 * @brief Apply cwnd ceiling caps for congestion severity and STARTUP loss-based gain limits.
 * @param sk  The TCP socket
 *
 * Two-stage constraint application in order:
 * 1. CONGESTED condition cap: If net_condition is UCP_COND_CONGESTED, reduces cwnd_gain
 *    based on the current congestion severity level (NONE/MILD/MODERATE/SEVERE).
 * 2. STARTUP loss-cap: If in STARTUP mode, applies loss-based gain reduction:
 *    - Loss >= hard_cap threshold (2%): cap both cwnd_gain and pacing_gain at cwnd_gain_val (2.0x)
 *    - Loss >= soft_drain threshold (0.5%): cap both gains at startup_soft_gain_val (2.5x)
 */
static void ucp_apply_cwnd_constraints(struct sock *sk)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	u32 cap = ucp_cwnd_gain_val; /* Start with the standard steady-state cwnd gain (2.0x BDP = 512 BBR_SCALE) as the base cap */

	/* BBR compatible mode: skip all UCP-specific cwnd constraints (congestion ceiling and STARTUP caps) */
	if (ucp_bbr_mode == 1) /* Pure BBR mode: do not apply graduated caps or STARTUP loss-based reduction */
		return; /* Return immediately: BBR has no non-destructive cwnd constraint layer */

	/* Stage 1: If network condition is classified as CONGESTED, apply graduated cwnd caps */
	if (ucp->net_condition == UCP_COND_CONGESTED) { /* Only apply congestion severity caps when the network is actively classified as congested */
		switch (ucp_congestion_level(sk)) { /* Determine the current congestion severity from RTT increase and loss ratio */
			case UCP_CONG_MILD: /* Mild congestion: slight reduction to keep some headroom */
				cap = ucp_cwnd_cap_mild_val;      break; /* Cap = 1.75x BDP (default 448 BBR_SCALE): modest window reduction */
			case UCP_CONG_MODERATE: /* Moderate congestion: more significant reduction */
				cap = ucp_cwnd_cap_moderate_val;  break; /* Cap = 1.50x BDP (default 384 BBR_SCALE): tighter window */
			case UCP_CONG_SEVERE: /* Severe congestion: aggressive cwnd reduction */
				cap = ucp_cwnd_cap_severe_val;    break; /* Cap = 1.25x BDP (default 320 BBR_SCALE): most restrictive */
			default: break; /* UCP_CONG_NONE: keep the default cap (ucp_cwnd_gain_val = 2.0x BDP) unchanged */
		}
	}
	ucp->cwnd_gain = min_t(u32, ucp->cwnd_gain, cap); /* Apply the cap: cwnd_gain is the minimum of its current value and the selected cap */

	/* Stage 2: STARTUP mode loss-based gain reduction (UCP-specific non-destructive constraint) */
	if (ucp->mode == UCP_STARTUP) { /* Only apply these limits during the exponential bandwidth probing phase */
		u32 loss = ucp_get_loss_ratio(sk); /* Get current loss EWMA for the loss threshold checks */
		if (loss >= ucp_startup_hard_cap_val) { /* Loss >= 2% (hard cap threshold): significant loss during STARTUP requires aggressive gain reduction */
			ucp->cwnd_gain   = min_t(u32, ucp->cwnd_gain, ucp_cwnd_gain_val); /* Cap cwnd gain at standard cwnd gain (2.0x); reduces from UCP_HIGH_GAIN (~2.89x) */
			ucp->pacing_gain = min_t(u32, ucp->pacing_gain, ucp_cwnd_gain_val); /* Cap pacing gain at standard cwnd gain (2.0x); reduces from UCP_HIGH_GAIN */
		} else if (loss >= ucp_startup_soft_drain_val) { /* Loss >= 0.5% (soft drain threshold): moderate loss during STARTUP requires moderate gain reduction */
			ucp->cwnd_gain   = min_t(u32, ucp->cwnd_gain, ucp_startup_soft_gain_val); /* Reduce cwnd gain to the soft gain value (2.5x default) */
			ucp->pacing_gain = min_t(u32, ucp->pacing_gain, ucp_startup_soft_gain_val); /* Reduce pacing gain to the soft gain value (2.5x default) */
		}
	}
}

/**
 * @brief Apply quantization adjustments to the target cwnd: TSO headroom, even rounding, probe bonus.
 * @param sk   The TCP socket
 * @param cwnd The base target cwnd in packets (from BDP * gain calculation)
 * @return     The adjusted target cwnd in packets after applying all quantization effects
 *
 * Three adjustments applied in order:
 * 1. TSO headroom: adds UCP_TSO_HEADROOM_SEGS (3) * tso_segs_goal() segments to prevent TSO from starving the pipeline
 * 2. Even rounding: rounds up to the nearest even integer for better GSO/TSO alignment
 * 3. Probe bonus: adds UCP_PROBE_CWND_BONUS (2) extra segments if in PROBE_BW with >1.0x gain
 */
static u32 ucp_quantization_budget(struct sock *sk, u32 cwnd)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for mode and pacing_gain checks */
	cwnd += UCP_TSO_HEADROOM_SEGS * ucp_tso_segs_goal(sk); /* Add TSO headroom: 3 * TSO goal segments to ensure the TSO/GSO layer has enough budget and doesn't stall the pipeline */
	if (cwnd < U32_MAX) { /* Avoid overflow when rounding if cwnd is already at the maximum possible value */
		cwnd = (cwnd + 1) & ~1U; /* Round up to the nearest even integer: add 1 then clear the LSB; improves GSO alignment on NIC hardware */
	}
	if (ucp->mode == UCP_PROBE_BW && ucp->pacing_gain > BBR_UNIT) { /* In probe phase with >1.0x gain: add extra segments to ensure we fully probe for new bandwidth */
		cwnd += UCP_PROBE_CWND_BONUS; /* Add UCP_PROBE_CWND_BONUS (2) extra segments during probe to push the pipeline and detect newly available bandwidth */
	}
	return cwnd; /* Return the fully quantized cwnd target with all adjustments applied */
}

/**
 * @brief Compute target inflight data as BDP * gain with quantization adjustments.
 * @param sk   The TCP socket
 * @param bw   Bottleneck bandwidth in BW_SCALE
 * @param gain Gain multiplier in BBR_SCALE
 * @return     Target inflight cwnd in packets (quantized for TSO, even, and probe bonus)
 *
 * Convenience wrapper: calls ucp_bdp() to compute the base BDP * gain, then passes the result
 * through ucp_quantization_budget() for TSO headroom, even rounding, and probe bonus.
 */
static u32 ucp_inflight(struct sock *sk, u32 bw, int gain)
{
	/* Compute BDP * gain as the base target, then apply quantization (TSO headroom, even rounding, probe bonus) */
	return ucp_quantization_budget(sk, ucp_bdp(sk, bw, gain));
}

/**
 * @brief Estimate the number of packets still in flight at the Earliest Departure Time (EDT) of the next packet.
 * @param sk             The TCP socket
 * @param inflight_now   Current number of packets in flight (prior_in_flight from the rate sample)
 * @return               Projected inflight packets at EDT; 0 if the pipe will have drained completely
 *
 * Projects how many packets will still be in the network at the earliest departure time of the next
 * packet to be sent. This estimate is used in ucp_is_next_cycle_phase() to determine whether the
 * current PROBE_BW probe phase has built enough inflight: if projected inflight at EDT >= target
 * inflight, the probe still has effect and continues; otherwise, the probe has drained and it is
 * time to advance to the next cycle phase.
 *
 * Calculation: inflight_now + TSO goal (if probing) - packets delivered between now and EDT
 */
static u32 ucp_packets_in_net_at_edt(struct sock *sk, u32 inflight_now)
{
	struct tcp_sock *tp = tcp_sk(sk);   /* Get the TCP socket's internal state for tcp_clock_cache and tcp_wstamp_ns */
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for pacing_gain */
	u64 now_ns = tp->tcp_clock_cache;  /* Current time in nanoseconds, cached from the TCP clock for efficiency */
	u64 edt_ns = max(tp->tcp_wstamp_ns, now_ns); /* Earliest Departure Time: the later of the current time and the pacing wakeup timestamp; when the next packet can actually be transmitted */
	u32 delivered = (u64)ucp_max_bw(sk) * /* Compute expected deliveries between now and EDT: bandwidth * time_delta converted to packets */
			div_u64(edt_ns - now_ns, NSEC_PER_USEC) >> BW_SCALE; /* Convert (edt - now) from ns to us, multiply by BW (BW_SCALE), shift to get integer packets */
	u32 inflight_at_edt = inflight_now; /* Start with the current inflight count; we will subtract deliveries occurring before EDT */

	if (ucp->pacing_gain > BBR_UNIT) { /* If currently in a probe phase with >1.0x gain, additional TSO segments may be scheduled */
		inflight_at_edt += ucp_tso_segs_goal(sk); /* Add the TSO goal segment count to account for TSO burst that will be queued before EDT */
	}
	if (delivered >= inflight_at_edt) { /* More packets will be delivered than are currently in flight: the pipe will drain completely before EDT */
		return 0; /* Return 0: no packets will be in net at EDT, the pipe is empty */
	}
	return inflight_at_edt - delivered; /* Return the projected inflight count: current + TSO burst - deliveries during the interval */
}

/**
 * @brief Handle cwnd adjustments for entry into and exit from TCP fast recovery.
 * @param sk        The TCP socket
 * @param rs        The rate sample (contains loss count)
 * @param acked     Number of packets acknowledged in this ACK
 * @param new_cwnd  Output parameter: the computed new cwnd value
 * @return          true if cwnd was set for recovery entry (caller should use cwnd and skip normal update);
 *                  false if no recovery entry action taken (caller continues with normal cwnd computation)
 *
 * On entry to TCP_CA_Recovery (state transition from non-recovery to recovery):
 * - Sets fast_recovery flag
 * - Reduces cwnd by the number of lost packets (packet conservation)
 * - Sets cwnd to max(reduced_cwnd, packets_in_flight + acked) to avoid window collapse
 * On exit from TCP_CA_Recovery:
 * - Restores cwnd to max(current_cwnd, prior_cwnd) to recover the pre-recovery window
 * - Clears fast_recovery flag
 */
static bool ucp_set_cwnd_to_recover_or_restore(
	struct sock *sk, const struct rate_sample *rs, u32 acked, u32 *new_cwnd)
{
	struct tcp_sock *tp = tcp_sk(sk);   /* Get the TCP socket's internal state for snd_cwnd */
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	u8 ps = ucp->prev_ca_state;        /* The TCP CA state before the most recent transition; used to detect transitions */
	u8 st = inet_csk(sk)->icsk_ca_state; /* The current TCP CA state (Open, Disorder, CWR, Recovery, Loss) */
	u32 cwnd = tp->snd_cwnd;           /* Current congestion window in packets */

	/* Detect entry into fast recovery: state changed to Recovery and it was NOT in Recovery before */
	if (st == TCP_CA_Recovery && ps != TCP_CA_Recovery) { /* Transition detected: we just entered TCP_CA_Recovery */
		ucp->prev_ca_state = st; /* Update the saved previous state to current for next transition detection */
		ucp->fast_recovery = 1; /* Set the fast_recovery flag: indicates we are handling a non-congestion loss recovery */
		cwnd = max_t(u32, cwnd - (u32)rs->losses, 1); /* Reduce cwnd by the number of lost packets (packet conservation principle); minimum 1 */
		*new_cwnd = max(cwnd, tcp_packets_in_flight(tp) + acked); /* Ensure cwnd is at least what's still in flight plus what was just ACKed */
		return true; /* Signal to caller: cwnd has been set for recovery entry, skip normal BDP-based cwnd update */
	}

	/* Detect exit from recovery: previous state was Recovery (or higher) but current state is not */
	if (ps >= TCP_CA_Recovery && st < TCP_CA_Recovery) { /* Transition detected: we just exited TCP_CA_Recovery (back to Open/Disorder/CWR) */
		cwnd = max(cwnd, ucp->prior_cwnd); /* Restore cwnd to at least the saved prior_cwnd (pre-recovery window) */
		ucp->fast_recovery = 0; /* Clear the fast_recovery flag: recovery is complete */
	}

	ucp->prev_ca_state = st; /* Update the saved previous state to the current state for next transition detection */
	*new_cwnd = cwnd; /* Output the computed (or current) cwnd value */
	return false; /* Signal to caller: continue with normal cwnd computation */
}

/**
 * @brief Compute and set the congestion window for the current ACK.
 * @param sk     The TCP socket
 * @param rs     The rate sample from the TCP stack
 * @param acked  Number of packets acknowledged in this ACK
 * @param bw     Current bottleneck bandwidth estimate in BW_SCALE
 * @param gain   Current cwnd gain in BBR_SCALE (may have been capped by constraints)
 *
 * Implements the core cwnd update algorithm:
 * 1. If no packets acked, skip the update (cwnd unchanged)
 * 2. If entering fast recovery, use the recovery cwnd (from ucp_set_cwnd_to_recover_or_restore)
 * 3. Compute target = BDP * gain + quantization
 * 4. Clamp target to [1.25x BDP, 2.0x BDP] via inflight bounds
 * 5. If pipe is full (post-STARTUP): cwnd = min(cwnd + acked, target) (AIMD-like increase)
 * 6. If pipe is not yet full: cwnd += acked (exponential growth during STARTUP)
 * 7. Apply minimum cwnd floor (UCP_CWND_MIN_TARGET = 4)
 * 8. If just exiting PROBE_RTT, restore cwnd to prior_cwnd at minimum
 * 9. Clamp final cwnd to tp->snd_cwnd_clamp
 * 10. If in PROBE_RTT mode, force cwnd to 4 (UCP_CWND_MIN_TARGET)
 */
static void ucp_set_cwnd(struct sock *sk, const struct rate_sample *rs,
			 u32 acked, u32 bw, int gain)
{
	struct tcp_sock *tp = tcp_sk(sk);   /* Get the TCP socket's internal state for snd_cwnd and snd_cwnd_clamp */
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	u32 cwnd = tp->snd_cwnd, target; /* cwnd = current cwnd; target = computed BDP-based target cwnd */

	/* If no packets were acknowledged in this ACK, there is no new information to update cwnd */
	if (!acked) {
		goto done; /* Skip all cwnd update logic and jump to the final clamping section */
	}

	/* Check if we are entering fast recovery: if so, use the recovery-special cwnd and skip normal update */
	if (ucp_set_cwnd_to_recover_or_restore(sk, rs, acked, &cwnd)) {
		goto done; /* Recovery entry handled cwnd; skip normal BDP-based computation */
	}

	/* Compute the target cwnd as BDP * gain with quantization (TSO headroom, even rounding, probe bonus) */
	target = ucp_quantization_budget(sk, ucp_bdp(sk, bw, gain));

	/* Clamp the target cwnd to a reasonable range based on min_rtt-based BDP bounds */
	if (ucp->min_rtt_us != ~0U && ucp->min_rtt_us > 0 && bw > 0) { /* Validate that all BDP inputs are available and valid */
		u64 bdp = (u64)bw * ucp->min_rtt_us >> BW_SCALE; /* Compute min-RTT-based BDP: bw * min_rtt in packets (bw is in BW_SCALE, shift right 24 for integer) */
		u32 lo = (u32)max_t(u64, TCP_INIT_CWND, /* Lower bound: at least the initial cwnd (10 segments) */
			 (bdp * UCP_INFLIGHT_LOW_GAIN) >> BBR_SCALE); /* Lower bound: BDP * 1.25x (in BBR_SCALE), prevents cwnd from dropping below 1.25x BDP */
		u32 hi = (u32)max_t(u64, lo, /* Upper bound: at least the lower bound */
			 (bdp * UCP_INFLIGHT_HIGH_GAIN) >> BBR_SCALE); /* Upper bound: BDP * 2.0x (in BBR_SCALE), prevents cwnd from exceeding 2.0x BDP */
		target = clamp(target, lo, hi); /* Clamp the computed target cwnd to the [1.25x BDP, 2.0x BDP] range */
	}

	/* Apply ACK aggregation compensation: add a cwnd bonus based on recent excess ACK counts.
	 * Equivalent to BBR's extra_acked logic but using a single-slot exponential-decay max
	 * instead of a dual-slot sliding window (see ucp_update_ack_aggregation for details).
	 * Intentionally placed after the inflight bounds [1.25x, 2.0x] clamping so the compensation
	 * can lift the effective target above the nominal steady-state ceiling.  This is safe because:
	 *   - extra_acked_max is u8-saturated (max 255 pkts) and decays exponentially
	 *   - compensation is bounded by tp->snd_cwnd_clamp (the socket's absolute max)
	 *   - the non-destructive constraint layer (ucp_apply_cwnd_constraints) overrides cwnd_gain
	 *     in CONGESTED conditions, compressing the effective window regardless of target
	 */
	if (ucp_extra_acked_gain_val > 0 && ucp->extra_acked_max > 0) { /* Compensation is enabled and there is a non-zero excess max recorded */
		u32 comp = ((u32)ucp->extra_acked_max * ucp_extra_acked_gain_val) >> BBR_SCALE; /* Compute compensation = extra_acked_max * gain, scaled down by BBR_SCALE */
		target = min_t(u32, target + comp, tp->snd_cwnd_clamp); /* Add compensation to target, clamped to socket max window */
	}

	/* Apply the cwnd update rule based on whether the pipe is full */
	if (ucp_full_bw_reached(sk)) { /* Pipe is full (post-STARTUP or PROBE_BW): additive increase toward target */
		cwnd = min(cwnd + acked, target); /* Increase cwnd by acked but not beyond the target (BDP-derived ceiling) */
	} else if (cwnd < target || tp->delivered < TCP_INIT_CWND) { /* Pipe not yet full OR very early in connection: exponential growth */
		cwnd += acked; /* Additive increase without target ceiling: allows rapid growth during STARTUP */
	}

	/* Enforce the absolute minimum congestion window of UCP_CWND_MIN_TARGET (4 packets) */
	cwnd = max(cwnd, UCP_CWND_MIN_TARGET);

	/* If we just restored cwnd after exiting PROBE_RTT, ensure cwnd is at least prior_cwnd */
	if (unlikely(ucp->probe_rtt_restored)) { /* Unlikely path: probe_rtt_restored flag set meaning we just exited PROBE_RTT */
		cwnd = max(cwnd, ucp->prior_cwnd); /* Restore cwnd to at least the saved pre-PROBE_RTT window */
		ucp->probe_rtt_restored = 0; /* Clear the flag: restoration has been applied */
	}

done:
	tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp); /* Write the final cwnd, clamped to the socket's configured maximum (snd_cwnd_clamp) */

	if (ucp->mode == UCP_PROBE_RTT) { /* If currently in PROBE_RTT mode, override cwnd to the minimum target */
		tp->snd_cwnd = min(tp->snd_cwnd, UCP_CWND_MIN_TARGET); /* Force cwnd to 4 packets during PROBE_RTT to drain the bottleneck queue */
	}
}

/**
 * @brief Check whether the current PROBE_BW gain phase should advance to the next phase.
 * @param sk  The TCP socket
 * @param rs  The rate sample (contains prior_in_flight)
 * @return    true if the current phase is complete and the cycle should advance; false to stay in the current phase
 *
 * For phases with pacing_gain <= 1.0x (drain/cruise): advances when the phase has lasted at least min_rtt_us
 * (one full RTT of real time). This ensures enough time for the queue to drain.
 *
 * For phases with pacing_gain > 1.0x (probe): advances only when both:
 * 1. One full RTT has elapsed since the phase started
 * 2. The estimated inflight at EDT is less than the target inflight (meaning the probe effect has drained)
 * This prevents advancing prematurely while the probe is still filling the pipe.
 */
static bool ucp_is_next_cycle_phase(struct sock *sk,
				    const struct rate_sample *rs)
{
	struct tcp_sock *tp = tcp_sk(sk);   /* Get the TCP socket's internal state for delivered_mstamp */
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	bool is_full_length = tcp_stamp_us_delta(tp->delivered_mstamp, /* Compute time elapsed (in microseconds) since the current phase started */
				ucp_get_cycle_mstamp(ucp)) > ucp->min_rtt_us; /* Compare elapsed time to min_rtt_us: true if the phase has lasted at least one full RTT */

	if (ucp->pacing_gain <= BBR_UNIT) { /* Drain or cruise phase (gain <= 1.0x): only the time-based condition matters */
		return is_full_length; /* Advance when the phase has run for at least one min_rtt interval */
	}

	/* Probe phase (gain > 1.0x): end when time-based condition met AND (losses detected OR inflight still full) */
	return is_full_length && /* Phase has lasted at least one min_rtt */
	       (rs->losses || /* Packet loss detected: terminate probe phase early to avoid further damage */
		ucp_packets_in_net_at_edt(sk, rs->prior_in_flight) >= /* Estimated packets in the network at EDT of next packet */
		ucp_inflight(sk, ucp_max_bw(sk), ucp->pacing_gain)); /* Must be >= target inflight for the probe gain: the pipe must still be full to continue probing */
}

/**
 * @brief Update the PROBE_BW gain cycle phase if conditions are met.
 * @param sk  The TCP socket
 * @param rs  The rate sample from the TCP stack
 *
 * Called from ucp_update_model() as part of the per-ACK model update pipeline.
 * Only acts when in PROBE_BW mode and the current phase completion conditions are satisfied.
 * When ucp_is_next_cycle_phase() returns true, advances to the next gain phase.
 */
static void ucp_update_cycle_phase(struct sock *sk,
				   const struct rate_sample *rs)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for mode check */
	if (ucp->mode == UCP_PROBE_BW && ucp_is_next_cycle_phase(sk, rs)) { /* Only advance if in PROBE_BW mode AND the phase completion condition is met */
		ucp_advance_cycle_phase(sk); /* Advance to the next phase in the 8-phase gain cycle */
	}
}

/**
 * @brief Reset the operating mode after DRAIN or PROBE_RTT: transition to STARTUP or PROBE_BW.
 * @param sk  The TCP socket
 *
 * Called when DRAIN completes or after PROBE_RTT exits. Determines the next mode:
 * - If full_bw_reached is false (pipe never filled): transition to STARTUP (re-enter exponential growth)
 * - If full_bw_reached is true (pipe was full before): transition to PROBE_BW (steady-state cycling)
 *   with a randomized initial cycle phase to desynchronize multiple connections
 */
static void ucp_reset_mode(struct sock *sk)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */

	if (!ucp_full_bw_reached(sk)) { /* The pipe was never confirmed full: return to STARTUP for exponential bandwidth search */
		ucp->mode = UCP_STARTUP; /* Switch to STARTUP mode: begin probing for bandwidth with high gain (~2.89x) */
	} else { /* The pipe was previously full: enter steady-state PROBE_BW cycling */
		ucp->mode = UCP_PROBE_BW; /* Switch to PROBE_BW mode: begin the 8-phase steady-state gain cycle */
		/* Randomize the starting cycle index to desynchronize multiple concurrent connections */
		ucp->cycle_idx = UCP_PROBE_BW_CYCLE_LEN - 1 - /* Start near the end of the prior cycle: set to 7 - random(0..7) so phase 0 probe starts soon */
				 prandom_u32_max(UCP_PROBE_BW_CYCLE_RAND); /* Generate a random value 0..7 (UCP_PROBE_BW_CYCLE_RAND = 7) for phase start randomization */
		ucp_advance_cycle_phase(sk); /* Immediately advance to the next phase (likely phase 0 probe) to begin the steady-state cycle */
	}
}

/**
 * @brief Update the bottleneck bandwidth (BtlBw) estimate with a new delivery rate sample.
 * @param sk  The TCP socket
 * @param rs  The rate sample containing delivery rate and timestamp info
 *
 * Core bandwidth estimation pipeline:
 * 1. Detects packet-timed round boundaries using delivered counter progression
 * 2. Computes the delivery rate sample: delivered / interval_us in BW_SCALE
 * 3. Stores the sample in the 2-slot delivery rate history
 * 4. Updates the non-congested peak bandwidth (max_bw_non_congested)
 * 5. Applies adaptive bandwidth floor based on path class and loss conditions
 * 6. Updates the running-max bandwidth filter (minmax) with the (potentially floored) sample
 *
 * The adaptive bandwidth floor prevents the BtlBw estimate from dropping too low during
 * transient idle or application-limited periods. The floor is calculated as a percentage
 * of the peak non-congested bandwidth, and is only applied when loss is below the loss cap.
 */
static void ucp_update_bw(struct sock *sk, const struct rate_sample *rs)
{
	struct tcp_sock *tp = tcp_sk(sk);   /* Get the TCP socket's internal state for delivered counter */
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	u64 bw; /* Computed bandwidth sample in BW_SCALE; may be modified by floor logic before being added to the max filter */

	ucp->round_start = 0; /* Clear the round_start flag: will be re-set below if this ACK completes a new round */
	/* Validate the rate sample: must have positive delivered count and positive interval */
	if (rs->delivered < 0 || rs->interval_us <= 0) {
		return; /* Invalid rate sample: skip bandwidth update (no data to compute a rate) */
	}

	/* Check for packet-timed round boundary: a round completes when prior_delivered passes next_rtt_delivered */
	if (unlikely(ucp->next_rtt_delivered == 0)) { /* First ACK: initialize next_rtt_delivered without counting a spurious round (matches BBR behavior) */
		ucp->next_rtt_delivered = tp->delivered; /* Set the first round boundary to current delivered count */
		ucp->round_start = 1; /* Signal round start for initialization; rtt_cnt remains 0 to avoid a fake round increment */
	} else if (!before(rs->prior_delivered, ucp->next_rtt_delivered)) { /* The ACK's prior_delivered >= next_rtt_delivered: this ACK completes a round */
		ucp->next_rtt_delivered = tp->delivered; /* Set the next round boundary to the current total delivered count */
		ucp->rtt_cnt++; /* Increment the round counter: one more packet-timed round has been completed */
		ucp->round_start = 1; /* Set the round_start flag and signal that a new round has begun; used by pacing fast-ramp bypass */
	}

	{ /* Compute the delivery rate sample: delivered packets / interval in microseconds, scaled to BW_SCALE */
		s64 sample = div64_long((u64)rs->delivered * BW_UNIT, /* Multiply delivered count by BW_UNIT (2^24) to convert to BW_SCALE */
					rs->interval_us); /* Divide by the sample interval (microseconds); result is bandwidth in BW_SCALE (packets/usec * 2^24) */
		ucp_add_delivery_rate_sample(sk, (u32)sample); /* Store the rate sample in the 2-slot circular history */
		bw = sample; /* Set bw to the raw computed sample; may be modified by floor logic below */
	}

	/* Only update the bandwidth filter if this sample is not application-limited, or if it exceeds the current max */
	if (!rs->is_app_limited || bw >= ucp_max_bw(sk)) { /* Skip rate samples taken during application-limited periods unless they set a new max */
		/* Update the peak non-congested bandwidth if in a non-congested condition */
		if (ucp->net_condition != UCP_COND_CONGESTED && /* The network is not currently classified as congested */
		    (u32)bw > ucp->max_bw_non_congested) { /* And this sample exceeds the previous peak non-congested bandwidth */
			ucp->max_bw_non_congested = (u32)bw; /* Update the non-congested peak bandwidth to this higher sample */
		}

		/* Apply the adaptive bandwidth floor (UCP-specific non-destructive constraint: disabled in BBR compatible mode) */
		if (ucp_bbr_mode == 0 && /* Bandwidth floor is a UCP-specific feature; BBR mode skips all floor logic */
		    ucp->net_condition != UCP_COND_CONGESTED && /* Only apply the bandwidth floor when not in a congested condition */
		    ucp->max_bw_non_congested > 0) { /* A non-congested peak bandwidth has been established */
			/* Per-class bandwidth floor selection: each path class has its own configurable parameter */
			u32 pct; /* Floor permyriad value (1/10000); 0 disables the floor for that class */
			switch (ucp->net_class) { /* Select the floor permyriad based on the current path classification */
			case UCP_CLASS_LAN:       /* LAN paths: low latency, minimal variation; floor typically disabled */
				pct = ucp_bw_floor_lan_val; /* Use the LAN-specific floor parameter (default 0 = disabled) */
				break;
			case UCP_CLASS_VPN:       /* VPN paths: encapsulation adds overhead but bandwidth is usually stable */
				pct = ucp_bw_floor_vpn_val; /* Use the VPN-specific floor parameter (default 0 = disabled) */
				break;
			case UCP_CLASS_MOBILE:    /* Mobile paths: high jitter and variable radio conditions */
				pct = ucp_bw_floor_mobile_val; /* Use the mobile floor parameter (default 2500 = 25%) */
				break;
			case UCP_CLASS_LOSSY_FAT: /* Lossy fat paths: background loss can depress bandwidth measurements */
				pct = ucp_bw_floor_lossy_fat_val; /* Use the LOSSY_FAT-specific floor parameter (default 2500 = 25%) */
				break;
			case UCP_CLASS_CONGESTED: /* Congested-class paths: floor only active briefly after real-time congestion clears */
				pct = ucp_bw_floor_congested_val; /* Use the CONGESTED-class floor parameter (default 2000 = 20%) */
				break;
			default:                  /* DEFAULT paths: general internet or unclassified */
				pct = ucp_bw_floor_default_val; /* Use the default floor parameter (default 2000 = 20%) */
				break;
			}

			/* Apply the floor only if the current loss level is below the loss cap threshold */
			if (pct && ucp_get_loss_ratio(sk) < ucp_bw_floor_loss_cap_val) { /* Floor is enabled (non-zero) and loss is below the loss cap (5%) */
				u64 floor_bw = (u64)ucp->max_bw_non_congested * pct / 10000; /* Compute floor: non-congested peak * permyriad / 10000 */
				if (bw < floor_bw) { /* The current bandwidth sample is below the floor: apply the floor */
					bw = floor_bw; /* Replace the bandwidth sample with the computed floor value to prevent BtlBw from dropping too low */
				}
			}
		}

		/* Update the running-maximum filter with the (possibly floored) bandwidth sample */
		minmax_running_max(&ucp->bw, UCP_BW_RTTS, ucp->rtt_cnt, (u32)bw); /* Add the sample to the minmax filter with window size of 10 packet-timed rounds */
	}
}

/**
 * @brief Detect whether the STARTUP phase has filled the bottleneck pipe.
 * @param sk  The TCP socket
 * @param rs  The rate sample (used for is_app_limited check)
 *
 * Implements the BBR full-pipe detection: if bandwidth growth does not exceed 1.25x for
 * UCP_FULL_BW_CNT (3) consecutive packet-timed rounds, the pipe is declared full.
 * The full_bw snapshot is updated each time growth exceeds the 1.25x threshold,
 * resetting the counter. Application-limited rounds are excluded from the check.
 * When full_bw_reached is set, it triggers the STARTUP -> DRAIN transition.
 */
static void ucp_check_full_bw_reached(struct sock *sk,
				      const struct rate_sample *rs)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	u32 bw_thresh; /* The 1.25x growth threshold: full_bw * 125% in BBR_SCALE */

	/* Skip check if the pipe is already declared full, or this is not a round start, or the sample is app-limited */
	if (ucp_full_bw_reached(sk) || !ucp->round_start || rs->is_app_limited) {
		return; /* No update needed: pipe already full, not a new round, or application-limited sample is unreliable */
	}

	bw_thresh = (u64)ucp->full_bw * UCP_FULL_BW_THRESH >> BBR_SCALE; /* Compute 1.25x threshold: full_bw * 320 / 256 (UCP_FULL_BW_THRESH = 320 BBR_SCALE = 1.25x) */
	if (ucp_max_bw(sk) >= bw_thresh) { /* Current max BW >= 1.25x of the last confirmed full_bw snapshot: bandwidth is still growing */
		ucp->full_bw = ucp_max_bw(sk); /* Update the full_bw snapshot to the current max bandwidth (new growth high-water mark) */
		ucp->full_bw_cnt = 0; /* Reset the non-growth counter: we just saw significant growth */
		return; /* Pipe is not yet full; continue STARTUP probing */
	}

	ucp->full_bw_cnt++; /* Increment the non-growth counter: this round did not achieve 1.25x growth */
	ucp->full_bw_reached = (ucp->full_bw_cnt >= UCP_FULL_BW_CNT); /* Set full_bw_reached if 3 consecutive rounds without 1.25x growth: pipe is considered full */
}

/**
 * @brief Handle the STARTUP -> DRAIN and DRAIN -> PROBE_BW state transitions.
 * @param sk  The TCP socket
 * @param rs  The rate sample (currently unused by DRAIN completion check)
 *
 * Two transitions:
 * 1. STARTUP -> DRAIN: triggered when full_bw_reached is set. Sets the slow start threshold
 *    to the current inflight at 1.0x BDP and transitions to DRAIN mode.
 * 2. DRAIN -> PROBE_BW (or back to STARTUP if pipe was never filled): triggered when the
 *    estimated inflight at EDT drops to or below the target inflight at 1.0x BDP,
 *    indicating the excess queue from STARTUP has been fully drained.
 */
static void ucp_check_drain(struct sock *sk, const struct rate_sample *rs)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */

	/* STARTUP -> DRAIN transition: when the pipe is declared full */
	if (ucp->mode == UCP_STARTUP && ucp_full_bw_reached(sk)) { /* Currently in STARTUP AND pipe is now full */
		ucp->mode = UCP_DRAIN; /* Transition to DRAIN mode: begin draining the excess queue built during STARTUP */
		tcp_sk(sk)->snd_ssthresh = ucp_inflight(sk, ucp_max_bw(sk), /* Set slow start threshold to the current BDP at 1.0x gain */
							BBR_UNIT); /* ssthresh = current inflight at neutral gain (1.0x BDP); used as a target for the drain */
	}

	/* DRAIN completion check: transition out of DRAIN when queue is fully drained */
	if (ucp->mode == UCP_DRAIN && /* Currently in DRAIN mode */
	    ucp_packets_in_net_at_edt(sk, tcp_packets_in_flight(tcp_sk(sk))) <= /* Projected inflight at EDT */
	    ucp_inflight(sk, ucp_max_bw(sk), BBR_UNIT)) { /* <= target inflight at 1.0x BDP: the excess queue has been drained */
		ucp_reset_mode(sk); /* Exit DRAIN: transition to PROBE_BW (or re-enter STARTUP if pipe was never fully filled) */
	}
}

/**
 * @brief Exit PROBE_RTT mode after the dwell time has elapsed and restore normal operation.
 * @param sk  The TCP socket
 *
 * Called from the per-ACK path (via ucp_update_min_rtt) or from the TX_START event callback.
 * The exit condition is that probe_rtt_done_stamp is set and the current jiffies value is
 * past the done stamp. On exit:
 * 1. Updates min_rtt_stamp to now (so the next PROBE_RTT entry is properly timed)
 * 2. Restores cwnd to at least prior_cwnd (max of current cwnd and pre-PROBE_RTT window)
 * 3. Sets probe_rtt_restored flag (so ucp_set_cwnd applies the restoration)
 * 4. Calls ucp_reset_mode() to transition to STARTUP or PROBE_BW
 */
static void ucp_check_probe_rtt_done(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);   /* Get the TCP socket's internal state for snd_cwnd */
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */

	/* Check if PROBE_RTT exit conditions are met: done_stamp must be set AND current time must be past the done stamp */
	if (!ucp->probe_rtt_done_stamp || /* No probe_rtt_done_stamp set: not yet scheduled or already exited */
	    !after(tcp_jiffies32, ucp->probe_rtt_done_stamp)) { /* Current jiffies not yet past the done stamp: PROBE_RTT dwell time has not expired */
		return; /* Stay in PROBE_RTT: exit conditions not yet satisfied */
	}

	ucp->min_rtt_stamp = tcp_jiffies32; /* Update the min_rtt timestamp to now: this marks the end of the PROBE_RTT filter reset */
	tp->snd_cwnd = max(tp->snd_cwnd, ucp->prior_cwnd); /* Restore cwnd to at least the saved prior_cwnd (pre-PROBE_RTT window) */
	ucp->probe_rtt_restored = 1; /* Set flag: cwnd restoration from PROBE_RTT needs to be applied in ucp_set_cwnd() */

	ucp_reset_mode(sk); /* Reset the operating mode: transition to STARTUP (if pipe never full) or PROBE_BW (steady-state) */
}

/**
 * @brief Update the minimum RTT estimate, RTT history, and manage PROBE_RTT state transitions.
 * @param sk  The TCP socket
 * @param rs  The rate sample (contains rtt_us and is_ack_delayed)
 *
 * Comprehensive min_rtt update logic:
 * 1. Records whether the current ACK was delayed (affects RTT sample filtering)
 * 2. Checks if the min_rtt filter has expired (time to consider lowering min_rtt)
 * 3. Updates min_rtt if a lower sample is received (with fast-fall and sticky-floor logic)
 *    - If sample < 75% of current min_rtt: increments fast-fall counter; at 3 consecutive
 *      fast-fall samples, immediately updates min_rtt; otherwise gradually reduces by 25%
 *    - Otherwise: directly updates min_rtt to the lower sample
 * 4. SRTT guard: if smoothed RTT is significantly lower than min_rtt, updates min_rtt from SRTT
 * 5. Adds valid RTT sample to the history buffer via ucp_add_rtt_sample()
 * 6. Enters PROBE_RTT if the filter has expired and idle_restart is not set
 * 7. Manages the PROBE_RTT state: sets app_limited, schedules done_stamp, tracks round completion
 * 8. Clears idle_restart when data delivery resumes
 */
static void ucp_update_min_rtt(struct sock *sk, const struct rate_sample *rs)
{
	struct tcp_sock *tp = tcp_sk(sk);   /* Get the TCP socket's internal state for srtt_us and delivered */
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	bool filter_expired; /* True if the min_rtt measurement window has expired (enough time has passed since last update) */

	ucp->has_delayed_ack = rs->is_ack_delayed; /* Record whether this ACK was a delayed ACK; used later in RTT sample filtering */
	/* Check if the min_rtt filter has expired: current time past (min_rtt_stamp + probe_rtt_interval) */
	filter_expired = after(tcp_jiffies32,
		ucp->min_rtt_stamp + ucp_get_probe_rtt_interval(sk)); /* True if the adaptive PROBE_RTT interval has elapsed since the last min_rtt update */

	/* Update min_rtt if a lower RTT sample is available (with fast-fall protection) */
	if (rs->rtt_us >= 0 && /* Valid (non-negative) RTT sample in the rate sample */
	    (rs->rtt_us < ucp->min_rtt_us || /* The new RTT sample is strictly lower than current min_rtt OR */
	     (filter_expired && !rs->is_ack_delayed))) { /* The filter has expired AND this is not a delayed ACK (allow min_rtt to relax upward on expiry) */
		/* Check for fast-fall condition: the new RTT is below 75% of current min_rtt (significant drop) */
		if (ucp->min_rtt_us && /* Current min_rtt is established (non-zero) */
		    rs->rtt_us < ucp->min_rtt_us * UCP_MINRTT_STICKY_FLOOR_NUM / /* Compare: rtt < min_rtt * 3/4 */
				 UCP_MINRTT_STICKY_FLOOR_DEN) { /* 3/4 = 75% threshold for fast-fall detection */
			ucp->min_rtt_fast_fall_cnt++; /* Increment the fast-fall counter: another consecutive sample below 75% */
			if (ucp->min_rtt_fast_fall_cnt >= UCP_MINRTT_FAST_FALL_CNT) { /* 3 consecutive fast-fall samples: confirm the downward trend */
				ucp->min_rtt_us = rs->rtt_us; /* Directly update min_rtt to the new lower value after confirmation */
				ucp->min_rtt_fast_fall_cnt = 0; /* Reset the fast-fall counter for the next detection cycle */
			} else { /* Fast-fall accumulating but not yet confirmed: apply a gradual sticky floor reduction */
				ucp->min_rtt_us = ucp->min_rtt_us * /* Reduce min_rtt to 3/4 of its current value as a progressive step */
						  UCP_MINRTT_STICKY_FLOOR_NUM /
						  UCP_MINRTT_STICKY_FLOOR_DEN; /* min_rtt = min_rtt * 3/4 (partial reduction toward the new lower value) */
			}
		} else { /* Normal (non-fast-fall) update: the new sample is lower but not drastically so */
			ucp->min_rtt_us = rs->rtt_us; /* Directly update min_rtt to the new lower RTT sample */
			ucp->min_rtt_fast_fall_cnt = 0; /* Reset the fast-fall counter since this is not a fast-fall pattern */
		}
		ucp->min_rtt_stamp = tcp_jiffies32; /* Update the timestamp: min_rtt was just modified, start the expiry countdown from now */
	}

	/* SRTT guard: if the smoothed RTT is significantly below min_rtt, it means min_rtt is stale */
	if (tp->srtt_us && ucp->min_rtt_us && /* Both SRTT and min_rtt are available */
	    (tp->srtt_us >> 3) < ucp->min_rtt_us * /* Compare srtt (de-scaled) < min_rtt * 9/10: SRTT guard condition */
	     UCP_MINRTT_SRTT_GUARD_NUM / UCP_MINRTT_SRTT_GUARD_DEN) { /* 9/10 = 0.9: if srtt < 90% of min_rtt, min_rtt is too high */
		ucp->min_rtt_us = tp->srtt_us >> 3; /* Update min_rtt to the de-scaled smoothed RTT (srtt is 8x the actual value) */
		ucp->min_rtt_stamp = tcp_jiffies32; /* Update the timestamp to record when min_rtt was last changed */
	}

	/* Add the RTT sample to the 2-slot history buffer for P10 RTT estimation (if valid) */
	if (rs->rtt_us > 0) { /* Only process strictly positive RTT samples */
		ucp_add_rtt_sample(sk, rs->rtt_us); /* Add the filtered sample to the circular history buffer */
	}

	/* Check whether to enter PROBE_RTT mode: filter expired, no idle restart, not already in PROBE_RTT */
	if (UCP_PROBE_RTT_MODE_MS > 0 && filter_expired && /* PROBE_RTT is enabled (dwell time > 0) AND min_rtt filter window has expired */
	    !ucp->idle_restart && ucp->mode != UCP_PROBE_RTT) { /* Not restarting from idle AND not already in PROBE_RTT */
		ucp->mode = UCP_PROBE_RTT; /* Enter PROBE_RTT mode: cwnd will be clamped to 4 packets to drain queues */
		ucp_save_cwnd(sk); /* Save the current cwnd as prior_cwnd before PROBE_RTT clamps it */
		ucp->probe_rtt_done_stamp = 0; /* Defer: timer starts later when inflight <= 4 (queue drained), matching BBR behavior */
	}

	/* Manage the PROBE_RTT state: app_limited marking and exit timing */
	if (ucp->mode == UCP_PROBE_RTT) { /* Currently in PROBE_RTT mode */
		tp->app_limited = (tp->delivered + tcp_packets_in_flight(tp)) ? : 1; /* Mark the connection as app_limited if nothing is in flight; prevents bandwidth sampling during PROBE_RTT */
		if (!ucp->probe_rtt_done_stamp) { /* No done_stamp scheduled yet: waiting for inflight to drain */
			if (tcp_packets_in_flight(tp) <= UCP_CWND_MIN_TARGET) { /* Inflight is at or below the PROBE_RTT min target (4 packets): queue drained, start timer */
				ucp->probe_rtt_done_stamp = tcp_jiffies32 + /* Set the done_stamp: start the PROBE_RTT dwell timer now */
					msecs_to_jiffies(UCP_PROBE_RTT_MODE_MS); /* Dwell for 200ms in the low-inflight state */
				ucp->probe_rtt_round_done = 0; /* Reset the round-done flag; will be set when a full RTT passes */
				ucp->next_rtt_delivered = tp->delivered; /* Set the next round boundary to detect a full RTT in PROBE_RTT */
			} else if (ucp->round_start) { /* Safety: one full round elapsed with cwnd=4; inflight should have drained; force-start timer to prevent hang */
				ucp->probe_rtt_done_stamp = tcp_jiffies32 +
					msecs_to_jiffies(UCP_PROBE_RTT_MODE_MS);
				ucp->probe_rtt_round_done = 0;
				ucp->next_rtt_delivered = tp->delivered;
			}
		} else { /* Done stamp is already set: track round completion for exit */
			if (ucp->round_start) { /* A new packet-timed round has started while in PROBE_RTT */
				ucp->probe_rtt_round_done = 1; /* Mark that at least one full round has completed during PROBE_RTT */
			}
			if (ucp->probe_rtt_round_done) { /* At least one full round has completed and the done_stamp time has been reached */
				ucp_check_probe_rtt_done(sk); /* Attempt to exit PROBE_RTT; will succeed if done_stamp time is past */
			}
		}
	}

	/* Clear idle_restart when data delivery resumes */
	if (rs->delivered > 0) { /* At least one packet was delivered (acked) in this rate sample */
		ucp->idle_restart = 0; /* Clear the idle restart flag: we have seen data delivery since the idle period ended */
	}
}

/**
 * @brief Update the ACK aggregation compensation state.
 * @param sk  The TCP socket
 * @param rs  The rate sample (contains acked_sacked and delivered_mstamp)
 *
 * Tracks excess ACK counts over RTT-scale epochs.  At the end of each epoch
 * (time > max(min_rtt_us, 1ms) since start), expected deliveries are computed
 * from the bandwidth estimate and compared to the observed acked count.
 * The 1ms floor prevents degenerate per-ACK epoch resets when min_rtt_us is
 * unrealistically small (e.g. early connection or low-latency loopback).
 * The excess is accumulated and the running max (extra_acked_max) is decayed by 3/4,
 * then consumed by ucp_set_cwnd() to add a cwnd bonus.
 *
 * DIFFERENCES FROM GOOGLE BBR's extra_acked:
 *   BBR            | UCP
 *   ----------------|--------------------------------------------------
 *   u64 epoch mstamp| u32 ack_epoch_start_us (low 32 bits only)
 *   u16 extra_acked[2]| u8 extra_acked + u8 extra_acked_max (single slot)
 *   dual-slot window | single-slot exponential decay (x 0.75/epoch)
 *   fixed gain 1.0x | ucp_extra_acked_gain (default 0 = off)
 *   ~16 bytes total  | 6 bytes total
 *
 * Space constraint: struct ucp must fit within 104 bytes (ICSK_CA_PRIV_SIZE).
 * A full BBR extra_acked implementation would require ~16 bytes, which is
 * not available.  The single-slot decay approximation converges to the same
 * steady state within 3-4 epochs and incurs <3% throughput penalty on most
 * internet paths.  The u8 saturation at 255 is rarely reached outside
 * datacenter environments.
 */
static void ucp_update_ack_aggregation(struct sock *sk,
				       const struct rate_sample *rs)
{
	struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket state for delivered_mstamp */
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	u32 now_us, epoch_us, expected, extra; /* now_us = current microsecond timestamp; epoch_us = elapsed epoch time; expected = expected deliveries; extra = excess */

	if (!ucp_extra_acked_gain_val) { /* Compensation gain is 0 (disabled): skip all tracking */
		return;
	}
	if (rs->delivered < 0 || rs->interval_us <= 0) { /* Invalid rate sample: no delivery or interval data to base compensation on */
		return;
	}
	if (!ucp_max_bw(sk)) { /* No bandwidth estimate yet (early connection): cannot compute expected deliveries; skip to prevent false excess */
		return;
	}
	if (!ucp->min_rtt_us) { /* min_rtt not yet established: epoch boundary detection (epoch_us > min_rtt_us) would be degenerate; skip compensation until a valid baseline RTT exists */
		return;
	}

	now_us = (u32)tp->delivered_mstamp; /* Lower 32 bits of the 64-bit microsecond delivery timestamp */
	if (ucp->ack_epoch_start_us == 0) { /* First ACK or epoch just ended: initialize a new epoch */
		ucp->ack_epoch_start_us = now_us; /* Record the start of this epoch */
		ucp->extra_acked = 0; /* Reset the cumulative excess counter for the new epoch */
		return; /* No comparsion to make on the first sample of an epoch */
	}

	epoch_us = now_us - ucp->ack_epoch_start_us; /* Elapsed time since epoch start (handles 32-bit wraparound automatically) */

	if (epoch_us > max(ucp->min_rtt_us, (u32)UCP_ACK_EPOCH_MIN_US)) { /* Epoch has exceeded max(min_rtt, 1ms): time to close the window and compute excess; the 1ms floor prevents degenerate per-ACK resets when min_rtt_us is unrealistically small */
		expected = ((u64)ucp_max_bw(sk) * epoch_us) >> BW_SCALE; /* Expected deliveries = bw (pkts/usec * 2^24) * elapsed_us >> 24 */
		extra = (ucp->extra_acked > expected) ? /* If observed exceeds expected, compute the surplus */
			(ucp->extra_acked - expected) : 0; /* Excess = observed - expected, or 0 if observed <= expected */
		ucp->extra_acked_max = max((u8)((u32)ucp->extra_acked_max * UCP_ACK_EPOCH_DECAY_NUM / UCP_ACK_EPOCH_DECAY_DEN), /* Decay previous max by 3/4 (exponential forgetting) */
					   (u8)(extra > UCP_U8_MAX ? UCP_U8_MAX : extra)); /* New max is at least the current epoch's excess (saturated to u8) */
		/* Start a new epoch with the current ACK's acked count */
		ucp->ack_epoch_start_us = now_us; /* Reset epoch start to now */
		ucp->extra_acked = (u8)(rs->acked_sacked > UCP_U8_MAX ? UCP_U8_MAX : rs->acked_sacked); /* Seed the new epoch with this ACK's acked count (saturated) */
	} else {
		/* Still within the current epoch: accumulate the excess count */
		ucp->extra_acked = (u8)min_t(u32, UCP_U8_MAX, /* Accumulate and saturate to u8 */
					     (u32)ucp->extra_acked + rs->acked_sacked);
	}
}

/**
 * @brief Return the standard steady-state cwnd gain value.
 * @param sk  The TCP socket (unused in this function, required for API consistency)
 * @return    The precomputed steady-state cwnd gain in BBR_SCALE (default 512 = 2.0x)
 */
static u32 ucp_get_cwnd_gain(const struct sock *sk)
{
	(void)sk; /* Suppress unused parameter warning; parameter required by calling convention */
	return ucp_cwnd_gain_val; /* Return the precomputed steady-state cwnd gain value (2.0x BDP default) */
}

/**
 * @brief Run the complete UCP estimation pipeline for the current ACK.
 * @param sk  The TCP socket
 * @param rs  The rate sample from the TCP stack
 *
 * This is the main model update function, called once per ACK processing cycle.
 * The pipeline executes sub-updates in dependency order:
 * 1. ucp_update_bw() - Update bandwidth estimate and round tracking
 * 2. ucp_update_loss_ewma() - Update loss EWMA
 * 2a. ucp_update_ack_aggregation() - Track excess acked counts for ACK aggregation compensation
 * 3. ucp_update_net_condition() - Classify network condition (IDLE/LIGHT_LOAD/CONGESTED/RANDOM_LOSS)
 * 4. ucp_update_net_class() - Classify path type (LAN/MOBILE/LOSSY_FAT/CONGESTED/VPN/DEFAULT)
 * 5. ucp_update_cycle_phase() - Advance PROBE_BW gain cycle if phase conditions met
 * 6. ucp_check_full_bw_reached() - Detect STARTUP pipe-full condition
 * 7. ucp_check_drain() - Handle STARTUP/DRAIN/PROBE_BW transitions
 * 8. ucp_update_min_rtt() - Update min RTT and manage PROBE_RTT state
 * 9. Set mode-specific gains: STARTUP uses UCP_HIGH_GAIN, DRAIN uses UCP_DRAIN_GAIN,
 *    PROBE_BW uses the cycle-based cwnd gain, PROBE_RTT uses neutral gain
 */
static void ucp_update_model(struct sock *sk, const struct rate_sample *rs)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */

	ucp_update_bw(sk, rs);            /* Step 1: Update BtlBw estimate from current rate sample (with bandwidth floor) */
	ucp_update_loss_ewma(sk, rs);     /* Step 2: Update loss EWMA from current sample's acked and lost counts */

	ucp_update_ack_aggregation(sk, rs); /* Step 2a: Update ACK aggregation compensation (tracks excess acked counts per RTT epoch) */

	ucp_update_net_condition(sk, rs); /* Step 3: Classify network condition using rate trend, loss, ECN, and RTT rise */
	ucp_update_net_class(sk);         /* Step 4: Classify network path class using RTT stats and loss */
	ucp_update_cycle_phase(sk, rs);   /* Step 5: Advance PROBE_BW cycle phase if timing/inflight conditions are met */
	ucp_check_full_bw_reached(sk, rs);/* Step 6: Check if STARTUP bandwidth growth has stalled (3 rounds < 1.25x) */
	ucp_check_drain(sk, rs);          /* Step 7: Check for STARTUP->DRAIN and DRAIN->PROBE_BW transitions */
	ucp_update_min_rtt(sk, rs);       /* Step 8: Update min_rtt, RTT history, and manage PROBE_RTT enter/exit */

	/* Step 9: Set mode-specific base gains for pacing and cwnd */
	switch (ucp->mode) { /* Set default gains based on the current operating mode; these may be further modified by constraints */
		case UCP_STARTUP: /* Exponential bandwidth search phase: use high gain for both pacing and cwnd */
			ucp->pacing_gain = UCP_HIGH_GAIN; /* Set pacing gain to UCP_HIGH_GAIN (~2.885x BBR_SCALE) for aggressive bandwidth probing */
			ucp->cwnd_gain   = UCP_HIGH_GAIN; /* Set cwnd gain to UCP_HIGH_GAIN (~2.885x BBR_SCALE) for aggressive window growth */
			break; /* Exit the switch after setting STARTUP gains */
		case UCP_DRAIN: /* Queue drain phase: drain pacing (low gain), keep high cwnd gain for window restoration */
			ucp->pacing_gain = UCP_DRAIN_GAIN; /* Set pacing gain to UCP_DRAIN_GAIN (~0.346x BBR_SCALE) to drain the queue rapidly */
			ucp->cwnd_gain   = UCP_HIGH_GAIN; /* Keep cwnd gain at UCP_HIGH_GAIN to maintain the large window while pacing drains */
			break; /* Exit the switch after setting DRAIN gains */
		case UCP_PROBE_BW: /* Steady-state bandwidth probing: use the standard cwnd gain (from current phase) */
			ucp->cwnd_gain = ucp_get_cwnd_gain(sk); /* Set cwnd gain to the standard steady-state value (2.0x BDP default); pacing_gain was set by cycle phase logic */
			break; /* Exit the switch after setting PROBE_BW cwnd gain (pacing_gain is already managed by cycle phase) */
		case UCP_PROBE_RTT: /* Min RTT refresh: use neutral gain for both pacing and cwnd */
			ucp->pacing_gain = BBR_UNIT; /* Set pacing gain to 1.0x (neutral): no probe, no drain; just send at estimated BtlBw */
			ucp->cwnd_gain   = BBR_UNIT; /* Set cwnd gain to 1.0x (neutral); actual cwnd will be forced to 4 by ucp_set_cwnd() */
			break; /* Exit the switch after setting PROBE_RTT gains */
	}
}

/**
 * @brief Per-ACK entry point for the UCP congestion control algorithm.
 * @param sk  The TCP socket
 * @param rs  The rate sample from the TCP stack (contains delivery rate, RTT, loss, etc.)
 *
 * This is the main callback registered in struct tcp_congestion_ops as .cong_control.
 * Called by the TCP stack on every ACK that includes a rate sample. The execution order:
 * 1. Run the full model update pipeline (ucp_update_model)
 * 2. Apply any queued one-shot drain constraints (ucp_apply_pacing_constraints)
 * 3. Apply cwnd ceiling and STARTUP loss caps (ucp_apply_cwnd_constraints)
 * 4. Get the current max bandwidth and set the pacing rate with the current pacing gain
 * 5. Clear the fast_recovery flag at the start of a new round if recovering (handles recovery exit)
 * 6. Compute and set the new congestion window
 */
static void ucp_main(struct sock *sk, const struct rate_sample *rs)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	u32 bw; /* Current maximum bandwidth (BtlBw) from the minmax filter, used for pacing rate and cwnd calculation */

	ucp_update_model(sk, rs); /* Step 1: Run the full estimation pipeline (BW, loss, conditions, class, phase, min_rtt, mode gains) */

	ucp_apply_pacing_constraints(sk); /* Step 2: Apply any one-shot drain queued by advance_cycle_phase (overrides pacing_gain if drain_pending) */
	ucp_apply_cwnd_constraints(sk);   /* Step 3: Apply CONGESTED cwnd caps and STARTUP loss-based gain limits */

	bw = ucp_max_bw(sk); /* Step 4: Get the current max bottleneck bandwidth estimate from the minmax filter */
	ucp_set_pacing_rate(sk, bw, ucp->pacing_gain); /* Set the socket's pacing rate = bw * pacing_gain (with EWMA smoothing on increases) */

	if (ucp->fast_recovery && ucp->round_start) { /* Step 5: If in fast recovery and a new round has started, recovery is effectively done */
		ucp->fast_recovery = 0; /* Clear the fast recovery flag: the recovery window has been applied and a new round has begun */
	}

	ucp_set_cwnd(sk, rs, rs->acked_sacked, bw, ucp->cwnd_gain); /* Step 6: Compute and apply the new congestion window = BDP * cwnd_gain + quantization */
}

/* ---- Module callbacks (registered in tcp_congestion_ops) ---------------- */

/**
 * @brief Initialize the per-connection UCP congestion control state.
 * @param sk  The TCP socket
 *
 * Called when a new TCP connection adopts the UCP congestion control algorithm.
 * Initializes: all state to zero (via memset), prev_ca_state to TCP_CA_Open,
 * min_rtt_us from the TCP stack's min RTT estimate, min_rtt_stamp to now,
 * and the initial pacing rate from cwnd/SRTT. Also requests kernel pacing support.
 */
static void ucp_init(struct sock *sk)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state area (pre-allocated by inet_csk_ca) */

	memset(ucp, 0, sizeof(*ucp)); /* Zero-initialize the entire UCP state structure to ensure deterministic startup */
	ucp->rate_change_ewma = (s32)0x80000000; /* Sentinel: marks uninitialized, distinct from any valid EWMA value (INT_MIN) */
	ucp->prev_ca_state = TCP_CA_Open; /* Set the initial previous CA state to Open (no prior recovery state) */
	ucp->min_rtt_us = tcp_min_rtt(tcp_sk(sk)); /* Initialize min_rtt from the TCP stack's minimum observed RTT (if available) */
	ucp->min_rtt_stamp = tcp_jiffies32; /* Set the min_rtt timestamp to the current jiffies to start the filter lifetime from now */
	ucp_init_pacing_rate_from_rtt(sk); /* Compute and set the initial pacing rate from current cwnd and SRTT using high STARTUP gain */
	cmpxchg(&sk->sk_pacing_status, SK_PACING_NONE, SK_PACING_NEEDED); /* Atomic compare-and-exchange to request kernel-side pacing support if not already enabled */
}

/**
 * @brief Return the send buffer expansion factor for UCP connections.
 * @param sk  The TCP socket (unused)
 * @return    Buffer expansion multiplier: 3x the default socket buffer
 *
 * Called by the TCP stack to determine how much to expand the socket's send buffer
 * beyond the default. UCP recommends 3x to accommodate the larger cwnd variations
 * and TSO burst requirements of a BBR-derived algorithm.
 */
static u32 ucp_sndbuf_expand(struct sock *sk)
{
	(void)sk; /* Suppress unused parameter warning; parameter required by callback API */
	return 3; /* Return expansion factor of 3x: recommend send buffer be 3x the default size */
}

/**
 * @brief Reset full_bw tracking on cwnd undo (after spurious loss recovery).
 * @param sk  The TCP socket
 * @return    The current cwnd to use after undo (unchanged from TCP stack's value)
 *
 * Called when the TCP stack detects that a loss recovery was spurious (e.g., due to
 * packet reordering or a false loss indication) and undoes the cwnd reduction.
 * Resets the full_bw tracking state so that bandwidth growth detection restarts
 * fresh after the undo event.
 */
static u32 ucp_undo_cwnd(struct sock *sk)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
	ucp->full_bw = 0; /* Reset the full_bw snapshot: bandwidth growth tracking will restart from zero */
	ucp->full_bw_cnt = 0; /* Reset the non-growth round counter: fresh start for full-pipe detection */
	ucp->full_bw_reached = 0; /* Clear full_bw_reached to force a fresh STARTUP bandwidth probe after spurious loss recovery undo */
	return tcp_sk(sk)->snd_cwnd; /* Return the current cwnd (which the TCP stack may have already restored during undo) */
}

/**
 * @brief Save current cwnd and return slow start threshold for loss recovery entry.
 * @param sk  The TCP socket
 * @return    The slow start threshold value (from the TCP socket's snd_ssthresh)
 *
 * Called by the TCP stack when entering loss recovery to get the slow start threshold.
 * Before returning, saves the current cwnd via ucp_save_cwnd() so that the pre-recovery
 * window can be restored when recovery exits.
 */
static u32 ucp_ssthresh(struct sock *sk)
{
	ucp_save_cwnd(sk); /* Save the current cwnd as prior_cwnd before recovery reduces it */
	return tcp_sk(sk)->snd_ssthresh; /* Return the current slow start threshold (may have been set by ucp_check_drain or default) */
}

/* ---- Diagnostic encoding for ss -i output ------------------------------ */
#define UCP_DIAG_MARKER       0x80000000U /* High bit marker to indicate UCP-specific diagnostic data in bbr_bw_hi field */
#define UCP_DIAG_CLASS_SHIFT  28          /* Shift offset within bbr_bw_hi for the net_class field (bits 28-30) */
#define UCP_DIAG_COND_SHIFT   26          /* Shift offset within bbr_bw_hi for the net_condition field (bits 26-27) */
#define UCP_DIAG_DRAIN_SHIFT  24          /* Shift offset within bbr_bw_hi for the drain_pending field (bits 24-25) */
#define UCP_DIAG_LOSS_SHIFT   16          /* Shift offset within bbr_bw_hi for the loss_ewma field (bits 16-23, 8-bit 0-255; 256 in BBR_SCALE saturates to 0xFF = closest display value) */

/**
 * @brief Return diagnostic information for the ss -i tool via INET_DIAG_BBRINFO.
 * @param sk    The TCP socket
 * @param ext   The diagnostic extension bitmask (checked for BBRINFO or VEGASINFO)
 * @param attr  Output parameter: set to INET_DIAG_BBRINFO to indicate which info is returned
 * @param info  Output union: filled with UCP state mapped to the bbr structure
 * @return      Size of the diagnostic structure (sizeof bbr), or 0 if no relevant extension requested
 *
 * Maps UCP internal state into the standard BBR diagnostic structure for display by ss -i.
 * When bandwidth fits in 32 bits, the bbr_bw_hi field is repurposed to encode UCP-specific
 * state (net_class, net_condition, drain_pending, loss_ewma) using bit fields with a marker
 * bit (0x80000000) to distinguish UCP data from actual high 32 bits of bandwidth.
 */
static size_t ucp_get_info(struct sock *sk, u32 ext, int *attr,
			   union tcp_cc_info *info)
{
	if (ext & (1 << (INET_DIAG_BBRINFO - 1)) || /* Check if the BBRINFO diagnostic extension is requested */
	    ext & (1 << (INET_DIAG_VEGASINFO - 1))) { /* Also accept VEGASINFO as some tools request it interchangeably */
		struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for MSS cache */
		struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
		u64 bw = (u64)ucp_max_bw(sk) * tp->mss_cache * /* Convert max BW from BW_SCALE to bits or bytes per second for diagnostic display; cast to u64 before multiplication to prevent 32-bit overflow on high-rate links */
			 USEC_PER_SEC >> BW_SCALE; /* bw = max_bw (pkts/usec) * MSS (bytes) * 1,000,000 (usec/sec) >> 24 = bytes/sec */

		memset(&info->bbr, 0, sizeof(info->bbr)); /* Zero-initialize the diagnostic structure before populating */
		info->bbr.bbr_bw_lo       = (u32)bw; /* Store the lower 32 bits of the bandwidth value (bytes/sec) */
		info->bbr.bbr_bw_hi       = (u32)(bw >> 32); /* Store the upper 32 bits of the bandwidth value (for high-speed links) */
		info->bbr.bbr_min_rtt     = ucp->min_rtt_us; /* Report the current minimum RTT estimate in microseconds */
		info->bbr.bbr_pacing_gain = ucp->pacing_gain; /* Report the current pacing gain (BBR_SCALE) */
		info->bbr.bbr_cwnd_gain   = ucp->cwnd_gain; /* Report the current cwnd gain (BBR_SCALE) */

		if (bw <= U32_MAX) { /* If bandwidth fits in 32 bits, bbr_bw_hi is free for UCP-specific diagnostic encoding */
			info->bbr.bbr_bw_hi = UCP_DIAG_MARKER | /* Set the high bit marker to identify UCP-encoded diagnostics */
				((u32)ucp->net_class << UCP_DIAG_CLASS_SHIFT) | /* Encode net_class at bits 28-30 */
				((u32)ucp->net_condition << UCP_DIAG_COND_SHIFT) | /* Encode net_condition at bits 26-27 */
				((u32)ucp->drain_pending << UCP_DIAG_DRAIN_SHIFT) | /* Encode drain_pending at bits 24-25 */
				((u32)min_t(u16, ucp_get_loss_ratio(sk), UCP_U8_MAX) /* Full loss EWMA (0..BBR_UNIT) reconstructed via getter, then saturated to 8-bit (UCP_U8_MAX = closest to 100%) */
				 << UCP_DIAG_LOSS_SHIFT); /* Encode the 8-bit loss value at bits 16-23 for ss -i display */
		}

		*attr = INET_DIAG_BBRINFO; /* Set the output attribute type to BBRINFO to tell the stack what we filled */
		return sizeof(info->bbr); /* Return the size of the bbr info structure to indicate data was provided */
	}
	return 0; /* No matching extension requested: return 0 to indicate no data was filled */
}

/**
 * @brief Handle TCP state changes (set_state callback).
 * @param sk         The TCP socket
 * @param new_state  The new TCP CA state (TCP_CA_Loss, TCP_CA_Recovery, etc.)
 *
 * Called by the TCP stack when the congestion algorithm state changes.
 * On entry to TCP_CA_Loss (RTO or retransmit timeout), resets the full_bw
 * and full_bw_reached flags to restart bandwidth growth detection, sets
 * round_start to trigger a fresh round, and clears fast_recovery.
 * Other state transitions are handled in ucp_set_cwnd_to_recover_or_restore.
 */
static void ucp_set_state(struct sock *sk, u8 new_state)
{
	struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */

	if (new_state == TCP_CA_Loss) { /* Entering TCP loss state (RTO or retransmission timeout) */
		ucp->full_bw = 0; /* Reset the full_bw snapshot: bandwidth growth tracking starts over after a timeout */
		ucp->full_bw_cnt = 0; /* Reset the non-growth round counter: fresh start after timeout */
		ucp->full_bw_reached = 0; /* Clear the full_bw_reached flag: pipe may no longer be full after a timeout */
		ucp->round_start = 1; /* Set round_start to force a new round boundary on the next ACK */
		ucp->fast_recovery = 0; /* Clear fast_recovery flag: recovery has been superseded by the loss event */
	}
}

/* ---- Register/unregister the congestion-control operations structure ---- */

/**
 * @brief UCP congestion control operations table.
 *
 * This structure is registered with the TCP stack via tcp_register_congestion_control().
 * It defines the callbacks that the TCP stack invokes at various points in the
 * connection lifecycle and per-ACK processing. The .cong_control callback (ucp_main)
 * is the primary per-ACK entry point for model updates and cwnd/pacing control.
 * The .flags field marks this algorithm as NON_RESTRICTED, allowing it to be used
 * with any TCP socket regardless of namespace restrictions.
 */
static struct tcp_congestion_ops tcp_ucp_cong_ops __read_mostly = {
	.flags          = TCP_CONG_NON_RESTRICTED, /* Allow this CC to be used in any network namespace without restriction */
	.name           = "ucp",                   /* String identifier used by the TCP stack and exposed via tcp_congestion_control sysctl */
	.owner          = THIS_MODULE,              /* Kernel module ownership: ties the ops lifetime to the module reference count */
	.init           = ucp_init,                 /* Callback: initialize per-connection UCP state when a connection adopts this CC */
	.cong_control   = ucp_main,                 /* Callback: per-ACK model update and control (the main UCP algorithm entry point) */
	.sndbuf_expand  = ucp_sndbuf_expand,        /* Callback: return the send buffer expansion multiplier (3x recommended) */
	.undo_cwnd      = ucp_undo_cwnd,            /* Callback: reset full_bw tracking after a spurious loss recovery undo */
	.cwnd_event     = ucp_cwnd_event,           /* Callback: handle congestion events (TX_START after idle, etc.) */
	.ssthresh       = ucp_ssthresh,             /* Callback: return slow start threshold on loss recovery entry; saves cwnd first */
	.min_tso_segs   = ucp_min_tso_segs,         /* Callback: return minimum TSO segments for the current pacing rate */
	.get_info       = ucp_get_info,             /* Callback: return diagnostic info for ss -i tool (maps to INET_DIAG_BBRINFO) */
	.set_state      = ucp_set_state,            /* Callback: handle TCP CA state changes (Loss, Recovery, etc.) */
};

/* ---- Sysctl interface for runtime tuning via sysctl -w / /etc/sysctl.conf ---- */
static struct ctl_table_header *ucp_ctl_header; /* Opaque handle returned by register_sysctl(); stored for cleanup in ucp_unregister */

/**
 * @brief Shared sysctl proc handler --- delegates to proc_dointvec then refreshes cached values on write.
 * @param ctl    Pointer to the sysctl table entry being accessed
 * @param write  1 = write operation (user is setting a value), 0 = read operation (e.g. sysctl -a)
 * @param buffer User-space buffer for the value being read or written
 * @param lenp   Pointer to buffer length; updated with bytes actually transferred
 * @param ppos   Current file position offset for partial reads/writes
 * @return       0 on success, negative errno on failure
 *
 * Every successful write triggers ucp_init_module_params() to recompute internal BBR_SCALE
 * and jiffies cached values, ensuring sysctl -w takes effect immediately on the next ACK.
 */
static int ucp_proc_handler(struct ctl_table *ctl, int write,
			    void *buffer, size_t *lenp, loff_t *ppos)
{
	/* Delegate to the standard int-vector proc handler; it parses the user string,
	 * writes the parsed int to *ctl->data, and returns 0 on success */
	int ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
	/* If this was a successful write, recompute all cached derived values so the
	 * new parameter takes effect on the very next ACK (no module reload needed) */
	if (write && ret == 0)
		ucp_init_module_params();
	return ret; /* Propagate the return value to the sysctl infrastructure */
}

/* Sysctl table registered under /proc/sys/net/ucp/.
 * Each entry maps a sysctl name (e.g. "ucp_extra_acked_gain") to one of the module's
 * static int variables.  All 33 entries share the same ucp_proc_handler which triggers
 * ucp_init_module_params() on every write, making "sysctl -w net.ucp.XXX=YYY" dynamic. */
static struct ctl_table ucp_ctl_table[] = {
	/* UCP/BBR operation mode selector */
	{ .procname = "ucp_bbr_mode",                .data = &ucp_bbr_mode,                .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	/* Bandwidth floor per path class (permyriad of non-congested peak BDP) */
	{ .procname = "ucp_bw_floor_default",        .data = &ucp_bw_floor_default,        .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_bw_floor_mobile",         .data = &ucp_bw_floor_mobile,         .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_bw_floor_lan",            .data = &ucp_bw_floor_lan,            .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_bw_floor_vpn",            .data = &ucp_bw_floor_vpn,            .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_bw_floor_lossy_fat",      .data = &ucp_bw_floor_lossy_fat,      .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_bw_floor_congested",      .data = &ucp_bw_floor_congested,      .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_bw_floor_loss_cap",       .data = &ucp_bw_floor_loss_cap,       .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	/* PROBE_RTT interval tuning (raw seconds; converted to jiffies in ucp_init_module_params) */
	{ .procname = "ucp_probe_rtt_base_sec",      .data = &ucp_probe_rtt_base_sec,      .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_probe_rtt_class_extra_sec", .data = &ucp_probe_rtt_class_extra_sec, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_probe_rtt_high_loss_extra_sec", .data = &ucp_probe_rtt_high_loss_extra_sec, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_probe_rtt_mid_loss_extra_sec", .data = &ucp_probe_rtt_mid_loss_extra_sec, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_probe_rtt_max_sec",       .data = &ucp_probe_rtt_max_sec,       .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	/* Congestion window gain and graduated caps (permyriad) */
	{ .procname = "ucp_cwnd_gain",               .data = &ucp_cwnd_gain,               .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_cwnd_cap_mild",           .data = &ucp_cwnd_cap_mild,           .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_cwnd_cap_moderate",       .data = &ucp_cwnd_cap_moderate,       .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_cwnd_cap_severe",         .data = &ucp_cwnd_cap_severe,         .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	/* ACK aggregation compensation (permyriad, 0=disabled, 10000=1.0x BBR standard) */
	{ .procname = "ucp_extra_acked_gain",        .data = &ucp_extra_acked_gain,        .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	/* PROBE_BW probe pacing gain and mobile override (permyriad) */
	{ .procname = "ucp_probe_gain",              .data = &ucp_probe_gain,              .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_probe_gain_mobile",       .data = &ucp_probe_gain_mobile,       .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	/* Early drain trigger and drain gain levels (permyriad) */
	{ .procname = "ucp_drain_loss_thresh",       .data = &ucp_drain_loss_thresh,       .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_drain_gain_light",        .data = &ucp_drain_gain_light,        .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_drain_gain_standard",     .data = &ucp_drain_gain_standard,     .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_drain_gain_aggressive",   .data = &ucp_drain_gain_aggressive,   .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_drain_loss_lvl2_thresh",  .data = &ucp_drain_loss_lvl2_thresh,  .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_drain_loss_lvl3_thresh",  .data = &ucp_drain_loss_lvl3_thresh,  .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	/* General loss and RTT thresholds for condition/class classifiers (permyriad) */
	{ .procname = "ucp_low_loss_thresh",         .data = &ucp_low_loss_thresh,         .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_high_loss_thresh",        .data = &ucp_high_loss_thresh,        .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_probe_skip_loss_thresh",  .data = &ucp_probe_skip_loss_thresh,  .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_probe_skip_rtt_rise",     .data = &ucp_probe_skip_rtt_rise,     .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	/* STARTUP loss-based gain reduction thresholds and reduced gain value (permyriad) */
	{ .procname = "ucp_startup_soft_drain_thresh", .data = &ucp_startup_soft_drain_thresh, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_startup_hard_cap_thresh", .data = &ucp_startup_hard_cap_thresh, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ .procname = "ucp_startup_soft_gain",       .data = &ucp_startup_soft_gain,       .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
	{ } /* Sentinel: empty entry marks the end of the table (required by register_sysctl) */
};

/**
 * @brief Module initialization function.
 * @return 0 on success, negative error code if registration fails
 *
 * Called when the module is loaded (insmod/modprobe). Performs:
 * 1. Compile-time check that struct ucp fits within ICSK_CA_PRIV_SIZE (104 bytes)
 * 2. Pre-compute all module parameter derived values (permyriad to BBR_SCALE, seconds to jiffies)
 * 3. Register sysctl table under /proc/sys/net/ucp/ for sysctl -w support
 * 4. Register the UCP congestion control algorithm with the TCP stack
 */
static int __init ucp_register(void)
{
	int ret; /* Return value from tcp_register_congestion_control, propagated to module init */

	/* ASSERT: struct ucp must fit in 104 bytes (ICSK_CA_PRIV_SIZE).
	 * If this BUILD_BUG_ON fires, you need to shrink the struct or enlarge the kernel's ca_priv area */
	BUILD_BUG_ON(sizeof(struct ucp) > ICSK_CA_PRIV_SIZE);
	/* Pre-compute all permyriad -> BBR_SCALE and seconds -> jiffies derived values.
	 * This must run before registration so the hot-path variables are ready for the first ACK.
	 * Also called from ucp_proc_handler / module_param_cb to refresh on runtime writes */
	ucp_init_module_params();

	/* Register the sysctl table under /proc/sys/net/ucp/.  This enables:
	 *   sysctl -w net.ucp.ucp_extra_acked_gain=0
	 *   echo "net.ucp.ucp_bbr_mode=1" >> /etc/sysctl.d/ucp.conf
	 * The table's ucp_proc_handler calls ucp_init_module_params() on every write,
	 * so all cached values are refreshed immediately without unloading the module */
	ucp_ctl_header = register_sysctl("net/ucp", ucp_ctl_table);

	/* Register the UCP congestion control ops with the TCP stack.
	 * After this succeeds, new TCP connections can select "ucp" via the
	 * net.ipv4.tcp_congestion_control sysctl or the TCP_CONGESTION setsockopt */
	ret = tcp_register_congestion_control(&tcp_ucp_cong_ops);
	/* If CC registration failed (e.g. name conflict or out of memory),
	 * clean up the sysctl table to avoid orphan entries in /proc/sys/net/ucp/.
	 * Without this, the sysctl nodes would exist but point to a dead module */
	if (ret && ucp_ctl_header) {
		unregister_sysctl_table(ucp_ctl_header);
		ucp_ctl_header = NULL; /* Invalidate the handle so ucp_unregister won't double-free */
	}
	return ret; /* 0 = success, negative errno = failure (module load will be rejected) */
}

/**
 * @brief Module exit function.
 *
 * Called when the module is unloaded (rmmod). Unregisters the UCP congestion
 * control algorithm from the TCP stack and cleans up the sysctl table.
 * All connections using UCP will be transitioned to the system default
 * congestion control algorithm.
 */
static void __exit ucp_unregister(void)
{
	/* Unregister the CC ops: the TCP stack will switch any remaining UCP
	 * connections to the system default CC before this call returns.
	 * This must happen first to prevent new connections from selecting UCP
	 * while we're tearing down */
	tcp_unregister_congestion_control(&tcp_ucp_cong_ops);
	/* Remove /proc/sys/net/ucp/ sysctl entries created at module init.
	 * Safe to call with NULL (ucp_ctl_header stays NULL if registration
	 * failed in ucp_register, and is set to NULL after a failed CC reg) */
	if (ucp_ctl_header) {
		unregister_sysctl_table(ucp_ctl_header);
		ucp_ctl_header = NULL; /* Defensive: prevent double-unregister on any future code path */
	}
}

module_init(ucp_register);   /* Declare ucp_register as the module entry point, called on insmod */
module_exit(ucp_unregister); /* Declare ucp_unregister as the module exit point, called on rmmod */

MODULE_AUTHOR("PPP PRIVATE NETWORK™ X"); /* Module author: the organization that developed UCP */
MODULE_AUTHOR("Original BBR: Van Jacobson, Neal Cardwell, Yuchung Cheng, "
	      "Soheil Hassas Yeganeh (Google)"); /* Credit the original BBR algorithm authors whose work UCP is derived from */
MODULE_LICENSE("Dual BSD/GPL"); /* Dual licensing: BSD or GPL v2; allows use in both open-source and proprietary projects */
MODULE_DESCRIPTION("TCP UCP v1.0 - BBR-based congestion control with non-destructive constraint layer, "
		   "6-class path classifier, graduated cwnd ceilings, bandwidth floors, "
		   "ACK aggregation compensation, and pure BBR-compatible mode. "
		   "Fits within ICSK_CA_PRIV_SIZE=104 bytes."); /* Module description displayed by modinfo */
MODULE_VERSION("1.0"); /* Module version string: major.minor; aligned with the formal release version */
相关推荐
HelloWorld工程师1 小时前
SSL证书在哪里可以免费且快速申请?
服务器·网络协议·ssl
每天回答3个问题1 小时前
LeetCodeHot100|回溯算法、46.全排列、78.子集、17.电话号码的字母组合
算法·深度优先·回溯
每天回答3个问题2 小时前
leetcodeHot100 | 104.二叉树的最大深度
c++·面试·
坚果派·白晓明2 小时前
【鸿蒙PC三方库移植适配框架解读系列】第五篇:完整流程图与角色职责
c语言·c++·华为·harmonyos·鸿蒙
皮皮学姐分享-ppx2 小时前
上市公司数字技术风险暴露数据(2010-2024)|《经济研究》同款大模型测算
大数据·网络·数据库·人工智能·chatgpt·制造
xiao_li_ya2 小时前
C++学习日记1(`*`的理解、const关键词)
开发语言·c++
Liangwei Lin3 小时前
LeetCode 287. 寻找重复数
算法·leetcode·职场和发展
皮卡蛋炒饭.3 小时前
应用层协议HTTP
网络·网络协议·http
wearegogog1233 小时前
Modbus TCP 通讯协议实现
服务器·网络·tcp/ip