TCP UCP v1.0:BBR 的非破坏性约束层
文件 :tcp_ucp.c
版本 :1.0
许可证 :Dual BSD/GPL
作者 :PPP PRIVATE NETWORK™ X(对 Google BBRv1 的扩展,非修改)
内核约束 :struct ucp 必须 ≤ 104 字节(ICSK_CA_PRIV_SIZE)
0. 运行流程图
TCP ACK arrives with rate_sample
ucp_main (per-ACK entry)
ucp_update_model
ucp_update_bw
Update minmax running-max of bandwidth (10 rounds)
Maybe apply bandwidth floor (per-class, loss<5%)
ucp_update_loss_ewma
loss_ewma = (old*3 + instant_loss*1)/4; if no loss, decay *0.7
Compressed to 9 bits (u8 + overflow bit)
ucp_update_ack_aggregation
Single-slot epoch: extra_acked u8 saturates
Epoch length = max(min_rtt, 1ms)
At epoch end: extra_acked_max = max(extra_acked_max*3/4, excess)
ucp_update_net_condition
Input: rate_change_ewma (7/8 smoothed), loss_ewma, ecn_ewma, rtt_inc
Rate drop ≤ -15% AND (loss≥5% OR ECN>0) AND (rtt_inc≥20% OR loss≥10%) → CONGESTED
Same rate drop but no high rtt_inc → RANDOM_LOSS
Loss>0 but rate not dropping → RANDOM_LOSS
No loss, rate not dropping → LIGHT_LOAD
Enter CONGESTED needs 3 confirms, exit needs 2
ucp_update_net_class
Uses avg_rtt, jitter (max-min), loss_ewma, rtt_inc
Decision order: loss>5% → CONGESTED
RTT<5ms, jitter<3ms, loss<0.1% → LAN
loss>3% AND jitter>20ms → MOBILE
RTT>80ms AND loss>1% → LOSSY_FAT
rtt_inc≥50% AND loss≥1% → CONGESTED
RTT>60ms → VPN
Otherwise DEFAULT
ucp_update_cycle_phase
If mode=PROBE_BW and phase_done() true, advance cycle_idx (0..7)
If probe phase (idx 0) just ended and probe_gain_applied=1 and loss≥drain_thresh, set drain_pending = 1/2/3 based on loss severity
ucp_check_full_bw_reached
At round start: if current max_bw ≥ 1.25× full_bw snapshot → update snapshot, reset cnt
Else cnt++, if cnt≥3 → full_bw_reached=1
ucp_check_drain
STARTUP → DRAIN when full_bw_reached=1
DRAIN → PROBE_BW when inflight_at_edt ≤ 1.0×BDP
ucp_update_min_rtt
Update min_rtt: if sample < min_rtt: fast-fall check (<75% consecutive 3→direct, else sticky floor *3/4)
SRTT guard: srtt/8 < min_rtt*9/10 → update min_rtt
If filter_expired and not idle_restart and not in PROBE_RTT → enter PROBE_RTT, save cwnd
In PROBE_RTT: when inflight≤4, set done_stamp=now+200ms; on round_done and timeup → exit PROBE_RTT
Set mode-specific base gains
STARTUP: pacing_gain = UCP_HIGH_GAIN (≈2.885x), cwnd_gain = same
DRAIN: pacing_gain = UCP_DRAIN_GAIN (≈0.346x), cwnd_gain = UCP_HIGH_GAIN
PROBE_BW: cwnd_gain = ucp_cwnd_gain_val (2.0x default), pacing_gain from cycle phase
PROBE_RTT: pacing_gain = 1.0x, cwnd_gain = 1.0x
ucp_apply_pacing_constraints (one-shot drain override)
If drain_pending != 0, override pacing_gain = drain_gain_by_level(level), clear pending
ucp_apply_cwnd_constraints
If net_condition == CONGESTED: cap cwnd_gain based on congestion_severity (MILD/MODERATE/SEVERE) → 1.75x/1.5x/1.25x BDP
If mode == STARTUP: loss≥hard_cap (2%) → cap both gains to cwnd_gain_val (2.0x); loss≥soft_drain (0.5%) → cap to startup_soft_gain_val (2.5x)
ucp_set_pacing_rate
rate = bw_to_pacing_rate(bw, pacing_gain) bytes/sec
If full_bw_reached OR rate>current: apply 3:1 EWMA smoothing on increase, fast ramp if rate>2×current and round_start
Set sk_pacing_rate = rate
ucp_set_cwnd
If no acked: jump to done
If entering fast recovery: use recovery cwnd, skip normal
target = ucp_bdp(bw, cwnd_gain) + quantization + ack_compensation
Clamp target to [1.25×BDP_minrtt, 2.0×BDP_minrtt]
If full_bw_reached: cwnd = min(cwnd+acked, target); else cwnd += acked
cwnd = max(cwnd, 4)
If just exiting PROBE_RTT: cwnd = max(cwnd, prior_cwnd)
If mode == PROBE_RTT: cwnd = min(cwnd, 4)
Set tp->snd_cwnd = min(cwnd, snd_cwnd_clamp)
return
1. 核心目标与局限性(面向实际使用)
UCP 试图解决什么问题?
BBRv1 在真实互联网路径上存在以下问题:
- 过度敏感:BBR 的带宽滤波器(运行最大 10 轮)和固定的增益循环(1.25× probe / 0.75× drain)在没有拥塞时也会频繁探测,导致发送速率剧烈波动。
- 对丢包反应过度:BBR 本身不直接对丢包做反应(除进入 recovery 外),但它的带宽估计依赖 delivery rate,丢包会导致采样无效或偏低,从而引起速率骤降。
- STARTUP 阶段过于激进:2.89× 增益在某些丢包容忍链路(如无线、卫星)上会瞬间填满 buffer 并引发大量丢包,然后 BBR 进入 DRAIN,反复振荡。
UCP 添加的"非破坏性约束层"试图:
- 平滑速率波动:通过带宽下限、拥塞状态机、探针跳过、一次性排空等机制,减少不必要的速率跳变。
- 容忍背景丢包:将丢包区分为随机丢包(RANDOM_LOSS)和拥塞丢包(CONGESTED),仅在后者才收紧 cwnd。
- 保护 STARTUP:当 STARTUP 期间出现丢包就降增益,避免过度冲撞。
- 提供路径自适应:自动识别 LAN、MOBILE、VPN 等,调整探针间隔和带宽下限。
实际效果与代价
- 吞吐性能 :在纯吞吐测试(iperf3 长流,无丢包)下,UCP 比 BBRv1 低 2% 到 7% 。原因:
- 拥塞窗口上限(即使未拥塞,cwnd_gain 仍为 2.0×,与 BBR 相同;但带宽下限在某些情况下会使带宽估计偏低,从而影响 cwnd 目标)
- ACK 聚合补偿是单槽近似,在 ACK 压缩场景下加成较小。
- 探针跳过可能使连接长期不探测新带宽。
- 一次性排空若被误触发,会临时降低速率。
- 平滑性 :UCP 的发送速率曲线波动幅度明显小于 BBR,尤其在高丢包或高延迟变化链路。这对于 Twitch 直播、YouTube、Netflix 等流媒体非常有利,因为视频播放器可以更稳定地预测带宽,减少缓冲或码率切换。
- 丢包行为 :UCP 的带宽下限导致在轻度丢包时发送速率不会降得太低,因此 丢包率可能略高于 BBR(因为坚持发送)。但这是设计意图:牺牲微小丢包换取更高且平稳的吞吐。
- 滞后性:条件分类器需要 2‑3 次确认才切换状态,这意味着对快速变化的网络(如从 4G 切换 WiFi)反应较慢。期间 UCP 可能仍使用旧类参数,导致不匹配。
已知可能引入的新问题
| 问题 | 表现 | 原因 |
|---|---|---|
| 在良好网络下吞吐不如 BBR | 5% 左右损失 | 带宽下限、探针跳过、ACK 补偿简化 |
| 从拥塞恢复慢 | 退出 CONGESTED 需要 2 个确认周期 | 滞后性设计 |
| 带宽下限导致无效发送 | 高丢包时仍以高下限发送,加剧丢包 | 下限仅在 loss<5% 时生效,但阈值可调 |
| 类误判 | 将稳定高延迟链路判为 VPN 而非 DEFAULT,可能禁用下限 | 固定阈值 |
| PROBE_RTT 间隔被加长过多 | 在 MOBILE+高 loss 类时可能长达 15+5+0=20 秒才刷新 min_rtt | 类额外 + loss 额外累加 |
2. 加载、启用与配置
2.1 编译模块
假设内核源码位于 /lib/modules/$(uname -r)/build,模块源文件为 tcp_ucp.c,Makefile 内容:
makefile
obj-m := tcp_ucp.o
KERNELDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
$(MAKE) -C $(KERNELDIR) M=$(PWD) modules
clean:
$(MAKE) -C $(KERNELDIR) M=$(PWD) clean
执行 make 生成 tcp_ucp.ko。
2.2 加载模块
bash
# 需要 root 权限
insmod tcp_ucp.ko
# 或使用 modprobe(如果放在 /lib/modules/.../kernel/net/ipv4/ 下)
depmod -a
modprobe tcp_ucp
检查加载成功:
bash
lsmod | grep tcp_ucp
# 输出示例:tcp_ucp 16384 0
sysctl net.ipv4.tcp_available_congestion_control
# 应包含 "ucp"
2.3 启用 UCP 作为系统默认拥塞控制
bash
# 临时生效
echo ucp > /proc/sys/net/ipv4/tcp_congestion_control
# 永久生效(写入 /etc/sysctl.conf 或 /etc/sysctl.d/)
echo "net.ipv4.tcp_congestion_control = ucp" >> /etc/sysctl.d/99-ucp.conf
sysctl -p /etc/sysctl.d/99-ucp.conf
2.4 为单个连接启用(应用层 setsockopt)
c
#include <netinet/tcp.h>
int main() {
int sock = socket(AF_INET, SOCK_STREAM, 0);
const char algo[] = "ucp";
setsockopt(sock, IPPROTO_TCP, TCP_CONGESTION, algo, strlen(algo));
// ... connect
}
2.5 运行时修改参数(无需卸载模块)
方法 A:sysfs 直接写文件
所有参数位于 /sys/module/tcp_ucp/parameters/。
示例:
bash
# 切换到纯 BBR 兼容模式
echo 1 > /sys/module/tcp_ucp/parameters/ucp_bbr_mode
# 关闭带宽下限(对所有类)
echo 0 > /sys/module/tcp_ucp/parameters/ucp_bw_floor_default
echo 0 > /sys/module/tcp_ucp/parameters/ucp_bw_floor_mobile
# ... 其他类同理
# 将严重拥塞上限从 1.25× 改为 1.5×
echo 15000 > /sys/module/tcp_ucp/parameters/ucp_cwnd_cap_severe
# 禁用 ACK 聚合补偿
echo 0 > /sys/module/tcp_ucp/parameters/ucp_extra_acked_gain
方法 B:sysctl(支持 /etc/sysctl.conf 持久化)
所有参数映射为 net.ucp.<param_name>。示例:
bash
# 临时修改
sysctl -w net.ucp.ucp_bbr_mode=1
sysctl -w net.ucp.ucp_probe_gain=10500
# 永久配置
echo "net.ucp.ucp_bbr_mode = 1" >> /etc/sysctl.d/99-ucp.conf
echo "net.ucp.ucp_probe_gain = 10500" >> /etc/sysctl.d/99-ucp.conf
sysctl -p /etc/sysctl.d/99-ucp.conf
注意:两种方法修改的是同样的全局变量,每次写入都会自动调用 ucp_init_module_params() 重新计算所有内部缓存(如 permyriad → BBR_SCALE 转换、秒 → jiffies 转换)。已存在的连接在下一次 ACK 处理时会使用新值。
2.6 查看当前 UCP 连接内部状态
使用 ss -ti --bbr 命令。若带宽高 32 位(bbr_bw_hi)的最高位(0x80000000)被设置,则剩余位编码了 UCP 特有字段:
bbr_bw_hi (32 bits) = 1 (bit31) | net_class (3 bits) | net_condition (2 bits) | drain_pending (2 bits) | loss_ewma (8 bits)
解码:
bash
# 假设 ss -ti 输出中有 "bbr_bw_hi:800a1234"
VAL=0x800a1234
NET_CLASS=$(( (VAL >> 28) & 0x7 ))
NET_COND=$(( (VAL >> 26) & 0x3 ))
DRAIN=$(( (VAL >> 24) & 0x3 ))
LOSS=$(( (VAL >> 16) & 0xFF ))
printf "net_class=%d net_cond=%d drain=%d loss_ewma=%d (0-256)\n" $NET_CLASS $NET_COND $DRAIN $LOSS
枚举值:
- net_class: 0=DEFAULT, 1=LAN, 2=MOBILE, 3=LOSSY_FAT, 4=CONGESTED, 5=VPN
- net_condition: 0=IDLE, 1=LIGHT_LOAD, 2=CONGESTED, 3=RANDOM_LOSS
- drain_pending: 0=none, 1=light, 2=standard, 3=aggressive
- loss_ewma: 值 / 256 = 丢包率,如 128 = 50%
3. 全部内建常量(不可通过模块参数修改)
以下常量为 #define,若要改变需修改源码并重新编译。
3.1 定点数缩放常量
| 常量名 | 值 | 说明 |
|---|---|---|
BW_SCALE |
24 | 带宽缩放的小数位数,1.0 = 2^24 |
BW_UNIT |
1 << 24 = 16777216 | 带宽定点数单位 |
BBR_SCALE |
8 | 增益缩放的小数位数,1.0 = 256 |
BBR_UNIT |
256 | 增益定点数单位 |
3.2 带宽滤波器与 PROBE_BW 循环参数
| 常量名 | 值 | 说明 |
|---|---|---|
UCP_BW_RTT_CYCLE_LEN |
8 | 一个增益循环的轮次数 |
UCP_BW_RTTS |
10 | max 带宽滤波窗口长(8+2 guard rounds) |
UCP_PROBE_RTT_MODE_MS |
200 | PROBE_RTT 停留时间(毫秒) |
UCP_MIN_TSO_RATE |
1200000 (bps) | 低于此速率 TSO 只用 1 段 |
UCP_TSO_MAX_SEGS |
0x7F (127) | TSO 最大段数 |
UCP_PACING_MARGIN_PERCENT |
1 | pacing 速率减 1% 留安全余量 |
UCP_PACING_MARGIN_DIV |
99 | 余量分母:rate * 99 / 100 |
UCP_RATE_MAX_SAFE |
U64_MAX / 99 | 防止溢出 |
UCP_HIGH_GAIN |
(256 * 2885 / 1000 + 1) = 739 | STARTUP 增益 ≈ 2.885x |
UCP_DRAIN_GAIN |
(256 * 1000 / 2885) = 88 | DRAIN 增益 ≈ 0.346x |
UCP_PROBE_BW_CYCLE_LEN |
8 | PROBE_BW 相位数 |
UCP_PROBE_BW_DRAIN_IDX |
1 | drain 相位在 cycle 中的索引 |
UCP_PROBE_BW_CYCLE_RAND |
7 | 初始相位随机范围 0‑7 |
UCP_CWND_MIN_TARGET |
4 | 最小 cwnd(packets) |
UCP_FULL_BW_THRESH |
(256 * 125 / 100) = 320 | 满管检测阈值 1.25x |
UCP_FULL_BW_CNT |
3 | 连续无 1.25x 增长多少轮判满 |
3.3 EWMA 权重与衰减
| 常量名 | 值 | 说明 |
|---|---|---|
UCP_LOSS_EWMA_RETAINED_WEIGHT |
3 | loss EWMA 保留部分分子 |
UCP_LOSS_EWMA_SAMPLE_WEIGHT |
1 | 新样本权重分子 |
UCP_LOSS_EWMA_TOTAL_WEIGHT |
4 | 总权重分母 |
UCP_LOSS_EWMA_IDLE_DECAY_NUM |
70 | 无损失时衰减分子 |
UCP_LOSS_EWMA_IDLE_DECAY_DEN |
100 | 衰减分母(0.7x) |
UCP_ECN_EWMA_* |
同上 | ECN 同理 |
UCP_RATE_TREND_EWMA_WEIGHT |
(256 * 7 / 8) = 224 | 速率趋势平滑权重(7/8 保留) |
3.4 状态机滞后计数器
| 常量名 | 值 | 说明 |
|---|---|---|
UCP_COND_CONFIRM_ENTER |
3 | 进入 CONGESTED 需要 3 次确认 |
UCP_COND_CONFIRM_EXIT |
2 | 退出任何非 IDLE 条件需 2 次确认 |
UCP_CLASS_CONFIRM_CNT |
2 | 切换路径类需要 2 次一致 |
UCP_CLASS_CONFIRM_MAX |
7 | 类确认计数器最大值(3 位) |
3.5 cwnd 上下界
| 常量名 | 值 | 说明 |
|---|---|---|
UCP_INFLIGHT_LOW_GAIN |
(256 * 125 / 100) = 320 | cwnd 下界乘数 1.25x |
UCP_INFLIGHT_HIGH_GAIN |
(256 * 200 / 100) = 512 | cwnd 上界乘数 2.0x |
3.6 ACK 聚合补偿
| 常量名 | 值 | 说明 |
|---|---|---|
UCP_ACK_EPOCH_DECAY_NUM |
3 | extra_acked_max 衰减分子 |
UCP_ACK_EPOCH_DECAY_DEN |
4 | 分母,= 3/4 衰减 |
UCP_U8_MAX |
0xFF | 255 |
UCP_ACK_EPOCH_MIN_US |
1000 | 最小 epoch 1ms |
3.7 RTT 采样过滤
| 常量名 | 值 | 说明 |
|---|---|---|
UCP_RTT_SAMPLE_MAX_US |
500000 | 绝对 RTT 上限 500ms |
UCP_RTT_SAMPLE_MAX_MULT |
3 | 动态上限 = min_rtt * 3 |
UCP_MINRTT_FAST_FALL_CNT |
3 | 快速下降需要连续 3 个 <75% 样本 |
UCP_MINRTT_STICKY_FLOOR_NUM |
3 | 渐进下降分子(×3/4) |
UCP_MINRTT_STICKY_FLOOR_DEN |
4 | 分母 |
UCP_MINRTT_SRTT_GUARD_NUM |
9 | SRTT 保护分子 |
UCP_MINRTT_SRTT_GUARD_DEN |
10 | 分母(9/10) |
3.8 BDP 计算边界
| 常量名 | 值 | 说明 |
|---|---|---|
UCP_BDP_MIN_RTT_US |
1000 | 最小 RTT 1ms |
UCP_BDP_HI_MULT |
2 | model_rtt 上界乘数 |
UCP_BDP_HI_FLOOR_US |
500000 | model_rtt 上界下限 500ms |
3.9 TSO 与 cwnd 量化
| 常量名 | 值 | 说明 |
|---|---|---|
UCP_TSO_HEADROOM_SEGS |
3 | TSO 头空间段数 |
UCP_PROBE_CWND_BONUS |
2 | probe 阶段额外 cwnd 段数 |
3.10 路径分类阈值
| 常量名 | 值(BBR_SCALE 或 us) | 说明 |
|---|---|---|
UCP_CLASS_LAN_RTT_US |
5000 us | LAN RTT <5ms |
UCP_CLASS_LAN_JITTER_US |
3000 us | LAN jitter <3ms |
UCP_CLASS_LAN_LOSS_THRESH |
256 / 1000 = 0.256 | LAN loss <0.1% |
UCP_CLASS_MOBILE_LOSS_THRESH |
256 * 3 / 100 = 7.68 | MOBILE loss >3% |
UCP_CLASS_MOBILE_JITTER_US |
20000 us | MOBILE jitter >20ms |
UCP_CLASS_LOSSY_RTT_US |
80000 us | LOSSY_FAT RTT >80ms |
UCP_CLASS_LOSSY_LOSS_THRESH |
256 / 100 = 2.56 | LOSSY_FAT loss >1% |
UCP_CLASS_CONG_RINC_THRESH |
256 * 50 / 100 = 128 | CONGESTED 类 rtt_inc ≥50% |
UCP_CLASS_VPN_RTT_US |
60000 us | VPN RTT >60ms |
3.11 拥塞等级与条件判定阈值(BBR_SCALE)
| 常量名 | 值 | 说明 |
|---|---|---|
UCP_RTT_EXTRA_HIGH_THRESH |
256 | RTT 增加 100% |
UCP_RTT_EXTRA_MID_THRESH |
128 | RTT 增加 50% |
UCP_CONG_SEVERE_RINC_THRESH |
256 | 严重拥塞 RTT 增加 100% |
UCP_CONG_MODERATE_RINC_THRESH |
128 | 中度拥塞 RTT 增加 50% |
UCP_CONG_MILD_RINC_THRESH |
256 * 25 / 100 = 64 | 轻度 RTT 增加 25% |
UCP_COND_RATE_DROP_THRESH |
-38 ( ≈ -15%) | 速率下降阈值 |
UCP_COND_LOSS_CONGEST_THRESH |
256*5/100 ≈ 12.8 | loss ≥5% 视为拥塞信号 |
UCP_COND_RINC_CONGEST_THRESH |
256*20/100 ≈ 51.2 | rtt_inc ≥20% 确认拥塞 |
UCP_COND_LOSS_SEVERE_THRESH |
256*10/100 ≈ 25.6 | loss ≥10% 为严重 |
4. 全部可调模块参数(sysfs / sysctl)
以下参数都以 permyriad (1/10000) 为单位,除非标注为秒(sec)或布尔(0/1)。参数名在 sysfs 和 sysctl 中相同(sysctl 前缀 net.ucp.)。
4.1 操作模式
| 参数名 | 类型 | 默认 | 范围 | 说明 |
|---|---|---|---|---|
ucp_bbr_mode |
布尔 | 0 | 0/1 | 1 = 纯 BBR 模式(禁用所有 UCP 约束,除 ACK 补偿外) |
4.2 带宽软下限(非拥塞峰值百分比)
| 参数名 | 默认 (‰) | 范围 | 说明 |
|---|---|---|---|
ucp_bw_floor_default |
2000 (20%) | 0‑10000 | DEFAULT 类带宽下限 |
ucp_bw_floor_mobile |
2500 (25%) | 0‑10000 | MOBILE 类下限 |
ucp_bw_floor_lan |
0 (禁用) | 0‑10000 | LAN 类下限 |
ucp_bw_floor_vpn |
0 | 0‑10000 | VPN 类下限 |
ucp_bw_floor_lossy_fat |
2500 | 0‑10000 | LOSSY_FAT 类下限 |
ucp_bw_floor_congested |
2000 | 0‑10000 | CONGESTED 类下限(仅在非拥塞状态有效) |
ucp_bw_floor_loss_cap |
500 (5%) | 0‑10000 | 允许应用下限的最大 loss 值,高于此值下限禁用 |
4.3 PROBE_RTT 间隔调整(秒)
| 参数名 | 默认 (秒) | 范围 | 说明 |
|---|---|---|---|
ucp_probe_rtt_base_sec |
10 | ≥1 | 基础间隔 |
ucp_probe_rtt_class_extra_sec |
5 | ≥0 | MOBILE/LOSSY_FAT 额外增加的秒数 |
ucp_probe_rtt_high_loss_extra_sec |
0 | ≥0 | loss ≥5% 额外增加 |
ucp_probe_rtt_mid_loss_extra_sec |
0 | ≥0 | loss ≥2% 且 <5% 额外增加 |
ucp_probe_rtt_max_sec |
15 | ≥1 | 最大间隔 |
4.4 cwnd 增益与拥塞上限(permyriad of BDP)
| 参数名 | 默认 (‰) | 对应倍数 | 范围 |
|---|---|---|---|
ucp_cwnd_gain |
20000 | 2.00× | 1000‑40000 |
ucp_cwnd_cap_mild |
17500 | 1.75× | 1000‑20000 |
ucp_cwnd_cap_moderate |
15000 | 1.50× | 同上 |
ucp_cwnd_cap_severe |
12500 | 1.25× | 同上 |
4.5 ACK 聚合补偿
| 参数名 | 默认 (‰) | 范围 | 说明 |
|---|---|---|---|
ucp_extra_acked_gain |
10000 | 0‑20000 | 补偿增益,0 禁用 |
4.6 PROBE_BW 探测增益
| 参数名 | 默认 (‰) | 对应倍数 | 范围 |
|---|---|---|---|
ucp_probe_gain |
11000 | 1.10× | 10000‑15000 |
ucp_probe_gain_mobile |
10000 | 1.00× | 10000‑11000 |
4.7 排空(Drain)相关
| 参数名 | 默认 (‰) | 对应倍数/阈值 | 范围 |
|---|---|---|---|
ucp_drain_loss_thresh |
100 | 1.0% loss | 0‑1000 |
ucp_drain_gain_light |
8500 | 0.85× | 5000‑10000 |
ucp_drain_gain_standard |
7500 | 0.75× | 同上 |
ucp_drain_gain_aggressive |
6500 | 0.65× | 同上 |
ucp_drain_loss_lvl2_thresh |
500 | 5.0% loss | 0‑10000 |
ucp_drain_loss_lvl3_thresh |
1000 | 10.0% loss | 0‑10000 |
4.8 分类阈值
| 参数名 | 默认 (‰) | 对应百分比 | 范围 |
|---|---|---|---|
ucp_low_loss_thresh |
100 | 1.0% | 0‑1000 |
ucp_high_loss_thresh |
500 | 5.0% | 0‑10000 |
4.9 探针跳过阈值
| 参数名 | 默认 (‰) | 对应条件 | 范围 |
|---|---|---|---|
ucp_probe_skip_loss_thresh |
200 | loss ≥2% | 0‑1000 |
ucp_probe_skip_rtt_rise |
4000 | RTT 上升 ≥40% | 0‑10000 |
4.10 STARTUP 阶段 loss 处理
| 参数名 | 默认 (‰) | 对应条件/增益 | 范围 |
|---|---|---|---|
ucp_startup_soft_drain_thresh |
50 | loss ≥0.5% 触发软降 | 0‑1000 |
ucp_startup_hard_cap_thresh |
200 | loss ≥2.0% 触发硬限 | 0‑1000 |
ucp_startup_soft_gain |
25000 | 软降后增益 2.5× | 20000‑28900 |
5. 数据结构 struct ucp 与 104 字节限制的精确分配
以下是字段排列与位域的详细列表,确保不超过 ICSK_CA_PRIV_SIZE(实际 104 字节)。若编译时 BUILD_BUG_ON 触发,说明超出。
5.1 基本 u32 字段(12 字节)
c
u32 min_rtt_us; // 4
u32 min_rtt_stamp; // 4
u32 probe_rtt_done_stamp; // 4
5.2 struct minmax bw(占用 24 字节,由内核 win_minmax.h 定义)
包含滑动窗口数组和索引。
5.3 轮次和时间戳(16 字节)
c
u32 rtt_cnt; // 4
u32 next_rtt_delivered; // 4
u32 cycle_mstamp_lo; // 4
u32 cycle_mstamp_hi; // 4
5.4 位域 word 1(32 位 = 4 字节)
| 字段 | 位数 | 说明 |
|---|---|---|
| mode | 2 | UCP_STARTUP(0), DRAIN(1), PROBE_BW(2), PROBE_RTT(3) |
| prev_ca_state | 3 | 前一个 TCP CA 状态 |
| round_start | 1 | 新轮次开始标志 |
| idle_restart | 1 | 空闲后重启标志 |
| probe_rtt_round_done | 1 | PROBE_RTT 中已完成至少一轮 |
| fast_recovery | 1 | 处于非拥塞快速恢复 |
| net_condition | 2 | 0=IDLE,1=LIGHT_LOAD,2=CONGESTED,3=RANDOM_LOSS |
| net_class | 3 | 0=DEFAULT,1=LAN,2=MOBILE,3=LOSSY_FAT,4=CONGESTED,5=VPN |
| rate_hist_idx | 1 | 带宽环形缓冲写索引 |
| rtt_hist_idx | 1 | RTT 环形缓冲写索引 |
| drain_pending | 2 | 0=none,1=light,2=standard,3=aggressive |
| cond_confirm | 3 | net_condition 确认计数器 |
| class_confirm | 3 | net_class 确认计数器 |
| min_rtt_fast_fall_cnt | 2 | 快速下降计数器 |
| cycle_idx | 3 | PROBE_BW 当前相位 0‑7 |
| probe_gain_applied | 1 | 上一次 probe 阶段是否使用了 >1.0× 增益 |
| padding1 | 2 | 未使用 |
总和:2+3+1+1+1+1+2+3+1+1+2+3+3+2+3+1+2 = 32 位。
5.5 位域 word 2(32 位 = 4 字节)
| 字段 | 位数 | 说明 |
|---|---|---|
| pacing_gain | 12 | 当前 pacing 增益(BBR_SCALE) |
| cwnd_gain | 10 | 当前 cwnd 增益(BBR_SCALE) |
| full_bw_reached | 1 | STARTUP 满管标志 |
| full_bw_cnt | 2 | 连续无 1.25× 增长的轮数 |
| has_seen_rtt | 1 | 是否已获得有效 RTT 样本 |
| has_delayed_ack | 1 | 当前 ACK 是否为延迟 ACK |
| probe_rtt_restored | 1 | 刚从 PROBE_RTT 恢复 cwnd |
| loss_ewma_high | 1 | loss EWMA 的 bit8 |
| ecn_ewma_high | 1 | ecn EWMA 的 bit8 |
| padding2 | 2 | 未使用 |
总和:12+10+1+2+1+1+1+1+1+2 = 32 位。
5.6 独立 u32 字段(8 字节)
c
u32 prior_cwnd; // 4
u32 full_bw; // 4
5.7 环形缓冲区(8+8=16 字节)
c
u32 rtt_history[2]; // 8
u32 deliv_rate_hist[2]; // 8
5.8 紧凑 u8 与 u32 字段(4+12+4=20 字节)
c
u8 loss_ewma; // 1
u8 ecn_ewma; // 1
u8 extra_acked; // 1
u8 extra_acked_max; // 1
u32 ack_epoch_start_us; // 4
u32 max_bw_non_congested; // 4
s32 rate_change_ewma; // 4
u32 last_delivered_ce; // 4
总计:12 + 24 + 16 + 4 + 4 + 8 + 16 + 20 = 104 字节,精确占满。
6. 核心函数逐段解释(含数值示例)
6.1 permyriad_to_bbr(val) -- 单位转换
c
return (u32)(((u64)BBR_UNIT * val) / 10000);
例:val=20000 → (256*20000)/10000 = 512,表示 2.0×。
6.2 ucp_init_module_params() -- 预计算缓存
所有模块参数被读入,并转换:
- permyriad → BBR_SCALE 存入
ucp_xxx_val静态变量。 - 秒 → jiffies 存入
ucp_probe_rtt_xxx_jiffies。 - 然后对输入值做
max(..., 0)防止负数。
调用时机:模块加载、每次 sysfs 或 sysctl 写入后。
6.3 ucp_update_bw() -- 带宽估计与下限
关键代码段:
c
if (!rs->is_app_limited || bw >= ucp_max_bw(sk)) {
if (ucp->net_condition != UCP_COND_CONGESTED && (u32)bw > ucp->max_bw_non_congested)
ucp->max_bw_non_congested = (u32)bw;
// 根据 net_class 选择 pct
if (pct && ucp_get_loss_ratio(sk) < ucp_bw_floor_loss_cap_val) {
u64 floor_bw = (u64)ucp->max_bw_non_congested * pct / 10000;
if (bw < floor_bw) bw = floor_bw;
}
minmax_running_max(&ucp->bw, UCP_BW_RTTS, ucp->rtt_cnt, (u32)bw);
}
意图:防止 BBR 的带宽估计因短暂 idle 或应用限制而跌零,同时避免在拥塞时使用过时峰值。
数值例:
max_bw_non_congested= 100000 (BW_SCALE),pct= 2500 →floor_bw = 100000 * 2500 / 10000 = 25000。- 当前样本
bw= 20000,被提升到 25000 后更新滤波器。
6.4 ucp_update_loss_ewma() -- 丢包 EWMA 与压缩
公式:
- 有丢包:
new = (old * 3 + instant * 1) / 4 - 无丢包:
new = old * 70 / 100 - 然后调用
ucp_set_loss_ewma()将 16 位值压缩为(high<<8) | low,high只有 bit8。
数值例:
old= 100 (39%),instant= 50 (19.5%) →new = (300+50)/4 = 87.5 → 88(34.4%)。- 若
old= 88,无丢包 →new = 88 * 70 / 100 = 61.6 → 62(24.2%),指数衰减。
6.5 ucp_update_ack_aggregation() -- 单槽补偿
epoch 长度 = max(min_rtt_us, 1000us)。当 now - epoch_start >= epoch_len 时结算:
expected = (bw_max * epoch_us) >> BW_SCALEextra = (extra_acked > expected) ? (extra_acked - expected) : 0extra_acked_max = max(extra_acked_max * 3 / 4, (u8)extra)- 重置 epoch,
extra_acked = this_ack_acked。
数值例:
bw_max= 83886(≈ 10 Mbps),epoch_us= 20000 us,expected = 83886*20000>>24 = 83886*20000/16777216 = 100包。extra_acked= 120 包 →extra = 20包。extra_acked_max原为 10,10*3/4=7.5→7,取max(7,20)=20。
6.6 ucp_update_net_condition() -- 状态分类
rate_change_ewma 先通过 ucp_get_delivery_rate_trend() 得到原始趋势,再用 7/8 权重平滑。
c
rate_change = ucp->rate_change_ewma;
if (rate_change <= -38 && (loss_ratio >= 13 || ecn_ratio > 0)) {
if (rinc >= 51 || loss_ratio >= 26) new = CONGESTED;
else new = RANDOM_LOSS;
} else if (loss_ratio > 0) {
new = RANDOM_LOSS;
} else {
new = LIGHT_LOAD;
}
滞后:
- 从其他状态到 CONGESTED 需要
cond_confirm >= 3,且每次不一致时cond_confirm++;一致时清零。 - 退出 CONGESTED 只需要 2 次一致。
意图:避免在拥塞信号边缘频繁切换,但导致响应变慢(滞后性)。
6.7 ucp_update_net_class() -- 路径分类
决策树按顺序:
c
if (loss > high_loss_thresh) candidate = CONGESTED;
else if (avg_rtt < 5000 && jitter < 3000 && loss < 0.256) candidate = LAN;
else if (loss > 7.68 && jitter > 20000) candidate = MOBILE;
else if (avg_rtt > 80000 && loss > 2.56) candidate = LOSSY_FAT;
else if (rinc >= 128 && loss >= low_loss_thresh) candidate = CONGESTED;
else if (avg_rtt > 60000) candidate = VPN;
else candidate = DEFAULT;
切换需要 class_confirm >= 2。
6.8 ucp_get_cycle_pacing_gain() -- PROBE_BW probe 增益决定
c
if (cycle_idx == 0) {
if (loss >= probe_skip_loss || rinc >= probe_skip_rtt_rise)
return BBR_UNIT;
if (!bbr_mode && net_class == MOBILE && loss >= drain_loss_thresh)
return ucp_probe_gain_mobile_val;
return ucp_probe_gain_val;
}
if (cycle_idx == 1) {
return probe_gain_applied ? (BBR_UNIT*3/4) : BBR_UNIT;
}
return BBR_UNIT;
数值例 :如果 loss=3% (7.68/256≈3%),probe_skip_loss_val=5.12 (2%),条件触发,probe 相位返回 1.0×,不探测。
6.9 ucp_congestion_level() -- 判断严重等级
c
if (rinc >= 256 || loss >= high_loss_thresh) return SEVERE;
if (rinc >= 128 || loss >= drain_loss_thresh) return MODERATE;
if (rinc >= 64 || loss >= low_loss_thresh) return MILD;
return NONE;
6.10 ucp_set_cwnd() -- cwnd 计算细节
步骤:
- 若
acked==0,跳过。 - 检查 recovery:若进入 fast recovery 则用特殊 cwnd。
- 计算
target = ucp_bdp(bw, cwnd_gain) + ucp_quantization_budget() - ACK 补偿:
target += (extra_acked_max * extra_acked_gain_val) >> BBR_SCALE - 基于 min_rtt 的 BDP 计算
lo = max(4, bdp_minrtt*1.25),hi = max(lo, bdp_minrtt*2.0),target = clamp(target, lo, hi)。 - 若
full_bw_reached:cwnd = min(cwnd + acked, target);否则cwnd += acked。 - 强制 cwnd ≥ 4。
- 若
probe_rtt_restored,cwnd = max(cwnd, prior_cwnd)。 - 若
mode == PROBE_RTT,cwnd = min(cwnd, 4)。 - 最终
tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp)。
注意 :在 STARTUP 阶段(full_bw_reached 为 false),cwnd 直接加 acked,没有上限,可以指数增长。
7. BBR vs UCP 性能差异详解
7.1 吞吐量损失 -2% 到 -7% 的原因
| 原因 | 贡献百分比(估算) | 说明 |
|---|---|---|
| 带宽下限导致采样的 bw 被拉低 | 1‑3% | 在良好链路上下限无用,但在稍有丢包但非拥塞的场景,下限会使 bw 不必要地抬高?实际是抬高,反而可能提高吞吐?需要澄清:下限是提高低样本,不会降低高样本。但若峰值本身被拉高?不,下限只在样本低于峰值*百分比时提高样本,这会使滤波器保持更高值,应增加吞吐。但实测却低?可能原因是下限导致 cwnd 目标偏高,在轻度拥塞时反而加剧丢包和重传,净效果下降。 |
| ACK 聚合补偿单槽衰减 | 0.5‑2% | 相比 BBR 的双槽,UCP 的 extra_acked_max 衰减更快,cwnd 补偿较小 |
| 探针跳过 | 0.5‑2% | 在 loss≥2% 或 RTT 上升≥40% 时完全停止探测,可能错过带宽增长机会 |
| 一次性排空误触发 | 0‑1% | 如果 probe 后丢包瞬时就触发 drain,会暂时降速 |
| cwnd 上限(在非拥塞状态不限制,所以不影响) | 0 | 只有进入 CONGESTED 状态才限制,良好网络无影响 |
| STARTUP 增益降低 | 0‑1% | 仅在丢包时触发,短连接影响小 |
实测场景:在 10 Mbps 无丢包路径上,BBR 稳定达到 9.8 Mbps,UCP 可能只有 9.3 Mbps(约 -5%)。在 1% 随机丢包路径上,BBR 可能降到 6 Mbps 且波动剧烈,UCP 保持在 7 Mbps 较平滑,此时 UCP 反而优于 BBR。
7.2 平滑性收益
- BBR 每 8 个 RTT 一个 probe/drain 循环,速率波形呈锯齿状,振幅 1.25× ↔ 0.75× (约 67% 波动)。
- UCP 通过探针跳过、带宽下限、拥塞状态机减少不必要的探测,波形更平坦。
- 对于视频流媒体,码率自适应算法通常每 2‑5 秒调整一次,UCP 的稳定性可减少码率切换次数,提升 QoE。
7.3 丢包行为差异
- BBR:丢包不会直接降低发送速率(除非进入 recovery),但 delivery rate 采样可能因丢包而偏低,导致后续带宽估计下降。
- UCP:带宽下限使发送速率在轻度丢包时保持较高,因此丢包率可能略高于 BBR。但设计者认为这是可接受的权衡(用略多的重传换平滑速率)。
7.4 滞后性导致的问题示例
- 网络从高延迟(100ms)切换到低延迟(10ms),UCP 的
net_class需要 2 次 RTT 采样才会从 VPN 或 DEFAULT 更新为 LAN,期间带宽下限可能仍然为 0(或低值),但 min_rtt 已经下降,BDP 变小,cwnd 可能偏低。 - 拥塞解除后,
net_condition需要 2 次 ACK 确认才能从 CONGESTED 转到 LIGHT_LOAD,cwnd 上限在额外 1‑2 个 RTT 内仍然受限。
8. 适用场景与不适用场景
8.1 适用场景
- 流媒体直播/点播(Twitch, YouTube, Netflix):平滑速率曲线优于峰值吞吐。
- 无线/移动网络(4G/5G,WiFi):有背景丢包但非拥塞,带宽下限有助于维持体验。
- 卫星链路(高延迟,偶尔丢包):STARTUP 丢包保护和探针跳过可避免过度冲击。
- VPN 隧道(有封装开销但延迟稳定):默认禁用下限,行为接近 BBR。
8.2 不适用或需谨慎使用的场景
- 数据中心内部高速传输(10G+,无丢包):吞吐损失 5% 不可接受,建议用原生 BBR 或 BBRv2。
- 极短流(HTTP 请求 <10 个 RTT):UCP 的滞后性可能导致连接结束前仍未切换到正确类,收益不明显。
- 实时音频/游戏(对丢包极度敏感):UCP 可能因坚持发送而增加丢包率,不利。
- 突发性流量高峰(如文件上传高峰):UCP 的探针跳过可能使连接错过带宽机会。
8.3 调优
- 若希望更接近 BBR 吞吐,设置
ucp_bbr_mode=1并保留ucp_extra_acked_gain=0。 - 若希望在移动网络获得平滑,保持默认,可适度调高
ucp_bw_floor_mobile到 3000 (30%)。 - 若发现频繁误入 RANDOM_LOSS,提高
ucp_low_loss_thresh到 200 (2%) 或更高。 - 若要更快响应网络变化,降低
ucp_cond_confirm_enter和ucp_class_confirm_cnt(需改源码)。
9. 局限性与取舍
| 项目 | 描述 | 取舍理由 |
|---|---|---|
| ACK 补偿简化 | 单槽 u8,每 epoch 衰减 3/4 | 104 字节限制,放弃精确性换空间 |
| 固定阈值分类 | 无法自适应路径基线 | 降低状态复杂度,但可能误判 |
| 带宽下限依赖峰值 | 拥塞后峰值清零,需重积累 | 防止使用过时的高峰值 |
| cwnd 上限离散 | 只有三级,无连续函数 | 实现简单,调参直观 |
| 滞后性 | 状态切换需多次确认 | 避免 flapping,但响应慢 |
| 无多流协调 | 每流独立 | 保持简单,依赖标准 TCP 公平性 |
| 吞吐低于 BBR | 2‑7% 损失 | 换平滑性和丢包容忍 |
| 可能比 BBR 丢包略高 | 带宽下限导致 | 设计接受,为了平稳 |
10. 调试与故障排查常用命令
bash
# 查看当前模块参数
cat /sys/module/tcp_ucp/parameters/ucp_bbr_mode
sysctl net.ucp.ucp_bbr_mode
# 监控连接状态
ss -ti --bbr | grep -E 'bbr_bw|ucp'
# 实时修改并观察效果(以 cwnd_cap_severe 为例)
echo 15000 > /sys/module/tcp_ucp/parameters/ucp_cwnd_cap_severe
# 然后抓包或用 ss 看 cwnd 变化
# 彻底禁用 UCP 特性(纯 BBR 模式 + 关闭 ACK 补偿)
echo 1 > /sys/module/tcp_ucp/parameters/ucp_bbr_mode
echo 0 > /sys/module/tcp_ucp/parameters/ucp_extra_acked_gain
# 恢复默认
echo 0 > /sys/module/tcp_ucp/parameters/ucp_bbr_mode
echo 10000 > /sys/module/tcp_ucp/parameters/ucp_extra_acked_gain
11. 源代码实现
c
/* ---- tcp_ucp.c : TCP UCP Congestion Control Module v1.0 ------------------ */
/*
* Universal Communication Protocol (UCP)
* A Non-Destructive Constraint Layer for BBR Congestion Control
* with Programmable Path-Aware Policies
*
* ---------------------------------------------------------------------------
* 1. ALGORITHM OVERVIEW
* ---------------------------------------------------------------------------
*
* UCP is not a new congestion control algorithm. It is a protective wrapper
* around the existing BBRv1 state machine. The core BBR engine (pacing-gain
* cycle, bandwidth minmax filter, PROBE_RTT state machine, cwnd_gain-based
* window control) runs unmodified. UCP interposes lightweight constraints ---
* ceilings and floors --- that engage only when runtime path classifiers detect
* non-ideal conditions. When the path behaves ideally, UCP is transparent
* and BBR runs exactly as Google designed it.
*
* The fundamental premise: BBR's fixed gains (2.0x cwnd, 1.25x/0.75x probe
* cycle, 2.89x STARTUP) are optimal for a statistically "typical" Internet
* path, but can over-consume buffer or under-utilize bandwidth on hostile
* links (high loss, cellular jitter, VPN encapsulation, satellite latency).
* Rather than retuning BBR's internal gains --- which would require re-proving
* the algorithm's stability across all path types --- UCP adds a detection and
* correction layer that acts only when the path demonstrably strays from the
* ideal.
*
* Per-ACK processing pipeline (ucp_main -> ucp_update_model):
*
* TCP ACK arrives with rate_sample
* |
* v
* +------------------+ +-----------------------+
* | ucp_update_bw() |--->| bw minmax filter |
* +------------------+ | (10-round running max) |
* | +-----------------------+
* v
* +------------------+ +-----------------------+
* | ucp_update_loss |--->| loss_ewma (u8, 9-bit |
* | _ewma() | | compresses 0..256) |
* +------------------+ +-----------------------+
* |
* v
* +------------------+ +-----------------------+
* | ucp_update_ack |--->| extra_acked tracking |
* | _aggregation() | | (single epoch slot, |
* +------------------+ | exponential decay) |
* | +-----------------------+
* v
* +------------------+ +-----------------------+
* | ucp_update_net |--->| 4-condition classifier|
* | _condition() | | IDLE/LIGHT_LOAD/ |
* +------------------+ | CONGESTED/RANDOM_LOSS |
* | +-----------------------+
* v
* +------------------+ +-----------------------+
* | ucp_update_net |--->| 6-class path |
* | _class() | | DEFAULT/LAN/MOBILE/ |
* +------------------+ | LOSSY_FAT/CONGESTED/ |
* | | VPN |
* v +-----------------------+
* +------------------+
* | ucp_update_cycle |
* | _phase() |---> 8-phase PROBE_BW gain cycle
* +------------------+
* |
* v
* +------------------+ +-----------------------+
* | ucp_check_full |--->| STARTUP pipe-full |
* | _bw_reached() | | detection (3 rounds) |
* +------------------+ +-----------------------+
* |
* v
* +------------------+ +-----------------------+
* | ucp_update_min |--->| min_rtt filter + |
* | _rtt() | | PROBE_RTT state mchn |
* +------------------+ +-----------------------+
* |
* v
* +------------------+
* | mode-specific |---> pacing_gain / cwnd_gain assignment
* | gains |
* +------------------+
* |
* v
* +------------------+ +-----------------------+
* | ucp_apply_pacing |--->| queued one-shot drain |
* | _constraints() | | (non-destructive #1) |
* +------------------+ +-----------------------+
* |
* v
* +------------------+ +-----------------------+
* | ucp_apply_cwnd |--->| graduated cwnd caps: |
* | _constraints() | | 1.75x/1.50x/1.25x BDP |
* +------------------+ | (non-destructive #2) |
* | +-----------------------+
* v
* +------------------+
* | ucp_set_cwnd() |---> BDP * gain + ACK compensation + quantize
* +------------------+
* |
* v
* +------------------+
* | ucp_set_pacing |---> sk_pacing_rate = bw * pacing_gain
* | _rate() |
* +------------------+
*
*
* ---------------------------------------------------------------------------
* 2. LAYERED ARCHITECTURE
* ---------------------------------------------------------------------------
*
* Layer 0 --- BBR Engine (unmodified core)
* ---------------------------------------
* - Bandwidth estimation via minmax running-max (10-round window)
* - 8-phase PROBE_BW gain cycle: probe(1.10x) / drain(0.75x) / cruise(1.0x)
* - STARTUP (2.89x high gain) / DRAIN (0.346x) / PROBE_RTT (cwnd=4)
* - Pacing-based transmission with cwnd_gain = 2.0x BDP
* - min_rtt tracking with 10-second refresh interval
*
* Layer 1 --- Runtime Path Classifiers (UCP-specific)
* --------------------------------------------------
* - net_condition: four-state classifier driven by:
* . delivery rate trend EWMA (signed, BBR_SCALE)
* . loss_ewma (smoothed packet loss ratio)
* . ecn_ewma (smoothed ECN marking ratio)
* . RTT increase ratio = (avg_rtt - min_rtt) / min_rtt
* Transitions use 2-3 sample hysteresis to prevent flapping.
* - net_class: six-way path taxonomy determined by:
* . RTT magnitude and stability
* . loss ratio and jitter
* . congestion persistence
* Each class has its own bandwidth floor percentage, probe gain
* override, and PROBE_RTT interval modifier.
*
* Layer 2 --- Non-Destructive Constraints (UCP-specific)
* -----------------------------------------------------
* - cwnd ceilings: when net_condition == CONGESTED, cwnd_gain is
* capped to a graduated level based on congestion severity:
* Mild: 1.75x BDP (default 17500 permyriad)
* Moderate: 1.50x BDP (default 15000 permyriad)
* Severe: 1.25x BDP (default 12500 permyriad)
* - Bandwidth floor: when net_condition != CONGESTED, the BtlBw
* estimate is floored to a configurable percentage of the peak
* bandwidth observed while non-congested. This prevents BBR's
* bandwidth filter from decaying to zero on lossy paths.
* Floor is disabled when loss_ewma exceeds the loss cap (5%).
* - STARTUP loss capping: in STARTUP phase, if loss exceeds a soft
* threshold (0.5%) the pacing gain is reduced from 2.89x to 2.50x;
* if loss exceeds a hard threshold (2.0%) the gain is clamped to
* 1.0x (no probe). This prevents the exponential growth phase
* from overwhelming a path that cannot sustain it.
* - Early drain: if loss is detected after a PROBE_BW probe phase,
* a one-shot drain is queued to quickly shed the probe's queue.
* Three drain levels (light/standard/aggressive) correspond to
* increasing loss severity.
* - Probe skip: in PROBE_BW, the probe phase (index 0, gain 1.10x)
* is skipped when loss_ewma >= 2% or RTT rise >= 40%, preventing
* further queue buildup on an already-congested path.
*
* Layer 3 --- ACK Aggregation Compensation (UCP-specific)
* ------------------------------------------------------
* Approximates Google BBR's extra_acked mechanism using a single-slot
* epoch with exponential decay (vs. BBR's dual-slot sliding window).
* Tracks excess ACK counts per min_rtt-scale epoch; the max excess
* is added to target_cwnd. This prevents pacing stalls during
* delayed-ACK and ACK-compression bursts. Disabled by default
* (ucp_extra_acked_gain = 0).
*
*
* ---------------------------------------------------------------------------
* 3. KNOWN WEAKNESSES AND LIMITATIONS
* ---------------------------------------------------------------------------
*
* [L1] 104-byte struct bound (ICSK_CA_PRIV_SIZE)
* ------------------------------------------------
* This is the single most severe constraint in the entire design.
* struct ucp is limited to 104 bytes by the kernel's inet_csk_ca()
* allocation. It forces:
* - u16 -> u8 compression for both loss_ewma and ecn_ewma, with
* 1-bit overflow flags that must be manually reassembled on every
* read via (high_bit << 8) | low_byte. This adds two conditional-
* free getter instructions per EWMA access.
* - ACK aggregation compensation limited to a single epoch slot
* (6 bytes total) instead of BBR's dual sliding window (16+ bytes).
* - No per-RTT RTT trend filter (would require 4-8 bytes).
* - No ECN loss-differentiation history (would require 4+ bytes).
* - No room for explicit cwnd reduction history or flow-level pacing
* state beyond what the TCP stack already provides.
* Any addition of per-flow state requires expanding ICSK_CA_PRIV_SIZE
* in the kernel, which is a non-trivial interface change.
*
* [L2] ACK aggregation compensation is a single-slot approximation
* ----------------------------------------------------------------
* BBR's extra_acked uses a 2-element sliding window of epoch excess
* values and takes the max of the two. UCP uses a single excess max
* that is exponentially decayed (x 0.75) each epoch, then compared
* with the current epoch's excess. This mean that:
* - A single large excess epoch followed by a quiet epoch will decay
* the compensation by 25% even if the second epoch should have no
* influence. BBR's dual-slot window would retain the full excess
* for two full epochs.
* - The maximum compensatable excess is 255 packets (u8 saturation)
* vs. BBR's u32 counter (effectively unlimited).
* In practice, u8 saturation at 255 is rarely reached on typical
* (non-datacenter) paths, and the exponential decay converges to
* the correct steady-state value within 3-4 epochs. The impact is
* a 1-3% throughput penalty on heavily ACK-compressed paths relative
* to full BBR.
*
* [L3] Condition classifier uses EWMA + fixed thresholds
* -------------------------------------------------------
* The four net_condition states are separated by fixed permyriad
* thresholds (loss 1%/5%/10%, RTT rise 20%). A fixed-threshold
* classifier cannot adapt to path-specific baseline loss or jitter
* without manual parameter tuning. An adaptive Bayesian or CUSUM
* classifier would improve accuracy on non-stationary paths but was
* excluded due to the 104-byte limit (would require 12-20 bytes for
* sufficient history).
*
* [L4] cwnd ceilings are graduated but linear
* --------------------------------------------
* The congestion severity levels (MILD/MODERATE/SEVERE) map to fixed
* cwnd_gain caps (1.75x/1.50x/1.25x). These steps are arbitrary and
* may not match the actual convexity of the congestion signal. A
* continuously-variable cap derived from loss magnitude would be more
* theoretically sound, but adds complexity and tuning surface.
*
* [L5] Bandwidth floor is a percentage, not a measurement
* --------------------------------------------------------
* The floor is computed as a fixed percentage of the current non-
* congested peak bandwidth. On paths where the peak itself is
* inaccurate (e.g., highly variable radio links), the floor
* inherits that inaccuracy. A floor based on the 10th percentile
* of recent BW samples would be more robust, but requires history.
*
* [L6] No multi-flow coordination
* --------------------------------
* UCP is a per-flow controller. It does not coordinate across
* multiple flows sharing the same bottleneck. The non-destructive
* constraints help prevent a single UCP flow from dominating a
* buffer, but UCP offers no mechanism for flow-aware fairness
* enforcement beyond what the TCP stack already provides.
*
* [L7] Pure BBR mode is best-effort
* ----------------------------------
* ucp_bbr_mode=1 bypasses UCP's classifiers and constraints, but
* the BBR state machine still runs inside the same struct ucp.
* ACK aggregation compensation, when enabled, applies regardless
* of mode. Users seeking bit-for-bit identical BBR behavior should
* use the kernel's built-in tcp_bbr module instead.
*
*
* ---------------------------------------------------------------------------
* 4. PARAMETER SYSTEM
* ---------------------------------------------------------------------------
*
* All user-facing parameters use a unified permyriad (per ten thousand)
* scale:
* gain 2.0x = 20000 permyriad
* loss 1.0% = 100 permyriad
* floor 25% = 2500 permyriad
*
* Internal arithmetic uses fixed-point scales:
* BBR_SCALE (8 bits, 1.0 = 256) for gain/ratio operations
* BW_SCALE (24 bits, 1 pkt/us = 2^24) for bandwidth
*
* Module parameters are exported to /sys/module/tcp_ucp/parameters/ for
* runtime tuning. All per-ACK hot-path values are pre-computed at module
* init time (permyriad -> BBR_SCALE, seconds -> jiffies) to avoid division
* in the datapath.
*
* RUNTIME RECONFIGURATION
*
* The module supports two dynamic configuration methods that do NOT require
* unloading and reloading the module:
*
* Method A --- sysfs (writes to /sys/module/tcp_ucp/parameters/)
* ------------------------------------------------------------
* All 33 parameters use module_param_cb() with a custom setter callback.
* Writing to /sys/module/tcp_ucp/parameters/<name> triggers an immediate
* call to ucp_init_module_params(), which recomputes all internal cached
* values (BBR_SCALE gains, jiffies intervals, etc.). The new value is
* active on the very next ACK processed by any UCP connection.
*
* # echo 10000 > /sys/module/tcp_ucp/parameters/ucp_extra_acked_gain
* # echo 1 > /sys/module/tcp_ucp/parameters/ucp_bbr_mode
*
* Method B --- sysctl (writes to /proc/sys/net/ucp/ via sysctl -w)
* ----------------------------------------------------------------
* The module registers a sysctl table at module init via register_sysctl().
* Each parameter appears as net.ucp.ucp_<name> and uses a custom proc_handler
* that delegates to proc_dointvec and then calls ucp_init_module_params().
* This enables both temporary (sysctl -w) and persistent (/etc/sysctl.conf,
* /etc/sysctl.d/ conf snippet) configuration:
*
* # sysctl -w net.ucp.ucp_extra_acked_gain=10000
* # sysctl -w net.ucp.ucp_bbr_mode=1
*
* # echo "net.ucp.ucp_extra_acked_gain = 10000" >> /etc/sysctl.d/ucp.conf
* # echo "net.ucp.ucp_bbr_mode = 1" >> /etc/sysctl.d/ucp.conf
* # sysctl -p /etc/sysctl.d/ucp.conf
*
* Either method is equivalent --- they modify the same underlying int variables
* and both call ucp_init_module_params() after a successful write.
*
*
* ---------------------------------------------------------------------------
* 5. FUTURE DIRECTIONS (v1.1+)
* ---------------------------------------------------------------------------
*
* - Increase ICSK_CA_PRIV_SIZE to 128 in the target kernel.
* - Dual-slot extra_acked window for exact BBR compatibility.
* - Per-RTT RTT trend minmax filter for improved congestion detection.
* - ECN loss-differentiation history for accurate non-congestive loss
* classification on L4s-enabled paths.
* - Continuously-variable cwnd cap derived from excess queue
* measurement (BBR's "inflight cap" model).
*
*
* Copyright (c) 2026 PPP PRIVATE NETWORK™ X
* SPDX-License-Identifier: Dual BSD/GPL
*/
#include <linux/module.h> /* Linux kernel module interface: module_param, MODULE_PARM_DESC, module_init, module_exit */
#include <net/tcp.h> /* Core TCP types: tcp_sock, tcp_congestion_ops, rate_sample, tcp_register_congestion_control */
#include <linux/inet_diag.h> /* Socket diagnostics interface: INET_DIAG_BBRINFO enum, union tcp_cc_info for ss -i output */
#include <linux/win_minmax.h> /* Sliding window min/max filter: struct minmax, minmax_running_max function */
#include <linux/math64.h> /* 64-bit math helpers: div64_u64, div64_long for safe integer division */
#include <linux/random.h> /* Kernel PRNG: prandom_u32_max for randomized PROBE_BW phase start offset */
#include <linux/sysctl.h> /* Sysctl interface: proc_dointvec, register_sysctl for sysctl -w support */
/* ---- Fixed-Point Scales (hardware-friendly constants, do NOT change) ---- */
#define BW_SCALE 24 /* Number of fractional bits in bandwidth fixed-point representation (2^24 = 16,777,216 units per 1.0); matches BBR BW_SCALE */
#define BW_UNIT (1 << BW_SCALE) /* Unity (1.0) in bandwidth fixed-point domain: value 2^24 = 16,777,216; represents 1 packet per microsecond at this scale */
#define BBR_SCALE 8 /* Number of fractional bits in gain/ratio fixed-point representation (2^8 = 256 units per 1.0); matches BBR BBR_SCALE */
#define BBR_UNIT (1 << BBR_SCALE)/* Unity (1.0x) in BBR gain domain: value 256; represents a multiplier of exactly 1.0; all gain values use this unit */
/* ---- UCP Finite State Machine Modes (mirror BBRv1) ---------------------- */
/**
* enum ucp_mode - UCP congestion control operational modes
* @UCP_STARTUP: Exponential bandwidth search phase; pacing_gain ~= 2.89x to rapidly probe available bandwidth
* @UCP_DRAIN: One-shot queue drain phase after STARTUP completes; pacing_gain ~= 0.346x to drain excess inflight
* @UCP_PROBE_BW: Steady-state bandwidth probing; uses an 8-phase gain cycle to periodically test for more bandwidth
* @UCP_PROBE_RTT: Min-RTT refresh phase; cwnd clamped to 4 packets for 200 ms to obtain an uncontaminated RTT sample
*/
enum ucp_mode {
UCP_STARTUP = 0, /* Value 0: Exponential bandwidth search, pacing_gain ~= 2.89x, same as BBR STARTUP; sends at ~2.89x estimated BDP to fill the pipe */
UCP_DRAIN = 1, /* Value 1: One-shot queue drain after pipe is full, pacing_gain ~= 0.346x (BBR_UNIT * 1000 / 2885); drains excess queue built in STARTUP */
UCP_PROBE_BW = 2, /* Value 2: Steady state, 8-phase gain cycle cycling between probe (1.10x), drain (0.75x), and cruise (1.0x) phases */
UCP_PROBE_RTT = 3 /* Value 3: Refresh min_rtt every ~10 seconds, cwnd clamped to 4 pkts for 200 ms duration to drain queues for a clean min RTT sample */
};
/* ---- Network Condition Classification ---------------------------------- */
/**
* enum ucp_net_cond - Classification of current network conditions
* @UCP_COND_IDLE: No delivery-rate history yet available (connection just started or was idle)
* @UCP_COND_LIGHT_LOAD: Low loss below low_loss_threshold and stable RTT; indicates uncongested path
* @UCP_COND_CONGESTED: Delivery rate is dropping AND either high loss or ECN marking is present; indicates true congestion
* @UCP_COND_RANDOM_LOSS: Loss is present but RTT rise <= 20%; indicates non-congestive packet loss (e.g., wireless corruption)
*/
enum ucp_net_cond {
UCP_COND_IDLE = 0, /* Value 0: No delivery-rate history yet available; initial state before any ACK processing; no condition classification possible */
UCP_COND_LIGHT_LOAD = 1, /* Value 1: loss < low_loss_thresh, stable RTT; network is underutilized or lightly loaded; no congestion signals present */
UCP_COND_CONGESTED = 2, /* Value 2: rate dropping AND (high loss OR ECN present); standard congestion detection with multiple corroborating signals */
UCP_COND_RANDOM_LOSS = 3 /* Value 3: loss present, RTT rise <= 20%, not congestion-related; e.g., wireless packet corruption or transient bit errors */
};
/* ---- Network Path Class (impacts ProbeRTT interval and BtlBw floor) ---- */
/**
* enum ucp_net_class - Classification of the network path type
* @UCP_CLASS_DEFAULT: Unclassified or mixed characteristics; use default parameters
* @UCP_CLASS_LAN: Low-latency local network: RTT < 5 ms, jitter < 3 ms, loss < 0.1%
* @UCP_CLASS_MOBILE: Cellular/mobile path: loss > 3%, jitter > 20 ms; requires conservative probing
* @UCP_CLASS_LOSSY_FAT: High-latency lossy path: RTT > 80 ms, background loss > 1%; typical of satellite links
* @UCP_CLASS_CONGESTED: Persistently congested: RTT rise >= 50% AND loss >= 1%
* @UCP_CLASS_VPN: VPN tunnel: RTT > 60 ms with stable elevated latency; may have encapsulation overhead
*/
enum ucp_net_class {
UCP_CLASS_DEFAULT = 0, /* Value 0: Default/unclassified path; uses standard parameter set; no special handling */
UCP_CLASS_LAN = 1, /* Value 1: Local area network: RTT < 5 ms, jitter < 3 ms, loss < 0.1%; bandwidth floor disabled */
UCP_CLASS_MOBILE = 2, /* Value 2: Mobile/cellular: loss > 3%, jitter > 20 ms; reduce probe aggression, increase probe interval */
UCP_CLASS_LOSSY_FAT = 3, /* Value 3: Lossy fat pipe: RTT > 80 ms, background loss > 1%; satellite-like links with high BDP */
UCP_CLASS_CONGESTED = 4, /* Value 4: Congested: RTT rise >= 50% AND loss >= 1%; persistent queue buildup, conservative handling */
UCP_CLASS_VPN = 5 /* Value 5: VPN tunnel: RTT > 60 ms, stable elevated latency; tunnel encapsulation overhead expected */
};
/* ---- Per-Connection State (must fit within ICSK_CA_PRIV_SIZE = 104) ---- */
/**
* struct ucp - Per-connection UCP congestion control state
* @min_rtt_us: Minimum round-trip time observed in the current measurement window (microseconds)
* @min_rtt_stamp: Jiffies timestamp (kernel timer ticks) when min_rtt_us was last updated; used for filter expiry
* @probe_rtt_done_stamp: Absolute jiffies value when the PROBE_RTT phase is scheduled to end; 0 when not in PROBE_RTT
* @bw: Running-maximum filter for bottleneck bandwidth (BtlBw), stored in BW_SCALE, spanning 10 packet-timed rounds
* @rtt_cnt: Number of packet-timed rounds completed since connection start; monotonically increasing counter
* @next_rtt_delivered: Value of tp->delivered at the start of the current packet-timed round; used to detect round boundaries
* @cycle_mstamp_lo: Lower 32 bits of the 64-bit timestamp (in microseconds) when the current PROBE_BW gain phase started
* @cycle_mstamp_hi: Upper 32 bits of the 64-bit PROBE_BW phase start timestamp
* @mode: 2-bit field: current ucp_mode value (0..3); controls gain selection and state machine transitions
* @prev_ca_state: 3-bit field: previous TCP congestion algorithm state before the most recent state transition
* @round_start: 1-bit flag: set to true on the first ACK of a new packet-timed round; cleared after bandwidth update
* @idle_restart: 1-bit flag: set to true after an application-limited idle period; suppresses overly aggressive behavior
* @probe_rtt_round_done: 1-bit flag: set to true after at least one full packet-timed round completes during PROBE_RTT
* @fast_recovery: 1-bit flag: set to true while in non-congestion fast recovery (loss recovery without congestion signal)
* @net_condition: 2-bit field: current ucp_net_condition classification (0..3); guides cap selection
* @net_class: 3-bit field: current ucp_net_class classification (0..5); affects probe intervals and bandwidth floors
* @rate_hist_idx: 1-bit field: index (0 or 1) into the 2-slot deliv_rate_hist[] circular buffer
* @rtt_hist_idx: 1-bit field: index (0 or 1) into the 2-slot rtt_history[] circular buffer
* @drain_pending: 2-bit field: queued drain level (0 = none, 1..3 = light/standard/aggressive); consumed by ucp_apply_pacing_constraints
* @cond_confirm: 3-bit field: hysteresis counter for confirming network condition transitions (0..7)
* @class_confirm: 3-bit field: hysteresis counter for confirming network class transitions (0..7)
* @min_rtt_fast_fall_cnt: 2-bit field: consecutive RTT samples below 75% of current min_rtt (0..3); triggers fast min_rtt update
* @cycle_idx: 3-bit field: current PROBE_BW gain phase index (0..7); selects pacing gain from the 8-phase cycle
* @probe_gain_applied: 1-bit flag: set if the most recent PROBE_BW phase 0 applied a gain greater than 1.0x
* @padding1: 2-bit explicit padding to complete the first 32-bit bitfield word (total field bits = 2+3+1+1+1+1+2+3+1+1+2+3+3+2+3+1+2 = 32); reserved, do not use
* @pacing_gain: 12-bit field: current pacing gain in BBR_SCALE; controls inter-packet transmission spacing
* @cwnd_gain: 10-bit field: current congestion window gain in BBR_SCALE; controls window growth relative to BDP
* @full_bw_reached: 1-bit flag: set to true once STARTUP bandwidth growth has stalled (pipe considered full)
* @full_bw_cnt: 2-bit field: consecutive packet-timed rounds without 1.25x bandwidth growth (0..3); triggers full_bw_reached
* @has_seen_rtt: 1-bit flag: set to true on the first valid SRTT sample received; used to gate RTT-based initialization
* @has_delayed_ack: 1-bit flag: copy of rs->is_ack_delayed for the current ACK; used to filter delayed-ACK RTT samples
* @probe_rtt_restored: 1-bit flag: set to true right after cwnd is restored from PROBE_RTT clamping; cleared after restoration applied
* @loss_ewma_high: 1-bit overflow (bit 8) of loss EWMA; combined with loss_ewma as (high << 8) | low to represent 0..BBR_UNIT (256 = 100% loss); compressed 9-bit encoding to save space
* @ecn_ewma_high: 1-bit overflow (bit 8) of ECN EWMA; combined with ecn_ewma as (high << 8) | low to represent 0..BBR_UNIT (256 = 100% ECN marking); compressed 9-bit encoding to save space
* @padding2: 2-bit explicit padding to complete the second 32-bit bitfield word (was 4 bits before loss_ewma_high + ecn_ewma_high); reserved, do not use
* @prior_cwnd: cwnd value saved before entering recovery or PROBE_RTT; used to restore cwnd upon exit (packets)
* @full_bw: Bandwidth snapshot (BW_SCALE) taken at the last confirmed bandwidth growth event in STARTUP
* @rtt_history: Two-element circular buffer of filtered RTT samples (microseconds each); used for P10 RTT estimation
* @deliv_rate_hist: Two-element circular buffer of most recent delivery rate samples (BW_SCALE); used for trend calculation
* @loss_ewma: Exponentially weighted moving average of packet loss ratio, compressed to u8; value = (loss_ewma_high << 8) | loss_ewma (0..BBR_UNIT)
* @ecn_ewma: Exponentially weighted moving average of ECN marking ratio, compressed to u8; value = (ecn_ewma_high << 8) | ecn_ewma (0..BBR_UNIT)
* @max_bw_non_congested: Peak bandwidth (BW_SCALE) observed while in non-CONGESTED network condition; used for bandwidth floor computation
* @rate_change_ewma: Smoothed delivery-rate trend value (BBR_SCALE, signed); positive means rate increasing, negative means dropping
* @last_delivered_ce: Value of tp->delivered_ce at the previous ACK; used to compute ECN marking delta for the current ACK
* @ack_epoch_start_us: Microsecond timestamp (low 32 bits of delivered_mstamp) marking the start of the ACK compensation epoch; 0 = epoch not started
* @extra_acked: Cumulative excess acked packet count in the current compensation epoch (u8, saturates at 255)
* @extra_acked_max: Maximum excess acked over recent epochs; decayed by 3/4 each epoch; used as cwnd bonus in ucp_set_cwnd()
*
* NOTE: ACK aggregation compensation --- BBR vs UCP comparison
*
* Both BBR and UCP track excess ACK counts per RTT-scale epoch and add a
* small cwnd bonus to prevent pacing stalls during delayed-ACK / ACK-
* compression bursts. Key differences due to UCP's 104-byte struct limit:
*
* Feature BBR (16 bytes) UCP (6 bytes)
* ------------------- -------------------- ----------------------------
* Epoch timestamp u64 (8 bytes) u32 (4 bytes, low 32 bits)
* Excess window u16 extra_acked[2] u8 extra_acked + u8 max
* (dual sliding slot) (single slot, decay x 0.75)
* Control bitfield ~4 bytes none (reuses existing fields)
* Gain fixed BBR_UNIT ucp_extra_acked_gain
* (default 0 = disabled)
*
* BBR's dual-slot retains full excess for two epochs; UCP decays at 0.75x
* per epoch, converging to the same steady state within 3-4 epochs.
* A 1ms epoch floor prevents degenerate per-ACK resets when min_rtt_us is
* unrealistically small. Throughput impact: <3% relative to full BBR on
* typical internet paths. u8 saturation at 255 pkts is rare outside DC.
*/
struct ucp {
/* core measurement and tracking state */
u32 min_rtt_us; /* Minimum RTT observed in the current measurement window; unit: microseconds; updated when a lower RTT is seen or filter expires */
u32 min_rtt_stamp; /* Jiffies timestamp (kernel timer tick) when min_rtt_us was last updated; used to implement the min_rtt filter expiry logic */
u32 probe_rtt_done_stamp; /* Absolute jiffies value when PROBE_RTT phase is scheduled to end; set to 0 when not actively in PROBE_RTT state */
struct minmax bw; /* Running-maximum filter structure for bottleneck bandwidth (BtlBw); window width: 10 packet-timed rounds; values in BW_SCALE */
u32 rtt_cnt; /* Monotonically increasing counter of packet-timed rounds completed since connection start; incremented each round boundary */
u32 next_rtt_delivered; /* Snapshot of tp->delivered taken at the start of the current packet-timed round; used to detect when the round ends */
u32 cycle_mstamp_lo; /* Lower 32 bits of the 64-bit microsecond timestamp marking the start of the current PROBE_BW gain phase */
u32 cycle_mstamp_hi; /* Upper 32 bits of the 64-bit microsecond timestamp marking the start of the current PROBE_BW gain phase */
/*
* Bitfield word 1: mode, flags, small counters, PROBE_BW index
* Total 32 bits
*/
u32 mode : 2; /* Bitfield: current ucp_mode value (0..3); controls which gain/behaviour rules apply; updated on state transitions */
u32 prev_ca_state : 3; /* Bitfield: previous TCP CA state (TCP_CA_Open, TCP_CA_Recovery, etc.); used to detect state transitions for cwnd management */
u32 round_start : 1; /* Bitfield: set to true on the first ACK that completes a new packet-timed round; cleared to 0 in ucp_update_bw() after round processing */
u32 idle_restart : 1; /* Bitfield: set to true when transmission resumes after an application-limited idle period; cleared on first data delivery */
u32 probe_rtt_round_done : 1;/* Bitfield: set to true after at least one full packet-timed round completes during PROBE_RTT; gates exit from PROBE_RTT */
u32 fast_recovery : 1; /* Bitfield: set to true while in non-congestion fast recovery (TCP_CA_Recovery); cleared at round start after recovery ends */
u32 net_condition : 2; /* Bitfield: current ucp_net_condition classification (0..3); guides cwnd ceiling and bandwidth floor selection */
u32 net_class : 3; /* Bitfield: current ucp_net_class classification (0..5); affects PROBE_RTT interval and bandwidth floor multiplier */
u32 rate_hist_idx : 1; /* Bitfield: index (0 or 1) into the 2-element deliv_rate_hist[] circular buffer; toggled after each sample is stored */
u32 rtt_hist_idx : 1; /* Bitfield: index (0 or 1) into the 2-element rtt_history[] circular buffer; toggled after each RTT sample is stored */
u32 drain_pending : 2; /* Bitfield: non-zero value (1..3) indicates a queued drain level; 0 means no drain pending; consumed by ucp_apply_pacing_constraints() */
u32 cond_confirm : 3; /* Bitfield: hysteresis counter (0..7) for confirming net_condition transitions; increments on disagreement, resets on agreement */
u32 class_confirm : 3; /* Bitfield: hysteresis counter (0..7) for confirming net_class transitions; increments on disagreement, saturates at UCP_CLASS_CONFIRM_MAX */
u32 min_rtt_fast_fall_cnt : 2; /* Bitfield: count (0..3) of consecutive RTT samples below 75% of current min_rtt; triggers fast min_rtt downward revision */
u32 cycle_idx : 3; /* Bitfield: current PROBE_BW gain cycle index (0..7); determines which gain value from the 8-phase cycle is used */
u32 probe_gain_applied : 1; /* Bitfield: set to true if the last PROBE_BW phase index 0 applied a probe gain > 1.0x; determines whether drain phase is needed */
u32 padding1 : 2; /* Bitfield: explicit padding to align the first bitfield word to exactly 32 bits (total field bits = 2+3+1+1+1+1+2+3+1+1+2+3+3+2+3+1+2 = 32); unused, reserved for future use */
/*
* Bitfield word 2: gains, binary flags, helper flags
* Total 32 bits
*/
u32 pacing_gain : 12; /* Bitfield: current pacing gain in BBR_SCALE (range 0..4095, representing 0.0x to ~16.0x); controls transmission rate via sk_pacing_rate */
u32 cwnd_gain : 10; /* Bitfield: current cwnd gain in BBR_SCALE (range 0..1023, representing 0.0x to ~4.0x); controls window size relative to BDP */
u32 full_bw_reached : 1; /* Bitfield: set to true once STARTUP bandwidth growth has stalled (pipe considered full); gates transition from STARTUP to DRAIN */
u32 full_bw_cnt : 2; /* Bitfield: consecutive rounds (0..3) without 1.25x bandwidth growth; when >= UCP_FULL_BW_CNT (3), full_bw_reached is set */
u32 has_seen_rtt : 1; /* Bitfield: set to true on the first valid SRTT sample; used to avoid RTT-based calculations before the first measurement is available */
u32 has_delayed_ack : 1; /* Bitfield: copy of the rs->is_ack_delayed flag for the current ACK; used to reject RTT samples that include ACK delay */
u32 probe_rtt_restored : 1; /* Bitfield: set to true right after cwnd is restored from PROBE_RTT clamping; ensures restoration only happens once */
u32 loss_ewma_high : 1; /* Bitfield: bit 8 (overflow) of the compressed loss EWMA; value = (loss_ewma_high << 8) | loss_ewma (0..BBR_UNIT) */
u32 ecn_ewma_high : 1; /* Bitfield: bit 8 (overflow) of the compressed ECN EWMA; value = (ecn_ewma_high << 8) | ecn_ewma (0..BBR_UNIT) */
u32 padding2 : 2; /* Bitfield: explicit padding to align the second bitfield word to exactly 32 bits; unused, reserved for future use */
/* standalone u32 fields */
u32 prior_cwnd; /* Congestion window (in packets) saved before entering TCP recovery or PROBE_RTT; used to restore cwnd upon exit */
u32 full_bw; /* Bandwidth snapshot (BW_SCALE) recorded at the last confirmed bandwidth growth event in STARTUP; used for 1.25x growth comparison */
u32 rtt_history[2]; /* Two-element circular buffer holding the most recent filtered RTT samples (microseconds each); index toggled by rtt_hist_idx */
u32 deliv_rate_hist[2]; /* Two-element circular buffer holding the most recent delivery rate samples (BW_SCALE each); index toggled by rate_hist_idx */
u8 loss_ewma; /* EWMA of packet loss ratio (BBR_SCALE, 0-255), lower 8 bits of compressed u16; value = (loss_ewma_high << 8) | loss_ewma */
u8 ecn_ewma; /* EWMA of ECN marking ratio (BBR_SCALE, 0-255), lower 8 bits of compressed u16; value = (ecn_ewma_high << 8) | ecn_ewma */
u8 extra_acked; /* Cumulative excess acked packet count in the current compensation epoch; saturates at 255 */
u8 extra_acked_max; /* Maximum excess acked over recent epochs; decayed by 3/4 each epoch; used as cwnd bonus in ucp_set_cwnd() */
u32 ack_epoch_start_us; /* Microsecond timestamp (low 32 bits of tp->delivered_mstamp) marking the start of the current epoch window; 0 = epoch not started */
u32 max_bw_non_congested; /* Peak bandwidth (BW_SCALE) observed while net_condition != UCP_COND_CONGESTED; reset when congestion is entered; used for bandwidth floor */
s32 rate_change_ewma; /* Smoothed delivery-rate trend (BBR_SCALE, signed); positive indicates rate increasing, negative indicates rate dropping; used for congestion classification */
u32 last_delivered_ce; /* Value of tp->delivered_ce (number of packets with CE mark) at the previous ACK; used to compute ECN marking delta for the current ACK */
};
/* -------------------------------------------------------------------------
* Module Parameters - all tunable, permyriad (per ten thousand) unless otherwise noted
* Each parameter is exported to /sys/module/tcp_ucp/parameters/ for runtime tuning
* ------------------------------------------------------------------------- */
/* ---- Module parameter sysctl-style dynamic update callback -------------- */
static void ucp_init_module_params(void); /* Forward declaration for the setter callback below; ucp_init_module_params is defined after all module_param declarations */
/**
* @brief Custom setter for all module parameters --- recomputes cached values on every write.
* This enables "sysctl -w" style runtime tuning: writing to /sys/module/tcp_ucp/parameters/
* immediately propagates to the internal BBR_SCALE / jiffies cached variables used in the hot path.
*/
static int ucp_param_set_int(const char *val, const struct kernel_param *kp)
{
int ret = param_set_int(val, kp);
if (ret == 0)
ucp_init_module_params();
return ret;
}
static const struct kernel_param_ops ucp_param_ops = {
.set = ucp_param_set_int,
.get = param_get_int,
};
/* ---- Operation mode selector -------------------------------------------- */
static int ucp_bbr_mode = 0; /* Operation mode: 0 = full UCP (non-destructive constraints, classifiers, bandwidth floor), 1 = pure BBR compatible (bypasses all UCP-specific logic); module_param for runtime switching */
module_param_cb(ucp_bbr_mode, &ucp_param_ops, &ucp_bbr_mode, 0644); /* Export as sysfs parameter with read-write permissions; echo 0 or 1 to switch mode at runtime */
MODULE_PARM_DESC(ucp_bbr_mode,
"Operation mode: 0 = full UCP (default, all constraints active), 1 = pure BBR compatible (bypass UCP-specific logic)");
/* ---- Bandwidth soft floor (permyriad of non-congested peak) ------------- */
static int ucp_bw_floor_default = 2000; /* Default path bandwidth floor: 2000 permyriad = 20.00% of non-congested peak bandwidth; balanced for general internet paths; unit: permyriad (1/10000) */
module_param_cb(ucp_bw_floor_default, &ucp_param_ops, &ucp_bw_floor_default, 0644); /* Export as sysfs parameter with read-write permissions (owner/group/other = rw-r--r--) */
MODULE_PARM_DESC(ucp_bw_floor_default,
"BtlBw floor for DEFAULT paths (permyriad of non-congested peak, 0 = disabled)"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_bw_floor_mobile = 2500; /* Mobile/lossy path bandwidth floor: 2500 permyriad = 25.00% of non-congested peak bandwidth; slightly higher than default to absorb radio jitter; unit: permyriad (1/10000) */
module_param_cb(ucp_bw_floor_mobile, &ucp_param_ops, &ucp_bw_floor_mobile, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_bw_floor_mobile,
"BtlBw floor for MOBILE / LOSSY_FAT paths (permyriad, 0 = disabled)"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_bw_floor_lan = 0; /* LAN path bandwidth floor: 0 permyriad = disabled by default; LAN links are stable enough that a floor is unnecessary; unit: permyriad (1/10000) */
module_param_cb(ucp_bw_floor_lan, &ucp_param_ops, &ucp_bw_floor_lan, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_bw_floor_lan,
"BtlBw floor for LAN paths (permyriad, 0 = disabled)"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_bw_floor_vpn = 0; /* VPN path bandwidth floor: 0 permyriad = disabled by default; VPN encapsulation adds predictable overhead not needing a floor; unit: permyriad (1/10000) */
module_param_cb(ucp_bw_floor_vpn, &ucp_param_ops, &ucp_bw_floor_vpn, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_bw_floor_vpn,
"BtlBw floor for VPN paths (permyriad, 0 = disabled)"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_bw_floor_lossy_fat = 2500; /* LOSSY_FAT path bandwidth floor: 2500 permyriad = 25.00%; matches MOBILE default since lossy fat pipes share similar variability; unit: permyriad (1/10000) */
module_param_cb(ucp_bw_floor_lossy_fat, &ucp_param_ops, &ucp_bw_floor_lossy_fat, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_bw_floor_lossy_fat,
"BtlBw floor for LOSSY_FAT paths (permyriad, default = mobile value)"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_bw_floor_congested = 2000; /* CONGESTED path bandwidth floor: 2000 permyriad = 20.00%; matches DEFAULT default; floor is only active when net_condition is NOT congested, so this applies only briefly after congestion clears; unit: permyriad (1/10000) */
module_param_cb(ucp_bw_floor_congested, &ucp_param_ops, &ucp_bw_floor_congested, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_bw_floor_congested,
"BtlBw floor for CONGESTED-class paths (permyriad, default = default value; note: floor only active when net_condition != CONGESTED)"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_bw_floor_loss_cap = 500; /* Maximum loss ratio for bandwidth floor activation: 500 permyriad = 5.0% loss; bandwidth floor disabled when loss exceeds this value; unit: permyriad */
module_param_cb(ucp_bw_floor_loss_cap, &ucp_param_ops, &ucp_bw_floor_loss_cap, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_bw_floor_loss_cap,
"Max loss (permyriad) for which the bandwidth floor is active"); /* Human-readable description displayed by modinfo and in sysfs */
/* ---- ProbeRTT interval (seconds) ---------------------------------------- */
static int ucp_probe_rtt_base_sec = 10; /* Base interval between PROBE_RTT entries: 10 seconds (matches BBR default); unit: seconds */
module_param_cb(ucp_probe_rtt_base_sec, &ucp_param_ops, &ucp_probe_rtt_base_sec, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_rtt_base_sec,
"Base seconds between PROBE_RTT entries"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_probe_rtt_class_extra_sec = 5; /* Additional seconds added for MOBILE or LOSSY_FAT path classes: +5 seconds; unit: seconds */
module_param_cb(ucp_probe_rtt_class_extra_sec, &ucp_param_ops, &ucp_probe_rtt_class_extra_sec, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_rtt_class_extra_sec,
"Extra seconds for MOBILE / LOSSY_FAT paths"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_probe_rtt_high_loss_extra_sec = 0; /* Additional seconds when loss >= high_loss_threshold: 0 extra seconds by default; unit: seconds */
module_param_cb(ucp_probe_rtt_high_loss_extra_sec, &ucp_param_ops, &ucp_probe_rtt_high_loss_extra_sec, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_rtt_high_loss_extra_sec,
"Extra seconds when loss >= high_loss_thresh"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_probe_rtt_mid_loss_extra_sec = 0; /* Additional seconds when loss >= probe_skip_thresh but below high_loss_thresh: 0 extra seconds by default; unit: seconds */
module_param_cb(ucp_probe_rtt_mid_loss_extra_sec, &ucp_param_ops, &ucp_probe_rtt_mid_loss_extra_sec, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_rtt_mid_loss_extra_sec,
"Extra seconds when loss >= probe_skip_thresh but < high_loss_thresh"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_probe_rtt_max_sec = 15; /* Hard upper cap on the PROBE_RTT interval: 15 seconds maximum regardless of other adjustments; unit: seconds */
module_param_cb(ucp_probe_rtt_max_sec, &ucp_param_ops, &ucp_probe_rtt_max_sec, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_rtt_max_sec,
"Absolute maximum PROBE_RTT interval (seconds)"); /* Human-readable description displayed by modinfo and in sysfs */
/* ---- Congestion window gains (permyriad of BDP) ------------------------- */
static int ucp_cwnd_gain = 20000; /* Steady-state cwnd gain: 20000 permyriad = 2.0000x of BDP; allows some queue buildup for throughput; unit: permyriad */
module_param_cb(ucp_cwnd_gain, &ucp_param_ops, &ucp_cwnd_gain, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_cwnd_gain,
"Steady-state cwnd gain (permyriad of BDP, 20000 = 2.0x)"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_cwnd_cap_mild = 17500; /* cwnd cap for MILD congestion: 17500 permyriad = 1.7500x of BDP; slightly reduces queue during mild congestion; unit: permyriad */
module_param_cb(ucp_cwnd_cap_mild, &ucp_param_ops, &ucp_cwnd_cap_mild, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_cwnd_cap_mild,
"cwnd cap for MILD congestion (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_cwnd_cap_moderate = 15000;/* cwnd cap for MODERATE congestion: 15000 permyriad = 1.5000x of BDP; tighter cap for moderate congestion; unit: permyriad */
module_param_cb(ucp_cwnd_cap_moderate, &ucp_param_ops, &ucp_cwnd_cap_moderate, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_cwnd_cap_moderate,
"cwnd cap for MODERATE congestion (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_cwnd_cap_severe = 12500; /* cwnd cap for SEVERE congestion: 12500 permyriad = 1.2500x of BDP; most restrictive cap for severe congestion; unit: permyriad */
module_param_cb(ucp_cwnd_cap_severe, &ucp_param_ops, &ucp_cwnd_cap_severe, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_cwnd_cap_severe,
"cwnd cap for SEVERE congestion (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */
/* ---- ACK aggregation compensation gain (permyriad) ----------------------- */
static int ucp_extra_acked_gain = 10000; /* ACK aggregation compensation gain: 10000 permyriad = 1.0x (BBR standard, default); added to cwnd as extra_acked_max * gain; unit: permyriad */
module_param_cb(ucp_extra_acked_gain, &ucp_param_ops, &ucp_extra_acked_gain, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_extra_acked_gain,
"ACK aggregation compensation gain (permyriad, 0=disabled, 10000=1.0x); default 10000 matches BBR standard behavior"); /* Human-readable description displayed by modinfo and in sysfs */
/* ---- PROBE_BW probe gain (permyriad) ------------------------------------ */
static int ucp_probe_gain = 11000; /* PROBE_BW probe phase pacing gain: 11000 permyriad = 1.1000x; conservative probe that adds 10% more than BDP to test for extra bandwidth */
module_param_cb(ucp_probe_gain, &ucp_param_ops, &ucp_probe_gain, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_gain,
"Pacing gain for PROBE_BW probe phase (permyriad, 11000 = 1.10x)"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_probe_gain_mobile = 10000;/* Mobile path probe gain: 10000 permyriad = 1.0000x; no rate increase on mobile when loss >= drain_thresh; unit: permyriad */
module_param_cb(ucp_probe_gain_mobile, &ucp_param_ops, &ucp_probe_gain_mobile, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_gain_mobile,
"Probe gain on MOBILE path when loss >= drain_thresh (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */
/* ---- Early drain triggers and gains (permyriad) ------------------------- */
static int ucp_drain_loss_thresh = 100; /* Drain trigger loss threshold: 100 permyriad = 1.0% loss; early drain engaged when loss >= this value after a probe phase; unit: permyriad */
module_param_cb(ucp_drain_loss_thresh, &ucp_param_ops, &ucp_drain_loss_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_drain_loss_thresh,
"Loss threshold (permyriad) to trigger early drain after probe"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_drain_gain_light = 8500; /* Light drain pacing gain: 8500 permyriad = 0.8500x; gentle queue drain for level 1 (low loss after probe); unit: permyriad */
module_param_cb(ucp_drain_gain_light, &ucp_param_ops, &ucp_drain_gain_light, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_drain_gain_light,
"Pacing gain for light drain (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_drain_gain_standard = 7500;/* Standard drain pacing gain: 7500 permyriad = 0.7500x; matches BBR standard drain gain of 0.75x; unit: permyriad */
module_param_cb(ucp_drain_gain_standard, &ucp_param_ops, &ucp_drain_gain_standard, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_drain_gain_standard,
"Pacing gain for standard drain (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_drain_gain_aggressive = 6500;/* Aggressive drain pacing gain: 6500 permyriad = 0.6500x; rapid queue drain for level 3 (high loss after probe); unit: permyriad */
module_param_cb(ucp_drain_gain_aggressive, &ucp_param_ops, &ucp_drain_gain_aggressive, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_drain_gain_aggressive,
"Pacing gain for aggressive drain (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */
/* drain level thresholds (permyriad loss) */
static int ucp_drain_loss_lvl2_thresh = 500; /* Level 2 (standard) drain loss threshold: 500 permyriad = 5.0% loss; triggers drain level 2; unit: permyriad */
module_param_cb(ucp_drain_loss_lvl2_thresh, &ucp_param_ops, &ucp_drain_loss_lvl2_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_drain_loss_lvl2_thresh,
"Loss threshold (permyriad) for standard drain (level 2)"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_drain_loss_lvl3_thresh = 1000;/* Level 3 (aggressive) drain loss threshold: 1000 permyriad = 10.0% loss; triggers drain level 3; unit: permyriad */
module_param_cb(ucp_drain_loss_lvl3_thresh, &ucp_param_ops, &ucp_drain_loss_lvl3_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_drain_loss_lvl3_thresh,
"Loss threshold (permyriad) for aggressive drain (level 3)"); /* Human-readable description displayed by modinfo and in sysfs */
/* ---- General loss thresholds (permyriad) -------------------------------- */
static int ucp_low_loss_thresh = 100; /* Low loss threshold: 100 permyriad = 1.0% loss; boundary between LIGHT_LOAD and RANDOM_LOSS conditions; unit: permyriad */
module_param_cb(ucp_low_loss_thresh, &ucp_param_ops, &ucp_low_loss_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_low_loss_thresh,
"Low loss threshold (permyriad) - LIGHT_LOAD boundary"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_high_loss_thresh = 500; /* High loss threshold: 500 permyriad = 5.0% loss; used for class classification, drain level determination, and congestion severity; unit: permyriad */
module_param_cb(ucp_high_loss_thresh, &ucp_param_ops, &ucp_high_loss_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_high_loss_thresh,
"High loss threshold (permyriad) - used for class and drain level"); /* Human-readable description displayed by modinfo and in sysfs */
/* ---- Probe safety-skip thresholds (permyriad) --------------------------- */
static int ucp_probe_skip_loss_thresh = 200; /* Probe skip loss threshold: 200 permyriad = 2.0% loss; PROBE_BW probe phase is skipped above this value to avoid worsening loss; unit: permyriad */
module_param_cb(ucp_probe_skip_loss_thresh, &ucp_param_ops, &ucp_probe_skip_loss_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_skip_loss_thresh,
"Loss threshold (permyriad) above which PROBE_BW probe is skipped"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_probe_skip_rtt_rise = 4000; /* Probe skip RTT rise threshold: 4000 permyriad = 40.00% RTT increase; probe phase skipped above this to avoid aggravating queue buildup; unit: permyriad */
module_param_cb(ucp_probe_skip_rtt_rise, &ucp_param_ops, &ucp_probe_skip_rtt_rise, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_probe_skip_rtt_rise,
"RTT increase threshold (permyriad) above which PROBE_BW probe is skipped"); /* Human-readable description displayed by modinfo and in sysfs */
/* ---- STARTUP loss-based gain reduction (permyriad) ---------------------- */
static int ucp_startup_soft_drain_thresh = 50; /* STARTUP soft drain loss threshold: 50 permyriad = 0.5% loss; above this, STARTUP gain is reduced to soft_gain (2.5x); unit: permyriad */
module_param_cb(ucp_startup_soft_drain_thresh, &ucp_param_ops, &ucp_startup_soft_drain_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_startup_soft_drain_thresh,
"Loss threshold (permyriad) to reduce STARTUP gain to soft_gain"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_startup_hard_cap_thresh = 200; /* STARTUP hard cap loss threshold: 200 permyriad = 2.0% loss; above this, STARTUP gain capped at cwnd_gain_val (2.0x); unit: permyriad */
module_param_cb(ucp_startup_hard_cap_thresh, &ucp_param_ops, &ucp_startup_hard_cap_thresh, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_startup_hard_cap_thresh,
"Loss threshold (permyriad) to cap STARTUP gain at cwnd_gain"); /* Human-readable description displayed by modinfo and in sysfs */
static int ucp_startup_soft_gain = 25000; /* STARTUP soft gain: 25000 permyriad = 2.5000x; reduced gain used when loss is between soft_drain and hard_cap thresholds; unit: permyriad */
module_param_cb(ucp_startup_soft_gain, &ucp_param_ops, &ucp_startup_soft_gain, 0644); /* Export as sysfs parameter with read-write permissions */
MODULE_PARM_DESC(ucp_startup_soft_gain,
"STARTUP pacing/cwnd gain when loss between soft and hard thresholds (permyriad)"); /* Human-readable description displayed by modinfo and in sysfs */
/* ---- Internal derived variables (populated at module init) -------------- */
static u32 ucp_bw_floor_default_val; /* Bandwidth floor for DEFAULT paths: stored as permyriad value (same as permyriad input); used in bandwidth floor calculation as (pct/10000) */
static u32 ucp_bw_floor_mobile_val; /* Bandwidth floor for MOBILE/LOSSY_FAT paths: stored as permyriad value (same as permyriad input); used in bandwidth floor calculation */
static u32 ucp_bw_floor_lan_val; /* Bandwidth floor for LAN paths: stored as permyriad value; 0 = disabled; tunable via sysfs */
static u32 ucp_bw_floor_vpn_val; /* Bandwidth floor for VPN paths: stored as permyriad value; 0 = disabled; tunable via sysfs */
static u32 ucp_bw_floor_lossy_fat_val; /* Bandwidth floor for LOSSY_FAT paths: stored as permyriad value; default 2500 = 25% */
static u32 ucp_bw_floor_congested_val; /* Bandwidth floor for CONGESTED paths: stored as permyriad value; default 2000 = 20%; only active when net_condition != CONGESTED */
static u32 ucp_bw_floor_loss_cap_val; /* Max loss ratio for bandwidth floor activation in BBR_SCALE; bandwidth floor is disabled when loss EWMA exceeds this value */
static u32 ucp_probe_rtt_base_jiffies; /* Base PROBE_RTT interval in jiffies (kernel timer ticks); calculated from ucp_probe_rtt_base_sec * HZ */
static u32 ucp_probe_rtt_class_extra_jiffies; /* Extra PROBE_RTT interval for MOBILE/LOSSY_FAT classes in jiffies; calculated from ucp_probe_rtt_class_extra_sec * HZ */
static u32 ucp_probe_rtt_high_loss_extra_jiffies; /* Extra PROBE_RTT interval when loss >= high_loss_thresh in jiffies; calculated from ucp_probe_rtt_high_loss_extra_sec * HZ */
static u32 ucp_probe_rtt_mid_loss_extra_jiffies; /* Extra PROBE_RTT interval when loss >= probe_skip_thresh but < high_loss_thresh in jiffies; calculated from ucp_probe_rtt_mid_loss_extra_sec * HZ */
static u32 ucp_probe_rtt_max_jiffies; /* Hard maximum PROBE_RTT interval in jiffies; calculated from ucp_probe_rtt_max_sec * HZ; absolute cap */
static u32 ucp_cwnd_gain_val; /* Steady-state cwnd gain in BBR_SCALE; converted from ucp_cwnd_gain via permyriad_to_bbr(); default: 512 = 2.0x */
static u32 ucp_cwnd_cap_mild_val; /* MILD congestion cwnd cap in BBR_SCALE; converted from ucp_cwnd_cap_mild via permyriad_to_bbr(); default: 448 = 1.75x */
static u32 ucp_cwnd_cap_moderate_val; /* MODERATE congestion cwnd cap in BBR_SCALE; converted from ucp_cwnd_cap_moderate; default: 384 = 1.50x */
static u32 ucp_cwnd_cap_severe_val; /* SEVERE congestion cwnd cap in BBR_SCALE; converted from ucp_cwnd_cap_severe; default: 320 = 1.25x */
static u32 ucp_probe_gain_val; /* PROBE_BW probe phase pacing gain in BBR_SCALE; converted from ucp_probe_gain; default: ~282 = 1.10x */
static u32 ucp_probe_gain_mobile_val; /* Mobile path probe gain in BBR_SCALE; converted from ucp_probe_gain_mobile; default: 256 = 1.00x (no probe) */
static u32 ucp_drain_loss_thresh_val; /* Drain trigger loss threshold in BBR_SCALE; converted from ucp_drain_loss_thresh; default: ~2.56 = 1.0% */
static u32 ucp_drain_gain_light_val; /* Light drain pacing gain in BBR_SCALE; converted from ucp_drain_gain_light; default: ~217 = 0.85x */
static u32 ucp_drain_gain_standard_val; /* Standard drain pacing gain in BBR_SCALE; converted from ucp_drain_gain_standard; default: 192 = 0.75x */
static u32 ucp_drain_gain_aggressive_val; /* Aggressive drain pacing gain in BBR_SCALE; converted from ucp_drain_gain_aggressive; default: ~166 = 0.65x */
static u32 ucp_drain_lvl2_loss_thresh_val; /* Level 2 (standard) drain loss threshold in BBR_SCALE; converted from ucp_drain_loss_lvl2_thresh; default: ~12.8 = 5.0% */
static u32 ucp_drain_lvl3_loss_thresh_val; /* Level 3 (aggressive) drain loss threshold in BBR_SCALE; converted from ucp_drain_loss_lvl3_thresh; default: ~25.6 = 10.0% */
static u32 ucp_low_loss_thresh_val; /* Low loss threshold in BBR_SCALE; converted from ucp_low_loss_thresh; default: ~2.56 = 1.0% */
static u32 ucp_high_loss_thresh_val; /* High loss threshold in BBR_SCALE; converted from ucp_high_loss_thresh; default: ~12.8 = 5.0% */
static u32 ucp_probe_skip_loss_val; /* Probe skip loss threshold in BBR_SCALE; converted from ucp_probe_skip_loss_thresh; default: ~5.12 = 2.0% */
static u32 ucp_probe_skip_rtt_rise_val; /* Probe skip RTT rise threshold in BBR_SCALE; converted from ucp_probe_skip_rtt_rise; default: ~102.4 = 40% */
static u32 ucp_startup_soft_drain_val; /* STARTUP soft drain loss threshold in BBR_SCALE; converted from ucp_startup_soft_drain_thresh; default: ~1.28 = 0.5% */
static u32 ucp_startup_hard_cap_val; /* STARTUP hard cap loss threshold in BBR_SCALE; converted from ucp_startup_hard_cap_thresh; default: ~5.12 = 2.0% */
static u32 ucp_startup_soft_gain_val; /* STARTUP soft gain in BBR_SCALE; converted from ucp_startup_soft_gain; default: 640 = 2.50x */
static u32 ucp_extra_acked_gain_val; /* ACK aggregation compensation gain in BBR_SCALE; converted from ucp_extra_acked_gain; default: 0 = disabled */
/* congestion severity thresholds (derived) */
static u32 ucp_cong_severe_loss_val; /* Severe congestion loss threshold in BBR_SCALE; derived as copy of ucp_high_loss_thresh_val (5.0% default) */
static u32 ucp_cong_moderate_loss_val; /* Moderate congestion loss threshold in BBR_SCALE; derived as copy of ucp_drain_loss_thresh_val (1.0% default) */
static u32 ucp_cong_mild_loss_val; /* Mild congestion loss threshold in BBR_SCALE; derived as copy of ucp_low_loss_thresh_val (1.0% default, same as moderate) */
/**
* @brief Convert a permyriad value (1/10000) to BBR_SCALE fixed-point (1/256).
* @param val Input value in permyriad units; range typically 0..10000+
* @return Equivalent value in BBR_SCALE fixed-point where BBR_UNIT = 256
*
* Performs: (BBR_UNIT * val) / 10000
* Example: 20000 permyriad * 256 / 10000 = 512, which represents 2.0x in BBR_SCALE
*/
static u32 permyriad_to_bbr(u32 val)
{
/* Multiply by BBR_UNIT (256) using 64-bit arithmetic to avoid overflow on large values, then divide by 10000 permyriad base */
return (u32)(((u64)BBR_UNIT * val) / 10000); /* Scaling: permyriad (1/10000) to BBR_SCALE (1/256) via (256 * val / 10000) */
}
/**
* @brief Precompute all module parameter derived values at module load time.
* Called once at module init to convert permyriad module parameters into BBR_SCALE gain values
* and seconds into jiffies, so the per-ACK hot path avoids repeated conversions.
*/
static void ucp_init_module_params(void)
{
/* Clamp all permyriad/seconds module parameters to non-negative: negative values via sysfs would wrap to large unsigned causing stalls or crashes */
ucp_bw_floor_default = max(ucp_bw_floor_default, 0);
ucp_bw_floor_mobile = max(ucp_bw_floor_mobile, 0);
ucp_bw_floor_lan = max(ucp_bw_floor_lan, 0);
ucp_bw_floor_vpn = max(ucp_bw_floor_vpn, 0);
ucp_bw_floor_lossy_fat = max(ucp_bw_floor_lossy_fat, 0);
ucp_bw_floor_congested = max(ucp_bw_floor_congested, 0);
ucp_bw_floor_loss_cap = max(ucp_bw_floor_loss_cap, 0);
ucp_probe_rtt_base_sec = max(ucp_probe_rtt_base_sec, 0);
ucp_probe_rtt_class_extra_sec = max(ucp_probe_rtt_class_extra_sec, 0);
ucp_probe_rtt_high_loss_extra_sec = max(ucp_probe_rtt_high_loss_extra_sec, 0);
ucp_probe_rtt_mid_loss_extra_sec = max(ucp_probe_rtt_mid_loss_extra_sec, 0);
ucp_probe_rtt_max_sec = max(ucp_probe_rtt_max_sec, 0);
ucp_cwnd_gain = max(ucp_cwnd_gain, 0);
ucp_cwnd_cap_mild = max(ucp_cwnd_cap_mild, 0);
ucp_cwnd_cap_moderate = max(ucp_cwnd_cap_moderate, 0);
ucp_cwnd_cap_severe = max(ucp_cwnd_cap_severe, 0);
ucp_extra_acked_gain = max(ucp_extra_acked_gain, 0);
ucp_probe_gain = max(ucp_probe_gain, 0);
ucp_probe_gain_mobile = max(ucp_probe_gain_mobile, 0);
ucp_drain_loss_thresh = max(ucp_drain_loss_thresh, 0);
ucp_drain_gain_light = max(ucp_drain_gain_light, 0);
ucp_drain_gain_standard = max(ucp_drain_gain_standard, 0);
ucp_drain_gain_aggressive = max(ucp_drain_gain_aggressive, 0);
ucp_drain_loss_lvl2_thresh = max(ucp_drain_loss_lvl2_thresh, 0);
ucp_drain_loss_lvl3_thresh = max(ucp_drain_loss_lvl3_thresh, 0);
ucp_low_loss_thresh = max(ucp_low_loss_thresh, 0);
ucp_high_loss_thresh = max(ucp_high_loss_thresh, 0);
ucp_probe_skip_loss_thresh = max(ucp_probe_skip_loss_thresh, 0);
ucp_probe_skip_rtt_rise = max(ucp_probe_skip_rtt_rise, 0);
ucp_startup_soft_drain_thresh = max(ucp_startup_soft_drain_thresh, 0);
ucp_startup_hard_cap_thresh = max(ucp_startup_hard_cap_thresh, 0);
ucp_startup_soft_gain = max(ucp_startup_soft_gain, 0);
/* bandwidth floor: store as permyriad values (same as permyriad input) for use in floor_pct/10000 division later */
ucp_bw_floor_default_val = ucp_bw_floor_default; /* Copy the default path floor permyriad (e.g., 2000 for 20.00%) without scale conversion; used as numerator in floor_pct/10000 */
ucp_bw_floor_mobile_val = ucp_bw_floor_mobile; /* Copy the mobile path floor permyriad (e.g., 2500 for 25.00%) without scale conversion; used as numerator in floor_pct/10000 */
ucp_bw_floor_lan_val = ucp_bw_floor_lan; /* Copy the LAN path floor permyriad (default 0 = disabled) without scale conversion */
ucp_bw_floor_vpn_val = ucp_bw_floor_vpn; /* Copy the VPN path floor permyriad (default 0 = disabled) without scale conversion */
ucp_bw_floor_lossy_fat_val = ucp_bw_floor_lossy_fat; /* Copy the LOSSY_FAT path floor permyriad (default 2500 = 25%) without scale conversion */
ucp_bw_floor_congested_val = ucp_bw_floor_congested; /* Copy the CONGESTED path floor permyriad (default 2000 = 20%) without scale conversion */
ucp_bw_floor_loss_cap_val= permyriad_to_bbr(ucp_bw_floor_loss_cap); /* Convert loss cap threshold from permyriad to BBR_SCALE for direct comparison with loss_ewma */
/* ProbeRTT intervals: multiply seconds by HZ (kernel ticks per second) to convert to timer ticks */
ucp_probe_rtt_base_jiffies = ucp_probe_rtt_base_sec * HZ; /* Base interval: 10 seconds * HZ; the minimum time that must elapse between PROBE_RTT entries */
ucp_probe_rtt_class_extra_jiffies = ucp_probe_rtt_class_extra_sec * HZ; /* Extra time for hostile path classes: 5 seconds * HZ; adds to base for MOBILE/LOSSY_FAT */
ucp_probe_rtt_high_loss_extra_jiffies= ucp_probe_rtt_high_loss_extra_sec * HZ; /* Extra time when high loss >= 5%: configured seconds * HZ (default 0) */
ucp_probe_rtt_mid_loss_extra_jiffies = ucp_probe_rtt_mid_loss_extra_sec * HZ; /* Extra time when medium loss >= 2%: configured seconds * HZ (default 0) */
ucp_probe_rtt_max_jiffies = ucp_probe_rtt_max_sec * HZ; /* Hard cap: 15 seconds * HZ; maximum allowed PROBE_RTT interval regardless of adjustments */
/* cwnd gains: convert from permyriad to BBR_SCALE for efficient fixed-point arithmetic per ACK */
ucp_cwnd_gain_val = permyriad_to_bbr(ucp_cwnd_gain); /* Steady-state cwnd gain: 20000 permyriad -> 512 BBR_SCALE (2.0x BDP) */
ucp_cwnd_cap_mild_val = permyriad_to_bbr(ucp_cwnd_cap_mild); /* Mild congestion cap: 17500 permyriad -> 448 BBR_SCALE (1.75x BDP) */
ucp_cwnd_cap_moderate_val = permyriad_to_bbr(ucp_cwnd_cap_moderate); /* Moderate congestion cap: 15000 permyriad -> 384 BBR_SCALE (1.50x BDP) */
ucp_cwnd_cap_severe_val = permyriad_to_bbr(ucp_cwnd_cap_severe); /* Severe congestion cap: 12500 permyriad -> 320 BBR_SCALE (1.25x BDP) */
/* probe gains: convert to BBR_SCALE for use in the PROBE_BW 8-phase gain cycle */
ucp_probe_gain_val = permyriad_to_bbr(ucp_probe_gain); /* Standard probe gain: 11000 permyriad -> ~282 BBR_SCALE (1.10x BDP) */
ucp_probe_gain_mobile_val = permyriad_to_bbr(ucp_probe_gain_mobile); /* Mobile probe gain: 10000 permyriad -> 256 BBR_SCALE (1.00x = no probe) */
/* drain thresholds and gains: convert all drain-related permyriad parameters to BBR_SCALE */
ucp_drain_loss_thresh_val = permyriad_to_bbr(ucp_drain_loss_thresh); /* Drain trigger loss: 100 permyriad -> ~2.56 BBR_SCALE (1.0%) */
ucp_drain_gain_light_val = permyriad_to_bbr(ucp_drain_gain_light); /* Light drain gain: 8500 permyriad -> ~217 BBR_SCALE (0.85x) */
ucp_drain_gain_standard_val = permyriad_to_bbr(ucp_drain_gain_standard); /* Standard drain gain: 7500 permyriad -> 192 BBR_SCALE (0.75x) */
ucp_drain_gain_aggressive_val = permyriad_to_bbr(ucp_drain_gain_aggressive); /* Aggressive drain gain: 6500 permyriad -> ~166 BBR_SCALE (0.65x) */
ucp_drain_lvl2_loss_thresh_val = permyriad_to_bbr(ucp_drain_loss_lvl2_thresh); /* Level 2 drain loss: 500 permyriad -> ~12.8 BBR_SCALE (5.0%) */
ucp_drain_lvl3_loss_thresh_val = permyriad_to_bbr(ucp_drain_loss_lvl3_thresh); /* Level 3 drain loss: 1000 permyriad -> ~25.6 BBR_SCALE (10.0%) */
/* general loss thresholds: convert to BBR_SCALE for consistent comparison with loss_ewma */
ucp_low_loss_thresh_val = permyriad_to_bbr(ucp_low_loss_thresh); /* Low loss threshold: 100 permyriad -> ~2.56 BBR_SCALE (1.0%) */
ucp_high_loss_thresh_val = permyriad_to_bbr(ucp_high_loss_thresh); /* High loss threshold: 500 permyriad -> ~12.8 BBR_SCALE (5.0%) */
/* probe skip thresholds: convert to BBR_SCALE for comparison with loss_ewma and RTT increase ratio */
ucp_probe_skip_loss_val = permyriad_to_bbr(ucp_probe_skip_loss_thresh); /* Probe skip loss: 200 permyriad -> ~5.12 BBR_SCALE (2.0%); probe phase skipped if loss_ewma >= this */
ucp_probe_skip_rtt_rise_val = permyriad_to_bbr(ucp_probe_skip_rtt_rise); /* Probe skip RTT rise: 4000 permyriad -> ~102.4 BBR_SCALE (40%); probe skipped if rinc >= this */
/* STARTUP thresholds: convert to BBR_SCALE for loss-based gain reduction in exponential growth phase */
ucp_startup_soft_drain_val = permyriad_to_bbr(ucp_startup_soft_drain_thresh); /* Soft drain loss: 50 permyriad -> ~1.28 BBR_SCALE (0.5%); reduces gain above this */
ucp_startup_hard_cap_val = permyriad_to_bbr(ucp_startup_hard_cap_thresh); /* Hard cap loss: 200 permyriad -> ~5.12 BBR_SCALE (2.0%); hard caps gain above this */
ucp_startup_soft_gain_val = permyriad_to_bbr(ucp_startup_soft_gain); /* Soft gain: 25000 permyriad -> 640 BBR_SCALE (2.50x); reduced gain between soft and hard thresholds */
/* ACK aggregation compensation gain: convert permyriad to BBR_SCALE for efficient per-ACK arithmetic */
ucp_extra_acked_gain_val = permyriad_to_bbr(ucp_extra_acked_gain); /* Compensation gain: default 0 -> 0 BBR_SCALE (disabled) */
/* derived congestion severity loss thresholds: reuse established loss threshold values to define graduated congestion severity levels */
ucp_cong_severe_loss_val = ucp_high_loss_thresh_val; /* Severe congestion loss threshold = high loss threshold (5.0%) */
ucp_cong_moderate_loss_val = ucp_drain_loss_thresh_val; /* Moderate congestion loss threshold = drain trigger loss (1.0%) */
ucp_cong_mild_loss_val = ucp_low_loss_thresh_val; /* Mild congestion loss threshold = low loss threshold (1.0%); same as moderate in default config */
}
/* ---- Non-exported internal constants (structural, BBR-derived) ---------- */
#define UCP_BW_RTT_CYCLE_LEN 8 /* Number of packet-timed rounds in one BBR gain cycle; the minmax filter window is UCP_BW_RTTS = cycle + 2 guard rounds = 10 total */
#define UCP_BW_RTTS (UCP_BW_RTT_CYCLE_LEN + 2) /* Total max filter window including 2 guard rounds: 10 rounds total; matches BBR's 10-round max filter */
#define UCP_PROBE_RTT_MODE_MS 200 /* Duration to stay in PROBE_RTT mode: 200 milliseconds; cwnd clamped to 4 packets during this period to drain the bottleneck queue for a clean min RTT sample */
#define UCP_MIN_TSO_RATE 1200000 /* Minimum pacing rate for TSO single-segment limit: 1,200,000 bps (1.2 Mbps); below this rate, only 1 TSO segment is used to avoid burstiness */
#define UCP_TSO_MAX_SEGS 0x7F /* Hard maximum TSO segments per GSO burst: 127 segments (0x7F); prevents excessive burst size regardless of calculated goal */
#define UCP_PACING_MARGIN_PERCENT 1 /* Self-queue pacing safety margin: 1% rate reduction; reduces the pacing rate slightly to avoid building local qdisc queues */
#define UCP_PACING_MARGIN_DIV 99 /* Divisor for applying the 1% pacing margin: 99/100 = 0.99; final rate = rate * 99 / 100, giving 1% headroom */
#define UCP_RATE_MAX_SAFE (U64_MAX / UCP_PACING_MARGIN_DIV) /* Upper bound to prevent 64-bit overflow in the margin multiplication: U64_MAX / 99; any rate above this is capped */
#define UCP_HIGH_GAIN (BBR_UNIT * 2885 / 1000 + 1) /* High gain for STARTUP phase: BBR_UNIT * 2885 / 1000 + 1 = ~2.885x (rounded up); same as BBR's 2/ln(2) ~ 2.885 high gain */
#define UCP_DRAIN_GAIN (BBR_UNIT * 1000 / 2885) /* Drain gain for DRAIN phase: BBR_UNIT * 1000 / 2885 = ~0.346x (truncated); reciprocal of UCP_HIGH_GAIN to drain exactly the queue built during STARTUP */
#define UCP_PROBE_BW_CYCLE_LEN 8 /* Number of phases in the PROBE_BW gain cycle: 8 phases (probe, drain, 6 cruise) as in standard BBR */
#define UCP_PROBE_BW_DRAIN_IDX 1 /* Index of the drain phase within the PROBE_BW cycle: phase 1 immediately follows the probe phase (index 0) */
#define UCP_PROBE_BW_CYCLE_RAND 7 /* Randomization range for initial cycle phase: 0..7 random offset; modulo cycle length to randomize phase start position across connections */
#define UCP_CWND_MIN_TARGET 4 /* Absolute minimum congestion window target: 4 packets; applied during PROBE_RTT as the cwnd clamp and as a general cwnd floor */
#define UCP_FULL_BW_THRESH (BBR_UNIT * 125 / 100) /* Full bandwidth detection threshold: 1.25x in BBR_SCALE (256 * 125 / 100 = 320); STARTUP considered full when BW growth < 1.25x */
#define UCP_FULL_BW_CNT 3 /* Number of consecutive rounds without 1.25x BW growth to declare pipe full: 3 rounds; same as BBR's 3-round criterion for full pipe */
#define UCP_LOSS_EWMA_RETAINED_WEIGHT 3 /* EWMA retained weight numerator for loss: 3 parts retained from the previous EWMA value */
#define UCP_LOSS_EWMA_SAMPLE_WEIGHT 1 /* EWMA sample weight numerator for loss: 1 part contributed by the new instantaneous loss ratio */
#define UCP_LOSS_EWMA_TOTAL_WEIGHT 4 /* EWMA total weight denominator for loss: 4 total parts (3 retained + 1 sample = new EWMA = 3/4 old + 1/4 new) */
#define UCP_LOSS_EWMA_IDLE_DECAY_NUM 70 /* Loss EWMA idle decay numerator: 70; when a sample has no losses, EWMA decays to EWMA * 70/100 = 0.7x previous */
#define UCP_LOSS_EWMA_IDLE_DECAY_DEN 100/* Loss EWMA idle decay denominator: 100; 70/100 = 0.7 decay factor applied when no losses in current sample */
#define UCP_ECN_EWMA_RETAINED_WEIGHT 3 /* ECN EWMA retained weight numerator: 3 parts retained from the previous EWMA value */
#define UCP_ECN_EWMA_SAMPLE_WEIGHT 1 /* ECN EWMA sample weight numerator: 1 part from the new instantaneous ECN marking ratio */
#define UCP_ECN_EWMA_TOTAL_WEIGHT 4 /* ECN EWMA total weight denominator: 4 total parts (3 retained + 1 sample = 3/4 old + 1/4 new) */
#define UCP_ECN_EWMA_IDLE_DECAY_NUM 70 /* ECN EWMA idle decay numerator: 70; when no CE marks in current sample, EWMA decays to EWMA * 70/100 */
#define UCP_ECN_EWMA_IDLE_DECAY_DEN 100/* ECN EWMA idle decay denominator: 100; 70/100 = 0.7 decay factor for ECN marking ratio when no new CE marks observed */
#define UCP_COND_CONFIRM_ENTER 3 /* Hysteresis confirm count threshold for entering CONGESTED condition: 3 consecutive samples agreeing on transition into congested */
#define UCP_COND_CONFIRM_EXIT 2 /* Hysteresis confirm count threshold for exiting CONGESTED condition: 2 consecutive samples needed to leave any non-idle condition */
#define UCP_CLASS_CONFIRM_CNT 2 /* Hysteresis confirm count threshold for network class transitions: need 2 consecutive class change suggestions to switch */
#define UCP_CLASS_CONFIRM_MAX 7 /* Maximum value for class hysteresis confirmation counter: saturates at 7 to avoid overflow in the 3-bit class_confirm bitfield */
#define UCP_INFLIGHT_LOW_GAIN (BBR_UNIT * 125 / 100) /* Lower bound gain for inflight cwnd clamping: 1.25x BDP in BBR_SCALE (320); prevents cwnd from dropping below 1.25x BDP */
#define UCP_INFLIGHT_HIGH_GAIN (BBR_UNIT * 200 / 100) /* Upper bound gain for inflight cwnd clamping: 2.00x BDP in BBR_SCALE (512); prevents cwnd from exceeding 2.0x BDP */
/* ACK aggregation compensation decay factor (exponential forgetting on extra_acked_max) */
#define UCP_ACK_EPOCH_DECAY_NUM 3 /* Numerator for epoch max decay: extra_acked_max = extra_acked_max * 3 / 4 each epoch */
#define UCP_ACK_EPOCH_DECAY_DEN 4 /* Denominator for epoch max decay: 3/4 = 0.75 exponential decay factor */
/* Maximum value for u8 saturation (used in ACK aggregation u8 counters); prefixed UCP_ to avoid conflict with kernel's UCP_U8_MAX in <linux/limits.h> */
#define UCP_U8_MAX 0xFF /* 255; maximum representable value in an unsigned 8-bit integer */
/* Minimum epoch duration for ACK aggregation compensation (microseconds) */
#define UCP_ACK_EPOCH_MIN_US 1000 /* 1 ms floor; prevents degenerate epoch resets when min_rtt_us is unrealistically small */
#define UCP_RTT_SAMPLE_MAX_US 500000 /* Hard absolute ceiling for RTT sample rejection: 500,000 microseconds (500 ms); any RTT sample above this is unconditionally discarded as an outlier */
#define UCP_RTT_SAMPLE_MAX_MULT 3 /* Dynamic multiplier for per-connection RTT sample ceiling: samples > 3x min_rtt_us are rejected as outliers */
#define UCP_RATE_TREND_EWMA_WEIGHT (BBR_UNIT * 7 / 8) /* EWMA retained weight for rate trend smoothing: 7/8 retained (224/256 BBR_SCALE); strong smoothing to filter out noise in delivery rate measurements */
#define UCP_MINRTT_FAST_FALL_CNT 3 /* Count of consecutive sub-75% min_rtt samples needed to trigger a fast downward revision: 3 fast-fall samples trigger immediate min_rtt update */
#define UCP_MINRTT_STICKY_FLOOR_NUM 3 /* Sticky floor numerator for progressive min_rtt reduction: 3; when fast-fall count is active but below threshold, min_rtt is reduced to 3/4 of current */
#define UCP_MINRTT_STICKY_FLOOR_DEN 4 /* Sticky floor denominator for progressive min_rtt reduction: 4; min_rtt = min_rtt * 3/4 before full fast-fall trigger */
#define UCP_MINRTT_SRTT_GUARD_NUM 9 /* SRTT guard numerator: 9; if srtt/8 < min_rtt * 9/10, update min_rtt to srtt/8 as a safety guard against stale min_rtt */
#define UCP_MINRTT_SRTT_GUARD_DEN 10 /* SRTT guard denominator: 10; comparison is srtt/8 < min_rtt * 9/10; condition: (tp->srtt_us >> 3) < ucp->min_rtt_us * 9/10 */
#define UCP_BDP_MIN_RTT_US 1000 /* Minimum RTT for BDP calculation: 1000 microseconds (1 ms); prevents division by zero or unrealistically small BDP values */
#define UCP_BDP_HI_MULT 2 /* BDP model RTT upper bound multiplier: 2; model_rtt is capped at max(min_rtt_us * 2, 500ms) to bound the BDP estimate */
#define UCP_BDP_HI_FLOOR_US 500000 /* BDP model RTT upper bound floor: 500,000 microseconds (500 ms); minimum value for the high bound, even when min_rtt is very small */
#define UCP_TSO_HEADROOM_SEGS 3 /* TSO headroom in segments: 3 extra TSO segments added to cwnd to prevent TCP segmentation offload from stalling the transmit pipeline */
#define UCP_PROBE_CWND_BONUS 2 /* Extra cwnd segments during PROBE_BW probe phase: 2 additional segments to ensure we fill the pipe during bandwidth probing */
#define UCP_BW_FLOOR_PCT_OFF 0 /* Special sentinel value that disables the bandwidth floor: 0 means floor is not applied (used for LAN/VPN paths) */
#define UCP_CLASS_LAN_RTT_US 5000 /* LAN classification RTT threshold: 5000 microseconds (5 ms); paths with average RTT below this qualify as LAN */
#define UCP_CLASS_LAN_JITTER_US 3000 /* LAN classification jitter threshold: 3000 microseconds (3 ms); max RTT variation for LAN classification */
#define UCP_CLASS_LAN_LOSS_THRESH (BBR_UNIT / 1000) /* LAN classification loss threshold: 0.1% in BBR_SCALE (256/1000 = 0.256); very low loss expected on local networks */
#define UCP_CLASS_MOBILE_LOSS_THRESH (BBR_UNIT * 3 / 100) /* Mobile classification loss threshold: 3% in BBR_SCALE (256*3/100 = 7.68); high loss characteristic of cellular */
#define UCP_CLASS_MOBILE_JITTER_US (20 * USEC_PER_MSEC) /* Mobile classification jitter threshold: 20,000 microseconds (20 ms); high jitter typical of cellular networks */
#define UCP_CLASS_LOSSY_RTT_US 80000 /* Lossy fat pipe classification RTT threshold: 80,000 microseconds (80 ms); high latency typical of satellite links */
#define UCP_CLASS_LOSSY_LOSS_THRESH (BBR_UNIT / 100) /* Lossy fat pipe loss threshold: 1% in BBR_SCALE (256/100 = 2.56); significant background loss expected */
#define UCP_CLASS_CONG_RINC_THRESH (BBR_UNIT * 50 / 100)/* Congested class RTT increase ratio threshold: 50% in BBR_SCALE (128); RTT has increased by 50% relative to min_rtt */
#define UCP_CLASS_VPN_RTT_US 60000 /* VPN classification RTT threshold: 60,000 microseconds (60 ms); elevated but stable latency typical of VPN tunnels */
#define UCP_RTT_EXTRA_HIGH_THRESH (BBR_UNIT * 100 / 100) /* RTT increase ratio threshold for "high" classification: 100% in BBR_SCALE (256); RTT has doubled relative to min_rtt */
#define UCP_RTT_EXTRA_MID_THRESH (BBR_UNIT * 50 / 100) /* RTT increase ratio threshold for "mid" classification: 50% in BBR_SCALE (128); RTT has increased by 50% */
#define UCP_CONG_SEVERE_RINC_THRESH UCP_RTT_EXTRA_HIGH_THRESH /* Severe congestion RTT increase threshold: same as RTT_EXTRA_HIGH (100%); congestion when RTT >= 2x min_rtt */
#define UCP_CONG_MODERATE_RINC_THRESH UCP_RTT_EXTRA_MID_THRESH /* Moderate congestion RTT increase threshold: same as RTT_EXTRA_MID (50%) */
#define UCP_CONG_MILD_RINC_THRESH (BBR_UNIT * 25 / 100) /* Mild congestion RTT increase threshold: 25% in BBR_SCALE (64); minor RTT increase indicates mild queue buildup */
#define UCP_COND_RATE_DROP_THRESH (-(s32)(BBR_UNIT * 15 / 100)) /* Rate EWMA drop threshold for congestion detection: -15% BBR_SCALE (signed -38); rate_change_ewma below this means significant rate decrease */
#define UCP_COND_LOSS_CONGEST_THRESH (BBR_UNIT * 5 / 100) /* Loss congestion threshold: 5% in BBR_SCALE (~12.8); loss above this + rate drop = CONGESTED condition */
#define UCP_COND_RINC_CONGEST_THRESH (BBR_UNIT * 20 / 100) /* RTT increase congestion threshold: 20% in BBR_SCALE (~51.2); RTT rise above this with rate drop confirms congestion */
#define UCP_COND_LOSS_SEVERE_THRESH (BBR_UNIT * 10 / 100) /* Severe loss threshold: 10% in BBR_SCALE (~25.6); very high loss rate alone can trigger CONGESTED classification */
/* ---- Inline helpers for 64-bit cycle_mstamp access ---------------------- */
/**
* @brief Reconstruct the 64-bit PROBE_BW phase start timestamp from two 32-bit halves.
* @param ucp Pointer to the per-connection UCP state structure
* @return 64-bit microsecond timestamp of when the current PROBE_BW gain phase started
*
* Combines cycle_mstamp_hi (upper 32 bits) and cycle_mstamp_lo (lower 32 bits) into a single u64.
* Stored as two halves to avoid requiring 64-bit aligned memory access on all CPU architectures.
*/
static inline u64 ucp_get_cycle_mstamp(const struct ucp *ucp)
{
/* Shift the upper 32-bit half into the high word of a u64 and OR in the lower 32-bit half to reconstruct the full 64-bit timestamp */
return ((u64)ucp->cycle_mstamp_hi << 32) | ucp->cycle_mstamp_lo;
}
/**
* @brief Store a 64-bit timestamp as two 32-bit halves in the UCP state structure.
* @param ucp Pointer to the per-connection UCP state structure
* @param val 64-bit microsecond timestamp to store (typically tp->delivered_mstamp)
*
* Splits the u64 value into upper and lower 32-bit halves for storage in cycle_mstamp_hi/lo.
* Using two u32 fields avoids requiring 64-bit aligned memory access on 32-bit architectures.
*/
static inline void ucp_set_cycle_mstamp(struct ucp *ucp, u64 val)
{
ucp->cycle_mstamp_hi = (u32)(val >> 32); /* Extract the upper 32 bits of the 64-bit timestamp by right-shifting 32 and casting to u32 */
ucp->cycle_mstamp_lo = (u32)(val); /* Extract the lower 32 bits of the 64-bit timestamp by truncating the u64 to u32 */
}
/* ---- Forward declarations ----------------------------------------------- */
static u16 ucp_get_loss_ratio(const struct sock *sk); /* Reconstruct loss EWMA from compressed u8 + overflow bit; forward-declared because called from ucp_update_loss_ewma before definition */
static void ucp_set_loss_ewma(struct ucp *ucp, u16 val); /* Compress loss EWMA to u8 + overflow bit; forward-declared because called from ucp_update_loss_ewma before definition */
static void ucp_check_probe_rtt_done(struct sock *sk); /* Check if PROBE_RTT dwell time has elapsed and exit PROBE_RTT if so; forward-declared because called from ucp_cwnd_event before its definition */
static void ucp_update_model(struct sock *sk, const struct rate_sample *rs); /* Run the full UCP estimation pipeline: bandwidth, loss EWMA, net condition, net class, cycle phase, min_rtt */
static void ucp_apply_pacing_constraints(struct sock *sk); /* Apply any queued one-shot drain constraints to the pacing gain; forward-declared for ucp_update_model */
static void ucp_apply_cwnd_constraints(struct sock *sk); /* Apply congestion severity cwnd caps and STARTUP loss-based gain limits; forward-declared for ucp_update_model */
/**
* @brief Test whether STARTUP has filled the pipe (full bandwidth reached).
* @param sk The TCP socket
* @return true if STARTUP bandwidth growth has stalled (pipe considered full), false otherwise
*
* Checks the full_bw_reached bitfield in the per-connection UCP state.
* This flag gates the transition from STARTUP to DRAIN mode in the state machine.
*/
static bool ucp_full_bw_reached(const struct sock *sk)
{
/* Retrieve the per-connection UCP private state via inet_csk_ca() and return the full_bw_reached bitfield value */
return ((struct ucp *)inet_csk_ca(sk))->full_bw_reached;
}
/**
* @brief Return the maximum filtered bandwidth (BtlBw) from the minmax running-max window.
* @param sk The TCP socket
* @return Maximum bandwidth in BW_SCALE (units of 1/2^24 packets per microsecond); 0 if no samples available
*
* Queries the minmax running-max filter for the peak bottleneck bandwidth estimate over the last 10 packet-timed rounds.
* This is the UCP algorithm's primary bandwidth estimate, equivalent to BBR's BtlBw.
*/
static u32 ucp_max_bw(const struct sock *sk)
{
/* Return the running maximum value from the minmax filter structure; bandwidth is in BW_SCALE (packets per microsecond * 2^24) */
return minmax_get(&((struct ucp *)inet_csk_ca(sk))->bw);
}
/**
* @brief Convert a BW_SCALE bandwidth value and a BBR_SCALE gain into bytes per second for pacing.
* @param sk The TCP socket (used to retrieve MSS cache)
* @param rate Bandwidth in BW_SCALE (packets per microsecond, fixed-point with 24 fractional bits)
* @param gain Gain multiplier in BBR_SCALE (1.0x = BBR_UNIT = 256)
* @return Pacing rate in bytes per second, with 1% self-queue margin already applied
*
* Algorithm steps:
* 1. rate * MSS * gain / BBR_SCALE (converts to bytes per microsecond in BW_SCALE)
* 2. Convert to bytes per second: integer part * USEC_PER_SEC + fractional part * USEC_PER_SEC / BW_UNIT
* 3. Apply 1% margin: multiply by 99 then divide by 100
* Uses 64-bit arithmetic throughout to avoid overflow on high-speed links (up to multi-gigabit rates).
*/
static u64 ucp_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain)
{
unsigned int mss = tcp_sk(sk)->mss_cache; /* Get the Maximum Segment Size (bytes) from the TCP socket for converting packets to bytes */
u64 q, r, bytes_per_sec; /* q = integer part of BW_UNIT (whole bytes/usec), r = fractional part (< 1 byte/usec), bytes_per_sec = final result */
/* rate * MSS * gain >> BBR_SCALE: rate is in pkts/usec * 2^24 (BW_SCALE), multiply by MSS to get bytes/usec * 2^24, multiply by gain/256 to apply the pacing gain */
rate = rate * mss * gain >> BBR_SCALE; /* Result: (packets/usec * 2^24) * (bytes/packet) * gain / 256 = bytes/usec * 2^24 (still in BW_SCALE) */
q = rate >> BW_SCALE; /* Extract integer part: shift right 24 bits to get whole bytes per microsecond */
r = rate & (BW_UNIT - 1); /* Extract fractional part: mask with (2^24 - 1) to get the fractional remainder in BW_SCALE */
/* Convert to bytes per second: integer part * 1,000,000 (usec/sec) + fractional part * 1,000,000 >> BW_SCALE */
bytes_per_sec = q * USEC_PER_SEC + ((r * USEC_PER_SEC) >> BW_SCALE);
if (bytes_per_sec > UCP_RATE_MAX_SAFE) {/* Safety check: if bytes_per_sec would overflow when multiplied by UCP_PACING_MARGIN_DIV (99), cap it */
bytes_per_sec = UCP_RATE_MAX_SAFE; /* Cap at maximum safe value to prevent overflow in the subsequent margin multiplication */
}
/* Apply the 1% self-queue pacing margin: multiply by 99 then divide by 100, which slightly reduces the rate to prevent local qdisc queue buildup */
return bytes_per_sec * UCP_PACING_MARGIN_DIV / 100; /* Final pacing rate in bytes/sec with 1% headroom for the qdisc layer */
}
/**
* @brief Convert BtlBw estimate and pacing gain to socket pacing rate, capped by sk_max_pacing_rate.
* @param sk The TCP socket
* @param bw Bottleneck bandwidth in BW_SCALE
* @param gain Pacing gain in BBR_SCALE
* @return Pacing rate in bytes per second, limited to sk->sk_max_pacing_rate (the socket's configured max)
*
* Wrapper around ucp_rate_bytes_per_sec() that clamps the result to the per-socket maximum pacing rate.
* This prevents the pacing rate from exceeding any user-configured or system-imposed rate cap.
*/
static unsigned long ucp_bw_to_pacing_rate(struct sock *sk, u32 bw, int gain)
{
/* Compute bytes/sec from BW and gain, then clamp to the socket's configured maximum pacing rate (sk_max_pacing_rate) */
return min_t(u64, ucp_rate_bytes_per_sec(sk, bw, gain),
sk->sk_max_pacing_rate);
}
/**
* @brief Initialize the pacing rate from cwnd and SRTT before any bandwidth samples are available.
* @param sk The TCP socket
*
* This is called at connection start and whenever pacing needs initialization before the first
* delivery rate sample. It estimates bandwidth as cwnd / srtt and sets the initial pacing rate
* using the high STARTUP gain (UCP_HIGH_GAIN ~= 2.89x) to probe for bandwidth.
* If SRTT is not yet available, it assumes a 1 ms RTT as a fallback estimate.
*/
static void ucp_init_pacing_rate_from_rtt(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for snd_cwnd and srtt_us */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for has_seen_rtt */
u32 rtt_us; /* RTT in microseconds for initial bandwidth estimation */
u64 bw; /* Estimated bandwidth in BW_SCALE (packets per microsecond * 2^24) */
/* Check if a valid smoothed RTT measurement is available from the TCP stack */
if (tp->srtt_us) { /* srtt_us is present: use the smoothed RTT for the initial bandwidth estimate */
rtt_us = max(tp->srtt_us >> 3, 1U); /* srtt_us is stored as 8x the actual RTT; shift right 3 to get microseconds, clamp to minimum 1 us to avoid division by zero */
ucp->has_seen_rtt = 1; /* Record that a valid RTT sample has been observed; gates RTT-dependent logic elsewhere in the algorithm */
} else { /* No smoothed RTT measurement yet: use a conservative fallback */
rtt_us = USEC_PER_MSEC; /* Fallback to 1 millisecond (1000 microseconds) as a reasonable initial RTT assumption for modern networks */
}
/* Calculate initial bandwidth: cwnd (packets) / rtt_us (usec), converted to BW_SCALE */
bw = (u64)tp->snd_cwnd * BW_UNIT; /* Convert cwnd to BW_SCALE: multiply by 2^24 so the division yields (packets/usec) * 2^24 */
do_div(bw, rtt_us); /* 64-bit division: bw = cwnd * 2^24 / rtt_us; result is bandwidth in packets per microsecond at BW_SCALE */
/* Set the initial pacing rate to bandwidth * UCP_HIGH_GAIN (~2.89x) to aggressively probe during the STARTUP phase */
sk->sk_pacing_rate = ucp_bw_to_pacing_rate(sk, bw, UCP_HIGH_GAIN);
}
/**
* @brief Apply pacing gain to the socket's pacing rate with 3:1 EWMA smoothing on increases.
* @param sk The TCP socket
* @param bw Current bottleneck bandwidth estimate in BW_SCALE
* @param gain Target pacing gain in BBR_SCALE (determines the multiplier applied to bw for the new rate)
*
* Smoothing behavior:
* - Rate increases: apply 3:1 EWMA (75% old, 25% new) to avoid pacing jitter
* - Fast-ramp bypass: if new rate > 2x current AND at round start, set directly (no smoothing)
* - Rate decreases: applied immediately without smoothing (drains are instant)
* - Post-STARTUP (full_bw_reached): always update pacing (both increases and decreases)
* - Pre-STARTUP: only update pacing if the rate is increasing
*/
static void ucp_set_pacing_rate(struct sock *sk, u32 bw, int gain)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for srtt_us (used in initialization fallback) */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for has_seen_rtt and round_start */
unsigned long rate = ucp_bw_to_pacing_rate(sk, bw, gain); /* Compute the target pacing rate from bandwidth and gain (result in bytes/sec) */
/* If SRTT has become available but we haven't done the RTT-based initialization yet, run it now as a fallback */
if (unlikely(!ucp->has_seen_rtt && tp->srtt_us)) { /* Unlikely path: no previous RTT initialization but SRTT just became available */
ucp_init_pacing_rate_from_rtt(sk); /* Initialize pacing rate from cwnd and SRTT since we now have RTT data */
}
/* Conditionally update pacing: always update if pipe is full (post-STARTUP), or only on increases if pipe is not yet full */
if (ucp_full_bw_reached(sk) || rate > sk->sk_pacing_rate) { /* After pipe is full, always apply new rate; before fullness, only allow increases */
if (rate > sk->sk_pacing_rate) { /* Only apply smoothing on rate increases; decreases (drains) are set directly without smoothing */
if (rate > sk->sk_pacing_rate * 2 && ucp->round_start) { /* Fast-ramp bypass: rate > 2x current at a round start indicates a significant change that should not be smoothed */
sk->sk_pacing_rate = rate; /* Fast-ramp: set the new rate directly without EWMA smoothing for quick response to improved conditions */
} else { /* Normal rate increase: apply 3:1 EWMA smoothing (3/4 old rate + 1/4 new rate) to prevent pacing jitter and oscillations */
rate = (sk->sk_pacing_rate * 3 + rate) / 4; /* EWMA: (old_rate * 3 + new_rate) / 4 = 75% old, 25% new contribution */
}
}
sk->sk_pacing_rate = rate; /* Write the (potentially smoothed) pacing rate to the socket structure */
}
}
/**
* @brief Determine the minimum number of TSO segments based on the current pacing rate.
* @param sk The TCP socket
* @return Minimum TSO segments: 1 for very low rates (< 150 KBps), 2 for normal rates
*
* Below approximately 150 KBps (UCP_MIN_TSO_RATE >> 3 = 150,000 bytes/sec), using more than 1 TSO segment
* could cause excessive burstiness relative to the drain rate. Returns 1 segment for low rates to
* minimize bursts, and 2 segments for higher rates where TSO batching is beneficial.
*/
static u32 ucp_min_tso_segs(struct sock *sk)
{
/* Compare pacing rate against UCP_MIN_TSO_RATE / 8 (1,200,000 bps / 8 = 150,000 bytes/sec); below this threshold use 1 segment, otherwise use 2 */
return sk->sk_pacing_rate < (UCP_MIN_TSO_RATE >> 3) ? 1 : 2;
}
/**
* @brief Compute the desired TSO/GSO burst size in segments.
* @param sk The TCP socket
* @return The target number of segments per TSO burst, capped at UCP_TSO_MAX_SEGS (127)
*
* Calculates the ideal burst size based on the pacing rate and pacing shift configuration.
* The goal is to emit segments at a rate that matches the pacing rate while staying within GSO_MAX_SIZE.
* The result is clamped between ucp_min_tso_segs() and UCP_TSO_MAX_SEGS (127).
*/
static u32 ucp_tso_segs_goal(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for MSS cache and pacing shift */
u32 bytes, segs; /* bytes = byte budget per GSO interval; segs = resulting segment count in the TSO burst */
/* Compute the per-GSO-interval byte budget: pacing rate shifted right by pacing_shift gives bytes per GSO interval */
bytes = min_t(unsigned long,
sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift), /* Bytes per GSO period = rate / (2^pacing_shift); uses READ_ONCE for lock-free read of the shift value */
GSO_MAX_SIZE - 1 - MAX_TCP_HEADER); /* Cap at maximum GSO segment size minus 1 byte and minus TCP/IP header overhead */
/* Convert byte budget to segments: divide by MSS (bytes per segment), ensuring at least ucp_min_tso_segs() segments */
segs = max_t(u32, bytes / tp->mss_cache, ucp_min_tso_segs(sk));
return min(segs, UCP_TSO_MAX_SEGS); /* Cap segments at the hard maximum of 127 TSO segments per burst to prevent oversized bursts */
}
/**
* @brief Save the current cwnd as prior_cwnd before entering recovery or PROBE_RTT.
* @param sk The TCP socket
*
* Called before transitioning into TCP loss recovery (TCP_CA_Recovery) or PROBE_RTT state.
* The saved cwnd is later used to restore the window when exiting those states.
* If already in a recovery or PROBE_RTT state (detected via prev_ca_state), the maximum
* of current cwnd and prior_cwnd is preserved to avoid shrinking the saved window.
*/
static void ucp_save_cwnd(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for snd_cwnd */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for prior_cwnd and prev_ca_state */
/* Only save cwnd if not already in a recovery or PROBE_RTT state; otherwise take the max to preserve the highest known good window */
if (ucp->prev_ca_state < TCP_CA_Recovery && ucp->mode != UCP_PROBE_RTT) { /* Previously in a normal state (Open/Disorder/CWR): save current cwnd as prior_cwnd */
ucp->prior_cwnd = tp->snd_cwnd; /* Save the current cwnd as the pre-restriction value for later restoration */
} else { /* Already in a recovery or PROBE_RTT state: take the maximum to prevent prior_cwnd from shrinking */
ucp->prior_cwnd = max(ucp->prior_cwnd, tp->snd_cwnd); /* Keep the larger of the previously saved value and current cwnd */
}
}
/**
* @brief Handle congestion events from the TCP stack (CWND_EVENT callback).
* @param sk The TCP socket
* @param event Type of congestion event (CA_EVENT_TX_START, etc.)
*
* Implements the tcp_congestion_ops.cwnd_event callback. On TX_START after being
* application-limited (tp->app_limited is true), marks idle_restart and transitions the
* pacing rate to 1.0x BtlBw (in PROBE_BW) or checks for early PROBE_RTT exit.
* This prevents bursty behavior after idle periods.
*/
static void ucp_cwnd_event(struct sock *sk, enum tcp_ca_event event)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for the app_limited flag */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
/* Check if this is a TX_START event (new data being sent) and the socket was application-limited */
if (event == CA_EVENT_TX_START && tp->app_limited) { /* TX_START with app_limited means we are restarting transmission after an idle period */
ucp->idle_restart = 1; /* Set flag indicating we just restarted from idle; affects min_rtt filter to avoid using stale RTT samples */
if (ucp->mode == UCP_PROBE_BW) { /* In steady-state (PROBE_BW): reset pacing rate to 1.0x BtlBw to avoid a rate burst after idle */
ucp_set_pacing_rate(sk, ucp_max_bw(sk), BBR_UNIT); /* Set pacing to 1.0x BtlBw to send at exactly the estimated bottleneck rate */
} else if (ucp->mode == UCP_PROBE_RTT) { /* Currently in PROBE_RTT: new data arriving means we may want to exit PROBE_RTT early */
ucp_check_probe_rtt_done(sk); /* Check if PROBE_RTT should end; may restore cwnd and transition to STARTUP or PROBE_BW */
}
}
}
/**
* @brief Add a filtered RTT sample to the 2-slot circular history buffer.
* @param sk The TCP socket
* @param rtt_us RTT sample value in microseconds (from the rate sample)
*
* Applies two levels of filtering before storing:
* 1. Rejects samples above a ceiling (min(min_rtt * 3, 500ms)) or from delayed ACKs
* 2. Rejects statistical outliers beyond min_rtt + 4 * rttvar
* Valid samples are stored in the 2-element circular buffer indexed by rtt_hist_idx.
*/
static void ucp_add_rtt_sample(struct sock *sk, u32 rtt_us)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for min_rtt and rtt_hist_idx */
u32 rttvar, ceiling; /* rttvar = RTT variance from TCP RTTM estimator; ceiling = maximum acceptable RTT sample value for this connection */
/* Compute the per-connection RTT ceiling: if min_rtt > 1ms, use 3x min_rtt; otherwise use a hard 500ms cap */
ceiling = ucp->min_rtt_us > UCP_BDP_MIN_RTT_US ? /* min_rtt is known and above 1ms floor: use a dynamic ceiling proportional to the connection's baseline RTT */
ucp->min_rtt_us * UCP_RTT_SAMPLE_MAX_MULT : /* Ceiling = min_rtt * 3 (3x the connection's observed minimum RTT) */
(u32)UCP_RTT_SAMPLE_MAX_US; /* Ceiling = hard 500ms absolute limit for connections with very low or unknown min_rtt */
if (rtt_us > ceiling || ucp->has_delayed_ack) { /* Sample exceeds ceiling or was measured during a delayed ACK interval (which inflates RTT) */
return; /* Discard this RTT sample: it is either an outlier or distorted by ACK delay */
}
rttvar = tcp_sk(sk)->rttvar_us; /* Get the smoothed RTT variance estimate from the TCP stack's RTTM (RTT Measurement) engine */
/* Statistical outlier rejection: reject samples that exceed min_rtt + 4 * standard deviations of RTT variation */
if (ucp->min_rtt_us && rttvar && /* Only apply this filter if both min_rtt and rttvar are available (non-zero) */
rtt_us > ucp->min_rtt_us + 4 * rttvar) { /* Sample is beyond 4x rttvar from min_rtt: a strong statistical outlier signal */
return; /* Discard this RTT sample as a statistical outlier */
}
/* Store the validated RTT sample in the 2-slot circular buffer at the current write index */
ucp->rtt_history[ucp->rtt_hist_idx] = rtt_us; /* Write RTT sample to the current history slot (index 0 or 1) */
ucp->rtt_hist_idx ^= 1; /* Toggle the history index (XOR with 1) to advance to the other slot for the next sample */
}
/**
* @brief Approximate the 10th percentile (P10) RTT from the two stored RTT samples.
* @param sk The TCP socket
* @return The lower (better) of the two stored RTT samples as the P10 approximation, or min_rtt_us if no samples exist
*
* With only 2 samples in the history buffer, the P10 is approximated by the minimum of the two.
* Returns the minimum of the two samples if both are available, the single available sample if only
* one exists, or falls back to min_rtt_us if the history is empty.
* This low-percentile RTT is used as the "model RTT" in BDP calculations to avoid over-estimating BDP.
*/
static u32 ucp_get_p10_rtt(const struct sock *sk)
{
const struct ucp *ucp = (const struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state (const access) */
int s0 = !!ucp->rtt_history[0], s1 = !!ucp->rtt_history[1]; /* Convert sample presence to boolean values: !! maps non-zero to 1, zero to 0 */
if (!s0 && !s1) { /* Neither RTT history slot has a valid sample */
return ucp->min_rtt_us; /* Fall back to the minimum RTT estimate as the best available RTT approximation */
}
if (s0 && s1) { /* Both history slots have valid RTT samples */
return min(ucp->rtt_history[0], ucp->rtt_history[1]); /* Approximate P10 as the smaller (better/less-congested) of the two samples */
}
return s0 ? ucp->rtt_history[0] : ucp->rtt_history[1]; /* Only one sample available: return whichever slot is populated */
}
/**
* @brief Compute the RTT increase ratio as (average_RTT / min_rtt - 1) in BBR_SCALE.
* @param sk The TCP socket
* @return RTT increase ratio in BBR_SCALE: 0 = no increase, BBR_UNIT (256) = 100% increase
*
* Measures queuing delay by computing how much the current average RTT has increased above the
* baseline minimum RTT. The formula is (avg_rtt - min_rtt) * BBR_UNIT / min_rtt.
* This ratio is used for congestion severity classification, probe-skip decisions, and path class detection.
*/
static u32 ucp_get_rtt_increase_ratio(const struct sock *sk)
{
const struct ucp *ucp = (const struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state (const access) */
u32 avg = 0; /* Accumulator for computing the arithmetic mean of available RTT history samples */
int v = 0; /* Counter for the number of valid RTT samples in the history buffer */
/* Sum the RTT history samples and count the valid entries */
if (ucp->rtt_history[0]) { avg += ucp->rtt_history[0]; v++; } /* Add first history slot to running sum if it contains a non-zero sample */
if (ucp->rtt_history[1]) { avg += ucp->rtt_history[1]; v++; } /* Add second history slot to running sum if it contains a non-zero sample */
if (!v || !ucp->min_rtt_us) { /* No RTT history samples available OR min_rtt_us has not been established yet */
return 0; /* Cannot compute a meaningful ratio; return 0 indicating no RTT increase */
}
avg /= v; /* Compute the arithmetic mean of the available RTT samples (sum / count) */
if (avg <= ucp->min_rtt_us) { /* Average observed RTT does not exceed the baseline minimum: no measurable queuing delay */
return 0; /* Return 0 to indicate no RTT increase (no queue buildup detected) */
}
/* Compute (avg_rtt - min_rtt) / min_rtt in BBR_SCALE: scaled 64-bit division for precision */
return (u32)(((u64)(avg - ucp->min_rtt_us) * BBR_UNIT) /
ucp->min_rtt_us); /* Result represents 0..N in BBR_SCALE where 256 = 100% increase over baseline */
}
/**
* @brief Compute the Bandwidth-Delay Product (BDP) in packets using the P10 RTT model.
* @param sk The TCP socket (for MSS and UCP state)
* @param bw Bottleneck bandwidth in BW_SCALE (packets per microsecond * 2^24)
* @param gain Gain multiplier in BBR_SCALE (1.0 = BBR_UNIT = 256)
* @return Congestion window target in packets (ceiling rounded): BDP * gain
*
* The "model RTT" is approximated by the P10 RTT from the 2-slot history buffer, clamped
* between min_rtt_us and max(min_rtt_us * 2, 500ms). This prevents the BDP from being
* inflated by transient RTT spikes. Returns TCP_INIT_CWND (10 segments) if min_rtt is invalid.
* Final result: (bw * model_rtt * gain) >> (BBR_SCALE + BW_SCALE) with ceiling rounding.
*/
static u32 ucp_bdp(struct sock *sk, u32 bw, int gain)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
u32 model_rtt, hi; /* model_rtt = P10 RTT estimate for BDP; hi = upper bound for model_rtt clamping (prevents BDP over-estimation) */
u64 w; /* w = intermediate BDP product: bw * model_rtt (in BW_SCALE * microseconds = packets * 2^24) */
/* Guard: if min_rtt_us is uninitialized (~0U) or below the 1ms minimum floor, return a safe default cwnd */
if (unlikely(ucp->min_rtt_us == ~0U || /* Check for uninitialized min_rtt (all 32 bits set to 1, indicating never initialized) */
ucp->min_rtt_us < UCP_BDP_MIN_RTT_US)) { /* Check for min_rtt below the 1ms operational floor */
return TCP_INIT_CWND; /* Return default initial cwnd (10 segments in modern TCP) as a safe fallback when RTT is not yet valid */
}
model_rtt = ucp_get_p10_rtt(sk); /* Get the approximate P10 (10th percentile) RTT from the 2-slot RTT history buffer */
if (model_rtt < ucp->min_rtt_us) { /* P10 RTT should never be lower than the absolute minimum; protect against misestimation */
model_rtt = ucp->min_rtt_us; /* Clamp the model RTT upward to at least the measured minimum RTT */
}
/* Compute the upper bound for model_rtt: the larger of 500ms absolute floor or 2 * min_rtt_us */
hi = (u32)max_t(u64, UCP_BDP_HI_FLOOR_US, /* Absolute floor: 500,000 us (500 ms), prevents excessively tight BDP on very low-RTT paths */
(u64)ucp->min_rtt_us * UCP_BDP_HI_MULT); /* Dynamic ceiling: 2 * min_rtt_us, scales with the connection's baseline RTT */
model_rtt = clamp(model_rtt, ucp->min_rtt_us, hi); /* Clamp model_rtt to the range [min_rtt_us, hi] to bound BDP estimates */
/* Compute the raw BDP product: bw (pkts/usec * 2^24) * model_rtt (usec) yields packets * 2^24 */
w = (u64)bw * model_rtt; /* Product is in units of (packets * 2^24); this is the bandwidth-delay product at BW_SCALE */
/* Apply gain and convert from BW_SCALE back to packets with ceiling rounding: (w * gain / 256 + 2^24 - 1) / 2^24 */
return ((w * gain >> BBR_SCALE) + BW_UNIT - 1) >> BW_SCALE; /* Final result: (bw * model_rtt * gain) >> (BBR_SCALE + BW_SCALE) with ceiling rounding for integer packets */
}
/**
* @brief Update the loss EWMA (Exponentially Weighted Moving Average) with data from a new rate sample.
* @param sk The TCP socket
* @param rs The rate sample structure containing acked_sacked and loss counters
*
* If the rate sample contains packet losses, computes an instantaneous loss ratio
* (losses / total_packets) in BBR_SCALE and updates the EWMA with a 3:1 retained-to-sample ratio.
* If the sample has no losses, the EWMA is decayed by multiplying by 70/100 (0.7x) to gradually
* forget old loss events over loss-free periods.
*/
static void ucp_update_loss_ewma(struct sock *sk, const struct rate_sample *rs)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for loss_ewma */
u32 total, instant; /* total = total packets in this sample (acked + lost); instant = instantaneous loss ratio in BBR_SCALE */
u16 cur; /* Current loss EWMA value reconstructed from compressed u8 + overflow bit */
total = (u32)(rs->acked_sacked + rs->losses); /* Compute total packets accounted for in this rate sample (acknowledged + lost) */
cur = ucp_get_loss_ratio(sk); /* Reconstruct the current loss EWMA from compressed u8 + overflow bit for arithmetic */
if (rs->losses == 0) { /* No packet losses in this sample: decay the loss EWMA to gradually reduce the estimate over clean periods */
cur = cur * UCP_LOSS_EWMA_IDLE_DECAY_NUM / /* Multiply current EWMA by 70 (idle decay numerator) */
UCP_LOSS_EWMA_IDLE_DECAY_DEN; /* Divide by 100 (idle decay denominator); results in 0.7x exponential decay */
ucp_set_loss_ewma(ucp, cur); /* Write back the decayed value through the compressed setter */
return; /* No further update needed for a loss-free sample */
}
/* Compute instantaneous loss ratio: losses / total, scaled to BBR_SCALE (multiply by 256) */
instant = (u32)(((u64)rs->losses * BBR_UNIT) / total); /* Instantaneous loss fraction = losses * 256 / total, result in BBR_SCALE */
if (cur == 0) { /* First time a loss is observed: initialize EWMA directly to the instantaneous value without smoothing */
ucp_set_loss_ewma(ucp, (u16)instant); /* Set initial EWMA to the first observed loss ratio (no prior history to smooth with) */
return; /* Exit since the EWMA has been initialized */
}
/* Standard EWMA update with 3:1 weight ratio: (old_ewma * 3 + instant_ratio * 1) / 4 */
cur = (cur * UCP_LOSS_EWMA_RETAINED_WEIGHT + /* Retained portion: current EWMA * 3 (75% retained memory) */
instant * UCP_LOSS_EWMA_SAMPLE_WEIGHT) / /* New sample portion: instantaneous loss ratio * 1 (25% new data) */
UCP_LOSS_EWMA_TOTAL_WEIGHT; /* Normalization: divide by total weight 4 to get the weighted average */
ucp_set_loss_ewma(ucp, cur); /* Write back the updated EWMA value through the compressed setter */
}
/**
* @brief Set the loss EWMA value, compressing to u8 with an overflow bit.
* @param ucp The per-connection UCP private state
* @param val Loss ratio in BBR_SCALE (0..BBR_UNIT, where BBR_UNIT = 256 = 100% loss)
*
* Encoded as a 9-bit value: loss_ewma_high carries bit 8, loss_ewma carries
* bits 7-0. Val is clamped to BBR_UNIT before splitting.
*/
static void ucp_set_loss_ewma(struct ucp *ucp, u16 val)
{
if (val > BBR_UNIT) {
val = BBR_UNIT;
}
ucp->loss_ewma_high = (val >> BBR_SCALE) & 1;
ucp->loss_ewma = (u8)val;
}
/**
* @brief Return the current loss ratio EWMA value in BBR_SCALE.
* @param sk The TCP socket
* @return Current smoothed loss ratio in BBR_SCALE (0 = 0% loss, BBR_UNIT = 256 = 100% loss)
*
* Reconstructs the full 9-bit value from the compressed representation:
* value = (loss_ewma_high << 8) | loss_ewma
*/
static inline u16 ucp_get_loss_ratio(const struct sock *sk)
{
const struct ucp *ucp = (const struct ucp *)inet_csk_ca(sk);
return ((u16)ucp->loss_ewma_high << 8) | ucp->loss_ewma;
}
/**
* @brief Set the ECN EWMA value, compressing to u8 with an overflow bit.
* @param ucp The per-connection UCP private state
* @param val ECN marking ratio in BBR_SCALE (0..BBR_UNIT, where BBR_UNIT = 256 = 100% marked)
*
* Encoded as a 9-bit value: ecn_ewma_high carries bit 8, ecn_ewma carries
* bits 7-0. Val is clamped to BBR_UNIT before splitting.
*/
static void ucp_set_ecn_ewma(struct ucp *ucp, u16 val)
{
if (val > BBR_UNIT) {
val = BBR_UNIT;
}
ucp->ecn_ewma_high = (val >> BBR_SCALE) & 1;
ucp->ecn_ewma = (u8)val;
}
/**
* @brief Return the current ECN marking ratio EWMA value in BBR_SCALE.
* @param ucp The per-connection UCP private state
* @return Current smoothed ECN ratio in BBR_SCALE (0 = 0% marked, BBR_UNIT = 256 = 100% marked)
*
* Reconstructs the full 9-bit value from the compressed representation:
* value = (ecn_ewma_high << 8) | ecn_ewma
*/
static u16 ucp_get_ecn_ratio(const struct ucp *ucp)
{
return ((u16)ucp->ecn_ewma_high << 8) | ucp->ecn_ewma;
}
/**
* @brief Store a delivery-rate bandwidth sample in the 2-slot circular history buffer.
* @param sk The TCP socket
* @param rate_bw Delivery rate in BW_SCALE to store (packets per microsecond * 2^24)
*
* Writes the bandwidth value into the slot indexed by rate_hist_idx, then toggles the index
* (XOR 1) so the next sample will overwrite the other slot. This 2-slot history enables
* ucp_get_delivery_rate_trend() to compare the two most recent samples for rate direction.
*/
static void ucp_add_delivery_rate_sample(struct sock *sk, u32 rate_bw)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for rate_hist_idx */
ucp->deliv_rate_hist[ucp->rate_hist_idx] = rate_bw; /* Store the delivery rate sample at the current write index in the 2-slot circular buffer */
ucp->rate_hist_idx ^= 1; /* Toggle the write index (XOR with 1): switch to the other slot for the next sample */
}
/**
* @brief Compute the signed delivery-rate trend in BBR_SCALE from the two most recent samples.
* @param sk The TCP socket
* @return Rate trend in BBR_SCALE: positive = rate increasing, negative = rate decreasing, 0 = insufficient data
*
* Computes (newer - older) / older as a signed ratio in BBR_SCALE. If the newer sample is less than
* the older, the result is negative, indicating a bandwidth decrease. If the newer is greater,
* the result is positive, indicating bandwidth growth. Returns 0 if either sample is unavailable
* (uninitialized zero values).
*/
static s32 ucp_get_delivery_rate_trend(struct sock *sk)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for rate_hist_idx and delivery history */
int ni = ucp->rate_hist_idx ^ 1; /* Calculate index of the "newer" sample: complement of current write index (the most recently written slot) */
int oi = ucp->rate_hist_idx; /* Index of the "older" sample: current write index points to the slot written two samples ago */
u32 n, o; /* n = newer delivery rate sample (most recently written); o = older delivery rate sample (written before that) */
n = ucp->deliv_rate_hist[ni]; /* Read the newer sample from the slot most recently written (the complement of the write index) */
o = ucp->deliv_rate_hist[oi]; /* Read the older sample from the slot written previously (the current write index) */
if (!n || !o) { /* Either sample is missing (zero = uninitialized, meaning not enough history accumulated yet) */
return 0; /* Cannot compute a meaningful trend with fewer than 2 valid samples */
}
if (n < o) { /* Newer sample is less than older sample: bandwidth is decreasing */
return -(s32)((u64)(o - n) * BBR_UNIT / o); /* Return negative ratio: -(old - new) / old, scaled to BBR_SCALE; negative indicates rate drop */
}
/* Newer sample is greater than or equal to older sample: bandwidth is increasing or stable */
return (s32)((u64)(n - o) * BBR_UNIT / o); /* Return positive ratio: (new - old) / old, scaled to BBR_SCALE; positive indicates rate growth */
}
/**
* @brief Update the network condition classification (IDLE, LIGHT_LOAD, CONGESTED, RANDOM_LOSS)
* using hysteresis for stability.
* @param sk The TCP socket
* @param rs The rate sample from the TCP stack (used for ECN delta and total packet count)
*
* Combines multiple signals to classify the network condition:
* - Delivery rate trend (via EWMA): direction and magnitude of bandwidth change
* - Loss ratio EWMA: smoothed packet loss percentage
* - ECN marking EWMA: smoothed explicit congestion notification rate
* - RTT increase ratio: queuing delay relative to baseline
*
* Classification rules:
* 1. Rate drop (>15%) + (loss >= 5% or ECN present): signals congestion
* - With RTT rise >= 20% or loss >= 10%: CONGESTED
* - Otherwise: RANDOM_LOSS (rate drop without queue buildup)
* 2. Any loss > 0 without strong rate drop: RANDOM_LOSS
* 3. No loss, no significant rate drop: LIGHT_LOAD
*
* Hysteresis: entering CONGESTED requires 3 consecutive agreeing samples; other transitions need 2.
* When entering CONGESTED, max_bw_non_congested is reset so the bandwidth floor will be re-established.
*/
static void ucp_update_net_condition(struct sock *sk,
const struct rate_sample *rs)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for delivered_ce ECN counter */
s32 rate_change; /* Current smoothed delivery rate trend from EWMA (BBR_SCALE, signed: positive = increasing, negative = decreasing) */
u32 loss_ratio, ecn_ratio; /* loss_ratio = loss EWMA in BBR_SCALE; ecn_ratio = ECN EWMA or instantaneous ratio in BBR_SCALE */
u8 new_cond; /* Proposed new condition classification for this evaluation round */
u32 ce_delta, total; /* ce_delta = newly marked CE delta; total = total packets in this sample (acked + lost) */
u16 ecn_cur; /* Current ECN EWMA value reconstructed from compressed u8 + overflow bit */
/* If the delivery rate history buffer is completely empty, we cannot classify yet */
if (!ucp->deliv_rate_hist[0] && !ucp->deliv_rate_hist[1]) { /* Both history slots are zero (uninitialized) */
ucp->net_condition = UCP_COND_IDLE; /* Set condition to IDLE: we have no rate history to evaluate */
ucp->cond_confirm = 0; /* Reset the hysteresis confirmation counter for future transitions */
return; /* No further classification is possible without delivery rate history */
}
rate_change = ucp_get_delivery_rate_trend(sk); /* Get the raw signed delivery rate trend from the 2-slot history buffer */
if (ucp->rate_change_ewma == (s32)0x80000000) { /* First trend measurement: initialize the EWMA directly from the raw value (sentinel set in ucp_init) */
ucp->rate_change_ewma = rate_change; /* Set initial EWMA to the first observed trend value (no prior history) */
} else { /* Update the EWMA with strong smoothing (7/8 retained, 1/8 new sample) to filter noise */
ucp->rate_change_ewma = (s32)(((s64)ucp->rate_change_ewma * /* Multiply retained EWMA by the 7/8 weight factor */
UCP_RATE_TREND_EWMA_WEIGHT + (s64)rate_change * /* Add new sample weighted by (1 - 7/8) = 1/8 */
(BBR_UNIT - UCP_RATE_TREND_EWMA_WEIGHT)) / BBR_UNIT); /* Divide by BBR_UNIT to normalize the weighted sum */
}
rate_change = ucp->rate_change_ewma; /* Use the smoothed EWMA rate change for classification decisions */
loss_ratio = ucp_get_loss_ratio(sk); /* Get the current loss EWMA (BBR_SCALE) for loss-based classification */
ce_delta = tp->delivered_ce - ucp->last_delivered_ce; /* Compute the change in CE-marked packet count since the last ACK */
ucp->last_delivered_ce = tp->delivered_ce; /* Update the trailing CE counter to the current value for the next ACK's delta calculation */
if (ce_delta) { /* New ECN Congestion Experienced marks were received in this ACK: update the ECN EWMA */
total = max_t(u32, rs->acked_sacked + rs->losses, 1); /* Total packets acknowledged or lost in this sample; minimum 1 to avoid division by zero */
ecn_ratio = (u32)(((u64)ce_delta * BBR_UNIT) / total); /* Compute instantaneous ECN marking ratio: CE-marked delta / total, scaled to BBR_SCALE */
ecn_cur = ucp_get_ecn_ratio(ucp); /* Reconstruct the current ECN EWMA from compressed u8 + overflow bit */
if (ecn_cur == 0) { /* First ECN sample observed: initialize EWMA directly from the instantaneous ratio */
ucp_set_ecn_ewma(ucp, (u16)ecn_ratio); /* Set initial ECN EWMA to the first observed marking ratio through the compressed setter */
} else { /* Update ECN EWMA with 3:1 smoothing (3/4 retained, 1/4 new sample) */
ecn_cur = (ecn_cur * UCP_ECN_EWMA_RETAINED_WEIGHT + /* Retained portion: old EWMA * 3 (75%) */
ecn_ratio * UCP_ECN_EWMA_SAMPLE_WEIGHT) / /* New sample portion: instantaneous ratio * 1 (25%) */
UCP_ECN_EWMA_TOTAL_WEIGHT; /* Divide by total weight 4 to normalize */
ucp_set_ecn_ewma(ucp, ecn_cur); /* Write back the updated ECN EWMA through the compressed setter */
}
} else { /* No new ECN marks in this ACK: decay the ECN EWMA to gradually reduce the estimate over clean periods */
ecn_cur = ucp_get_ecn_ratio(ucp); /* Reconstruct the current ECN EWMA from compressed u8 + overflow bit for arithmetic */
ecn_cur = ecn_cur * UCP_ECN_EWMA_IDLE_DECAY_NUM / /* Multiply current EWMA by 70 (idle decay numerator) for exponential decay */
UCP_ECN_EWMA_IDLE_DECAY_DEN; /* Divide by 100 (idle decay denominator); results in 0.7x per-sample decay */
ucp_set_ecn_ewma(ucp, ecn_cur); /* Write back the decayed ECN EWMA through the compressed setter */
}
/* Primary classification decision tree based on combined signals */
if (rate_change <= UCP_COND_RATE_DROP_THRESH && /* Smoothed rate trend indicates a significant drop (<= -15% in BBR_SCALE) */
(loss_ratio >= UCP_COND_LOSS_CONGEST_THRESH || ucp_get_ecn_ratio(ucp) > 0)) { /* AND congestion signals are present (loss >= 5% or ECN marks detected) */
u32 rinc = ucp_get_rtt_increase_ratio(sk); /* Get the current RTT increase ratio to disambiguate true congestion from random loss */
if (rinc >= UCP_COND_RINC_CONGEST_THRESH || /* RTT has increased by >= 20% (queue buildup detected) OR */
loss_ratio >= UCP_COND_LOSS_SEVERE_THRESH) { /* Loss rate is severe (>= 10%): strong indication of true congestion */
new_cond = UCP_COND_CONGESTED; /* Strong congestion signals present: classify as truly congested with queue buildup */
} else { /* Rate drop with congestion signals but without significant RTT rise or severe loss: likely random/corruption packet loss */
new_cond = UCP_COND_RANDOM_LOSS; /* Classify as random loss: rate dropped but no queuing delay, suggesting non-congestive packet loss */
}
} else if (loss_ratio > 0) { /* Some packet loss is present but rate trend is not dropping significantly */
new_cond = UCP_COND_RANDOM_LOSS; /* Moderate loss without strong rate drop or congestion signals: classify as random (non-congestive) loss */
} else { /* No loss detected and no significant rate drop: network appears healthy and uncongested */
new_cond = UCP_COND_LIGHT_LOAD; /* Clean conditions: classify as light load (uncongested, underutilized) */
}
/* Apply hysteresis to prevent rapid condition flapping: require multiple consecutive agreeing samples */
if (new_cond == ucp->net_condition) { /* Proposed condition matches the current classification: condition is stable */
ucp->cond_confirm = 0; /* Reset the confirmation counter since the classification is consistent */
} else { /* Proposed condition differs from current: increment the confirmation counter to accumulate evidence */
ucp->cond_confirm++; /* Count another consecutive sample that disagrees with the current condition */
if (ucp->cond_confirm >= (new_cond == UCP_COND_CONGESTED ? /* Different thresholds for entering vs. leaving congestion */
UCP_COND_CONFIRM_ENTER : /* Entering CONGESTED requires UCP_COND_CONFIRM_ENTER (3) consecutive samples (conservative) */
UCP_COND_CONFIRM_EXIT)) { /* Leaving any non-idle condition requires UCP_COND_CONFIRM_EXIT (2) consecutive samples */
ucp->net_condition = new_cond; /* Commit the transition to the new condition classification */
ucp->cond_confirm = 0; /* Reset the confirmation counter for future transitions */
if (new_cond == UCP_COND_CONGESTED) { /* When entering CONGESTED, reset the non-congested bandwidth peak so it will be re-established from scratch */
ucp->max_bw_non_congested = 0; /* Clear the peak non-congested BW; it will be re-measured when conditions improve */
}
}
}
}
/**
* @brief Update the network path class classification (DEFAULT, LAN, MOBILE, LOSSY_FAT, CONGESTED, VPN).
* @param sk The TCP socket
*
* Classifies the network path based on average RTT, jitter (RTT max-min range), loss ratio EWMA,
* and RTT increase ratio. The classification influences PROBE_RTT interval, bandwidth floor
* selection, and probe gain behavior. Uses hysteresis (class_confirm counter) to prevent rapid
* class flapping. The decision tree evaluates conditions in order of specificity.
*/
static void ucp_update_net_class(struct sock *sk)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
u32 avg = 0, mn = ~0U, mx = 0; /* avg = running sum for mean RTT calculation (microseconds); mn = minimum sample (init to max); mx = maximum sample (init to 0) */
int v = 0; /* Number of valid RTT samples found in the history buffer */
u32 jitter, loss, rinc; /* jitter = RTT range (mx - mn) in us; loss = loss ratio EWMA in BBR_SCALE; rinc = RTT increase ratio in BBR_SCALE */
u8 candidate; /* Proposed new path class for this evaluation; will be compared to current with hysteresis */
/* Accumulate RTT statistics from the 2-slot history buffer */
if (ucp->rtt_history[0]) { /* First slot of the RTT history buffer has a valid sample */
avg += ucp->rtt_history[0]; v++; /* Add to the running sum and increment the sample counter */
mn = mx = ucp->rtt_history[0]; /* Initialize both min and max to the first sample's value for comparison with the second sample */
}
if (ucp->rtt_history[1]) { /* Second slot of the RTT history buffer has a valid sample */
avg += ucp->rtt_history[1]; v++; /* Add to the running sum and increment the sample counter */
mn = min(mn, ucp->rtt_history[1]); /* Update the minimum: keep the smaller of the existing min and this sample */
mx = max(mx, ucp->rtt_history[1]); /* Update the maximum: keep the larger of the existing max and this sample */
}
if (v < 2) { /* Need at least 2 RTT samples for a reasonably reliable path classification */
ucp->net_class = UCP_CLASS_DEFAULT; /* Insufficient data: revert to the DEFAULT unclassified path */
ucp->class_confirm = 0; /* Reset the hysteresis confirmation counter since we are starting fresh */
return; /* Cannot classify the path with fewer than 2 RTT samples */
}
avg /= v; /* Compute the arithmetic mean RTT from the available samples (sum / count) */
jitter = mx - mn; /* Compute jitter as the range (max - min) of the two RTT samples; a simple measure of RTT variability */
loss = ucp_get_loss_ratio(sk); /* Get the current smoothed loss ratio EWMA (BBR_SCALE) */
rinc = ucp_get_rtt_increase_ratio(sk); /* Get the RTT increase ratio (BBR_SCALE): measures queuing delay relative to min_rtt */
/* Decision tree for path class classification: evaluated in order from most significant to least specific */
if (loss > ucp_high_loss_thresh_val) { /* High loss (> 5%) is the strongest and most definitive signal of persistent congestion */
candidate = UCP_CLASS_CONGESTED; /* Loss exceeding the high threshold trumps all other signals: classify as persistently congested */
} else if (avg < UCP_CLASS_LAN_RTT_US && /* Average RTT < 5 ms (very low latency, typical of local networks) */
jitter < UCP_CLASS_LAN_JITTER_US && /* Jitter < 3 ms (very low RTT variation) */
loss < UCP_CLASS_LAN_LOSS_THRESH) { /* Loss < 0.1% (nearly lossless) */
candidate = UCP_CLASS_LAN; /* All three LAN criteria met: classify as a Local Area Network path */
} else if (loss > UCP_CLASS_MOBILE_LOSS_THRESH && /* Loss > 3% (high loss typical of cellular links) */
jitter > UCP_CLASS_MOBILE_JITTER_US) { /* Jitter > 20 ms (high variability typical of mobile networks) */
candidate = UCP_CLASS_MOBILE; /* Both mobile signatures present: classify as a MOBILE/cellular path */
} else if (avg > UCP_CLASS_LOSSY_RTT_US && /* Average RTT > 80 ms (high latency, typical of satellite links) */
loss > UCP_CLASS_LOSSY_LOSS_THRESH) { /* Loss > 1% (significant background packet loss) */
candidate = UCP_CLASS_LOSSY_FAT; /* High latency + significant loss: classify as a LOSSY_FAT (satellite-like) path */
} else if (rinc >= UCP_CLASS_CONG_RINC_THRESH && /* RTT increase ratio >= 50% (significant queuing delay buildup) */
loss >= ucp_low_loss_thresh_val) { /* Loss >= 1% (some packet loss is occurring) */
candidate = UCP_CLASS_CONGESTED; /* Significant RTT rise combined with loss: classify as temporarily CONGESTED */
} else if (avg > UCP_CLASS_VPN_RTT_US) { /* Average RTT > 60 ms (elevated but stable latency) */
candidate = UCP_CLASS_VPN; /* Elevated latency without loss or jitter: classify as a VPN tunnel path */
} else { /* None of the specific path signatures are matched */
candidate = UCP_CLASS_DEFAULT; /* Use the DEFAULT unclassified path: standard parameters apply */
}
/* Apply hysteresis to prevent rapid class flapping */
if (candidate == ucp->net_class) { /* The proposed class matches the current classification: strengthen confidence */
ucp->class_confirm = min_t(u32, ucp->class_confirm + 1, /* Increment confirmation counter, capped at UCP_CLASS_CONFIRM_MAX (7) to prevent overflow */
UCP_CLASS_CONFIRM_MAX); /* Saturating counter: confirms stability but cannot exceed the 3-bit field's safe maximum */
} else { /* Proposed class differs from current: accumulate evidence for a transition */
ucp->class_confirm = min_t(u32, ucp->class_confirm + 1, /* Increment with saturation to prevent 3-bit wraparound on mismatch after full-saturation */
UCP_CLASS_CONFIRM_CNT); /* Saturate at threshold to avoid overflow -> 0 reset that would discard accumulated confidence */
if (ucp->class_confirm >= UCP_CLASS_CONFIRM_CNT) { /* Need UCP_CLASS_CONFIRM_CNT (2) consecutive disagreeing samples to change class */
ucp->net_class = candidate; /* Commit the transition to the new path class */
ucp->class_confirm = 0; /* Reset the confirmation counter after a successful transition */
}
}
}
/**
* @brief Get the pacing gain for the current PROBE_BW gain cycle phase.
* @param sk The TCP socket
* @return Pacing gain in BBR_SCALE: probe gain (>BBR_UNIT), drain gain (<BBR_UNIT), or 1.0x (cruise)
*
* Phase 0 (probe): Returns the configured probe gain (default 1.10x) unless:
* - Loss >= probe_skip_loss_thresh (2%) OR RTT rise >= probe_skip_rtt_rise_thresh (40%): skip probe, return 1.0x
* - MOBILE path with loss >= drain_thresh (1%): return reduced mobile_probe_gain (1.0x default)
* Phase 1 (drain): Returns 0.75x (standard BBR drain gain) if the probe phase actually applied >1.0x gain;
* otherwise returns 1.0x (no drain needed since probe was skipped).
* Phases 2-7 (cruise): Returns 1.0x (neutral gain, send at estimated BtlBw).
*/
static u32 ucp_get_cycle_pacing_gain(const struct sock *sk)
{
const struct ucp *ucp = (const struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state (const access) */
if (ucp->cycle_idx == 0) { /* Phase 0: probe phase - attempt to send faster than BtlBw to test for newly available bandwidth */
/* Safety check: skip the probe if the network is already under stress (high loss or significant RTT rise) */
if (ucp_get_loss_ratio(sk) >= ucp_probe_skip_loss_val || /* Current loss EWMA >= 2% threshold: probing would likely cause more loss */
ucp_get_rtt_increase_ratio(sk) >= ucp_probe_skip_rtt_rise_val) { /* RTT increase >= 40%: probing would increase queue buildup */
return BBR_UNIT; /* Return neutral gain (1.0x) to avoid aggravating the network condition */
}
/* On MOBILE paths with loss above the drain threshold, use a reduced probe (UCP-specific; BBR mode skips this) */
if (ucp_bbr_mode == 0 && /* Full UCP mode: apply class-based probe gain reduction */
ucp->net_class == UCP_CLASS_MOBILE && /* Path is classified as MOBILE (cellular/high-loss wireless) */
ucp_get_loss_ratio(sk) >= ucp_drain_loss_thresh_val) { /* Loss >= 1% (drain threshold): condition is too lossy for aggressive probing */
return ucp_probe_gain_mobile_val; /* Return the reduced mobile probe gain (default 1.0x = no probe) */
}
return ucp_probe_gain_val; /* Conditions are favorable: return the standard probe gain (default 1.10x; user-configurable via sysfs) */
}
if (ucp->cycle_idx == UCP_PROBE_BW_DRAIN_IDX) { /* Phase 1: drain phase - reduce rate to drain any excess inflight built during the probe */
if (ucp->probe_gain_applied) { /* The preceding probe phase actually applied a gain > 1.0x, so there is excess inflight to drain */
return (BBR_UNIT * 3 / 4); /* Return the standard BBR drain gain: 0.75x BtlBw to drain the queue at a controlled rate */
} else { /* The probe phase was skipped (returned neutral gain 1.0x), so no excess inflight was created and no drain is needed */
return BBR_UNIT; /* Return neutral gain (1.0x) since there is no queue to drain */
}
}
/* Phases 2 through 7: cruise phases - maintain the current rate at exactly the estimated BtlBw */
return BBR_UNIT; /* Neutral gain (1.0x): send at the estimated bottleneck rate without adding or draining queue */
}
/**
* @brief Advance the PROBE_BW gain cycle to the next phase and handle post-probe drain queuing.
* @param sk The TCP socket
*
* Advances cycle_idx to the next phase (modulo 8 = UCP_PROBE_BW_CYCLE_LEN). Before advancing,
* if completing a probe phase (index 0) that actually applied >1.0x gain and the loss ratio
* exceeds the drain threshold, an early drain level (1-3) is queued based on loss severity.
* The phase start timestamp is updated, and the new phase's pacing gain is set.
*/
static void ucp_advance_cycle_phase(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for the delivered_mstamp timestamp */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
/* Before advancing from the probe phase (index 0), check if an early drain needs to be queued (UCP-specific: BBR does not do loss-triggered early drains) */
if (ucp_bbr_mode == 0 && ucp->cycle_idx == 0 && ucp->probe_gain_applied && /* Only in UCP mode AND completing a probe phase that actually applied >1.0x gain */
ucp_get_loss_ratio(sk) >= ucp_drain_loss_thresh_val) { /* And current loss ratio >= drain trigger threshold (1%): probing caused detectable loss */
u32 loss = ucp_get_loss_ratio(sk); /* Get the current loss EWMA to determine the drain severity level */
if (loss >= ucp_drain_lvl3_loss_thresh_val) { /* Loss >= 10%: most aggressive drain needed for severe loss */
ucp->drain_pending = 3; /* queue aggressive drain (level 3): pacing gain -> 0.65x (default) */
} else if (loss >= ucp_drain_lvl2_loss_thresh_val) { /* Loss >= 5% but < 10%: moderate loss, standard drain */
ucp->drain_pending = 2; /* queue standard drain (level 2): pacing gain -> 0.75x (default) */
} else { /* Loss >= 1% (drain threshold) but < 5%: light loss, gentle drain */
ucp->drain_pending = 1; /* queue light drain (level 1): pacing gain -> 0.85x (default) */
}
}
/* Advance to the next phase in the 8-position gain cycle, wrapping around at the end */
ucp->cycle_idx = (ucp->cycle_idx + 1) & (UCP_PROBE_BW_CYCLE_LEN - 1); /* Increment and wrap via mask: & 7 ensures modulo 8 operation */
ucp_set_cycle_mstamp(ucp, tp->delivered_mstamp); /* Record the microsecond timestamp of when this new phase started */
/* Determine if the new phase will apply a probing gain: only at phase 0 with gain > BBR_UNIT */
ucp->probe_gain_applied = (ucp->cycle_idx == 0 && /* Only the probe phase (index 0) can have >1.0x gain */
ucp_get_cycle_pacing_gain(sk) > BBR_UNIT); /* Gain must strictly exceed 1.0x to be considered a probe */
ucp->pacing_gain = ucp_get_cycle_pacing_gain(sk); /* Set the current pacing gain to the gain value for the newly entered phase */
}
/**
* @brief Compute the adaptive PROBE_RTT interval in jiffies based on path conditions.
* @param sk The TCP socket
* @return PROBE_RTT interval in jiffies (kernel timer ticks), capped at ucp_probe_rtt_max_jiffies
*
* The interval starts from the base value (default 10 seconds) and adds extra time based on:
* - Path class: MOBILE and LOSSY_FAT paths get class_extra seconds (default +5)
* - High loss (>= high_loss_thresh, 5%): adds high_loss_extra seconds (default 0)
* - Medium loss (>= probe_skip_loss, 2% but < high_loss): adds mid_loss_extra seconds (default 0)
* The total is capped at ucp_probe_rtt_max_jiffies (default max 15 seconds).
*/
static u32 ucp_get_probe_rtt_interval(const struct sock *sk)
{
const struct ucp *ucp = (const struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state (const access) */
u32 interval = ucp_probe_rtt_base_jiffies; /* Start with the base interval (default 10 seconds converted to jiffies) */
u32 loss = ucp_get_loss_ratio(sk); /* Get the current loss EWMA for the extra-time checks */
/* MOBILE and LOSSY_FAT paths need longer intervals between PROBE_RTT entries (UCP-specific: BBR uses fixed interval) */
if (ucp_bbr_mode == 0 && /* Only apply class-based extension in full UCP mode; BBR mode uses fixed interval */
(ucp->net_class == UCP_CLASS_MOBILE || /* Cellular paths have variable RTT; longer interval avoids unnecessary probes */
ucp->net_class == UCP_CLASS_LOSSY_FAT)) { /* Lossy fat pipes (e.g., satellite) have high latency; longer interval avoids excessive RTT samples */
interval += ucp_probe_rtt_class_extra_jiffies; /* Add the class extra time (default 5 seconds in jiffies) */
}
/* Add extra time when loss is elevated: more loss = longer window between RTT refresh operations.
* In BBR mode (ucp_bbr_mode == 1), these UCP-specific extensions are skipped to keep standard BBR behavior. */
if (ucp_bbr_mode == 0) {
if (loss >= ucp_high_loss_thresh_val) { /* Loss EWMA >= 5% (high threshold): significant loss in progress, postpone PROBE_RTT */
interval += ucp_probe_rtt_high_loss_extra_jiffies; /* Add high loss extra time (default 0 seconds, configurable via module parameter) */
} else if (loss >= ucp_probe_skip_loss_val) { /* Loss EWMA >= 2% (probe skip threshold) but below high threshold: moderate loss */
interval += ucp_probe_rtt_mid_loss_extra_jiffies; /* Add mid loss extra time (default 0 seconds, configurable via module parameter) */
}
}
/* Clamp the total interval to the configured maximum to prevent excessively long intervals between RTT refreshes */
return min_t(u32, interval, ucp_probe_rtt_max_jiffies); /* Cap at ucp_probe_rtt_max_jiffies (default max 15 seconds) */
}
/**
* @brief Map a drain level to its corresponding pacing gain in BBR_SCALE.
* @param level Drain severity level: 1 = light, 2 = standard, 3 = aggressive
* @return Pacing gain in BBR_SCALE (all drain gains are < BBR_UNIT to reduce sending rate):
* Level 1: ~0.85x (light), Level 2: 0.75x (standard), Level 3: ~0.65x (aggressive)
* Invalid level or 0: returns BBR_UNIT (1.0x = no drain)
*/
static u32 ucp_drain_gain_by_level(int level)
{
switch (level) { /* Select the appropriate drain pacing gain based on the queued drain severity level */
case 1: return ucp_drain_gain_light_val; /* Level 1: light drain with gentle pacing reduction (default 0.85x = ~217 BBR_SCALE) */
case 2: return ucp_drain_gain_standard_val; /* Level 2: standard drain with moderate pacing reduction (default 0.75x = 192 BBR_SCALE) */
case 3: return ucp_drain_gain_aggressive_val; /* Level 3: aggressive drain with strong pacing reduction (default 0.65x = ~166 BBR_SCALE) */
default: return BBR_UNIT; /* Invalid or zero level: return neutral gain (1.0x = 256 BBR_SCALE) meaning no drain effect */
}
}
/**
* @brief Apply any queued one-shot drain constraints to the pacing gain.
* @param sk The TCP socket
*
* This is called from the main ACK processing path (ucp_main) after ucp_update_model().
* If ucp->drain_pending is non-zero (meaning ucp_advance_cycle_phase queued a drain),
* this function:
* 1. Overrides pacing_gain with the corresponding drain gain from ucp_drain_gain_by_level()
* 2. Clears the pending flag (one-shot: the drain is applied exactly once)
* 3. Records the phase start timestamp at the current delivery time
* This provides a rapid queue-drain response to loss triggered by bandwidth probes.
*/
static void ucp_apply_pacing_constraints(struct sock *sk)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
if (ucp->drain_pending) { /* Check if a drain has been queued (non-zero drain_pending means level 1, 2, or 3 is waiting) */
ucp->pacing_gain = ucp_drain_gain_by_level(ucp->drain_pending); /* Override the cycling pacing gain with the drain-specific gain (0.65x-0.85x) */
ucp->drain_pending = 0; /* Clear the pending flag: this is a one-shot drain that has now been applied */
ucp_set_cycle_mstamp(ucp, tcp_sk(sk)->delivered_mstamp); /* Record the drain start timestamp at the current delivery time marking phase transition */
}
}
/**
* enum ucp_cong_level - Congestion severity levels for graduated cwnd ceiling application
* @UCP_CONG_NONE: No congestion detected; no cwnd cap beyond the standard cwnd gain (2.0x BDP)
* @UCP_CONG_MILD: Mild congestion; slight cwnd cap applied (1.75x BDP by default)
* @UCP_CONG_MODERATE: Moderate congestion; tighter cwnd cap (1.50x BDP by default)
* @UCP_CONG_SEVERE: Severe congestion; most restrictive cwnd cap (1.25x BDP by default)
*/
enum ucp_cong_level {
UCP_CONG_NONE = 0, /* Value 0: No congestion signals detected; cwnd can use the full standard gain (2.0x BDP) */
UCP_CONG_MILD = 1, /* Value 1: Mild congestion; cwnd capped at ucp_cwnd_cap_mild_val (1.75x BDP default) */
UCP_CONG_MODERATE = 2, /* Value 2: Moderate congestion; cwnd capped at ucp_cwnd_cap_moderate_val (1.50x BDP default) */
UCP_CONG_SEVERE = 3, /* Value 3: Severe congestion; cwnd capped at ucp_cwnd_cap_severe_val (1.25x BDP default) */
};
/**
* @brief Determine the current congestion severity level based on RTT increase and loss ratio.
* @param sk The TCP socket
* @return ucp_cong_level enum: NONE (0), MILD (1), MODERATE (2), or SEVERE (3)
*
* Combines two signals (RTT increase ratio and loss ratio) using an OR condition:
* whichever signal indicates higher severity determines the result. This means either
* queuing delay (RTT rise) or packet loss can independently trigger the congestion response.
* The severity level is used by ucp_apply_cwnd_constraints() to select the cwnd cap.
*/
static enum ucp_cong_level ucp_congestion_level(const struct sock *sk)
{
u32 rinc = ucp_get_rtt_increase_ratio(sk); /* Get the RTT increase ratio in BBR_SCALE: represents queuing delay relative to baseline */
u32 loss = ucp_get_loss_ratio(sk); /* Get the smoothed loss ratio EWMA in BBR_SCALE: represents packet loss severity */
/* Check the most severe level first: SEVERE - either RTT doubled or loss >= 5% */
if (rinc >= UCP_CONG_SEVERE_RINC_THRESH || /* RTT increased by >= 100% (doubled): severe queuing delay */
loss >= ucp_cong_severe_loss_val) { /* Loss >= 5% (high threshold): severe packet loss */
return UCP_CONG_SEVERE; /* Both conditions independently indicate severe congestion */
}
/* Check MODERATE: RTT up by >= 50% or loss >= 1% */
if (rinc >= UCP_CONG_MODERATE_RINC_THRESH || /* RTT increased by >= 50%: moderate queuing delay */
loss >= ucp_cong_moderate_loss_val) { /* Loss >= 1% (drain threshold): moderate packet loss */
return UCP_CONG_MODERATE; /* Either condition indicates moderate congestion */
}
/* Check MILD: RTT up by >= 25% or loss >= 1% */
if (rinc >= UCP_CONG_MILD_RINC_THRESH || /* RTT increased by >= 25%: mild queuing delay */
loss >= ucp_cong_mild_loss_val) { /* Loss >= 1% (low threshold): mild packet loss */
return UCP_CONG_MILD; /* Either condition indicates mild congestion */
}
return UCP_CONG_NONE; /* No significant congestion signals: no cwnd cap needed */
}
/**
* @brief Apply cwnd ceiling caps for congestion severity and STARTUP loss-based gain limits.
* @param sk The TCP socket
*
* Two-stage constraint application in order:
* 1. CONGESTED condition cap: If net_condition is UCP_COND_CONGESTED, reduces cwnd_gain
* based on the current congestion severity level (NONE/MILD/MODERATE/SEVERE).
* 2. STARTUP loss-cap: If in STARTUP mode, applies loss-based gain reduction:
* - Loss >= hard_cap threshold (2%): cap both cwnd_gain and pacing_gain at cwnd_gain_val (2.0x)
* - Loss >= soft_drain threshold (0.5%): cap both gains at startup_soft_gain_val (2.5x)
*/
static void ucp_apply_cwnd_constraints(struct sock *sk)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
u32 cap = ucp_cwnd_gain_val; /* Start with the standard steady-state cwnd gain (2.0x BDP = 512 BBR_SCALE) as the base cap */
/* BBR compatible mode: skip all UCP-specific cwnd constraints (congestion ceiling and STARTUP caps) */
if (ucp_bbr_mode == 1) /* Pure BBR mode: do not apply graduated caps or STARTUP loss-based reduction */
return; /* Return immediately: BBR has no non-destructive cwnd constraint layer */
/* Stage 1: If network condition is classified as CONGESTED, apply graduated cwnd caps */
if (ucp->net_condition == UCP_COND_CONGESTED) { /* Only apply congestion severity caps when the network is actively classified as congested */
switch (ucp_congestion_level(sk)) { /* Determine the current congestion severity from RTT increase and loss ratio */
case UCP_CONG_MILD: /* Mild congestion: slight reduction to keep some headroom */
cap = ucp_cwnd_cap_mild_val; break; /* Cap = 1.75x BDP (default 448 BBR_SCALE): modest window reduction */
case UCP_CONG_MODERATE: /* Moderate congestion: more significant reduction */
cap = ucp_cwnd_cap_moderate_val; break; /* Cap = 1.50x BDP (default 384 BBR_SCALE): tighter window */
case UCP_CONG_SEVERE: /* Severe congestion: aggressive cwnd reduction */
cap = ucp_cwnd_cap_severe_val; break; /* Cap = 1.25x BDP (default 320 BBR_SCALE): most restrictive */
default: break; /* UCP_CONG_NONE: keep the default cap (ucp_cwnd_gain_val = 2.0x BDP) unchanged */
}
}
ucp->cwnd_gain = min_t(u32, ucp->cwnd_gain, cap); /* Apply the cap: cwnd_gain is the minimum of its current value and the selected cap */
/* Stage 2: STARTUP mode loss-based gain reduction (UCP-specific non-destructive constraint) */
if (ucp->mode == UCP_STARTUP) { /* Only apply these limits during the exponential bandwidth probing phase */
u32 loss = ucp_get_loss_ratio(sk); /* Get current loss EWMA for the loss threshold checks */
if (loss >= ucp_startup_hard_cap_val) { /* Loss >= 2% (hard cap threshold): significant loss during STARTUP requires aggressive gain reduction */
ucp->cwnd_gain = min_t(u32, ucp->cwnd_gain, ucp_cwnd_gain_val); /* Cap cwnd gain at standard cwnd gain (2.0x); reduces from UCP_HIGH_GAIN (~2.89x) */
ucp->pacing_gain = min_t(u32, ucp->pacing_gain, ucp_cwnd_gain_val); /* Cap pacing gain at standard cwnd gain (2.0x); reduces from UCP_HIGH_GAIN */
} else if (loss >= ucp_startup_soft_drain_val) { /* Loss >= 0.5% (soft drain threshold): moderate loss during STARTUP requires moderate gain reduction */
ucp->cwnd_gain = min_t(u32, ucp->cwnd_gain, ucp_startup_soft_gain_val); /* Reduce cwnd gain to the soft gain value (2.5x default) */
ucp->pacing_gain = min_t(u32, ucp->pacing_gain, ucp_startup_soft_gain_val); /* Reduce pacing gain to the soft gain value (2.5x default) */
}
}
}
/**
* @brief Apply quantization adjustments to the target cwnd: TSO headroom, even rounding, probe bonus.
* @param sk The TCP socket
* @param cwnd The base target cwnd in packets (from BDP * gain calculation)
* @return The adjusted target cwnd in packets after applying all quantization effects
*
* Three adjustments applied in order:
* 1. TSO headroom: adds UCP_TSO_HEADROOM_SEGS (3) * tso_segs_goal() segments to prevent TSO from starving the pipeline
* 2. Even rounding: rounds up to the nearest even integer for better GSO/TSO alignment
* 3. Probe bonus: adds UCP_PROBE_CWND_BONUS (2) extra segments if in PROBE_BW with >1.0x gain
*/
static u32 ucp_quantization_budget(struct sock *sk, u32 cwnd)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for mode and pacing_gain checks */
cwnd += UCP_TSO_HEADROOM_SEGS * ucp_tso_segs_goal(sk); /* Add TSO headroom: 3 * TSO goal segments to ensure the TSO/GSO layer has enough budget and doesn't stall the pipeline */
if (cwnd < U32_MAX) { /* Avoid overflow when rounding if cwnd is already at the maximum possible value */
cwnd = (cwnd + 1) & ~1U; /* Round up to the nearest even integer: add 1 then clear the LSB; improves GSO alignment on NIC hardware */
}
if (ucp->mode == UCP_PROBE_BW && ucp->pacing_gain > BBR_UNIT) { /* In probe phase with >1.0x gain: add extra segments to ensure we fully probe for new bandwidth */
cwnd += UCP_PROBE_CWND_BONUS; /* Add UCP_PROBE_CWND_BONUS (2) extra segments during probe to push the pipeline and detect newly available bandwidth */
}
return cwnd; /* Return the fully quantized cwnd target with all adjustments applied */
}
/**
* @brief Compute target inflight data as BDP * gain with quantization adjustments.
* @param sk The TCP socket
* @param bw Bottleneck bandwidth in BW_SCALE
* @param gain Gain multiplier in BBR_SCALE
* @return Target inflight cwnd in packets (quantized for TSO, even, and probe bonus)
*
* Convenience wrapper: calls ucp_bdp() to compute the base BDP * gain, then passes the result
* through ucp_quantization_budget() for TSO headroom, even rounding, and probe bonus.
*/
static u32 ucp_inflight(struct sock *sk, u32 bw, int gain)
{
/* Compute BDP * gain as the base target, then apply quantization (TSO headroom, even rounding, probe bonus) */
return ucp_quantization_budget(sk, ucp_bdp(sk, bw, gain));
}
/**
* @brief Estimate the number of packets still in flight at the Earliest Departure Time (EDT) of the next packet.
* @param sk The TCP socket
* @param inflight_now Current number of packets in flight (prior_in_flight from the rate sample)
* @return Projected inflight packets at EDT; 0 if the pipe will have drained completely
*
* Projects how many packets will still be in the network at the earliest departure time of the next
* packet to be sent. This estimate is used in ucp_is_next_cycle_phase() to determine whether the
* current PROBE_BW probe phase has built enough inflight: if projected inflight at EDT >= target
* inflight, the probe still has effect and continues; otherwise, the probe has drained and it is
* time to advance to the next cycle phase.
*
* Calculation: inflight_now + TSO goal (if probing) - packets delivered between now and EDT
*/
static u32 ucp_packets_in_net_at_edt(struct sock *sk, u32 inflight_now)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for tcp_clock_cache and tcp_wstamp_ns */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for pacing_gain */
u64 now_ns = tp->tcp_clock_cache; /* Current time in nanoseconds, cached from the TCP clock for efficiency */
u64 edt_ns = max(tp->tcp_wstamp_ns, now_ns); /* Earliest Departure Time: the later of the current time and the pacing wakeup timestamp; when the next packet can actually be transmitted */
u32 delivered = (u64)ucp_max_bw(sk) * /* Compute expected deliveries between now and EDT: bandwidth * time_delta converted to packets */
div_u64(edt_ns - now_ns, NSEC_PER_USEC) >> BW_SCALE; /* Convert (edt - now) from ns to us, multiply by BW (BW_SCALE), shift to get integer packets */
u32 inflight_at_edt = inflight_now; /* Start with the current inflight count; we will subtract deliveries occurring before EDT */
if (ucp->pacing_gain > BBR_UNIT) { /* If currently in a probe phase with >1.0x gain, additional TSO segments may be scheduled */
inflight_at_edt += ucp_tso_segs_goal(sk); /* Add the TSO goal segment count to account for TSO burst that will be queued before EDT */
}
if (delivered >= inflight_at_edt) { /* More packets will be delivered than are currently in flight: the pipe will drain completely before EDT */
return 0; /* Return 0: no packets will be in net at EDT, the pipe is empty */
}
return inflight_at_edt - delivered; /* Return the projected inflight count: current + TSO burst - deliveries during the interval */
}
/**
* @brief Handle cwnd adjustments for entry into and exit from TCP fast recovery.
* @param sk The TCP socket
* @param rs The rate sample (contains loss count)
* @param acked Number of packets acknowledged in this ACK
* @param new_cwnd Output parameter: the computed new cwnd value
* @return true if cwnd was set for recovery entry (caller should use cwnd and skip normal update);
* false if no recovery entry action taken (caller continues with normal cwnd computation)
*
* On entry to TCP_CA_Recovery (state transition from non-recovery to recovery):
* - Sets fast_recovery flag
* - Reduces cwnd by the number of lost packets (packet conservation)
* - Sets cwnd to max(reduced_cwnd, packets_in_flight + acked) to avoid window collapse
* On exit from TCP_CA_Recovery:
* - Restores cwnd to max(current_cwnd, prior_cwnd) to recover the pre-recovery window
* - Clears fast_recovery flag
*/
static bool ucp_set_cwnd_to_recover_or_restore(
struct sock *sk, const struct rate_sample *rs, u32 acked, u32 *new_cwnd)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for snd_cwnd */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
u8 ps = ucp->prev_ca_state; /* The TCP CA state before the most recent transition; used to detect transitions */
u8 st = inet_csk(sk)->icsk_ca_state; /* The current TCP CA state (Open, Disorder, CWR, Recovery, Loss) */
u32 cwnd = tp->snd_cwnd; /* Current congestion window in packets */
/* Detect entry into fast recovery: state changed to Recovery and it was NOT in Recovery before */
if (st == TCP_CA_Recovery && ps != TCP_CA_Recovery) { /* Transition detected: we just entered TCP_CA_Recovery */
ucp->prev_ca_state = st; /* Update the saved previous state to current for next transition detection */
ucp->fast_recovery = 1; /* Set the fast_recovery flag: indicates we are handling a non-congestion loss recovery */
cwnd = max_t(u32, cwnd - (u32)rs->losses, 1); /* Reduce cwnd by the number of lost packets (packet conservation principle); minimum 1 */
*new_cwnd = max(cwnd, tcp_packets_in_flight(tp) + acked); /* Ensure cwnd is at least what's still in flight plus what was just ACKed */
return true; /* Signal to caller: cwnd has been set for recovery entry, skip normal BDP-based cwnd update */
}
/* Detect exit from recovery: previous state was Recovery (or higher) but current state is not */
if (ps >= TCP_CA_Recovery && st < TCP_CA_Recovery) { /* Transition detected: we just exited TCP_CA_Recovery (back to Open/Disorder/CWR) */
cwnd = max(cwnd, ucp->prior_cwnd); /* Restore cwnd to at least the saved prior_cwnd (pre-recovery window) */
ucp->fast_recovery = 0; /* Clear the fast_recovery flag: recovery is complete */
}
ucp->prev_ca_state = st; /* Update the saved previous state to the current state for next transition detection */
*new_cwnd = cwnd; /* Output the computed (or current) cwnd value */
return false; /* Signal to caller: continue with normal cwnd computation */
}
/**
* @brief Compute and set the congestion window for the current ACK.
* @param sk The TCP socket
* @param rs The rate sample from the TCP stack
* @param acked Number of packets acknowledged in this ACK
* @param bw Current bottleneck bandwidth estimate in BW_SCALE
* @param gain Current cwnd gain in BBR_SCALE (may have been capped by constraints)
*
* Implements the core cwnd update algorithm:
* 1. If no packets acked, skip the update (cwnd unchanged)
* 2. If entering fast recovery, use the recovery cwnd (from ucp_set_cwnd_to_recover_or_restore)
* 3. Compute target = BDP * gain + quantization
* 4. Clamp target to [1.25x BDP, 2.0x BDP] via inflight bounds
* 5. If pipe is full (post-STARTUP): cwnd = min(cwnd + acked, target) (AIMD-like increase)
* 6. If pipe is not yet full: cwnd += acked (exponential growth during STARTUP)
* 7. Apply minimum cwnd floor (UCP_CWND_MIN_TARGET = 4)
* 8. If just exiting PROBE_RTT, restore cwnd to prior_cwnd at minimum
* 9. Clamp final cwnd to tp->snd_cwnd_clamp
* 10. If in PROBE_RTT mode, force cwnd to 4 (UCP_CWND_MIN_TARGET)
*/
static void ucp_set_cwnd(struct sock *sk, const struct rate_sample *rs,
u32 acked, u32 bw, int gain)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for snd_cwnd and snd_cwnd_clamp */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
u32 cwnd = tp->snd_cwnd, target; /* cwnd = current cwnd; target = computed BDP-based target cwnd */
/* If no packets were acknowledged in this ACK, there is no new information to update cwnd */
if (!acked) {
goto done; /* Skip all cwnd update logic and jump to the final clamping section */
}
/* Check if we are entering fast recovery: if so, use the recovery-special cwnd and skip normal update */
if (ucp_set_cwnd_to_recover_or_restore(sk, rs, acked, &cwnd)) {
goto done; /* Recovery entry handled cwnd; skip normal BDP-based computation */
}
/* Compute the target cwnd as BDP * gain with quantization (TSO headroom, even rounding, probe bonus) */
target = ucp_quantization_budget(sk, ucp_bdp(sk, bw, gain));
/* Clamp the target cwnd to a reasonable range based on min_rtt-based BDP bounds */
if (ucp->min_rtt_us != ~0U && ucp->min_rtt_us > 0 && bw > 0) { /* Validate that all BDP inputs are available and valid */
u64 bdp = (u64)bw * ucp->min_rtt_us >> BW_SCALE; /* Compute min-RTT-based BDP: bw * min_rtt in packets (bw is in BW_SCALE, shift right 24 for integer) */
u32 lo = (u32)max_t(u64, TCP_INIT_CWND, /* Lower bound: at least the initial cwnd (10 segments) */
(bdp * UCP_INFLIGHT_LOW_GAIN) >> BBR_SCALE); /* Lower bound: BDP * 1.25x (in BBR_SCALE), prevents cwnd from dropping below 1.25x BDP */
u32 hi = (u32)max_t(u64, lo, /* Upper bound: at least the lower bound */
(bdp * UCP_INFLIGHT_HIGH_GAIN) >> BBR_SCALE); /* Upper bound: BDP * 2.0x (in BBR_SCALE), prevents cwnd from exceeding 2.0x BDP */
target = clamp(target, lo, hi); /* Clamp the computed target cwnd to the [1.25x BDP, 2.0x BDP] range */
}
/* Apply ACK aggregation compensation: add a cwnd bonus based on recent excess ACK counts.
* Equivalent to BBR's extra_acked logic but using a single-slot exponential-decay max
* instead of a dual-slot sliding window (see ucp_update_ack_aggregation for details).
* Intentionally placed after the inflight bounds [1.25x, 2.0x] clamping so the compensation
* can lift the effective target above the nominal steady-state ceiling. This is safe because:
* - extra_acked_max is u8-saturated (max 255 pkts) and decays exponentially
* - compensation is bounded by tp->snd_cwnd_clamp (the socket's absolute max)
* - the non-destructive constraint layer (ucp_apply_cwnd_constraints) overrides cwnd_gain
* in CONGESTED conditions, compressing the effective window regardless of target
*/
if (ucp_extra_acked_gain_val > 0 && ucp->extra_acked_max > 0) { /* Compensation is enabled and there is a non-zero excess max recorded */
u32 comp = ((u32)ucp->extra_acked_max * ucp_extra_acked_gain_val) >> BBR_SCALE; /* Compute compensation = extra_acked_max * gain, scaled down by BBR_SCALE */
target = min_t(u32, target + comp, tp->snd_cwnd_clamp); /* Add compensation to target, clamped to socket max window */
}
/* Apply the cwnd update rule based on whether the pipe is full */
if (ucp_full_bw_reached(sk)) { /* Pipe is full (post-STARTUP or PROBE_BW): additive increase toward target */
cwnd = min(cwnd + acked, target); /* Increase cwnd by acked but not beyond the target (BDP-derived ceiling) */
} else if (cwnd < target || tp->delivered < TCP_INIT_CWND) { /* Pipe not yet full OR very early in connection: exponential growth */
cwnd += acked; /* Additive increase without target ceiling: allows rapid growth during STARTUP */
}
/* Enforce the absolute minimum congestion window of UCP_CWND_MIN_TARGET (4 packets) */
cwnd = max(cwnd, UCP_CWND_MIN_TARGET);
/* If we just restored cwnd after exiting PROBE_RTT, ensure cwnd is at least prior_cwnd */
if (unlikely(ucp->probe_rtt_restored)) { /* Unlikely path: probe_rtt_restored flag set meaning we just exited PROBE_RTT */
cwnd = max(cwnd, ucp->prior_cwnd); /* Restore cwnd to at least the saved pre-PROBE_RTT window */
ucp->probe_rtt_restored = 0; /* Clear the flag: restoration has been applied */
}
done:
tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp); /* Write the final cwnd, clamped to the socket's configured maximum (snd_cwnd_clamp) */
if (ucp->mode == UCP_PROBE_RTT) { /* If currently in PROBE_RTT mode, override cwnd to the minimum target */
tp->snd_cwnd = min(tp->snd_cwnd, UCP_CWND_MIN_TARGET); /* Force cwnd to 4 packets during PROBE_RTT to drain the bottleneck queue */
}
}
/**
* @brief Check whether the current PROBE_BW gain phase should advance to the next phase.
* @param sk The TCP socket
* @param rs The rate sample (contains prior_in_flight)
* @return true if the current phase is complete and the cycle should advance; false to stay in the current phase
*
* For phases with pacing_gain <= 1.0x (drain/cruise): advances when the phase has lasted at least min_rtt_us
* (one full RTT of real time). This ensures enough time for the queue to drain.
*
* For phases with pacing_gain > 1.0x (probe): advances only when both:
* 1. One full RTT has elapsed since the phase started
* 2. The estimated inflight at EDT is less than the target inflight (meaning the probe effect has drained)
* This prevents advancing prematurely while the probe is still filling the pipe.
*/
static bool ucp_is_next_cycle_phase(struct sock *sk,
const struct rate_sample *rs)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for delivered_mstamp */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
bool is_full_length = tcp_stamp_us_delta(tp->delivered_mstamp, /* Compute time elapsed (in microseconds) since the current phase started */
ucp_get_cycle_mstamp(ucp)) > ucp->min_rtt_us; /* Compare elapsed time to min_rtt_us: true if the phase has lasted at least one full RTT */
if (ucp->pacing_gain <= BBR_UNIT) { /* Drain or cruise phase (gain <= 1.0x): only the time-based condition matters */
return is_full_length; /* Advance when the phase has run for at least one min_rtt interval */
}
/* Probe phase (gain > 1.0x): end when time-based condition met AND (losses detected OR inflight still full) */
return is_full_length && /* Phase has lasted at least one min_rtt */
(rs->losses || /* Packet loss detected: terminate probe phase early to avoid further damage */
ucp_packets_in_net_at_edt(sk, rs->prior_in_flight) >= /* Estimated packets in the network at EDT of next packet */
ucp_inflight(sk, ucp_max_bw(sk), ucp->pacing_gain)); /* Must be >= target inflight for the probe gain: the pipe must still be full to continue probing */
}
/**
* @brief Update the PROBE_BW gain cycle phase if conditions are met.
* @param sk The TCP socket
* @param rs The rate sample from the TCP stack
*
* Called from ucp_update_model() as part of the per-ACK model update pipeline.
* Only acts when in PROBE_BW mode and the current phase completion conditions are satisfied.
* When ucp_is_next_cycle_phase() returns true, advances to the next gain phase.
*/
static void ucp_update_cycle_phase(struct sock *sk,
const struct rate_sample *rs)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state for mode check */
if (ucp->mode == UCP_PROBE_BW && ucp_is_next_cycle_phase(sk, rs)) { /* Only advance if in PROBE_BW mode AND the phase completion condition is met */
ucp_advance_cycle_phase(sk); /* Advance to the next phase in the 8-phase gain cycle */
}
}
/**
* @brief Reset the operating mode after DRAIN or PROBE_RTT: transition to STARTUP or PROBE_BW.
* @param sk The TCP socket
*
* Called when DRAIN completes or after PROBE_RTT exits. Determines the next mode:
* - If full_bw_reached is false (pipe never filled): transition to STARTUP (re-enter exponential growth)
* - If full_bw_reached is true (pipe was full before): transition to PROBE_BW (steady-state cycling)
* with a randomized initial cycle phase to desynchronize multiple connections
*/
static void ucp_reset_mode(struct sock *sk)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
if (!ucp_full_bw_reached(sk)) { /* The pipe was never confirmed full: return to STARTUP for exponential bandwidth search */
ucp->mode = UCP_STARTUP; /* Switch to STARTUP mode: begin probing for bandwidth with high gain (~2.89x) */
} else { /* The pipe was previously full: enter steady-state PROBE_BW cycling */
ucp->mode = UCP_PROBE_BW; /* Switch to PROBE_BW mode: begin the 8-phase steady-state gain cycle */
/* Randomize the starting cycle index to desynchronize multiple concurrent connections */
ucp->cycle_idx = UCP_PROBE_BW_CYCLE_LEN - 1 - /* Start near the end of the prior cycle: set to 7 - random(0..7) so phase 0 probe starts soon */
prandom_u32_max(UCP_PROBE_BW_CYCLE_RAND); /* Generate a random value 0..7 (UCP_PROBE_BW_CYCLE_RAND = 7) for phase start randomization */
ucp_advance_cycle_phase(sk); /* Immediately advance to the next phase (likely phase 0 probe) to begin the steady-state cycle */
}
}
/**
* @brief Update the bottleneck bandwidth (BtlBw) estimate with a new delivery rate sample.
* @param sk The TCP socket
* @param rs The rate sample containing delivery rate and timestamp info
*
* Core bandwidth estimation pipeline:
* 1. Detects packet-timed round boundaries using delivered counter progression
* 2. Computes the delivery rate sample: delivered / interval_us in BW_SCALE
* 3. Stores the sample in the 2-slot delivery rate history
* 4. Updates the non-congested peak bandwidth (max_bw_non_congested)
* 5. Applies adaptive bandwidth floor based on path class and loss conditions
* 6. Updates the running-max bandwidth filter (minmax) with the (potentially floored) sample
*
* The adaptive bandwidth floor prevents the BtlBw estimate from dropping too low during
* transient idle or application-limited periods. The floor is calculated as a percentage
* of the peak non-congested bandwidth, and is only applied when loss is below the loss cap.
*/
static void ucp_update_bw(struct sock *sk, const struct rate_sample *rs)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for delivered counter */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
u64 bw; /* Computed bandwidth sample in BW_SCALE; may be modified by floor logic before being added to the max filter */
ucp->round_start = 0; /* Clear the round_start flag: will be re-set below if this ACK completes a new round */
/* Validate the rate sample: must have positive delivered count and positive interval */
if (rs->delivered < 0 || rs->interval_us <= 0) {
return; /* Invalid rate sample: skip bandwidth update (no data to compute a rate) */
}
/* Check for packet-timed round boundary: a round completes when prior_delivered passes next_rtt_delivered */
if (unlikely(ucp->next_rtt_delivered == 0)) { /* First ACK: initialize next_rtt_delivered without counting a spurious round (matches BBR behavior) */
ucp->next_rtt_delivered = tp->delivered; /* Set the first round boundary to current delivered count */
ucp->round_start = 1; /* Signal round start for initialization; rtt_cnt remains 0 to avoid a fake round increment */
} else if (!before(rs->prior_delivered, ucp->next_rtt_delivered)) { /* The ACK's prior_delivered >= next_rtt_delivered: this ACK completes a round */
ucp->next_rtt_delivered = tp->delivered; /* Set the next round boundary to the current total delivered count */
ucp->rtt_cnt++; /* Increment the round counter: one more packet-timed round has been completed */
ucp->round_start = 1; /* Set the round_start flag and signal that a new round has begun; used by pacing fast-ramp bypass */
}
{ /* Compute the delivery rate sample: delivered packets / interval in microseconds, scaled to BW_SCALE */
s64 sample = div64_long((u64)rs->delivered * BW_UNIT, /* Multiply delivered count by BW_UNIT (2^24) to convert to BW_SCALE */
rs->interval_us); /* Divide by the sample interval (microseconds); result is bandwidth in BW_SCALE (packets/usec * 2^24) */
ucp_add_delivery_rate_sample(sk, (u32)sample); /* Store the rate sample in the 2-slot circular history */
bw = sample; /* Set bw to the raw computed sample; may be modified by floor logic below */
}
/* Only update the bandwidth filter if this sample is not application-limited, or if it exceeds the current max */
if (!rs->is_app_limited || bw >= ucp_max_bw(sk)) { /* Skip rate samples taken during application-limited periods unless they set a new max */
/* Update the peak non-congested bandwidth if in a non-congested condition */
if (ucp->net_condition != UCP_COND_CONGESTED && /* The network is not currently classified as congested */
(u32)bw > ucp->max_bw_non_congested) { /* And this sample exceeds the previous peak non-congested bandwidth */
ucp->max_bw_non_congested = (u32)bw; /* Update the non-congested peak bandwidth to this higher sample */
}
/* Apply the adaptive bandwidth floor (UCP-specific non-destructive constraint: disabled in BBR compatible mode) */
if (ucp_bbr_mode == 0 && /* Bandwidth floor is a UCP-specific feature; BBR mode skips all floor logic */
ucp->net_condition != UCP_COND_CONGESTED && /* Only apply the bandwidth floor when not in a congested condition */
ucp->max_bw_non_congested > 0) { /* A non-congested peak bandwidth has been established */
/* Per-class bandwidth floor selection: each path class has its own configurable parameter */
u32 pct; /* Floor permyriad value (1/10000); 0 disables the floor for that class */
switch (ucp->net_class) { /* Select the floor permyriad based on the current path classification */
case UCP_CLASS_LAN: /* LAN paths: low latency, minimal variation; floor typically disabled */
pct = ucp_bw_floor_lan_val; /* Use the LAN-specific floor parameter (default 0 = disabled) */
break;
case UCP_CLASS_VPN: /* VPN paths: encapsulation adds overhead but bandwidth is usually stable */
pct = ucp_bw_floor_vpn_val; /* Use the VPN-specific floor parameter (default 0 = disabled) */
break;
case UCP_CLASS_MOBILE: /* Mobile paths: high jitter and variable radio conditions */
pct = ucp_bw_floor_mobile_val; /* Use the mobile floor parameter (default 2500 = 25%) */
break;
case UCP_CLASS_LOSSY_FAT: /* Lossy fat paths: background loss can depress bandwidth measurements */
pct = ucp_bw_floor_lossy_fat_val; /* Use the LOSSY_FAT-specific floor parameter (default 2500 = 25%) */
break;
case UCP_CLASS_CONGESTED: /* Congested-class paths: floor only active briefly after real-time congestion clears */
pct = ucp_bw_floor_congested_val; /* Use the CONGESTED-class floor parameter (default 2000 = 20%) */
break;
default: /* DEFAULT paths: general internet or unclassified */
pct = ucp_bw_floor_default_val; /* Use the default floor parameter (default 2000 = 20%) */
break;
}
/* Apply the floor only if the current loss level is below the loss cap threshold */
if (pct && ucp_get_loss_ratio(sk) < ucp_bw_floor_loss_cap_val) { /* Floor is enabled (non-zero) and loss is below the loss cap (5%) */
u64 floor_bw = (u64)ucp->max_bw_non_congested * pct / 10000; /* Compute floor: non-congested peak * permyriad / 10000 */
if (bw < floor_bw) { /* The current bandwidth sample is below the floor: apply the floor */
bw = floor_bw; /* Replace the bandwidth sample with the computed floor value to prevent BtlBw from dropping too low */
}
}
}
/* Update the running-maximum filter with the (possibly floored) bandwidth sample */
minmax_running_max(&ucp->bw, UCP_BW_RTTS, ucp->rtt_cnt, (u32)bw); /* Add the sample to the minmax filter with window size of 10 packet-timed rounds */
}
}
/**
* @brief Detect whether the STARTUP phase has filled the bottleneck pipe.
* @param sk The TCP socket
* @param rs The rate sample (used for is_app_limited check)
*
* Implements the BBR full-pipe detection: if bandwidth growth does not exceed 1.25x for
* UCP_FULL_BW_CNT (3) consecutive packet-timed rounds, the pipe is declared full.
* The full_bw snapshot is updated each time growth exceeds the 1.25x threshold,
* resetting the counter. Application-limited rounds are excluded from the check.
* When full_bw_reached is set, it triggers the STARTUP -> DRAIN transition.
*/
static void ucp_check_full_bw_reached(struct sock *sk,
const struct rate_sample *rs)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
u32 bw_thresh; /* The 1.25x growth threshold: full_bw * 125% in BBR_SCALE */
/* Skip check if the pipe is already declared full, or this is not a round start, or the sample is app-limited */
if (ucp_full_bw_reached(sk) || !ucp->round_start || rs->is_app_limited) {
return; /* No update needed: pipe already full, not a new round, or application-limited sample is unreliable */
}
bw_thresh = (u64)ucp->full_bw * UCP_FULL_BW_THRESH >> BBR_SCALE; /* Compute 1.25x threshold: full_bw * 320 / 256 (UCP_FULL_BW_THRESH = 320 BBR_SCALE = 1.25x) */
if (ucp_max_bw(sk) >= bw_thresh) { /* Current max BW >= 1.25x of the last confirmed full_bw snapshot: bandwidth is still growing */
ucp->full_bw = ucp_max_bw(sk); /* Update the full_bw snapshot to the current max bandwidth (new growth high-water mark) */
ucp->full_bw_cnt = 0; /* Reset the non-growth counter: we just saw significant growth */
return; /* Pipe is not yet full; continue STARTUP probing */
}
ucp->full_bw_cnt++; /* Increment the non-growth counter: this round did not achieve 1.25x growth */
ucp->full_bw_reached = (ucp->full_bw_cnt >= UCP_FULL_BW_CNT); /* Set full_bw_reached if 3 consecutive rounds without 1.25x growth: pipe is considered full */
}
/**
* @brief Handle the STARTUP -> DRAIN and DRAIN -> PROBE_BW state transitions.
* @param sk The TCP socket
* @param rs The rate sample (currently unused by DRAIN completion check)
*
* Two transitions:
* 1. STARTUP -> DRAIN: triggered when full_bw_reached is set. Sets the slow start threshold
* to the current inflight at 1.0x BDP and transitions to DRAIN mode.
* 2. DRAIN -> PROBE_BW (or back to STARTUP if pipe was never filled): triggered when the
* estimated inflight at EDT drops to or below the target inflight at 1.0x BDP,
* indicating the excess queue from STARTUP has been fully drained.
*/
static void ucp_check_drain(struct sock *sk, const struct rate_sample *rs)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
/* STARTUP -> DRAIN transition: when the pipe is declared full */
if (ucp->mode == UCP_STARTUP && ucp_full_bw_reached(sk)) { /* Currently in STARTUP AND pipe is now full */
ucp->mode = UCP_DRAIN; /* Transition to DRAIN mode: begin draining the excess queue built during STARTUP */
tcp_sk(sk)->snd_ssthresh = ucp_inflight(sk, ucp_max_bw(sk), /* Set slow start threshold to the current BDP at 1.0x gain */
BBR_UNIT); /* ssthresh = current inflight at neutral gain (1.0x BDP); used as a target for the drain */
}
/* DRAIN completion check: transition out of DRAIN when queue is fully drained */
if (ucp->mode == UCP_DRAIN && /* Currently in DRAIN mode */
ucp_packets_in_net_at_edt(sk, tcp_packets_in_flight(tcp_sk(sk))) <= /* Projected inflight at EDT */
ucp_inflight(sk, ucp_max_bw(sk), BBR_UNIT)) { /* <= target inflight at 1.0x BDP: the excess queue has been drained */
ucp_reset_mode(sk); /* Exit DRAIN: transition to PROBE_BW (or re-enter STARTUP if pipe was never fully filled) */
}
}
/**
* @brief Exit PROBE_RTT mode after the dwell time has elapsed and restore normal operation.
* @param sk The TCP socket
*
* Called from the per-ACK path (via ucp_update_min_rtt) or from the TX_START event callback.
* The exit condition is that probe_rtt_done_stamp is set and the current jiffies value is
* past the done stamp. On exit:
* 1. Updates min_rtt_stamp to now (so the next PROBE_RTT entry is properly timed)
* 2. Restores cwnd to at least prior_cwnd (max of current cwnd and pre-PROBE_RTT window)
* 3. Sets probe_rtt_restored flag (so ucp_set_cwnd applies the restoration)
* 4. Calls ucp_reset_mode() to transition to STARTUP or PROBE_BW
*/
static void ucp_check_probe_rtt_done(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for snd_cwnd */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
/* Check if PROBE_RTT exit conditions are met: done_stamp must be set AND current time must be past the done stamp */
if (!ucp->probe_rtt_done_stamp || /* No probe_rtt_done_stamp set: not yet scheduled or already exited */
!after(tcp_jiffies32, ucp->probe_rtt_done_stamp)) { /* Current jiffies not yet past the done stamp: PROBE_RTT dwell time has not expired */
return; /* Stay in PROBE_RTT: exit conditions not yet satisfied */
}
ucp->min_rtt_stamp = tcp_jiffies32; /* Update the min_rtt timestamp to now: this marks the end of the PROBE_RTT filter reset */
tp->snd_cwnd = max(tp->snd_cwnd, ucp->prior_cwnd); /* Restore cwnd to at least the saved prior_cwnd (pre-PROBE_RTT window) */
ucp->probe_rtt_restored = 1; /* Set flag: cwnd restoration from PROBE_RTT needs to be applied in ucp_set_cwnd() */
ucp_reset_mode(sk); /* Reset the operating mode: transition to STARTUP (if pipe never full) or PROBE_BW (steady-state) */
}
/**
* @brief Update the minimum RTT estimate, RTT history, and manage PROBE_RTT state transitions.
* @param sk The TCP socket
* @param rs The rate sample (contains rtt_us and is_ack_delayed)
*
* Comprehensive min_rtt update logic:
* 1. Records whether the current ACK was delayed (affects RTT sample filtering)
* 2. Checks if the min_rtt filter has expired (time to consider lowering min_rtt)
* 3. Updates min_rtt if a lower sample is received (with fast-fall and sticky-floor logic)
* - If sample < 75% of current min_rtt: increments fast-fall counter; at 3 consecutive
* fast-fall samples, immediately updates min_rtt; otherwise gradually reduces by 25%
* - Otherwise: directly updates min_rtt to the lower sample
* 4. SRTT guard: if smoothed RTT is significantly lower than min_rtt, updates min_rtt from SRTT
* 5. Adds valid RTT sample to the history buffer via ucp_add_rtt_sample()
* 6. Enters PROBE_RTT if the filter has expired and idle_restart is not set
* 7. Manages the PROBE_RTT state: sets app_limited, schedules done_stamp, tracks round completion
* 8. Clears idle_restart when data delivery resumes
*/
static void ucp_update_min_rtt(struct sock *sk, const struct rate_sample *rs)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for srtt_us and delivered */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
bool filter_expired; /* True if the min_rtt measurement window has expired (enough time has passed since last update) */
ucp->has_delayed_ack = rs->is_ack_delayed; /* Record whether this ACK was a delayed ACK; used later in RTT sample filtering */
/* Check if the min_rtt filter has expired: current time past (min_rtt_stamp + probe_rtt_interval) */
filter_expired = after(tcp_jiffies32,
ucp->min_rtt_stamp + ucp_get_probe_rtt_interval(sk)); /* True if the adaptive PROBE_RTT interval has elapsed since the last min_rtt update */
/* Update min_rtt if a lower RTT sample is available (with fast-fall protection) */
if (rs->rtt_us >= 0 && /* Valid (non-negative) RTT sample in the rate sample */
(rs->rtt_us < ucp->min_rtt_us || /* The new RTT sample is strictly lower than current min_rtt OR */
(filter_expired && !rs->is_ack_delayed))) { /* The filter has expired AND this is not a delayed ACK (allow min_rtt to relax upward on expiry) */
/* Check for fast-fall condition: the new RTT is below 75% of current min_rtt (significant drop) */
if (ucp->min_rtt_us && /* Current min_rtt is established (non-zero) */
rs->rtt_us < ucp->min_rtt_us * UCP_MINRTT_STICKY_FLOOR_NUM / /* Compare: rtt < min_rtt * 3/4 */
UCP_MINRTT_STICKY_FLOOR_DEN) { /* 3/4 = 75% threshold for fast-fall detection */
ucp->min_rtt_fast_fall_cnt++; /* Increment the fast-fall counter: another consecutive sample below 75% */
if (ucp->min_rtt_fast_fall_cnt >= UCP_MINRTT_FAST_FALL_CNT) { /* 3 consecutive fast-fall samples: confirm the downward trend */
ucp->min_rtt_us = rs->rtt_us; /* Directly update min_rtt to the new lower value after confirmation */
ucp->min_rtt_fast_fall_cnt = 0; /* Reset the fast-fall counter for the next detection cycle */
} else { /* Fast-fall accumulating but not yet confirmed: apply a gradual sticky floor reduction */
ucp->min_rtt_us = ucp->min_rtt_us * /* Reduce min_rtt to 3/4 of its current value as a progressive step */
UCP_MINRTT_STICKY_FLOOR_NUM /
UCP_MINRTT_STICKY_FLOOR_DEN; /* min_rtt = min_rtt * 3/4 (partial reduction toward the new lower value) */
}
} else { /* Normal (non-fast-fall) update: the new sample is lower but not drastically so */
ucp->min_rtt_us = rs->rtt_us; /* Directly update min_rtt to the new lower RTT sample */
ucp->min_rtt_fast_fall_cnt = 0; /* Reset the fast-fall counter since this is not a fast-fall pattern */
}
ucp->min_rtt_stamp = tcp_jiffies32; /* Update the timestamp: min_rtt was just modified, start the expiry countdown from now */
}
/* SRTT guard: if the smoothed RTT is significantly below min_rtt, it means min_rtt is stale */
if (tp->srtt_us && ucp->min_rtt_us && /* Both SRTT and min_rtt are available */
(tp->srtt_us >> 3) < ucp->min_rtt_us * /* Compare srtt (de-scaled) < min_rtt * 9/10: SRTT guard condition */
UCP_MINRTT_SRTT_GUARD_NUM / UCP_MINRTT_SRTT_GUARD_DEN) { /* 9/10 = 0.9: if srtt < 90% of min_rtt, min_rtt is too high */
ucp->min_rtt_us = tp->srtt_us >> 3; /* Update min_rtt to the de-scaled smoothed RTT (srtt is 8x the actual value) */
ucp->min_rtt_stamp = tcp_jiffies32; /* Update the timestamp to record when min_rtt was last changed */
}
/* Add the RTT sample to the 2-slot history buffer for P10 RTT estimation (if valid) */
if (rs->rtt_us > 0) { /* Only process strictly positive RTT samples */
ucp_add_rtt_sample(sk, rs->rtt_us); /* Add the filtered sample to the circular history buffer */
}
/* Check whether to enter PROBE_RTT mode: filter expired, no idle restart, not already in PROBE_RTT */
if (UCP_PROBE_RTT_MODE_MS > 0 && filter_expired && /* PROBE_RTT is enabled (dwell time > 0) AND min_rtt filter window has expired */
!ucp->idle_restart && ucp->mode != UCP_PROBE_RTT) { /* Not restarting from idle AND not already in PROBE_RTT */
ucp->mode = UCP_PROBE_RTT; /* Enter PROBE_RTT mode: cwnd will be clamped to 4 packets to drain queues */
ucp_save_cwnd(sk); /* Save the current cwnd as prior_cwnd before PROBE_RTT clamps it */
ucp->probe_rtt_done_stamp = 0; /* Defer: timer starts later when inflight <= 4 (queue drained), matching BBR behavior */
}
/* Manage the PROBE_RTT state: app_limited marking and exit timing */
if (ucp->mode == UCP_PROBE_RTT) { /* Currently in PROBE_RTT mode */
tp->app_limited = (tp->delivered + tcp_packets_in_flight(tp)) ? : 1; /* Mark the connection as app_limited if nothing is in flight; prevents bandwidth sampling during PROBE_RTT */
if (!ucp->probe_rtt_done_stamp) { /* No done_stamp scheduled yet: waiting for inflight to drain */
if (tcp_packets_in_flight(tp) <= UCP_CWND_MIN_TARGET) { /* Inflight is at or below the PROBE_RTT min target (4 packets): queue drained, start timer */
ucp->probe_rtt_done_stamp = tcp_jiffies32 + /* Set the done_stamp: start the PROBE_RTT dwell timer now */
msecs_to_jiffies(UCP_PROBE_RTT_MODE_MS); /* Dwell for 200ms in the low-inflight state */
ucp->probe_rtt_round_done = 0; /* Reset the round-done flag; will be set when a full RTT passes */
ucp->next_rtt_delivered = tp->delivered; /* Set the next round boundary to detect a full RTT in PROBE_RTT */
} else if (ucp->round_start) { /* Safety: one full round elapsed with cwnd=4; inflight should have drained; force-start timer to prevent hang */
ucp->probe_rtt_done_stamp = tcp_jiffies32 +
msecs_to_jiffies(UCP_PROBE_RTT_MODE_MS);
ucp->probe_rtt_round_done = 0;
ucp->next_rtt_delivered = tp->delivered;
}
} else { /* Done stamp is already set: track round completion for exit */
if (ucp->round_start) { /* A new packet-timed round has started while in PROBE_RTT */
ucp->probe_rtt_round_done = 1; /* Mark that at least one full round has completed during PROBE_RTT */
}
if (ucp->probe_rtt_round_done) { /* At least one full round has completed and the done_stamp time has been reached */
ucp_check_probe_rtt_done(sk); /* Attempt to exit PROBE_RTT; will succeed if done_stamp time is past */
}
}
}
/* Clear idle_restart when data delivery resumes */
if (rs->delivered > 0) { /* At least one packet was delivered (acked) in this rate sample */
ucp->idle_restart = 0; /* Clear the idle restart flag: we have seen data delivery since the idle period ended */
}
}
/**
* @brief Update the ACK aggregation compensation state.
* @param sk The TCP socket
* @param rs The rate sample (contains acked_sacked and delivered_mstamp)
*
* Tracks excess ACK counts over RTT-scale epochs. At the end of each epoch
* (time > max(min_rtt_us, 1ms) since start), expected deliveries are computed
* from the bandwidth estimate and compared to the observed acked count.
* The 1ms floor prevents degenerate per-ACK epoch resets when min_rtt_us is
* unrealistically small (e.g. early connection or low-latency loopback).
* The excess is accumulated and the running max (extra_acked_max) is decayed by 3/4,
* then consumed by ucp_set_cwnd() to add a cwnd bonus.
*
* DIFFERENCES FROM GOOGLE BBR's extra_acked:
* BBR | UCP
* ----------------|--------------------------------------------------
* u64 epoch mstamp| u32 ack_epoch_start_us (low 32 bits only)
* u16 extra_acked[2]| u8 extra_acked + u8 extra_acked_max (single slot)
* dual-slot window | single-slot exponential decay (x 0.75/epoch)
* fixed gain 1.0x | ucp_extra_acked_gain (default 0 = off)
* ~16 bytes total | 6 bytes total
*
* Space constraint: struct ucp must fit within 104 bytes (ICSK_CA_PRIV_SIZE).
* A full BBR extra_acked implementation would require ~16 bytes, which is
* not available. The single-slot decay approximation converges to the same
* steady state within 3-4 epochs and incurs <3% throughput penalty on most
* internet paths. The u8 saturation at 255 is rarely reached outside
* datacenter environments.
*/
static void ucp_update_ack_aggregation(struct sock *sk,
const struct rate_sample *rs)
{
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket state for delivered_mstamp */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
u32 now_us, epoch_us, expected, extra; /* now_us = current microsecond timestamp; epoch_us = elapsed epoch time; expected = expected deliveries; extra = excess */
if (!ucp_extra_acked_gain_val) { /* Compensation gain is 0 (disabled): skip all tracking */
return;
}
if (rs->delivered < 0 || rs->interval_us <= 0) { /* Invalid rate sample: no delivery or interval data to base compensation on */
return;
}
if (!ucp_max_bw(sk)) { /* No bandwidth estimate yet (early connection): cannot compute expected deliveries; skip to prevent false excess */
return;
}
if (!ucp->min_rtt_us) { /* min_rtt not yet established: epoch boundary detection (epoch_us > min_rtt_us) would be degenerate; skip compensation until a valid baseline RTT exists */
return;
}
now_us = (u32)tp->delivered_mstamp; /* Lower 32 bits of the 64-bit microsecond delivery timestamp */
if (ucp->ack_epoch_start_us == 0) { /* First ACK or epoch just ended: initialize a new epoch */
ucp->ack_epoch_start_us = now_us; /* Record the start of this epoch */
ucp->extra_acked = 0; /* Reset the cumulative excess counter for the new epoch */
return; /* No comparsion to make on the first sample of an epoch */
}
epoch_us = now_us - ucp->ack_epoch_start_us; /* Elapsed time since epoch start (handles 32-bit wraparound automatically) */
if (epoch_us > max(ucp->min_rtt_us, (u32)UCP_ACK_EPOCH_MIN_US)) { /* Epoch has exceeded max(min_rtt, 1ms): time to close the window and compute excess; the 1ms floor prevents degenerate per-ACK resets when min_rtt_us is unrealistically small */
expected = ((u64)ucp_max_bw(sk) * epoch_us) >> BW_SCALE; /* Expected deliveries = bw (pkts/usec * 2^24) * elapsed_us >> 24 */
extra = (ucp->extra_acked > expected) ? /* If observed exceeds expected, compute the surplus */
(ucp->extra_acked - expected) : 0; /* Excess = observed - expected, or 0 if observed <= expected */
ucp->extra_acked_max = max((u8)((u32)ucp->extra_acked_max * UCP_ACK_EPOCH_DECAY_NUM / UCP_ACK_EPOCH_DECAY_DEN), /* Decay previous max by 3/4 (exponential forgetting) */
(u8)(extra > UCP_U8_MAX ? UCP_U8_MAX : extra)); /* New max is at least the current epoch's excess (saturated to u8) */
/* Start a new epoch with the current ACK's acked count */
ucp->ack_epoch_start_us = now_us; /* Reset epoch start to now */
ucp->extra_acked = (u8)(rs->acked_sacked > UCP_U8_MAX ? UCP_U8_MAX : rs->acked_sacked); /* Seed the new epoch with this ACK's acked count (saturated) */
} else {
/* Still within the current epoch: accumulate the excess count */
ucp->extra_acked = (u8)min_t(u32, UCP_U8_MAX, /* Accumulate and saturate to u8 */
(u32)ucp->extra_acked + rs->acked_sacked);
}
}
/**
* @brief Return the standard steady-state cwnd gain value.
* @param sk The TCP socket (unused in this function, required for API consistency)
* @return The precomputed steady-state cwnd gain in BBR_SCALE (default 512 = 2.0x)
*/
static u32 ucp_get_cwnd_gain(const struct sock *sk)
{
(void)sk; /* Suppress unused parameter warning; parameter required by calling convention */
return ucp_cwnd_gain_val; /* Return the precomputed steady-state cwnd gain value (2.0x BDP default) */
}
/**
* @brief Run the complete UCP estimation pipeline for the current ACK.
* @param sk The TCP socket
* @param rs The rate sample from the TCP stack
*
* This is the main model update function, called once per ACK processing cycle.
* The pipeline executes sub-updates in dependency order:
* 1. ucp_update_bw() - Update bandwidth estimate and round tracking
* 2. ucp_update_loss_ewma() - Update loss EWMA
* 2a. ucp_update_ack_aggregation() - Track excess acked counts for ACK aggregation compensation
* 3. ucp_update_net_condition() - Classify network condition (IDLE/LIGHT_LOAD/CONGESTED/RANDOM_LOSS)
* 4. ucp_update_net_class() - Classify path type (LAN/MOBILE/LOSSY_FAT/CONGESTED/VPN/DEFAULT)
* 5. ucp_update_cycle_phase() - Advance PROBE_BW gain cycle if phase conditions met
* 6. ucp_check_full_bw_reached() - Detect STARTUP pipe-full condition
* 7. ucp_check_drain() - Handle STARTUP/DRAIN/PROBE_BW transitions
* 8. ucp_update_min_rtt() - Update min RTT and manage PROBE_RTT state
* 9. Set mode-specific gains: STARTUP uses UCP_HIGH_GAIN, DRAIN uses UCP_DRAIN_GAIN,
* PROBE_BW uses the cycle-based cwnd gain, PROBE_RTT uses neutral gain
*/
static void ucp_update_model(struct sock *sk, const struct rate_sample *rs)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
ucp_update_bw(sk, rs); /* Step 1: Update BtlBw estimate from current rate sample (with bandwidth floor) */
ucp_update_loss_ewma(sk, rs); /* Step 2: Update loss EWMA from current sample's acked and lost counts */
ucp_update_ack_aggregation(sk, rs); /* Step 2a: Update ACK aggregation compensation (tracks excess acked counts per RTT epoch) */
ucp_update_net_condition(sk, rs); /* Step 3: Classify network condition using rate trend, loss, ECN, and RTT rise */
ucp_update_net_class(sk); /* Step 4: Classify network path class using RTT stats and loss */
ucp_update_cycle_phase(sk, rs); /* Step 5: Advance PROBE_BW cycle phase if timing/inflight conditions are met */
ucp_check_full_bw_reached(sk, rs);/* Step 6: Check if STARTUP bandwidth growth has stalled (3 rounds < 1.25x) */
ucp_check_drain(sk, rs); /* Step 7: Check for STARTUP->DRAIN and DRAIN->PROBE_BW transitions */
ucp_update_min_rtt(sk, rs); /* Step 8: Update min_rtt, RTT history, and manage PROBE_RTT enter/exit */
/* Step 9: Set mode-specific base gains for pacing and cwnd */
switch (ucp->mode) { /* Set default gains based on the current operating mode; these may be further modified by constraints */
case UCP_STARTUP: /* Exponential bandwidth search phase: use high gain for both pacing and cwnd */
ucp->pacing_gain = UCP_HIGH_GAIN; /* Set pacing gain to UCP_HIGH_GAIN (~2.885x BBR_SCALE) for aggressive bandwidth probing */
ucp->cwnd_gain = UCP_HIGH_GAIN; /* Set cwnd gain to UCP_HIGH_GAIN (~2.885x BBR_SCALE) for aggressive window growth */
break; /* Exit the switch after setting STARTUP gains */
case UCP_DRAIN: /* Queue drain phase: drain pacing (low gain), keep high cwnd gain for window restoration */
ucp->pacing_gain = UCP_DRAIN_GAIN; /* Set pacing gain to UCP_DRAIN_GAIN (~0.346x BBR_SCALE) to drain the queue rapidly */
ucp->cwnd_gain = UCP_HIGH_GAIN; /* Keep cwnd gain at UCP_HIGH_GAIN to maintain the large window while pacing drains */
break; /* Exit the switch after setting DRAIN gains */
case UCP_PROBE_BW: /* Steady-state bandwidth probing: use the standard cwnd gain (from current phase) */
ucp->cwnd_gain = ucp_get_cwnd_gain(sk); /* Set cwnd gain to the standard steady-state value (2.0x BDP default); pacing_gain was set by cycle phase logic */
break; /* Exit the switch after setting PROBE_BW cwnd gain (pacing_gain is already managed by cycle phase) */
case UCP_PROBE_RTT: /* Min RTT refresh: use neutral gain for both pacing and cwnd */
ucp->pacing_gain = BBR_UNIT; /* Set pacing gain to 1.0x (neutral): no probe, no drain; just send at estimated BtlBw */
ucp->cwnd_gain = BBR_UNIT; /* Set cwnd gain to 1.0x (neutral); actual cwnd will be forced to 4 by ucp_set_cwnd() */
break; /* Exit the switch after setting PROBE_RTT gains */
}
}
/**
* @brief Per-ACK entry point for the UCP congestion control algorithm.
* @param sk The TCP socket
* @param rs The rate sample from the TCP stack (contains delivery rate, RTT, loss, etc.)
*
* This is the main callback registered in struct tcp_congestion_ops as .cong_control.
* Called by the TCP stack on every ACK that includes a rate sample. The execution order:
* 1. Run the full model update pipeline (ucp_update_model)
* 2. Apply any queued one-shot drain constraints (ucp_apply_pacing_constraints)
* 3. Apply cwnd ceiling and STARTUP loss caps (ucp_apply_cwnd_constraints)
* 4. Get the current max bandwidth and set the pacing rate with the current pacing gain
* 5. Clear the fast_recovery flag at the start of a new round if recovering (handles recovery exit)
* 6. Compute and set the new congestion window
*/
static void ucp_main(struct sock *sk, const struct rate_sample *rs)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
u32 bw; /* Current maximum bandwidth (BtlBw) from the minmax filter, used for pacing rate and cwnd calculation */
ucp_update_model(sk, rs); /* Step 1: Run the full estimation pipeline (BW, loss, conditions, class, phase, min_rtt, mode gains) */
ucp_apply_pacing_constraints(sk); /* Step 2: Apply any one-shot drain queued by advance_cycle_phase (overrides pacing_gain if drain_pending) */
ucp_apply_cwnd_constraints(sk); /* Step 3: Apply CONGESTED cwnd caps and STARTUP loss-based gain limits */
bw = ucp_max_bw(sk); /* Step 4: Get the current max bottleneck bandwidth estimate from the minmax filter */
ucp_set_pacing_rate(sk, bw, ucp->pacing_gain); /* Set the socket's pacing rate = bw * pacing_gain (with EWMA smoothing on increases) */
if (ucp->fast_recovery && ucp->round_start) { /* Step 5: If in fast recovery and a new round has started, recovery is effectively done */
ucp->fast_recovery = 0; /* Clear the fast recovery flag: the recovery window has been applied and a new round has begun */
}
ucp_set_cwnd(sk, rs, rs->acked_sacked, bw, ucp->cwnd_gain); /* Step 6: Compute and apply the new congestion window = BDP * cwnd_gain + quantization */
}
/* ---- Module callbacks (registered in tcp_congestion_ops) ---------------- */
/**
* @brief Initialize the per-connection UCP congestion control state.
* @param sk The TCP socket
*
* Called when a new TCP connection adopts the UCP congestion control algorithm.
* Initializes: all state to zero (via memset), prev_ca_state to TCP_CA_Open,
* min_rtt_us from the TCP stack's min RTT estimate, min_rtt_stamp to now,
* and the initial pacing rate from cwnd/SRTT. Also requests kernel pacing support.
*/
static void ucp_init(struct sock *sk)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state area (pre-allocated by inet_csk_ca) */
memset(ucp, 0, sizeof(*ucp)); /* Zero-initialize the entire UCP state structure to ensure deterministic startup */
ucp->rate_change_ewma = (s32)0x80000000; /* Sentinel: marks uninitialized, distinct from any valid EWMA value (INT_MIN) */
ucp->prev_ca_state = TCP_CA_Open; /* Set the initial previous CA state to Open (no prior recovery state) */
ucp->min_rtt_us = tcp_min_rtt(tcp_sk(sk)); /* Initialize min_rtt from the TCP stack's minimum observed RTT (if available) */
ucp->min_rtt_stamp = tcp_jiffies32; /* Set the min_rtt timestamp to the current jiffies to start the filter lifetime from now */
ucp_init_pacing_rate_from_rtt(sk); /* Compute and set the initial pacing rate from current cwnd and SRTT using high STARTUP gain */
cmpxchg(&sk->sk_pacing_status, SK_PACING_NONE, SK_PACING_NEEDED); /* Atomic compare-and-exchange to request kernel-side pacing support if not already enabled */
}
/**
* @brief Return the send buffer expansion factor for UCP connections.
* @param sk The TCP socket (unused)
* @return Buffer expansion multiplier: 3x the default socket buffer
*
* Called by the TCP stack to determine how much to expand the socket's send buffer
* beyond the default. UCP recommends 3x to accommodate the larger cwnd variations
* and TSO burst requirements of a BBR-derived algorithm.
*/
static u32 ucp_sndbuf_expand(struct sock *sk)
{
(void)sk; /* Suppress unused parameter warning; parameter required by callback API */
return 3; /* Return expansion factor of 3x: recommend send buffer be 3x the default size */
}
/**
* @brief Reset full_bw tracking on cwnd undo (after spurious loss recovery).
* @param sk The TCP socket
* @return The current cwnd to use after undo (unchanged from TCP stack's value)
*
* Called when the TCP stack detects that a loss recovery was spurious (e.g., due to
* packet reordering or a false loss indication) and undoes the cwnd reduction.
* Resets the full_bw tracking state so that bandwidth growth detection restarts
* fresh after the undo event.
*/
static u32 ucp_undo_cwnd(struct sock *sk)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
ucp->full_bw = 0; /* Reset the full_bw snapshot: bandwidth growth tracking will restart from zero */
ucp->full_bw_cnt = 0; /* Reset the non-growth round counter: fresh start for full-pipe detection */
ucp->full_bw_reached = 0; /* Clear full_bw_reached to force a fresh STARTUP bandwidth probe after spurious loss recovery undo */
return tcp_sk(sk)->snd_cwnd; /* Return the current cwnd (which the TCP stack may have already restored during undo) */
}
/**
* @brief Save current cwnd and return slow start threshold for loss recovery entry.
* @param sk The TCP socket
* @return The slow start threshold value (from the TCP socket's snd_ssthresh)
*
* Called by the TCP stack when entering loss recovery to get the slow start threshold.
* Before returning, saves the current cwnd via ucp_save_cwnd() so that the pre-recovery
* window can be restored when recovery exits.
*/
static u32 ucp_ssthresh(struct sock *sk)
{
ucp_save_cwnd(sk); /* Save the current cwnd as prior_cwnd before recovery reduces it */
return tcp_sk(sk)->snd_ssthresh; /* Return the current slow start threshold (may have been set by ucp_check_drain or default) */
}
/* ---- Diagnostic encoding for ss -i output ------------------------------ */
#define UCP_DIAG_MARKER 0x80000000U /* High bit marker to indicate UCP-specific diagnostic data in bbr_bw_hi field */
#define UCP_DIAG_CLASS_SHIFT 28 /* Shift offset within bbr_bw_hi for the net_class field (bits 28-30) */
#define UCP_DIAG_COND_SHIFT 26 /* Shift offset within bbr_bw_hi for the net_condition field (bits 26-27) */
#define UCP_DIAG_DRAIN_SHIFT 24 /* Shift offset within bbr_bw_hi for the drain_pending field (bits 24-25) */
#define UCP_DIAG_LOSS_SHIFT 16 /* Shift offset within bbr_bw_hi for the loss_ewma field (bits 16-23, 8-bit 0-255; 256 in BBR_SCALE saturates to 0xFF = closest display value) */
/**
* @brief Return diagnostic information for the ss -i tool via INET_DIAG_BBRINFO.
* @param sk The TCP socket
* @param ext The diagnostic extension bitmask (checked for BBRINFO or VEGASINFO)
* @param attr Output parameter: set to INET_DIAG_BBRINFO to indicate which info is returned
* @param info Output union: filled with UCP state mapped to the bbr structure
* @return Size of the diagnostic structure (sizeof bbr), or 0 if no relevant extension requested
*
* Maps UCP internal state into the standard BBR diagnostic structure for display by ss -i.
* When bandwidth fits in 32 bits, the bbr_bw_hi field is repurposed to encode UCP-specific
* state (net_class, net_condition, drain_pending, loss_ewma) using bit fields with a marker
* bit (0x80000000) to distinguish UCP data from actual high 32 bits of bandwidth.
*/
static size_t ucp_get_info(struct sock *sk, u32 ext, int *attr,
union tcp_cc_info *info)
{
if (ext & (1 << (INET_DIAG_BBRINFO - 1)) || /* Check if the BBRINFO diagnostic extension is requested */
ext & (1 << (INET_DIAG_VEGASINFO - 1))) { /* Also accept VEGASINFO as some tools request it interchangeably */
struct tcp_sock *tp = tcp_sk(sk); /* Get the TCP socket's internal state for MSS cache */
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
u64 bw = (u64)ucp_max_bw(sk) * tp->mss_cache * /* Convert max BW from BW_SCALE to bits or bytes per second for diagnostic display; cast to u64 before multiplication to prevent 32-bit overflow on high-rate links */
USEC_PER_SEC >> BW_SCALE; /* bw = max_bw (pkts/usec) * MSS (bytes) * 1,000,000 (usec/sec) >> 24 = bytes/sec */
memset(&info->bbr, 0, sizeof(info->bbr)); /* Zero-initialize the diagnostic structure before populating */
info->bbr.bbr_bw_lo = (u32)bw; /* Store the lower 32 bits of the bandwidth value (bytes/sec) */
info->bbr.bbr_bw_hi = (u32)(bw >> 32); /* Store the upper 32 bits of the bandwidth value (for high-speed links) */
info->bbr.bbr_min_rtt = ucp->min_rtt_us; /* Report the current minimum RTT estimate in microseconds */
info->bbr.bbr_pacing_gain = ucp->pacing_gain; /* Report the current pacing gain (BBR_SCALE) */
info->bbr.bbr_cwnd_gain = ucp->cwnd_gain; /* Report the current cwnd gain (BBR_SCALE) */
if (bw <= U32_MAX) { /* If bandwidth fits in 32 bits, bbr_bw_hi is free for UCP-specific diagnostic encoding */
info->bbr.bbr_bw_hi = UCP_DIAG_MARKER | /* Set the high bit marker to identify UCP-encoded diagnostics */
((u32)ucp->net_class << UCP_DIAG_CLASS_SHIFT) | /* Encode net_class at bits 28-30 */
((u32)ucp->net_condition << UCP_DIAG_COND_SHIFT) | /* Encode net_condition at bits 26-27 */
((u32)ucp->drain_pending << UCP_DIAG_DRAIN_SHIFT) | /* Encode drain_pending at bits 24-25 */
((u32)min_t(u16, ucp_get_loss_ratio(sk), UCP_U8_MAX) /* Full loss EWMA (0..BBR_UNIT) reconstructed via getter, then saturated to 8-bit (UCP_U8_MAX = closest to 100%) */
<< UCP_DIAG_LOSS_SHIFT); /* Encode the 8-bit loss value at bits 16-23 for ss -i display */
}
*attr = INET_DIAG_BBRINFO; /* Set the output attribute type to BBRINFO to tell the stack what we filled */
return sizeof(info->bbr); /* Return the size of the bbr info structure to indicate data was provided */
}
return 0; /* No matching extension requested: return 0 to indicate no data was filled */
}
/**
* @brief Handle TCP state changes (set_state callback).
* @param sk The TCP socket
* @param new_state The new TCP CA state (TCP_CA_Loss, TCP_CA_Recovery, etc.)
*
* Called by the TCP stack when the congestion algorithm state changes.
* On entry to TCP_CA_Loss (RTO or retransmit timeout), resets the full_bw
* and full_bw_reached flags to restart bandwidth growth detection, sets
* round_start to trigger a fresh round, and clears fast_recovery.
* Other state transitions are handled in ucp_set_cwnd_to_recover_or_restore.
*/
static void ucp_set_state(struct sock *sk, u8 new_state)
{
struct ucp *ucp = (struct ucp *)inet_csk_ca(sk); /* Get the per-connection UCP private state */
if (new_state == TCP_CA_Loss) { /* Entering TCP loss state (RTO or retransmission timeout) */
ucp->full_bw = 0; /* Reset the full_bw snapshot: bandwidth growth tracking starts over after a timeout */
ucp->full_bw_cnt = 0; /* Reset the non-growth round counter: fresh start after timeout */
ucp->full_bw_reached = 0; /* Clear the full_bw_reached flag: pipe may no longer be full after a timeout */
ucp->round_start = 1; /* Set round_start to force a new round boundary on the next ACK */
ucp->fast_recovery = 0; /* Clear fast_recovery flag: recovery has been superseded by the loss event */
}
}
/* ---- Register/unregister the congestion-control operations structure ---- */
/**
* @brief UCP congestion control operations table.
*
* This structure is registered with the TCP stack via tcp_register_congestion_control().
* It defines the callbacks that the TCP stack invokes at various points in the
* connection lifecycle and per-ACK processing. The .cong_control callback (ucp_main)
* is the primary per-ACK entry point for model updates and cwnd/pacing control.
* The .flags field marks this algorithm as NON_RESTRICTED, allowing it to be used
* with any TCP socket regardless of namespace restrictions.
*/
static struct tcp_congestion_ops tcp_ucp_cong_ops __read_mostly = {
.flags = TCP_CONG_NON_RESTRICTED, /* Allow this CC to be used in any network namespace without restriction */
.name = "ucp", /* String identifier used by the TCP stack and exposed via tcp_congestion_control sysctl */
.owner = THIS_MODULE, /* Kernel module ownership: ties the ops lifetime to the module reference count */
.init = ucp_init, /* Callback: initialize per-connection UCP state when a connection adopts this CC */
.cong_control = ucp_main, /* Callback: per-ACK model update and control (the main UCP algorithm entry point) */
.sndbuf_expand = ucp_sndbuf_expand, /* Callback: return the send buffer expansion multiplier (3x recommended) */
.undo_cwnd = ucp_undo_cwnd, /* Callback: reset full_bw tracking after a spurious loss recovery undo */
.cwnd_event = ucp_cwnd_event, /* Callback: handle congestion events (TX_START after idle, etc.) */
.ssthresh = ucp_ssthresh, /* Callback: return slow start threshold on loss recovery entry; saves cwnd first */
.min_tso_segs = ucp_min_tso_segs, /* Callback: return minimum TSO segments for the current pacing rate */
.get_info = ucp_get_info, /* Callback: return diagnostic info for ss -i tool (maps to INET_DIAG_BBRINFO) */
.set_state = ucp_set_state, /* Callback: handle TCP CA state changes (Loss, Recovery, etc.) */
};
/* ---- Sysctl interface for runtime tuning via sysctl -w / /etc/sysctl.conf ---- */
static struct ctl_table_header *ucp_ctl_header; /* Opaque handle returned by register_sysctl(); stored for cleanup in ucp_unregister */
/**
* @brief Shared sysctl proc handler --- delegates to proc_dointvec then refreshes cached values on write.
* @param ctl Pointer to the sysctl table entry being accessed
* @param write 1 = write operation (user is setting a value), 0 = read operation (e.g. sysctl -a)
* @param buffer User-space buffer for the value being read or written
* @param lenp Pointer to buffer length; updated with bytes actually transferred
* @param ppos Current file position offset for partial reads/writes
* @return 0 on success, negative errno on failure
*
* Every successful write triggers ucp_init_module_params() to recompute internal BBR_SCALE
* and jiffies cached values, ensuring sysctl -w takes effect immediately on the next ACK.
*/
static int ucp_proc_handler(struct ctl_table *ctl, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
/* Delegate to the standard int-vector proc handler; it parses the user string,
* writes the parsed int to *ctl->data, and returns 0 on success */
int ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
/* If this was a successful write, recompute all cached derived values so the
* new parameter takes effect on the very next ACK (no module reload needed) */
if (write && ret == 0)
ucp_init_module_params();
return ret; /* Propagate the return value to the sysctl infrastructure */
}
/* Sysctl table registered under /proc/sys/net/ucp/.
* Each entry maps a sysctl name (e.g. "ucp_extra_acked_gain") to one of the module's
* static int variables. All 33 entries share the same ucp_proc_handler which triggers
* ucp_init_module_params() on every write, making "sysctl -w net.ucp.XXX=YYY" dynamic. */
static struct ctl_table ucp_ctl_table[] = {
/* UCP/BBR operation mode selector */
{ .procname = "ucp_bbr_mode", .data = &ucp_bbr_mode, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
/* Bandwidth floor per path class (permyriad of non-congested peak BDP) */
{ .procname = "ucp_bw_floor_default", .data = &ucp_bw_floor_default, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_bw_floor_mobile", .data = &ucp_bw_floor_mobile, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_bw_floor_lan", .data = &ucp_bw_floor_lan, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_bw_floor_vpn", .data = &ucp_bw_floor_vpn, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_bw_floor_lossy_fat", .data = &ucp_bw_floor_lossy_fat, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_bw_floor_congested", .data = &ucp_bw_floor_congested, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_bw_floor_loss_cap", .data = &ucp_bw_floor_loss_cap, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
/* PROBE_RTT interval tuning (raw seconds; converted to jiffies in ucp_init_module_params) */
{ .procname = "ucp_probe_rtt_base_sec", .data = &ucp_probe_rtt_base_sec, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_probe_rtt_class_extra_sec", .data = &ucp_probe_rtt_class_extra_sec, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_probe_rtt_high_loss_extra_sec", .data = &ucp_probe_rtt_high_loss_extra_sec, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_probe_rtt_mid_loss_extra_sec", .data = &ucp_probe_rtt_mid_loss_extra_sec, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_probe_rtt_max_sec", .data = &ucp_probe_rtt_max_sec, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
/* Congestion window gain and graduated caps (permyriad) */
{ .procname = "ucp_cwnd_gain", .data = &ucp_cwnd_gain, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_cwnd_cap_mild", .data = &ucp_cwnd_cap_mild, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_cwnd_cap_moderate", .data = &ucp_cwnd_cap_moderate, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_cwnd_cap_severe", .data = &ucp_cwnd_cap_severe, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
/* ACK aggregation compensation (permyriad, 0=disabled, 10000=1.0x BBR standard) */
{ .procname = "ucp_extra_acked_gain", .data = &ucp_extra_acked_gain, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
/* PROBE_BW probe pacing gain and mobile override (permyriad) */
{ .procname = "ucp_probe_gain", .data = &ucp_probe_gain, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_probe_gain_mobile", .data = &ucp_probe_gain_mobile, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
/* Early drain trigger and drain gain levels (permyriad) */
{ .procname = "ucp_drain_loss_thresh", .data = &ucp_drain_loss_thresh, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_drain_gain_light", .data = &ucp_drain_gain_light, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_drain_gain_standard", .data = &ucp_drain_gain_standard, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_drain_gain_aggressive", .data = &ucp_drain_gain_aggressive, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_drain_loss_lvl2_thresh", .data = &ucp_drain_loss_lvl2_thresh, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_drain_loss_lvl3_thresh", .data = &ucp_drain_loss_lvl3_thresh, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
/* General loss and RTT thresholds for condition/class classifiers (permyriad) */
{ .procname = "ucp_low_loss_thresh", .data = &ucp_low_loss_thresh, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_high_loss_thresh", .data = &ucp_high_loss_thresh, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_probe_skip_loss_thresh", .data = &ucp_probe_skip_loss_thresh, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_probe_skip_rtt_rise", .data = &ucp_probe_skip_rtt_rise, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
/* STARTUP loss-based gain reduction thresholds and reduced gain value (permyriad) */
{ .procname = "ucp_startup_soft_drain_thresh", .data = &ucp_startup_soft_drain_thresh, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_startup_hard_cap_thresh", .data = &ucp_startup_hard_cap_thresh, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ .procname = "ucp_startup_soft_gain", .data = &ucp_startup_soft_gain, .maxlen = sizeof(int), .mode = 0644, .proc_handler = ucp_proc_handler },
{ } /* Sentinel: empty entry marks the end of the table (required by register_sysctl) */
};
/**
* @brief Module initialization function.
* @return 0 on success, negative error code if registration fails
*
* Called when the module is loaded (insmod/modprobe). Performs:
* 1. Compile-time check that struct ucp fits within ICSK_CA_PRIV_SIZE (104 bytes)
* 2. Pre-compute all module parameter derived values (permyriad to BBR_SCALE, seconds to jiffies)
* 3. Register sysctl table under /proc/sys/net/ucp/ for sysctl -w support
* 4. Register the UCP congestion control algorithm with the TCP stack
*/
static int __init ucp_register(void)
{
int ret; /* Return value from tcp_register_congestion_control, propagated to module init */
/* ASSERT: struct ucp must fit in 104 bytes (ICSK_CA_PRIV_SIZE).
* If this BUILD_BUG_ON fires, you need to shrink the struct or enlarge the kernel's ca_priv area */
BUILD_BUG_ON(sizeof(struct ucp) > ICSK_CA_PRIV_SIZE);
/* Pre-compute all permyriad -> BBR_SCALE and seconds -> jiffies derived values.
* This must run before registration so the hot-path variables are ready for the first ACK.
* Also called from ucp_proc_handler / module_param_cb to refresh on runtime writes */
ucp_init_module_params();
/* Register the sysctl table under /proc/sys/net/ucp/. This enables:
* sysctl -w net.ucp.ucp_extra_acked_gain=0
* echo "net.ucp.ucp_bbr_mode=1" >> /etc/sysctl.d/ucp.conf
* The table's ucp_proc_handler calls ucp_init_module_params() on every write,
* so all cached values are refreshed immediately without unloading the module */
ucp_ctl_header = register_sysctl("net/ucp", ucp_ctl_table);
/* Register the UCP congestion control ops with the TCP stack.
* After this succeeds, new TCP connections can select "ucp" via the
* net.ipv4.tcp_congestion_control sysctl or the TCP_CONGESTION setsockopt */
ret = tcp_register_congestion_control(&tcp_ucp_cong_ops);
/* If CC registration failed (e.g. name conflict or out of memory),
* clean up the sysctl table to avoid orphan entries in /proc/sys/net/ucp/.
* Without this, the sysctl nodes would exist but point to a dead module */
if (ret && ucp_ctl_header) {
unregister_sysctl_table(ucp_ctl_header);
ucp_ctl_header = NULL; /* Invalidate the handle so ucp_unregister won't double-free */
}
return ret; /* 0 = success, negative errno = failure (module load will be rejected) */
}
/**
* @brief Module exit function.
*
* Called when the module is unloaded (rmmod). Unregisters the UCP congestion
* control algorithm from the TCP stack and cleans up the sysctl table.
* All connections using UCP will be transitioned to the system default
* congestion control algorithm.
*/
static void __exit ucp_unregister(void)
{
/* Unregister the CC ops: the TCP stack will switch any remaining UCP
* connections to the system default CC before this call returns.
* This must happen first to prevent new connections from selecting UCP
* while we're tearing down */
tcp_unregister_congestion_control(&tcp_ucp_cong_ops);
/* Remove /proc/sys/net/ucp/ sysctl entries created at module init.
* Safe to call with NULL (ucp_ctl_header stays NULL if registration
* failed in ucp_register, and is set to NULL after a failed CC reg) */
if (ucp_ctl_header) {
unregister_sysctl_table(ucp_ctl_header);
ucp_ctl_header = NULL; /* Defensive: prevent double-unregister on any future code path */
}
}
module_init(ucp_register); /* Declare ucp_register as the module entry point, called on insmod */
module_exit(ucp_unregister); /* Declare ucp_unregister as the module exit point, called on rmmod */
MODULE_AUTHOR("PPP PRIVATE NETWORK™ X"); /* Module author: the organization that developed UCP */
MODULE_AUTHOR("Original BBR: Van Jacobson, Neal Cardwell, Yuchung Cheng, "
"Soheil Hassas Yeganeh (Google)"); /* Credit the original BBR algorithm authors whose work UCP is derived from */
MODULE_LICENSE("Dual BSD/GPL"); /* Dual licensing: BSD or GPL v2; allows use in both open-source and proprietary projects */
MODULE_DESCRIPTION("TCP UCP v1.0 - BBR-based congestion control with non-destructive constraint layer, "
"6-class path classifier, graduated cwnd ceilings, bandwidth floors, "
"ACK aggregation compensation, and pure BBR-compatible mode. "
"Fits within ICSK_CA_PRIV_SIZE=104 bytes."); /* Module description displayed by modinfo */
MODULE_VERSION("1.0"); /* Module version string: major.minor; aligned with the formal release version */