ceph的ceph df detail中的max avail

复制代码

[root@node1 ~]# ceph osd pool ls detail
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 20 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr_devicehealth
pool 5 'testpool' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 208 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 6 'rbdpool' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 142 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

[root@node1 ~]# ceph osd pool get rbdpool crush_rule
crush_rule: replicated_rule
[root@node1 ~]# ceph osd pool get testpool crush_rule
crush_rule: hdd_host_rule
[root@node1 ~]# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
 -1         0.87900  root default
 -3         0.29300      host node1
  1    hdd  0.19499          osd.1       up   1.00000  1.00000
  0    ssd  0.09799          osd.0       up   1.00000  1.00000
 -7         0.29300      host node2
  2    hdd  0.19499          osd.2       up   1.00000  1.00000
  3    ssd  0.09799          osd.3       up   1.00000  1.00000
-10         0.29300      host node3
  5    hdd  0.19499          osd.5       up   1.00000  1.00000
  4    ssd  0.09799          osd.4       up   1.00000  1.00000
[root@node1 ~]# ceph osd crush rule ls
replicated_rule
ssd2_host_rule
hdd_host_rule
ssd1_host_rule
[root@node1 ~]# ceph osd crush rule dump replicated_rule
{
    "rule_id": 0,
    "rule_name": "replicated_rule",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 12,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

[root@node1 ~]# ceph osd crush rule dump hdd_host_rule
{
    "rule_id": 2,
    "rule_name": "hdd_host_rule",
    "ruleset": 2,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -6,
            "item_name": "default~hdd"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

可以。你给的信息已经足够把 277 GiB 和 184 GiB 反推出来，而且这两个数基本就是按 "满盘阈值（full ratio）之前还能用的 RAW 空间 / 副本数" 算出来的。

1）关键点：`MAX AVAIL` 不是用 `RAW AVAIL` 直接算

Ceph 计算 MAX AVAIL 时通常不会把磁盘"用到 100%"，而是按集群的 full ratio 阈值 （常见默认是 0.95）来估算：

也就是只允许用到 95%，剩下 5% 视为不可用（防止达到 full / backfillfull 导致集群不可写/不可恢复）。

所以更像是：

MAX_AVAIL ≈ ( RAW_SIZE × full_ratio ) − RAW_USED replica_size \text{MAX\_AVAIL} \approx \frac{(\text{RAW\_SIZE} \times \text{full\_ratio}) - \text{RAW\_USED}}{\text{replica\_size}} MAX_AVAIL≈replica_size(RAW_SIZE×full_ratio)−RAW_USED

2）为什么 rbdpool 是 277 GiB？

你这边 rbdpool 的规则

rbdpool：crush_rule 0 → replicated_rule
replicated_rule：step take default（不限定 class）
副本数：size 3

因此它可以使用 default root 下的所有 OSD（hdd+ssd）。

套用你的 `ceph df detail` RAW 区数据（你前面那段输出）

TOTAL RAW SIZE = 900 GiB
TOTAL RAW USED = 19 GiB
假设 full_ratio = 0.95

计算：

可用到阈值的 RAW 容量：

(900 \times 0.95 = 855) GiB

扣除已用 RAW：

(855 - 19 = 836) GiB（还能写进去的 RAW 上限）

换算成"用户数据"（3 副本要乘 3 的 RAW）：

(836 / 3 = 278.666...) GiB

Ceph 输出为 277 GiB，这是完全合理的：

实际实现里会有取整、保留、以及按 crush/host 约束做保守估计，最终显示略小于 278.7。

所以：277 GiB ≈ (900×0.95 − 19)/3

3）为什么 testpool 是 184 GiB？

你这边 testpool 的规则

testpool：crush_rule 2 → hdd_host_rule
hdd_host_rule：step take default~hdd（限定只用 HDD class）
副本数：size 3

因此它只使用 HDD 那部分 RAW 空间。

从你 ceph df detail 的 RAW STORAGE：

HDD SIZE = 600 GiB
HDD RAW USED = 18 GiB
full_ratio 仍按 0.95

计算：

可用到阈值的 HDD RAW 容量：

(600 \times 0.95 = 570) GiB

扣除已用 RAW：

(570 - 18 = 552) GiB

换算成"用户数据"（3 副本）：

(552 / 3 = 184) GiB

这和你看到的 MAX AVAIL = 184 GiB 完全一致。

所以：184 GiB = (600×0.95 − 18)/3

4）为什么用 `ceph osd tree` 的权重也能解释得通，但不是主导项

你给的 osd tree 显示每个 host 上都有 hdd+ssd，且 replicated_rule 不限定 class；理论上数据会按权重在 hdd/ssd 间分布。

但你这里 SSD 与 HDD 都很空 （SSD 0.58% 用量，HDD 2.96% 用量），并不存在"某一类先被写满"导致的瓶颈，所以最终 MAX AVAIL 更接近"按 full_ratio 的全局上限估算"。

如果后期 SSD 比 HDD 更接近满盘阈值，replicated_rule 的 MAX AVAIL 就会被 SSD 子集卡住，不会再接近 (TOTAL×full_ratio)/3。

5）你可以用一条命令确认 full_ratio

建议你在集群上看一下（不同版本输出字段略有差异，但都能查到）：

bash 复制代码

ceph osd dump | egrep 'full_ratio|backfillfull_ratio|nearfull_ratio'

你很可能会看到 full_ratio 接近 0.95，与上述两组数值吻合。

如果你把 ceph osd dump 里这几个 ratio 行贴出来，我还能把 "277 为何不是 278.7（具体少了哪一部分）" 精确到实现细节（取整/保留/约束项）。