Rocky9部署ceph集群(使用ceph-ansible)

该方式不是官方推荐的安装方式,官方推荐安装方式为cephadm

服务器准备

系统 主机名 Public IP Cluster IP
Rocky9.2 node1 192.168.202.129 192.168.142.128
Rocky9.2 node2 192.168.202.130 192.168.142.129
Rocky9.2 node3 192.168.202.131 192.168.142.130
Rocky9.2 node4 192.168.202.132 192.168.142.131
Rocky9.2 node5 192.168.202.134 192.168.142.132

每台服务器的磁盘如下

sh 复制代码
[root@node2 ~]# lsblk
NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sr0              11:0    1  1.5G  0 rom  
nvme0n1         259:0    0   60G  0 disk 
├─nvme0n1p1     259:1    0    1G  0 part /boot
└─nvme0n1p2     259:2    0   59G  0 part 
  ├─rl_192-root 253:0    0   57G  0 lvm  /
  └─rl_192-swap 253:1    0    2G  0 lvm  [SWAP]
nvme0n2         259:3    0   20G  0 disk 
nvme0n3         259:4    0   20G  0 disk

关闭防火墙和selinux

sh 复制代码
# 关闭防火墙并移除开机自启动
systemctl disable firewalld --now

# 关闭selinux
sed -i 's#SELINUX=enforcing#SELINUX=disabled#g' /etc/selinux/config 

ceph中各个组件监听的端口

服务名 端口 描述
Monitor 6379/TCP 和Ceph cluster通信
Manager 7000/TCP 和Ceph Manager dashboard 通信
8003/TCP 通过HTTPS和Ceph Manager RESTful API通信
OSD 9283/TCP 和Ceph Manager Prometheus 插件通信
6800-7300/TCP 每一个OSD使用该范围中的三个端口:一个通过Public network和客户端与monitors通信;一个端口通过Cluster network其他OSD发送数据,第三个端口通过Cluster network发送心跳包
RADOS Gateway 7480/TCP(configurable) RADOS Gateway 使用7480/TCP端口,但是可以改变该端口,例如改为80/TCP, 443/TCP

服务器的时间要同步

对于集群来说,时间同步很重要,如果时间不同步的话很容易导致ceph集群健康状态不一致的情况

sh 复制代码
# 启动chronyd服务进行自动时间同步
systemctl start chronyd --now

安装前准备

本示例中将使node1作为控制节点

sh 复制代码
# 安装额外的软件源,需要安装epel-release,因为ceph需要的一些包在这里面
dnf install epel-release -y

克隆ceph-ansible

sh 复制代码
git clone https://github.com/ceph/ceph-ansible.git

安装ceph ansible所需的一些第三方集合

sh 复制代码
# 默认rocky9没有安装pip3,需要安装pip3
dnf install python3-pip
# pip 配置国内镜像
pip3 config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# 安装python第三方包。例如ansible等
pip3 install -r requirements.txt
# 上面requirements.yml里面会定义ansible的版本,所以不需要单独安装ansible了

#安装ceph-ansible所需的第三方集合
ansible-galaxy install -r requirements.yml

让node1可以和所有主机免密钥登陆

sh 复制代码
ssh-copy-id [email protected]
ssh-copy-id [email protected]
ssh-copy-id [email protected]
ssh-copy-id [email protected]
ssh-copy-id [email protected]

定义ansible主机清单

注意:ansible清单文件中的组名必须为mons和mgrs,因为ceph-ansible的playbook定义了使用这个名字,例如在site.yml.example中的配置

sh 复制代码
# cat /etc/ansible/hosts
[mons]
192.168.202.129
192.168.202.130
192.168.202.131

[mgrs]
192.168.202.129
192.168.202.130
192.168.202.131

# 如果有dashbord的话必须设置monitorning主机组,并且不能在部署了mgrs的节点再部署monitoring
[monitoring]
192.168.202.134

拷贝ansible的入口文件

sh 复制代码
[root@node1 ceph-ansible]# cp site.yml.sample site.yml

修改变量文件

和硬件相关密切的只有osd,所以可以只拷贝osds.yml出来,其他的可以不拷贝

sh 复制代码
# 进入group_vars目录
[root@node1 group_vars]# pwd
/root/ceph-ansible/group_vars

# 将sample文件拷贝为yml文件
[root@node1 group_vars]# cp mons.yml.sample mons.yml
[root@node1 group_vars]# cp mgrs.yml.sample mgrs.yml

修改ceph集群的配置文件

all.yml为ceph集群的配置文件,该文件定义了安装ceph的包路径以及定义了yum仓库,以及ceph的一些配置信息

sh 复制代码
# 将配置文件拷贝出来

[root@node1 group_vars]# pwd
/root/ceph-ansible/group_vars
[root@node1 group_vars]# cp all.yml.sample all.yml

# 修改all.yml文件
[root@node1 group_vars]# grep -vE '^$|^#'  all.yml
---
dummy:
  #fetch_directory: ~/ceph-ansible-keys # 将密钥文件拷贝到此目录
ntp_service_enabled: false # 禁止ntp服务器,因为系统开始已经配置好了ntp了,所以这里禁止
ceph_origin: repository  # 使用本地镜像仓库
ceph_repository: community  # 仓库名字
ceph_mirror: http://mirrors.aliyun.com/ceph
ceph_stable_key: http://mirrors.aliyun.com/ceph/keys/release.asc
ceph_stable_release: reef  # 定义ceph安装的版本,与前面git切换的分支相对应,不同的分支只能安装指定的ceph版本
ceph_stable_repo: "{{ ceph_mirror }}/rpm-{{ ceph_stable_release }}"
rbd_cache: "true"
rbd_cache_writethrough_until_flush: "false"
rbd_client_directories: false # this will create rbd_client_log_path and rbd_client_admin_socket_path directories with proper permissions
dashboard_enabled: false
monitor_interface: ens160  # 监听的接口
journal_size: 5120 # OSD journal size in MB
public_network: 192.168.202.0/24    # 客户端访问的网络
cluster_network: 192.168.142.0/24   # 集群网络
ceph_conf_overrides:
   global:
     mon_osd_allow_primary_affinity: 1
     mon_clock_drift_allowed: 0.5
     osd_pool_default_size: 2
     osd_pool_default_min_size: 1
     mon_pg_warn_min_per_osd: 0
     mon_pg_warn_max_per_osd: 0
     mon_pg_warn_max_object_skew: 0
   client:
     rdb_default_features: 1

执行ansible

sh 复制代码
# 不加-e yes_i_know=true会无法运行,会提示迁移到cephadm新的工具上
ansible-playbook site.yml -e yes_i_know=true

查看安装的结果

sh 复制代码
[root@node1 ~]# ceph -s
  cluster:
    id:     a82e91cb-dedc-4115-b44a-2bc8b7190afe
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            1 daemons have recently crashed
            6 mgr modules have recently crashed
            OSD count 0 < osd_pool_default_size 2
 
  services:
    mon: 3 daemons, quorum node1,node2,node3 (age 38m)
    mgr: node3(active, since 36m), standbys: node2, node1
    osd: 0 osds: 0 up, 0 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     

添加修改osd变量文件

sh 复制代码
# 将osd.yml文件拷贝出来
[root@node1 ceph-ansible]# cp group_vars/osds.yml.sample group_vars/osds.yml

# 修改osds.yml文件
[root@node1 ceph-ansible]# grep -vE '^$|^#' group_vars/osds.yml
---
dummy:
devices:
  - /dev/nvme0n2
  - /dev/nvme0n3

定义osd的ansible清单文件

sh 复制代码
[root@node1 ceph-ansible]# cat /etc/ansible/hosts 
[mons]
192.168.202.129
192.168.202.130
192.168.202.131

[mgrs]
192.168.202.129
192.168.202.130
192.168.202.131

[osds]
192.168.202.129
192.168.202.130
192.168.202.131
192.168.202.132

执行ansible-playbook

sh 复制代码
ansible-playbook site.yml -e yes_i_know=true

查看结果

sh 复制代码
[root@node1 ceph-ansible]# ceph status
  cluster:
    id:     a82e91cb-dedc-4115-b44a-2bc8b7190afe
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            1 daemons have recently crashed
            9 mgr modules have recently crashed
 
  services:
    mon: 3 daemons, quorum node1,node2,node3 (age 2m)
    mgr: node3(active, since 112s), standbys: node1, node2
    osd: 8 osds: 8 up (since 46s), 8 in (since 57s)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 449 KiB
    usage:   213 MiB used, 160 GiB / 160 GiB avail
    pgs:     1 active+clean

查看服务器监听的IP

sh 复制代码
[root@node1 ceph-ansible]# ss -tunlp
Netid     State       Recv-Q      Send-Q             Local Address:Port           Peer Address:Port     Process                                   
udp       UNCONN      0           0                      127.0.0.1:323                 0.0.0.0:*         users:(("chronyd",pid=800,fd=5))         
udp       UNCONN      0           0                          [::1]:323                    [::]:*         users:(("chronyd",pid=800,fd=6))         
tcp       LISTEN      0           512              192.168.202.129:6789                0.0.0.0:*         users:(("ceph-mon",pid=37151,fd=28))     
tcp       LISTEN      0           512              192.168.202.129:6802                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=22))     
tcp       LISTEN      0           512              192.168.202.129:6803                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=23))     
tcp       LISTEN      0           512              192.168.202.129:6800                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=18))     
tcp       LISTEN      0           512              192.168.202.129:6801                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=19))     
tcp       LISTEN      0           512              192.168.202.129:6806                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=22))     
tcp       LISTEN      0           512              192.168.202.129:6807                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=23))     
tcp       LISTEN      0           512              192.168.202.129:6804                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=18))     
tcp       LISTEN      0           512              192.168.202.129:6805                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=19))     
tcp       LISTEN      0           512              192.168.202.129:3300                0.0.0.0:*         users:(("ceph-mon",pid=37151,fd=27))     
tcp       LISTEN      0           128                      0.0.0.0:22                  0.0.0.0:*         users:(("sshd",pid=813,fd=3))            
tcp       LISTEN      0           512              192.168.142.128:6802                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=24))     
tcp       LISTEN      0           512              192.168.142.128:6803                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=25))     
tcp       LISTEN      0           512              192.168.142.128:6800                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=20))     
tcp       LISTEN      0           512              192.168.142.128:6801                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=21))     
tcp       LISTEN      0           512              192.168.142.128:6806                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=24))     
tcp       LISTEN      0           512              192.168.142.128:6807                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=25))     
tcp       LISTEN      0           512              192.168.142.128:6804                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=20))     
tcp       LISTEN      0           512              192.168.142.128:6805                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=21))     
tcp       LISTEN      0           128                         [::]:22                     [::]:*         users:(("sshd",pid=813,fd=4))  

安装client

使用node5作为客户端

修改clients的yml文件

sh 复制代码
# 从模板文件拷贝出来clients.yml
[root@node1 ceph-ansible]# cp group_vars/clients.yml.sample group_vars/clients.yml

# 修改clients内容如下
[root@node1 ceph-ansible]# grep -vE '^$|^#' group_vars/clients.yml
---
dummy:
copy_admin_key: true  # 将admin 的 key拷贝出来,没有key认证会失败

修改ansible清单文件

sh 复制代码
[root@node1 ceph-ansible]# cat /etc/ansible/hosts 
[mons]
192.168.202.129
192.168.202.130
192.168.202.131

[mgrs]
192.168.202.129
192.168.202.130
192.168.202.131

[osds]
192.168.202.129
192.168.202.130
192.168.202.131
192.168.202.132

[clients]
192.168.202.134

执行ansible-playbook

sh 复制代码
ansible-playbook site.yml -e yes_i_know=true

遇到的问题

问题一:

sh 复制代码
fatal: [192.168.202.131]: FAILED! => changed=false 
  attempts: 3
  failures: []
  msg: |-
    Depsolve Error occurred:
     Problem 1: conflicting requests
      - nothing provides libtcmalloc.so.4()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides libthrift-0.14.0.so()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides liboath.so.0()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides liboath.so.0(LIBOATH_1.10.0)(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides liboath.so.0(LIBOATH_1.2.0)(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
     Problem 2: conflicting requests
      - nothing provides libtcmalloc.so.4()(64bit) needed by ceph-mon-2:17.2.6-0.el9.x86_64
  rc: 1
  results: []
fatal: [192.168.202.130]: FAILED! => changed=false 
  attempts: 3
  failures: []
  msg: |-
    Depsolve Error occurred:
     Problem 1: conflicting requests
      - nothing provides libtcmalloc.so.4()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides libthrift-0.14.0.so()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides liboath.so.0()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides liboath.so.0(LIBOATH_1.10.0)(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides liboath.so.0(LIBOATH_1.2.0)(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
     Problem 2: conflicting requests
      - nothing provides libtcmalloc.so.4()(64bit) needed by ceph-mon-2:17.2.6-0.el9.x86_64
  rc: 1
  results: []
fatal: [192.168.202.129]: FAILED! => changed=false 
  attempts: 3
  failures: []
  msg: |-
    Depsolve Error occurred:
     Problem 1: conflicting requests
      - nothing provides libthrift-0.14.0.so()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
     Problem 2: package ceph-mon-2:17.2.6-0.el9.x86_64 requires ceph-base = 2:17.2.6-0.el9, but none of the providers can be installed
      - package ceph-base-2:17.2.6-0.el9.x86_64 requires librgw2 = 2:17.2.6-0.el9, but none of the providers can be installed
      - conflicting requests
      - nothing provides libthrift-0.14.0.so()(64bit) needed by librgw2-2:17.2.6-0.el9.x86_64
  rc: 1
  results: []

解决办法:需要安装epel源

sh 复制代码
dnf install epel-release -y

问题二

sh 复制代码
TASK [ceph-infra : install firewalld python binding] *********************************************************************************************An exception occurred during task execution. To see the full traceback, use -vvv. The error was: AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'
fatal: [192.168.202.129]: FAILED! => changed=false 
  module_stderr: |-
    Shared connection to 192.168.202.129 closed.
  module_stdout: |-
    Traceback (most recent call last):
      File "/root/.ansible/tmp/ansible-tmp-1694568395.6913803-54958-96914655721864/AnsiballZ_dnf.py", line 107, in <module>
        _ansiballz_main()
      File "/root/.ansible/tmp/ansible-tmp-1694568395.6913803-54958-96914655721864/AnsiballZ_dnf.py", line 99, in _ansiballz_main
        invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)
      File "/root/.ansible/tmp/ansible-tmp-1694568395.6913803-54958-96914655721864/AnsiballZ_dnf.py", line 47, in invoke_module
        runpy.run_module(mod_name='ansible.modules.dnf', init_globals=dict(_module_fqn='ansible.modules.dnf', _modlib_path=modlib_path),
      File "/usr/lib64/python3.9/runpy.py", line 225, in run_module
        return _run_module_code(code, init_globals, run_name, mod_spec)
      File "/usr/lib64/python3.9/runpy.py", line 97, in _run_module_code
        _run_code(code, mod_globals, init_globals,
      File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/tmp/ansible_ansible.legacy.dnf_payload_o8_ihnkm/ansible_ansible.legacy.dnf_payload.zip/ansible/modules/dnf.py", line 359, in <module>      File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
      File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 664, in _load_unlocked
      File "<frozen importlib._bootstrap>", line 627, in _load_backward_compatible
      File "<frozen zipimport>", line 259, in load_module
      File "/tmp/ansible_ansible.legacy.dnf_payload_o8_ihnkm/ansible_ansible.legacy.dnf_payload.zip/ansible/module_utils/urls.py", line 115, in <module>
      File "/usr/lib/python3.9/site-packages/urllib3/contrib/pyopenssl.py", line 50, in <module>
        import OpenSSL.SSL
      File "/usr/lib/python3.9/site-packages/OpenSSL/__init__.py", line 8, in <module>
        from OpenSSL import crypto, SSL
      File "/usr/lib/python3.9/site-packages/OpenSSL/crypto.py", line 3279, in <module>
        _lib.OpenSSL_add_all_algorithms()
    AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: 1

解决办法:升级升级 OpenSSL 库

sh 复制代码
python3 -m pip install --upgrade pyOpenSSL

问题三

sh 复制代码
TASK [ceph-mgr : add modules to ceph-mgr] ********************************************************************************************************failed: [192.168.202.131 -> 192.168.202.129] (item=dashboard) => changed=true 
  ansible_loop_var: item
  cmd:
  - ceph
  - -n
  - client.admin
  - -k
  - /etc/ceph/ceph.client.admin.keyring
  - --cluster
  - ceph
  - mgr
  - module
  - enable
  - dashboard
  delta: '0:00:00.210799'
  end: '2023-09-13 09:32:06.953160'
  item: dashboard
  rc: 2
  start: '2023-09-13 09:32:06.742361'
  stderr: 'Error ENOENT: module ''dashboard'' reports that it cannot run on the active manager daemon: PyO3 modules may only be initialized once per interpreter process (pass --force to force enablement)'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

问题四

sh 复制代码
[WARNING]: log file at /root/ansible/ansible.log is not writeable and we cannot create it, aborting

[DEPRECATION WARNING]: "include" is deprecated, use include_tasks/import_tasks instead. This feature will be removed in version 2.16. Deprecation
 warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
ERROR! couldn't resolve module/action 'openstack.config_template.config_template'. This often indicates a misspelling, missing collection, or incorrect module path.

The error appears to be in '/root/ceph-ansible/roles/ceph-config/tasks/main.yml': line 137, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


- name: "generate {{ cluster }}.conf configuration file"
  ^ here
We could be wrong, but this one looks like it might be an issue with
missing quotes. Always quote template expression brackets when they
start a value. For instance:

    with_items:
      - {{ foo }}

Should be written as:

    with_items:
      - "{{ foo }}"

解决办法: 安装ceph-ansible所需的第三方包

sh 复制代码
#安装ceph-ansible所需的第三方集合
ansible-galaxy install -r requirements.yml

问题五

sh 复制代码
TASK [ceph-validate : fail if monitoring group doesn't exist] ************************************************************************************fatal: [192.168.202.131]: FAILED! => changed=false 
  msg: you must add a monitoring group and add at least one node.
fatal: [192.168.202.129]: FAILED! => changed=false 
  msg: you must add a monitoring group and add at least one node.
fatal: [192.168.202.130]: FAILED! => changed=false 
  msg: you must add a monitoring group and add at least one node.

解决办法一:关闭dashboard

sh 复制代码
# 在group_vars/all.yml添加如下参数
dashboard_enabled: false

参考文章

相关推荐
wowocpp30 分钟前
spring boot Controller 和 RestController 的区别
java·spring boot·后端
后青春期的诗go35 分钟前
基于Rust语言的Rocket框架和Sqlx库开发WebAPI项目记录(二)
开发语言·后端·rust·rocket框架
freellf41 分钟前
go语言学习进阶
后端·学习·golang
全栈派森3 小时前
云存储最佳实践
后端·python·程序人生·flask
CircleMouse3 小时前
基于 RedisTemplate 的分页缓存设计
java·开发语言·后端·spring·缓存
獨枭4 小时前
使用 163 邮箱实现 Spring Boot 邮箱验证码登录
java·spring boot·后端
维基框架4 小时前
Spring Boot 封装 MinIO 工具
java·spring boot·后端
秋野酱4 小时前
基于javaweb的SpringBoot酒店管理系统设计与实现(源码+文档+部署讲解)
java·spring boot·后端
☞无能盖世♛逞何英雄☜4 小时前
Flask框架搭建
后端·python·flask
进击的雷神5 小时前
Perl语言深度考查:从文本处理到正则表达式的全面掌握
开发语言·后端·scala