脚本自动化创建AWS EC2实例+安装ElasticSearch和Kibana+集成OpenTelemetry监控

文章目录

为什么要通过脚本来部署服务器？
EC2实例类型+硬件选择
其他配置
- 安全组配置
- 网络配置
- [IAM Role](#IAM Role)
- [Key Pair](#Key Pair)
- 内部域名
书写自动化脚本
- 属性文件
- EBS配置文件
- 创建EC2实例命令
- - [user data 文件](#user data 文件)
  - OpenTelemetry监控
- 创建内部域名
- 发送部署结果消息通知相关人员
验证
总结

由于最近参与了部分部署服务器的工作，记录并总结下第一次参与利用脚本自动化部署服务器的过程和思路。

为什么要通过脚本来部署服务器？

在实际生产中，我们规定当新创建一台服务器、上线或下线某台服务器时不允许通过在云服务厂商的控制台上用鼠标完成操作这种方式来创建，原因是

流程无法标准化，谁知道每次鼠标点点点都干了什么，如果换其他人维护，不熟悉的人不知道创建一台服务器需要哪些步骤
流程无法标准化带来的后果就是无法自动化，每部署一台服务器都要这么干，重复工作且长期这么做极容易出错
通过脚本实现，将脚本的每次改动通过git的版本控制，日后回顾时知道当时为什么要增加或减少某个步骤

EC2实例类型+硬件选择

在书写脚本之前，我们得先根据实际应用场景选择对应的类型和硬件。

实例类型的选择

我们打算在该实例上安装单机版的ElasticSearch+Kibana，所以我们选择内存优化型而不是通用型或计算优化型或其他类型

内存

由于我们对这个ElasticSearch存储的都是非核心数据且是单机，再加上以往的经验，我们认为16GB是一个合适的值

CPU

根据ElasticSearch官方文档，CPU通常不是限制ElasticSearch的因素，所以我们初期认为2个CPU足够

存储

我们使用EBS+gp3，初始容量为30GB。容量预估是根据实际测试结果确定的

架构

有x86_64和arm64两种选择，arm64相对便宜一点，而且我们开始逐渐将架构由x86_64过渡到arm64，所以选择arm64

操作系统

操作系统全平台保持一致，统一为Rocky Linux 8.10。AWS中对应的AMI为ami-06459b48b47a92d77

最终的选择

经过上述条件过滤之后，可选的实例类型为r6g.large、r7g.large、r8g.large。由于r8g.large是最新的，我们担心其稳定性，所有我们选择了中间版本r7g.large

其他配置

安全组配置

根据需求将kibana的5601,elasticsearch的9200端口开放以允许内部web server服务器访问

网络配置

和其他服务器一样，使用统一的VPC和子网

IAM Role

根据需要配置IAM Role

Key Pair

和其他服务器一样，使用统一的key pair

内部域名

我们所有的服务器都通过内部域名访问而不是IP，因为IP可能会变。所以在写脚本之前要把最终要用的内部域名确定下来，例如：elastic-stack-standalone.xxx.io

书写自动化脚本

使用aws ec2 cli的run-instances命令来创建实例和aws route53 cli的change-resource-record-sets命令来创建内部域名

属性文件

我们将上述硬件的配置和其他配置都放到一个属性文件中server.properties

properties 复制代码

SERVER_TYPE="elastic-stack-standalone"
SERVER_INSTANCE_TYPE="r7g.large"
# arm64 rocky linux 8.9 instead of x86_64
SERVER_AMI="ami-06459b48b47a92d77"

# security group id
SG_ID="security group id"

# key pair.
KEY_PAIR_NAME=keyPairName

# networking
SUBNET_ID="subnet id"

# elastic stack standalone server does not need public IP
PUBLIC_IP=""

# private domain name
ROUTE53_FILE="change-resource-record-sets.json"
PRIVATE_DOMAIN="elastic-stack-standalone.xxx.io"
HOSTED_ZONE_ID=hostZoneId

EBS配置文件

device-mappings.json

json 复制代码

[
    {
        "DeviceName": "/dev/sda1",
        "Ebs": {
            "VolumeSize": 30,
            "VolumeType": "gp3",
            "DeleteOnTermination": true
            
        }
    }
]

创建EC2实例命令

除USER_DATA 所有变量都从server.properties中读取

shell 复制代码

aws ec2 run-instances --image-id ${SERVER_AMI} \
--key-name $KEY_NAME \
--user-data "${USER_DATA}" \
--instance-type ${SERVER_INSTANCE_TYPE} \
--block-device-mappings device-mappings.json  \
--subnet-id ${SUBNET_ID} \
--security-group-ids ${SG_ID} \
--private-ip-address $PRIVATE_IP

user data 文件

user data可以理解为AWS 创建Instance之后，你希望执行的后续操作。

例如

升级操作系统
安装软件，如git, ldap client
创建及配置用户

user-data.txt内容为

shell 复制代码

install_software() {
  echo "install required software"
  yum install expect git openldap-clients sssd sssd-ldap net-tools compat-openssl10 bc -y
}
init_os() {
  # upgrade rocky linux to 8.10 from 8.9
  yum -y update
  config_security
  config_network_and_firewall
  config_system_settings_for_elastic_stack
  install_software
}

config_ldap_client() {
  echo "config ldap client"
  CONF="/git/repositories/deployment/server-setup/ldap-client"
  yes | cp -fp $CONF/etc/openldap/ldap-pro.conf /etc/openldap/ldap.conf
  yes | cp -fp $CONF/etc/sssd/sssd-pro.conf /etc/sssd/sssd.conf
  # reload sssd service
  chmod 600 /etc/sssd/sssd.conf
  systemctl restart sssd oddjobd
  systemctl enable sssd oddjobd

  # create home directory for ldap login
  authselect select sssd with-mkhomedir
  systemctl restart sshd

  #Add LDAP users to proper user groups
  for U in userList; do
    usermod -aG wheel $U
  done
}

install_elastic_stack_with_rpm() {
  rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch

  cat <<EOF | tee /etc/yum.repos.d/elasticsearch.repo >/dev/null
[elasticsearch]
name=Elasticsearch repository for 8.x packages
baseurl=https://artifacts.elastic.co/packages/8.x/yum
gpgcheck=0
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=0
autorefresh=1
type=rpm-md
EOF
  install_elasticsearch_then_start
  install_kibana_then_start
}

install_monitor() {
   cp -rp  /root/repositories/deployment/server-setup/monitored_host /opt/
   chmod u+x /opt/monitored_host/elastic-stack/standalone/monitor.sh
   /opt/monitored_host/elastic-stack/standalone/monitor.sh
}

main() {
  init_os
  pull_git_repo
  config_ldap_client
  install_elastic_stack_with_rpm
  install_otel_monitor
}
main

主要流程见main函数，我没有把所有函数都写出来，只列举了几个函数：例如将Rocky Linux 8.9 升级到8.10，因为AWS 提供的AMI最新为8.9，我们使用的是8.10; 关于服务器用户我们使用LDAP进行管理

OpenTelemetry监控

不了解opentelemetry的同学，建议查看官方文档去了解它到底是干什么的。

我们对于OS级别的监控使用node_exporter，对于ElasticSearch的监控使用elasticsearch_exporter，然后统一使用otel_collector进行收集metrics并暴露出去供Prometheus服务器收集。

monitor.sh

shell 复制代码

#!/bin/bash
# this scirpt will install node_exporter, elasticsearch_exporter opentelemetry collector
workspace=/opt/monitored_host

architecture=$(arch)
hardware_architecture=$( [ "$architecture" = "aarch64" ] && echo "arm64" || ( [ "$architecture" = "x86_64" ] && echo "amd64" || echo "unknown-architecture" ) )

echo "The architecture is: $hardware_architecture"


install_node_exporter() {
  cd /opt
  URL=$(curl -s https://api.github.com/repos/prometheus/node_exporter/releases | grep browser_download_url | grep "linux-$hardware_architecture" | head -n 1 | cut -d '"' -f 4)
  FILE=$(echo $URL|awk -F"/" '{print $NF}')
  DIR=$(echo $URL|awk -F"/" '{print $NF}'|sed 's/\.tar\.gz//g')
  curl -LO $URL
  tar -zxf $FILE
  rm -rf /opt/node_exporter
  ln -s /opt/$DIR /opt/node_exporter
  rm -f $FILE

  # add node_exporter service
  cd $workspace
  \cp systemd_service/node_exporter.service /etc/systemd/system
  systemctl daemon-reload
  systemctl enable node_exporter.service
  systemctl start node_exporter.service
}

install_elasticsearch_exporter() {
  cd /opt
  URL=$(curl -s https://api.github.com/repos/prometheus-community/elasticsearch_exporter/releases | grep browser_download_url | grep "linux-$hardware_architecture" | head -n 1 | cut -d '"' -f 4)
  FILE=$(echo $URL|awk -F"/" '{print $NF}')
  DIR=$(echo $URL|awk -F"/" '{print $NF}'|sed 's/\.tar\.gz//g')
  curl -LO $URL
  tar -zxf $FILE
  ln -s /opt/$DIR /opt/elasticsearch_exporter
  rm -f $FILE
  # add service
  cd $workspace
  \cp systemd_service/elastic_stack_sre_exporter.service /etc/systemd/system
  systemctl daemon-reload
  systemctl enable elastic_stack_sre_exporter.service
  systemctl start elastic_stack_sre_exporter.service
}

install_otelcol() {
  cd /opt
  URL=$(curl -s https://api.github.com/repos/open-telemetry/opentelemetry-collector-releases/releases|grep "browser_download_url"|grep -v "otelcol-contrib"|grep rpm|grep "linux_$hardware_architecture"|head -n 1|cut -d '"' -f 4)
  FILE=$(echo $URL|awk -F"/" '{print $NF}')
  curl -LO $URL
  rpm -iUh $FILE
  rm -f $FILE

  # add otel user
  useradd otel -s /sbin/nologin -M

  # add otel config path
  mkdir /etc/otelcol

  cd "$workspace"
  \cp elastic-stack/standalone/otelcol.yml /etc/otelcol/config.yaml
  sed -ri 's#( *host_name: ).*#\1"'$(hostname)'"#' /etc/otelcol/config.yaml
  \cp systemd_service/otelcol.service /etc/systemd/system
  systemctl daemon-reload
  systemctl enable otelcol.service
  systemctl restart otelcol.service
}
main() {
  install_n
  ode_exporter
  install_elasticsearch_exporter
  install_otelcol
}
main

otel_collector配置文件

yaml 复制代码

extensions:
  health_check:

receivers:
  prometheus/os:
    config:
      scrape_configs:
        - job_name: 'node_exporter'
          scrape_interval: 5s
          static_configs:
            - targets:
              - "127.0.0.1:9100"
  prometheus/elasticsearch:
    config:
      scrape_configs:
        - job_name: 'elasticsearch_exporter'
          scrape_interval: 5s
          static_configs:
            - targets:
                - "127.0.0.1:9114"


exporters:
  prometheus/main:
    endpoint: "0.0.0.0:8090"
    const_labels:
      host_locale: "product"
      host_name: "replace_me"

service:
  pipelines:
    metrics/00:
      receivers: [prometheus/os, prometheus/elasticsearch]
      exporters: [prometheus/main]

创建内部域名

change-resource-record-sets.json

json 复制代码

{
  "Changes": [
    {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "host.xxxx.io",
        "Type": "A",
        "TTL": 300,
        "ResourceRecords": [
          {
            "Value": "IP0"
          }

        ]
      }
    }
  ]
}

route53 cli 请求命令

shell 复制代码

# 替换json文件中的值
ROUTE53_REQUEST=$(cat ${ROUTE53_FILE});
ROUTE53_REQUEST=${ROUTE53_REQUEST/host.${PRIVATE_DOMAIN}/${HN}.${PRIVATE_DOMAIN}}
ROUTE53_REQUEST=${ROUTE53_REQUEST/IP0/${PRIVATE_IP}}

aws route53 change-resource-record-sets --hosted-zone-id ${HOSTED_ZONE_ID} --change-batch "${ROUTE53_REQUEST}" --output json

发送部署结果消息通知相关人员

部署流程为只需要执行一个脚本，然后即可干别的事情去了。等到主要脚本执行完毕，发送部署结果通知相关人员或更新上线服务器列表。这一部分是集成哪家IM，看你们公司实际用哪家，按需接入即可

验证

验证的流程，按照道理来说也要集成到脚本中，我这里没有集成。采取了手动验证的方式，主要验证：

OS是否已升级
配置的用户是否可以登录服务器
elasticsearch和kibana是否可以正常访问
其他web server是否可以通过内部域名访问elasticsearch
是否可以通过Prometheus收集指标

总结

整体思路如下：

所有服务器部署流程使用同一个部署脚本来保证部署流程标准化
每个服务器的server.properties和user-data.txt不一样，每次部署只需要提供这两个文件即可
对于部署之后的验证也可写成脚本集成到部署流程中来