Chaos Mesh云原生的混沌测试平台搭建
一.环境准备
确认已经安装helm,如要查看 Helm 是否已经安装,请执行如下命令:
shell
helm version

二.使用helm安装
1.添加 Chaos Mesh 仓库
在 Helm 仓库中添加 Chaos Mesh 仓库:
shell
helm repo add chaos-mesh https://charts.chaos-mesh.org
2.查看可以安装的 Chaos Mesh 版本
shell
#最新版
helm search repo chaos-mesh
#可查看历史版本
helm search repo chaos-mesh -l
如图:
3.安装 Chaos Mesh
shell
#创建命名空间
kubectl create ns chaos-mesh
因为在安装部署Chaos Mesh的时候,会涉及到k8s拉取不到镜像,或者需要自定义其他配置。所以在部署的时候可以指定values.yml文件。
进入这个网站:chaos-mesh/helm/chaos-mesh at release-2.7 · chaos-mesh/chaos-mesh,找到自己对应安装的版本。

找到该目录的values.yaml文件,然后复制里面需要修改的内容,在本地新建一个文件进行对应的调整。
如图是一些可能修改的点,要注意复制修改的时候,需要把一些必要的上下文一起复制,注意缩进。

下面是自己的例子,主要是修改了镜像地址和时区:
yaml
rbac:
create: true
# timezone is the timezone where controller-manager, chaos-daemon and dashboard uses.
# For example: "UTC" or "Asia/Shanghai"
# This value will be set on controller-manager and dashboard container's
# environment variable TZ.
# You may need to set the timezone to be consistent with your Grafana configuration,
# otherwise the query Grafana used to retrieve event maybe in wrong timezone.
timezone: "Asia/Shanghai"
images:
# images.registry is the global container registry for the images, you could replace it with your self-hosted container registry.
registry: "registry.cn-hangzhou.aliyuncs.com"
# images.tag is the global image tag (for example, semiVer with prefix v, or latest).
tag: "v2.7.0"
controllerManager:
# securityContext if needed
securityContext: {}
# running chaos-controller-manager on host network
hostNetwork: false
# Allow testing on `hostNetwork` pods. This is Dangerous. Please run only as temporary solution.
allowHostNetworkTesting: false
# The serviceAccount for chaos-controller-manager
serviceAccount: chaos-controller-manager
# ServiceAccount annotations for chaos-controller-manager
serviceAccountAnnotations: {}
# Create the serviceAccount for chaos-controller-manager
serviceAccountCreate: true
# Custom priorityClassName for using pod priorities
priorityClassName: ""
# Replicas for chaos-controller-manager
replicaCount: 3
# image would be constructed by <registry>/<repository>:<tag>
image:
# override global registry, empty value means using the global images.registry
registry: ""
# repository part for image of chaos-controller-manager
repository: <阿里云镜像仓库地址>/chaos-mesh
# override global tag, empty value means using the global images.tag
tag: ""
# Image pull policy
imagePullPolicy: IfNotPresent
# The keys within the "env" map are mounted as environment variables on the pod.
env:
# WEBHOOK_PORT is configured the port for chaos-controller-manager provides webhooks.
# In GKE private clusters, by default kubernetes apiservers are allowed to
# talk to the cluster nodes only on 443 and 10250. so configuring
# WEBHOOK_PORT: 10250, will work out of the box without needing to add firewall
# rules or requiring NET_BIND_SERVICE capabilities to bind port numbers <1000
WEBHOOK_PORT: 10250
# METRICS_PORT is configured the port for chaos-controller-manager exposing prometheus metrics
METRICS_PORT: 10080
# If enabled, only pods in the namespace annotated with `"chaos-mesh.org/inject": "enabled"` could be injected
enableFilterNamespace: false
# targetNamespace only works with clusterScoped is false(namespace scoped mode).
# It means namespace which will be injected chaos
targetNamespace: chaos-mesh
service:
# Kubernetes Service type for service chaos-controller-manager
type: ClusterIP
resources:
# We usually recommend not to specify default resources and to leave this as a conscious
# choice for the user. This also increases chances charts run on environments with little
# resources, such as Minikube. If you do want to specify resources, uncomment the following
# lines, adjust them as necessary, and remove the curly braces after 'resources:'.
limits: {}
# cpu: 500m
# memory: 1024Mi
requests:
cpu: 25m
memory: 256Mi
# Node labels for chaos-controller-manager pod assignment
nodeSelector: {}
# Toleration labels for chaos-controller-manager pod assignment
tolerations: []
# Map of chaos-controller-manager node/pod affinities
affinity: {}
# Pod annotations of chaos-controller-manager
podAnnotations: {}
# A list of controllers to enable. "*" enables all controllers by default.
enabledControllers:
- "*"
# A list of webhooks to enable. "*" enables all webhooks by default.
enabledWebhooks:
- "*"
podChaos:
podFailure:
# Custom Pause Container Image for Pod Failure Chaos
pauseImage: registry.cn-hangzhou.aliyuncs.com/<阿里云镜像仓库地址>/pause:latest
leaderElection:
# Enable leader election for controller manager.
enabled: true
# The duration that non-leader candidates will wait to force acquire leadership. This is measured against time of last observed ack.
leaseDuration: 15s
# The duration that the acting control-plane will retry refreshing leadership before giving up.
renewDeadline: 10s
# The duration the LeaderElector clients should wait between tries of actions.
retryPeriod: 2s
# chaosdSecurityMode is enabled for mTLS connection between chaos-controller-manager and chaosd
chaosdSecurityMode: true
# multi cluster install offline helm chart path
localHelmChart:
enabled: false
volume:
hostPath:
path: /data/helm
type: DirectoryOrCreate
chaosDaemon:
# image would be constructed by <registry>/<repository>:<tag>
image:
# override global registry, empty value means using the global images.registry
registry: ""
# repository part for image of chaos-daemon
repository: <阿里云镜像仓库地址>/chaos-daemon
# empty tag means using the global images.tag
tag: ""
# Image pull policy
imagePullPolicy: IfNotPresent
# The port which grpc server listens on.
grpcPort: 31767
# The port which http server listens on.
httpPort: 31766
# extra chaosDaemon envs
env: {}
# securityContext if needed
securityContext: {}
# running chaosDaemon on host network
hostNetwork: false
# configurations about mtls.
# currently we do not support use specified ca and cert for mtls, it would generate the ca and certs when chaos mesh deploy by helm.
mtls:
# enable mtls on the grpc connection between chaos-controller-manager and chaos-daemon
enabled: true
runtime: containerd
socketPath: /run/containerd/containerd.sock
dashboard:
# Enable chaos-dashboard
create: true
# Optional, the secret name that has `DATABASE_DATASOURCE` defined.
# It's recommended to use a secret to store the database credentials.
databaseSecretName: ""
# rootUrl specify the base url for openid/oauth2 (like GCP Auth Integration) callback URL.
rootUrl: http://localhost:2333
# securityContext if needed
securityContext: {}
# running chaos-dashboard on host network
hostNetwork: false
# replicas of chaos-dashboard
replicaCount: 1
# Custom priorityClassName for using pod priorities
priorityClassName: ""
# The serviceAccount for chaos-dashboard
serviceAccount: chaos-dashboard
image:
# override global registry, empty value means using the global images.registry
registry: ""
# repository part for image of chaos-dashboard
repository: <阿里云镜像仓库地址>/chaos-dashboard
# override global tag, empty value means using the global images.tag
tag: ""
# Image pull policy
imagePullPolicy: IfNotPresent
# securityMode requires user to provide credentials on Chaos Dashboard, instead of using chaos-dashboard service account
securityMode: true
dnsServer:
# Enable DNS Server which required by DNSChaos
create: true
# Name of serviceaccount for chaos-dns-server.
serviceAccount: chaos-dns-server
# image would be constructed by <registry>/<repository>:<tag>
image:
# override global registry, empty value means using the global images.registry
registry: ""
# repository part for image of chaos-dns-server
repository: chaos-mesh/chaos-coredns
# override global tag, empty value means using the global images.tag
tag: "v0.2.6"
# Image pull policy
imagePullPolicy: IfNotPresent
# Customized priorityClassName for chaos-dns-server
priorityClassName: ""
dnsServer:
# Enable DNS Server which required by DNSChaos
create: true
# Name of serviceaccount for chaos-dns-server.
serviceAccount: chaos-dns-server
# image would be constructed by <registry>/<repository>:<tag>
image:
# override global registry, empty value means using the global images.registry
registry: ""
# repository part for image of chaos-dns-server
repository: <阿里云镜像仓库地址>/chaos-coredns
# override global tag, empty value means using the global images.tag
tag: "v0.2.6"
# Image pull policy
imagePullPolicy: IfNotPresent
# Customized priorityClassName for chaos-dns-server
priorityClassName: ""
在修改完镜像地址等等东西之后,就可以执行命令部署。
shell
helm install chaos-mesh -f chaos_mesh_values.yaml chaos-mesh/chaos-mesh --namespace=chaos-mesh --create-namespace
-f 后面是自己修改的values配置文件
检查是否部署成功
shell
kubectl get po -n chaos-mesh

4.访问Chaos Mesh
访问地址是<集群IP>:30768
,如图。

Chaos MeshRBAC 鉴权:


按如下步骤选择好命名空间和角色之后,点击自动生成的文件创建即可。
这里要注意:
shell
kubectl create token account-default-viewer-ixqbu
这个命令生成的token是有过期时效的。所以我们还有一个方法可以生成长期可用的token。
yaml
apiVersion: v1
kind: Secret
metadata:
name: account-test-manager-sequd-token
namespace: test
annotations:
kubernetes.io/service-account.name: account-default-viewer-ixqbu
type: kubernetes.io/service-account-token
注意此处的kubernetes.io/service-account.name:
和上一步创建的角色名字相同.
shell
#查看secrets
kubectl describe secrets -n test account-test-manager-sequd-token

输入环境名和token就可以成功创建实验了。
三.创建测试实验
1.选择实验方法设置实验条件


这里Workers代表进程,这里是三个进程对Pod施加100M的压力。

这里可以配置标签选择器和命名空间,以确定哪些Pod参与此次实验。

此处要注意,当最后一步提交的时候,如果没有反应。报错信息则需要在F12里看具体的接口报错。此处报错信息在Pod日志 里无法看见。
a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.'
这里可以看到失败的原因是实验名称必须小写且不能有除这些字符以外的特殊字符。
修改之后则可以正常提交。
2.检查实验结果
在提交了实验之后,我们可以看到实验正在进行。

此时进入容器内部top,可以看到会有其他的进程对该pod施加内存压力,则证明Chaos Mesh安装成功可以如期进行实验
