K8S环境部署Prometheus
记录在K8S 1.18版本环境下部署Prometheus 0.5版本。
1. 下载kube-prometheus仓库
shell
git clone https://github.com/coreos/kube-prometheus.git
cd kube-prometheus
笔者安装的K8S版本是1.18 ,prometheus选择配套的分支release-0.5:
shell
# 切换到release-0.5
git checkout remotes/origin/release-0.5 -b 0.5
K8S和Pormetheus的配套关系:
kube-prometheus stack | Kubernetes 1.14 | Kubernetes 1.15 | Kubernetes 1.16 | Kubernetes 1.17 | Kubernetes 1.18 |
---|---|---|---|---|---|
release-0.3 |
✔ | ✔ | ✔ | ✔ | ✗ |
release-0.4 |
✗ | ✗ | ✔ | ✔ | ✗ |
release-0.5 |
✗ | ✗ | ✗ | ✗ | ✔ |
HEAD |
✗ | ✗ | ✗ | ✗ | ✔ |
最新的版本配套关系查看kube-prometheus官方仓库:https://github.com/prometheus-operator/kube-prometheus ,可以切换版本查看配套关系。
2. 查看manifest
shell
[root@k8s-master kube-prometheus]# cd manifests/
[root@k8s-master manifests]# ll
total 1684
-rw-r--r-- 1 root root 405 Jun 12 16:20 alertmanager-alertmanager.yaml
-rw-r--r-- 1 root root 973 Jun 12 16:20 alertmanager-secret.yaml
-rw-r--r-- 1 root root 96 Jun 12 16:20 alertmanager-serviceAccount.yaml
-rw-r--r-- 1 root root 254 Jun 12 16:20 alertmanager-serviceMonitor.yaml
-rw-r--r-- 1 root root 308 Jun 12 16:22 alertmanager-service.yaml
-rw-r--r-- 1 root root 550 Jun 12 16:20 grafana-dashboardDatasources.yaml
-rw-r--r-- 1 root root 1405645 Jun 12 16:20 grafana-dashboardDefinitions.yaml
-rw-r--r-- 1 root root 454 Jun 12 16:20 grafana-dashboardSources.yaml
-rw-r--r-- 1 root root 7539 Jun 12 16:20 grafana-deployment.yaml
-rw-r--r-- 1 root root 86 Jun 12 16:20 grafana-serviceAccount.yaml
-rw-r--r-- 1 root root 208 Jun 12 16:20 grafana-serviceMonitor.yaml
-rw-r--r-- 1 root root 238 Jun 12 16:22 grafana-service.yaml
-rw-r--r-- 1 root root 376 Jun 12 16:20 kube-state-metrics-clusterRoleBinding.yaml
-rw-r--r-- 1 root root 1651 Jun 12 16:20 kube-state-metrics-clusterRole.yaml
-rw-r--r-- 1 root root 1925 Jun 12 16:20 kube-state-metrics-deployment.yaml
-rw-r--r-- 1 root root 192 Jun 12 16:20 kube-state-metrics-serviceAccount.yaml
-rw-r--r-- 1 root root 829 Jun 12 16:20 kube-state-metrics-serviceMonitor.yaml
-rw-r--r-- 1 root root 403 Jun 12 16:20 kube-state-metrics-service.yaml
-rw-r--r-- 1 root root 266 Jun 12 16:20 node-exporter-clusterRoleBinding.yaml
-rw-r--r-- 1 root root 283 Jun 12 16:20 node-exporter-clusterRole.yaml
-rw-r--r-- 1 root root 2775 Jun 12 16:20 node-exporter-daemonset.yaml
-rw-r--r-- 1 root root 92 Jun 12 16:20 node-exporter-serviceAccount.yaml
-rw-r--r-- 1 root root 711 Jun 12 16:20 node-exporter-serviceMonitor.yaml
-rw-r--r-- 1 root root 355 Jun 12 16:20 node-exporter-service.yaml
-rw-r--r-- 1 root root 292 Jun 12 16:20 prometheus-adapter-apiService.yaml
-rw-r--r-- 1 root root 396 Jun 12 16:20 prometheus-adapter-clusterRoleAggregatedMetricsReader.yaml
-rw-r--r-- 1 root root 304 Jun 12 16:20 prometheus-adapter-clusterRoleBindingDelegator.yaml
-rw-r--r-- 1 root root 281 Jun 12 16:20 prometheus-adapter-clusterRoleBinding.yaml
-rw-r--r-- 1 root root 188 Jun 12 16:20 prometheus-adapter-clusterRoleServerResources.yaml
-rw-r--r-- 1 root root 219 Jun 12 16:20 prometheus-adapter-clusterRole.yaml
-rw-r--r-- 1 root root 1378 Jun 12 16:20 prometheus-adapter-configMap.yaml
-rw-r--r-- 1 root root 1344 Jun 12 16:20 prometheus-adapter-deployment.yaml
-rw-r--r-- 1 root root 325 Jun 12 16:20 prometheus-adapter-roleBindingAuthReader.yaml
-rw-r--r-- 1 root root 97 Jun 12 16:20 prometheus-adapter-serviceAccount.yaml
-rw-r--r-- 1 root root 236 Jun 12 16:20 prometheus-adapter-service.yaml
-rw-r--r-- 1 root root 269 Jun 12 16:20 prometheus-clusterRoleBinding.yaml
-rw-r--r-- 1 root root 216 Jun 12 16:20 prometheus-clusterRole.yaml
-rw-r--r-- 1 root root 621 Jun 12 16:20 prometheus-operator-serviceMonitor.yaml
-rw-r--r-- 1 root root 751 Jun 12 16:20 prometheus-prometheus.yaml
-rw-r--r-- 1 root root 293 Jun 12 16:20 prometheus-roleBindingConfig.yaml
-rw-r--r-- 1 root root 983 Jun 12 16:20 prometheus-roleBindingSpecificNamespaces.yaml
-rw-r--r-- 1 root root 188 Jun 12 16:20 prometheus-roleConfig.yaml
-rw-r--r-- 1 root root 820 Jun 12 16:20 prometheus-roleSpecificNamespaces.yaml
-rw-r--r-- 1 root root 86744 Jun 12 16:20 prometheus-rules.yaml
-rw-r--r-- 1 root root 93 Jun 12 16:20 prometheus-serviceAccount.yaml
-rw-r--r-- 1 root root 6829 Jun 12 16:20 prometheus-serviceMonitorApiserver.yaml
-rw-r--r-- 1 root root 395 Jun 12 16:20 prometheus-serviceMonitorCoreDNS.yaml
-rw-r--r-- 1 root root 6172 Jun 12 16:20 prometheus-serviceMonitorKubeControllerManager.yaml
-rw-r--r-- 1 root root 6778 Jun 12 16:20 prometheus-serviceMonitorKubelet.yaml
-rw-r--r-- 1 root root 347 Jun 12 16:20 prometheus-serviceMonitorKubeScheduler.yaml
-rw-r--r-- 1 root root 247 Jun 12 16:20 prometheus-serviceMonitor.yaml
-rw-r--r-- 1 root root 297 Jun 12 16:21 prometheus-service.yaml
drwxr-xr-x 2 root root 4096 Jun 12 16:20 setup
3. 修改镜像源
修改prometheus-operator,prometheus,alertmanager,kube-state-metrics,node-exporter,prometheus-adapter的镜像源为中科大的镜像源。
shell
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' setup/prometheus-operator-deployment.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' prometheus-prometheus.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' alertmanager-alertmanager.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' kube-state-metrics-deployment.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' node-exporter-daemonset.yaml
sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' prometheus-adapter-deployment.yaml
4. 修改promethes,alertmanager,grafana的service类型为NodePort类型
为了可以从外部访问 prometheus,alertmanager,grafana,我们这里修改 promethes,alertmanager,grafana的 service 类型为 NodePort 类型。
- 修改 prometheus 的 service
shell
[root@k8s-master kube-prometheus]# cat manifests/prometheus-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
prometheus: k8s
name: prometheus-k8s
namespace: monitoring
spec:
type: NodePort # 增加NodePort配置
ports:
- name: web
port: 9090
targetPort: web
nodePort: 30090 # 增加NodePort配置
selector:
app: prometheus
prometheus: k8s
sessionAffinity: ClientIP
- 修改 grafana 的 service
shell
[root@k8s-master kube-prometheus]# cat manifests/grafana-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: grafana
name: grafana
namespace: monitoring
spec:
type: NodePort # 增加NodePort配置
ports:
- name: http
port: 3000
targetPort: http
nodePort: 32000 # 增加NodePort配置
selector:
app: grafana
- 修改 alertmanager 的 service
shell
[root@k8s-master kube-prometheus]# cat manifests/alertmanager-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
alertmanager: main
name: alertmanager-main
namespace: monitoring
spec:
type: NodePort # 增加NodePort配置
ports:
- name: web
port: 9093
targetPort: web
nodePort: 30093 # 增加NodePort配置
selector:
alertmanager: main
app: alertmanager
sessionAffinity: ClientIP
5. 安装kube-prometheus
安装CRD和prometheus-operator
shell
[root@k8s-master manifests]# kubectl apply -f setup/
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
namespace/monitoring configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com configured
clusterrole.rbac.authorization.k8s.io/prometheus-operator created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-operator created
deployment.apps/prometheus-operator created
service/prometheus-operator created
serviceaccount/prometheus-operator created
[root@k8s-master manifests]# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
prometheus-operator-5cd4d464cc-b9vqq 0/2 ContainerCreating 0 16s
下载prometheus-operator镜像需要花费几分钟,等待prometheus-operator变成running状态。
安装prometheus, alertmanager, grafana, kube-state-metrics, node-exporter等资源
shell
[root@k8s-master manifests]# kubectl apply -f .
alertmanager.monitoring.coreos.com/main created
secret/alertmanager-main created
service/alertmanager-main created
serviceaccount/alertmanager-main created
servicemonitor.monitoring.coreos.com/alertmanager created
secret/grafana-datasources created
configmap/grafana-dashboard-apiserver created
configmap/grafana-dashboard-cluster-total created
configmap/grafana-dashboard-controller-manager created
configmap/grafana-dashboard-k8s-resources-cluster created
configmap/grafana-dashboard-k8s-resources-namespace created
configmap/grafana-dashboard-k8s-resources-node created
configmap/grafana-dashboard-k8s-resources-pod created
configmap/grafana-dashboard-k8s-resources-workload created
configmap/grafana-dashboard-k8s-resources-workloads-namespace created
configmap/grafana-dashboard-kubelet created
configmap/grafana-dashboard-namespace-by-pod created
configmap/grafana-dashboard-namespace-by-workload created
configmap/grafana-dashboard-node-cluster-rsrc-use created
configmap/grafana-dashboard-node-rsrc-use created
configmap/grafana-dashboard-nodes created
configmap/grafana-dashboard-persistentvolumesusage created
configmap/grafana-dashboard-pod-total created
configmap/grafana-dashboard-prometheus-remote-write created
configmap/grafana-dashboard-prometheus created
configmap/grafana-dashboard-proxy created
configmap/grafana-dashboard-scheduler created
configmap/grafana-dashboard-statefulset created
configmap/grafana-dashboard-workload-total created
configmap/grafana-dashboards created
deployment.apps/grafana created
service/grafana created
serviceaccount/grafana created
servicemonitor.monitoring.coreos.com/grafana created
clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
deployment.apps/kube-state-metrics created
service/kube-state-metrics created
serviceaccount/kube-state-metrics created
servicemonitor.monitoring.coreos.com/kube-state-metrics created
clusterrole.rbac.authorization.k8s.io/node-exporter created
clusterrolebinding.rbac.authorization.k8s.io/node-exporter created
daemonset.apps/node-exporter created
service/node-exporter created
serviceaccount/node-exporter created
servicemonitor.monitoring.coreos.com/node-exporter created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
clusterrole.rbac.authorization.k8s.io/prometheus-adapter created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-adapter created
clusterrolebinding.rbac.authorization.k8s.io/resource-metrics:system:auth-delegator created
clusterrole.rbac.authorization.k8s.io/resource-metrics-server-resources created
configmap/adapter-config created
deployment.apps/prometheus-adapter created
rolebinding.rbac.authorization.k8s.io/resource-metrics-auth-reader created
service/prometheus-adapter created
serviceaccount/prometheus-adapter created
clusterrole.rbac.authorization.k8s.io/prometheus-k8s created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-k8s created
servicemonitor.monitoring.coreos.com/prometheus-operator created
prometheus.monitoring.coreos.com/k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s-config created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s-config created
role.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s created
prometheusrule.monitoring.coreos.com/prometheus-k8s-rules created
service/prometheus-k8s created
serviceaccount/prometheus-k8s created
servicemonitor.monitoring.coreos.com/prometheus created
servicemonitor.monitoring.coreos.com/kube-apiserver created
servicemonitor.monitoring.coreos.com/coredns created
servicemonitor.monitoring.coreos.com/kube-controller-manager created
servicemonitor.monitoring.coreos.com/kube-scheduler created
servicemonitor.monitoring.coreos.com/kubelet created
等待monitoring命名空间下的pod都变为运行:
shell
[root@k8s-master ~]# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 157m
alertmanager-main-1 2/2 Running 0 157m
alertmanager-main-2 2/2 Running 0 157m
grafana-5c55845445-gh8g7 1/1 Running 0 20h
kube-state-metrics-75f946484-lqrbf 3/3 Running 0 20h
node-exporter-5h5cs 2/2 Running 0 20h
node-exporter-f28gj 2/2 Running 0 20h
node-exporter-w9rhr 2/2 Running 0 20h
prometheus-adapter-7d68d6f886-qwrfg 1/1 Running 0 20h
prometheus-k8s-0 3/3 Running 0 20h
prometheus-k8s-1 3/3 Running 0 20h
prometheus-operator-5cd4d464cc-b9vqq 2/2 Running 0 20h
博主部署测试遇到如下问题,解决方式记录如下:
- alertmanager-main的三个容器启动失败,状态为crashLoopBackOff,pod中的其中一个容器无法启动。
shell
Warning Unhealthy 11m (x5 over 11m) kubelet, k8s-node2 Liveness probe failed: Get http://10.244.2.8:9093/-/healthy: dial tcp 10.244.2.8:9093: connect: connection refused
Warning Unhealthy 10m (x10 over 11m) kubelet, k8s-node2 Readiness probe failed: Get http://10.244.2.8:9093/-/ready: dial tcp 10.244.2.8:9093: connect: connection refused
下面的解决方法参考自:https://github.com/prometheus-operator/kube-prometheus/issues/653
shell
# 暂停更新,修改如下资源文件,增加paused:true
kubectl -n monitoring edit alertmanagers.monitoring.coreos.com
...
spec:
image: quay.io/prometheus/alertmanager:v0.23.0
nodeSelector:
kubernetes.io/os: linux
paused: true
podMetadata:
labels:
...
[root@k8s-master ~]# kubectl -n monitoring get statefulset.apps/alertmanager-main -o yaml > dump.yaml
# 修改alertmanager-main.yaml,在spec.template.spec添加hostNetwork: true,在文件的234行左右的位置
[root@k8s-master manifests]# vi dump.yaml
...
spec:
hostNetwork: true # 增加的内容
containers:
- args:
- --config.file=/etc/alertmanager/config/alertmanager.yaml
...
# 删除livenessProbe和readinessProbe探针
[root@k8s-master manifests]# vi dump.yaml
...
livenessProbe:
failureThreshold: 10
httpGet:
path: /-/healthy
port: web
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 3
...
readinessProbe:
failureThreshold: 10
httpGet:
path: /-/ready
port: web
scheme: HTTP
initialDelaySeconds: 3
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
...
# 删除原有的statefulset,重新创建
[root@k8s-master ~]# kubectl delete statefulset.apps/alertmanager-main -n monitoring
[root@k8s-master ~]# kubectl create -f dump.yaml
- 其中一个alertmanager状态为pendding,查看原因为不满足节点调度要求。
解决方法如下,
shell
# 去除污点
kubectl describe node k8s-master | grep Taints
kubectl taint nodes k8s-master node-role.kubernetes.io/master-
6. 访问prometheus,alert-manager,grafana
- 访问prometheus
浏览器打开http://192.168.0.51:30090,192.168.0.51为master的IP
- 访问alert-manager
浏览器打开http://192.168.0.51:30093
- 访问grafana
浏览器打开http://192.168.0.51:32000
用户名/密码:admin/admin