方法一
修改告警规则
- alert: cpu使用率大于88%
expr: instance:node_cpu_utilization:ratio * 100 > 88
for: 5m
labels:
severity: critical
level: 3
kind: CpuUsage
annotations:
summary: "cpu使用率大于85%"
description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
根据Kind区分,规则一kind1,规则二是kind2。
alertmanager配置示例
global:
resolve_timeout: 5m
smtp_from: from@email.com
smtp_smarthost: smtp.net:port
smtp_auth_username: from@email.com
smtp_auth_password: PASS
smtp_require_tls: false
route:
receiver: 'email'
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 10m
routes:
- receiver: 'our'
group_wait: 10s
match_re:
severity: warning
- receiver: 'other'
group_wait: 10s
match_re:
severity: busi
templates:
- '*.html'
receivers:
- name: 'email'
email_configs:
- to: 'xuxd@email.com'
send_resolved: false
html: '{{ template "default-monitor.html" . }}'
headers: { Subject: "[WARN] 报警邮件" } #邮件主题
- name: 'our'
webhook_configs:
- url: http://127.0.0.1:8060/dingtalk/our/send
- name: 'other'
webhook_configs:
- url: http://127.0.0.1:8060/dingtalk/other/send
- route:除了email这个全局配置的接收者外,下面的routes指定了两个特定的接收者,一个接收者叫"our",匹配warning级别的;另一个叫"other",匹配busi级别的,这两个级别在最前面的规则里定义,不是什么特定关键字,就是自己随便定义的一个标记
- receivers:这里指定了上面定义的接收者的配置,email指定邮件发给谁;"our"指定dingtalk的发送url,注意这个uri的末尾,send前用的"our";"other"下面指定了两个url,区别就是url末尾的send前面,一个是"our",另一个是"other"
prometheus-webhook-dingtalk配置
## Customizable templates path
templates:
- /home/user/monitor/alert/prometheus-webhook-dingtalk-1.4.0.linux-amd64/template/template.tmpl
## Targets, previously was known as "profiles"
targets:
our:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxx
secret: xxx_secret
other:
url: https://oapi.dingtalk.com/robot/send?access_token=xxx_other
secret: xxx_other_secret
targets下有两个,分别是"our"和"other",这里对应上面alertmanager配置的url里的"our"和"other。
这样配置,如果规则一告警,就是alertmanager的name为other的receiver来发送告警通知,发送到我们的钉钉群和业务侧钉钉群。如果是规则二告警,通过our发送,便只发送到我们的钉钉群。
vmalert配置文件value.yaml
# Default values for victoria-metrics-alert.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
serviceAccount:
# Specifies whether a service account should be created
create: true
# Annotations to add to the service account
annotations: {}
# The name of the service account to use.
# If not set and create is true, a name is generated using the fullname template
name:
# mount API token to pod directly
automountToken: true
imagePullSecrets: []
rbac:
create: true
pspEnabled: true
namespaced: false
extraLabels: {}
annotations: {}
server:
name: server
enabled: true
image:
repository: victoriametrics/vmalert
tag: "" # rewrites Chart.AppVersion
pullPolicy: IfNotPresent
nameOverride: ""
fullnameOverride: ""
## See `kubectl explain poddisruptionbudget.spec` for more
## ref: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
podDisruptionBudget:
enabled: false
# minAvailable: 1
# maxUnavailable: 1
labels: {}
# -- Additional environment variables (ex.: secret tokens, flags) https://github.com/VictoriaMetrics/VictoriaMetrics#environment-variables
env:
[]
# - name: VM_remoteWrite_basicAuth_password
# valueFrom:
# secretKeyRef:
# name: auth_secret
# key: password
replicaCount: 1
# deployment strategy, set to standard k8s default
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
# specifies the minimum number of seconds for which a newly created Pod should be ready without any of its containers crashing/terminating
# 0 is the standard k8s default
minReadySeconds: 0
# vmalert reads metrics from source, next section represents its configuration. It can be any service which supports
# MetricsQL or PromQL.
datasource:
url: "http://192.168.47.9:8481/select/0/prometheus/"
basicAuth:
username: ""
password: ""
remote:
write:
url: ""
read:
url: ""
notifier:
alertmanager:
url: "http://x.x.x.x:9093"
extraArgs:
envflag.enable: "true"
envflag.prefix: VM_
loggerFormat: json
# Additional hostPath mounts
extraHostPathMounts:
[]
# - name: certs-dir
# mountPath: /etc/kubernetes/certs
# subPath: ""
# hostPath: /etc/kubernetes/certs
# readOnly: true
# Extra Volumes for the pod
extraVolumes:
[]
#- name: example
# configMap:
# name: example
# Extra Volume Mounts for the container
extraVolumeMounts:
[]
# - name: example
# mountPath: /example
extraContainers:
[]
#- name: config-reloader
# image: reloader-image
service:
annotations: {}
labels: {}
clusterIP: ""
## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
##
externalIPs: []
loadBalancerIP: ""
loadBalancerSourceRanges: []
servicePort: 8880
type: ClusterIP
# Ref: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip
# externalTrafficPolicy: "local"
# healthCheckNodePort: 0
ingress:
enabled: false
annotations: {}
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: 'true'
extraLabels: {}
hosts: []
# - name: vmselect.local
# path: /select
# port: http
tls: []
# - secretName: vmselect-ingress-tls
# hosts:
# - vmselect.local
# For Kubernetes >= 1.18 you should specify the ingress-controller via the field ingressClassName
# See https://kubernetes.io/blog/2020/04/02/improvements-to-the-ingress-api-in-kubernetes-1.18/#specifying-the-class-of-an-ingress
# ingressClassName: nginx
# -- pathType is only for k8s >= 1.1=
pathType: Prefix
podSecurityContext: {}
# fsGroup: 2000
securityContext:
{}
# capabilities:
# drop:
# - ALL
# readOnlyRootFilesystem: true
# runAsNonRoot: true
# runAsUser: 1000
resources:
{}
# We usually recommend not to specify default resources and to leave this as a conscious
# choice for the user. This also increases chances charts run on environments with little
# resources, such as Minikube. If you do want to specify resources, uncomment the following
# lines, adjust them as necessary, and remove the curly braces after 'resources:'.
# limits:
# cpu: 100m
# memory: 128Mi
# requests:
# cpu: 100m
# memory: 128Mi
# Annotations to be added to the deployment
annotations: {}
# labels to be added to the deployment
labels: {}
# Annotations to be added to pod
podAnnotations: {}
podLabels: {}
nodeSelector: {}
priorityClassName: ""
tolerations: []
affinity: {}
# vmalert alert rules configuration configuration:
# use existing configmap if specified
# otherwise .config values will be used
configMap: ""
config:
alerts:
groups:
- name: 磁盘挂载错误
rules:
- alert: 磁盘挂载错误
annotations:
description: '{{$labels.job}}链{{$labels.instance}}节点磁盘挂载错误'
expr: mount_error{job=~"dev|sit"} == 1
for: 1m
labels:
severity: critical
kind: kind1
- name: 进程不存在
rules:
- alert: 进程不存在
annotations:
description: '{{$labels.job}}链{{$labels.instance}}进程不存在'
expr: process_total_error{job=~"dev|sit"} == 1
for: 1m
labels:
severity: critical
kind: kind2
serviceMonitor:
enabled: false
extraLabels: {}
annotations: {}
# interval: 15s
# scrapeTimeout: 5s
# -- Commented. HTTP scheme to use for scraping.
# scheme: https
# -- Commented. TLS configuration to use when scraping the endpoint
# tlsConfig:
# insecureSkipVerify: true
alertmanager:
enabled: true
replicaCount: 1
podMetadata:
labels: {}
annotations: {}
image: prom/alertmanager
tag: v0.20.0
retention: 120h
nodeSelector: {}
priorityClassName: ""
resources: {}
tolerations: []
imagePullSecrets: []
podSecurityContext: {}
extraArgs: {}
# key: value
# external URL, that alertmanager will expose to receivers
baseURL: ""
# use existing configmap if specified
# otherwise .config values will be used
configMap: ""
config:
global:
resolve_timeout: 5m
route:
# default receiver
receiver: aldaba
# tag to group by
group_by: [alertname]
# How long to initially wait to send a notification for a group of alerts
group_wait: 30s
# How long to wait before sending a notification about new alerts that are added to a group
group_interval: 60s
# How long to wait before sending a notification again if it has already been sent successfully for an alert
repeat_interval: 1h
routes:
- receiver: 'mychain'
group_wait: 10s
match_re:
kind: mychain
receivers:
- name: aldaba
webhook_configs:
- url: http://192.168.208.133:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=72a3a55795094a6878c2c2443a81a3545add1f688ddee18701c0dd753dbb3b2a&split=false
send_resolved: true
- name: mychain
webhook_configs:
- url: http://192.168.208.133:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=307270fdcd1bb0c4b0533e29005cca7cb353c27d7f988fdff0ec00e6affc6e83&split=false
send_resolved: true
inhibit_rules:
- source_match:
#severity: 'warning'
target_match:
#severity: 'warning'
#equal: ['alertname', 'job']
templates: {}
# alertmanager.tmpl: |-
service:
annotations: {}
type: ClusterIP
port: 9093
# if you want to force a specific nodePort. Must be use with service.type=NodePort
# nodePort:
ingress:
enabled: false
annotations:
# nginx.ingress.kubernetes.io/auth-realm: Authentication Required
# nginx.ingress.kubernetes.io/auth-secret: victoria-metrics/basic-auth
# nginx.ingress.kubernetes.io/auth-type: basic
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: 'true'
extraLabels: {}
hosts: {}
# - name: wangjuan.test.com
# path: /
# port: web
tls: []
# - secretName: alertmanager-ingress-tls
# hosts:
# - alertmanager.local
# For Kubernetes >= 1.18 you should specify the ingress-controller via the field ingressClassName
# See https://kubernetes.io/blog/2020/04/02/improvements-to-the-ingress-api-in-kubernetes-1.18/#specifying-the-class-of-an-ingress
# ingressClassName: nginx
# -- pathType is only for k8s >= 1.1=
pathType: Prefix
persistentVolume:
# -- Create/use Persistent Volume Claim for alertmanager component. Empty dir if false
enabled: false
# -- Array of access modes. Must match those of existing PV or dynamic provisioner. Ref: [http://kubernetes.io/docs/user-guide/persistent-volumes/](http://kubernetes.io/docs/user-guide/persistent-volumes/)
accessModes:
- ReadWriteOnce
# -- Persistant volume annotations
annotations: {}
# -- StorageClass to use for persistent volume. Requires alertmanager.persistentVolume.enabled: true. If defined, PVC created automatically
storageClass: ""
# -- Existing Claim name. If defined, PVC must be created manually before volume will be bound
existingClaim: ""
# -- Mount path. Alertmanager data Persistent Volume mount root path.
mountPath: /data
# -- Mount subpath
subPath: ""
# -- Size of the volume. Better to set the same as resource limit memory property.
size: 50Mi
方法二
根据job过滤
alertmanager配置
apiVersion: v1
data:
alertmanager.yaml: |-
global:
resolve_timeout: 5m
inhibit_rules:
- equal:
- alertname
- job
source_match:
severity: warning
target_match:
severity: warning
receivers:
- name: nft
webhook_configs:
- send_resolved: false
url: http://x.x.x.x:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxx&split=false
- name: poap
webhook_configs:
- send_resolved: false
url: http://x.x.x.x:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxxx&split=false
- name: ipforce
webhook_configs:
- send_resolved: false
url: http://x.x.x.x:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxxxxx&split=false
route:
group_by:
- alertname
group_interval: 60s
group_wait: 30s
receiver: nft
repeat_interval: 1h
routes:
- group_wait: 10s
match_re:
job: test_poap
receiver: poap
- group_wait: 10s
match_re:
job: test_ipforce
receiver: ipforce
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: vmalert
meta.helm.sh/release-namespace: victoria-metrics
creationTimestamp: '2022-04-06T07:31:38Z'
labels:
app: alertmanager
app.kubernetes.io/instance: vmalert
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: victoria-metrics-alert
helm.sh/chart: victoria-metrics-alert-0.4.33
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
'f:data': {}
'f:metadata':
'f:annotations':
.: {}
'f:meta.helm.sh/release-name': {}
'f:meta.helm.sh/release-namespace': {}
'f:labels':
.: {}
'f:app': {}
'f:app.kubernetes.io/instance': {}
'f:app.kubernetes.io/managed-by': {}
'f:app.kubernetes.io/name': {}
'f:helm.sh/chart': {}
manager: helm
operation: Update
time: '2022-04-06T07:31:38Z'
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
'f:data':
'f:alertmanager.yaml': {}
manager: ACK-Console Apache-HttpClient
operation: Update
time: '2023-01-05T07:52:13Z'
name: vmalert-alertmanager-alertmanager-config
namespace: victoria-metrics
resourceVersion: '80954053'
uid: 653e4633-86e5-41ce-9a17-301f75224e9c