prometheusalert区分告警到不同钉钉群

方法一

修改告警规则

- alert: cpu使用率大于88%
    expr: instance:node_cpu_utilization:ratio * 100 > 88
    for: 5m
    labels:
      severity: critical
      level: 3
      kind: CpuUsage
    annotations:
      summary: "cpu使用率大于85%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"

根据Kind区分,规则一kind1,规则二是kind2。

alertmanager配置示例

global:
  resolve_timeout: 5m
  smtp_from: from@email.com
  smtp_smarthost: smtp.net:port
  smtp_auth_username: from@email.com
  smtp_auth_password: PASS
  smtp_require_tls: false
route:
  receiver: 'email'
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 10m
  routes:
  - receiver: 'our'
    group_wait: 10s
    match_re:
       severity: warning
  - receiver: 'other'
    group_wait: 10s
    match_re:
       severity: busi

templates:
  - '*.html'
receivers:
- name: 'email'
  email_configs:
  - to: 'xuxd@email.com'
    send_resolved: false
    html: '{{ template "default-monitor.html" . }}'
    headers: { Subject: "[WARN] 报警邮件" } #邮件主题
- name: 'our'
  webhook_configs:
  - url: http://127.0.0.1:8060/dingtalk/our/send
- name: 'other'
  webhook_configs:
  - url: http://127.0.0.1:8060/dingtalk/other/send
  • route:除了email这个全局配置的接收者外,下面的routes指定了两个特定的接收者,一个接收者叫"our",匹配warning级别的;另一个叫"other",匹配busi级别的,这两个级别在最前面的规则里定义,不是什么特定关键字,就是自己随便定义的一个标记
  • receivers:这里指定了上面定义的接收者的配置,email指定邮件发给谁;"our"指定dingtalk的发送url,注意这个uri的末尾,send前用的"our";"other"下面指定了两个url,区别就是url末尾的send前面,一个是"our",另一个是"other"

prometheus-webhook-dingtalk配置

## Customizable templates path
templates:
   - /home/user/monitor/alert/prometheus-webhook-dingtalk-1.4.0.linux-amd64/template/template.tmpl

## Targets, previously was known as "profiles"
targets:
  our:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxx
    secret: xxx_secret
  other:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxx_other
    secret: xxx_other_secret

targets下有两个,分别是"our"和"other",这里对应上面alertmanager配置的url里的"our"和"other。

这样配置,如果规则一告警,就是alertmanager的name为other的receiver来发送告警通知,发送到我们的钉钉群和业务侧钉钉群。如果是规则二告警,通过our发送,便只发送到我们的钉钉群。

vmalert配置文件value.yaml

# Default values for victoria-metrics-alert.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name:
  # mount API token to pod directly
  automountToken: true

imagePullSecrets: []

rbac:
  create: true
  pspEnabled: true
  namespaced: false
  extraLabels: {}
  annotations: {}

server:
  name: server
  enabled: true
  image:
    repository: victoriametrics/vmalert
    tag: "" # rewrites Chart.AppVersion
    pullPolicy: IfNotPresent
  nameOverride: ""
  fullnameOverride: ""

  ## See `kubectl explain poddisruptionbudget.spec` for more
  ## ref: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
  podDisruptionBudget:
    enabled: false
    # minAvailable: 1
    # maxUnavailable: 1
    labels: {}

  # -- Additional environment variables (ex.: secret tokens, flags) https://github.com/VictoriaMetrics/VictoriaMetrics#environment-variables
  env:
    []
    # - name: VM_remoteWrite_basicAuth_password
    #   valueFrom:
    #     secretKeyRef:
    #       name: auth_secret
    #       key: password

  replicaCount: 1

  # deployment strategy, set to standard k8s default
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%

  # specifies the minimum number of seconds for which a newly created Pod should be ready without any of its containers crashing/terminating
  # 0 is the standard k8s default
  minReadySeconds: 0

  # vmalert reads metrics from source, next section represents its configuration. It can be any service which supports
  # MetricsQL or PromQL.
  datasource:
    url: "http://192.168.47.9:8481/select/0/prometheus/"
    basicAuth:
      username: ""
      password: ""

  remote:
    write:
      url: ""
    read:
      url: ""

  notifier:
    alertmanager:
      url: "http://x.x.x.x:9093"

  extraArgs:
    envflag.enable: "true"
    envflag.prefix: VM_
    loggerFormat: json

  # Additional hostPath mounts
  extraHostPathMounts:
    []
    # - name: certs-dir
    #   mountPath: /etc/kubernetes/certs
    #   subPath: ""
    #   hostPath: /etc/kubernetes/certs
  #   readOnly: true

  # Extra Volumes for the pod
  extraVolumes:
    []
     #- name: example
     #  configMap:
     #    name: example

  # Extra Volume Mounts for the container
  extraVolumeMounts:
    []
    # - name: example
    #   mountPath: /example

  extraContainers:
    []
    #- name: config-reloader
    #  image: reloader-image

  service:
    annotations: {}
    labels: {}
    clusterIP: ""
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    externalIPs: []
    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    servicePort: 8880
    type: ClusterIP
    # Ref: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip
    # externalTrafficPolicy: "local"
    # healthCheckNodePort: 0

  ingress:
    enabled: false
    annotations: {}
    #   kubernetes.io/ingress.class: nginx
    #   kubernetes.io/tls-acme: 'true'

    extraLabels: {}
    hosts: []
    #   - name: vmselect.local
    #     path: /select
    #     port: http

    tls: []
    #   - secretName: vmselect-ingress-tls
    #     hosts:
    #       - vmselect.local

    # For Kubernetes >= 1.18 you should specify the ingress-controller via the field ingressClassName
    # See https://kubernetes.io/blog/2020/04/02/improvements-to-the-ingress-api-in-kubernetes-1.18/#specifying-the-class-of-an-ingress
    # ingressClassName: nginx
    # -- pathType is only for k8s >= 1.1=
    pathType: Prefix

  podSecurityContext: {}
  # fsGroup: 2000

  securityContext:
    {}
    # capabilities:
    #   drop:
    #   - ALL
    # readOnlyRootFilesystem: true
    # runAsNonRoot: true
  # runAsUser: 1000

  resources:
    {}
    # We usually recommend not to specify default resources and to leave this as a conscious
    # choice for the user. This also increases chances charts run on environments with little
    # resources, such as Minikube. If you do want to specify resources, uncomment the following
    # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
    # requests:
    #   cpu: 100m
  #   memory: 128Mi

  # Annotations to be added to the deployment
  annotations: {}
  # labels to be added to the deployment
  labels: {}

  # Annotations to be added to pod
  podAnnotations: {}

  podLabels: {}

  nodeSelector: {}

  priorityClassName: ""

  tolerations: []

  affinity: {}

  # vmalert alert rules configuration configuration:
  # use existing configmap if specified
  # otherwise .config values will be used
  configMap: ""
  config:
    alerts:
      groups:
          - name: 磁盘挂载错误
            rules:
            - alert: 磁盘挂载错误
              annotations:
                description: '{{$labels.job}}链{{$labels.instance}}节点磁盘挂载错误'
              expr: mount_error{job=~"dev|sit"} == 1
              for: 1m
              labels:
                severity: critical
                kind: kind1
          - name: 进程不存在
            rules:
            - alert: 进程不存在
              annotations:
                description: '{{$labels.job}}链{{$labels.instance}}进程不存在'
              expr: process_total_error{job=~"dev|sit"} == 1
              for: 1m
              labels:
                severity: critical
                kind: kind2

serviceMonitor:
  enabled: false
  extraLabels: {}
  annotations: {}
#    interval: 15s
#    scrapeTimeout: 5s
  # -- Commented. HTTP scheme to use for scraping.
#    scheme: https
  # -- Commented. TLS configuration to use when scraping the endpoint
#    tlsConfig:
#      insecureSkipVerify: true

alertmanager:
  enabled: true
  replicaCount: 1
  podMetadata:
    labels: {}
    annotations: {}
  image: prom/alertmanager
  tag: v0.20.0
  retention: 120h
  nodeSelector: {}
  priorityClassName: ""
  resources: {}
  tolerations: []
  imagePullSecrets: []
  podSecurityContext: {}
  extraArgs: {}
  # key: value

  # external URL, that alertmanager will expose to receivers
  baseURL: ""
  # use existing configmap if specified
  # otherwise .config values will be used
  configMap: ""
  config:
    global:
      resolve_timeout: 5m
    route:
      # default receiver
      receiver: aldaba
      # tag to group by
      group_by: [alertname]
      # How long to initially wait to send a notification for a group of alerts
      group_wait: 30s
      # How long to wait before sending a notification about new alerts that are added to a group
      group_interval: 60s
      # How long to wait before sending a notification again if it has already been sent successfully for an alert
      repeat_interval: 1h
      routes:
      - receiver: 'mychain'
        group_wait: 10s
        match_re:
          kind: mychain
    receivers:
      - name: aldaba
        webhook_configs:
        - url: http://192.168.208.133:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=72a3a55795094a6878c2c2443a81a3545add1f688ddee18701c0dd753dbb3b2a&split=false
          send_resolved: true
      - name: mychain
        webhook_configs:
        - url: http://192.168.208.133:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=307270fdcd1bb0c4b0533e29005cca7cb353c27d7f988fdff0ec00e6affc6e83&split=false
          send_resolved: true
    inhibit_rules:
      - source_match:
          #severity: 'warning'
        target_match:
          #severity: 'warning'
        #equal: ['alertname', 'job']

  templates: {}
  #  alertmanager.tmpl: |-
  service:
    annotations: {}
    type: ClusterIP
    port: 9093
    # if you want to force a specific nodePort. Must be use with service.type=NodePort
    # nodePort:
  ingress:
    enabled: false
    annotations:
            #  nginx.ingress.kubernetes.io/auth-realm: Authentication Required
            #  nginx.ingress.kubernetes.io/auth-secret: victoria-metrics/basic-auth
            #  nginx.ingress.kubernetes.io/auth-type: basic
    #   kubernetes.io/ingress.class: nginx
    #   kubernetes.io/tls-acme: 'true'
    extraLabels: {}
    hosts: {}
    #   - name: wangjuan.test.com
    #    path: /
    #     port: web

    tls: []
    #   - secretName: alertmanager-ingress-tls
    #     hosts:
    #       - alertmanager.local

    # For Kubernetes >= 1.18 you should specify the ingress-controller via the field ingressClassName
    # See https://kubernetes.io/blog/2020/04/02/improvements-to-the-ingress-api-in-kubernetes-1.18/#specifying-the-class-of-an-ingress
    # ingressClassName: nginx
    # -- pathType is only for k8s >= 1.1=
    pathType: Prefix
  persistentVolume:
    # -- Create/use Persistent Volume Claim for alertmanager component. Empty dir if false
    enabled: false
    # -- Array of access modes. Must match those of existing PV or dynamic provisioner. Ref: [http://kubernetes.io/docs/user-guide/persistent-volumes/](http://kubernetes.io/docs/user-guide/persistent-volumes/)
    accessModes:
      - ReadWriteOnce
    # -- Persistant volume annotations
    annotations: {}
    # -- StorageClass to use for persistent volume. Requires alertmanager.persistentVolume.enabled: true. If defined, PVC created automatically
    storageClass: ""
    # -- Existing Claim name. If defined, PVC must be created manually before volume will be bound
    existingClaim: ""
    # -- Mount path. Alertmanager data Persistent Volume mount root path.
    mountPath: /data
    # -- Mount subpath
    subPath: ""
    # -- Size of the volume. Better to set the same as resource limit memory property.
    size: 50Mi

方法二

根据job过滤

alertmanager配置

apiVersion: v1
data:
  alertmanager.yaml: |-
    global:
      resolve_timeout: 5m
    inhibit_rules:
    - equal:
      - alertname
      - job
      source_match:
        severity: warning
      target_match:
        severity: warning
    receivers:
    - name: nft
      webhook_configs:
      - send_resolved: false
        url: http://x.x.x.x:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxx&split=false
    - name: poap
      webhook_configs:
      - send_resolved: false
        url: http://x.x.x.x:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxxx&split=false
    - name: ipforce
      webhook_configs:
      - send_resolved: false
        url: http://x.x.x.x:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxxxxx&split=false
    route:
      group_by:
      - alertname
      group_interval: 60s
      group_wait: 30s
      receiver: nft
      repeat_interval: 1h
      routes:
      - group_wait: 10s
        match_re:
          job: test_poap
        receiver: poap
      - group_wait: 10s
        match_re:
          job: test_ipforce
        receiver: ipforce
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: vmalert
    meta.helm.sh/release-namespace: victoria-metrics
  creationTimestamp: '2022-04-06T07:31:38Z'
  labels:
    app: alertmanager
    app.kubernetes.io/instance: vmalert
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: victoria-metrics-alert
    helm.sh/chart: victoria-metrics-alert-0.4.33
  managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:data': {}
        'f:metadata':
          'f:annotations':
            .: {}
            'f:meta.helm.sh/release-name': {}
            'f:meta.helm.sh/release-namespace': {}
          'f:labels':
            .: {}
            'f:app': {}
            'f:app.kubernetes.io/instance': {}
            'f:app.kubernetes.io/managed-by': {}
            'f:app.kubernetes.io/name': {}
            'f:helm.sh/chart': {}
      manager: helm
      operation: Update
      time: '2022-04-06T07:31:38Z'
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:data':
          'f:alertmanager.yaml': {}
      manager: ACK-Console Apache-HttpClient
      operation: Update
      time: '2023-01-05T07:52:13Z'
  name: vmalert-alertmanager-alertmanager-config
  namespace: victoria-metrics
  resourceVersion: '80954053'
  uid: 653e4633-86e5-41ce-9a17-301f75224e9c
相关推荐
chaodaibing3 天前
elasticsearch_exporter启动报错 failed to fetch and decode node stats
elasticsearch·prometheus
是陈教授呀3 天前
如何获取钉钉webhook
钉钉
陌殇殇殇4 天前
Prometheus监控MySQL主从数据库
运维·数据库·mysql·prometheus
福大大架构师每日一题4 天前
19.1 使用k8s的sdk编写一个项目获取pod和node信息
云原生·容器·kubernetes·prometheus
福大大架构师每日一题4 天前
19.3 打镜像部署到k8s中,prometheus配置采集并在grafana看图
kubernetes·grafana·prometheus
福大大架构师每日一题5 天前
21.2 k8s中etcd的tls双向认证原理解析
容器·kubernetes·prometheus·etcd
我的运维人生5 天前
基于Prometheus和Grafana的现代服务器监控体系构建
服务器·运维开发·grafana·prometheus·技术共享
BUG弄潮儿5 天前
k8s 部署 prometheus
容器·kubernetes·prometheus
东方巴黎~Sunsiny5 天前
java项目实现钉钉异常告警实时监控
java·开发语言·钉钉
是陈教授呀5 天前
钉钉如何请求webhook发送信息
钉钉