方法一

修改告警规则

复制代码

- alert: cpu使用率大于88%
    expr: instance:node_cpu_utilization:ratio * 100 > 88
    for: 5m
    labels:
      severity: critical
      level: 3
      kind: CpuUsage
    annotations:
      summary: "cpu使用率大于85%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"

根据Kind区分，规则一kind1，规则二是kind2。

alertmanager配置示例

复制代码

global:
  resolve_timeout: 5m
  smtp_from: from@email.com
  smtp_smarthost: smtp.net:port
  smtp_auth_username: from@email.com
  smtp_auth_password: PASS
  smtp_require_tls: false
route:
  receiver: 'email'
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 10m
  routes:
  - receiver: 'our'
    group_wait: 10s
    match_re:
       severity: warning
  - receiver: 'other'
    group_wait: 10s
    match_re:
       severity: busi

templates:
  - '*.html'
receivers:
- name: 'email'
  email_configs:
  - to: 'xuxd@email.com'
    send_resolved: false
    html: '{{ template "default-monitor.html" . }}'
    headers: { Subject: "[WARN] 报警邮件" } #邮件主题
- name: 'our'
  webhook_configs:
  - url: http://127.0.0.1:8060/dingtalk/our/send
- name: 'other'
  webhook_configs:
  - url: http://127.0.0.1:8060/dingtalk/other/send

route:除了email这个全局配置的接收者外，下面的routes指定了两个特定的接收者，一个接收者叫"our"，匹配warning级别的；另一个叫"other"，匹配busi级别的，这两个级别在最前面的规则里定义，不是什么特定关键字，就是自己随便定义的一个标记
receivers:这里指定了上面定义的接收者的配置，email指定邮件发给谁；"our"指定dingtalk的发送url，注意这个uri的末尾，send前用的"our";"other"下面指定了两个url，区别就是url末尾的send前面，一个是"our"，另一个是"other"

prometheus-webhook-dingtalk配置

复制代码

## Customizable templates path
templates:
   - /home/user/monitor/alert/prometheus-webhook-dingtalk-1.4.0.linux-amd64/template/template.tmpl

## Targets, previously was known as "profiles"
targets:
  our:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxx
    secret: xxx_secret
  other:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxx_other
    secret: xxx_other_secret

targets下有两个，分别是"our"和"other",这里对应上面alertmanager配置的url里的"our"和"other。

这样配置，如果规则一告警，就是alertmanager的name为other的receiver来发送告警通知，发送到我们的钉钉群和业务侧钉钉群。如果是规则二告警，通过our发送，便只发送到我们的钉钉群。

vmalert配置文件value.yaml

复制代码

# Default values for victoria-metrics-alert.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name:
  # mount API token to pod directly
  automountToken: true

imagePullSecrets: []

rbac:
  create: true
  pspEnabled: true
  namespaced: false
  extraLabels: {}
  annotations: {}

server:
  name: server
  enabled: true
  image:
    repository: victoriametrics/vmalert
    tag: "" # rewrites Chart.AppVersion
    pullPolicy: IfNotPresent
  nameOverride: ""
  fullnameOverride: ""

  ## See `kubectl explain poddisruptionbudget.spec` for more
  ## ref: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
  podDisruptionBudget:
    enabled: false
    # minAvailable: 1
    # maxUnavailable: 1
    labels: {}

  # -- Additional environment variables (ex.: secret tokens, flags) https://github.com/VictoriaMetrics/VictoriaMetrics#environment-variables
  env:
    []
    # - name: VM_remoteWrite_basicAuth_password
    #   valueFrom:
    #     secretKeyRef:
    #       name: auth_secret
    #       key: password

  replicaCount: 1

  # deployment strategy, set to standard k8s default
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%

  # specifies the minimum number of seconds for which a newly created Pod should be ready without any of its containers crashing/terminating
  # 0 is the standard k8s default
  minReadySeconds: 0

  # vmalert reads metrics from source, next section represents its configuration. It can be any service which supports
  # MetricsQL or PromQL.
  datasource:
    url: "http://192.168.47.9:8481/select/0/prometheus/"
    basicAuth:
      username: ""
      password: ""

  remote:
    write:
      url: ""
    read:
      url: ""

  notifier:
    alertmanager:
      url: "http://x.x.x.x:9093"

  extraArgs:
    envflag.enable: "true"
    envflag.prefix: VM_
    loggerFormat: json

  # Additional hostPath mounts
  extraHostPathMounts:
    []
    # - name: certs-dir
    #   mountPath: /etc/kubernetes/certs
    #   subPath: ""
    #   hostPath: /etc/kubernetes/certs
  #   readOnly: true

  # Extra Volumes for the pod
  extraVolumes:
    []
     #- name: example
     #  configMap:
     #    name: example

  # Extra Volume Mounts for the container
  extraVolumeMounts:
    []
    # - name: example
    #   mountPath: /example

  extraContainers:
    []
    #- name: config-reloader
    #  image: reloader-image

  service:
    annotations: {}
    labels: {}
    clusterIP: ""
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    externalIPs: []
    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    servicePort: 8880
    type: ClusterIP
    # Ref: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip
    # externalTrafficPolicy: "local"
    # healthCheckNodePort: 0

  ingress:
    enabled: false
    annotations: {}
    #   kubernetes.io/ingress.class: nginx
    #   kubernetes.io/tls-acme: 'true'

    extraLabels: {}
    hosts: []
    #   - name: vmselect.local
    #     path: /select
    #     port: http

    tls: []
    #   - secretName: vmselect-ingress-tls
    #     hosts:
    #       - vmselect.local

    # For Kubernetes >= 1.18 you should specify the ingress-controller via the field ingressClassName
    # See https://kubernetes.io/blog/2020/04/02/improvements-to-the-ingress-api-in-kubernetes-1.18/#specifying-the-class-of-an-ingress
    # ingressClassName: nginx
    # -- pathType is only for k8s >= 1.1=
    pathType: Prefix

  podSecurityContext: {}
  # fsGroup: 2000

  securityContext:
    {}
    # capabilities:
    #   drop:
    #   - ALL
    # readOnlyRootFilesystem: true
    # runAsNonRoot: true
  # runAsUser: 1000

  resources:
    {}
    # We usually recommend not to specify default resources and to leave this as a conscious
    # choice for the user. This also increases chances charts run on environments with little
    # resources, such as Minikube. If you do want to specify resources, uncomment the following
    # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
    # requests:
    #   cpu: 100m
  #   memory: 128Mi

  # Annotations to be added to the deployment
  annotations: {}
  # labels to be added to the deployment
  labels: {}

  # Annotations to be added to pod
  podAnnotations: {}

  podLabels: {}

  nodeSelector: {}

  priorityClassName: ""

  tolerations: []

  affinity: {}

  # vmalert alert rules configuration configuration:
  # use existing configmap if specified
  # otherwise .config values will be used
  configMap: ""
  config:
    alerts:
      groups:
          - name: 磁盘挂载错误
            rules:
            - alert: 磁盘挂载错误
              annotations:
                description: '{{$labels.job}}链{{$labels.instance}}节点磁盘挂载错误'
              expr: mount_error{job=~"dev|sit"} == 1
              for: 1m
              labels:
                severity: critical
                kind: kind1
          - name: 进程不存在
            rules:
            - alert: 进程不存在
              annotations:
                description: '{{$labels.job}}链{{$labels.instance}}进程不存在'
              expr: process_total_error{job=~"dev|sit"} == 1
              for: 1m
              labels:
                severity: critical
                kind: kind2

serviceMonitor:
  enabled: false
  extraLabels: {}
  annotations: {}
#    interval: 15s
#    scrapeTimeout: 5s
  # -- Commented. HTTP scheme to use for scraping.
#    scheme: https
  # -- Commented. TLS configuration to use when scraping the endpoint
#    tlsConfig:
#      insecureSkipVerify: true

alertmanager:
  enabled: true
  replicaCount: 1
  podMetadata:
    labels: {}
    annotations: {}
  image: prom/alertmanager
  tag: v0.20.0
  retention: 120h
  nodeSelector: {}
  priorityClassName: ""
  resources: {}
  tolerations: []
  imagePullSecrets: []
  podSecurityContext: {}
  extraArgs: {}
  # key: value

  # external URL, that alertmanager will expose to receivers
  baseURL: ""
  # use existing configmap if specified
  # otherwise .config values will be used
  configMap: ""
  config:
    global:
      resolve_timeout: 5m
    route:
      # default receiver
      receiver: aldaba
      # tag to group by
      group_by: [alertname]
      # How long to initially wait to send a notification for a group of alerts
      group_wait: 30s
      # How long to wait before sending a notification about new alerts that are added to a group
      group_interval: 60s
      # How long to wait before sending a notification again if it has already been sent successfully for an alert
      repeat_interval: 1h
      routes:
      - receiver: 'mychain'
        group_wait: 10s
        match_re:
          kind: mychain
    receivers:
      - name: aldaba
        webhook_configs:
        - url: http://192.168.208.133:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=72a3a55795094a6878c2c2443a81a3545add1f688ddee18701c0dd753dbb3b2a&split=false
          send_resolved: true
      - name: mychain
        webhook_configs:
        - url: http://192.168.208.133:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=307270fdcd1bb0c4b0533e29005cca7cb353c27d7f988fdff0ec00e6affc6e83&split=false
          send_resolved: true
    inhibit_rules:
      - source_match:
          #severity: 'warning'
        target_match:
          #severity: 'warning'
        #equal: ['alertname', 'job']

  templates: {}
  #  alertmanager.tmpl: |-
  service:
    annotations: {}
    type: ClusterIP
    port: 9093
    # if you want to force a specific nodePort. Must be use with service.type=NodePort
    # nodePort:
  ingress:
    enabled: false
    annotations:
            #  nginx.ingress.kubernetes.io/auth-realm: Authentication Required
            #  nginx.ingress.kubernetes.io/auth-secret: victoria-metrics/basic-auth
            #  nginx.ingress.kubernetes.io/auth-type: basic
    #   kubernetes.io/ingress.class: nginx
    #   kubernetes.io/tls-acme: 'true'
    extraLabels: {}
    hosts: {}
    #   - name: wangjuan.test.com
    #    path: /
    #     port: web

    tls: []
    #   - secretName: alertmanager-ingress-tls
    #     hosts:
    #       - alertmanager.local

    # For Kubernetes >= 1.18 you should specify the ingress-controller via the field ingressClassName
    # See https://kubernetes.io/blog/2020/04/02/improvements-to-the-ingress-api-in-kubernetes-1.18/#specifying-the-class-of-an-ingress
    # ingressClassName: nginx
    # -- pathType is only for k8s >= 1.1=
    pathType: Prefix
  persistentVolume:
    # -- Create/use Persistent Volume Claim for alertmanager component. Empty dir if false
    enabled: false
    # -- Array of access modes. Must match those of existing PV or dynamic provisioner. Ref: [http://kubernetes.io/docs/user-guide/persistent-volumes/](http://kubernetes.io/docs/user-guide/persistent-volumes/)
    accessModes:
      - ReadWriteOnce
    # -- Persistant volume annotations
    annotations: {}
    # -- StorageClass to use for persistent volume. Requires alertmanager.persistentVolume.enabled: true. If defined, PVC created automatically
    storageClass: ""
    # -- Existing Claim name. If defined, PVC must be created manually before volume will be bound
    existingClaim: ""
    # -- Mount path. Alertmanager data Persistent Volume mount root path.
    mountPath: /data
    # -- Mount subpath
    subPath: ""
    # -- Size of the volume. Better to set the same as resource limit memory property.
    size: 50Mi

方法二

根据job过滤

alertmanager配置

复制代码

apiVersion: v1
data:
  alertmanager.yaml: |-
    global:
      resolve_timeout: 5m
    inhibit_rules:
    - equal:
      - alertname
      - job
      source_match:
        severity: warning
      target_match:
        severity: warning
    receivers:
    - name: nft
      webhook_configs:
      - send_resolved: false
        url: http://x.x.x.x:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxx&split=false
    - name: poap
      webhook_configs:
      - send_resolved: false
        url: http://x.x.x.x:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxxx&split=false
    - name: ipforce
      webhook_configs:
      - send_resolved: false
        url: http://x.x.x.x:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxxxxx&split=false
    route:
      group_by:
      - alertname
      group_interval: 60s
      group_wait: 30s
      receiver: nft
      repeat_interval: 1h
      routes:
      - group_wait: 10s
        match_re:
          job: test_poap
        receiver: poap
      - group_wait: 10s
        match_re:
          job: test_ipforce
        receiver: ipforce
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: vmalert
    meta.helm.sh/release-namespace: victoria-metrics
  creationTimestamp: '2022-04-06T07:31:38Z'
  labels:
    app: alertmanager
    app.kubernetes.io/instance: vmalert
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: victoria-metrics-alert
    helm.sh/chart: victoria-metrics-alert-0.4.33
  managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:data': {}
        'f:metadata':
          'f:annotations':
            .: {}
            'f:meta.helm.sh/release-name': {}
            'f:meta.helm.sh/release-namespace': {}
          'f:labels':
            .: {}
            'f:app': {}
            'f:app.kubernetes.io/instance': {}
            'f:app.kubernetes.io/managed-by': {}
            'f:app.kubernetes.io/name': {}
            'f:helm.sh/chart': {}
      manager: helm
      operation: Update
      time: '2022-04-06T07:31:38Z'
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:data':
          'f:alertmanager.yaml': {}
      manager: ACK-Console Apache-HttpClient
      operation: Update
      time: '2023-01-05T07:52:13Z'
  name: vmalert-alertmanager-alertmanager-config
  namespace: victoria-metrics
  resourceVersion: '80954053'
  uid: 653e4633-86e5-41ce-9a17-301f75224e9c

prometheusalert区分告警到不同钉钉群

方法一

修改告警规则

alertmanager配置示例

prometheus-webhook-dingtalk配置

vmalert配置文件value.yaml

方法二

根据job过滤