Recommended Azure Monitors

General

This document describes the recommended Azure monitors which can be implemented in Azure cloud application subscriptions.

SMT incident priority mapping

The priority "Blocker" is mostly used by Developers to prioritize their tasks and its not applicable for operations team.

0-CRITICAL Critical <= 4 hrs
1-ERROR High <= 12hrs
2-WARNING Medium <= 48hrs (2days)
3 - Informational Low <= 96hrs (4days)
4 - Verbose No Ticket Action based on the notification and analysis
All Resources Resource Health Resource Health Previous resource status=All, Current resource status=All Always Current status 4 - Verbose MS teams Included all future resource groups and future resourcesExcluding "Virtual machine instance from VMSS"
All Resources Service Health Service Health Event types: Service issue, Planned maintenance , Health advisories, Security Advisories Always Current status 4 - Verbose MS teams Regions : North Europe, West EuropeServices: Alerts & Metrics, Activity Logs & Alerts and 21 more
Azure SQL Database CPU Metric app_cpu_percent > 80 5 mins 1 hour 2-WARNING Email
Azure SQL Database CPU Metric app_cpu_percent > 95 5 mins 1 hour 1-ERROR MS teams & Email
Azure SQL Database Memory Metric app_memory_percent > 80 5 mins 1 hour 2-WARNING Email
Azure SQL Database Memory Metric app_memory_percent > 95 5 mins 1 hour 1-ERROR MS teams & Email
Azure SQL Database Space Metric allocated_data_storage greater or less than dynamic threshold 15 mins 1 hour 2-WARNING Email
AKS - Node Node CPU Metric node_cpu_usage_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Memory Metric node_memory_working_set_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Disk Metric node_disk_usage_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Status (NotReady,Unknown) Metric kube_node_status_condition > 0 5 mins 15 mins 2-WARNING Email
AKS - Pods Pods phases (Failed,Unknown,Pending) Metric kube_pod_status_phase >= 1 5 mins 30 mins 2-WARNING Email Phase of the pod Include Failed,Unknown,Pending
AKS - Pods Unschedulable Pods Metric unschedulable > 1 15 mins 1 hour 2-WARNING Email
AKS - Pods Pods ready state percentage Metric podReadyPercentage(preview) 2-WARNING Email
AKS - Containers Restarting Containers Metric restarting container count(preview) 2-WARNING Email
AKS - Containers OOM killed containers Metric oomKilledContainerCount)preview) 2-WARNING Email
AKS - Containers CPU Exceeded Percentage Metric cpuExceededPercentage (preview) 2-WARNING Email
AKS - Containers Memory working set exceeded percentage Metric memoryWorkingSetExceededPercentage(preview) 2-WARNING Email
Application Gateway Unhealthy backend Host Metric UnhealthyHostCount > 0 1 min 5 mins 0-CRITICAL MS teams & Email
Application Gateway Failed Requests Metric FailedRequests > 100 5 mins 15 mins 2-WARNING Email
Load balancer SNAT Connection Status Count Metric SnatConnectionCount >= 1 5 mins 15 mins 2-WARNING Email Connection State = Failed, Pending
Public IP Addresses Under DDoS attack or not Metric IfUnderDDoSAttack > 0 1 min 5 mins 0-CRITICAL MS teams & Email
Virtual machine scaleset CPU Usage Metric Percentage CPU > 90 15 mins 1 hour 2-WARNING Email
Container Registry Storage Used Metric StorageUsed > 90% of Storage size included in the SKU 15 mins 1 hour 3 - Informational Email Review this which SKU of ACR has this metric
LogicApp RunsFailed Metric RunsFailed>0 1 hour 12 hours 3 - Informational Email
Log Analytics Workspace Container SIGKILL Error Logs Table rows Count > 0 15 mins 15 mins 2-WARNING Email Signal KILL error Expand source
Log Analytics Workspace WAF_Possible_DDoS_Detected Logs Query count_ > 1000 15 mins 15 mins 1 - Error MS teams & Email WAF_Possible_DDoS_Detected Expand source
Log Analytics workspace Node-restart-delayed triggered by Kured Logs Query 2-WARNING Email Node-restart-delayed Expand source
Log Analytics workspace Node-restart-successful-Kured Action Logs Query OBSOLETE Node-restart-successful Expand source
Azure SQL Database / server Vulnerability Scan Report Vulnerability Scan Report
Failure Failure Anomalies - ETAS-BCP-PT-Forensic-Logic-App Failure Anomalies detected 3 - Informational etas-bcp-pt-forensic-logic-app Application Insights Smart detector

Requirements

ACR ACR - To trigger alert when Create or Update Images from the ACR ?
SQL DB SQL DB - Slow / Long running Queries ?
Service Principal secret / certificate expiry ?
AKS Check if we can sent an alert if k8s is not able to scale in new workernode
VISUALIZATION KURED/AKS ALERTS Currently we dont have a Dashboard / Vis for kured alertsA overview over time would be helpful to

Refer : https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-overview

https://learn.microsoft.com/en-us/azure/azure-monitor/alerts/alerts-overview

相关推荐
宝桥南山8 小时前
AI - 在命令行中尝试一下ACP(Agent Client Protocol)通信
microsoft·微软·github·aigc·copilot
技术钱13 小时前
Prompt组件以及使用技巧
microsoft·prompt
宝桥南山15 小时前
GitHub Models - 尝试一下使用GitHub Models
microsoft·ai·微软·c#·github·.netcore
syounger16 小时前
SAP新API政策引发AI生态焦虑:开放平台还是变相锁定?
人工智能·microsoft
Data-Miner18 小时前
用DeepSeek V4做表:数以轻舟Agent让做Excel表像聊天一样简单
microsoft·excel
Lucky_Turtle19 小时前
【Linux】debain13开启bbr
服务器·azure
nbwenren20 小时前
2026实测:用 Gemini 3.1镜像站 自然语言转 SQL,打造内部数据分析对话式接口
microsoft
武藤一雄1 天前
WPF:MessageBox系统消息框
前端·microsoft·c#·.net·wpf
Leinwin1 天前
双城奔赴,智领未来:领驭科技亮相2026微软AI Tour上海·香港站
人工智能·科技·microsoft
编码者卢布2 天前
【Azure Container App】使用 yaml 部署 Container App 时遇见 400 Bad Request 错误
microsoft·azure