Recommended Azure Monitors

General

This document describes the recommended Azure monitors which can be implemented in Azure cloud application subscriptions.

SMT incident priority mapping

The priority "Blocker" is mostly used by Developers to prioritize their tasks and its not applicable for operations team.

0-CRITICAL Critical <= 4 hrs
1-ERROR High <= 12hrs
2-WARNING Medium <= 48hrs (2days)
3 - Informational Low <= 96hrs (4days)
4 - Verbose No Ticket Action based on the notification and analysis
All Resources Resource Health Resource Health Previous resource status=All, Current resource status=All Always Current status 4 - Verbose MS teams Included all future resource groups and future resourcesExcluding "Virtual machine instance from VMSS"
All Resources Service Health Service Health Event types: Service issue, Planned maintenance , Health advisories, Security Advisories Always Current status 4 - Verbose MS teams Regions : North Europe, West EuropeServices: Alerts & Metrics, Activity Logs & Alerts and 21 more
Azure SQL Database CPU Metric app_cpu_percent > 80 5 mins 1 hour 2-WARNING Email
Azure SQL Database CPU Metric app_cpu_percent > 95 5 mins 1 hour 1-ERROR MS teams & Email
Azure SQL Database Memory Metric app_memory_percent > 80 5 mins 1 hour 2-WARNING Email
Azure SQL Database Memory Metric app_memory_percent > 95 5 mins 1 hour 1-ERROR MS teams & Email
Azure SQL Database Space Metric allocated_data_storage greater or less than dynamic threshold 15 mins 1 hour 2-WARNING Email
AKS - Node Node CPU Metric node_cpu_usage_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Memory Metric node_memory_working_set_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Disk Metric node_disk_usage_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Status (NotReady,Unknown) Metric kube_node_status_condition > 0 5 mins 15 mins 2-WARNING Email
AKS - Pods Pods phases (Failed,Unknown,Pending) Metric kube_pod_status_phase >= 1 5 mins 30 mins 2-WARNING Email Phase of the pod Include Failed,Unknown,Pending
AKS - Pods Unschedulable Pods Metric unschedulable > 1 15 mins 1 hour 2-WARNING Email
AKS - Pods Pods ready state percentage Metric podReadyPercentage(preview) 2-WARNING Email
AKS - Containers Restarting Containers Metric restarting container count(preview) 2-WARNING Email
AKS - Containers OOM killed containers Metric oomKilledContainerCount)preview) 2-WARNING Email
AKS - Containers CPU Exceeded Percentage Metric cpuExceededPercentage (preview) 2-WARNING Email
AKS - Containers Memory working set exceeded percentage Metric memoryWorkingSetExceededPercentage(preview) 2-WARNING Email
Application Gateway Unhealthy backend Host Metric UnhealthyHostCount > 0 1 min 5 mins 0-CRITICAL MS teams & Email
Application Gateway Failed Requests Metric FailedRequests > 100 5 mins 15 mins 2-WARNING Email
Load balancer SNAT Connection Status Count Metric SnatConnectionCount >= 1 5 mins 15 mins 2-WARNING Email Connection State = Failed, Pending
Public IP Addresses Under DDoS attack or not Metric IfUnderDDoSAttack > 0 1 min 5 mins 0-CRITICAL MS teams & Email
Virtual machine scaleset CPU Usage Metric Percentage CPU > 90 15 mins 1 hour 2-WARNING Email
Container Registry Storage Used Metric StorageUsed > 90% of Storage size included in the SKU 15 mins 1 hour 3 - Informational Email Review this which SKU of ACR has this metric
LogicApp RunsFailed Metric RunsFailed>0 1 hour 12 hours 3 - Informational Email
Log Analytics Workspace Container SIGKILL Error Logs Table rows Count > 0 15 mins 15 mins 2-WARNING Email Signal KILL error Expand source
Log Analytics Workspace WAF_Possible_DDoS_Detected Logs Query count_ > 1000 15 mins 15 mins 1 - Error MS teams & Email WAF_Possible_DDoS_Detected Expand source
Log Analytics workspace Node-restart-delayed triggered by Kured Logs Query 2-WARNING Email Node-restart-delayed Expand source
Log Analytics workspace Node-restart-successful-Kured Action Logs Query OBSOLETE Node-restart-successful Expand source
Azure SQL Database / server Vulnerability Scan Report Vulnerability Scan Report
Failure Failure Anomalies - ETAS-BCP-PT-Forensic-Logic-App Failure Anomalies detected 3 - Informational etas-bcp-pt-forensic-logic-app Application Insights Smart detector

Requirements

ACR ACR - To trigger alert when Create or Update Images from the ACR ?
SQL DB SQL DB - Slow / Long running Queries ?
Service Principal secret / certificate expiry ?
AKS Check if we can sent an alert if k8s is not able to scale in new workernode
VISUALIZATION KURED/AKS ALERTS Currently we dont have a Dashboard / Vis for kured alertsA overview over time would be helpful to

Refer : https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-overview

https://learn.microsoft.com/en-us/azure/azure-monitor/alerts/alerts-overview

相关推荐
垂钓的小鱼15 小时前
阿奇舒勒矛盾矩阵如何在萃智引擎中实现 AI 化——从 39×39 到一句话输入
microsoft
weixin_307779139 小时前
在 Azure 上构建数据库路由与异构整合层:原理、方案与最佳实践
数据库·人工智能·后端·云计算·azure
小黄人软件1 天前
Claude和Codex下载离线包 安装遇到问题:windows无法访问指定设备 路径 文件 应用无法打开也无法卸载,解决了
人工智能·microsoft·openai·codex
2601_961875241 天前
法考资料全套2026|客观题|主观题|资料已整理
阿里云·云计算·腾讯云·azure·七牛云存储·csdn开发云·火山引擎
golfscript1 天前
Playwright Python:微软出的浏览器自动化库
python·其他·microsoft·自动化
zhangfeng11331 天前
ONNX Runtime 微软的推理引擎 TensorRT,NVIDIA GPU 上的深度学习推理, CUDA Graph
人工智能·深度学习·microsoft
宝桥南山1 天前
GitHub Copilot - 尝试使用一下Azure Devops MCP server
microsoft·微软·github·aigc·copilot·devops
robot_???1 天前
Visual studio2022:找不到指定的SDK“Microsoft.NET.Sdk”
microsoft·.net·visual studio
墨小傲1 天前
Codex离线安装解决无法通过微软商店安装的痛处
microsoft
xhtdj1 天前
Build 2026:Azure API Management 推出统一模型 API 并新增 MCP 内容安全能力
人工智能·安全·azure