Recommended Azure Monitors

General

This document describes the recommended Azure monitors which can be implemented in Azure cloud application subscriptions.

SMT incident priority mapping

The priority "Blocker" is mostly used by Developers to prioritize their tasks and its not applicable for operations team.

0-CRITICAL Critical <= 4 hrs
1-ERROR High <= 12hrs
2-WARNING Medium <= 48hrs (2days)
3 - Informational Low <= 96hrs (4days)
4 - Verbose No Ticket Action based on the notification and analysis
All Resources Resource Health Resource Health Previous resource status=All, Current resource status=All Always Current status 4 - Verbose MS teams Included all future resource groups and future resourcesExcluding "Virtual machine instance from VMSS"
All Resources Service Health Service Health Event types: Service issue, Planned maintenance , Health advisories, Security Advisories Always Current status 4 - Verbose MS teams Regions : North Europe, West EuropeServices: Alerts & Metrics, Activity Logs & Alerts and 21 more
Azure SQL Database CPU Metric app_cpu_percent > 80 5 mins 1 hour 2-WARNING Email
Azure SQL Database CPU Metric app_cpu_percent > 95 5 mins 1 hour 1-ERROR MS teams & Email
Azure SQL Database Memory Metric app_memory_percent > 80 5 mins 1 hour 2-WARNING Email
Azure SQL Database Memory Metric app_memory_percent > 95 5 mins 1 hour 1-ERROR MS teams & Email
Azure SQL Database Space Metric allocated_data_storage greater or less than dynamic threshold 15 mins 1 hour 2-WARNING Email
AKS - Node Node CPU Metric node_cpu_usage_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Memory Metric node_memory_working_set_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Disk Metric node_disk_usage_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Status (NotReady,Unknown) Metric kube_node_status_condition > 0 5 mins 15 mins 2-WARNING Email
AKS - Pods Pods phases (Failed,Unknown,Pending) Metric kube_pod_status_phase >= 1 5 mins 30 mins 2-WARNING Email Phase of the pod Include Failed,Unknown,Pending
AKS - Pods Unschedulable Pods Metric unschedulable > 1 15 mins 1 hour 2-WARNING Email
AKS - Pods Pods ready state percentage Metric podReadyPercentage(preview) 2-WARNING Email
AKS - Containers Restarting Containers Metric restarting container count(preview) 2-WARNING Email
AKS - Containers OOM killed containers Metric oomKilledContainerCount)preview) 2-WARNING Email
AKS - Containers CPU Exceeded Percentage Metric cpuExceededPercentage (preview) 2-WARNING Email
AKS - Containers Memory working set exceeded percentage Metric memoryWorkingSetExceededPercentage(preview) 2-WARNING Email
Application Gateway Unhealthy backend Host Metric UnhealthyHostCount > 0 1 min 5 mins 0-CRITICAL MS teams & Email
Application Gateway Failed Requests Metric FailedRequests > 100 5 mins 15 mins 2-WARNING Email
Load balancer SNAT Connection Status Count Metric SnatConnectionCount >= 1 5 mins 15 mins 2-WARNING Email Connection State = Failed, Pending
Public IP Addresses Under DDoS attack or not Metric IfUnderDDoSAttack > 0 1 min 5 mins 0-CRITICAL MS teams & Email
Virtual machine scaleset CPU Usage Metric Percentage CPU > 90 15 mins 1 hour 2-WARNING Email
Container Registry Storage Used Metric StorageUsed > 90% of Storage size included in the SKU 15 mins 1 hour 3 - Informational Email Review this which SKU of ACR has this metric
LogicApp RunsFailed Metric RunsFailed>0 1 hour 12 hours 3 - Informational Email
Log Analytics Workspace Container SIGKILL Error Logs Table rows Count > 0 15 mins 15 mins 2-WARNING Email Signal KILL error Expand source
Log Analytics Workspace WAF_Possible_DDoS_Detected Logs Query count_ > 1000 15 mins 15 mins 1 - Error MS teams & Email WAF_Possible_DDoS_Detected Expand source
Log Analytics workspace Node-restart-delayed triggered by Kured Logs Query 2-WARNING Email Node-restart-delayed Expand source
Log Analytics workspace Node-restart-successful-Kured Action Logs Query OBSOLETE Node-restart-successful Expand source
Azure SQL Database / server Vulnerability Scan Report Vulnerability Scan Report
Failure Failure Anomalies - ETAS-BCP-PT-Forensic-Logic-App Failure Anomalies detected 3 - Informational etas-bcp-pt-forensic-logic-app Application Insights Smart detector

Requirements

ACR ACR - To trigger alert when Create or Update Images from the ACR ?
SQL DB SQL DB - Slow / Long running Queries ?
Service Principal secret / certificate expiry ?
AKS Check if we can sent an alert if k8s is not able to scale in new workernode
VISUALIZATION KURED/AKS ALERTS Currently we dont have a Dashboard / Vis for kured alertsA overview over time would be helpful to

Refer : https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-overview

https://learn.microsoft.com/en-us/azure/azure-monitor/alerts/alerts-overview

相关推荐
Agent手记5 小时前
成本数据多系统自动采集与分析实操指南:基于2026大模型Agent的超自动化实践
运维·人工智能·microsoft·ai·自动化
小鹿软件办公6 小时前
LibreOffice 开发者再批微软 OOXML,缺乏透明度且存在兼容问题
microsoft·libreoffice
Data-Miner7 小时前
国产AI做表工具数以轻舟Agent全新更新:新增支持火山引擎API
人工智能·microsoft·火山引擎
Terrence Shen9 小时前
demo111
microsoft
互联圈运营观察11 小时前
布局先行、技术深耕:国内端侧AI企业抢滩机器人与具身智能赛道
人工智能·microsoft·机器人
fuquxiaoguang12 小时前
微软Maia 200的“算力经济学”:推理时代的专用芯片如何改写游戏规则
人工智能·microsoft
糖果店的幽灵13 小时前
Part 2: Models(模型)
microsoft·langchain
福尔摩斯·柯南14 小时前
微软Microsoft Office 95/97/2000/xp/2003/2007/2010/2013/2016/2019/2021/2024全系列
microsoft
德宏大魔王(AI自动回关)16 小时前
从文字应答到具身交互:AI 交互体验的全新进化
microsoft
余衫马16 小时前
Microsoft Semantic Kernel 实战:使用内核参数实现一个简单的对话机器人
人工智能·microsoft·ai·agent·智能体