Recommended Azure Monitors

General

This document describes the recommended Azure monitors which can be implemented in Azure cloud application subscriptions.

SMT incident priority mapping

The priority "Blocker" is mostly used by Developers to prioritize their tasks and its not applicable for operations team.

0-CRITICAL Critical <= 4 hrs
1-ERROR High <= 12hrs
2-WARNING Medium <= 48hrs (2days)
3 - Informational Low <= 96hrs (4days)
4 - Verbose No Ticket Action based on the notification and analysis
All Resources Resource Health Resource Health Previous resource status=All, Current resource status=All Always Current status 4 - Verbose MS teams Included all future resource groups and future resourcesExcluding "Virtual machine instance from VMSS"
All Resources Service Health Service Health Event types: Service issue, Planned maintenance , Health advisories, Security Advisories Always Current status 4 - Verbose MS teams Regions : North Europe, West EuropeServices: Alerts & Metrics, Activity Logs & Alerts and 21 more
Azure SQL Database CPU Metric app_cpu_percent > 80 5 mins 1 hour 2-WARNING Email
Azure SQL Database CPU Metric app_cpu_percent > 95 5 mins 1 hour 1-ERROR MS teams & Email
Azure SQL Database Memory Metric app_memory_percent > 80 5 mins 1 hour 2-WARNING Email
Azure SQL Database Memory Metric app_memory_percent > 95 5 mins 1 hour 1-ERROR MS teams & Email
Azure SQL Database Space Metric allocated_data_storage greater or less than dynamic threshold 15 mins 1 hour 2-WARNING Email
AKS - Node Node CPU Metric node_cpu_usage_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Memory Metric node_memory_working_set_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Disk Metric node_disk_usage_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Status (NotReady,Unknown) Metric kube_node_status_condition > 0 5 mins 15 mins 2-WARNING Email
AKS - Pods Pods phases (Failed,Unknown,Pending) Metric kube_pod_status_phase >= 1 5 mins 30 mins 2-WARNING Email Phase of the pod Include Failed,Unknown,Pending
AKS - Pods Unschedulable Pods Metric unschedulable > 1 15 mins 1 hour 2-WARNING Email
AKS - Pods Pods ready state percentage Metric podReadyPercentage(preview) 2-WARNING Email
AKS - Containers Restarting Containers Metric restarting container count(preview) 2-WARNING Email
AKS - Containers OOM killed containers Metric oomKilledContainerCount)preview) 2-WARNING Email
AKS - Containers CPU Exceeded Percentage Metric cpuExceededPercentage (preview) 2-WARNING Email
AKS - Containers Memory working set exceeded percentage Metric memoryWorkingSetExceededPercentage(preview) 2-WARNING Email
Application Gateway Unhealthy backend Host Metric UnhealthyHostCount > 0 1 min 5 mins 0-CRITICAL MS teams & Email
Application Gateway Failed Requests Metric FailedRequests > 100 5 mins 15 mins 2-WARNING Email
Load balancer SNAT Connection Status Count Metric SnatConnectionCount >= 1 5 mins 15 mins 2-WARNING Email Connection State = Failed, Pending
Public IP Addresses Under DDoS attack or not Metric IfUnderDDoSAttack > 0 1 min 5 mins 0-CRITICAL MS teams & Email
Virtual machine scaleset CPU Usage Metric Percentage CPU > 90 15 mins 1 hour 2-WARNING Email
Container Registry Storage Used Metric StorageUsed > 90% of Storage size included in the SKU 15 mins 1 hour 3 - Informational Email Review this which SKU of ACR has this metric
LogicApp RunsFailed Metric RunsFailed>0 1 hour 12 hours 3 - Informational Email
Log Analytics Workspace Container SIGKILL Error Logs Table rows Count > 0 15 mins 15 mins 2-WARNING Email Signal KILL error Expand source
Log Analytics Workspace WAF_Possible_DDoS_Detected Logs Query count_ > 1000 15 mins 15 mins 1 - Error MS teams & Email WAF_Possible_DDoS_Detected Expand source
Log Analytics workspace Node-restart-delayed triggered by Kured Logs Query 2-WARNING Email Node-restart-delayed Expand source
Log Analytics workspace Node-restart-successful-Kured Action Logs Query OBSOLETE Node-restart-successful Expand source
Azure SQL Database / server Vulnerability Scan Report Vulnerability Scan Report
Failure Failure Anomalies - ETAS-BCP-PT-Forensic-Logic-App Failure Anomalies detected 3 - Informational etas-bcp-pt-forensic-logic-app Application Insights Smart detector

Requirements

ACR ACR - To trigger alert when Create or Update Images from the ACR ?
SQL DB SQL DB - Slow / Long running Queries ?
Service Principal secret / certificate expiry ?
AKS Check if we can sent an alert if k8s is not able to scale in new workernode
VISUALIZATION KURED/AKS ALERTS Currently we dont have a Dashboard / Vis for kured alertsA overview over time would be helpful to

Refer : https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-overview

https://learn.microsoft.com/en-us/azure/azure-monitor/alerts/alerts-overview

相关推荐
weixin_307779131 天前
实现Azure Synapse Analytics安全地请求企业内部API返回数据
开发语言·python·云计算·azure
终身学习基地2 天前
第二篇:linux之Xshell使用及相关linux操作
linux·运维·microsoft
Captaincc2 天前
未来人工智能在知识工作中的应用:CHI 2025 的思维工具
microsoft·ai编程
CopyLower2 天前
**Microsoft Certified Professional(MCP)** 认证考试
python·microsoft·flask
weixin_307779132 天前
实现Azure Function安全地请求企业内部API返回数据
开发语言·python·云计算·azure
LucianaiB2 天前
C语言之文本加密程序设计
c语言·数据库·microsoft
爱编程的鱼2 天前
什么是 IDE?集成开发环境的功能与优势
开发语言·ide·python·学习·microsoft·c#
人类群星闪耀时5 天前
音乐产业新玩法:NFTs如何颠覆传统与挑战未来?
microsoft
刘培玉--大王5 天前
Langchain Agent封装的工具
microsoft·langchain
weixin_307779136 天前
Azure Synapse Dedicated SQL pool里大型表对大型表分批合并数据的策略
数据仓库·sql·microsoft·azure