Recommended Azure Monitors

General

This document describes the recommended Azure monitors which can be implemented in Azure cloud application subscriptions.

SMT incident priority mapping

The priority "Blocker" is mostly used by Developers to prioritize their tasks and its not applicable for operations team.

0-CRITICAL Critical <= 4 hrs
1-ERROR High <= 12hrs
2-WARNING Medium <= 48hrs (2days)
3 - Informational Low <= 96hrs (4days)
4 - Verbose No Ticket Action based on the notification and analysis
All Resources Resource Health Resource Health Previous resource status=All, Current resource status=All Always Current status 4 - Verbose MS teams Included all future resource groups and future resourcesExcluding "Virtual machine instance from VMSS"
All Resources Service Health Service Health Event types: Service issue, Planned maintenance , Health advisories, Security Advisories Always Current status 4 - Verbose MS teams Regions : North Europe, West EuropeServices: Alerts & Metrics, Activity Logs & Alerts and 21 more
Azure SQL Database CPU Metric app_cpu_percent > 80 5 mins 1 hour 2-WARNING Email
Azure SQL Database CPU Metric app_cpu_percent > 95 5 mins 1 hour 1-ERROR MS teams & Email
Azure SQL Database Memory Metric app_memory_percent > 80 5 mins 1 hour 2-WARNING Email
Azure SQL Database Memory Metric app_memory_percent > 95 5 mins 1 hour 1-ERROR MS teams & Email
Azure SQL Database Space Metric allocated_data_storage greater or less than dynamic threshold 15 mins 1 hour 2-WARNING Email
AKS - Node Node CPU Metric node_cpu_usage_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Memory Metric node_memory_working_set_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Disk Metric node_disk_usage_percentage > 80 15 mins 1 hour 2-WARNING Email Name of the node Include True
AKS - Node Node Status (NotReady,Unknown) Metric kube_node_status_condition > 0 5 mins 15 mins 2-WARNING Email
AKS - Pods Pods phases (Failed,Unknown,Pending) Metric kube_pod_status_phase >= 1 5 mins 30 mins 2-WARNING Email Phase of the pod Include Failed,Unknown,Pending
AKS - Pods Unschedulable Pods Metric unschedulable > 1 15 mins 1 hour 2-WARNING Email
AKS - Pods Pods ready state percentage Metric podReadyPercentage(preview) 2-WARNING Email
AKS - Containers Restarting Containers Metric restarting container count(preview) 2-WARNING Email
AKS - Containers OOM killed containers Metric oomKilledContainerCount)preview) 2-WARNING Email
AKS - Containers CPU Exceeded Percentage Metric cpuExceededPercentage (preview) 2-WARNING Email
AKS - Containers Memory working set exceeded percentage Metric memoryWorkingSetExceededPercentage(preview) 2-WARNING Email
Application Gateway Unhealthy backend Host Metric UnhealthyHostCount > 0 1 min 5 mins 0-CRITICAL MS teams & Email
Application Gateway Failed Requests Metric FailedRequests > 100 5 mins 15 mins 2-WARNING Email
Load balancer SNAT Connection Status Count Metric SnatConnectionCount >= 1 5 mins 15 mins 2-WARNING Email Connection State = Failed, Pending
Public IP Addresses Under DDoS attack or not Metric IfUnderDDoSAttack > 0 1 min 5 mins 0-CRITICAL MS teams & Email
Virtual machine scaleset CPU Usage Metric Percentage CPU > 90 15 mins 1 hour 2-WARNING Email
Container Registry Storage Used Metric StorageUsed > 90% of Storage size included in the SKU 15 mins 1 hour 3 - Informational Email Review this which SKU of ACR has this metric
LogicApp RunsFailed Metric RunsFailed>0 1 hour 12 hours 3 - Informational Email
Log Analytics Workspace Container SIGKILL Error Logs Table rows Count > 0 15 mins 15 mins 2-WARNING Email Signal KILL error Expand source
Log Analytics Workspace WAF_Possible_DDoS_Detected Logs Query count_ > 1000 15 mins 15 mins 1 - Error MS teams & Email WAF_Possible_DDoS_Detected Expand source
Log Analytics workspace Node-restart-delayed triggered by Kured Logs Query 2-WARNING Email Node-restart-delayed Expand source
Log Analytics workspace Node-restart-successful-Kured Action Logs Query OBSOLETE Node-restart-successful Expand source
Azure SQL Database / server Vulnerability Scan Report Vulnerability Scan Report
Failure Failure Anomalies - ETAS-BCP-PT-Forensic-Logic-App Failure Anomalies detected 3 - Informational etas-bcp-pt-forensic-logic-app Application Insights Smart detector

Requirements

ACR ACR - To trigger alert when Create or Update Images from the ACR ?
SQL DB SQL DB - Slow / Long running Queries ?
Service Principal secret / certificate expiry ?
AKS Check if we can sent an alert if k8s is not able to scale in new workernode
VISUALIZATION KURED/AKS ALERTS Currently we dont have a Dashboard / Vis for kured alertsA overview over time would be helpful to

Refer : https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-overview

https://learn.microsoft.com/en-us/azure/azure-monitor/alerts/alerts-overview

相关推荐
Robot2517 小时前
「华为」人形机器人赛道投资首秀!
大数据·人工智能·科技·microsoft·华为·机器人
云攀登者-望正茂11 小时前
AKS 支持 Kata Container容器沙盒 -预览阶段
容器·azure
IT专业服务商14 小时前
联想 SR550 服务器,配置 RAID 5教程!
运维·服务器·windows·microsoft·硬件架构
星空寻流年14 小时前
CSS3(BFC)
前端·microsoft·css3
市象1 天前
傅利叶十周年,升级核心战略:“有温度”的具身智能蓝图
microsoft
云攀登者-望正茂2 天前
通过 Azure DevOps 探索 Helm 和 Azure AKS
azure·devops
qq_393828222 天前
Windows ABBYY FineReader 16 Corporate 文档转换、PDF编辑和文档比较
windows·microsoft·电脑·开源软件·软件需求
高工智能汽车2 天前
AI汽车时代的全面赋能者:德赛西威全栈能力再升级
人工智能·microsoft·汽车
带娃的IT创业者2 天前
《AI大模型应知应会100篇》第58篇:Semantic Kernel:微软的大模型应用框架
人工智能·microsoft·flask
云攀登者-望正茂2 天前
AKS 网络深入探究:Kubenet、Azure-CNI 和 Azure-CNI(overlay)
网络·azure