getCategoryData False Fault Alarm Process

Summary

The getCategoryData_Fault_Percent_SEV3 alarm fires false positives during traffic valleys on the ApiGateway service. The root cause is not a service failure --- it is a combination of client disconnects during response serialization and a percentage-based alarm threshold that becomes too sensitive at low traffic volumes.

Fix : Increase the alarm threshold for getCategoryData_Fault_Percent_SEV3 to accommodate the baseline client-abort fault rate during traffic valleys.


Root Cause Analysis

Architecture

This service is an API gateway that uses two different communication layers:

  • Tomcat + ARest + RestEasy (inbound) --- serves the public-facing REST API to external clients (H5 web app). Tomcat handles HTTP connections, ARest provides querylog/metrics, RestEasy routes requests to controller methods.

  • Coral client (outbound) --- makes internal RPC calls to downstream services. The controller bridges the two: receives HTTP via Tomcat, calls downstream via Coral, returns the result as HTTP.

    H5 Web App
    ↓ (HTTPS --- Tomcat inbound)
    Tomcat → Filters → RestEasy → Guice Interceptors → Controller
    ↓ (Coral RPC --- outbound)
    Coral Client → CategoryConfigService → DynamoDB
    ↓ (response flows back)
    Controller → RestEasy → Jackson serialization → Tomcat → H5 Web App

In the bug scenario, the Coral outbound call succeeded but the Tomcat inbound response write failed (client disconnected during Jackson serialization).

Request Lifecycle

复制代码
┌─────────────────────────────────────────────────────────────────┐
│                    Tomcat Servlet Container                      │
│                                                                  │
│  1. HTTP Request arrives from H5 client                         │
│  2. ARestQuerylogFilter --- starts querylog timer                 │
│  3. AuthenticationFilter --- validates access token               │
│  4. RateLimitFilter --- rate limiting check                       │
│  5. RestEasy routes to ConfigureServiceController               │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Guice @WithMetrics / @DimCounts Interceptor              │  │
│  │  6. BEFORE: start metrics tracking                        │  │
│  │  7. Controller executes:                                  │  │
│  │     → Coral client calls CategoryConfigService            │  │
│  │     → Downstream queries DynamoDB (1215 items)            │  │
│  │     → Controller returns Response.ok(response)            │  │
│  │  8. AFTER: method returned without throwing               │  │
│  │     → emits getCategoryDataSuccess=1  ← SUCCESS METRIC   │  │
│  └───────────────────────────────────────────────────────────┘  │
│  9. RestEasy/Jackson serializes Response to HTTP body           │
│     → IndexedListSerializer serializes 1215 items               │
│     → Tomcat OutputBuffer writes bytes to client socket         │
│     ╔═══════════════════════════════════════════════════╗       │
│     ║  CLIENT DISCONNECTS (Connection reset by peer)    ║       │
│     ╚═══════════════════════════════════════════════════╝       │
│     → ClientAbortException → servlet records 5xx               │
│  10. ARestQuerylogFilter completes                              │
│      → sees 5xx → writes Fault=1  ← FAULT METRIC              │
│                                                                  │
│  Result: Fault=1 AND getCategoryDataSuccess=1 on same entry    │
└─────────────────────────────────────────────────────────────────┘

The gap between step 8 (Guice interceptor emits success) and step 9 (serialization fails) is why both metrics appear on the same request.

What Happens

  1. H5 client calls GET /gateway/config/category?language=zh-CN with marketplaceId=MP_001
  2. Gateway's ConfigureServiceController.getCategoryData() calls ConfigureServiceProxy, which calls CategoryConfigService
  3. Downstream queries DynamoDB, loads 1215 category items with nested subcategories, returns successfully (~95ms)
  4. Controller receives the response, returns Response.ok(response) --- the @DimCounts Guice interceptor fires getCategoryDataSuccess=1
  5. Servlet/Jackson begins serializing the 1215-item response back to the H5 client
  6. Client disconnects mid-serialization (user navigated away, closed tab, or network interruption)
  7. Servlet throws ClientAbortException: java.io.IOException: Connection reset by peer
  8. RestEasy throws UnhandledException: Response is committed, can't handle exception
  9. Servlet container records 5xx → ARest ARestQuerylogFilter writes Fault=1

Why the Alarm Fires During Traffic Valleys

The absolute number of client-abort faults is small and roughly constant (users navigating away is normal behavior). During peak traffic, these faults are a negligible percentage:

Period Total Requests Client-Abort Faults Fault Rate Alarm (>1%)
Peak 1000 3 0.3% No
Valley 100 3 3.0% Yes

The fault rate spikes during valleys because the denominator shrinks, not because faults increase.

Why getCategoryDataSuccess=1 AND Fault=1 Appear on the Same Request

These two metrics are emitted at different layers of the request lifecycle:

Metric Layer When Emitted What It Sees
getCategoryDataSuccess=1 @DimCounts / @WithMetrics Guice interceptor When controller method returns Method returned Response.ok() without throwing → Applies.SUCCESS
Fault=1 ARest ARestQuerylogFilter (servlet layer) After HTTP response is fully written Servlet container recorded 5xx from ClientAbortException

Both are correct from their own perspective: the method succeeded, but the HTTP response delivery failed. The controller's catch (Exception e) block is never entered because the exception occurs in the servlet serialization layer, after the controller method has already returned.

What Was Ruled Out

  • Missing parameters / NPE : The request has valid marketplaceId and language
  • Downstream failure: Downstream service completed successfully (DynamoDB returned 200, 1215 items loaded, no errors)
  • Controller exception : "Fail to getCategoryData" never appears in application.log --- the catch block is never hit
  • H5 timeout: H5 timeout is 10 seconds; the request completes in ~125ms --- well within the limit

Evidence Logs

Gateway service_log --- Fault request

复制代码
Operation=ConfigureServiceController.getCategoryData
Time=124.850746 ms
Counters=Error=0,Fault=1
Metrics=getCategoryDataSuccess=1

Fault=1 AND getCategoryDataSuccess=1 on the same request --- controller succeeded, HTTP response delivery failed.

Gateway service_log --- Normal request

复制代码
Operation=ConfigureServiceController.getCategoryData
Time=136.968727 ms
Counters=Error=0,Fault=0
Metrics=getCategoryDataSuccess=1

Fault=0 --- same endpoint, client stayed connected.

Gateway application.log --- ClientAbortException

复制代码
[ERROR] org.apache.catalina.core.ContainerBase.[Tomcat].[localhost]:
  Servlet.service() for servlet threw exception

org.jboss.resteasy.spi.UnhandledException: Response is committed, can't handle exception
  at org.jboss.resteasy.core.SynchronousDispatcher.writeException(...)
  ...
Caused by: org.apache.catalina.connector.ClientAbortException:
  java.io.IOException: Connection reset by peer
  at org.apache.catalina.connector.OutputBuffer.realWriteBytes(...)
  ...
  at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(...)
  at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(...)

Exception occurs in Jackson IndexedListSerializer --- client disconnected while serializing the category item list. No mention of getCategoryData in the stack trace; correlation is by timestamp only.

Downstream CategoryConfigService log

复制代码
GetCategoryData input:GetCategoryDataInput(language=zh-CN, marketplaceId=MP_001)
Received successful response: 200
getCategoryList load 1215 items from ddb

Downstream completed successfully --- no errors.


Bug Analysis

Current Behavior (Defect)

1.1 WHEN a client disconnects while Jackson/the servlet container is serializing a large getCategoryData response THEN the servlet container throws ClientAbortException, records a 5xx HTTP status, and the ARest ARestQuerylogFilter increments Fault=1 --- even though the controller returned Response.ok(response) successfully and getCategoryDataSuccess=1 was emitted

1.2 WHEN the ClientAbortException causes a 5xx status THEN RestEasy throws UnhandledException: Response is committed, can't handle exception because the response was already partially written to the client

1.3 WHEN traffic volume drops during valley periods THEN the small constant number of client-abort faults becomes a larger percentage of total requests, causing the fault rate to exceed the 1% alarm threshold and triggering getCategoryData_Fault_Percent_SEV3 --- even though the absolute number of faults has not increased and every request was processed successfully

1.4 WHEN the getCategoryData response payload is large (1215 category items with nested subcategories) THEN the increased serialization time widens the window for client disconnects, contributing to the constant baseline of client-abort faults

Expected Behavior (Correct)

2.1 WHEN client disconnects cause a small constant number of Fault=1 counts during traffic valleys THEN the alarm threshold SHALL be set high enough to accommodate the baseline client-abort fault rate, so that getCategoryData_Fault_Percent_SEV3 does NOT fire due to traffic volume fluctuations alone

2.2 WHEN an actual service outage or downstream failure causes a genuine spike in fault rate above the adjusted threshold THEN the alarm SHALL still fire to alert the team

2.3 WHEN client disconnects occur during getCategoryData response serialization THEN the alarm SHALL NOT fire, because client aborts during traffic valleys are not indicative of service health issues

Unchanged Behavior (Regression Prevention)

3.1 WHEN getCategoryData is called with valid parameters and the client remains connected THEN the system SHALL CONTINUE TO return HTTP 200 with the GetCategoryDataOutput response and increment getCategoryDataSuccess

3.2 WHEN the downstream service throws an actual exception THEN the system SHALL CONTINUE TO catch it, log the error, return HTTP 500, and increment getCategoryDataFailure --- these are genuine faults that should be counted

3.3 WHEN getCategoryData succeeds THEN the system SHALL CONTINUE TO track getCategoryDataSuccess with the marketplaceId dimension

3.4 WHEN other controller endpoints are called THEN they SHALL CONTINUE TO behave as they currently do

3.5 WHEN getCategoryDataFailure is incremented due to actual downstream failures THEN the alarm SHALL CONTINUE TO fire when the fault rate exceeds the adjusted threshold --- only the threshold level changes, not the alarm mechanism


Debugging Notes

How to investigate getCategoryData_Fault_Percent alarms in the future:

Step 1: Check service_log (querylog) first

From the log aggregation system, decompress and search the service_log for the alarm time window:

bash 复制代码
# Decompress and extract all getCategoryData querylog entries
zstd -dc service_log.<date>-<hour>.<host>-* | grep -B 3 -A 7 "getCategoryData" > /tmp/getCategoryData.csv

# Then filter for fault entries
grep "Fault=1" /tmp/getCategoryData.csv

Key fields in each querylog block:

  • Counters=Fault=1 --- confirms a fault was recorded
  • Metrics=getCategoryDataSuccess=1 alongside Fault=1 → client abort (controller succeeded, HTTP delivery failed)
  • Metrics=getCategoryDataFailure=1 → genuine downstream failure (controller catch block was hit)
  • StartTime --- Unix epoch timestamp for correlating with application.log

Step 2: Check application.log for the matching timestamp

If getCategoryDataSuccess=1 AND Fault=1 (client abort):

bash 复制代码
grep "ClientAbortException\|Connection reset by peer" application.log.* | grep "<timestamp>"

The ClientAbortException won't mention getCategoryData --- match by timestamp only.

If getCategoryDataFailure=1 (genuine failure):

bash 复制代码
grep "Fail to getCategoryData" application.log.* | grep "<timestamp>"

Step 3: Check downstream service logs

Same timestamp window. Look for:

  • GetCategoryData input: --- request received
  • getCategoryList load N items from ddb --- DynamoDB succeeded
  • getCategoryData error: --- downstream internal error (returns success response with error code in body, does not throw)

Traffic Valley Pattern

If the alarm fires during low-traffic periods but the absolute Fault=1 count is small and constant, the issue is the percentage-based threshold being too sensitive for the traffic volume. Compare fault counts during alarm windows vs normal windows --- if the absolute count is similar, the alarm is triggered by the traffic valley, not by service degradation.

相关推荐
Javatutouhouduan11 小时前
2026Java面试的正确打开方式!
java·高并发·java面试·java面试题·后端开发·java编程·java八股文
JAVA面经实录91711 小时前
Java初级最终完整版学习路线图
java·spring·eclipse·maven
Cat_Rocky12 小时前
k8s-持久化存储,粗浅学习
java·学习·kubernetes
知识领航员13 小时前
蘑兔AI音乐深度实测:功能拆解、实测表现与适用场景
java·c语言·c++·人工智能·python·算法·github
释怀°Believe13 小时前
Spring解析
java·后端·spring
一只机电自动化菜鸟13 小时前
一建机电备考笔记(33) 机电专业技术(起重技术-吊装方案)(含考频+题型)
经验分享·笔记·学习·职场和发展·课程设计
ooseabiscuit13 小时前
Laravel4.x:现代PHP框架的奠基之作
java·开发语言·php
节奏昂14 小时前
【一份基础软件的下载地址和安装地址】
java
没什么本事14 小时前
关于C# panel 添加lable问题 -- 明确X和Y 位置错误
android·java·c#
dhashdoia15 小时前
GPT-5.5 代码开发实战:Codex与Browser Use深度集成与星链4SAPI优化方案
java·数据库·人工智能·gpt·架构