getCategoryData False Fault Alarm Process

Summary

The getCategoryData_Fault_Percent_SEV3 alarm fires false positives during traffic valleys on the ApiGateway service. The root cause is not a service failure --- it is a combination of client disconnects during response serialization and a percentage-based alarm threshold that becomes too sensitive at low traffic volumes.

Fix : Increase the alarm threshold for getCategoryData_Fault_Percent_SEV3 to accommodate the baseline client-abort fault rate during traffic valleys.

Root Cause Analysis

Architecture

This service is an API gateway that uses two different communication layers:

Tomcat + ARest + RestEasy (inbound) --- serves the public-facing REST API to external clients (H5 web app). Tomcat handles HTTP connections, ARest provides querylog/metrics, RestEasy routes requests to controller methods.
Coral client (outbound) --- makes internal RPC calls to downstream services. The controller bridges the two: receives HTTP via Tomcat, calls downstream via Coral, returns the result as HTTP.

H5 Web App
↓ (HTTPS --- Tomcat inbound)
Tomcat → Filters → RestEasy → Guice Interceptors → Controller
↓ (Coral RPC --- outbound)
Coral Client → CategoryConfigService → DynamoDB
↓ (response flows back)
Controller → RestEasy → Jackson serialization → Tomcat → H5 Web App

In the bug scenario, the Coral outbound call succeeded but the Tomcat inbound response write failed (client disconnected during Jackson serialization).

Request Lifecycle

复制代码

┌─────────────────────────────────────────────────────────────────┐
│                    Tomcat Servlet Container                      │
│                                                                  │
│  1. HTTP Request arrives from H5 client                         │
│  2. ARestQuerylogFilter --- starts querylog timer                 │
│  3. AuthenticationFilter --- validates access token               │
│  4. RateLimitFilter --- rate limiting check                       │
│  5. RestEasy routes to ConfigureServiceController               │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Guice @WithMetrics / @DimCounts Interceptor              │  │
│  │  6. BEFORE: start metrics tracking                        │  │
│  │  7. Controller executes:                                  │  │
│  │     → Coral client calls CategoryConfigService            │  │
│  │     → Downstream queries DynamoDB (1215 items)            │  │
│  │     → Controller returns Response.ok(response)            │  │
│  │  8. AFTER: method returned without throwing               │  │
│  │     → emits getCategoryDataSuccess=1  ← SUCCESS METRIC   │  │
│  └───────────────────────────────────────────────────────────┘  │
│  9. RestEasy/Jackson serializes Response to HTTP body           │
│     → IndexedListSerializer serializes 1215 items               │
│     → Tomcat OutputBuffer writes bytes to client socket         │
│     ╔═══════════════════════════════════════════════════╗       │
│     ║  CLIENT DISCONNECTS (Connection reset by peer)    ║       │
│     ╚═══════════════════════════════════════════════════╝       │
│     → ClientAbortException → servlet records 5xx               │
│  10. ARestQuerylogFilter completes                              │
│      → sees 5xx → writes Fault=1  ← FAULT METRIC              │
│                                                                  │
│  Result: Fault=1 AND getCategoryDataSuccess=1 on same entry    │
└─────────────────────────────────────────────────────────────────┘

The gap between step 8 (Guice interceptor emits success) and step 9 (serialization fails) is why both metrics appear on the same request.

What Happens

H5 client calls GET /gateway/config/category?language=zh-CN with marketplaceId=MP_001
Gateway's ConfigureServiceController.getCategoryData() calls ConfigureServiceProxy, which calls CategoryConfigService
Downstream queries DynamoDB, loads 1215 category items with nested subcategories, returns successfully (~95ms)
Controller receives the response, returns Response.ok(response) --- the @DimCounts Guice interceptor fires getCategoryDataSuccess=1
Servlet/Jackson begins serializing the 1215-item response back to the H5 client
Client disconnects mid-serialization (user navigated away, closed tab, or network interruption)
Servlet throws ClientAbortException: java.io.IOException: Connection reset by peer
RestEasy throws UnhandledException: Response is committed, can't handle exception
Servlet container records 5xx → ARest ARestQuerylogFilter writes Fault=1

Why the Alarm Fires During Traffic Valleys

The absolute number of client-abort faults is small and roughly constant (users navigating away is normal behavior). During peak traffic, these faults are a negligible percentage:

Period	Total Requests	Client-Abort Faults	Fault Rate	Alarm (>1%)
Peak	1000	3	0.3%	No
Valley	100	3	3.0%	Yes

The fault rate spikes during valleys because the denominator shrinks, not because faults increase.

Why `getCategoryDataSuccess=1` AND `Fault=1` Appear on the Same Request

These two metrics are emitted at different layers of the request lifecycle:

Metric	Layer	When Emitted	What It Sees
`getCategoryDataSuccess=1`	`@DimCounts` / `@WithMetrics` Guice interceptor	When controller method returns	Method returned `Response.ok()` without throwing → `Applies.SUCCESS`
`Fault=1`	ARest `ARestQuerylogFilter` (servlet layer)	After HTTP response is fully written	Servlet container recorded 5xx from `ClientAbortException`

Both are correct from their own perspective: the method succeeded, but the HTTP response delivery failed. The controller's catch (Exception e) block is never entered because the exception occurs in the servlet serialization layer, after the controller method has already returned.

What Was Ruled Out

Missing parameters / NPE : The request has valid marketplaceId and language
Downstream failure: Downstream service completed successfully (DynamoDB returned 200, 1215 items loaded, no errors)
Controller exception : "Fail to getCategoryData" never appears in application.log --- the catch block is never hit
H5 timeout: H5 timeout is 10 seconds; the request completes in ~125ms --- well within the limit

Evidence Logs

Gateway service_log --- Fault request

复制代码

Operation=ConfigureServiceController.getCategoryData
Time=124.850746 ms
Counters=Error=0,Fault=1
Metrics=getCategoryDataSuccess=1

Fault=1 AND getCategoryDataSuccess=1 on the same request --- controller succeeded, HTTP response delivery failed.

Gateway service_log --- Normal request

复制代码

Operation=ConfigureServiceController.getCategoryData
Time=136.968727 ms
Counters=Error=0,Fault=0
Metrics=getCategoryDataSuccess=1

Fault=0 --- same endpoint, client stayed connected.

Gateway application.log --- ClientAbortException

复制代码

[ERROR] org.apache.catalina.core.ContainerBase.[Tomcat].[localhost]:
  Servlet.service() for servlet threw exception

org.jboss.resteasy.spi.UnhandledException: Response is committed, can't handle exception
  at org.jboss.resteasy.core.SynchronousDispatcher.writeException(...)
  ...
Caused by: org.apache.catalina.connector.ClientAbortException:
  java.io.IOException: Connection reset by peer
  at org.apache.catalina.connector.OutputBuffer.realWriteBytes(...)
  ...
  at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(...)
  at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(...)

Exception occurs in Jackson IndexedListSerializer --- client disconnected while serializing the category item list. No mention of getCategoryData in the stack trace; correlation is by timestamp only.

Downstream CategoryConfigService log

复制代码

GetCategoryData input:GetCategoryDataInput(language=zh-CN, marketplaceId=MP_001)
Received successful response: 200
getCategoryList load 1215 items from ddb

Downstream completed successfully --- no errors.

Bug Analysis

Current Behavior (Defect)

1.1 WHEN a client disconnects while Jackson/the servlet container is serializing a large getCategoryData response THEN the servlet container throws ClientAbortException, records a 5xx HTTP status, and the ARest ARestQuerylogFilter increments Fault=1 --- even though the controller returned Response.ok(response) successfully and getCategoryDataSuccess=1 was emitted

1.2 WHEN the ClientAbortException causes a 5xx status THEN RestEasy throws UnhandledException: Response is committed, can't handle exception because the response was already partially written to the client

1.3 WHEN traffic volume drops during valley periods THEN the small constant number of client-abort faults becomes a larger percentage of total requests, causing the fault rate to exceed the 1% alarm threshold and triggering getCategoryData_Fault_Percent_SEV3 --- even though the absolute number of faults has not increased and every request was processed successfully

1.4 WHEN the getCategoryData response payload is large (1215 category items with nested subcategories) THEN the increased serialization time widens the window for client disconnects, contributing to the constant baseline of client-abort faults

Expected Behavior (Correct)

2.1 WHEN client disconnects cause a small constant number of Fault=1 counts during traffic valleys THEN the alarm threshold SHALL be set high enough to accommodate the baseline client-abort fault rate, so that getCategoryData_Fault_Percent_SEV3 does NOT fire due to traffic volume fluctuations alone

2.2 WHEN an actual service outage or downstream failure causes a genuine spike in fault rate above the adjusted threshold THEN the alarm SHALL still fire to alert the team

2.3 WHEN client disconnects occur during getCategoryData response serialization THEN the alarm SHALL NOT fire, because client aborts during traffic valleys are not indicative of service health issues

Unchanged Behavior (Regression Prevention)

3.1 WHEN getCategoryData is called with valid parameters and the client remains connected THEN the system SHALL CONTINUE TO return HTTP 200 with the GetCategoryDataOutput response and increment getCategoryDataSuccess

3.2 WHEN the downstream service throws an actual exception THEN the system SHALL CONTINUE TO catch it, log the error, return HTTP 500, and increment getCategoryDataFailure --- these are genuine faults that should be counted

3.3 WHEN getCategoryData succeeds THEN the system SHALL CONTINUE TO track getCategoryDataSuccess with the marketplaceId dimension

3.4 WHEN other controller endpoints are called THEN they SHALL CONTINUE TO behave as they currently do

3.5 WHEN getCategoryDataFailure is incremented due to actual downstream failures THEN the alarm SHALL CONTINUE TO fire when the fault rate exceeds the adjusted threshold --- only the threshold level changes, not the alarm mechanism

Debugging Notes

How to investigate getCategoryData_Fault_Percent alarms in the future:

Step 1: Check service_log (querylog) first

From the log aggregation system, decompress and search the service_log for the alarm time window:

bash 复制代码

# Decompress and extract all getCategoryData querylog entries
zstd -dc service_log.<date>-<hour>.<host>-* | grep -B 3 -A 7 "getCategoryData" > /tmp/getCategoryData.csv

# Then filter for fault entries
grep "Fault=1" /tmp/getCategoryData.csv

Key fields in each querylog block:

Counters=Fault=1 --- confirms a fault was recorded
Metrics=getCategoryDataSuccess=1 alongside Fault=1 → client abort (controller succeeded, HTTP delivery failed)
Metrics=getCategoryDataFailure=1 → genuine downstream failure (controller catch block was hit)
StartTime --- Unix epoch timestamp for correlating with application.log

Step 2: Check application.log for the matching timestamp

If getCategoryDataSuccess=1 AND Fault=1 (client abort):

bash 复制代码

grep "ClientAbortException\|Connection reset by peer" application.log.* | grep "<timestamp>"

The ClientAbortException won't mention getCategoryData --- match by timestamp only.

If getCategoryDataFailure=1 (genuine failure):

bash 复制代码

grep "Fail to getCategoryData" application.log.* | grep "<timestamp>"

Step 3: Check downstream service logs

Same timestamp window. Look for:

GetCategoryData input: --- request received
getCategoryList load N items from ddb --- DynamoDB succeeded
getCategoryData error: --- downstream internal error (returns success response with error code in body, does not throw)

Traffic Valley Pattern

If the alarm fires during low-traffic periods but the absolute Fault=1 count is small and constant, the issue is the percentage-based threshold being too sensitive for the traffic volume. Compare fault counts during alarm windows vs normal windows --- if the absolute count is similar, the alarm is triggered by the traffic valley, not by service degradation.

getCategoryData False Fault Alarm Process

Summary

Root Cause Analysis

Architecture

Request Lifecycle

What Happens

Why the Alarm Fires During Traffic Valleys

Why getCategoryDataSuccess=1 AND Fault=1 Appear on the Same Request

What Was Ruled Out

Evidence Logs

Gateway service_log --- Fault request

Gateway service_log --- Normal request

Gateway application.log --- ClientAbortException

Downstream CategoryConfigService log

Bug Analysis

Current Behavior (Defect)

Expected Behavior (Correct)

Unchanged Behavior (Regression Prevention)

Debugging Notes

Step 1: Check service_log (querylog) first

Step 2: Check application.log for the matching timestamp

Step 3: Check downstream service logs

Traffic Valley Pattern

Why `getCategoryDataSuccess=1` AND `Fault=1` Appear on the Same Request