Summary
The getCategoryData_Fault_Percent_SEV3 alarm fires false positives during traffic valleys on the ApiGateway service. The root cause is not a service failure --- it is a combination of client disconnects during response serialization and a percentage-based alarm threshold that becomes too sensitive at low traffic volumes.
Fix : Increase the alarm threshold for getCategoryData_Fault_Percent_SEV3 to accommodate the baseline client-abort fault rate during traffic valleys.
Root Cause Analysis
Architecture
This service is an API gateway that uses two different communication layers:
-
Tomcat + ARest + RestEasy (inbound) --- serves the public-facing REST API to external clients (H5 web app). Tomcat handles HTTP connections, ARest provides querylog/metrics, RestEasy routes requests to controller methods.
-
Coral client (outbound) --- makes internal RPC calls to downstream services. The controller bridges the two: receives HTTP via Tomcat, calls downstream via Coral, returns the result as HTTP.
H5 Web App
↓ (HTTPS --- Tomcat inbound)
Tomcat → Filters → RestEasy → Guice Interceptors → Controller
↓ (Coral RPC --- outbound)
Coral Client → CategoryConfigService → DynamoDB
↓ (response flows back)
Controller → RestEasy → Jackson serialization → Tomcat → H5 Web App
In the bug scenario, the Coral outbound call succeeded but the Tomcat inbound response write failed (client disconnected during Jackson serialization).
Request Lifecycle
┌─────────────────────────────────────────────────────────────────┐
│ Tomcat Servlet Container │
│ │
│ 1. HTTP Request arrives from H5 client │
│ 2. ARestQuerylogFilter --- starts querylog timer │
│ 3. AuthenticationFilter --- validates access token │
│ 4. RateLimitFilter --- rate limiting check │
│ 5. RestEasy routes to ConfigureServiceController │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Guice @WithMetrics / @DimCounts Interceptor │ │
│ │ 6. BEFORE: start metrics tracking │ │
│ │ 7. Controller executes: │ │
│ │ → Coral client calls CategoryConfigService │ │
│ │ → Downstream queries DynamoDB (1215 items) │ │
│ │ → Controller returns Response.ok(response) │ │
│ │ 8. AFTER: method returned without throwing │ │
│ │ → emits getCategoryDataSuccess=1 ← SUCCESS METRIC │ │
│ └───────────────────────────────────────────────────────────┘ │
│ 9. RestEasy/Jackson serializes Response to HTTP body │
│ → IndexedListSerializer serializes 1215 items │
│ → Tomcat OutputBuffer writes bytes to client socket │
│ ╔═══════════════════════════════════════════════════╗ │
│ ║ CLIENT DISCONNECTS (Connection reset by peer) ║ │
│ ╚═══════════════════════════════════════════════════╝ │
│ → ClientAbortException → servlet records 5xx │
│ 10. ARestQuerylogFilter completes │
│ → sees 5xx → writes Fault=1 ← FAULT METRIC │
│ │
│ Result: Fault=1 AND getCategoryDataSuccess=1 on same entry │
└─────────────────────────────────────────────────────────────────┘
The gap between step 8 (Guice interceptor emits success) and step 9 (serialization fails) is why both metrics appear on the same request.
What Happens
- H5 client calls
GET /gateway/config/category?language=zh-CNwithmarketplaceId=MP_001 - Gateway's
ConfigureServiceController.getCategoryData()callsConfigureServiceProxy, which callsCategoryConfigService - Downstream queries DynamoDB, loads 1215 category items with nested subcategories, returns successfully (~95ms)
- Controller receives the response, returns
Response.ok(response)--- the@DimCountsGuice interceptor firesgetCategoryDataSuccess=1 - Servlet/Jackson begins serializing the 1215-item response back to the H5 client
- Client disconnects mid-serialization (user navigated away, closed tab, or network interruption)
- Servlet throws
ClientAbortException: java.io.IOException: Connection reset by peer - RestEasy throws
UnhandledException: Response is committed, can't handle exception - Servlet container records 5xx → ARest
ARestQuerylogFilterwritesFault=1
Why the Alarm Fires During Traffic Valleys
The absolute number of client-abort faults is small and roughly constant (users navigating away is normal behavior). During peak traffic, these faults are a negligible percentage:
| Period | Total Requests | Client-Abort Faults | Fault Rate | Alarm (>1%) |
|---|---|---|---|---|
| Peak | 1000 | 3 | 0.3% | No |
| Valley | 100 | 3 | 3.0% | Yes |
The fault rate spikes during valleys because the denominator shrinks, not because faults increase.
Why getCategoryDataSuccess=1 AND Fault=1 Appear on the Same Request
These two metrics are emitted at different layers of the request lifecycle:
| Metric | Layer | When Emitted | What It Sees |
|---|---|---|---|
getCategoryDataSuccess=1 |
@DimCounts / @WithMetrics Guice interceptor |
When controller method returns | Method returned Response.ok() without throwing → Applies.SUCCESS |
Fault=1 |
ARest ARestQuerylogFilter (servlet layer) |
After HTTP response is fully written | Servlet container recorded 5xx from ClientAbortException |
Both are correct from their own perspective: the method succeeded, but the HTTP response delivery failed. The controller's catch (Exception e) block is never entered because the exception occurs in the servlet serialization layer, after the controller method has already returned.
What Was Ruled Out
- Missing parameters / NPE : The request has valid
marketplaceIdandlanguage - Downstream failure: Downstream service completed successfully (DynamoDB returned 200, 1215 items loaded, no errors)
- Controller exception :
"Fail to getCategoryData"never appears in application.log --- the catch block is never hit - H5 timeout: H5 timeout is 10 seconds; the request completes in ~125ms --- well within the limit
Evidence Logs
Gateway service_log --- Fault request
Operation=ConfigureServiceController.getCategoryData
Time=124.850746 ms
Counters=Error=0,Fault=1
Metrics=getCategoryDataSuccess=1
Fault=1 AND getCategoryDataSuccess=1 on the same request --- controller succeeded, HTTP response delivery failed.
Gateway service_log --- Normal request
Operation=ConfigureServiceController.getCategoryData
Time=136.968727 ms
Counters=Error=0,Fault=0
Metrics=getCategoryDataSuccess=1
Fault=0 --- same endpoint, client stayed connected.
Gateway application.log --- ClientAbortException
[ERROR] org.apache.catalina.core.ContainerBase.[Tomcat].[localhost]:
Servlet.service() for servlet threw exception
org.jboss.resteasy.spi.UnhandledException: Response is committed, can't handle exception
at org.jboss.resteasy.core.SynchronousDispatcher.writeException(...)
...
Caused by: org.apache.catalina.connector.ClientAbortException:
java.io.IOException: Connection reset by peer
at org.apache.catalina.connector.OutputBuffer.realWriteBytes(...)
...
at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(...)
at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(...)
Exception occurs in Jackson IndexedListSerializer --- client disconnected while serializing the category item list. No mention of getCategoryData in the stack trace; correlation is by timestamp only.
Downstream CategoryConfigService log
GetCategoryData input:GetCategoryDataInput(language=zh-CN, marketplaceId=MP_001)
Received successful response: 200
getCategoryList load 1215 items from ddb
Downstream completed successfully --- no errors.
Bug Analysis
Current Behavior (Defect)
1.1 WHEN a client disconnects while Jackson/the servlet container is serializing a large getCategoryData response THEN the servlet container throws ClientAbortException, records a 5xx HTTP status, and the ARest ARestQuerylogFilter increments Fault=1 --- even though the controller returned Response.ok(response) successfully and getCategoryDataSuccess=1 was emitted
1.2 WHEN the ClientAbortException causes a 5xx status THEN RestEasy throws UnhandledException: Response is committed, can't handle exception because the response was already partially written to the client
1.3 WHEN traffic volume drops during valley periods THEN the small constant number of client-abort faults becomes a larger percentage of total requests, causing the fault rate to exceed the 1% alarm threshold and triggering getCategoryData_Fault_Percent_SEV3 --- even though the absolute number of faults has not increased and every request was processed successfully
1.4 WHEN the getCategoryData response payload is large (1215 category items with nested subcategories) THEN the increased serialization time widens the window for client disconnects, contributing to the constant baseline of client-abort faults
Expected Behavior (Correct)
2.1 WHEN client disconnects cause a small constant number of Fault=1 counts during traffic valleys THEN the alarm threshold SHALL be set high enough to accommodate the baseline client-abort fault rate, so that getCategoryData_Fault_Percent_SEV3 does NOT fire due to traffic volume fluctuations alone
2.2 WHEN an actual service outage or downstream failure causes a genuine spike in fault rate above the adjusted threshold THEN the alarm SHALL still fire to alert the team
2.3 WHEN client disconnects occur during getCategoryData response serialization THEN the alarm SHALL NOT fire, because client aborts during traffic valleys are not indicative of service health issues
Unchanged Behavior (Regression Prevention)
3.1 WHEN getCategoryData is called with valid parameters and the client remains connected THEN the system SHALL CONTINUE TO return HTTP 200 with the GetCategoryDataOutput response and increment getCategoryDataSuccess
3.2 WHEN the downstream service throws an actual exception THEN the system SHALL CONTINUE TO catch it, log the error, return HTTP 500, and increment getCategoryDataFailure --- these are genuine faults that should be counted
3.3 WHEN getCategoryData succeeds THEN the system SHALL CONTINUE TO track getCategoryDataSuccess with the marketplaceId dimension
3.4 WHEN other controller endpoints are called THEN they SHALL CONTINUE TO behave as they currently do
3.5 WHEN getCategoryDataFailure is incremented due to actual downstream failures THEN the alarm SHALL CONTINUE TO fire when the fault rate exceeds the adjusted threshold --- only the threshold level changes, not the alarm mechanism
Debugging Notes
How to investigate getCategoryData_Fault_Percent alarms in the future:
Step 1: Check service_log (querylog) first
From the log aggregation system, decompress and search the service_log for the alarm time window:
bash
# Decompress and extract all getCategoryData querylog entries
zstd -dc service_log.<date>-<hour>.<host>-* | grep -B 3 -A 7 "getCategoryData" > /tmp/getCategoryData.csv
# Then filter for fault entries
grep "Fault=1" /tmp/getCategoryData.csv
Key fields in each querylog block:
Counters=Fault=1--- confirms a fault was recordedMetrics=getCategoryDataSuccess=1alongsideFault=1→ client abort (controller succeeded, HTTP delivery failed)Metrics=getCategoryDataFailure=1→ genuine downstream failure (controller catch block was hit)StartTime--- Unix epoch timestamp for correlating with application.log
Step 2: Check application.log for the matching timestamp
If getCategoryDataSuccess=1 AND Fault=1 (client abort):
bash
grep "ClientAbortException\|Connection reset by peer" application.log.* | grep "<timestamp>"
The ClientAbortException won't mention getCategoryData --- match by timestamp only.
If getCategoryDataFailure=1 (genuine failure):
bash
grep "Fail to getCategoryData" application.log.* | grep "<timestamp>"
Step 3: Check downstream service logs
Same timestamp window. Look for:
GetCategoryData input:--- request receivedgetCategoryList load N items from ddb--- DynamoDB succeededgetCategoryData error:--- downstream internal error (returns success response with error code in body, does not throw)
Traffic Valley Pattern
If the alarm fires during low-traffic periods but the absolute Fault=1 count is small and constant, the issue is the percentage-based threshold being too sensitive for the traffic volume. Compare fault counts during alarm windows vs normal windows --- if the absolute count is similar, the alarm is triggered by the traffic valley, not by service degradation.