TL;DR:
We achieved a 30% to 60% reduction in p95 API latency for top Describe APIs and 10% reduction in p50 AdminDB CPU usage in the IAD region. These improvements were driven by set of optimizations in the WebService (WS) API implementation.
Background
We observed frequent tickets related to API faults and latency issues in the WS, as well as similar issues in other services like AWF and EP. These issues were largely caused due to overload on the AdminDB. As part of the investigation, we analyzed the Performance Insights dashboard in the RDS Console and found that the Feature Access table was the most frequently accessed table in the AdminDB. Anecdotally, we sensed that there must be some bug or inefficiencies causing this behavior. Upon further investigation, we discovered a 10x growth in traffic to the Feature Access table from the WS over the last few years, particularly in the last 2 years, coinciding with multiple instance type, engine, and feature launches (e.g., Skyhook, Synchronaa).
Root Cause Analysis
Our analysis of the feature access table usage in the WS identified multiple inefficiencies:
- Sequential Feature Checks: The DescribeCacheParameters API checked for access to 12 engine type features and 23 instance type features sequentially, resulting in at least 35 calls to the AdminDB per API call.
- Duplicate Calls: The DescribeCacheParameters API checked for 23 instance type features at three different places in the code path, fetching the status from the database each time, contributing 69 calls to the DB.
- Redundant Tagging Feature Checks: A bug in tagging implementation was causing the WS to check for access to the Partial Authz feature as part of all APIs, when it was only required for two tagging APIs.
Solution
To address these inefficiencies, we implemented the following optimizations:
- Batching Feature Access Checks: We batched the feature access checks for the DescribeCacheParameters API, reducing the number of calls to the AdminDB from 35 individual calls to a single batched call.
- Eliminating Duplicate Calls: We optimized the DescribeCacheParameters API to fetch the feature access information once and reuse it throughout the code path, eliminating 69 unnecessary calls to the database.
- Removing Redundant Tagging Feature Checks: We fixed the bug that was checking for the Partial Authz feature in all APIs, reducing unnecessary database queries.
Improvements
- Reduced traffic to the feature access table by 90%, lowering QPS for the table from 35k to 4k.
- API latency (p95) for top APIs dropped by 30% to 60%, with DescribeCacheParameters latency improving from 280ms to 105ms : link