Databases·Feb 14, 2025·5 min read

N+1 queries at scale: why ArcIn finds them in seconds

Applicare Engineering Team Feb 14, 2025 5 min read

N+1 is the most expensive bug most teams can't easily find

The N+1 query problem is simple to understand and surprisingly hard to catch before production: your application fetches a list of N items, then issues N additional queries to fetch related data for each item. One request becomes 1+N database queries.

At scale, this is catastrophic. A checkout page that fetches 50 orders issues 51 database queries. At 10,000 requests per minute, that's 510,000 database queries per minute from a single endpoint — when it should be 10,000.

<30s

Detection to code attribution

Manual instrumentation needed

65%

DB incident reduction avg

Why N+1 is hard to catch in staging

In staging, your order list has 3 items. Three queries is fine. In production, your order list has 47 items. 48 queries — multiplied by thousands of concurrent users — collapses your connection pool.

This is why N+1 consistently slips through code review and staging. The problem only manifests at production data volumes, and traditional APM tools show you "database is slow" without telling you which query pattern is causing it or which code is responsible.

How ArcIn attributes queries to code

ArcIn correlates distributed traces with database query telemetry to identify N+1 patterns automatically. When it detects that a single trace is generating more than a threshold number of structurally identical queries, it:

Captures the query pattern and the number of repetitions per request
Traces the query back through the call stack to the ORM method that issued it
Identifies the git commit that introduced or changed that code path
Generates a specific fix recommendation (eager loading, query batching, caching)

N+1 detected: OrderRepository.findAll() Pattern: SELECT * FROM order_items WHERE order_id = ? Repetitions per request: 47 (1 per order in result set) Issuing method: OrderRepository.java:142 findAll() Introduced: commit a4f2c91 (deploy #6205, 14:28 today) Fix: Add @OneToMany(fetch = FetchType.EAGER) on order_items or use JOIN FETCH in JPQL query Expected improvement: 47 queries → 1 query per request

Applicare — SQL query N+1 pattern detection

The connection pool cascade

N+1 queries don't just slow down the affected endpoint — they cascade through your entire database connection pool. Each of the 47 queries holds a connection open for its duration. When enough concurrent requests are in flight, the pool exhausts and every service that uses the database starts timing out.

This is why N+1 incidents often look like a "database is down" event when they first manifest — the root cause is a code pattern in a specific service, but the symptom is system-wide database unavailability.

Real example: the AeroMexico checkout incident

At 14:32, checkout-svc p99 jumped from 180ms to 520ms. ArcIn identified an N+1 in OrderRepository within 47 seconds. IntelliTune offered an immediate rollback. The total time from alert to resolution was 11 minutes — down from the previous average of 4.5 hours for database-related incidents.