AIOps·Mar 11, 2025·6 min read

Root cause in 60 seconds: how ArcIn traverses 40 services to find the answer

Applicare Engineering Team Mar 11, 2025 6 min read

The problem with traditional incident response

When an incident fires at 2am, the on-call engineer opens five dashboards, joins a bridge call with three colleagues, and spends 45 minutes correlating metrics across services before someone finally says "I think it's the database." Then someone else spends another 20 minutes confirming it.

This is not a people problem. This is a tooling problem. Your monitoring tools were built to show you data — not to answer questions.

ArcIn was built to answer questions. Here's how it traverses 40+ services in under 60 seconds to find the root cause of any incident.

<60s

Root cause identification

40+

Services traversed

11 min

Avg MTTR with ArcIn

Applicare — Real transaction trace, 147ms viewCategory call

The entity graph — ArcIn's foundation

ArcIn doesn't analyse services in isolation. It operates on the causal entity graph — a continuously updated model of your entire infrastructure that maps causal relationships between every service, host, database, and cloud resource in your environment.

When ArcIn receives a question like "why is checkout-svc slow?", it doesn't just look at checkout-svc's metrics. It traverses the graph upstream and downstream, looking for the entity whose degradation best explains the symptom.

The traversal algorithm — simplified

ArcIn's traversal works in three stages:

Symptom identification — identify the entity experiencing the reported degradation and the exact metric that changed
Causal graph traversal — walk upstream through the dependency graph, scoring each entity by its probability of being the root cause based on timing correlation, magnitude of change, and historical patterns
Root cause synthesis — identify the highest-probability root cause, retrieve the specific event (deploy, config change, traffic surge) that triggered it, and generate a plain-English explanation with a recommended fix

The key insight: most incidents have a single root cause. A slow database query causes a slow API call causes slow page loads causes increased error rates. ArcIn finds the database query — not just the slow page loads that your users are experiencing.

A real example: the checkout incident

At 14:32 on a Tuesday, checkout-svc p99 latency increased from 180ms to 520ms. ArcIn traversed the graph and returned this analysis in 47 seconds:

Root cause: Deploy #6205 (14:28, checkout-svc v2.4) Introduced N+1 query in OrderRepository.findAll() Effect: 47 SQL queries per request (was: 1) DB pool utilisation: 96% (threshold: 80%) Cascading impact: checkout-svc p99 +340ms → 5xx rate +2.4% Recommended fix: Add eager loading on order_items relationship IntelliTune action available: rollback deploy #6205

The on-call engineer received this in a Slack message before they'd even opened a dashboard. Total resolution time: 11 minutes (including the rollback verification).

Why plain English matters

ArcIn deliberately outputs its analysis in plain English rather than raw metrics. This isn't just a UX decision — it reflects a fundamental belief that observability tools should serve engineers at every skill level, not just the ones who know how to read flame graphs.

When a developer can ask "why is my service slow?" and get a specific, actionable answer in under a minute, the conversation about incident response changes completely.

← Back to blog Try Applicare free →