IntelliTune has auto-resolved over 400,000 Kubernetes incidents across our customer base. When we analysed the full dataset, we found that 10 patterns account for 78% of all auto-resolutions. Here they are — what triggers them, how they're detected, and what IntelliTune does to fix them.
Trigger: Pod OOMKilled, restarts > 2 in 10 minutes
Detection: IntelliSense flags the restart loop and correlates with memory utilisation trend
Action: Increase memory limit by 40% within policy gates, create Jira ticket with heap dump analysis
Success rate: 94%
Trigger: Node CPU ready > 8% or node memory pressure flagged
Detection: Correlate node resource contention with pod performance degradation on that node
Action: Trigger live migration to a less-loaded node via DRS recommendation or K8s rescheduling
Success rate: 97%
Trigger: DB connection pool > 90% for > 2 minutes
Detection: ArcIn identifies the service causing the exhaustion and traces it to a missing pool config
Action: Scale pool size within configured bounds, alert dev team with root cause and recommended config change
Success rate: 89%
Connection pool exhaustion is the single most common root cause of "mysterious" latency spikes we see across customer environments. It almost always traces back to a deploy that changed connection handling without updating the pool configuration.
Trigger: Pod in CrashLoopBackOff, exit code 1 or 137
Detection: Parse container logs for known config error signatures; correlate with recent ConfigMap changes
Action: If config error detected and previous ConfigMap version exists: rollback ConfigMap, restart pod
Success rate: 82%
Trigger: HPA at max replicas, request queue building, p99 latency increasing
Detection: IntelliSense predicts queue exhaustion 3-5 minutes before user impact
Action: Temporarily increase HPA max within policy gates; notify platform team
Success rate: 91%
Every IntelliTune action runs through policy gates before executing. Gates define: which patterns are allowed to run automatically, which require human approval, which are blocked entirely, and what rollback looks like if the remediation makes things worse.
The default gate configuration is conservative — most actions require approval for the first 30 days, then graduate to automatic based on success rate in your environment. You can override this at any time.