Analysis of 10,000+ agent memory states across domains. How unreliable is AI agent memory — and what does it cost?
Memory unreliability rates vary significantly by domain. High-stakes domains show higher rates due to faster information decay and stricter source requirements.
Key finding: The most common cause of unreliability is temporal decay — memories older than 30 days are 3.4× more likely to conflict with current ground truth. Commercial bias (sponsored content in memory sources) accounts for 18% of flagged entries in fintech.
Why do agent memories fail? Sgraal classifies every BLOCK and WARN decision into failure categories.
Memory is too old relative to the action being taken. Weibull decay model: half-life varies by domain (fintech: 7 days, general: 45 days).
Two or more memory entries directly contradict each other. Most common in multi-agent systems where agents share memory pools.
Memory sourced from sponsored content, affiliate articles, or commercially motivated sources. Detected via commercial_intent scoring.
Memory entry has no traceable source. Common in agents that summarize web content without preserving source metadata.
Sgraal runs 108 scoring models in parallel. End-to-end latency from request to decision:
Note: This benchmark uses synthetic data. 10,847 memory state evaluations generated using adversarial test patterns and realistic agent memory profiles. Synthetic memories were constructed to represent real-world distributions of temporal decay, source conflict, and commercial bias. All evaluations span 4 domains: fintech, healthcare, legal, and general. No real user data was used.
A memory entry is classified as "unreliable" if it receives omega_mem_final ≥ 31 (WARN threshold) on at least one preflight call. This includes temporal decay, source conflict, commercial bias, and provenance failures.
83 scoring modules evaluated per preflight call: Weibull freshness decay, 5-method drift detection ensemble, source trust scoring, conflict graph analysis, causal graph construction, Entry Shapley attribution, commercial intent classification, compliance profile evaluation, timestamp integrity, identity drift, and consensus collapse detection.
This benchmark reflects production traffic from Sgraal users and may not be representative of all AI agent deployments. Domain-specific rates are influenced by the types of agents using Sgraal in each domain. Latency measurements are from Railway (EU West) to client.
Independent builds, side-by-side results across 8 adversarial corpora.
239 cases: sponsored drift (60), subtle drift (59), hallucination (60)
Sgraal: F1 = 1.000 · FP=0 · FN=0
Grok: F1 = 1.000
90 cases · 4 attack vectors: injection mid-chain, drift amplification, RAG poisoning, API drift
Sgraal: F1 = 1.000
Grok: F1 = 1.000 · <180ms · blast radius <2%
60 cases · timestamp forgery detection · old decisions disguised as fresh
Sgraal: F1 = 1.000 · 60/60
New field: timestamp_integrity: VALID | SUSPICIOUS | MANIPULATED
90 cases · gradual role and authority escalation across agent hops
Sgraal: F1 = 1.000 · 90/90
New field: identity_drift: CLEAN | SUSPICIOUS | MANIPULATED
90 cases · self-reinforcing false consensus detection
Sgraal: F1 = 1.000 · 90/90
New field: consensus_collapse: CLEAN | SUSPICIOUS | MANIPULATED
3 independent stacks syncing on fabricated consensus. Joint corpus with Grok.
Sgraal: Armed · anti-consensus layer active
Grok: Corpus incoming
When multiple attack vectors fire simultaneously, Sgraal computes a unified attack surface score.
| Layers active | attack_surface_score | attack_surface_level |
|---|---|---|
| 1 layer SUSPICIOUS | 0.50 | MODERATE |
| 2 layers SUSPICIOUS | 0.65 | HIGH |
| 3 layers SUSPICIOUS | 0.70 | HIGH |
| 1 layer MANIPULATED | 1.00 | CRITICAL |
| All 3 MANIPULATED | 1.40 | CRITICAL |
614
Total corpus cases
8
Adversarial rounds
0
False negatives
These figures reflect synthetic R12/R14 corpus performance; production calibration is pending paying-customer onboarding.
Run a preflight check on your memory state. No signup required.