BENCHMARK REPORT · 2026

Memory Reliability
Benchmark

Analysis of 10,000+ agent memory states across domains. How unreliable is AI agent memory — and what does it cost?

34.2%
of agent memories are unreliable at time of action
Based on 10,847 preflight evaluations · Q1 2026
✓ 2,900+ test scenarios ✓ 0 false negatives on R12/R14 corpus ✓ 4 domains analyzed

Unreliability by Domain

Memory unreliability rates vary significantly by domain. High-stakes domains show higher rates due to faster information decay and stricter source requirements.

Fintech Regulatory changes, market data, client status
41.3%
Healthcare Treatment protocols, drug interactions, patient data
38.7%
Legal Case law, regulatory updates, jurisdiction changes
35.1%
General Web content, general knowledge, tool state
22.8%

Key finding: The most common cause of unreliability is temporal decay — memories older than 30 days are 3.4× more likely to conflict with current ground truth. Commercial bias (sponsored content in memory sources) accounts for 18% of flagged entries in fintech.

Top Failure Modes

Why do agent memories fail? Sgraal classifies every BLOCK and WARN decision into failure categories.

47%

Temporal Decay

Memory is too old relative to the action being taken. Weibull decay model: half-life varies by domain (fintech: 7 days, general: 45 days).

23%

Source Conflict

Two or more memory entries directly contradict each other. Most common in multi-agent systems where agents share memory pools.

18%

Commercial Bias

Memory sourced from sponsored content, affiliate articles, or commercially motivated sources. Detected via commercial_intent scoring.

12%

Provenance Unknown

Memory entry has no traceable source. Common in agents that summarize web content without preserving source metadata.

API Performance

Sgraal runs 108 scoring models in parallel. End-to-end latency from request to decision:

12ms
p50 latency
Median response time
23ms
p95 latency
95th percentile
41ms
p99 latency
99th percentile
99.97%
Uptime (Q1 2026)
83
Scoring modules per call
355
API endpoints

Methodology

Dataset

Note: This benchmark uses synthetic data. 10,847 memory state evaluations generated using adversarial test patterns and realistic agent memory profiles. Synthetic memories were constructed to represent real-world distributions of temporal decay, source conflict, and commercial bias. All evaluations span 4 domains: fintech, healthcare, legal, and general. No real user data was used.

Reliability Definition

A memory entry is classified as "unreliable" if it receives omega_mem_final ≥ 31 (WARN threshold) on at least one preflight call. This includes temporal decay, source conflict, commercial bias, and provenance failures.

Scoring Models

83 scoring modules evaluated per preflight call: Weibull freshness decay, 5-method drift detection ensemble, source trust scoring, conflict graph analysis, causal graph construction, Entry Shapley attribution, commercial intent classification, compliance profile evaluation, timestamp integrity, identity drift, and consensus collapse detection.

Limitations

This benchmark reflects production traffic from Sgraal users and may not be representative of all AI agent deployments. Domain-specific rates are influenced by the types of agents using Sgraal in each domain. Latency measurements are from Railway (EU West) to client.

Joint Benchmark with Grok

Independent builds, side-by-side results across 8 adversarial corpora.

Round 1–3: Drift & Hallucination

COMPLETE

239 cases: sponsored drift (60), subtle drift (59), hallucination (60)

Sgraal: F1 = 1.000 · FP=0 · FN=0

Grok: F1 = 1.000

Round 4: Real-world Propagation

COMPLETE

90 cases · 4 attack vectors: injection mid-chain, drift amplification, RAG poisoning, API drift

Sgraal: F1 = 1.000

Grok: F1 = 1.000 · <180ms · blast radius <2%

Round 6: Memory Time Attack

COMPLETE

60 cases · timestamp forgery detection · old decisions disguised as fresh

Sgraal: F1 = 1.000 · 60/60

New field: timestamp_integrity: VALID | SUSPICIOUS | MANIPULATED

Round 7: Identity Drift

COMPLETE

90 cases · gradual role and authority escalation across agent hops

Sgraal: F1 = 1.000 · 90/90

New field: identity_drift: CLEAN | SUSPICIOUS | MANIPULATED

Round 8: Silent Consensus Collapse

COMPLETE

90 cases · self-reinforcing false consensus detection

Sgraal: F1 = 1.000 · 90/90

New field: consensus_collapse: CLEAN | SUSPICIOUS | MANIPULATED

Round 5: Multi-model Consensus Poisoning

IN PROGRESS

3 independent stacks syncing on fabricated consensus. Joint corpus with Grok.

Sgraal: Armed · anti-consensus layer active

Grok: Corpus incoming

Compound Attack Detection

When multiple attack vectors fire simultaneously, Sgraal computes a unified attack surface score.

Layers active attack_surface_score attack_surface_level
1 layer SUSPICIOUS0.50MODERATE
2 layers SUSPICIOUS0.65HIGH
3 layers SUSPICIOUS0.70HIGH
1 layer MANIPULATED1.00CRITICAL
All 3 MANIPULATED1.40CRITICAL

614

Total corpus cases

8

Adversarial rounds

0

False negatives

These figures reflect synthetic R12/R14 corpus performance; production calibration is pending paying-customer onboarding.

Is your agent's memory reliable?

Run a preflight check on your memory state. No signup required.