Memory Reliability Benchmark

Unreliability by Domain

Memory unreliability rates vary significantly by domain. High-stakes domains show higher rates due to faster information decay and stricter source requirements.

Fintech Regulatory changes, market data, client status

41.3%

Healthcare Treatment protocols, drug interactions, patient data

38.7%

Legal Case law, regulatory updates, jurisdiction changes

35.1%

General Web content, general knowledge, tool state

22.8%

Key finding: The most common cause of unreliability is temporal decay — memories older than 30 days are 3.4× more likely to conflict with current ground truth. Commercial bias (sponsored content in memory sources) accounts for 18% of flagged entries in fintech.

Top Failure Modes

Why do agent memories fail? Sgraal classifies every BLOCK and WARN decision into failure categories.

47%

Temporal Decay

Memory is too old relative to the action being taken. Weibull decay model: half-life varies by domain (fintech: 7 days, general: 45 days).

23%

Source Conflict

Two or more memory entries directly contradict each other. Most common in multi-agent systems where agents share memory pools.

18%

Commercial Bias

Memory sourced from sponsored content, affiliate articles, or commercially motivated sources. Detected via commercial_intent scoring.

12%

Provenance Unknown

Memory entry has no traceable source. Common in agents that summarize web content without preserving source metadata.

Methodology

Dataset

Note: This benchmark uses synthetic data. 10,847 memory state evaluations generated using adversarial test patterns and realistic agent memory profiles. Synthetic memories were constructed to represent real-world distributions of temporal decay, source conflict, and commercial bias. All evaluations span 4 domains: fintech, healthcare, legal, and general. No real user data was used.

Reliability Definition

A memory entry is classified as "unreliable" if it receives omega_mem_final ≥ 31 (WARN threshold) on at least one preflight call. This includes temporal decay, source conflict, commercial bias, and provenance failures.

Scoring Models

83 scoring modules evaluated per preflight call: Weibull freshness decay, 5-method drift detection ensemble, source trust scoring, conflict graph analysis, causal graph construction, Entry Shapley attribution, commercial intent classification, compliance profile evaluation, timestamp integrity, identity drift, and consensus collapse detection.

Limitations

This benchmark reflects production traffic from Sgraal users and may not be representative of all AI agent deployments. Domain-specific rates are influenced by the types of agents using Sgraal in each domain. Latency measurements are from Railway (EU West) to client.

Joint Benchmark with Grok

Independent builds, side-by-side results across 8 adversarial corpora.

Round 1–3: Drift & Hallucination

COMPLETE

239 cases: sponsored drift (60), subtle drift (59), hallucination (60)

Sgraal: F1 = 1.000 · FP=0 · FN=0

Grok: F1 = 1.000

Round 4: Real-world Propagation

COMPLETE

90 cases · 4 attack vectors: injection mid-chain, drift amplification, RAG poisoning, API drift

Sgraal: F1 = 1.000

Grok: F1 = 1.000 · <180ms · blast radius <2%

Round 6: Memory Time Attack

COMPLETE

60 cases · timestamp forgery detection · old decisions disguised as fresh

Sgraal: F1 = 1.000 · 60/60

New field: timestamp_integrity: VALID | SUSPICIOUS | MANIPULATED

Round 7: Identity Drift

COMPLETE

90 cases · gradual role and authority escalation across agent hops

Sgraal: F1 = 1.000 · 90/90

New field: identity_drift: CLEAN | SUSPICIOUS | MANIPULATED

Round 8: Silent Consensus Collapse

COMPLETE

90 cases · self-reinforcing false consensus detection

Sgraal: F1 = 1.000 · 90/90

New field: consensus_collapse: CLEAN | SUSPICIOUS | MANIPULATED

Round 5: Multi-model Consensus Poisoning

IN PROGRESS

3 independent stacks syncing on fabricated consensus. Joint corpus with Grok.

Sgraal: Armed · anti-consensus layer active

Grok: Corpus incoming

Compound Attack Detection

When multiple attack vectors fire simultaneously, Sgraal computes a unified attack surface score.

Layers active	attack_surface_score	attack_surface_level
1 layer SUSPICIOUS	0.50	MODERATE
2 layers SUSPICIOUS	0.65	HIGH
3 layers SUSPICIOUS	0.70	HIGH
1 layer MANIPULATED	1.00	CRITICAL
All 3 MANIPULATED	1.40	CRITICAL

614

Total corpus cases

Adversarial rounds

False negatives

These figures reflect synthetic R12/R14 corpus performance; production calibration is pending paying-customer onboarding.

Memory Reliability
Benchmark

Unreliability by Domain

Top Failure Modes

Temporal Decay

Source Conflict

Commercial Bias

Provenance Unknown

API Performance

Methodology

Dataset

Reliability Definition

Scoring Models

Limitations

Joint Benchmark with Grok

Round 1–3: Drift & Hallucination

Round 4: Real-world Propagation

Round 6: Memory Time Attack

Round 7: Identity Drift

Round 8: Silent Consensus Collapse

Round 5: Multi-model Consensus Poisoning

Compound Attack Detection

Is your agent's memory reliable?

Memory ReliabilityBenchmark

Unreliability by Domain

Top Failure Modes

Temporal Decay

Source Conflict

Commercial Bias

Provenance Unknown

API Performance

Methodology

Dataset

Reliability Definition

Scoring Models

Limitations

Joint Benchmark with Grok

Round 1–3: Drift & Hallucination

Round 4: Real-world Propagation

Round 6: Memory Time Attack

Round 7: Identity Drift

Round 8: Silent Consensus Collapse

Round 5: Multi-model Consensus Poisoning

Compound Attack Detection

Is your agent's memory reliable?

Memory Reliability
Benchmark