Sgraal | 11 metrics for AI memory health

01. Days until block

What it measures: projected runway before memory drift forces a BLOCK decision, with confidence interval.

Why it matters: early-warning signal — schedule maintenance ahead of operational impact.

Customer action: if < 7 days, queue refresh / heal / re-source.

02. Confidence calibration

What it measures: whether stated confidence matches actual reliability of the underlying memory.

Why it matters: overconfident memory is the leading silent failure mode for autonomous agents.

Customer action: on OVERCONFIDENT flag, require explicit user confirmation before irreversible action.

03. Knowledge age

What it measures: wall-clock age of the memory entries the decision relied on.

Why it matters: reasoning quality decays with stale data. Some domains tolerate weeks; others minutes.

Customer action: tune per-domain freshness threshold; alert when crossed.

04. Top-ROI heal

What it measures: which single memory entry yields the largest decision improvement if healed.

Why it matters: healing is not free; the metric prioritizes the one entry where intervention buys the most safety.

Customer action: automate refresh of the flagged entry; defer others.

05. Fleet health distance

What it measures: how this agent's memory profile compares to the fleet aggregate distribution.

Why it matters: outliers are early signals of either compromise or genuinely novel context.

Customer action: investigate agents whose distance grows monotonically; consider quarantine.

06. Memory complexity trend

What it measures: direction and rate of change in the structural complexity of recent memory.

Why it matters: rising complexity without rising relevance indicates accumulation without curation.

Customer action: trigger MVMem retention sweep when trend stays positive across 7+ calls.

07. Decision cost asymmetry

What it measures: relative cost of false-allow vs false-block for the current action and domain.

Why it matters: a 1% false-allow on a $5M wire transfer is not the same as 1% on a chat reply.

Customer action: auto-tune your local approval workflow to match.

08. Single point of failure

What it measures: the entry whose removal would change the decision most — and the score quantifying it.

Why it matters: a decision driven by one source is fragile. Diversification is the fix.

Customer action: add corroborating sources before acting; or accept the risk consciously.

09. Monoculture risk

What it measures: diversity of provenance + structural independence of supporting evidence.

Why it matters: three sources that all trace back to the same root agent are one source wearing three hats.

Customer action: on HIGH, require human approval before irreversible action.

10. Insight summary

What it measures: compact narrative explanation of the decision in plain English.

Why it matters: compliance teams, end users, and operators all need an explainability surface — not raw scores.

Customer action: include in audit trail; surface to end-user UI on ASK_USER decisions.

11. Knowledge age confidence interval

What it measures: uncertainty bounds on the days-until-block estimate.

Why it matters: a point estimate of "5 days" with ±0.5 day CI is actionable; ±10 day CI is not.

Customer action: only act on the projection when CI width is below your domain threshold.

11 metrics for AI memory health