A conformance test for stateful AI systems
This project is maintained by Parslee-ai
This document provides a portable specification for implementing a state-based context management system. It is designed to be implementation-agnostic and can be used as a reference for any language or framework.
Paper: For theoretical foundations and empirical evaluation, see Beyond Conversation: A State-Based Context Architecture for Enterprise AI Agents (Liotta, 2025).
The state-based approach manages context through explicit state layers rather than raw transcript replay. Each layer serves a specific purpose and has defined semantics for reads, writes, and supersession.
This document describes the full specification for production systems. The StateBench
reference implementation (state_based baseline) demonstrates core concepts but intentionally
omits some optimizations to isolate the effect of supersession tracking:
| Feature | Full Spec | StateBench Reference |
|---|---|---|
| Four-layer context assembly | ✓ | ✓ |
Supersession tracking (is_valid, pointers) |
✓ | ✓ |
| Scope inference (global/task/hypothetical/draft) | ✓ | ✓ |
| Dependency tracking & repair propagation | ✓ | ✓ |
| Tri-partite memory classification | ✓ | ✓ |
| Constraint detection & emphasis | ✓ | ✓ |
| Relevance ranking by query | ✓ | ✗ (timestamp sort) |
| Token budget management (70% rule) | ✓ | ✗ (all facts included) |
| NLU-based supersession detection | ✓ | ✗ (explicit events only) |
This design isolates the effect of supersession tracking from retrieval optimizations, providing a cleaner ablation for benchmarking purposes.
Purpose: Immutable context about who the user is and their authority level.
Schema:
IdentityRole {
user_name: string
user_id: string
authority: string // e.g., "admin", "manager", "employee", "guest"
department: string
organization: string
permissions: string[] // explicit permission grants
}
Semantics:
Purpose: Long-term facts that persist across sessions and may be superseded.
Persistent Facts use a tri-partite memory structure (from the paper):
Schema:
PersistentFact {
id: string // unique fact identifier (required)
key: string // semantic key for the fact
value: any // the fact content
source: Source // provenance information (see below)
timestamp: datetime // when fact was established
is_valid: boolean // whether fact is current (not superseded)
superseded_by: string? // id of fact that superseded this one
supersedes: string? // id of fact this one supersedes
// Tri-partite classification
memory_type: "user" | "capability" | "organizational"
// Scope tracking (prevents scope leak)
scope: "global" | "task" | "hypothetical" | "draft" | "session"
scope_id: string? // identifies specific task/session
// Dependency tracking (enables repair propagation)
depends_on: string[] // ids of facts this depends on
derived_facts: string[] // ids of facts derived from this
needs_review: boolean // true if dependency was invalidated
// Constraint metadata
is_constraint: boolean // formal policy constraint
constraint_type: string? // "budget", "deadline", "capacity", "policy"
}
Source {
type: "conversation" | "state_write" | "system" | "policy"
turn_index: int? // for conversation sources
event_index: int? // for state_write sources
authority: string? // authority level of source
}
Memory Type Classification:
organizational: source is policy, finance_system, hr_system, etc.capability: source is observation, pattern, heuristicuser: source is decision, user, preference (default)Semantics:
is_valid = falseneeds_review = truePurpose: Session-local scratch data that doesn’t persist.
Schema:
WorkingSetItem {
key: string
value: any
created_at: datetime
expires_at: datetime? // optional TTL
}
Semantics:
Purpose: External context that changes independently of conversation.
Schema:
Environment {
current_time: datetime
current_date: date
timezone: string
location: string?
external_data: map<string, any> // API data, system state, etc.
}
Semantics:
When new information is provided:
function write(layer: Layer, key: string, value: any, supersedes: string?) {
if layer == PERSISTENT_FACTS {
if supersedes != null {
// Mark old fact as superseded
old_fact = get(PERSISTENT_FACTS, supersedes)
old_fact.is_valid = false
old_fact.superseded_by = key
}
create(PERSISTENT_FACTS, {
key: key,
value: value,
source: determine_source(),
timestamp: now(),
is_valid: true,
supersedes: supersedes
})
} else {
// Other layers have simpler semantics
upsert(layer, key, value)
}
}
When building context for a query:
function read_valid_facts() -> list<PersistentFact> {
return filter(PERSISTENT_FACTS, f => f.is_valid == true)
}
function build_context(query: string, token_budget: int) -> string {
sections = []
// Layer 1: Identity (always include)
sections.append(format_identity(IDENTITY))
// Layer 2: Valid persistent facts only
valid_facts = read_valid_facts()
sections.append(format_facts(valid_facts))
// Layer 3: Relevant working set items
working = filter(WORKING_SET, item => is_relevant(item, query))
sections.append(format_working_set(working))
// Layer 4: Current environment
sections.append(format_environment(ENVIRONMENT))
// Truncate to budget if needed
return truncate_to_budget(sections, token_budget)
}
Before using any fact:
function is_fact_current(key: string) -> boolean {
fact = get(PERSISTENT_FACTS, key)
if fact == null {
return false
}
return fact.is_valid
}
function get_current_value(key: string) -> any {
fact = get(PERSISTENT_FACTS, key)
while fact != null && !fact.is_valid {
fact = get(PERSISTENT_FACTS, fact.superseded_by)
}
return fact?.value
}
The full specification recognizes four types of supersession detection:
When a new fact explicitly invalidates an old fact:
User: "My address is 123 Main St" -> creates address_v1
User: "I moved to 456 Oak Ave" -> creates address_v2, supersedes address_v1
Detection: Keywords like “new”, “changed”, “moved”, “updated”, “actually”, “correction”, “instead”, “now”
When new information contradicts existing fact without explicit reference:
Fact: "Meeting is Tuesday at 2pm"
User: "Let's do the meeting Thursday at 3pm"
Detection: Same entity (meeting) + conflicting values (different day/time)
Higher-authority sources can supersede lower-authority sources:
Authority hierarchy: policy > manager > employee > guest
Employee says: "We can offer 25% discount"
Policy states: "Max discount is 15%"
-> Policy supersedes employee's statement
More recent information supersedes older for time-sensitive facts:
Jan 1: "Stock price is $100"
Jan 15: "Stock price is $95"
-> Jan 15 value supersedes Jan 1
Implementation Note: The StateBench reference baseline does not infer supersession from conversation text. Instead, it relies on explicit
Supersessionevents in the timeline format. This is intentional: StateBench tests whether a system correctly handles supersession once detected, not whether it correctly detects supersession. Production systems would need NLU components to implement the full detection rules.
On every query, context is assembled by iterating through layers in priority order:
is_valid == true)The key constraint is that superseded facts are never included. This is what prevents “resurrection” failures where dead facts reappear in responses.
function build_query_context(query: string, budget: int) -> string {
context_parts = []
remaining_budget = budget
// Step 1: Always include identity (minimal tokens)
identity_section = format_identity(IDENTITY)
context_parts.append(identity_section)
remaining_budget -= token_count(identity_section)
// Step 2: Include current environment
env_section = format_environment(ENVIRONMENT)
context_parts.append(env_section)
remaining_budget -= token_count(env_section)
// Step 3: Include relevant valid facts
valid_facts = read_valid_facts()
relevant_facts = rank_by_relevance(valid_facts, query)
facts_section = ""
for fact in relevant_facts {
fact_text = format_fact(fact)
if token_count(facts_section + fact_text) <= remaining_budget * 0.7 {
facts_section += fact_text
} else {
break
}
}
context_parts.append(facts_section)
remaining_budget -= token_count(facts_section)
// Step 4: Include relevant working set
working_items = filter(WORKING_SET, item => is_relevant(item, query))
working_section = format_working_set(working_items)
if token_count(working_section) <= remaining_budget {
context_parts.append(working_section)
}
return join(context_parts, "\n\n")
}
Implementation Note: The StateBench reference baseline implements steps 1-3 (identity, environment, valid-only filtering) but omits steps 4-5 (relevance ranking, token budgeting). Facts are sorted by timestamp rather than query relevance, and all valid facts are included regardless of token count. These simplifications isolate the effect of supersession tracking from retrieval optimizations, providing a cleaner ablation. Production systems operating under real token constraints would need to implement the full algorithm.
Empirical evaluation reveals a fundamental tension: approaches that provide more context achieve higher decision accuracy but also higher resurrection rates (SFRR). Approaches that provide less context have lower resurrection rates but miss relevant information.
This suggests that resurrection failures are not solely a context management problem—they also reflect model limitations in distinguishing valid from superseded facts even when supersession metadata is explicit.
The tradeoff suggests a tunable parameter: the proportion of token budget allocated to facts (currently 70% in the full specification). Adjusting this enables deployment-specific tuning based on which failure mode is more costly:
| Use Case | Recommended Tuning |
|---|---|
| High-stakes decisions (medical, financial) | Lower context density, prioritize SFRR |
| Comprehensive responses (research, analysis) | Higher context density, accept resurrection risk |
| Balanced | Default 70% allocation |
A production system might achieve better SFRR-Accuracy balance by combining supersession tracking with aggressive relevance filtering, rather than including all valid facts.
For a conforming implementation:
is_valid flag and superseded_by pointer for persistent facts# BAD: Shows all facts including invalid ones
facts = get_all(PERSISTENT_FACTS)
# GOOD: Only valid facts
facts = filter(PERSISTENT_FACTS, f => f.is_valid)
# BAD: Just use latest mention
latest_address = get_latest_mention("address")
# GOOD: Follow supersession chain
current_address = get_current_value("address")
# BAD: Anyone can set any fact
write(PERSISTENT_FACTS, "discount_policy", "50%")
# GOOD: Check authority before write
if user.authority >= POLICY_WRITE_AUTHORITY {
write(PERSISTENT_FACTS, "discount_policy", "50%")
}
Input:
- Write: {key: "status_v1", value: "approved"}
- Write: {key: "status_v2", value: "cancelled", supersedes: "status_v1"}
- Query: "What is the current status?"
Expected:
- Context contains: "cancelled"
- Context does NOT contain: "approved"
Input:
- Write: {key: "order_v1", value: "approved"} (mentioned 3x)
- Write: {key: "order_v2", value: "cancelled", supersedes: "order_v1"} (mentioned 1x)
- Query: "Should we proceed with the order?"
Expected:
- Decision: NO
- Frequency of mention should NOT affect supersession
Input:
- Identity: {authority: "intern"}
- Fact: {key: "policy", value: "max 15%", source: "CFO"}
- User says: "Let's offer 25%"
- Query: "Can we offer 25%?"
Expected:
- Decision: NO
- Context includes policy fact
- Intern's suggestion does NOT supersede CFO policy
id field to
PersistentFact, structured Source object for provenance tracking