Runtime Persistence & Session Recovery

Overview

Phase 10 transforms the AIRI runtime from process-bound orchestration into recoverable persistent infrastructure. The runtime can now survive daemon restarts by persisting state to disk and replaying events to reconstruct in-memory state.

Architecture

Append-Only Event Store

The core persistence primitive is an append-only event store. Every significant state transition (plan started, step completed, task failed, etc.) is recorded as an immutable event with a monotonic sequence number.

Event ID format: evt_{sequence}_{timestamp}
Example: evt_42_1706119234567

Events are ordered by sequence number, not by timestamp. This guarantees deterministic ordering even when multiple events share the same timestamp.

Key properties:

Events are never mutated or deleted (append-only).
Each event gets a unique, monotonically increasing sequence number.
Events can be queried by: session, module, type, execution ID, or since a given event ID.

Snapshot Strategy

Snapshots are point-in-time captures of the full runtime state. They are:

Versioned: each save increments the version number.
Lightweight: only essential state is serialized (plans, tasks, capabilities, sessions).
Prunable: old snapshots are automatically removed, keeping only the N most recent.

Recovery loads the latest snapshot and replays events since that snapshot. This bounds recovery time — instead of replaying all events from the beginning of time, only events since the last snapshot need to be replayed.

Deterministic Recovery Philosophy

Recovery is deterministic: given the same snapshot and the same events, the recovered state is always the same. There is no "smart" recovery heuristics or implicit magic.

Recovery flow:

Load the latest snapshot.
Replay events since the snapshot.
Restore planner state (active plans, pending steps).
Restore active executions (running tasks).
Restore session ownership.
Reconcile incomplete executions (mark running-but-not-actively-running as failed).

Reconciliation: Executions that were running at snapshot time but are not actively running at recovery time are marked as failed. This handles the case where the daemon crashed mid-execution.

Session Persistence

Sessions survive frontend disconnects. When a client disconnects:

The session is marked as "detached" but retained.
A recovery token is generated for reconnection.
The session can be resumed by providing the recovery token.

Sessions are only destroyed on:

Explicit cleanup (e.g., user logs out).
Expiry (detached sessions older than a configurable threshold).

Planner Persistence Integration

The PlanExecutor accepts an optional EventStore. When configured:

All plan lifecycle events (plan.started, plan.completed, plan.failed, plan.cancelled) are persisted to the event store.
All step lifecycle events (step.started, step.completed, step.failed) are persisted.
Plans can be marked as resumable — they can be restored after a restart.
Already-completed steps are skipped on replay (idempotent recovery).
Cancelled plans remain cancelled after restart.

Execution Persistence Integration

The ExecutionTrace accepts an optional EventStore. When configured:

Each execution record is persisted to the event store.
Execution IDs remain stable across restarts (they're already UUIDs).
Failed recovery attempts become events in the store.

The LocalToolRuntime also accepts an optional EventStore for persisting tool execution lifecycle events.

File Structure

core/
  persistence/
    types.ts                    — Persistence abstractions (interfaces)
    event-store.ts              — InMemoryEventStore + PersistedEventStore
    snapshots.ts                — InMemorySnapshotStore + SnapshotManager
    index.ts                    — Barrel export
    adapters/
      filesystem/
        adapter.ts              — FilesystemPersistenceAdapter
        event-store.ts          — FilesystemEventStore
        snapshot-store.ts       — FilesystemSnapshotStore
        runtime-state-store.ts  — FilesystemRuntimeStateStore
        index.ts                — Barrel export
  session/
    types.ts                    — Persistent session types
    session-manager.ts          — PersistentSessionManager
  runtime/
    recovery.ts                 — RecoveryCoordinator
  planner/
    executor.ts                 — Modified: optional EventStore persistence
  runtime/
    execution-trace.ts          — Modified: optional EventStore persistence
    local-tool-runtime.ts       — Modified: optional EventStore persistence
  __tests__/
    persistence.test.ts         — Persistence layer tests
    filesystem-adapter.test.ts  — Filesystem adapter tests

Why Persistence Precedes Autonomy

Before the runtime can support autonomous operation (self-directed task execution, long-running plans), it must be able to survive restarts. A plan that takes hours to complete cannot afford to lose progress because of a daemon restart.

Persistence provides the foundation for:

Long-running plans: plans that span daemon restarts.
Session continuity: clients can disconnect and reconnect without losing state.
Auditability: the event store provides a complete history of all state transitions.
Debugging: replaying events to understand how a particular state was reached.

Future Distributed Runtime Implications

The persistence abstractions are designed to support future distributed scenarios:

PersistenceAdapter can be implemented for remote blob stores (S3, GCS).
EventStore can be backed by distributed log systems (Kafka, Pulsar).
SessionOwnership tracks which process owns a session, enabling session affinity.
RecoveryCoordinator can be extended to coordinate recovery across multiple nodes.

Known Limitations

Filesystem-only: Currently only filesystem-based persistence is implemented. No database or remote storage backends.
Full event scan: The filesystem event store does a full file scan for queries. This is acceptable for bounded event counts but will need indexing for large-scale use.
No event pruning: The event store does not yet prune old events. A future version should implement time-based or count-based retention.
Single-process: Recovery assumes a single daemon process. No distributed coordination is implemented.
No encryption: Data is stored in plaintext on disk. Sensitive data should be encrypted at rest in production deployments.

Future Migration Recommendations

Database backend: Implement PersistenceAdapter for SQLite or PostgreSQL for better query performance and concurrent access.
Event indexing: Add an index layer (e.g., B-tree) for efficient event queries without full scans.
Snapshot compression: Compress snapshot files to reduce disk usage.
Incremental snapshots: Instead of full snapshots, store only the delta since the last snapshot.
Distributed consensus: For multi-node deployments, use a consensus protocol (Raft) to coordinate recovery.