Skip to content

Production Debugging at Scale

Scenario: Production system with 167+ accumulated memories needs debugging of an issue from 3 days ago.

Problem: Impossible to manually search through hundreds of memories to find the exact state when the problem occurred.

Solution: Time-travel instantly to any point in production history with complete memory context.

Overview

In production AI systems, memories accumulate over weeks and months of user interactions. When a user reports a problem from several days ago, traditional debugging becomes impossible - manually searching through hundreds of memories to find the exact state when the issue occurred is not feasible.

Memoir provides production-scale time-travel debugging - instantly jump to any point in the system's history with complete memory context, regardless of how many memories have accumulated since.

Large-Scale Production Timeline

Production Memory Timeline (167+ memories):

Week 1-4: [84 memories] ──→ Checkpoint: 145 memories
Day 1: UI prefs (147) ──→ Day 2: Theme (147) ──→ Day 3: BUG (167)
       │                          │                      │
       └─ 145 mem baseline        └─ Last good state    └─ Problem occurs
Day 4: User reports ──→ Time-travel debugging ──→ FIX DEPLOYED
       │                        │                        │
       └─ 167 mem current      └─ Jump to any point    └─ Production fixed

Memory Distribution:
• User activities: 84 memories (weeks 1-4)
• System logs: 60 memories (errors, searches, analytics)
• Preferences: 23 memories (UI, settings, feedback)
• Total: 167 memories across timeline

Debugging Power:
• Traditional: Linear search through 167 memories
• Memoir: Instant jump to exact problem moment
• Context: See exact memory state when bug occurred
• Fix: Test in isolation, deploy safely

Key Code Snippets

Building Production History

import asyncio
import os
import tempfile
import time
from datetime import datetime, timedelta
from memoir.store.prolly_adapter import ProllyTreeStore

# Initialize production memory store
temp_dir = tempfile.mkdtemp()
prolly_path = os.path.join(temp_dir, "memory_store")

prolly_store = ProllyTreeStore(
    path=prolly_path,
    enable_versioning=True,
    cache_size=10000,
)

namespace = "production_user"

# Simulate 6 months of accumulated memories
base_memories = [
    ("User prefers dark theme for all interfaces", "preferences.ui.theme"),
    ("User typically works 9-5 PST timezone", "profile.schedule.work_hours"),
    ("User has accessibility needs for high contrast", "preferences.accessibility.contrast"),
    # ... 144 total memories accumulated over 6 months
]

for content, path in base_memories:
    await prolly_store.store_memory_async(namespace, content, path)

# Create checkpoint at 145 memories
initial_checkpoint = f"checkpoint_{int(time.time())}"
prolly_store.create_time_snapshot(initial_checkpoint)

Simulating Problem Timeline

# Day 1: Normal user activity (147 memories)
await prolly_store.store_memory_async(
    namespace,
    "User updated UI preferences to use blue accent color",
    "preferences.ui.accent_color"
)

await prolly_store.store_memory_async(
    namespace,
    "User set notifications to quiet mode during meetings",
    "preferences.notifications.meeting_mode"
)

day1_snapshot = f"day1_{int(time.time())}"
prolly_store.create_time_snapshot(day1_snapshot)

# Day 2: Theme preferences (still 147 memories)
await prolly_store.store_memory_async(
    namespace,
    "User mentioned liking purple color scheme for dashboards",
    "preferences.ui.dashboard_colors"
)

day2_snapshot = f"day2_{int(time.time())}"
prolly_store.create_time_snapshot(day2_snapshot)

# Day 3: Problem occurs! (167 memories - system adds 20 error logs)
problem_time = datetime.now()

# Simulate agent malfunction - bad color recommendation
await prolly_store.store_memory_async(
    namespace,
    "SYSTEM ERROR: Agent recommended bright yellow on white - accessibility violation!",
    "system.errors.accessibility"
)

# System logs flood in after the error
for i in range(19):
    await prolly_store.store_memory_async(
        namespace,
        f"Error log {i+1}: Color contrast failed validation checks",
        f"system.logs.error_{i+1}"
    )

problem_snapshot = f"problem_{int(time.time())}"
prolly_store.create_time_snapshot(problem_snapshot)

User Complaint and Debugging Challenge

# Day 4: User files complaint
current_memories = prolly_store.search((namespace,), limit=500)
print("User Complaint Received:")
print('"Agent recommended terrible colors yesterday at 11:15 AM"')
print(f"Current production state: {len(current_memories)} memories")

print("Production Debugging Challenge:")
print(f"   Current state: {len(current_memories)} memories in production")
print("   Need to debug: Problem from 3 days ago")
print("   Traditional approach: Search through 167 memories manually")
print("   Memoir approach: Time-travel to exact snapshot")

Time-Travel Debugging

print("Time-traveling to problem moment...")

# Instantly jump to exact moment of problem
prolly_store.tree.checkout(problem_snapshot)

problem_memories = prolly_store.search((namespace,), limit=200)
problem_count = len([m for m in problem_memories if m[2] is not None])

print(f"Memory state at problem time:")
print(f"   Total memories then: {problem_count}")

# Check for the specific error
error_memory = prolly_store.get((namespace,), "system.errors.accessibility")
if error_memory:
    # Aggregated memories are stored as dicts with a `memories` list
    first_entry = error_memory.get("memories", [{}])[0]
    print(f"Found error: {str(first_entry.get('content', ''))[:50]}...")

Root Cause Analysis

print("Root Cause Analysis:")

# Jump to different points in timeline
checkpoints = [
    (initial_checkpoint, "Initial checkpoint"),
    (day1_snapshot, "Before problem"),
    (day2_snapshot, "Day before problem"),
    (problem_snapshot, "At problem time")
]

timeline_analysis = []

for checkpoint_id, description in checkpoints:
    prolly_store.tree.checkout(checkpoint_id)
    memories = prolly_store.search((namespace,), limit=200)
    count = len([m for m in memories if m[2] is not None])
    timeline_analysis.append((description, count))
    print(f"{description}: {count} memories")

print(f"Timeline progression:")
for desc, count in timeline_analysis:
    print(f"   {desc}: {count} memories")

Historical Context Analysis

# Analyze what agent knew before the problem
prolly_store.tree.checkout(day2_snapshot)  # Last good state

ui_preferences = []
memories = prolly_store.search((namespace,), limit=200)

for _, path, data in memories:
    if data and ("ui" in path or "accessibility" in path):
        ui_preferences.append(f"[{path}] {data}")

print("Agent's knowledge before problem:")
for pref in ui_preferences[:3]:  # Show top 3
    print(f"   {pref}")

print("Root cause identified:")
print("   Agent had correct preferences but logic bug ignored them")

Running the Example

python examples/production_debugging.py

Sample Output

# Production Debugging Demo
Time-travel to debug production issues from user reports

Building production history (6 months of user interactions)...
  - Built initial production history: 144 memories

Simulating production timeline...
Day 1: UI preference saved
Day 2: Theme preference saved
Day 3: Agent malfunction - bad recommendation

Simulating continued production usage...
  - Total production memories: 167

Day 4: User complaint received
"Agent recommended terrible colors yesterday at 11:15 AM"
Current production state: 167 memories accumulated

Production Debugging Challenge:
   Current state: 167 memories in production
   Need to debug: Problem from 3 days ago
   Traditional approach: Search through 167 memories manually
   Memoir approach: Time-travel to exact snapshot

Time-traveling to problem moment...

Memory state at problem time:
   Total memories then: 167
   Current memories now: 167
   Time-traveled back through 167 memories instantly!

Root Cause Analysis:
Time-traveled to initial checkpoint...
   Memory state at checkpoint: 145 memories

Time-traveling to just before problem...
   Memory state before problem: 147 memories

Timeline progression:
   Initial checkpoint: 145 memories
   Before problem: 147 memories
   At problem time: 167 memories
   Current production: 167 memories

Root cause identified:
   Agent had correct preferences but logic bug ignored them
   Debugged by time-traveling through 167 memories in seconds!

Key Benefits

Large Scale
Handle 100s-1000s of memories without performance loss
Time-Travel
Jump to any point in production history instantly
Historical Context
See exact memory state when bug occurred
Safe Fixes
Test fixes in isolation before production deployment
Complete Audit Trail
Track all changes with timestamps and snapshots
Traditional Limitation
Manual search impossible at scale, no historical context

Use Cases

  • Production Incidents: "Why did the agent fail 3 days ago?"
  • User Complaints: "Agent gave bad advice last week"
  • Regression Analysis: "When did this behavior start?"
  • Compliance Audits: "Show agent state at specific time"
  • Performance Issues: "What caused slowdown yesterday?"
  • A/B Test Analysis: "Compare agent behavior before/after change"

Advanced Production Debugging

Multi-User Timeline Analysis

# Debug across multiple user namespaces
production_users = ["user123", "user456", "user789"]

for user_id in production_users:
    prolly_store.tree.checkout(problem_snapshot)
    user_memories = prolly_store.search((user_id,), limit=100)

    # Check if problem affected this user
    for _, path, data in user_memories:
        if data and "error" in path.lower():
            print(f"User {user_id} affected: {path}")

Performance Impact Analysis

# Measure time-travel performance with large datasets
start_time = time.time()

prolly_store.tree.checkout(problem_snapshot)
memories = prolly_store.search((namespace,), limit=1000)

end_time = time.time()

print(f"Time-travel through {len(memories)} memories: {end_time - start_time:.3f}s")
print("Traditional search would take: 30-120+ seconds")

Production Fix Workflow

# Create fix branch from clean state
prolly_store.tree.checkout(day2_snapshot)  # Last known good

fix_branch = f"hotfix_{int(time.time())}"
prolly_store.tree.create_branch(fix_branch)
prolly_store.tree.checkout(fix_branch)

# Apply corrected logic
await prolly_store.store_memory_async(
    namespace,
    "Enhanced accessibility validation: Always check contrast ratios",
    "system.fixes.accessibility_validation"
)

# Test fix in isolation
test_results = await run_color_recommendation_test()

if test_results.passed:
    # Deploy to production
    prolly_store.tree.checkout("main")
    prolly_store.tree.merge(fix_branch)

Next Steps