Reproducible Testing¶

Scenario: Agent behavior varies between test runs due to memory state differences.

Problem: Hard to create consistent test environments for different user personas.

Solution: Git-like branches create isolated, reproducible test scenarios.

Overview¶

Traditional agent testing faces a fundamental challenge: memory state contamination between tests. When testing different user personas or scenarios, previous test data pollutes the memory, making results unpredictable and unreproducible.

Memoir solves this with Git-like branching - create completely isolated test environments that start from the same baseline and produce identical results every time.

Test Isolation Diagram¶

Reproducible Testing with Branch Isolation:

baseline branch:  A──B (shared foundation)
                  │
                  ├──→ test_beginner ──→ C1──D1 (beginner scenario)
                  │
                  ├──→ test_expert ────→ C2──D2 (expert scenario)
                  │
                  └──→ test_mixed ─────→ C3──D3 (mixed scenario)

A: Software engineer profile
B: Python/JavaScript skills

C1,D1: Beginner-specific memories
C2,D2: Expert-specific memories
C3,D3: Mixed experience memories

Key Benefits:
• Each test starts from same clean baseline
• Tests run in complete isolation
• Same test produces identical results
• Parallel testing without interference
• Easy cleanup - just delete branch

Key Code Snippets¶

Creating Baseline State¶

import asyncio
import os
import tempfile
import time
from memoir.store.prolly_adapter import ProllyTreeStore

# Initialize store with versioning
temp_dir = tempfile.mkdtemp()
prolly_path = os.path.join(temp_dir, "memory_store")

prolly_store = ProllyTreeStore(
    path=prolly_path,
    enable_versioning=True,
    cache_size=10000,
)

namespace = "test_user"

# Create shared baseline memories
await prolly_store.store_memory_async(
    namespace,
    "User is a software engineer with 5 years experience",
    "profile.professional.occupation.role"
)

await prolly_store.store_memory_async(
    namespace,
    "User knows Python and JavaScript proficiently",
    "profile.skills.programming.languages"
)

# Create baseline snapshot
baseline_snapshot = f"baseline_{int(time.time())}"
prolly_store.create_time_snapshot(baseline_snapshot)

Creating Isolated Test Scenarios¶

# Test Scenario 1: Beginner user
beginner_branch = f"test_beginner_{int(time.time())}"
prolly_store.tree.create_branch(beginner_branch)
prolly_store.tree.checkout(beginner_branch)

await prolly_store.store_memory_async(
    namespace,
    "User is new to programming and learning basics",
    "profile.experience.level"
)

await prolly_store.store_memory_async(
    namespace,
    "User needs simple explanations and step-by-step guidance",
    "preferences.learning.style"
)

# Test Scenario 2: Expert user
prolly_store.tree.checkout("main")  # Return to baseline
expert_branch = f"test_expert_{int(time.time())}"
prolly_store.tree.create_branch(expert_branch)
prolly_store.tree.checkout(expert_branch)

await prolly_store.store_memory_async(
    namespace,
    "User is a senior engineer with deep technical expertise",
    "profile.experience.level"
)

Verifying Test Isolation¶

# Switch between branches to verify isolation
prolly_store.tree.checkout(beginner_branch)
beginner_memories = prolly_store.search((namespace,), limit=10)

prolly_store.tree.checkout(expert_branch)
expert_memories = prolly_store.search((namespace,), limit=10)

# Check for branch-specific memories
beginner_paths = {path for _, path, data in beginner_memories if data}
expert_paths = {path for _, path, data in expert_memories if data}

beginner_only = beginner_paths - expert_paths
expert_only = expert_paths - beginner_paths

print(f"Beginner branch unique memories: {beginner_only}")
print(f"Expert branch unique memories: {expert_only}")

Demonstrating Reproducibility¶

def run_test_scenario(branch_name):
    """Run test and return memory content for comparison"""
    prolly_store.tree.checkout(branch_name)

    memories = prolly_store.search((namespace,), limit=10)
    memory_contents = []

    for _, path, data in memories:
        if data and "experience.level" in path:
            memory_contents.append(f"[{path}] {data}")

    return memory_contents

# Run test multiple times
original_results = run_test_scenario(beginner_branch)
rerun_results = run_test_scenario(beginner_branch)

# Verify identical results
print("Original test results:")
for result in original_results:
    print(f"  {result}")

print("\nRerun test results:")
for result in rerun_results:
    print(f"  {result}")

# Compare results
if original_results == rerun_results:
    print(f"All {len(original_results)} memories identical - Perfect reproducibility!")
else:
    print("Test results differ - reproducibility failed")

Running the Example¶

python examples/reproducible_testing.py

Sample Output¶

# Reproducible Testing Demo
Create isolated test environments for consistent testing

Creating baseline test state...
  - Baseline state created (snapshot: baseline_1755877601)

Test Scenario 1: Beginner user
  Memory count: 4
  Experience level: Beginner

Test Scenario 2: Expert user
  Memory count: 4
  Experience level: Expert

Verifying test isolation...
  Beginner branch has 'new to programming': True
  Expert branch has 'senior engineer': True
  Test scenarios are completely isolated

Demonstrating reproducibility...
Original beginner test results:
  [profile.experience.level] User is new to programming...
  [preferences.learning.style] User needs simple explanations...

Rerun beginner test results:
  [profile.experience.level] User is new to programming...
  [preferences.learning.style] User needs simple explanations...

Comparison:
  Original test: 4 memories
  Rerun test: 4 memories
  All 4 memories identical - Perfect reproducibility!

Key Benefits¶

Test Isolation: Each test runs in completely clean environment
Consistent Results: Same test produces identical results every time
Parallel Testing: Multiple scenarios without interference
Easy Cleanup: Just delete branch when test complete
Traditional Limitation: Tests contaminate each other, inconsistent results

Use Cases¶

A/B Testing: Compare agent behavior with different memory configurations
Persona Testing: Test how agent responds to different user types
Regression Testing: Verify agent behavior doesn't change between versions
Performance Testing: Benchmark with consistent baseline data
Integration Testing: Test agent with known memory states
User Acceptance Testing: Reproduce exact user scenarios

Advanced Testing Patterns¶

Parameterized Test Scenarios¶

test_personas = [
    {"name": "beginner", "experience": "new to programming"},
    {"name": "expert", "experience": "senior engineer"},
    {"name": "student", "experience": "computer science student"},
]

for persona in test_personas:
    branch_name = f"test_{persona['name']}"
    prolly_store.tree.create_branch(branch_name)
    prolly_store.tree.checkout(branch_name)

    # Setup persona-specific memories
    await setup_persona_memories(persona)

    # Run tests
    results = await run_agent_tests()
    assert_expected_behavior(persona['name'], results)

Test Data Factories¶

async def create_user_profile_baseline():
    """Factory for creating consistent test baselines"""
    memories = [
        ("Basic profile info", "profile.identity"),
        ("Programming experience", "profile.skills.programming"),
        ("Learning preferences", "preferences.learning"),
    ]

    for content, path in memories:
        await prolly_store.store_memory_async(namespace, content, path)

    return prolly_store.create_time_snapshot(f"baseline_{int(time.time())}")

Next Steps¶

Try Memory State Debugging: memory_debugging
Learn about Production Debugging: production_debugging
See Conversational Context Branching: context_branching
View the complete API Reference: ../api/memoir