Memoir Search Theory & Architecture¶
Summary¶
Memoir exposes three retrieval pipelines that share the same taxonomy-structured store:
-
IntelligentSearchEngine—mode="single"(in-engine, one LLM call): the engine presents the full path inventory with content samples to an LLM and asks it to pick the relevant paths in one shot. Lowest latency when the store is small/medium; signal-to-noise degrades as the inventory grows. Total latency ~500–800ms. -
IntelligentSearchEngine—mode="tiered"(in-engine, staged LLM calls): the engine runs the same drill-down shape the caller-driven skill uses, but with its own LLM. L1 histogram (no LLM) → L1 pick → optional L2 pick when a branch is wide → key pick → batched fetch. Narrower prompts at each stage, better scaling with store size, at the cost of 2–3 LLM calls instead of 1. Typical total latency ~1–2s. -
Caller-driven tiered retrieval (out-of-engine, LLM-free): the CLI primitives
summarize --depth Nandgetlet an outer LLM (the Claude Codememory-recallskill is the canonical example) drive its own drill-down — no LLM call inside memoir. Latency is tens to hundreds of ms because no model inference happens on the retrieval side; the outer agent also contributes conversational context to the picker, which an in-engine pass cannot.
All three paths exploit pre-classified semantic paths for O(log n)-shaped lookups instead of O(n) similarity search. Where the path picker sits — inside the engine (single or tiered), or outside in the calling agent — is a factoring choice, not an algorithmic one. The single-stage and tiered in-engine modes are selected per call via the mode argument on IntelligentSearchEngine.search() (also exposed as --mode {single,tiered} on memoir recall and ?mode=… on the UI /api/recall endpoint).
Core Problem Statement¶
Traditional AI memory systems suffer from fundamental search inefficiencies:
- O(n) Complexity: Vector similarity search across entire corpus
- High Latency: 150-750ms for embeddings + similarity computation
- Opaque Ranking: Black-box similarity scores without interpretability
- No Hierarchical Leverage: Flat search space ignoring semantic relationships
Memoir solves this through hierarchical semantic search where:
- Memories are pre-organized into semantic paths (via classification)
- Search can leverage the hierarchical structure for O(log n) operations
- Path-based filtering dramatically reduces search space
Search Philosophy¶
Key Innovation¶
flowchart TB
subgraph Trad[Traditional]
direction LR
t1[query] --> t2[embedding] --> t3["O(n) similarity search"] --> t4[ranked results]
end
subgraph Single["Memoir — in-engine, single"]
direction LR
s1[query] --> s2[LLM path selection] --> s3["O(log n) retrieval"] --> s4[filtered results]
end
subgraph Tiered["Memoir — in-engine, tiered"]
direction LR
ti1[query] --> ti2[LLM picks L1] --> ti3["(optional) LLM picks L2"] --> ti4[LLM picks keys] --> ti5["O(log n) retrieval"]
end
subgraph Caller["Memoir — caller-driven"]
direction LR
c1[query] --> c2[caller-LLM picks prefix] --> c3["summarize --depth N"] --> c4[get]
end
All three memoir pipelines exploit pre-classified semantic structure to:
- Reduce Search Space: Focus only on relevant taxonomy branches
- Improve Interpretability: Clear path-based result organization
- Enable Prefix Queries: Efficient hierarchical exploration
- Leverage LLM Understanding: True semantic query comprehension
The caller-driven mode additionally avoids a second LLM inside memoir when the caller is itself an LLM — the outer model already has the query plus session context and is a strictly better picker than a fresh in-engine pass.
Architecture Overview¶
IntelligentSearchEngine — Single-Stage Mode (mode="single")¶
This is the default IntelligentSearchEngine.search() pipeline. One LLM call picks 1–3 paths from the full path inventory and the engine returns the memories at those paths.
Design Goals¶
- Semantic Understanding: LLM comprehends query intent
- Path-Aware Selection: Leverages hierarchical structure
- Context-Rich Results: Provides memory samples for decisions
- Flexible Relevance: LLM-based path selection
Algorithm Deep Dive¶
Stage 1: Path Discovery (10-50ms)¶
# Get all memories to extract unique paths
all_memories = store.search(namespace_tuple, limit=1000)
# Build path information with samples
paths_info = {}
for _, path, data in all_memories:
if path not in paths_info:
paths_info[path] = {
"type": "aggregated" or "single",
"count": memory_count,
"sample": content[:100] # Preview
}
Path Information Structure: - Type: Aggregated vs single memory - Count: Number of memories at path - Sample: First 100 chars for context
This provides the LLM with:
- Complete path inventory
- Memory density information
- Content previews for informed selection
Stage 2: LLM Path Selection (200-500ms)¶
prompt = f"""Given this search query: "{query}"
Please select the most relevant memory paths from:
- profile.personal.identity (5 memories): John Smith, age 28...
- preferences.technology.programming (3 memories): Loves Python...
- context.conversation.history (10 memories): Discussed AI...
Instructions:
- Select 1-3 paths most relevant to query
- Return ONLY path names, one per line
- If no paths relevant, return "NONE"
"""
Prompt Engineering Details: - Query Prominence: Query shown first for focus - Path Context: Each path shown with count and sample - Limited Selection: 1-3 paths to prevent over-retrieval - Clear Format: Line-separated paths for parsing - Null Case: Explicit "NONE" for no matches
Stage 3: Memory Retrieval (5-20ms)¶
for path in selected_paths[:limit]:
path_memories = _get_memories_from_path(namespace, path, all_memories)
results.extend(path_memories)
if len(results) >= limit:
break
Retrieval Strategy: - Path-Limited: Only retrieve from selected paths - Early Termination: Stop when limit reached - Memory Expansion: Unpack aggregated memories - Metadata Enrichment: Add path and source info
Fallback Handling¶
Robustness Features: - LLM Failure Fallback: Use first 3 paths - Parse Error Recovery: Handle malformed LLM output - Empty Result Handling: Return empty list gracefully
Performance Characteristics¶
- Path Discovery: 10-50ms
- LLM Path Selection: 200-500ms (single LLM call; prompt caching on cached sections)
- Memory Retrieval: 5-20ms
- Total Latency: 215-570ms typical (500-800ms measured end-to-end)
- Memory Usage: O(all_paths + selected_memories)
- LLM Token Usage: ~500-1500 tokens per search
Strengths & Limitations¶
Strengths: - True semantic understanding - Handles complex, abstract queries - Leverages memory organization - Provides reasoning transparency - Single LLM call keeps latency bounded
Limitations: - Higher latency (LLM dependency) - Costs associated with LLM usage - Non-deterministic results - Requires online LLM access - Picker sees only the query string — not the caller's conversational context
IntelligentSearchEngine — Tiered Mode (mode="tiered")¶
Opt in with search(..., mode="tiered"). Same engine, same store APIs, same result shape — but the selection work is split into narrower stages so each prompt stays small as the store grows. The pattern mirrors the caller-driven [mode=drill] flow (see below), only with the engine's own LLM driving instead of an outer agent.
When to use it over mode="single"¶
- Large stores. Single-stage sends every path (with content samples) in one prompt — token cost grows linearly with store size, and relevance signal thins out. Tiered mode's L1 histogram stays constant (typically 5–15 entries) and the key-pick stage only sees paths under picked L1s.
- Signal-to-noise matters more than latency. Tiered spends 2–3 LLM calls in sequence (~1–2s end-to-end). If the single-stage picker is good enough and latency is the bottleneck, stay on
mode="single". - You want to A/B the two against a real workload. Because
modeis a per-call argument (not a config knob), benchmarks and the UI can toggle without restart.
The picker in tiered mode still does not see the caller's conversational context (that's a property only the skill-side caller-driven flow has — see next section).
Algorithm Deep Dive¶
- L1 survey (pure compute, no LLM) — after the shared path-discovery step loads all memories once, the engine runs
_group_by_depth(paths, 1)over the stored keys and gets a histogram{prefix: count}of top-level taxonomy segments. - L1 pick (LLM #1) — the engine sends a small prompt with the query and the histogram, asking for 2–4 plausible L1 prefixes. Malformed / empty output falls back to top-N by count so the search never dies silently.
- Descent (pure compute) — for each picked L1,
_filter_keys(paths, f"{L1}.*")narrows to concrete keys. If any single L1 exceedsL2_ESCALATION_THRESHOLD(40 keys), that branch is marked for L2 escalation. - L2 pick (optional, LLM #1.5) — for each oversized L1, the engine groups that branch's keys by depth-2 and asks the LLM to pick 2–3 L2 sub-prefixes. Same fallback-to-top-N safety net.
- Key pick (LLM #2) — the descended key set plus content samples goes into the reused
_select_relevant_pathsprompt (the same static taxonomy-aware prompt the single-stage path uses). LLM returns 3–7 exact keys. - Memory retrieval (pure compute) — picked keys are fetched from the already-loaded
memory_dictvia_extract_memories_from_data, same shape as single-stage.
flowchart TB
Q["query: what's my testing setup?"]
S1["1. L1 histogram<br/>{preferences: 28, context: 25,<br/>workflow: 24, routine: 8, ...}"]
S2["2. LLM #1<br/>[preferences, workflow, routine]"]
S3["3. Descent<br/>15 keys under preferences.*<br/>12 under workflow.*, 6 under routine.*"]
S4["4. (no L2)<br/>all three L1s under the 40-key threshold"]
S5["5. LLM #2<br/>[preferences.coding.testing,<br/>workflow.coding.testing,<br/>routine.coding.testing]"]
S6["6. Fetch<br/>3 IntelligentSearchResult objects<br/>with step_timings + llm_prompts"]
Q --> S1 --> S2 --> S3 --> S4 --> S5 --> S6
Prompt stages¶
- L1 pick prompt. Query + a one-line-per-prefix histogram. Explicitly instructed to return prefix names only, one per line, or
NONE. - L2 pick prompt. Only emitted when at least one L1 triggered escalation; scoped to that branch. When multiple branches escalate in the same call, all sub-prompts are concatenated under the
l2_pickcapture key soreturn_prompts=Trueshows the full chain. - Key pick prompt. Delegates to the existing
_select_relevant_pathswith the descended subset — this reuses the static[STATIC_SECTION_START]/[STATIC_SECTION_END]taxonomy prelude, so prompt caching still applies to this stage exactly as it does for single-stage.
Observability¶
Results from mode="tiered" carry the same metadata shape as single-stage but with tiered-specific keys. Callers that only care about "give me memories" see no difference; callers that inspect metadata["step_timings"] or metadata["llm_prompts"] (the benchmark, the UI's return_prompts=1 panel, tests) see the staged breakdown.
step_timings key |
Present? | Meaning |
|---|---|---|
step1_path_discovery |
always | Shared path-discovery step (same as single-stage). |
l1_survey |
tiered only | Pure-compute L1 histogram. Typically <10ms. |
l1_pick_llm |
tiered only | LLM #1 latency. |
descend |
tiered only | Pure-compute filtering into concrete keys. |
l2_pick_llm |
tiered only (conditional) | Present only when at least one L1 triggered escalation. |
key_pick_llm |
tiered only | LLM #2 latency. |
memory_retrieval |
tiered only | Final fetch + shape conversion. |
total_search |
always | End-to-end wall time. |
llm_prompts keys in tiered mode are l1_pick, l2_pick (when present), and key_pick. metadata["mode"] is stamped to "tiered" (or "single" on the default path) so downstream consumers never need to guess which pipeline produced a result.
Fallbacks & robustness¶
- LLM returns
NONEor garbage at L1 → top-N-by-count fallback on the histogram. - LLM returns
NONEor garbage at L2 → top-N-by-count fallback on that branch's L2 histogram. - Descended key set is empty after all picks → return a single timing-only dummy result (mirrors single-stage's dummy-on-no-match convention) so callers can still observe timings.
- Unknown
modevalue →ValueErrorat the top ofsearch(); fails loud rather than silently falling back.
Performance Characteristics¶
| Stage | Typical latency |
|---|---|
| L1 survey + descend | <20ms (pure compute) |
| L1 pick LLM | ~300–500ms |
| L2 pick LLM (when escalated) | ~300–500ms |
| Key pick LLM | ~400–600ms |
| Memory retrieval | ~5–20ms |
| End-to-end | ~1–2s (vs. ~0.5–0.8s for single-stage) |
LLM token usage scales better than single-stage once the store is large: L1 pick costs are effectively O(1) in corpus size (histograms are small), and the key-pick stage only sees the descended subset rather than the whole inventory.
Mode selection API¶
Mode is a per-call argument, not a configuration toggle:
- Engine:
engine.search(query, namespace, mode="tiered") - Service:
service.recall(query, mode="tiered")/recall_sync(..., mode="tiered") - CLI:
memoir recall "query" --mode tiered(defaultsingle) - UI:
GET /api/recall?path=...&query=...&mode=tiered(whitelisted tosingle/tiered)
Per-call selection was chosen over a global config so benchmarks, tests, and end-users can A/B the two pipelines on the same store without environment juggling.
Caller-Driven Tiered Retrieval¶
When the caller is itself an LLM (e.g. the Claude Code memory-recall skill, agentic tool-use clients), running a second LLM inside memoir to pick paths is wasteful — the outer model already reads the query plus full session context and can pick better than a context-free in-engine pass. The caller-driven pipeline exposes raw primitives and lets the outer LLM drive the drill-down.
Primitives¶
Three LLM-free CLI commands compose into every retrieval shape:
memoir summarize --depth N [--keys <pattern>]— groups taxonomy keys by the firstNdot-separated segments and emits aprefix_countshistogram.N=1gives the L1 layout (typically 5–15 prefixes); deeperNdrills further. Composable with--keys <pattern>for scoped surveys (--keys "preferences.*" --depth 2gives the L2 breakdown underpreferences).- Implementation:
src/memoir/cli/commands/analysis.py—_filter_keys(fnmatch) +_group_by_depth. -
Cost: pure taxonomy scan, no LLM. ~100ms on a mid-sized store.
-
memoir get <key> [<key>...] [-n <namespace>]— batched exact-path lookup. Missing keys returnfound: falserather than erroring, so the caller can include speculative candidates without branching logic. - Implementation:
src/memoir/cli/commands/memory.py+src/memoir/services/memory_service.py. -
Cost: <10ms for a batched lookup (merkle-tree point queries).
-
memoir blame <path> -l N/memoir diff <a> <b>— escalations for history / cross-commit questions. Not on the hot path.
The four modes¶
Every response from the caller-driven path prefixes a mode marker so the cost/correctness trade-off is visible in the transcript:
| Mode | Trigger | Flow | Typical cost |
|---|---|---|---|
[mode=get] |
Query already names a path | direct get |
<10ms |
[mode=flat] |
A single glob covers the scope (e.g. *.testing.*, *pytest*) |
summarize --keys <pattern> → pick → get |
~100ms |
[mode=drill] |
Open-ended query (the default) | summarize --depth 1 → pick 2–4 L1 prefixes → summarize --keys "<L1>.*" → (optional depth-2 escalation when an L1 has > 40 keys) → get |
~200–300ms |
[mode=blame] / [mode=diff] |
Provenance or cross-commit/branch question | run drill first to identify keys, then blame -l N or diff <a> <b> |
+100ms on top of drill |
Markers combine when paths chain ([mode=drill+blame]). The legacy LLM-bundled path is tagged [mode=recall-legacy] and is explicitly discouraged for agent callers.
Drill-down walkthrough¶
flowchart TB
Q["query: what's my testing setup?"]
S1["1. summarize --depth 1 -n default<br/>prefix_counts: {preferences, context,<br/>workflow, routine, ...}"]
S2["2. Caller-LLM picks<br/>[preferences, workflow, routine]<br/>(all plausibly host testing-related facts)"]
S3["3. For each pick: summarize --keys prefix.*<br/>preferences.coding.testing, preferences.tools.testing, ...<br/>workflow.coding.testing, workflow.automation.testing<br/>routine.coding.testing"]
S4["4. Caller-LLM picks 3–7 exact keys<br/>Batched: memoir get preferences.coding.testing ...<br/>→ 4 items with value.content"]
S5["5. Response<br/>[mode=drill] — You use pytest over unittest ..."]
Q --> S1 --> S2 --> S3 --> S4 --> S5
No LLM call on memoir's side anywhere in this flow. Total wall-time is dominated by CLI startup + the three summarize invocations.
Why the outer LLM is the better picker¶
- It sees the full query plus conversational context; the in-engine picker sees only the query string.
- It can make ambiguity-aware calls (fetch from 2–4 plausible prefixes, discard irrelevant results downstream) without a second round trip.
- It avoids the latency + token cost of a nested LLM invocation.
- The contract is minimal — three CLI commands, stable JSON shape — so any agent framework can consume it.
Performance Characteristics¶
| Pipeline | LLM calls in memoir | Typical wall time | Network dependence |
|---|---|---|---|
IntelligentSearchEngine — mode="single" |
1 (path selection) | 500–800ms | Requires online LLM |
IntelligentSearchEngine — mode="tiered" (no L2 escalation) |
2 (L1 pick + key pick) | ~1–1.5s | Requires online LLM |
IntelligentSearchEngine — mode="tiered" (with L2 escalation) |
3 (L1 + L2 + key pick) | ~1.5–2s | Requires online LLM |
[mode=get] (caller-driven) |
0 | <10ms | Local only |
[mode=flat] (caller-driven) |
0 | ~100ms | Local only |
[mode=drill] (caller-driven) |
0 | ~200–300ms | Local only |
[mode=blame] / [mode=diff] (caller-driven) |
0 | +100ms | Local only |
The caller-driven modes consume zero memoir-side tokens. Tokens spent by the outer LLM to do the picking are amortized against conversational context it already has loaded — effectively free at the margin.
When to use which entry point¶
- SDK / non-LLM caller with a small/medium store →
IntelligentSearchEnginewithmode="single"— one LLM call, lowest latency. - SDK / non-LLM caller with a large store or noisy single-stage results →
IntelligentSearchEnginewithmode="tiered"— more LLM calls, narrower prompts per stage, better scaling with store size. - Agent caller with its own LLM (Claude Code, agentic clients) → caller-driven drill-down — skip nested LLM calls entirely and let the outer agent's context contribute to the picker.
- Narrow lookup (known path, obvious pattern) →
[mode=get]or[mode=flat]; the outer LLM decides. - Open-ended or ambiguous query from an agent →
[mode=drill]; escalate toblame/diffonly for explicit provenance questions.
Design references¶
plugins/claude-code/skills/memory-recall/SKILL.md— canonical caller contract, decision rules, and mode-marker convention.plugins/claude-code/hooks/user-prompt-submit.sh— SessionStart nudge that steers the outer LLM toward recall on non-trivial prompts.
Reaching the pipelines from the shell¶
All three pipelines above are exposed through the memoir CLI:
memoir recall— in-engine single or tiered search (via--mode {single,tiered}).memoir get— direct exact-key lookup (the no-LLM shortcut).memoir summarize --depth N/--keys <glob>— the raw primitives the caller-driven skill composes.
See the dedicated CLI Reference for full examples, environment variables, and a shell-only drill-down walkthrough.
Advanced Search Patterns¶
1. Hierarchical Prefix Search¶
Exploit path structure for exploration:
# Search all memories under a path prefix
prefix = "profile.professional"
memories = store.search_prefix(namespace, prefix)
2. Multi-Namespace Search¶
Search across multiple user namespaces:
namespaces = ["user:alice", "user:bob", "shared:team"]
results = []
for ns in namespaces:
results.extend(engine.search(query, ns, limit=3))
3. Temporal Search¶
Combine with version control for time-based queries:
# Search at specific commit/timestamp
historical_results = engine.search(
query, namespace,
at_commit="abc123" # Git-like time travel
)
4. Person-Filtered Search¶
Filter results by person context:
# Search only memories related to a specific person
results = await engine.search(
query="favorite food",
namespace="user123",
person_filter="john"
)
Performance Optimization Strategies¶
1. Search Result Caching¶
# Cache search results by query + namespace
cache_key = hash(query + namespace)
if cache_key in search_cache:
return search_cache[cache_key]
2. Path Index Precomputation¶
# Precompute path -> memory count mapping
path_index = {}
for _, path, data in all_memories:
path_index[path] = path_index.get(path, 0) + 1
3. Parallel Path Retrieval¶
# Retrieve from multiple paths concurrently
async def parallel_retrieval(paths):
tasks = [retrieve_path(p) for p in paths]
return await asyncio.gather(*tasks)
4. Progressive Result Loading¶
# Return results as they're found
async def streaming_search():
for path in selected_paths:
memories = await get_memories(path)
yield memories # Stream results
Implementation Details¶
Memory Format Handling¶
The engine handles two memory formats:
Aggregated Memory Format:
{
"memories": [
{"content": "...", "confidence": 0.9, "metadata": {}},
{"content": "...", "confidence": 0.8, "metadata": {}}
],
"count": 2,
"last_updated": "2024-01-01"
}
Single Memory Format:
Search Result Structure¶
Results use a standardized structure:
@dataclass
class IntelligentSearchResult:
path: str # Semantic path
content: str # Memory content
metadata: dict # Additional metadata
relevance_score: float # 0.0 to 1.0
namespace: str # User namespace
Namespace Handling¶
Flexible namespace format support:
# String format
namespace = "user:alice"
# Tuple format
namespace = ("user", "alice")
# Conversion
namespace_tuple = tuple(namespace.split(":"))
Theoretical Foundation¶
Information Retrieval Theory¶
The search engine implements concepts from:
- Probabilistic Retrieval:
- LLM estimates P(relevant|path, query)
-
Bayesian inference through language understanding
-
Hierarchical Search:
- Logarithmic complexity through path structure
- Semantic clustering of related memories
Hierarchical Search Advantages¶
The semantic path structure supports O(log n)-shaped lookups via two concrete mechanisms exposed as CLI primitives:
-
Prefix-indexed summarization —
summarize --depth N [--keys <pattern>]groups keys by the firstNsegments in O(k) wherekis the number of matching keys (always ≤ the full corpus). An L1 histogram is typically 5–15 entries, constant-sized from the caller's perspective regardless of how many memories exist. A depth-2 survey scoped to one L1 prefix is similarly bounded. -
Exact-key batched
get— once the caller has picked keys, retrieval is O(1) per key in the underlying ProllyTree (merkle-tree point query). Batching amortizes CLI startup across 3–7 fetches.
The "log n" label is a shape claim rather than a strict complexity bound — typical queries touch one depth-1 histogram + one depth-2 survey + a handful of gets, which is bounded independently of corpus size for well-distributed taxonomies. The formal complexity depends on branching factor at each level; in practice, depth-3 is the ceiling because the taxonomy itself is capped at 3 levels.
Additional benefits that fall out of the same structure:
- Semantic Clustering: Related memories are naturally grouped at a common prefix.
- Progressive Refinement: The caller drills only the prefixes that plausibly match, skipping irrelevant subtrees entirely.
- Faceted Search:
--keys <pattern>supports arbitrary glob filters (preferences.coding.*,*testing*,context.project.*) composable with--depth N. - Auditable decisions: Because mode markers (
[mode=get|flat|drill|blame|diff]) tag every caller-driven response, the retrieval path taken is visible in the transcript — the search becomes debuggable post-hoc.
Reference Files¶
Implementation entry points for the three retrieval pipelines:
| Component | File |
|---|---|
IntelligentSearchEngine (both mode="single" and mode="tiered") |
src/memoir/search/intelligent.py |
_search_tiered + L1/L2 pickers + L2_ESCALATION_THRESHOLD |
src/memoir/search/intelligent.py |
MemoryService.recall (passes mode through) |
src/memoir/services/memory_service.py |
memoir recall --mode {single,tiered} CLI flag |
src/memoir/cli/commands/memory.py |
UI /api/recall?mode=… |
src/memoir/ui/handlers/memory_handler.py |
summarize --depth N (drill-down primitive) |
src/memoir/cli/commands/analysis.py |
get <key>... (batched exact lookup) |
src/memoir/cli/commands/memory.py |
get service layer |
src/memoir/services/memory_service.py |
| Response shapes | src/memoir/services/models.py |
| Caller contract + mode markers | plugins/claude-code/skills/memory-recall/SKILL.md |
| Per-prompt recall nudge | plugins/claude-code/hooks/user-prompt-submit.sh |
--depth CLI tests |
tests/test_cli.py |
| Tiered-mode engine tests | tests/test_search_tiered.py |
Conclusion¶
The Memoir search architecture demonstrates that effective memory retrieval doesn't require expensive vector similarity search. By leveraging semantic taxonomy paths, the system provides three complementary pipelines that share the same substrate:
IntelligentSearchEngine—mode="single"— one LLM call picks paths for SDK-style callers that don't have their own model. 10–50× faster than vector approaches while preserving semantic understanding.IntelligentSearchEngine—mode="tiered"— the same engine runs the drill-down pattern in staged LLM calls (L1 pick → optional L2 pick → key pick) when the store is large enough that a single-prompt path inventory stops fitting cleanly. Narrower prompts per stage at the cost of 2–3 LLM calls.- Caller-driven tiered retrieval — LLM-free primitives (
summarize --depth N,get) let agentic callers drive the drill-down themselves. Zero memoir-side tokens, ~100–300ms wall time, fully auditable via mode markers.
All three pipelines benefit from the same underlying insight: pre-classification into semantic paths transforms retrieval from finding needles in haystacks to navigating a well-organized filing cabinet. The choice between them is a factoring decision about where the picker lives and how many stages it runs — one in-engine pass for simple callers, a staged in-engine pass when the store outgrows that, and fully out-of-engine for agent callers whose outer LLM is already the best possible picker.