Watch — file & folder ingestion with vector search¶
memoir watch ingests local files (PDFs, Markdown, docx, html, ...) into
memoir and makes them semantically searchable. It complements the existing
memoir remember write path: anything you point watch at is parsed,
classified into the same taxonomy, and indexed for vector search.
memoir search is the new top-level command that queries the vector index.
When to use watch vs. remember¶
memoir remember— short, ad-hoc facts and notes you type or paste in. Classified by LLM, stored at a taxonomy path. No vector index.memoir watch— bulk ingest of existing files. Each file is parsed with markitdown, classified, stored as a memory, and indexed for vector search bymemoir search.
Quick start¶
# Create a store (or reuse an existing one)
memoir new ~/.memoir/notes
export MEMOIR_STORE=~/.memoir/notes
# Point watch at a folder of docs
memoir watch add ~/Documents/notes
# → walks the folder, parses each file, classifies, indexes for search
# Or a single file
memoir watch add ~/papers/transformer.pdf -n research
# See what's registered
memoir watch list
# Re-scan (idempotent — only changed files are re-indexed)
memoir watch scan ~/Documents/notes
# Search the indexed content
memoir search "transformer attention mechanism"
memoir search "async patterns" -n research -k 10 --json
# Tear down
memoir watch remove ~/Documents/notes --purge
How it works¶
For each watched file:
- Size guard. Files larger than
watch:config.max_size_bytes(default 100 000 bytes on disk) are rejected outright. The whole pipeline is sized for short documents. - Hash. Content hash (blake3 if installed, otherwise sha256). If the hash matches the prior scan, nothing happens — this makes re-scans cheap.
- Parse.
markitdownextracts plaintext. - Chunk + summarize. A single LLM call asks for (a) a one-paragraph
summary of the whole document, and (b) a list of chunk boundaries sized
for vector search. Boundaries are reported as verbatim anchor strings
(first / last ~40 chars of each chunk); the pipeline locates them in the
source text via
str.findto recover real char offsets. Hard cap of 10 chunks per file. - Store. The summary lands at
raw.<file>.summary; each chunk atraw.<file>.chunk.001,.chunk.002, … under thewatchnamespace viaMemoryService.remember.extra_metadata.sourcerecords{kind: "watch", abs_path, content_hash, kind_detail, chunk_index, chunk_start, chunk_end, ...}so each entry traces back to its origin. - Index. Every memory key (summary + chunks) is added to the vector index as its own prollytree text-index document. Semantic search returns chunk-level hits.
Re-scanning a changed file tears down every previous key — both KV and vector — before writing the new summary + chunks, so the key namespace never accumulates orphans across rewrites. The data tree commits once per file (covering all the puts/deletes + path-registry update); the vector tree commits once per file as well.
memoir watch remove --purge <file> deletes every raw.<file>.* key
from both KV and vector and removes the file from the watched-paths
registry.
All state (config, registered paths, per-file hashes) lives inside the
memoir store, under the watch:config, watch:paths, watch:files keys
— no sidecar files.
CLI reference¶
memoir watch add <path> [-n NAMESPACE] [--model MODEL]
memoir watch list
memoir watch scan [path] [-n NAMESPACE] [--model MODEL]
memoir watch remove <path> [--purge]
memoir watch status <path>
memoir watch formats
memoir search <query> [-n NAMESPACE] [-k INT]
Defaults¶
- Single files only. Folders are rejected by
watch add/watch scan. - Namespace:
watchunless-nis given. - Max file size: 100 000 bytes on disk (in
watch:config.max_size_bytes). Files larger than this are rejected. The config dict is written on first scan; reset to defaults withmemoir forget config -n watch --forceand re-scan. - Max chunks per file: 10 (hard cap in the pipeline; prompt nudges toward 1–5).
- Embedder:
MiniLmEmbedder(downloads ~90 MB of model weights on first run into~/.cache/prollytree/embedders/). - LLM model: resolves via
--model→MEMOIR_LLM_MODELenv →claude-haiku-4-5default.
v1 limits¶
- Local files only. No URL / RSS / cloud-storage sources.
- On-demand scans. No live filesystem watcher / daemon — re-run
memoir watch scanto pick up changes. - Original bytes not stored. Only the markitdown plaintext (or its deterministic summary) is stored. To re-process a file, point watch at the original on disk.
- Vector search is watch-only.
memoir rememberwrites are not added to the vector index in v1. - Sequential. Large folders take a while — one LLM classification call per changed file. Re-scans skip unchanged files via content hash.
Installation¶
Watch requires the markitdown extra:
The vector index itself comes with the default prollytree wheel — no extra install needed.