MeMo: Memory as a Model

The Problem

LLMs Are Frozen in Time

Large language models are, at their core, static artifacts. Once pretraining ends, the weights freeze, and so does the world they know. For real applications, this is not merely inconvenient but architecturally limiting: a medical assistant blind to new clinical guidelines, a legal tool unaware of recent rulings, an enterprise system that cannot consult documents that postdate its training cutoff.

Three paradigms have emerged to address this. Non-parametric methods (RAG, In-Context Learning (ICL)) retrieve documents at inference time: flexible, but constrained by context windows, brittle to retrieval noise, and unable to synthesize facts scattered across many documents. Parametric methods (fine-tuning, continual pretraining) bake new knowledge into weights: powerful, but expensive, prone to catastrophic forgetting, and limiting generalization to unseen queries. Latent memory methods compress knowledge into soft tokens or other model-specific representations: compact, but shackled by representation coupling, since the memory cannot be reused with any model other than the one that produced it.

Each paradigm bets on a different place for knowledge to live. MeMo asks a more fundamental question: what if knowledge lived in a model of its own?

MeMo combines all three paradigms' strengths into a single modular framework. It avoids context-window limitations (unlike RAG), prevents catastrophic forgetting (unlike fine-tuning), and decouples knowledge from the reasoning model (unlike latent memory), while requiring zero access to the Executive model's weights, making it compatible with any LLM including both open and proprietary closed-source models.

Method	Frozen LLM	No Retrieval Index	Black-box	No Forgetting	Constant-size Memory	Cross-LLM
Non-parametric (RAG, ICL)	✓	✗	✓	✓	✗	✓
Parametric (Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT))	✗	✓	✗	✗	✓	✗
Latent Memory	✓	✓	✗	✓	✗	✗
MeMo (Ours)	✓	✓	✓	✓	✓	✓

Method

Two Models, One Knowledge Interface

MeMo introduces a two-model architecture. The Memory model, substantially smaller than the Executive model, is trained on a synthesized reflection QA dataset derived from a target corpus. It internalizes the corpus's knowledge parametrically and answers sub-queries entirely from its weights at inference time, with no access to source documents. The Executive model (the large frozen LLM) treats the Memory model as an external oracle, querying it through a structured multi-turn protocol to answer user queries.

MeMo two-phase architecture: training pipeline and inference protocol

Overview of the MeMo training and inference pipeline. A frozen Generator model synthesizes a reflection QA dataset from the target corpus; the Memory model is SFT-trained on it. At inference, the frozen Executive model queries the Memory model through a structured three-stage protocol to answer complex user queries without ever reading the source documents.

Data Synthesis Pipeline

Given a target corpus, the Generator model drives a five-step pipeline to produce a reflection QA dataset Q_final that captures both single-document facts and cross-document relationships. No document identifiers or watermarks are embedded in the generated QA pairs, preventing the Memory model from exploiting shortcut signals at evaluation time.

1

Fact Extraction

Each document is segmented into chunks. For each chunk the Generator runs two parallel passes: direct extraction capturing explicitly stated facts (Q_dir), and indirect extraction targeting inferred or synthesized information beyond the surface text (Q_indir). This dual signal ensures the Memory model trains on both factual recall and inferential reasoning.

2

Consolidation

QA pairs sharing a common underlying context (entity, time period, or relationship type) are merged into composite multi-fact questions (Q_mrg). This produces training instances that require integrating multiple facts within the same contextual chunk, going beyond single-fact question answering. The consolidated set is Q_con = Q_dir ∪ Q_indir ∪ Q_mrg.

3

Verification and Rewriting

Each QA pair in Q_con is evaluated for self-containment: whether it can be correctly answered in isolation, without access to the source chunk. Common failure modes include unresolved pronouns ("What did they propose?") and implicit references ("As noted above…"). Non-self-contained pairs are rewritten using the original chunk as context; those that remain ambiguous are discarded. This yields the verified set Q_ver.

4

Entity Surfacing

For every named entity in Q_ver, the Generator creates reverse-lookup QA pairs (Q_ent): the question encodes the entity's attributes and relationships, and the answer reveals its identity. Facts are aggregated across all QA pairs within the chunk first, and questions span varying complexity levels from single-fact to multi-fact. This directly combats the reversal curse, training the Memory model to infer entities from indirect or partially specified descriptions, a capability exploited during Stage 2 of inference.

5

Cross-Document Synthesis

The final step operates over groups of topically related document chunks. The Generator identifies two types of cross-document connections: converging clues (multiple documents each contribute complementary facts that together identify the same entity) and parallel properties (different entities across documents share a common attribute or role, enabling comparative and analogical reasoning). The final dataset is Q_final = Q_ver ∪ Q_ent ∪ Q_cross.

The data synthesis pipeline produces three complementary training signals: self-contained factual QA, entity-centric reverse lookups, and cross-document synthesis, giving the Memory model a 360° view of the corpus's knowledge structure.

Why Reflections, Not Raw Documents?

The core insight behind MeMo's data pipeline is the concept of reflections: synthesized QA pairs that act as compositional windows into the corpus. Unlike training on raw text or simple paraphrase pairs, reflections are engineered to capture exactly what makes cross-document reasoning hard: they consolidate facts from multiple chunks, enforce self-containment (no dangling pronouns, no implicit context), and encode entity relationships explicitly in both forward and reverse directions.

Three Stages, One Answer

Naive single-turn querying of the Memory model fails on compositional questions requiring chained reasoning. MeMo's protocol mirrors how a skilled analyst would interrogate an unfamiliar knowledge base, proceeding in three stages:

1

Grounding: Cast Wide

The Executive decomposes the user query into K atomic sub-questions, each targeting a single identifying constraint. Each is answered by Memory independently (with no shared context between sub-questions), providing broad, unbiased grounding. K is adaptively determined per query by the Executive.

2

Entity Identification: Converge

Using the grounding responses as context, the Executive iteratively issues targeted follow-up questions to Memory, progressively narrowing the candidate entity pool. This stage exploits the entity-surfacing training; Memory has learned to resolve partial descriptions to concrete entities. Terminates when a single entity is identified or the budget is exhausted.

3

Answer Seeking: Verify and Synthesize

Conditioned on the identified entity, the Executive queries Memory for precise supporting facts. Once sufficient evidence is gathered, it synthesizes a final answer from the accumulated responses. Memory responses are compact natural-language snippets whose length is independent of corpus size, enabling constant-time retrieval regardless of knowledge base scale.

Continual Integration via Model Merging

When a new corpus arrives, full parametric retraining must process the union of all previous sources, a cost that grows quadratically with the number of corpora. MeMo instead trains a separate Memory model on the new corpus and applies model merging to combine it with existing Memory models. Each Memory model contributes a task vector capturing its parametric shift from the shared pretrained base; these vectors are merged via Trim, Elect Sign & Merge (TIES) with a configurable sparsification density.

For K=2 corpus subsets of ~640k QA pairs each, merging accumulates only X+Y ≈ 48 8×H100 GPU-hours versus X+(X+Y) ≈ 72 8×H100 GPU-hours for full retraining, a 33% reduction. At K=10, the saving grows to 5.5× (240 vs. 1,320 8×H100 GPU-hours), as merging costs scale linearly while full retraining costs scale quadratically.

Results

Beating RAG Where It Matters Most

MeMo was evaluated on three knowledge-intensive benchmarks spanning different reasoning challenges: BrowseComp-Plus (multi-hop retrieval across 300 questions with 3,541 total corpus documents), NarrativeQA (long-document comprehension across books and movie scripts), and MuSiQue (multi-step reasoning across 2–4 Wikipedia paragraphs, 1,000 questions). Results are reported under two Executive models (Qwen2.5-32B-Instruct and Gemini-3.0-Flash) against three retrieval baselines (BM25, NV-Embed-V2, HippoRAG2) and one latent-memory baseline (Cartridges).

0

NarrativeQA accuracy % (Gemini-3.0-Flash), more than 2× the best RAG baseline

0

MuSiQue accuracy % (Gemini-3.0-Flash), the best among all methods tested

0

Compute reduction via model merging vs. full retraining at K=2

Method	BrowseComp-Plus (%)		NarrativeQA (%)		MuSiQue (%)
Method	Q2.5-32B	Gemini-3F	Q2.5-32B	Gemini-3F	Q2.5-32B	Gemini-3F
Perfect Retrieval*	79.67	88.33	51.42	60.41	62.83	73.00
BM25	1.11	27.00	10.24	14.33	20.00	23.20
NV-Embed-V2	50.67	57.00	20.59	26.62	37.47	46.60
HippoRAG2	56.11	66.33	21.39	23.21	42.17	57.00
Cartridges	0.00	—	3.75	—	8.57	—
MeMo (Ours)	54.22	66.67	26.85	53.58	48.30	60.20

*Perfect Retrieval is an empirical upper bound. Bold = best among real methods. MeMo Memory model = Qwen2.5-14B-Instruct.

The pattern is unambiguous. On NarrativeQA and MuSiQue, where reasoning requires synthesizing facts distributed across many documents, MeMo dominates all retrieval baselines by large margins. On NarrativeQA, the RAG baseline (HippoRAG2) achieves 23.21% under Gemini-3.0-Flash; MeMo reaches 53.58%, more than double. This is exactly where retrieval systems structurally fail: they can find relevant passages but cannot synthesize coherent answers from content distributed across a full book.

On BrowseComp-Plus, MeMo leads under Gemini-3.0-Flash (66.67%) while narrowly trailing HippoRAG2 under Qwen2.5-32B-Instruct (54.22% vs. 56.11%). This gap is informative: BrowseComp-Plus answers are absent from Executive model pretraining, making direct document access inherently valuable; parametric encoding is at a mild structural disadvantage when answers require verbatim retrieval.

The plug-and-play advantage is quantifiable: upgrading the Executive from Qwen2.5-32B-Instruct to Gemini-3.0-Flash yields gains of +12.45% on BrowseComp-Plus, +26.73% on NarrativeQA, and +11.90% on MuSiQue, with zero retraining of the Memory model. As frontier Executive models improve over time, MeMo's accuracy improves automatically.

Near-Immunity to Retrieval Noise

When an equal number of negative distractor documents is added to the corpus (1×N, where N=1,775 evidence documents for BrowseComp-Plus and N=2,648 for MuSiQue), retrieval-based systems suffer significant accuracy degradation as they struggle to distinguish relevant from irrelevant content. MeMo remains essentially unaffected:

NV-Embed-V2

BrowseComp-Plus−6.22%

MuSiQue−4.83%

HippoRAG2

BrowseComp-Plus−6.22%

MuSiQue−5.16%

MeMo (Ours)

BrowseComp-Plus+0.55%

MuSiQue−1.77%

The robustness is structural: the Memory model responds entirely from internalized parametric knowledge at inference time. Noise documents in the retrieval pool have no effect on its weights. The −1.77% drop on MuSiQue falls within standard deviation bounds, in stark contrast to the 4–6% degradations seen in retrieval systems.

Model Merging: The Compute–Accuracy Trade-off

TIES-merging at ρ=0.3 achieves 15.81% versus full retraining's 26.85% on NarrativeQA under Qwen2.5-32B-Instruct, an 11.04% accuracy gap. This is a real cost. However, even the merged model outperformed by every retrieval baseline on NarrativeQA, and at K=10 the 5.5× compute saving becomes increasingly compelling for knowledge bases spanning many independent sources updated incrementally over time.

Key Insight

One Thing to Remember

🧠

RAG retrieves documents and hopes the LLM can synthesize them. MeMo encodes the corpus directly into a model, letting any frozen LLM query that model as a knowledge oracle. The result is a system immune to retrieval noise, free from context limits, and composable with any LLM you already have.

For machine learning (ML) practitioners deploying knowledge-intensive systems: if your use case involves long documents, multi-hop reasoning, or knowledge bases too large to fit in a context window, MeMo offers a qualitatively different architecture. Its three-stage protocol adds per-query overhead compared to single-turn RAG, but the accuracy gains on NarrativeQA (more than 2× over the best RAG baseline) suggest the trade-off is frequently worthwhile.

The plug-and-play compatibility is the sleeper advantage. Train the Memory model once with a weaker open-source Generator, then deploy it alongside GPT-4, Gemini, Claude, or any other frontier model as the Executive. As Executive models improve, accuracy improves, with no retraining required. MeMo inverts the usual dependency: the knowledge store and the reasoning engine are finally independent.

Limitations & Future Work

Where MeMo Falls Short

There are four structural constraints of the current design:

Training Cost

The five-step data synthesis pipeline and subsequent SFT require substantial one-time compute (~72 8×H100 GPU-hours for the NarrativeQA experiments). Corpora that change frequently will accumulate costs proportional to update frequency.

Memory Capacity Ceiling

The Memory model has finite representational capacity. Very large or information-dense corpora may exceed what a fixed-size Memory model can faithfully encode; scaling behavior with corpus size is not yet fully characterized.

Merging Accuracy Gap

TIES-merging trails full retraining by 11%–19% on NarrativeQA. For high-stakes deployments, this accuracy cost may outweigh the compute savings, depending on the number of corpora K and accuracy requirements.

Evaluation Scope

Evaluation covers three English QA benchmarks. Performance on structured knowledge (tables, code, formal logic) or multilingual corpora is not assessed. The multi-turn protocol also introduces per-query latency overhead not present in single-pass RAG systems.

Future directions the authors identify: more efficient memory construction, extensions to dynamically evolving corpora, tighter coordination loops between the Executive and Memory models, and weight-sharing techniques that could eliminate the dedicated Memory model GPU footprint for very large knowledge bases.

Citation

Cite this Work

BibTeX

@article{quek2026memo,
  title   = {MeMo: Memory as a Model},
  author  = {Quek, Ryan Wei Heng and Lee, Sanghyuk and
           Leong, Alfred Wei Lun and Verma, Arun and
           Prakash, Alok and Chen, Nancy F. and
           Low, Bryan Kian Hsiang and Rus, Daniela and
           Solar-Lezama, Armando},
  journal = {arXiv preprint arXiv:2605.15156},
  year    = {2026}
}