RAG Retrieves Text. MSA Retrieves Thought: Rethinking Memory in LLMs
Why Fine-Tuning, RAG, and Latent Compression Fall Short—and What Memory Sparse Attention Changes
Image credit: Unsplash
Summary
Giving LLMs memory has always been a trade-off between scale, precision, and efficiency. Fine-tuning, RAG, and latent compression each solve part of the problem—but none solve it completely. Memory Sparse Attention (MSA) introduces a new approach by retrieving internal model representations instead of text, potentially redefining how AI systems remember and reason.
Primary source: Memory Sparse Attention: Long-Context LLM Memory via Sparse Latent Retrieval
Table of Contents
- Introduction
- The Core Problem: Memory in LLMs
- Approach #1: Fine-Tuning the Weights
- Approach #2: Retrieval-Augmented Generation (RAG)
- Approach #3: Latent State Compression
- Why All Three Approaches Break
- Enter MSA: Memory Sparse Attention
- How MSA Actually Works
- The Key Innovation: Retrieval in Thought Space
- Scaling Trick: Document-wise RoPE
- Performance Results and Benchmarks
- The Hybrid Insight: Latent + Text
- Limitations and Open Questions
- What This Means for AI Systems
- Conclusion
- References
- FAQ
Introduction
There is a quiet assumption behind most modern AI systems:
Large Language Models can “remember.”
But that assumption breaks the moment you push beyond short contexts.
In reality, memory in LLMs is still an unsolved systems problem.
And today, we only have three ways to approximate it.
All three are flawed.
The Core Problem: Memory in LLMs
LLMs are not databases.
They do not “store” knowledge in the way traditional systems do.
Instead, they:
- Encode patterns into weights
- Use context windows for short-term memory
- Rely on external systems for long-term recall
The challenge is:
How do you give an LLM scalable, accurate, and efficient memory?
So far, the industry has converged on three approaches.
Approach #1: Fine-Tuning the Weights
This is the most direct method.
How It Works
- Train the model on new data
- Embed knowledge directly into weights
Advantages
- High precision
- Native integration into model
Problems
- Fixed capacity
- Expensive retraining
- Catastrophic forgetting
When you add new knowledge, you risk overwriting old knowledge.
Approach #2: Retrieval-Augmented Generation (RAG)
The current industry standard.
How It Works
- Store documents externally
- Retrieve relevant chunks
- Inject into prompt
Advantages
- Scales to any dataset
- Easy to update
- Modular
Problems
- Works in text space, not model reasoning space
- Retrieval quality bottleneck
- Structural ceiling
Even with reranking and embeddings, there is a mismatch:
The model thinks in latent space, but retrieves in text space.
Approach #3: Latent State Compression
A more experimental approach.
How It Works
- Compress memory into hidden states
- Pass through model
Advantages
- Efficient
- Compact
Problems
- Fixed-size memory
- Information loss
- Degrades at scale
Example:
RWKV drops from 100% to 53% accuracy at 1M tokens in needle-in-a-haystack tests (source).
Why All Three Approaches Break
Each method optimizes for one dimension:
| Approach | Strength | Weakness |
|---|---|---|
| Fine-tuning | Precision | Forgetting |
| RAG | Scalability | Representation mismatch |
| Latent states | Efficiency | Capacity limits |
But none solve:
Scalable, accurate, native memory simultaneously.
Enter MSA: Memory Sparse Attention
A fourth approach emerges:
Memory Sparse Attention (MSA)
Developed by Evermind, MSA rethinks memory at the architectural level (paper).
How MSA Actually Works
Instead of retrieving text:
- It retrieves internal representations
Key Components
- Router Projectors
- KV Cache Signatures
- Top-K Selection
Process
- Model encodes documents into KV cache
- Router scores documents based on relevance
- Selects top-K documents
- Feeds them directly into attention
The Key Innovation: Retrieval in Thought Space
This is the breakthrough.
RAG retrieves text. MSA retrieves thought.
Why This Matters
- No embedding mismatch
- No text reconstruction gap
- Same representation as model reasoning
Retrieval and generation now:
- Share the same forward pass
- Use the same loss function
- Operate in the same space
Scaling Trick: Document-wise RoPE
Scaling is where most systems fail.
MSA solves this elegantly.
The Problem
Position embeddings break at large token sizes.
The Solution
- Each document starts at position 0
- Independent positional encoding
Result
- Train on 64K tokens
- Infer on 100M tokens
Without out-of-distribution issues.
Performance Results and Benchmarks
On a 4B model (Qwen3-4B):
Key Results
- <9% degradation from 16K → 100M tokens
- 94.84% accuracy at 1M tokens
- Backbone baseline collapses to 25%
Comparison
- Beats RAG + 235B generators on multiple QA benchmarks
- 16% average improvement over standard RAG
Results reference: MSA benchmarks
The Hybrid Insight: Latent + Text
The most important finding is not the architecture.
It is the hybrid approach.
What Ablation Shows
- Latent retrieval finds relevant documents
- Raw text generates final answers
Insight
MSA is not replacing text retrieval. It is refining it.
Limitations and Open Questions
This is not a solved problem yet.
Key Concerns
- Tested on single backbone (Qwen3-4B)
- Unknown scaling to 70B+ models
- Static memory bank (no real-time updates)
- No confidence intervals
Practical Constraints
- Memory parallelism depends on small models
- Offline encoding required
What This Means for AI Systems
MSA signals a shift.
From:
External pipelines
To:
Internalized memory
Implications
- Better retrieval quality
- Reduced system complexity
- Closer integration of memory and reasoning
But RAG Still Wins In
- Dynamic updates
- Interpretability
- Cost efficiency
The Bigger Direction
The real trend is clear:
Retrieval is moving inside the model.
MSA is not the final answer.
But it is a directional shift.
Conclusion
We are still early in solving memory for LLMs.
The current approaches:
- Fine-tuning
- RAG
- Latent compression
Each solve part of the problem.
MSA introduces something new:
Memory aligned with how the model actually thinks.
It closes a structural gap that pipelines cannot.
And that makes it worth paying attention to.
References
- Memory Sparse Attention: Long-Context LLM Memory via Sparse Latent Retrieval
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- RoFormer: Enhanced Transformer with Rotary Position Embedding
- RWKV-LM repository
- Qwen3-4B model card
FAQ
1. Is MSA better than RAG?
Not universally. It excels in static, large-scale knowledge bases.
2. Can MSA replace current systems?
Not yet. It has scalability and implementation challenges.
3. What is the biggest advantage?
Retrieval in the model’s native representation space.
4. What is the biggest limitation?
Unproven scalability for large models.
5. What should developers do now?
Continue using RAG, but track developments in internal memory architectures like MSA.

