RAG Retrieves Text. MSA Retrieves Thought: Rethinking Memory in LLMs

AI Memory

Image credit: Unsplash

Summary

Giving LLMs memory has always been a trade-off between scale, precision, and efficiency. Fine-tuning, RAG, and latent compression each solve part of the problem—but none solve it completely. Memory Sparse Attention (MSA) introduces a new approach by retrieving internal model representations instead of text, potentially redefining how AI systems remember and reason.

Primary source: Memory Sparse Attention: Long-Context LLM Memory via Sparse Latent Retrieval

Introduction
The Core Problem: Memory in LLMs
Approach #1: Fine-Tuning the Weights
Approach #2: Retrieval-Augmented Generation (RAG)
Approach #3: Latent State Compression
Why All Three Approaches Break
Enter MSA: Memory Sparse Attention
How MSA Actually Works
The Key Innovation: Retrieval in Thought Space
Scaling Trick: Document-wise RoPE
Performance Results and Benchmarks
The Hybrid Insight: Latent + Text
Limitations and Open Questions
What This Means for AI Systems
Conclusion
References
FAQ

Introduction

There is a quiet assumption behind most modern AI systems:

Large Language Models can “remember.”

But that assumption breaks the moment you push beyond short contexts.

In reality, memory in LLMs is still an unsolved systems problem.

And today, we only have three ways to approximate it.

All three are flawed.

The Core Problem: Memory in LLMs

LLMs are not databases.

They do not “store” knowledge in the way traditional systems do.

Instead, they:

Encode patterns into weights
Use context windows for short-term memory
Rely on external systems for long-term recall

The challenge is:

How do you give an LLM scalable, accurate, and efficient memory?

So far, the industry has converged on three approaches.

Approach #1: Fine-Tuning the Weights

This is the most direct method.

How It Works

Train the model on new data
Embed knowledge directly into weights

Advantages

High precision
Native integration into model

Problems

Fixed capacity
Expensive retraining
Catastrophic forgetting

When you add new knowledge, you risk overwriting old knowledge.

Approach #2: Retrieval-Augmented Generation (RAG)

The current industry standard.

How It Works

Store documents externally
Retrieve relevant chunks
Inject into prompt

Advantages

Scales to any dataset
Easy to update
Modular

Problems

Works in text space, not model reasoning space
Retrieval quality bottleneck
Structural ceiling

Even with reranking and embeddings, there is a mismatch:

The model thinks in latent space, but retrieves in text space.

Approach #3: Latent State Compression

A more experimental approach.

How It Works

Compress memory into hidden states
Pass through model

Advantages

Efficient
Compact

Problems

Fixed-size memory
Information loss
Degrades at scale

Example:

RWKV drops from 100% to 53% accuracy at 1M tokens in needle-in-a-haystack tests (source).

Why All Three Approaches Break

Each method optimizes for one dimension:

Approach	Strength	Weakness
Fine-tuning	Precision	Forgetting
RAG	Scalability	Representation mismatch
Latent states	Efficiency	Capacity limits

But none solve:

Scalable, accurate, native memory simultaneously.

Enter MSA: Memory Sparse Attention

A fourth approach emerges:

Memory Sparse Attention (MSA)

Developed by Evermind, MSA rethinks memory at the architectural level (paper).

How MSA Actually Works

Instead of retrieving text:

It retrieves internal representations

Key Components

Router Projectors
KV Cache Signatures
Top-K Selection

Process

Model encodes documents into KV cache
Router scores documents based on relevance
Selects top-K documents
Feeds them directly into attention

The Key Innovation: Retrieval in Thought Space

This is the breakthrough.

RAG retrieves text. MSA retrieves thought.

Why This Matters

No embedding mismatch
No text reconstruction gap
Same representation as model reasoning

Retrieval and generation now:

Share the same forward pass
Use the same loss function
Operate in the same space

Scaling Trick: Document-wise RoPE

Scaling is where most systems fail.

MSA solves this elegantly.

The Problem

Position embeddings break at large token sizes.

The Solution

Each document starts at position 0
Independent positional encoding

Result

Train on 64K tokens
Infer on 100M tokens

Without out-of-distribution issues.

Performance Results and Benchmarks

On a 4B model (Qwen3-4B):

Key Results

<9% degradation from 16K → 100M tokens
94.84% accuracy at 1M tokens
Backbone baseline collapses to 25%

Comparison

Beats RAG + 235B generators on multiple QA benchmarks
16% average improvement over standard RAG

Results reference: MSA benchmarks

The Hybrid Insight: Latent + Text

The most important finding is not the architecture.

It is the hybrid approach.

What Ablation Shows

Latent retrieval finds relevant documents
Raw text generates final answers

Insight

MSA is not replacing text retrieval. It is refining it.

Limitations and Open Questions

This is not a solved problem yet.

Key Concerns

Tested on single backbone (Qwen3-4B)
Unknown scaling to 70B+ models
Static memory bank (no real-time updates)
No confidence intervals

Practical Constraints

Memory parallelism depends on small models
Offline encoding required

What This Means for AI Systems

MSA signals a shift.

From:

External pipelines

To:

Internalized memory

Implications

Better retrieval quality
Reduced system complexity
Closer integration of memory and reasoning

But RAG Still Wins In

Dynamic updates
Interpretability
Cost efficiency

The Bigger Direction

The real trend is clear:

Retrieval is moving inside the model.

MSA is not the final answer.

But it is a directional shift.

Conclusion

We are still early in solving memory for LLMs.

The current approaches:

Fine-tuning
RAG
Latent compression

Each solve part of the problem.

MSA introduces something new:

Memory aligned with how the model actually thinks.

It closes a structural gap that pipelines cannot.

And that makes it worth paying attention to.

References

FAQ

1. Is MSA better than RAG?

Not universally. It excels in static, large-scale knowledge bases.

2. Can MSA replace current systems?

Not yet. It has scalability and implementation challenges.

3. What is the biggest advantage?

Retrieval in the model’s native representation space.

4. What is the biggest limitation?

Unproven scalability for large models.

5. What should developers do now?

Continue using RAG, but track developments in internal memory architectures like MSA.

Summary

Table of Contents

Introduction

The Core Problem: Memory in LLMs

Approach #1: Fine-Tuning the Weights

How It Works

Advantages

Problems

Approach #2: Retrieval-Augmented Generation (RAG)

How It Works

Advantages

Problems

Approach #3: Latent State Compression

How It Works

Advantages

Problems

Why All Three Approaches Break

Enter MSA: Memory Sparse Attention

How MSA Actually Works

Key Components

Process

The Key Innovation: Retrieval in Thought Space

Why This Matters

Scaling Trick: Document-wise RoPE

The Problem

The Solution

Result

Performance Results and Benchmarks

Key Results

Comparison

The Hybrid Insight: Latent + Text

What Ablation Shows

Insight

Limitations and Open Questions

Key Concerns

Practical Constraints

What This Means for AI Systems

Implications

But RAG Still Wins In

The Bigger Direction

Conclusion

References

FAQ

1. Is MSA better than RAG?

2. Can MSA replace current systems?

3. What is the biggest advantage?

4. What is the biggest limitation?

5. What should developers do now?

Other Articles

6 Things AI Can’t Do for Your Startup

Impact of Oracle Layoffs and How It Affects Software Engineering

Perplexity Residency Program: A New Path Into AI Careers

Trending Articles

History of BERT

Foundational Startup Books Every Founder Should Read

Short Form Video is still King

Why the Future Belongs to Engineers Who Use AI Efficiently

Your Employees Are Your Best Influencers

Google Stitch is just way too overpowered... how to make use of it?

Prompt Engineering for React Devs

Branding is Getting Personal Today

6 Cool Websites Every Web Developer Should Know

Top 15 High-Paying AI Jobs to Watch by 2030

Frontline Employee Productivity: How Technology is Empowering the Modern Workforce

Frequently Asked Questions

What is Apptastic Coder?

Who is this website for?

Do you earn from affiliate or referral links on this site?

How often is the content updated?

Can I trust the job postings or external links shared here?

Do you provide personalised help or consulting?

Can I suggest topics or request tutorials?

Do you share source code or sample projects?

Is everything on this site free to read?

How can I stay updated with new posts and resources?

Still have a question?