Skip to main content

RAG Retrieves Text. MSA Retrieves Thought: Rethinking Memory in LLMs

Why Fine-Tuning, RAG, and Latent Compression Fall Short—and What Memory Sparse Attention Changes

Apr 12, 2026

By Roopak Nijhara

AI MemoryChapter 1
Quick mode
Switch between full article and quick carousel

AI Memory

Image credit: Unsplash

Summary

Giving LLMs memory has always been a trade-off between scale, precision, and efficiency. Fine-tuning, RAG, and latent compression each solve part of the problem—but none solve it completely. Memory Sparse Attention (MSA) introduces a new approach by retrieving internal model representations instead of text, potentially redefining how AI systems remember and reason.

Primary source: Memory Sparse Attention: Long-Context LLM Memory via Sparse Latent Retrieval


Table of Contents

  1. Introduction
  2. The Core Problem: Memory in LLMs
  3. Approach #1: Fine-Tuning the Weights
  4. Approach #2: Retrieval-Augmented Generation (RAG)
  5. Approach #3: Latent State Compression
  6. Why All Three Approaches Break
  7. Enter MSA: Memory Sparse Attention
  8. How MSA Actually Works
  9. The Key Innovation: Retrieval in Thought Space
  10. Scaling Trick: Document-wise RoPE
  11. Performance Results and Benchmarks
  12. The Hybrid Insight: Latent + Text
  13. Limitations and Open Questions
  14. What This Means for AI Systems
  15. Conclusion
  16. References
  17. FAQ

Introduction

There is a quiet assumption behind most modern AI systems:

Large Language Models can “remember.”

But that assumption breaks the moment you push beyond short contexts.

In reality, memory in LLMs is still an unsolved systems problem.

And today, we only have three ways to approximate it.

All three are flawed.


The Core Problem: Memory in LLMs

LLMs are not databases.

They do not “store” knowledge in the way traditional systems do.

Instead, they:

  • Encode patterns into weights
  • Use context windows for short-term memory
  • Rely on external systems for long-term recall

The challenge is:

How do you give an LLM scalable, accurate, and efficient memory?

So far, the industry has converged on three approaches.


Approach #1: Fine-Tuning the Weights

This is the most direct method.

How It Works

  • Train the model on new data
  • Embed knowledge directly into weights

Advantages

  • High precision
  • Native integration into model

Problems

  • Fixed capacity
  • Expensive retraining
  • Catastrophic forgetting

When you add new knowledge, you risk overwriting old knowledge.


Approach #2: Retrieval-Augmented Generation (RAG)

The current industry standard.

How It Works

  • Store documents externally
  • Retrieve relevant chunks
  • Inject into prompt

Advantages

  • Scales to any dataset
  • Easy to update
  • Modular

Problems

  • Works in text space, not model reasoning space
  • Retrieval quality bottleneck
  • Structural ceiling

Even with reranking and embeddings, there is a mismatch:

The model thinks in latent space, but retrieves in text space.


Approach #3: Latent State Compression

A more experimental approach.

How It Works

  • Compress memory into hidden states
  • Pass through model

Advantages

  • Efficient
  • Compact

Problems

  • Fixed-size memory
  • Information loss
  • Degrades at scale

Example:

RWKV drops from 100% to 53% accuracy at 1M tokens in needle-in-a-haystack tests (source).


Why All Three Approaches Break

Each method optimizes for one dimension:

Approach Strength Weakness
Fine-tuning Precision Forgetting
RAG Scalability Representation mismatch
Latent states Efficiency Capacity limits

But none solve:

Scalable, accurate, native memory simultaneously.


Enter MSA: Memory Sparse Attention

A fourth approach emerges:

Memory Sparse Attention (MSA)

Developed by Evermind, MSA rethinks memory at the architectural level (paper).


How MSA Actually Works

Instead of retrieving text:

  • It retrieves internal representations

Key Components

  • Router Projectors
  • KV Cache Signatures
  • Top-K Selection

Process

  1. Model encodes documents into KV cache
  2. Router scores documents based on relevance
  3. Selects top-K documents
  4. Feeds them directly into attention

The Key Innovation: Retrieval in Thought Space

This is the breakthrough.

RAG retrieves text. MSA retrieves thought.


Why This Matters

  • No embedding mismatch
  • No text reconstruction gap
  • Same representation as model reasoning

Retrieval and generation now:

  • Share the same forward pass
  • Use the same loss function
  • Operate in the same space

Scaling Trick: Document-wise RoPE

Scaling is where most systems fail.

MSA solves this elegantly.


The Problem

Position embeddings break at large token sizes.


The Solution

  • Each document starts at position 0
  • Independent positional encoding

Result

  • Train on 64K tokens
  • Infer on 100M tokens

Without out-of-distribution issues.


Performance Results and Benchmarks

On a 4B model (Qwen3-4B):


Key Results

  • <9% degradation from 16K → 100M tokens
  • 94.84% accuracy at 1M tokens
  • Backbone baseline collapses to 25%

Comparison

  • Beats RAG + 235B generators on multiple QA benchmarks
  • 16% average improvement over standard RAG

Results reference: MSA benchmarks


The Hybrid Insight: Latent + Text

The most important finding is not the architecture.

It is the hybrid approach.


What Ablation Shows

  • Latent retrieval finds relevant documents
  • Raw text generates final answers

Insight

MSA is not replacing text retrieval. It is refining it.


Limitations and Open Questions

This is not a solved problem yet.


Key Concerns

  • Tested on single backbone (Qwen3-4B)
  • Unknown scaling to 70B+ models
  • Static memory bank (no real-time updates)
  • No confidence intervals

Practical Constraints

  • Memory parallelism depends on small models
  • Offline encoding required

What This Means for AI Systems

MSA signals a shift.

From:

External pipelines

To:

Internalized memory


Implications

  • Better retrieval quality
  • Reduced system complexity
  • Closer integration of memory and reasoning

But RAG Still Wins In

  • Dynamic updates
  • Interpretability
  • Cost efficiency

The Bigger Direction

The real trend is clear:

Retrieval is moving inside the model.

MSA is not the final answer.

But it is a directional shift.


Conclusion

We are still early in solving memory for LLMs.

The current approaches:

  • Fine-tuning
  • RAG
  • Latent compression

Each solve part of the problem.

MSA introduces something new:

Memory aligned with how the model actually thinks.

It closes a structural gap that pipelines cannot.

And that makes it worth paying attention to.


References

FAQ

1. Is MSA better than RAG?

Not universally. It excels in static, large-scale knowledge bases.


2. Can MSA replace current systems?

Not yet. It has scalability and implementation challenges.


3. What is the biggest advantage?

Retrieval in the model’s native representation space.


4. What is the biggest limitation?

Unproven scalability for large models.


5. What should developers do now?

Continue using RAG, but track developments in internal memory architectures like MSA.

Apr 12, 2026

Frequently Asked Questions

Find answers to common questions about Apptastic Coder

Apptastic Coder is a developer-focused site where I share tutorials, tools, and resources around AI, web development, automation, and side projects. It’s a mix of technical deep-dives, practical how-to guides, and curated links that can help you build real-world projects faster.

Still have a question?

Reach out to us through the contact page, and we'll be happy to help.

Contact Us