Is CAG the Ultimate RAG Killer? Not Really!

Summarized Audio Version:

00:00

Tired of waiting for your AI to search for answers? What if your AI already knew everything it needed without searching? Meet CAG: the next evolution beyond RAG that’s faster, simpler, and sometimes even smarter.

Have you ever asked ChatGPT a question and watched that annoying “thinking…” animation for what feels like forever? Or worse—built a RAG system that sometimes grabs the wrong documents entirely?

You’re not alone. And there’s a better way.

Today, I’m diving into Cache-Augmented Generation (CAG)—the approach that’s making some developers ditch retrieval altogether. It’s like giving your AI a photographic memory instead of sending it to the library every time you ask a question.

Let’s explore why this matters and how it could transform your AI projects.

The Problem with RAG: The Hidden Tax on Your AI

RAG (Retrieval-Augmented Generation) was revolutionary when it arrived. Finally, a way to give AI access to knowledge beyond its training data! But if you’ve implemented RAG systems, you’ve probably noticed some frustrations:

The Waiting Game

“Please wait while I search for relevant information…”

Every RAG system has this hidden tax: retrieval latency. Your users ask a question, then wait while your system:

Processes their query
Searches through vector databases
Ranks the results
Retrieves the documents
Finally starts generating an answer

In testing with the HotpotQA dataset, researchers found RAG systems taking up to 94 seconds to generate answers for complex queries. That’s an eternity in user experience terms!

When Retrieval Goes Wrong

Ever seen this happen?

User: “What’s our refund policy for damaged items?”

AI: “According to your shipping policy, we use FedEx and UPS for most deliveries…”

That’s retrieval failure—when your system grabs the wrong documents. And it happens more often than we’d like to admit. The research shows that even with dense retrieval methods, accuracy plateaus around 75-80% for complex queries.

Each retrieval error cascades into wrong answers, frustrated users, and lost trust.

The Complexity Monster

A typical RAG system requires:

Embedding models
Vector databases
Retrieval pipelines
Ranking algorithms
Prompt engineering
Document chunking strategies

Each component needs maintenance, optimization, and debugging. When something breaks (and it will), you’ll spend hours figuring out if it’s your embeddings, chunking strategy, or retrieval parameters.

As one frustrated developer told me: “Half my debugging time is spent fixing retrieval issues, not actual AI problems.”

Enter CAG: Cache-Augmented Generation

What if your AI already knew everything it needed without searching?

That’s the insight behind CAG (Cache-Augmented Generation), a paradigm shift introduced by researchers at National Chengchi University and Academia Sinica that’s gaining traction as context windows expand.

What is CAG in Plain English?

CAG preloads all relevant knowledge into the AI’s memory and keeps it there, ready for instant access.

Comparison of Traditional RAG and our CAG
Workflows: The upper section illustrates the RAG pipeline, including real-time retrieval and reference text input during inference, while the lower section depicts our CAG approach, which preloads the KV-cache, eliminating the retrieval step and reference text input at inference.
Source: https://arxiv.org/pdf/2412.15605v1

Think of it like this:

RAG is like a librarian who runs to find books every time you ask a question
CAG is like a librarian who already has all the relevant books on their desk, open to the right pages

The difference? Speed, simplicity, and sometimes even better answers.

The Core Insight: Why Search When You Can Remember?

Modern LLMs like Llama 3.1, Claude, and GPT-4 can handle massive context windows (up to 128K tokens in some cases). That’s roughly 300+ pages of text!

CAG leverages this capability by:

Loading your entire knowledge base into the model’s context window
Precomputing and saving the model’s internal state (the KV cache)
Using this cached state to answer questions instantly

It’s like giving your AI a perfect memory of your documents instead of making it search through them every time.

How CAG Works: The 3-Step Process

1. Preload (One-Time Setup)

Take all your knowledge documents (manuals, guides, policies)
Format them into a single, well-structured context
Feed this context to a long-context LLM

2. Cache (The Magic Part)

As the LLM processes this knowledge, it creates a “Key-Value (KV) cache”
This cache is essentially the LLM’s working memory of all the information
Save this cache to disk or keep it in memory

3. Generate (The Fast Part)

When a user asks a question, load the cached knowledge state
The LLM instantly accesses the relevant information from its memory
Generate an answer without any retrieval step

The result? Responses that are:

Faster: No retrieval latency (up to 40x quicker in some tests)
More consistent: No retrieval errors or ranking problems
Simpler: One unified system instead of multiple components

CAG vs. RAG: The Showdown

When I first implemented CAG for a client’s customer support system, the results were so dramatic that the product manager asked, “Did you just upgrade our servers?”

Nope—just ditched retrieval entirely.

Let’s break down exactly how CAG compares to RAG across the metrics that matter most:

Speed: From “Please Wait” to “Here You Go”

The research doesn’t lie. In experiments with the HotpotQA dataset:

Dataset Size	RAG Response Time	CAG Response Time	Speedup
Small	9.25 seconds	0.85 seconds	10.9x
Medium	28.82 seconds	1.66 seconds	17.4x
Large	94.35 seconds	2.33 seconds	40.5x

That’s not a typo—CAG was 40x faster with large datasets!

Why such a dramatic difference? With CAG, you’re skipping:

Query processing time
Vector search operations
Document retrieval
Context assembly

Instead, you’re jumping straight to generation with all the knowledge already loaded. It’s like the difference between looking up a fact in a book versus already knowing it by heart.

Accuracy: When Retrieval Fails, CAG Prevails

Here’s where things get interesting. The researchers measured answer quality using BERTScore (higher is better):

Dataset	RAG (Best Result)	CAG	Improvement
HotpotQA Small	0.7516	0.7759	+3.2%
SQuAD Small	0.8191	0.8265	+0.9%
HotpotQA Large	0.7495	0.7527	+0.4%

CAG consistently outperformed even the best RAG configurations. Why?

No Retrieval Errors: CAG eliminates the possibility of retrieving the wrong documents
Holistic Understanding: The model sees all information at once, enabling better cross-referencing
Consistent Context: No variation in what information is available for each query

This is especially important for complex questions that require synthesizing information from multiple sources—exactly where traditional RAG often struggles.

Simplicity: Fewer Moving Parts, Fewer Headaches

Let’s compare the system architectures:

RAG System Components:

Document processing pipeline
Embedding model
Vector database
Retrieval mechanism
Ranking algorithm
Prompt engineering system
LLM for generation

CAG System Components:

Document processing pipeline
KV cache management
LLM for generation

That’s it! With CAG, you’re eliminating three of the most complex and error-prone components of a RAG system.

The result? Less code, fewer dependencies, simpler debugging, and easier maintenance.

When CAG Wins (And When It Doesn’t)

CAG isn’t always the right choice. Here’s my rule of thumb after implementing both approaches:

Use CAG when:

Your knowledge base is relatively stable (doesn’t change hourly)
Total knowledge size fits within your LLM’s context window (8K-128K tokens)
Response speed is critical to user experience
You need high consistency in answers
Your system needs to work offline or with minimal infrastructure

Stick with RAG when:

Your knowledge base is massive (millions of documents)
Information changes constantly (like news or stock prices)
You need to search across diverse, unstructured sources
Your use case requires explicit citation of sources
You need to scale beyond a single LLM’s context window

The beauty is that you don’t have to choose permanently—many systems can start with CAG for core knowledge and add RAG capabilities as they scale.

Real-World Applications: Where CAG Shines

So where can you actually use CAG today? Let’s explore five practical applications where ditching retrieval makes perfect sense.

Customer Support with Fixed Knowledge

First, consider customer service chatbots. Most companies have support documentation that rarely changes—product manuals, troubleshooting guides, and FAQs.

Instead of making customers wait while your bot searches through these documents, CAG preloads everything. As a result, response times drop from seconds to milliseconds. Plus, since the bot has access to the entire knowledge base at once, it can connect information across different sections more effectively.

Legal and Compliance Assistants

Next, legal teams often work with fixed sets of documents—contracts, regulations, or company policies. These documents might be updated quarterly or annually, but rarely change day-to-day.

For example, a compliance officer might ask: “What are our obligations regarding customer data in California?”

With CAG, the assistant instantly accesses all relevant privacy policies and regulations without searching. Because the entire regulatory framework is preloaded, the answers are more comprehensive and contextually aware.

Educational Tools with Defined Curricula

Similarly, educational applications benefit tremendously from CAG. Textbooks, course materials, and reference guides can all be preloaded.

A student studying biology might ask dozens of questions about cell structure in a single session. Rather than retrieving documents for each question, CAG keeps all the textbook content in memory, ready for instant access.

This approach not only speeds up response times but also helps maintain consistency across related questions—the AI “remembers” what it told you earlier in the conversation.

Product Documentation Assistants

Furthermore, technical documentation for software or hardware products is another perfect CAG use case. The documentation might be extensive but has clear boundaries.

Developers or users can ask specific questions about APIs, features, or troubleshooting steps. Since CAG preloads the entire documentation set, it can provide more holistic answers that connect information across different sections.

Internal Knowledge Management

Finally, company wikis and internal knowledge bases often contain critical information that employees need quickly. However, this information typically changes on a predictable schedule—not minute by minute.

By implementing CAG, employees get instant answers about company policies, procedures, or project details without waiting for retrieval. Moreover, the system can be refreshed weekly or monthly when the knowledge base is updated.

In each of these cases, CAG offers three key advantages:

Speed: Instant responses without retrieval delays
Consistency: The same knowledge is available for every question
Simplicity: No complex retrieval pipeline to maintain

Of course, CAG isn’t magic—it works best when your knowledge fits within your LLM’s context window. But with modern models handling 32K, 64K, or even 128K tokens, that covers a surprising number of real-world applications.

Coding Section: Hands-On with CAG vs. RAG

Ready to see CAG in action? Let’s roll up our sleeves and run some real-world tests. I’ve spent hours benchmarking CAG against traditional RAG, and now I’ll walk you through exactly how to replicate my experiments.

Fair warning: You’ll need some serious GPU power for this. CAG is hungry for VRAM—think of it as the price you pay for ditching retrieval.

Setting Up Your Environment

First things first, you’ll need a machine with decent GPU capabilities. For my tests, I used a Google Cloud Workstation with dual A100 40GB GPUs. While you might get away with a single T4 for smaller tests, don’t be surprised if you hit memory limits quickly.

Step 1: Check Your GPU Setup

Before diving in, make sure your GPU is properly configured:

nvidia-smi

You should see your GPU(s) listed with their memory capacity. If not, you’ll need to troubleshoot your CUDA drivers before proceeding.

Step 2: Install Basic Dependencies

Next, let’s get the necessary packages installed:

sudo apt-get update
sudo apt-get install -y git python3 python3-pip

If you prefer working with Docker (which I often do for cleaner environments), add:

sudo apt-get install -y docker.io
sudo usermod -aG docker $USER

Remember to log out and back in after adding yourself to the Docker group!

Step 3: Clone the CAG Repository

Now, let’s grab the code from the original researchers:

git clone https://github.com/hhhuang/CAG.git
cd CAG

Step 4: Set Up a Virtual Environment

Trust me on this one—virtual environments will save you countless headaches:

python3 -m venv venv
source venv/activate

Step 5: Install Python Requirements

Now for the dependencies:

pip install --upgrade pip
pip install -r requirements.txt

If you run into PyTorch CUDA compatibility issues (which happens more often than not), try:

pip install torch --extra-index-url https://download.pytorch.org/whl/cu118

You might need to adjust the CUDA version (cu118) to match your system.

Step 6: Configure API Keys and Download Data

Create your environment file:

cp .env.template .env

Then edit it to add your API keys:

HF_TOKEN="hf_abc123..." for Hugging Face
OPENAI_API_KEY="sk-..." for OpenAI (if you plan to use their models)

Download the test datasets:

sh ./downloads.sh

This fetches both SQuAD and HotpotQA datasets that we’ll use for benchmarking.

Finally, verify that PyTorch can see your GPU:

python3 -c "import torch; print(torch.cuda.is_available())"

You should get True as the output. If not, your PyTorch installation isn’t configured correctly for GPU usage.

Running the Experiments

Now for the fun part—putting CAG head-to-head against RAG. We’ll run both on the same dataset to ensure a fair comparison.

Testing RAG Performance

First, let’s establish our baseline with a traditional RAG approach:

time python ./rag.py \
  --index "bm25" \
  --dataset "squad-train" \
  --similarity "bertscore" \
  --maxKnowledge 5 \
  --maxParagraph 100 \
  --maxQuestion 100 \
  --topk 3 \
  --modelname "meta-llama/Llama-3.1-8B-Instruct" \
  --randomSeed 0 \
  --output "./rag_results_squad.txt"

Let’s break down what’s happening here:

We’re using BM25 for retrieval (a proven sparse retrieval method)
Testing on the SQuAD dataset
Limiting to 5 knowledge documents and 100 questions to keep things manageable
Retrieving the top 3 most relevant passages for each question
Using Llama 3.1 8B as our model
Measuring similarity with BERTScore

The time command at the beginning will tell us exactly how long the entire process takes.

Testing CAG Performance

Now, let’s run the same test with CAG:

time python ./kvcache.py \
  --kvcache file \
  --dataset "squad-train" \
  --similarity "bertscore" \
  --maxKnowledge 5 \
  --maxParagraph 100 \
  --maxQuestion 100 \
  --modelname "meta-llama/Llama-3.1-8B-Instruct" \
  --randomSeed 0 \
  --output "./cag_results_squad.txt"

The key difference here is --kvcache file, which tells the system to precompute and store the key-value cache—the heart of the CAG approach.

What I Discovered

After running multiple scenarios with varying dataset sizes, I found some surprising results. While the research paper showed CAG consistently outperforming RAG, my real-world tests revealed a more nuanced picture.

In my small dataset test (maxKnowledge=5, ~100 queries):

RAG processed queries in about 0.576 seconds each with a similarity score of 0.848
CAG took about 40 seconds upfront to preload documents, achieved a slightly better similarity of 0.875, but averaged 0.744 seconds per query

When scaling up (maxKnowledge=10, 200 queries):

RAG maintained quick 0.539 second responses with 0.8207 similarity
CAG had a much larger upfront cost (~90 seconds) with queries averaging 1.416 seconds, though accuracy improved to 0.8339

The real eye-opener came when pushing to maxKnowledge=500:

RAG completed the test with 0.785 similarity at 0.557 seconds per query
CAG crashed with an out-of-memory error, demanding over 44GB of VRAM!

This highlights a critical limitation: CAG’s memory requirements can be astronomical for large knowledge bases. While it might offer slight accuracy improvements, the GPU costs can be prohibitive unless you’re running thousands of repeated queries against the same knowledge base.

Experiment Video:

Conclusion: Finding the Right Tool for Your Knowledge System

After testing CAG against traditional RAG systems, I’ve come to a nuanced conclusion: neither approach is universally superior. Instead, each shines in specific scenarios.

Despite the impressive speed numbers in the research papers, my hands-on testing revealed that CAG comes with significant trade-offs. Yes, it can deliver faster responses once everything is loaded—but that initial loading process is extremely memory-intensive. In fact, when I tried scaling to larger knowledge bases, my dual A100 GPUs with 40GB each couldn’t handle the load.

Meanwhile, RAG proved surprisingly resilient. It maintained consistent performance across different knowledge base sizes and actually outperformed CAG in many of my real-world tests, especially when considering the total processing time including setup.

So where does this leave us? Here’s my practical advice:

Consider CAG when:

Your knowledge base is small and stable
You have access to powerful GPUs with plenty of VRAM
Users frequently ask similar questions about the same content
You need the absolute fastest response times after initial setup
Accuracy improvements of 1-3% would significantly impact your application

Stick with RAG when:

Your knowledge base is large or frequently changing
You’re working with limited GPU resources
Questions vary widely across your knowledge domain
You need to scale beyond a single model’s context window
You want a more cost-effective solution with reasonable performance

The most exciting possibility? Hybrid approaches. Imagine preloading your most frequently accessed information with CAG while maintaining a RAG system for less common queries. This could give you the best of both worlds—lightning-fast responses for common questions and flexible retrieval for everything else.

As context windows continue to expand and GPU memory becomes more affordable, CAG will likely become increasingly viable. But for now, I recommend starting with RAG and experimenting with CAG only after you’ve identified specific use cases where its benefits clearly outweigh its costs.

The future of AI knowledge systems isn’t about choosing between retrieval or no retrieval—it’s about knowing when to use each approach for maximum impact.

Is CAG the Ultimate RAG Killer? Not Really!

Table of Contents

Summarized Audio Version:

The Problem with RAG: The Hidden Tax on Your AI

The Waiting Game

When Retrieval Goes Wrong

The Complexity Monster

Enter CAG: Cache-Augmented Generation

What is CAG in Plain English?

The Core Insight: Why Search When You Can Remember?

How CAG Works: The 3-Step Process

1. Preload (One-Time Setup)

2. Cache (The Magic Part)

3. Generate (The Fast Part)

CAG vs. RAG: The Showdown

Speed: From “Please Wait” to “Here You Go”

Accuracy: When Retrieval Fails, CAG Prevails

Simplicity: Fewer Moving Parts, Fewer Headaches

When CAG Wins (And When It Doesn’t)

Real-World Applications: Where CAG Shines

Customer Support with Fixed Knowledge

Legal and Compliance Assistants

Educational Tools with Defined Curricula

Product Documentation Assistants

Internal Knowledge Management

Coding Section: Hands-On with CAG vs. RAG

Setting Up Your Environment

Step 1: Check Your GPU Setup

Step 2: Install Basic Dependencies

Step 3: Clone the CAG Repository

Step 4: Set Up a Virtual Environment

Step 5: Install Python Requirements

Step 6: Configure API Keys and Download Data

Running the Experiments

Testing RAG Performance

Testing CAG Performance

What I Discovered

Experiment Video:

Conclusion: Finding the Right Tool for Your Knowledge System

Further Reading

Related Posts

LOGIN