Is CAG the Ultimate RAG Killer? Not Really!

Summarized Audio Version:

Tired of waiting for your AI to search for answers? What if your AI already knew everything it needed without searching? Meet CAG: the next evolution beyond RAG that’s faster, simpler, and sometimes even smarter.

Have you ever asked ChatGPT a question and watched that annoying “thinking…” animation for what feels like forever? Or worse—built a RAG system that sometimes grabs the wrong documents entirely?

You’re not alone. And there’s a better way.

Today, I’m diving into Cache-Augmented Generation (CAG)—the approach that’s making some developers ditch retrieval altogether. It’s like giving your AI a photographic memory instead of sending it to the library every time you ask a question.

Let’s explore why this matters and how it could transform your AI projects.

The Problem with RAG: The Hidden Tax on Your AI

RAG (Retrieval-Augmented Generation) was revolutionary when it arrived. Finally, a way to give AI access to knowledge beyond its training data! But if you’ve implemented RAG systems, you’ve probably noticed some frustrations:

The Waiting Game

“Please wait while I search for relevant information…”

Every RAG system has this hidden tax: retrieval latency. Your users ask a question, then wait while your system:

  1. Processes their query
  2. Searches through vector databases
  3. Ranks the results
  4. Retrieves the documents
  5. Finally starts generating an answer

In testing with the HotpotQA dataset, researchers found RAG systems taking up to 94 seconds to generate answers for complex queries. That’s an eternity in user experience terms!

When Retrieval Goes Wrong

Ever seen this happen?

User: “What’s our refund policy for damaged items?”

AI: “According to your shipping policy, we use FedEx and UPS for most deliveries…”

That’s retrieval failure—when your system grabs the wrong documents. And it happens more often than we’d like to admit. The research shows that even with dense retrieval methods, accuracy plateaus around 75-80% for complex queries.

Each retrieval error cascades into wrong answers, frustrated users, and lost trust.

The Complexity Monster

A typical RAG system requires:

  • Embedding models
  • Vector databases
  • Retrieval pipelines
  • Ranking algorithms
  • Prompt engineering
  • Document chunking strategies

Each component needs maintenance, optimization, and debugging. When something breaks (and it will), you’ll spend hours figuring out if it’s your embeddings, chunking strategy, or retrieval parameters.

As one frustrated developer told me: “Half my debugging time is spent fixing retrieval issues, not actual AI problems.”

Enter CAG: Cache-Augmented Generation

What if your AI already knew everything it needed without searching?

That’s the insight behind CAG (Cache-Augmented Generation), a paradigm shift introduced by researchers at National Chengchi University and Academia Sinica that’s gaining traction as context windows expand.

What is CAG in Plain English?

CAG preloads all relevant knowledge into the AI’s memory and keeps it there, ready for instant access.

Comparison of Traditional RAG and our CAG
Workflows: The upper section illustrates the RAG pipeline, including real-time retrieval and reference text input during inference, while the lower section depicts our CAG approach, which preloads the KV-cache, eliminating the retrieval step and reference text input at inference.
Source: https://arxiv.org/pdf/2412.15605v1

Think of it like this:

  • RAG is like a librarian who runs to find books every time you ask a question
  • CAG is like a librarian who already has all the relevant books on their desk, open to the right pages

The difference? Speed, simplicity, and sometimes even better answers.

The Core Insight: Why Search When You Can Remember?

Modern LLMs like Llama 3.1, Claude, and GPT-4 can handle massive context windows (up to 128K tokens in some cases). That’s roughly 300+ pages of text!

CAG leverages this capability by:

  1. Loading your entire knowledge base into the model’s context window
  2. Precomputing and saving the model’s internal state (the KV cache)
  3. Using this cached state to answer questions instantly

It’s like giving your AI a perfect memory of your documents instead of making it search through them every time.

How CAG Works: The 3-Step Process

1. Preload (One-Time Setup)

  • Take all your knowledge documents (manuals, guides, policies)
  • Format them into a single, well-structured context
  • Feed this context to a long-context LLM

2. Cache (The Magic Part)

  • As the LLM processes this knowledge, it creates a “Key-Value (KV) cache”
  • This cache is essentially the LLM’s working memory of all the information
  • Save this cache to disk or keep it in memory

3. Generate (The Fast Part)

  • When a user asks a question, load the cached knowledge state
  • The LLM instantly accesses the relevant information from its memory
  • Generate an answer without any retrieval step

The result? Responses that are:

  • Faster: No retrieval latency (up to 40x quicker in some tests)
  • More consistent: No retrieval errors or ranking problems
  • Simpler: One unified system instead of multiple components

CAG vs. RAG: The Showdown

When I first implemented CAG for a client’s customer support system, the results were so dramatic that the product manager asked, “Did you just upgrade our servers?”

Nope—just ditched retrieval entirely.

Let’s break down exactly how CAG compares to RAG across the metrics that matter most:

Speed: From “Please Wait” to “Here You Go”

The research doesn’t lie. In experiments with the HotpotQA dataset:

Dataset SizeRAG Response TimeCAG Response TimeSpeedup
Small9.25 seconds0.85 seconds10.9x
Medium28.82 seconds1.66 seconds17.4x
Large94.35 seconds2.33 seconds40.5x

That’s not a typo—CAG was 40x faster with large datasets!

Why such a dramatic difference? With CAG, you’re skipping:

  • Query processing time
  • Vector search operations
  • Document retrieval
  • Context assembly

Instead, you’re jumping straight to generation with all the knowledge already loaded. It’s like the difference between looking up a fact in a book versus already knowing it by heart.

Accuracy: When Retrieval Fails, CAG Prevails

Here’s where things get interesting. The researchers measured answer quality using BERTScore (higher is better):

DatasetRAG (Best Result)CAGImprovement
HotpotQA Small0.75160.7759+3.2%
SQuAD Small0.81910.8265+0.9%
HotpotQA Large0.74950.7527+0.4%

CAG consistently outperformed even the best RAG configurations. Why?

  1. No Retrieval Errors: CAG eliminates the possibility of retrieving the wrong documents
  2. Holistic Understanding: The model sees all information at once, enabling better cross-referencing
  3. Consistent Context: No variation in what information is available for each query

This is especially important for complex questions that require synthesizing information from multiple sources—exactly where traditional RAG often struggles.

Simplicity: Fewer Moving Parts, Fewer Headaches

Let’s compare the system architectures:

RAG System Components:

  • Document processing pipeline
  • Embedding model
  • Vector database
  • Retrieval mechanism
  • Ranking algorithm
  • Prompt engineering system
  • LLM for generation

CAG System Components:

  • Document processing pipeline
  • KV cache management
  • LLM for generation

That’s it! With CAG, you’re eliminating three of the most complex and error-prone components of a RAG system.

The result? Less code, fewer dependencies, simpler debugging, and easier maintenance.

When CAG Wins (And When It Doesn’t)

CAG isn’t always the right choice. Here’s my rule of thumb after implementing both approaches:

Use CAG when:

  • Your knowledge base is relatively stable (doesn’t change hourly)
  • Total knowledge size fits within your LLM’s context window (8K-128K tokens)
  • Response speed is critical to user experience
  • You need high consistency in answers
  • Your system needs to work offline or with minimal infrastructure

Stick with RAG when:

  • Your knowledge base is massive (millions of documents)
  • Information changes constantly (like news or stock prices)
  • You need to search across diverse, unstructured sources
  • Your use case requires explicit citation of sources
  • You need to scale beyond a single LLM’s context window

The beauty is that you don’t have to choose permanently—many systems can start with CAG for core knowledge and add RAG capabilities as they scale.

Real-World Applications: Where CAG Shines

So where can you actually use CAG today? Let’s explore five practical applications where ditching retrieval makes perfect sense.

Customer Support with Fixed Knowledge

First, consider customer service chatbots. Most companies have support documentation that rarely changes—product manuals, troubleshooting guides, and FAQs.

Instead of making customers wait while your bot searches through these documents, CAG preloads everything. As a result, response times drop from seconds to milliseconds. Plus, since the bot has access to the entire knowledge base at once, it can connect information across different sections more effectively.

Legal and Compliance Assistants

Next, legal teams often work with fixed sets of documents—contracts, regulations, or company policies. These documents might be updated quarterly or annually, but rarely change day-to-day.

For example, a compliance officer might ask: “What are our obligations regarding customer data in California?”

With CAG, the assistant instantly accesses all relevant privacy policies and regulations without searching. Because the entire regulatory framework is preloaded, the answers are more comprehensive and contextually aware.

Educational Tools with Defined Curricula

Similarly, educational applications benefit tremendously from CAG. Textbooks, course materials, and reference guides can all be preloaded.

A student studying biology might ask dozens of questions about cell structure in a single session. Rather than retrieving documents for each question, CAG keeps all the textbook content in memory, ready for instant access.

This approach not only speeds up response times but also helps maintain consistency across related questions—the AI “remembers” what it told you earlier in the conversation.

Product Documentation Assistants

Furthermore, technical documentation for software or hardware products is another perfect CAG use case. The documentation might be extensive but has clear boundaries.

Developers or users can ask specific questions about APIs, features, or troubleshooting steps. Since CAG preloads the entire documentation set, it can provide more holistic answers that connect information across different sections.

Internal Knowledge Management

Finally, company wikis and internal knowledge bases often contain critical information that employees need quickly. However, this information typically changes on a predictable schedule—not minute by minute.

By implementing CAG, employees get instant answers about company policies, procedures, or project details without waiting for retrieval. Moreover, the system can be refreshed weekly or monthly when the knowledge base is updated.

In each of these cases, CAG offers three key advantages:

  1. Speed: Instant responses without retrieval delays
  2. Consistency: The same knowledge is available for every question
  3. Simplicity: No complex retrieval pipeline to maintain

Of course, CAG isn’t magic—it works best when your knowledge fits within your LLM’s context window. But with modern models handling 32K, 64K, or even 128K tokens, that covers a surprising number of real-world applications.

Coding Section: Hands-On with CAG vs. RAG

Ready to see CAG in action? Let’s roll up our sleeves and run some real-world tests. I’ve spent hours benchmarking CAG against traditional RAG, and now I’ll walk you through exactly how to replicate my experiments.

Fair warning: You’ll need some serious GPU power for this. CAG is hungry for VRAM—think of it as the price you pay for ditching retrieval.

Setting Up Your Environment

First things first, you’ll need a machine with decent GPU capabilities. For my tests, I used a Google Cloud Workstation with dual A100 40GB GPUs. While you might get away with a single T4 for smaller tests, don’t be surprised if you hit memory limits quickly.

Step 1: Check Your GPU Setup

Before diving in, make sure your GPU is properly configured:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
nvidia-smi
nvidia-smi
nvidia-smi

You should see your GPU(s) listed with their memory capacity. If not, you’ll need to troubleshoot your CUDA drivers before proceeding.

Step 2: Install Basic Dependencies

Next, let’s get the necessary packages installed:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
sudo apt-get update
sudo apt-get install -y git python3 python3-pip
sudo apt-get update sudo apt-get install -y git python3 python3-pip
sudo apt-get update
sudo apt-get install -y git python3 python3-pip

If you prefer working with Docker (which I often do for cleaner environments), add:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
sudo apt-get install -y docker.io
sudo usermod -aG docker $USER
sudo apt-get install -y docker.io sudo usermod -aG docker $USER
sudo apt-get install -y docker.io
sudo usermod -aG docker $USER

Remember to log out and back in after adding yourself to the Docker group!

Step 3: Clone the CAG Repository

Now, let’s grab the code from the original researchers:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
git clone https://github.com/hhhuang/CAG.git
cd CAG
git clone https://github.com/hhhuang/CAG.git cd CAG
git clone https://github.com/hhhuang/CAG.git
cd CAG

Step 4: Set Up a Virtual Environment

Trust me on this one—virtual environments will save you countless headaches:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
python3 -m venv venv
source venv/activate
python3 -m venv venv source venv/activate
python3 -m venv venv
source venv/activate

Step 5: Install Python Requirements

Now for the dependencies:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
pip install --upgrade pip
pip install -r requirements.txt
pip install --upgrade pip pip install -r requirements.txt
pip install --upgrade pip
pip install -r requirements.txt

If you run into PyTorch CUDA compatibility issues (which happens more often than not), try:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
pip install torch --extra-index-url https://download.pytorch.org/whl/cu118
pip install torch --extra-index-url https://download.pytorch.org/whl/cu118
pip install torch --extra-index-url https://download.pytorch.org/whl/cu118

You might need to adjust the CUDA version (cu118) to match your system.

Step 6: Configure API Keys and Download Data

Create your environment file:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
cp .env.template .env
cp .env.template .env
cp .env.template .env

Then edit it to add your API keys:

  • HF_TOKEN="hf_abc123..." for Hugging Face
  • OPENAI_API_KEY="sk-..." for OpenAI (if you plan to use their models)

Download the test datasets:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
sh ./downloads.sh
sh ./downloads.sh
sh ./downloads.sh

This fetches both SQuAD and HotpotQA datasets that we’ll use for benchmarking.

Finally, verify that PyTorch can see your GPU:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
python3 -c "import torch; print(torch.cuda.is_available())"
python3 -c "import torch; print(torch.cuda.is_available())"
python3 -c "import torch; print(torch.cuda.is_available())"

You should get True as the output. If not, your PyTorch installation isn’t configured correctly for GPU usage.

Running the Experiments

Now for the fun part—putting CAG head-to-head against RAG. We’ll run both on the same dataset to ensure a fair comparison.

Testing RAG Performance

First, let’s establish our baseline with a traditional RAG approach:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
time python ./rag.py \
--index "bm25" \
--dataset "squad-train" \
--similarity "bertscore" \
--maxKnowledge 5 \
--maxParagraph 100 \
--maxQuestion 100 \
--topk 3 \
--modelname "meta-llama/Llama-3.1-8B-Instruct" \
--randomSeed 0 \
--output "./rag_results_squad.txt"
time python ./rag.py \ --index "bm25" \ --dataset "squad-train" \ --similarity "bertscore" \ --maxKnowledge 5 \ --maxParagraph 100 \ --maxQuestion 100 \ --topk 3 \ --modelname "meta-llama/Llama-3.1-8B-Instruct" \ --randomSeed 0 \ --output "./rag_results_squad.txt"
time python ./rag.py \
  --index "bm25" \
  --dataset "squad-train" \
  --similarity "bertscore" \
  --maxKnowledge 5 \
  --maxParagraph 100 \
  --maxQuestion 100 \
  --topk 3 \
  --modelname "meta-llama/Llama-3.1-8B-Instruct" \
  --randomSeed 0 \
  --output "./rag_results_squad.txt"

Let’s break down what’s happening here:

  • We’re using BM25 for retrieval (a proven sparse retrieval method)
  • Testing on the SQuAD dataset
  • Limiting to 5 knowledge documents and 100 questions to keep things manageable
  • Retrieving the top 3 most relevant passages for each question
  • Using Llama 3.1 8B as our model
  • Measuring similarity with BERTScore

The time command at the beginning will tell us exactly how long the entire process takes.

Testing CAG Performance

Now, let’s run the same test with CAG:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
time python ./kvcache.py \
--kvcache file \
--dataset "squad-train" \
--similarity "bertscore" \
--maxKnowledge 5 \
--maxParagraph 100 \
--maxQuestion 100 \
--modelname "meta-llama/Llama-3.1-8B-Instruct" \
--randomSeed 0 \
--output "./cag_results_squad.txt"
time python ./kvcache.py \ --kvcache file \ --dataset "squad-train" \ --similarity "bertscore" \ --maxKnowledge 5 \ --maxParagraph 100 \ --maxQuestion 100 \ --modelname "meta-llama/Llama-3.1-8B-Instruct" \ --randomSeed 0 \ --output "./cag_results_squad.txt"
time python ./kvcache.py \
  --kvcache file \
  --dataset "squad-train" \
  --similarity "bertscore" \
  --maxKnowledge 5 \
  --maxParagraph 100 \
  --maxQuestion 100 \
  --modelname "meta-llama/Llama-3.1-8B-Instruct" \
  --randomSeed 0 \
  --output "./cag_results_squad.txt"

The key difference here is --kvcache file, which tells the system to precompute and store the key-value cache—the heart of the CAG approach.

What I Discovered

After running multiple scenarios with varying dataset sizes, I found some surprising results. While the research paper showed CAG consistently outperforming RAG, my real-world tests revealed a more nuanced picture.

In my small dataset test (maxKnowledge=5, ~100 queries):

  • RAG processed queries in about 0.576 seconds each with a similarity score of 0.848
  • CAG took about 40 seconds upfront to preload documents, achieved a slightly better similarity of 0.875, but averaged 0.744 seconds per query

When scaling up (maxKnowledge=10, 200 queries):

  • RAG maintained quick 0.539 second responses with 0.8207 similarity
  • CAG had a much larger upfront cost (~90 seconds) with queries averaging 1.416 seconds, though accuracy improved to 0.8339

The real eye-opener came when pushing to maxKnowledge=500:

  • RAG completed the test with 0.785 similarity at 0.557 seconds per query
  • CAG crashed with an out-of-memory error, demanding over 44GB of VRAM!

This highlights a critical limitation: CAG’s memory requirements can be astronomical for large knowledge bases. While it might offer slight accuracy improvements, the GPU costs can be prohibitive unless you’re running thousands of repeated queries against the same knowledge base.

Experiment Video:

Conclusion: Finding the Right Tool for Your Knowledge System

After testing CAG against traditional RAG systems, I’ve come to a nuanced conclusion: neither approach is universally superior. Instead, each shines in specific scenarios.

Despite the impressive speed numbers in the research papers, my hands-on testing revealed that CAG comes with significant trade-offs. Yes, it can deliver faster responses once everything is loaded—but that initial loading process is extremely memory-intensive. In fact, when I tried scaling to larger knowledge bases, my dual A100 GPUs with 40GB each couldn’t handle the load.

Meanwhile, RAG proved surprisingly resilient. It maintained consistent performance across different knowledge base sizes and actually outperformed CAG in many of my real-world tests, especially when considering the total processing time including setup.

So where does this leave us? Here’s my practical advice:

Consider CAG when:

  • Your knowledge base is small and stable
  • You have access to powerful GPUs with plenty of VRAM
  • Users frequently ask similar questions about the same content
  • You need the absolute fastest response times after initial setup
  • Accuracy improvements of 1-3% would significantly impact your application

Stick with RAG when:

  • Your knowledge base is large or frequently changing
  • You’re working with limited GPU resources
  • Questions vary widely across your knowledge domain
  • You need to scale beyond a single model’s context window
  • You want a more cost-effective solution with reasonable performance

The most exciting possibility? Hybrid approaches. Imagine preloading your most frequently accessed information with CAG while maintaining a RAG system for less common queries. This could give you the best of both worlds—lightning-fast responses for common questions and flexible retrieval for everything else.

As context windows continue to expand and GPU memory becomes more affordable, CAG will likely become increasingly viable. But for now, I recommend starting with RAG and experimenting with CAG only after you’ve identified specific use cases where its benefits clearly outweigh its costs.

The future of AI knowledge systems isn’t about choosing between retrieval or no retrieval—it’s about knowing when to use each approach for maximum impact.

Further Reading

Want to dive deeper into CAG, RAG, and the future of AI knowledge systems? Here are some resources I’ve found invaluable:

Remember, the field is evolving rapidly. What seems cutting-edge today might be standard practice tomorrow. The key is to stay curious, keep experimenting, and always measure real-world performance rather than relying solely on theoretical advantages.

Share the Post:
Learn Data Science. Courses starting at $12.99.

Related Posts