Training a 7-billion-parameter model from scratch costs hundreds of thousands of dollars and requires a cluster of A100s. But adapting that same model to answer medical questions accurately? With LoRA and QLoRA, you can do that on a single consumer GPU in a few hours for roughly $10. This guide walks through exactly how — the math, the 2026 variant landscape, the code, the hyperparameters, and the hard-won lessons about what goes wrong.
Why Fine-Tuning Beats Prompting (Sometimes)
There are four ways to customize an LLM: prompt engineering, few-shot examples, RAG, and fine-tuning. The first three require zero model changes. They're fast to try and easy to update. So why fine-tune at all?
Three situations tip the scales toward fine-tuning:
Behavioral consistency at scale. Prompt engineering produces variable results. At a thousand requests per day, "mostly follows the format" becomes hundreds of malformed outputs. Fine-tuning bakes the behavior into weights — it fires reliably on every request.
Style and tone beyond what prompting can reach. A hospital wants responses that sound exactly like their clinical documentation standards. No prompt reliably replicates a specific writing register across arbitrary topics.
Latency and cost. Long system prompts add tokens on every call. A fine-tuned model carries the "expertise" in its weights and needs no prompt scaffolding.
That said, fine-tuning is not always the answer. If your primary need is fresh external knowledge (current drug interactions, live pricing, recent clinical trials), Retrieval-Augmented Generation is almost always cheaper and more maintainable. The decision framework is in a dedicated section later.
The Problem with Full Fine-Tuning
Full fine-tuning means updating every one of the model's weights. For a 7B parameter model stored in 16-bit floats, that's 14 GB just for the weights. Add optimizer states (Adam needs two momentum values per parameter) and you're looking at around 56 GB — which exceeds the memory of a single A100 80GB GPU when you factor in activations and batch data.
The cost scales brutally: a 70B model in full fine-tuning needs multiple nodes. For most teams, it's simply not viable.
Full fine-tuning also risks catastrophic forgetting: the model's general knowledge and instruction-following behavior can degrade when you push hard on a narrow domain. You spend resources making the model better at one thing while inadvertently making it worse at everything else.
LoRA was designed to sidestep both problems.
How LoRA Works: Low-Rank Weight Updates
LoRA (Low-Rank Adaptation), introduced by Hu et al. in 2021 (arXiv:2106.09685), rests on a simple but powerful observation: the changes needed to adapt a pretrained model to a new task have low intrinsic dimensionality. You don't need to shift all 4096 × 4096 values in an attention weight matrix. A much smaller update captures what you need.
Here's the mechanism. During fine-tuning, instead of modifying a weight matrix directly, LoRA freezes and adds a parallel branch with two small matrices. The effective weight becomes:
Where:
- is the original frozen pretrained weight matrix
- is the up-projection matrix (initialized to zeros)
- is the down-projection matrix (initialized from a Gaussian)
- is the rank — a small integer (typically 8 to 64) much smaller than
- is a scaling constant that controls the magnitude of the update
- is the weight matrix dimension (e.g., 4096 for Llama 3.1 8B attention layers)
In Plain English: Imagine your Llama 3.1 8B model knows general medicine from pretraining — it can discuss symptoms, drugs, and anatomy. To specialize it for your hospital's Q&A style, you don't rewrite its entire medical knowledge. You add two thin "adapter layers" (B and A) that nudge its outputs toward your target behavior. The original knowledge stays frozen. Only the adapters train.
Click to expandLoRA weight decomposition: frozen W0 plus trainable low-rank BA matrices
Why Low Rank Works
A 4096 × 4096 weight matrix has 16.7 million parameters. With rank , the LoRA matrices A and B together have only 131,072 parameters — 128x fewer. The key insight is that adaptation tasks don't require full-rank updates. Research by Aghajanyan et al. (2020) showed that the "intrinsic dimensionality" of fine-tuning is surprisingly low, validating this design.
The initialization also matters. B starts at zeros, so at the beginning of training, the LoRA branch contributes exactly nothing to the output. Training starts from the same pretrained behavior and diverges gradually — a much more stable starting point than random initialization.
Full fine-tuning: 16,777,216 params per weight matrix
LoRA (rank 16): 131,072 params per weight matrix
Reduction: 128x fewer trainable parameters
r= 8, alpha=16 -> scale=2.00, params=65,536
r=16, alpha=16 -> scale=1.00, params=131,072
r=32, alpha=16 -> scale=0.50, params=262,144
r=64, alpha=64 -> scale=1.00, params=524,288
After training, the matrices B and A can be merged back into with a single addition. The resulting model has the same architecture and inference speed as the base model — there's no runtime overhead.
The 2026 LoRA Variant Landscape
The original LoRA paper spawned a family of improvements. Knowing which variant to reach for saves a lot of trial and error.
rsLoRA: Fixing the Scaling Problem
Standard LoRA scales the adapter contribution by alpha / r. This works at low ranks but causes instability and stunted learning as rank increases. rsLoRA (Rank-Stabilized LoRA, arXiv:2312.03732) proves the correct scaling is alpha / sqrt(r).
The fix is mathematically elegant: dividing by sqrt(r) rather than r preserves the gradient signal at higher ranks, enabling stable training up to ranks of 512 or 2048 where standard LoRA would plateau. In PEFT, you enable it with use_rslora=True. For most tasks at rank 16 or 32 the difference is subtle, but if you're pushing to high ranks for complex domain adaptation, rsLoRA is the better default.
DoRA: Decomposing Magnitude and Direction
DoRA (Weight-Decomposed Low-Rank Adaptation, arXiv:2402.09353) takes a different angle. Instead of adding a low-rank matrix directly, DoRA first decomposes the weight into its magnitude (a scalar per output dimension) and direction (a unit-norm matrix), then applies LoRA updates only to the directional component.
The result is consistent accuracy improvements over standard LoRA: +3.7% on LLaMA-7B, +1 to 4.4% on LLaMA-13B and LLaMA3-8B on commonsense reasoning benchmarks (Liu et al., 2024). Crucially, DoRA introduces no extra inference overhead — the magnitude vector and low-rank direction update merge back into the base weights like standard LoRA.
In PEFT, it's a single flag: use_dora=True.
PiSSA and LoftQ: Better Initialization
Both address a subtle weakness: when you quantize the base model to 4-bit (as in QLoRA), the quantization introduces error that standard LoRA random initialization doesn't account for.
LoftQ jointly optimizes the quantization and the initial LoRA matrices to minimize this approximation error. PiSSA (Principal Singular values and Singular vectors Adaptation) initializes A and B from the principal components of the weight matrix rather than from random noise, meaning the adapters start already encoding the most important structure in the weights.
In practice: if you're doing QLoRA on a model where quality degradation from quantization is visible, try LoftQ or PiSSA initialization before increasing rank.
| Variant | Key Change | Best For | PEFT Flag |
|---|---|---|---|
| Standard LoRA | Baseline | General fine-tuning | — |
| rsLoRA | alpha / sqrt(r) scaling | High-rank experiments | use_rslora=True |
| DoRA | Magnitude + direction decomposition | Quality-critical tasks | use_dora=True |
| QLoRA | 4-bit NF4 base + fp16 adapters | Consumer GPU fine-tuning | load_in_4bit=True |
| LoftQ | Quantization-aware init | Minimizing quantization error | Custom init |
| PiSSA | SVD-based adapter init | Faster convergence | Custom init |
Pro Tip: For a new project, start with QLoRA + DoRA (load_in_4bit=True, use_dora=True). It's the current best-practice combination for quality within consumer VRAM budgets.
Click to expandLoRA variant decision guide: choosing between QLoRA, DoRA, rsLoRA, and standard LoRA
LoRA Hyperparameters in Practice
These four hyperparameters determine almost everything about how LoRA behaves.
Rank (r)
Rank controls the capacity of the adapter. A higher rank means more expressive updates at the cost of more parameters. The relationship is not linear — rank 64 is not eight times better than rank 8.
2025 research on rank selection (arXiv:2512.15634) found that intermediate ranks (32 to 64) offer the best balance between capacity and stability, representing a shift from earlier guidance. In practice, r=16 remains a reliable starting point that covers most instruction-following and style adaptation tasks. Reach for r=32 or r=64 when your task involves significant domain shift — teaching a model a new medical specialty it barely encountered during pretraining, for example.
The practical upper bound is around r=64 for standard LoRA. Above that, use rsLoRA to maintain stable gradients.
Alpha
Alpha scales the LoRA contribution by alpha / r. The convention "set alpha to twice the rank" (e.g., r=16, alpha=32) keeps the scaling factor at 2.0. Setting alpha equal to rank (scale=1.0) is more conservative. Going much higher than 2x causes unstable training.
With rsLoRA enabled, the effective scaling becomes alpha / sqrt(r). The convention there is to set alpha equal to rank (giving a scale of 1.0 in the sqrt formula, which is well-behaved at all ranks).
Dropout
LoRA dropout applies between the A and B matrices for regularization. For datasets of 1K to 100K examples, 0.05 to 0.1 is safe. With very small datasets (under 500 examples), increase to 0.1 to 0.2. Don't overthink this parameter — it rarely makes or breaks a fine-tuning run.
Target Modules
This is the hyperparameter most people get wrong. LoRA can be applied to any linear layer: attention projections (Q, K, V, output), feed-forward layers, and embedding layers. The original LoRA paper targeted only Q and V attention matrices. Current best practice has moved toward applying LoRA to all linear layers.
Hugging Face PEFT lets you do this with target_modules="all-linear". Applying to all linear layers increases parameter count but consistently improves task performance, particularly for instruction-following and domain adaptation tasks.
| Module Scope | Default? | Effect |
|---|---|---|
q_proj, v_proj | LoRA paper default | Good baseline, fewest params |
q_proj, k_proj, v_proj, o_proj | Common setting | Stronger attention adaptation |
| All linear layers | Current best practice | Best adaptation quality |
| Embedding layers | Rare | Needed for new token vocabularies |
Pro Tip: Start with target_modules="all-linear" and reduce only if you're hitting memory limits. The parameter overhead is modest compared to the gains.
QLoRA: Fine-Tuning at 4-Bit Precision
LoRA solves the trainable parameter problem. QLoRA, introduced by Dettmers et al. (arXiv:2305.14314), solves the memory storage problem. Even if you only train the LoRA adapters, you still need to hold the frozen base model in GPU memory. For an 8B model, that's roughly 15 GB in fp16.
QLoRA combines three techniques to shrink that footprint dramatically:
4-bit NF4 Quantization. NF4 (NormalFloat4) is a data type designed specifically for neural network weights, which follow a roughly normal distribution. It's information-theoretically optimal for this distribution — it places quantization boundaries where they minimize expected error given Gaussian-distributed values. This is different from standard int4 (which assumes uniform distribution) and produces noticeably better quality.
Double Quantization. The 4-bit quantization process requires quantization constants (one per small block of weights). Double quantization quantizes those constants too, saving an additional 0.37 bits per parameter on average.
Paged Optimizers. NVIDIA's unified memory allows optimizer states to page from GPU to CPU RAM when the GPU runs out of memory, handling the occasional spikes during gradient computation without crashing.
The net effect:
| Configuration | 8B Model Memory | Training Cost |
|---|---|---|
| Full fine-tuning (fp16) | ~60 GB | ~$40 on A100 |
| LoRA in fp16 | ~18 GB | ~$15 on A100 |
| QLoRA (4-bit NF4) | ~6 GB | ~$8 on A100, free on Colab |
In Plain English: Llama 3.1 8B has 8 billion parameters. In 16-bit floats, that's about 15 GB. Compress to 4-bit and the weights shrink to roughly 3.7 GB. Add LoRA adapters and optimizer states and you land around 6 GB. A consumer GPU with 8 GB of VRAM can handle it — and on RunPod, an RTX 4090 runs about $0.74/hr, so a full fine-tuning job costs under $5.
QLoRA does have a trade-off: training takes roughly 30 to 40% longer than standard LoRA in fp16, because the forward pass must dequantize weights before computation. For most use cases, the memory savings outweigh the speed penalty.
For a deeper look at how quantization works internally, see LLM Quantization for Consumer Hardware.
Click to expandComparison of fine-tuning approaches: full fine-tuning, LoRA, QLoRA, and RAG
Implementation with PEFT and Unsloth
Setting Up the Base Model
For our running example — fine-tuning Llama 3.1 8B as a medical Q&A assistant — here's the standard QLoRA setup using Hugging Face PEFT and bitsandbytes:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch
# 4-bit NF4 quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # double quantization
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
Configuring LoRA
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32, # scaling: alpha/r = 2.0
lora_dropout=0.05,
target_modules="all-linear", # apply to all linear layers
task_type=TaskType.CAUSAL_LM,
bias="none",
use_dora=True, # DoRA: weight-decomposed adaptation
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5195
Notice: 0.52% of parameters are trainable. The other 99.48% stay frozen.
Unsloth for Faster Training
Unsloth is an open-source library with custom Triton kernels for LoRA fine-tuning. By fusing the LoRA computation directly into the forward pass and rewriting backward pass logic, Unsloth delivers 2 to 2.7x faster training with 60 to 74% less VRAM compared to standard PEFT. For MoE architectures like Qwen3-235B, the speedup reaches 12x with 35% less VRAM. Training jobs that previously took 12 hours now finish in under 2 hours.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
use_gradient_checkpointing="unsloth", # 30% more memory savings
)
For fine-tuning on consumer or mid-range GPU hardware in 2026, Unsloth is the default choice — there's rarely a reason not to use it.
Dataset Preparation and Training
Preparing Medical Q&A Data
LoRA fine-tuning works best with instruction-formatted data. For our medical assistant, each training example should follow the chat template:
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
# Format your data with the model's chat template
def format_example(example):
messages = [
{"role": "system", "content": "You are a medical Q&A assistant. Answer clinical questions accurately and concisely."},
{"role": "user", "content": example["question"]},
{"role": "assistant", "content": example["answer"]},
]
return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}
dataset = load_dataset("medmcqa", split="train[:10000]")
dataset = dataset.map(format_example)
Training Configuration
training_args = SFTConfig(
output_dir="./llama-3.1-8b-medical",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # effective batch size = 8
learning_rate=2e-4, # higher than full fine-tuning (typical: 1e-4 to 3e-4)
warmup_ratio=0.05,
lr_scheduler_type="cosine",
fp16=False,
bf16=True,
logging_steps=25,
save_strategy="epoch",
max_seq_length=2048,
packing=True, # pack short sequences to reduce padding waste
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
Saving and Loading LoRA Adapters
After training, save only the adapter weights — not the entire model:
# Save just the adapter (a few MB, not the full 8B model)
model.save_pretrained("./medical-lora-adapters")
tokenizer.save_pretrained("./medical-lora-adapters")
# Load adapter on top of base model later
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
)
model = PeftModel.from_pretrained(base_model, "./medical-lora-adapters")
# Or merge into base for zero-latency inference
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama-3.1-8b-medical-merged")
The adapter files are typically 50 to 200 MB depending on rank and target modules. The base model download is a one-time cost shared across all adapters.
Click to expandQLoRA fine-tuning pipeline from base model to merged fine-tuned model
LoRA for Reasoning Models
One of the bigger 2025 developments is using LoRA to fine-tune for reasoning behaviors, not just instruction-following. The recipe, popularized by DeepSeek R1 and its derivatives, is GRPO (Group Relative Policy Optimization) — a form of reinforcement learning that rewards correct reasoning chains.
The workflow looks like this:
- Start with an instruction-tuned model (not a base model)
- Attach LoRA adapters as usual
- Instead of SFT with (question, answer) pairs, use GRPO with (question, reward_fn) pairs
- The reward function checks mathematical correctness, code execution, or format compliance
- The model learns to generate better reasoning chains through trial and error
The TRL library's GRPOTrainer handles this directly. In 2025, the Tina paper showed that LoRA fine-tuning with GRPO on just 7,000 examples could transfer DeepSeek R1's reasoning style to smaller models — at a total compute cost under $25.
This matters for the LoRA hyperparameter discussion too: reasoning fine-tuning typically benefits from higher ranks (r=32 to r=64) because the behavioral shift is more fundamental than style adaptation. The model isn't just learning to format responses differently; it's learning to generate extended reasoning traces it may have rarely produced before.
Key Insight: Reasoning fine-tuning and instruction fine-tuning have different rank requirements. Use r=16 for style/format tasks. Use r=32 to r=64 for reasoning transfer.
When to Use Fine-Tuning vs RAG (and When to Skip Both)
This is the decision most teams get wrong. Both techniques improve LLM outputs. They solve different problems.
Use fine-tuning (LoRA/QLoRA) when:
- You need consistent output format or writing style (not achievable by prompting alone)
- The task involves fixed skills that don't change (code generation in a specific style, medical triage classification)
- You're processing high volumes where token costs matter
- The "knowledge" is about how to respond, not what facts to include
Use RAG when:
- Your knowledge base changes frequently (treatment guidelines, drug databases, internal policies)
- You need to cite sources or show your reasoning
- You're dealing with a large document corpus (thousands of PDFs)
- Data freshness matters (live inventory, current regulations)
Use both when:
- You want consistent behavior AND access to external knowledge
- The model needs to retrieve specialized facts AND format them in a specific clinical style
Use neither (prompting is enough) when:
- The task is general and the base model already handles it well
- You have fewer than 200 high-quality examples (too little data to avoid overfitting)
- Your knowledge update frequency is too high for retraining (multiple times per day)
The 2026 best practice is a hybrid: RAG for facts, fine-tuning for behavior. For a medical Q&A assistant specifically, LoRA trains the model to use clinical tone and follow documentation standards, while RAG provides current drug interactions and clinical guidelines. See RAG vs Fine-Tuning: When to Use Which for a full decision tree.
Key Insight: Fine-tuning teaches the model HOW to respond. RAG gives it WHAT to say. They're complementary, not competing.
Common Mistakes and How to Avoid Them
Rank too high without rsLoRA. Setting r=128 without use_rslora=True causes gradient instability because the standard alpha / r scaling becomes too small at high ranks. Either stay at r=64 or below with standard LoRA, or use rsLoRA for anything higher.
Wrong target modules. Applying LoRA only to q_proj and v_proj (the original paper's default) consistently underperforms applying it to all linear layers for instruction-following tasks. The difference is measurable and the parameter cost is manageable.
Learning rate too high. LoRA adapters are sensitive to learning rate. Rates above 5e-4 commonly cause the adapters to overshoot and destabilize training. Start at 2e-4 with a cosine scheduler and warmup. If loss doesn't decrease within the first 50 steps, lower to 1e-4.
Too few training epochs on small datasets. With under 2,000 examples, three to five epochs are usually necessary. With 50,000 or more examples, one or two epochs may suffice.
Forgetting the base model's capabilities. LoRA mitigates catastrophic forgetting better than full fine-tuning, but aggressive fine-tuning on a narrow domain still degrades general abilities. If your evaluation shows regression on general tasks, reduce learning rate or add diverse general instruction data to the training mix.
Not merging adapters before deployment. Serving with separate base + adapter requires two forward passes internally. Always call merge_and_unload() before deploying to production — same output, standard inference latency.
Skipping evaluation on out-of-domain examples. After fine-tuning, always run your model on questions it wasn't trained on. A model that scores 95% on the training distribution but can't handle rephrased questions has overfit. Evaluation breadth matters as much as held-out accuracy.
Conclusion
LoRA and QLoRA have made LLM fine-tuning accessible to anyone with a consumer GPU — and the 2025-2026 variant ecosystem has made them better yet. The core insight remains the same: adaptation has low intrinsic dimensionality, so you don't need to update all the weights to change the behavior. A rank-16 DoRA configuration on all linear layers trains only 0.5% of parameters but captures the behavioral changes that matter. QLoRA's 4-bit NF4 quantization cuts the memory footprint by 4x, bringing 8B model fine-tuning comfortably within 8 GB of VRAM for under $10 on cloud GPUs.
The practical guidance for 2026: use r=16 with DoRA and target_modules="all-linear" as your starting configuration. Enable rsLoRA if you're experimenting with high ranks. Use Unsloth on consumer hardware. Keep learning rate at 2e-4 with cosine warmup. Evaluate on out-of-domain examples, not just the held-out split.
For the underlying architecture that makes LoRA possible — why attention weight matrices are the right target — start with Attention Is All You Need: The Transformer Revolution. For the full RAG vs fine-tuning decision framework, RAG vs Fine-Tuning: When to Use Which covers every scenario in detail. And if you want to understand all the quantization options beyond NF4, LLM Quantization for Consumer Hardware explains the tradeoffs.
Build the adapter, merge it, and ship it. Your medical Q&A assistant is waiting.
Interview Questions
What is the difference between LoRA and full fine-tuning?
Full fine-tuning updates all of the model's weights, requiring memory proportional to the number of parameters times the optimizer state (often 4 to 8x the model size). LoRA freezes the original weights and adds small trainable rank-decomposition matrices (B and A) whose product approximates the necessary weight update. For a 7B model, this reduces trainable parameters from billions to tens of millions, making fine-tuning feasible on a single GPU.
Why is the B matrix initialized to zeros in LoRA?
Initializing B to zeros ensures that at the start of training, the LoRA contribution (BA) is exactly zero. This means the model begins fine-tuning from the exact pretrained behavior rather than from a random perturbation. A starts with a small Gaussian — it provides the gradient signal — but the overall output is unchanged until training updates B away from zero.
What does the LoRA rank parameter control, and how do you choose it?
Rank controls the capacity of the adapter — specifically, the inner dimension of the B and A matrices. Lower rank means fewer parameters and less expressive adaptation; higher rank means more capacity but diminishing returns above about 64. For style and instruction-following tasks, r=16 works well. Use r=32 to r=64 for complex domain adaptation or reasoning transfer. Above rank 64, consider rsLoRA to maintain stable gradients.
Explain what QLoRA adds on top of LoRA.
QLoRA quantizes the frozen base model to 4-bit NF4 precision, which reduces the memory needed to store those weights by roughly 4x. LoRA adapters still train in higher precision (bfloat16). QLoRA also introduces double quantization (quantizing the quantization constants) and paged optimizers (swapping optimizer states to CPU RAM during memory spikes). Together, these cut the memory footprint enough to fine-tune an 8B model on a consumer GPU with 8 GB VRAM for under $10.
What is DoRA and how does it improve on standard LoRA?
DoRA (Weight-Decomposed Low-Rank Adaptation) decomposes the pretrained weight into magnitude and direction components, then applies LoRA updates only to the directional part. This separates the two types of changes a fine-tuning task requires — rescaling important output directions versus rotating the weight to cover new patterns. Empirically, DoRA consistently outperforms LoRA by 1 to 4% on commonsense reasoning benchmarks across LLaMA models, with no extra inference overhead since the magnitude vector and direction update merge back into base weights post-training.
What is rsLoRA and when should you use it?
RsLoRA changes the LoRA scaling factor from alpha / r to alpha / sqrt(r). The original scaling causes gradient magnitude to shrink as rank increases, limiting useful fine-tuning to low ranks. The sqrt scaling preserves gradient flow at any rank, enabling stable training up to ranks of 512 or higher. Use rsLoRA whenever you're experimenting with ranks above 64, or if you find that increasing rank beyond 32 stops helping — it's a single flag (use_rslora=True) in PEFT with no downside.
What is NF4 and why is it better than standard int4 for LLM weights?
NF4 (NormalFloat4) places its quantization levels to minimize expected error given a normally distributed input. Neural network weights follow a roughly normal distribution, so NF4's non-uniform quantization boundaries match the data much better than int4, which assumes uniform distribution. In practice, NF4 quantization preserves model quality noticeably better than int4 at the same bit width.
When should you use RAG instead of LoRA for improving LLM outputs?
Use RAG when your knowledge needs to be frequently updated, citations are required, or the information lives in a large document corpus. Use LoRA when you need consistent behavioral patterns, specific output formats, or task-specific reasoning that prompting alone can't reliably achieve. In production systems, RAG and LoRA are often complementary: RAG handles dynamic knowledge retrieval, LoRA handles consistent response style and structure.