LoRA / Parameter-Efficient Adapters

Production

Modularity Pattern

Techniques that fine-tune a small number of additional parameters (typically 0.1-1% of the base model) while freezing the original weights — enabling cheap, composable, and reversible model customization without full retraining.

LoRA (Low-Rank Adaptation, Hu et al., 2021) freezes pretrained weights and injects trainable low-rank decomposition matrices into each transformer layer. For a weight matrix W (d×d), LoRA adds ΔW = BA where B (d×r) and A (r×d) with rank r << d (typically 4-64). At inference, ΔW is merged into W with zero latency overhead. QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit quantization of base weights, enabling fine-tuning of 65B models on a single 48GB GPU. Practical usage: fine-tune via Hugging Face PEFT library, Axolotl, or provider APIs. Adapters can be shared on Hugging Face Hub and stacked/merged.

Why Does This Exist?

LoRA adapters are the most practical implementation of modular capabilities today. Each adapter is a self-contained capability delta (medical terminology, legal reasoning, code style) that can be independently trained, stored, shared, and swapped. Adapter merging (TIES, DARE) enables combining capabilities, and removal is trivial — delete the adapter, restore the base model. This is modularity at the weight level.

Fine-tuning a full 70B model requires hundreds of GPU-hours and specialized infrastructure. A LoRA adapter for the same model trains in single-digit GPU-hours on a single machine, stores in ~100MB instead of ~140GB, and can be served by dynamically loading adapters onto a shared base model — dramatically reducing both training and serving costs for customized deployments.

When a capability or knowledge domain was added via a LoRA adapter, unlearning is as simple as removing the adapter — a clean, complete, verifiable removal with zero collateral damage to base model capabilities. This only applies to adapter-introduced knowledge (not base model knowledge), but it demonstrates the unlearning benefits of modular architecture and provides a template for what principled unlearning could look like.