gptq by davila7
Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.
Coding
15.7K Stars
1.4K Forks
Updated Jan 12, 2026, 05:31 AM
Why Use This
This skill provides specialized capabilities for davila7's codebase.
Use Cases
- Developing new features in the davila7 repository
- Refactoring existing code to follow davila7 standards
- Understanding and working with davila7's codebase structure
Skill Snapshot
Auto scan of skill assets. Informational only.
Valid SKILL.md
Checks against SKILL.md specification
Source & Community
Repository claude-code-templates
Skill Version
main
Community
15.7K 1.4K
Updated At Jan 12, 2026, 05:31 AM
Skill Stats
SKILL.md 451 Lines
Total Files 1
Total Size 0 B
License MIT
---
name: gptq
description: Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Optimization, GPTQ, Quantization, 4-Bit, Post-Training, Memory Optimization, Consumer GPUs, Fast Inference, QLoRA, Group-Wise Quantization]
dependencies: [auto-gptq, transformers, optimum, peft]
---
# GPTQ (Generative Pre-trained Transformer Quantization)
Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization.
## When to use GPTQ
**Use GPTQ when:**
- Need to fit large models (70B+) on limited GPU memory
- Want 4× memory reduction with <2% accuracy loss
- Deploying on consumer GPUs (RTX 4090, 3090)
- Need faster inference (3-4× speedup vs FP16)
**Use AWQ instead when:**
- Need slightly better accuracy (<1% loss)
- Have newer GPUs (Ampere, Ada)
- Want Marlin kernel support (2× faster on some GPUs)
**Use bitsandbytes instead when:**
- Need simple integration with transformers
- Want 8-bit quantization (less compression, better quality)
- Don't need pre-quantized model files
## Quick start
### Installation
```bash
# Install AutoGPTQ
pip install auto-gptq
# With Triton (Linux only, faster)
pip install auto-gptq[triton]
# With CUDA extensions (faster)
pip install auto-gptq --no-build-isolation
# Full installation
pip install auto-gptq transformers accelerate
```
### Load pre-quantized model
```python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
# Load quantized model from HuggingFace
model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_triton=False # Set True on Linux for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Generate
prompt = "Explain quantum computing"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))
```
### Quantize your own model
```python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
# Load model
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Quantization config
quantize_config = BaseQuantizeConfig(
bits=4, # 4-bit quantization
group_size=128, # Group size (recommended: 128)
desc_act=False, # Activation order (False for CUDA kernel)
damp_percent=0.01 # Dampening factor
)
# Load model for quantization
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config
)
# Prepare calibration data
dataset = load_dataset("c4", split="train", streaming=True)
calibration_data = [
tokenizer(example["text"])["input_ids"][:512]
for example in dataset.take(128)
]
# Quantize
model.quantize(calibration_data)
# Save quantized model
model.save_quantized("llama-2-7b-gptq")
tokenizer.save_pretrained("llama-2-7b-gptq")
# Push to HuggingFace
model.push_to_hub("username/llama-2-7b-gptq")
```
## Group-wise quantization
**How GPTQ works**:
1. **Group weights**: Divide each weight matrix into groups (typically 128 elements)
2. **Quantize per-group**: Each group has its own scale/zero-point
3. **Minimize error**: Uses Hessian information to minimize quantization error
4. **Result**: 4-bit weights with near-FP16 accuracy
**Group size trade-off**:
| Group Size | Model Size | Accuracy | Speed | Recommendation |
|------------|------------|----------|-------|----------------|
| -1 (per-column) | Smallest | Best | Slowest | Research only |
| 32 | Smaller | Better | Slower | High accuracy needed |
| **128** | Medium | Good | **Fast** | **Recommended default** |
| 256 | Larger | Lower | Faster | Speed critical |
| 1024 | Largest | Lowest | Fastest | Not recommended |
**Example**:
```
Weight matrix: [1024, 4096] = 4.2M elements
Group size = 128:
- Groups: 4.2M / 128 = 32,768 groups
- Each group: own 4-bit scale + zero-point
- Result: Better granularity → better accuracy
```
## Quantization configurations
### Standard 4-bit (recommended)
```python
from auto_gptq import BaseQuantizeConfig
config = BaseQuantizeConfig(
bits=4, # 4-bit quantization
group_size=128, # Standard group size
desc_act=False, # Faster CUDA kernel
damp_percent=0.01 # Dampening factor
)
```
**Performance**:
- Memory: 4× reduction (70B model: 140GB → 35GB)
- Accuracy: ~1.5% perplexity increase
- Speed: 3-4× faster than FP16
### High accuracy (3-bit with larger groups)
```python
config = BaseQuantizeConfig(
bits=3, # 3-bit (more compression)
group_size=128, # Keep standard group size
desc_act=True, # Better accuracy (slower)
damp_percent=0.01
)
```
**Trade-off**:
- Memory: 5× reduction
- Accuracy: ~3% perplexity increase
- Speed: 5× faster (but less accurate)
### Maximum accuracy (4-bit with small groups)
```python
config = BaseQuantizeConfig(
bits=4,
group_size=32, # Smaller groups (better accuracy)
desc_act=True, # Activation reordering
damp_percent=0.005 # Lower dampening
)
```
**Trade-off**:
- Memory: 3.5× reduction (slightly larger)
- Accuracy: ~0.8% perplexity increase (best)
- Speed: 2-3× faster (kernel overhead)
## Kernel backends
### ExLlamaV2 (default, fastest)
```python
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_exllama=True, # Use ExLlamaV2
exllama_config={"version": 2}
)
```
**Performance**: 1.5-2× faster than Triton
### Marlin (Ampere+ GPUs)
```python
# Quantize with Marlin format
config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False # Required for Marlin
)
model.quantize(calibration_data, use_marlin=True)
# Load with Marlin
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_marlin=True # 2× faster on A100/H100
)
```
**Requirements**:
- NVIDIA Ampere or newer (A100, H100, RTX 40xx)
- Compute capability ≥ 8.0
### Triton (Linux only)
```python
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_triton=True # Linux only
)
```
**Performance**: 1.2-1.5× faster than CUDA backend
## Integration with transformers
### Direct transformers usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load quantized model (transformers auto-detects GPTQ)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-13B-Chat-GPTQ",
device_map="auto",
trust_remote_code=False
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")
# Use like any transformers model
inputs = tokenizer("Hello", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
```
### QLoRA fine-tuning (GPTQ + LoRA)
```python
from transformers import AutoModelForCausalLM
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
# Load GPTQ model
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto"
)
# Prepare for LoRA training
model = prepare_model_for_kbit_training(model)
# LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Add LoRA adapters
model = get_peft_model(model, lora_config)
# Fine-tune (memory efficient!)
# 70B model trainable on single A100 80GB
```
## Performance benchmarks
### Memory reduction
| Model | FP16 | GPTQ 4-bit | Reduction |
|-------|------|------------|-----------|
| Llama 2-7B | 14 GB | 3.5 GB | 4× |
| Llama 2-13B | 26 GB | 6.5 GB | 4× |
| Llama 2-70B | 140 GB | 35 GB | 4× |
| Llama 3-405B | 810 GB | 203 GB | 4× |
**Enables**:
- 70B on single A100 80GB (vs 2× A100 needed for FP16)
- 405B on 3× A100 80GB (vs 11× A100 needed for FP16)
- 13B on RTX 4090 24GB (vs OOM with FP16)
### Inference speed (Llama 2-7B, A100)
| Precision | Tokens/sec | vs FP16 |
|-----------|------------|---------|
| FP16 | 25 tok/s | 1× |
| GPTQ 4-bit (CUDA) | 85 tok/s | 3.4× |
| GPTQ 4-bit (ExLlama) | 105 tok/s | 4.2× |
| GPTQ 4-bit (Marlin) | 120 tok/s | 4.8× |
### Accuracy (perplexity on WikiText-2)
| Model | FP16 | GPTQ 4-bit (g=128) | Degradation |
|-------|------|---------------------|-------------|
| Llama 2-7B | 5.47 | 5.55 | +1.5% |
| Llama 2-13B | 4.88 | 4.95 | +1.4% |
| Llama 2-70B | 3.32 | 3.38 | +1.8% |
**Excellent quality preservation** - less than 2% degradation!
## Common patterns
### Multi-GPU deployment
```python
# Automatic device mapping
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-GPTQ",
device_map="auto", # Automatically split across GPUs
max_memory={0: "40GB", 1: "40GB"} # Limit per GPU
)
# Manual device mapping
device_map = {
"model.embed_tokens": 0,
"model.layers.0-39": 0, # First 40 layers on GPU 0
"model.layers.40-79": 1, # Last 40 layers on GPU 1
"model.norm": 1,
"lm_head": 1
}
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device_map=device_map
)
```
### CPU offloading
```python
# Offload some layers to CPU (for very large models)
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-405B-GPTQ",
device_map="auto",
max_memory={
0: "80GB", # GPU 0
1: "80GB", # GPU 1
2: "80GB", # GPU 2
"cpu": "200GB" # Offload overflow to CPU
}
)
```
### Batch inference
```python
# Process multiple prompts efficiently
prompts = [
"Explain AI",
"Explain ML",
"Explain DL"
]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=100,
pad_token_id=tokenizer.eos_token_id
)
for i, output in enumerate(outputs):
print(f"Prompt {i}: {tokenizer.decode(output)}")
```
## Finding pre-quantized models
**TheBloke on HuggingFace**:
- https://huggingface.co/TheBloke
- 1000+ models in GPTQ format
- Multiple group sizes (32, 128)
- Both CUDA and Marlin formats
**Search**:
```bash
# Find GPTQ models on HuggingFace
https://huggingface.co/models?library=gptq
```
**Download**:
```python
from auto_gptq import AutoGPTQForCausalLM
# Automatically downloads from HuggingFace
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-Chat-GPTQ",
device="cuda:0"
)
```
## Supported models
- **LLaMA family**: Llama 2, Llama 3, Code Llama
- **Mistral**: Mistral 7B, Mixtral 8x7B, 8x22B
- **Qwen**: Qwen, Qwen2, QwQ
- **DeepSeek**: V2, V3
- **Phi**: Phi-2, Phi-3
- **Yi, Falcon, BLOOM, OPT**
- **100+ models** on HuggingFace
## References
- **[Calibration Guide](references/calibration.md)** - Dataset selection, quantization process, quality optimization
- **[Integration Guide](references/integration.md)** - Transformers, PEFT, vLLM, TensorRT-LLM
- **[Troubleshooting](references/troubleshooting.md)** - Common issues, performance optimization
## Resources
- **GitHub**: https://github.com/AutoGPTQ/AutoGPTQ
- **Paper**: GPTQ: Accurate Post-Training Quantization (arXiv:2210.17323)
- **Models**: https://huggingface.co/models?library=gptq
- **Discord**: https://discord.gg/autogptq
Name Size