hqq-quantization by davila7
Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.
Coding
15.7K Stars
1.4K Forks
Updated Jan 12, 2026, 05:31 AM
Why Use This
This skill provides specialized capabilities for davila7's codebase.
Use Cases
- Developing new features in the davila7 repository
- Refactoring existing code to follow davila7 standards
- Understanding and working with davila7's codebase structure
Skill Snapshot
Auto scan of skill assets. Informational only.
Valid SKILL.md
Checks against SKILL.md specification
Source & Community
Repository claude-code-templates
Skill Version
main
Community
15.7K 1.4K
Updated At Jan 12, 2026, 05:31 AM
Skill Stats
SKILL.md 446 Lines
Total Files 1
Total Size 0 B
License MIT
---
name: hqq-quantization
description: Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Quantization, HQQ, Optimization, Memory Efficiency, Inference, Model Compression]
dependencies: [hqq>=0.2.0, torch>=2.0.0]
---
# HQQ - Half-Quadratic Quantization
Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends.
## When to use HQQ
**Use HQQ when:**
- Quantizing models without calibration data (no dataset needed)
- Need fast quantization (minutes vs hours for GPTQ/AWQ)
- Deploying with vLLM or HuggingFace Transformers
- Fine-tuning quantized models with LoRA/PEFT
- Experimenting with extreme quantization (2-bit, 1-bit)
**Key advantages:**
- **No calibration**: Quantize any model instantly without sample data
- **Multiple backends**: PyTorch, ATEN, TorchAO, Marlin, BitBlas for optimized inference
- **Flexible precision**: 8/4/3/2/1-bit with configurable group sizes
- **Framework integration**: Native HuggingFace and vLLM support
- **PEFT compatible**: Fine-tune quantized models with LoRA
**Use alternatives instead:**
- **AWQ**: Need calibration-based accuracy, production serving
- **GPTQ**: Maximum accuracy with calibration data available
- **bitsandbytes**: Simple 8-bit/4-bit without custom backends
- **llama.cpp/GGUF**: CPU inference, Apple Silicon deployment
## Quick start
### Installation
```bash
pip install hqq
# With specific backend
pip install hqq[torch] # PyTorch backend
pip install hqq[torchao] # TorchAO int4 backend
pip install hqq[bitblas] # BitBlas backend
pip install hqq[marlin] # Marlin backend
```
### Basic quantization
```python
from hqq.core.quantize import BaseQuantizeConfig, HQQLinear
import torch.nn as nn
# Configure quantization
config = BaseQuantizeConfig(
nbits=4, # 4-bit quantization
group_size=64, # Group size for quantization
axis=1 # Quantize along output dimension
)
# Quantize a linear layer
linear = nn.Linear(4096, 4096)
hqq_linear = HQQLinear(linear, config)
# Use normally
output = hqq_linear(input_tensor)
```
### Quantize full model with HuggingFace
```python
from transformers import AutoModelForCausalLM, HqqConfig
# Configure HQQ
quantization_config = HqqConfig(
nbits=4,
group_size=64,
axis=1
)
# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=quantization_config,
device_map="auto"
)
# Model is quantized and ready to use
```
## Core concepts
### Quantization configuration
HQQ uses `BaseQuantizeConfig` to define quantization parameters:
```python
from hqq.core.quantize import BaseQuantizeConfig
# Standard 4-bit config
config_4bit = BaseQuantizeConfig(
nbits=4, # Bits per weight (1-8)
group_size=64, # Weights per quantization group
axis=1 # 0=input dim, 1=output dim
)
# Aggressive 2-bit config
config_2bit = BaseQuantizeConfig(
nbits=2,
group_size=16, # Smaller groups for low-bit
axis=1
)
# Mixed precision per layer type
layer_configs = {
"self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32),
"mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32),
"mlp.down_proj": BaseQuantizeConfig(nbits=4, group_size=64),
}
```
### HQQLinear layer
The core quantized layer that replaces `nn.Linear`:
```python
from hqq.core.quantize import HQQLinear
import torch
# Create quantized layer
linear = torch.nn.Linear(4096, 4096)
hqq_layer = HQQLinear(linear, config)
# Access quantized weights
W_q = hqq_layer.W_q # Quantized weights
scale = hqq_layer.scale # Scale factors
zero = hqq_layer.zero # Zero points
# Dequantize for inspection
W_dequant = hqq_layer.dequantize()
```
### Backends
HQQ supports multiple inference backends for different hardware:
```python
from hqq.core.quantize import HQQLinear
# Available backends
backends = [
"pytorch", # Pure PyTorch (default)
"pytorch_compile", # torch.compile optimized
"aten", # Custom CUDA kernels
"torchao_int4", # TorchAO int4 matmul
"gemlite", # GemLite CUDA kernels
"bitblas", # BitBlas optimized
"marlin", # Marlin 4-bit kernels
]
# Set backend globally
HQQLinear.set_backend("torchao_int4")
# Or per layer
hqq_layer.set_backend("marlin")
```
**Backend selection guide:**
| Backend | Best For | Requirements |
|---------|----------|--------------|
| pytorch | Compatibility | Any GPU |
| pytorch_compile | Moderate speedup | torch>=2.0 |
| aten | Good balance | CUDA GPU |
| torchao_int4 | 4-bit inference | torchao installed |
| marlin | Maximum 4-bit speed | Ampere+ GPU |
| bitblas | Flexible bit-widths | bitblas installed |
## HuggingFace integration
### Load pre-quantized models
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load HQQ-quantized model from Hub
model = AutoModelForCausalLM.from_pretrained(
"mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
# Use normally
inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
```
### Quantize and save
```python
from transformers import AutoModelForCausalLM, HqqConfig
# Quantize
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
# Save quantized model
model.save_pretrained("./llama-8b-hqq-4bit")
# Push to Hub
model.push_to_hub("my-org/Llama-3.1-8B-HQQ-4bit")
```
### Mixed precision quantization
```python
from transformers import AutoModelForCausalLM, HqqConfig
# Different precision per layer type
config = HqqConfig(
nbits=4,
group_size=64,
# Attention layers: higher precision
# MLP layers: lower precision for memory savings
dynamic_config={
"attn": {"nbits": 4, "group_size": 64},
"mlp": {"nbits": 2, "group_size": 32}
}
)
```
## vLLM integration
### Serve HQQ models with vLLM
```python
from vllm import LLM, SamplingParams
# Load HQQ-quantized model
llm = LLM(
model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
quantization="hqq",
dtype="float16"
)
# Generate
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(["What is machine learning?"], sampling_params)
```
### vLLM with custom HQQ config
```python
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-8B",
quantization="hqq",
quantization_config={
"nbits": 4,
"group_size": 64
}
)
```
## PEFT/LoRA fine-tuning
### Fine-tune quantized models
```python
from transformers import AutoModelForCausalLM, HqqConfig
from peft import LoraConfig, get_peft_model
# Load quantized model
quant_config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=quant_config,
device_map="auto"
)
# Apply LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Train normally with Trainer or custom loop
```
### QLoRA-style training
```python
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./hqq-lora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
num_train_epochs=3,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=data_collator
)
trainer.train()
```
## Quantization workflows
### Workflow 1: Quick model compression
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
# 1. Configure quantization
config = HqqConfig(nbits=4, group_size=64)
# 2. Load and quantize (no calibration needed!)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
# 3. Verify quality
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
# 4. Save
model.save_pretrained("./llama-8b-hqq")
tokenizer.save_pretrained("./llama-8b-hqq")
```
### Workflow 2: Optimize for inference speed
```python
from hqq.core.quantize import HQQLinear
from transformers import AutoModelForCausalLM, HqqConfig
# 1. Quantize with optimal backend
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
# 2. Set fast backend
HQQLinear.set_backend("marlin") # or "torchao_int4"
# 3. Compile for additional speedup
import torch
model = torch.compile(model)
# 4. Benchmark
import time
inputs = tokenizer("Hello", return_tensors="pt").to(model.device)
start = time.time()
for _ in range(10):
model.generate(**inputs, max_new_tokens=100)
print(f"Avg time: {(time.time() - start) / 10:.2f}s")
```
## Best practices
1. **Start with 4-bit**: Best quality/size tradeoff for most models
2. **Use group_size=64**: Good balance; smaller for extreme quantization
3. **Choose backend wisely**: Marlin for 4-bit Ampere+, TorchAO for flexibility
4. **Verify quality**: Always test generation quality after quantization
5. **Mixed precision**: Keep attention at higher precision, compress MLP more
6. **PEFT training**: Use LoRA r=16-32 for good fine-tuning results
## Common issues
**Out of memory during quantization:**
```python
# Quantize layer-by-layer
from hqq.models.hf.base import AutoHQQHFModel
model = AutoHQQHFModel.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="sequential" # Load layers sequentially
)
```
**Slow inference:**
```python
# Switch to optimized backend
from hqq.core.quantize import HQQLinear
HQQLinear.set_backend("marlin") # Requires Ampere+ GPU
# Or compile
model = torch.compile(model, mode="reduce-overhead")
```
**Poor quality at 2-bit:**
```python
# Use smaller group size
config = BaseQuantizeConfig(
nbits=2,
group_size=16, # Smaller groups help at low bits
axis=1
)
```
## References
- **[Advanced Usage](references/advanced-usage.md)** - Custom backends, mixed precision, optimization
- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks
## Resources
- **Repository**: https://github.com/mobiusml/hqq
- **Paper**: Half-Quadratic Quantization
- **HuggingFace Models**: https://huggingface.co/mobiuslabsgmbh
- **Version**: 0.2.0+
- **License**: Apache 2.0
Name Size