Why Use This This skill provides specialized capabilities for jeremylongshore's codebase.
Use Cases Developing new features in the jeremylongshore repository Refactoring existing code to follow jeremylongshore standards Understanding and working with jeremylongshore's codebase structure
Install Guide 2 steps 1 2 Install inside Ananke
Click Install Skill, paste the link below, then press Install.
https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/vastai-pack/skills/vastai-reference-architecture Skill Snapshot Auto scan of skill assets. Informational only.
Valid SKILL.md Checks against SKILL.md specification
Source & Community
Updated At Mar 11, 2026, 05:33 AM
Skill Stats
SKILL.md 189 Lines
Total Files 1
Total Size 7.6 KB
License MIT
---
name: vastai-reference-architecture
description: |
Implement Vast.ai reference architecture with best-practice project layout.
Use when designing new Vast.ai integrations, reviewing project structure,
or establishing architecture standards for Vast.ai applications.
Trigger with phrases like "vastai architecture", "vastai best practices",
"vastai project structure", "how to organize vastai", "vastai layout".
allowed-tools: Read, Grep
version: 1.0.0
license: MIT
author: Jeremy Longshore <jeremy@intentsolutions.io>
compatible-with: claude-code, codex, openclaw
---
# Vast.ai Reference Architecture
## Overview
Production architecture for GPU compute on Vast.ai. Covers instance lifecycle management, training job orchestration, model serving patterns, and cost-optimized instance selection for ML workloads.
## Prerequisites
- Vast.ai account with API key
- `vastai` CLI installed
- Docker for container image management
- SSH key pair configured
## Architecture Diagram
```
┌──────────────────────────────────────────────────────┐
│ Job Orchestrator │
│ Training Jobs │ Inference Serving │ Batch Processing │
└──────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Vast.ai Instance Manager │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Search │ │ Create │ │ Monitor │ │
│ │ (find GPUs) │ │ (provision) │ │ (status) │ │
│ └──────────────┘ └──────────────┘ └───────────┘ │
├──────────────────────────────────────────────────────┤
│ GPU Instance │
│ ┌──────────────────────────────────────────────┐ │
│ │ Docker Container │ │
│ │ PyTorch │ Data │ Model Weights │ Checkpoints│ │
│ └──────────────────────────────────────────────┘ │
├──────────────────────────────────────────────────────┤
│ Data Transfer │
│ rsync (upload) │ S3/GCS (persist) │ rsync (download)│
└──────────────────────────────────────────────────────┘
```
## Instructions
### Step 1: Instance Selection Strategy
```python
import subprocess
import json
def search_instances(
gpu_type: str = "RTX 4090",
max_price: float = 0.50,
min_ram_gb: int = 32,
min_disk_gb: int = 50,
):
"""Search for cost-effective GPU instances."""
cmd = [
"vastai", "search", "offers",
"--gpu-name", gpu_type,
"--dph", f"<={max_price}",
"--min-ram", str(min_ram_gb),
"--min-disk", str(min_disk_gb),
"--reliability", ">0.95",
"--order", "dlperf_per_dphtotal-desc",
"--limit", "5",
"--raw"
]
result = subprocess.run(cmd, capture_output=True, text=True)
return json.loads(result.stdout)
# GPU selection guide:
# Training: A100 80GB ($1.50-2.50/hr) or H100 ($2.50-4.00/hr)
# Fine-tuning: RTX 4090 ($0.30-0.50/hr) or A6000 ($0.60-1.00/hr)
# Inference: RTX 3090 ($0.20-0.35/hr) or RTX 4090
```
### Step 2: Training Job Automation
```python
class VastTrainingJob:
def __init__(self, api_key: str):
self.api_key = api_key
def launch(self, offer_id: int, image: str, disk_gb: int = 50):
result = subprocess.run([
"vastai", "create", "instance", str(offer_id),
"--image", image,
"--disk", str(disk_gb),
"--onstart-cmd", "cd /workspace && python train.py",
"--raw"
], capture_output=True, text=True)
return json.loads(result.stdout)
def wait_until_ready(self, instance_id: int, timeout: int = 600): # 600: timeout: 10 minutes
import time
for _ in range(timeout // 10):
status = self.get_status(instance_id)
if status.get("actual_status") == "running":
return status
time.sleep(10)
raise TimeoutError("Instance failed to start")
def get_status(self, instance_id: int):
result = subprocess.run(
["vastai", "show", "instance", str(instance_id), "--raw"],
capture_output=True, text=True
)
return json.loads(result.stdout)
def destroy(self, instance_id: int):
subprocess.run(["vastai", "destroy", "instance", str(instance_id)])
```
### Step 3: Docker Image for Training
```dockerfile
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
WORKDIR /workspace
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# Pre-download model weights during build
RUN python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-3.2-1B')"
ENTRYPOINT ["python", "train.py"]
```
### Step 4: Data Transfer Pattern
```python
def upload_data(instance_id: int, local_path: str, remote_path: str = "/workspace/data"):
info = get_instance_info(instance_id)
subprocess.run([
"rsync", "-avz", "--progress",
"-e", f"ssh -p {info['ssh_port']} -o StrictHostKeyChecking=no",
local_path, f"root@{info['ssh_host']}:{remote_path}"
], check=True)
def download_results(instance_id: int, remote_path: str, local_path: str):
info = get_instance_info(instance_id)
subprocess.run([
"rsync", "-avz", "--progress",
"-e", f"ssh -p {info['ssh_port']}",
f"root@{info['ssh_host']}:{remote_path}", local_path
], check=True)
```
## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| No instances available | Filters too strict | Relax price or GPU type constraints |
| Instance killed | Host maintenance | Use checkpointing, auto-restart |
| Slow data transfer | Large dataset | Pre-load data in Docker image |
| OOM during training | Model too large for VRAM | Use gradient checkpointing, smaller batch |
## Examples
### End-to-End Training Pipeline
```python
job = VastTrainingJob(api_key="...")
offers = search_instances("A100", max_price=2.00)
instance = job.launch(offers[0]["id"], "my-training:latest")
job.wait_until_ready(instance["new_contract"])
upload_data(instance["new_contract"], "./data/")
# Training runs via onstart-cmd
download_results(instance["new_contract"], "/workspace/output/", "./results/")
job.destroy(instance["new_contract"])
```
## Resources
- [Vast.ai CLI Documentation](https://vast.ai/docs/cli/commands)
- [Vast.ai Instance Guide](https://vast.ai/docs)
## Output
- Configuration files or code changes applied to the project
- Validation report confirming correct implementation
- Summary of changes made and their rationale