Run 70B models with 128k context on a single GPU

TurboAgent brings Google Research's TurboQuant KV-cache compression to open-source LLMs. 4.9x memory reduction, zero accuracy loss, one-line setup.

pip install turboagent-ai[llama,native]
4.9x
KV Cache Compression
0
Accuracy Loss
128k+
Effective Context
93
Tests Passing

One-line agent creation

Auto-detects your hardware, selects the optimal backend, configures TurboQuant compression, and manages multi-turn memory. Just import and run.

KV cache for 70B at 128k context: ~4 GB instead of ~20 GB.

from turboagent import TurboAgent

agent = TurboAgent(
    "meta-llama/Llama-3.1-70B-Instruct",
    kv_mode="turbo3",
    context=131072,
)

response = agent.run(
    "Analyze my 50k-token research doc..."
)
# KV cache: 4 GB (was 20 GB)

Features

Everything you need for long-context agentic AI on consumer hardware.

TurboQuant Compression

PolarQuant + QJL algorithms from the paper. Precomputed Lloyd-Max codebooks. Bit-packed storage for true 4.9x compression.

Hardware Auto-Tuning

Detects CUDA, ROCm, Metal, CPU. Selects backend, layer offloading, context size, and compression mode automatically.

Multiple Backends

llama.cpp for consumer GPUs, vLLM for server throughput, PyTorch for research. One unified API.

Multi-Agent Swarms

Shared compressed KV pool across agents. Round-robin or custom routing. 5x less memory than independent contexts.

RAG Vector Store

Inner-product-preserving search using TurboQuant codebooks. Superior recall vs. Product Quantization.

Cloud API Server

OpenAI-compatible /v1/chat/completions. Docker deployment with GPU passthrough. Auth + rate limiting built in.

Compression Modes

Validated against the paper's theoretical bounds on real hardware.

ModeBits/ValueCompression70B KV @ 128kBest For
turbo33.25 bpv4.9x~4 GBMaximum context on limited VRAM
turbo44.25 bpv3.8x~5.3 GBHigher quality, ample memory
FP16 baseline16 bpv1x~20 GBReference (no compression)

Multi-Agent Swarms

Specialist agents share a single compressed KV cache. Agent B attends to context generated by Agent A without re-encoding. TurboQuant's inner-product fidelity ensures attention accuracy across agent boundaries.

from turboagent.agents.swarm import (
    TurboSwarm, SwarmAgent
)

swarm = TurboSwarm(
    "meta-llama/Llama-3.1-70B-Instruct",
    agents=[
        SwarmAgent(name="researcher", role="research"),
        SwarmAgent(name="critic", role="review"),
        SwarmAgent(name="writer", role="writing"),
    ],
)

results = swarm.run("Analyze KV cache compression.")

Validated at Scale

Tested on real hardware with production models.

ModelGPUVRAMCompressionMulti-Turn Recall
Qwen2.5-32BRTX PRO 600096 GB5.28xRecalls facts after compress/decompress
Qwen2-0.5BRTX 40708 GB4.9xBridge verified (quality limited by model size)

TurboAgent Enterprise

SSO, audit logging, compliance exports (SOC-2, GDPR), governance policies, multi-node KV sharing, priority kernels, dedicated support.

The open-source core is free forever under the MIT license.

Contact Sales