Run 70B models with 128k context on a single GPU

TurboAgent brings Google Research's TurboQuant KV-cache compression to open-source LLMs. 4.9x memory reduction, zero accuracy loss, one-line setup.

pip install turboagent-ai[llama,native]

View on GitHub Read the Docs

4.9x

KV Cache Compression

Accuracy Loss

128k+

Effective Context

Tests Passing

One-line agent creation

Auto-detects your hardware, selects the optimal backend, configures TurboQuant compression, and manages multi-turn memory. Just import and run.

KV cache for 70B at 128k context: ~4 GB instead of ~20 GB.

from turboagent import TurboAgent

agent = TurboAgent(
    "meta-llama/Llama-3.1-70B-Instruct",
    kv_mode="turbo3",
    context=131072,
)

response = agent.run(
    "Analyze my 50k-token research doc..."
)
# KV cache: 4 GB (was 20 GB)

Features

Everything you need for long-context agentic AI on consumer hardware.

TurboQuant Compression

PolarQuant + QJL algorithms from the paper. Precomputed Lloyd-Max codebooks. Bit-packed storage for true 4.9x compression.

Hardware Auto-Tuning

Detects CUDA, ROCm, Metal, CPU. Selects backend, layer offloading, context size, and compression mode automatically.

Multiple Backends

llama.cpp for consumer GPUs, vLLM for server throughput, PyTorch for research. One unified API.

Multi-Agent Swarms

Shared compressed KV pool across agents. Round-robin or custom routing. 5x less memory than independent contexts.

RAG Vector Store

Inner-product-preserving search using TurboQuant codebooks. Superior recall vs. Product Quantization.

Cloud API Server

OpenAI-compatible /v1/chat/completions. Docker deployment with GPU passthrough. Auth + rate limiting built in.

Compression Modes

Validated against the paper's theoretical bounds on real hardware.

Mode	Bits/Value	Compression	70B KV @ 128k	Best For
turbo3	3.25 bpv	4.9x	~4 GB	Maximum context on limited VRAM
turbo4	4.25 bpv	3.8x	~5.3 GB	Higher quality, ample memory
FP16 baseline	16 bpv	1x	~20 GB	Reference (no compression)

Multi-Agent Swarms

Specialist agents share a single compressed KV cache. Agent B attends to context generated by Agent A without re-encoding. TurboQuant's inner-product fidelity ensures attention accuracy across agent boundaries.

from turboagent.agents.swarm import (
    TurboSwarm, SwarmAgent
)

swarm = TurboSwarm(
    "meta-llama/Llama-3.1-70B-Instruct",
    agents=[
        SwarmAgent(name="researcher", role="research"),
        SwarmAgent(name="critic", role="review"),
        SwarmAgent(name="writer", role="writing"),
    ],
)

results = swarm.run("Analyze KV cache compression.")

Validated at Scale

Tested on real hardware with production models.

Model	GPU	VRAM	Compression	Multi-Turn Recall
Qwen2.5-32B	RTX PRO 6000	96 GB	5.28x	Recalls facts after compress/decompress
Qwen2-0.5B	RTX 4070	8 GB	4.9x	Bridge verified (quality limited by model size)

TurboAgent Enterprise

SSO, audit logging, compliance exports (SOC-2, GDPR), governance policies, multi-node KV sharing, priority kernels, dedicated support.

The open-source core is free forever under the MIT license.

Contact Sales