Skip to main content
  1. Posts/

TurboQuant KV Cache — Running 128B Models on Consumer Hardware

Code: github.com/KellerKev/turboquant-plus-local — Pre-built conda packages for llama.cpp with TurboQuant KV cache compression. No compilation required.

The memory wall for local LLMs isn’t the model weights — it’s the KV cache. Weight quantization (4-bit, 8-bit) is mature and well-understood. But as context windows grow from 8K to 32K to 128K tokens, the KV cache grows linearly and quickly dominates memory usage, especially during generation with long histories.

TurboQuant applies the same insight that made vector compression powerful for embeddings (covered in a previous post) to the KV cache: compress the keys and values produced at each layer during inference, not just the weights. The result is a 3.8–5.1x reduction in cache memory with negligible quality loss.

This project packages that capability into a single conda install for macOS and Linux.


The KV Cache Problem
#

When a transformer generates a token, it computes attention over every previous token. The keys and values from those attention computations get stored in the KV cache so they don’t need to be recomputed. This cache grows with every token generated.

For a 32K token context with a modern 26B model, the KV cache alone can consume several gigabytes — separate from the model weights entirely. Extend to 128K tokens or scale to a 128B model and you either run out of memory or are forced to truncate context.

The standard workaround is aggressive model quantization (3-bit, 2-bit), which hurts quality. TurboQuant’s approach leaves the model weights at whatever precision you prefer and instead compresses the cache itself, where the information structure allows for more aggressive compression without perceptible quality degradation.


Benchmark Results
#

Running on a Mac Studio M4 Max with the turbo4 compression mode:

ModelKV CompressionNotes
Gemma 4 26B5.1xReliable across architectures
Mistral Medium 3.5 128B3.8xMinimal performance degradation
Llama 3.xSupportedturbo4 + turbo3 both work
Qwen 3.5Supportedturbo4 + turbo3 both work

The practical effect: models that previously required truncating to 8K tokens can now run comfortably at 32K+. 128B-parameter models that wouldn’t fit at all become viable on consumer hardware.

Two compression modes are available:

  • turbo4 — 4-bit KV compression. Works reliably across all tested architectures. Recommended default.
  • turbo3 — 3-bit KV compression. Higher compression ratio, but produces corrupted output on Mistral Medium 3.5. Safe for Llama, Qwen, and Gemma.

Installation
#

Pre-built packages are distributed through my prefix.dev channel. No Xcode, no CUDA toolkit, no compilation:

pixi add --channel https://repo.prefix.dev/turboquant-plus-local llama-cpp-turboquant

Or install directly with conda/mamba if you’re not using pixi:

conda install -c https://repo.prefix.dev/turboquant-plus-local llama-cpp-turboquant

Available platforms:

  • macOS — Apple Silicon (M1/M2/M3/M4) with Metal GPU acceleration
  • Linux x86_64 — CPU-only and NVIDIA CUDA 12+ variants
  • Linux aarch64 — ARM CPU builds

Running the Server
#

Once installed, start the llama.cpp server with TurboQuant KV compression enabled:

# Download a model (example: Gemma 4 27B)
llama-server \
  --model /path/to/model.gguf \
  --cache-type-k turbo4 \
  --cache-type-v turbo4 \
  --ctx-size 32768 \
  --port 8080

The --cache-type-k turbo4 and --cache-type-v turbo4 flags activate TurboQuant compression on the key and value caches respectively. You can also mix — for example, turbo4 for keys and f16 for values — but compressing both gives the full benefit.

For agentic workflows where system prompts are large (opencode, Claude Code, custom agent frameworks), 32,768 tokens is a good starting context size. It accommodates most system prompt + conversation history combinations without hitting the cache memory wall.


Integration with AI Coding Agents
#

The server exposes an OpenAI-compatible API, so it drops directly into any client that supports a custom base URL. For opencode:

{
  "providers": {
    "local": {
      "name": "Local (TurboQuant)",
      "api": "openai",
      "url": "http://localhost:8080/v1",
      "models": {
        "local-model": {
          "name": "local-model",
          "options": {
            "maxTokens": 32768
          }
        }
      }
    }
  }
}

For Claude Code or any tool using an OpenAI-compatible client, point OPENAI_BASE_URL to http://localhost:8080/v1.

The 32K context is particularly important for agentic use — a typical Claude Code session with a large system prompt, file context, and conversation history can easily consume 15–20K tokens. Without TurboQuant, that leaves little headroom for responses on consumer hardware.


How It’s Built
#

The project is a packaging layer on top of TheTom’s llama-cpp-turboquant, which integrates TurboQuant compression into the llama.cpp codebase. Rather than maintaining a full fork, it tracks the feature/turboquant-kv-cache branch as a git submodule, so upstream llama.cpp improvements flow through automatically.

The build system uses pixi for reproducible compilation and rattler-build for conda package creation:

[tasks]
configure = "cmake -B build -G Ninja -DGGML_METAL=ON -DGGML_BLAS=ON -DLLAMA_SERVER=ON ..."
build = { cmd = "ninja -C build -j $(nproc)", depends-on = ["configure"] }
package = "rattler-build build --recipe recipes/llama-cpp-turboquant/"

GitHub Actions builds packages across all target platforms on every recipe change and uploads them to prefix.dev. The CUDA build gets its own workflow targeting Linux x86_64 with CUDA 12.6.0.

The platform-specific flags are handled in the build script:

# macOS: Metal + Accelerate BLAS
-DGGML_METAL=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Accelerate

# Linux with GPU
-DGGML_CUDA=ON

# All platforms
-DGGML_OPENMP=ON -DCMAKE_BUILD_TYPE=Release

Why Conda vs. Docker
#

Docker is the common suggestion for distributing compiled binaries, but it adds overhead — image size, runtime isolation, volume mounts for model files. For local inference, you want the binary as close to bare metal as possible: direct GPU access, no virtualization layer, minimal latency.

Conda packages with pixi give you the same reproducibility as Docker for dependencies while running natively. On Apple Silicon this means Metal shaders execute directly against the GPU. On Linux the CUDA variant links against the system CUDA runtime without a container boundary.

The tradeoff is that conda packages require a compatible runtime environment (the right OS, the right GPU drivers), whereas Docker can paper over some of those differences. For the target use case — a developer running models locally on their own machine — native execution wins.


Context
#

My previous TurboQuant post covered the compression algorithm’s application to embedding vectors for agent memory systems. This project applies the related but distinct KV cache compression to the inference loop itself — different problem, same underlying insight about exploiting distributional structure to compress without training.

The two stack together well: TurboQuant embeddings for retrieval-augmented agent memory, TurboQuant KV cache for the inference engine that processes those retrievals. Both run locally, both eliminate the memory bottleneck that previously required cloud infrastructure for serious context lengths.

Source, build instructions, and pre-built packages at github.com/KellerKev/turboquant-plus-local.

Kevin Keller
Author
Kevin Keller
Personal blog about AI, Observability & Data Sovereignty. Snowflake-related articles explore the art of the possible and are not official Snowflake solutions or endorsed by Snowflake unless explicitly stated. Opinions are my own. Content is meant as educational inspiration, not production guidance.
Share this article

Related