Skip to main content
AI · · ~3 min read

Silent Failure: Detecting Token Context Mismatches in Local LLMs

The model claims 24K context but suddenly delivers shorter responses at 9K tokens — without any error message. How to diagnose and permanently fix token context mismatches.

LLMllama-swapCUDADebuggingAI

The Problem

An LLM with supposedly 24K token context suddenly starts delivering noticeably shorter, truncated responses beyond ~9K tokens — without any error message. No context_length_exceeded, no warning, just: shorter answer.

This is silent failure: the system fails without communicating it.

How It Happens

With llama-swap (and similar tools like Ollama) the context_length parameter in config can differ from the value the model actually uses internally. Common sources:

  1. Wrong model card — GGUF files contain metadata that isn’t always accurate
  2. Forgotten config override — the actual value in the llama.cpp backend differs
  3. Quantization artifacts — certain quant levels (Q4_K_M vs. Q5_K) have different effective context limits

Diagnosis

# Watch llama-swap logs during a long request
journalctl -u llama-swap -f

# Or directly from llama.cpp output:
# [INFO] n_ctx = 8192  ← This is the real value
# [INFO] n_ctx_train = 32768  ← This is what the model "knows"

The critical value is n_ctx — not n_ctx_train. n_ctx_train is the training value, n_ctx is what is actually allocated in memory.

# API test: directly against the inference endpoint
curl -s http://localhost:9292/v1/models | jq '.data[].context_length'
# 24576  ← What the API claims

# But in the log:
# n_ctx = 8192  ← What's actually running

The Fix

In the llama-swap.yaml config explicitly set ctx_size and align it with actual GPU VRAM:

models:
  qwen3.5-35b-turbo:
    cmd: >
      llama-server
      --model /models/qwen3.5-35b-q4km.gguf
      --ctx-size 8192        # ← Set explicitly, don't take from model card
      --n-gpu-layers 99
      --gpu-device 1
    proxy: "http://localhost:9292"
    ttl: 300

Then validate:

# After restarting the service:
curl -s http://localhost:9292/v1/models | jq '.data[].context_length'
# 8192  ← API now matches reality

VRAM Calculation

Context uses VRAM. Rule of thumb for a 35B Q4_K_M model:

Ctx SizeVRAM for KV-CacheRemaining for Weights
8K~1.5 GB~22.5 GB
16K~3 GB~21 GB
32K~6 GB~18 GB

With 24 GB VRAM, 32K context is theoretically possible for a 35B Q4_K_M model — but only if the weights (~18 GB) fit entirely in VRAM.

Takeaway

Never blindly trust a model card. Always validate n_ctx directly from llama.cpp logs and ensure your API reports the same value. Silent failure at 8K context in a system promising 24K leads to subtly wrong responses — which can be severe in code review or document analysis scenarios.