The Problem
An LLM with supposedly 24K token context suddenly starts delivering noticeably shorter, truncated responses beyond ~9K tokens — without any error message. No context_length_exceeded, no warning, just: shorter answer.
This is silent failure: the system fails without communicating it.
How It Happens
With llama-swap (and similar tools like Ollama) the context_length parameter in config can differ from the value the model actually uses internally. Common sources:
- Wrong model card — GGUF files contain metadata that isn’t always accurate
- Forgotten config override — the actual value in the llama.cpp backend differs
- Quantization artifacts — certain quant levels (Q4_K_M vs. Q5_K) have different effective context limits
Diagnosis
# Watch llama-swap logs during a long request
journalctl -u llama-swap -f
# Or directly from llama.cpp output:
# [INFO] n_ctx = 8192 ← This is the real value
# [INFO] n_ctx_train = 32768 ← This is what the model "knows"
The critical value is n_ctx — not n_ctx_train. n_ctx_train is the training value, n_ctx is what is actually allocated in memory.
# API test: directly against the inference endpoint
curl -s http://localhost:9292/v1/models | jq '.data[].context_length'
# 24576 ← What the API claims
# But in the log:
# n_ctx = 8192 ← What's actually running
The Fix
In the llama-swap.yaml config explicitly set ctx_size and align it with actual GPU VRAM:
models:
qwen3.5-35b-turbo:
cmd: >
llama-server
--model /models/qwen3.5-35b-q4km.gguf
--ctx-size 8192 # ← Set explicitly, don't take from model card
--n-gpu-layers 99
--gpu-device 1
proxy: "http://localhost:9292"
ttl: 300
Then validate:
# After restarting the service:
curl -s http://localhost:9292/v1/models | jq '.data[].context_length'
# 8192 ← API now matches reality
VRAM Calculation
Context uses VRAM. Rule of thumb for a 35B Q4_K_M model:
| Ctx Size | VRAM for KV-Cache | Remaining for Weights |
|---|---|---|
| 8K | ~1.5 GB | ~22.5 GB |
| 16K | ~3 GB | ~21 GB |
| 32K | ~6 GB | ~18 GB |
With 24 GB VRAM, 32K context is theoretically possible for a 35B Q4_K_M model — but only if the weights (~18 GB) fit entirely in VRAM.
Takeaway
Never blindly trust a model card. Always validate n_ctx directly from llama.cpp logs and ensure your API reports the same value. Silent failure at 8K context in a system promising 24K leads to subtly wrong responses — which can be severe in code review or document analysis scenarios.