July 3, 2026

What vLLM actually does and why your GPU is wasting memory without it

An open-source inference engine that enables fast, memory-efficient serving of Large Language Models at scale

You have an AI model running on a GPU. Users send it questions, it sends back answers. Everything works.

Until you check the numbers. Your 80GB A100 is serving 8 concurrent users. Your GPU utilization hovers at 30%. Your cloud bill says $30,000/month. And your PM asks why the response times are climbing.

The model is not the problem. The model is doing its job fine. The problem is everything around the model: how you load it into memory, how you manage the conversations it holds in its head, and how you schedule who gets to talk to it next. The serving layer is the bottleneck.

That is where vLLM comes in.

What is vLLM#

vLLM is an open-source engine that sits between your AI model and the users talking to it. It handles the plumbing: loading the model onto the GPU, managing memory as conversations happen, scheduling multiple users at once, and streaming answers back. It was built by researchers at UC Berkeley and is now maintained by a large community. The name stands for "Virtual Large Language Model" because its core trick borrows from how operating systems manage virtual memory.

You do not change your model to use vLLM. You point vLLM at a model (from Hugging Face, for example), and it runs it for you. Your existing model, better infrastructure.

The problem vLLM solves#

To understand why vLLM matters, you need to understand one thing about how AI models generate text: they do it one word at a time.

When you ask an LLM "what is the capital of France?", it does not instantly know the full answer. It predicts one token (roughly one word) at a time. First "The", then "capital", then "of", then "France", then "is", then "Paris", then a period. Each prediction step uses the GPU.

Here is the catch. To predict the next token, the model needs to remember everything it has already said and everything you asked. It stores this memory in a structure called the KV cache (Key-Value cache). Think of it as the model's short-term memory for the current conversation.

ℹ️ Note

This KV cache is where the trouble starts.

The memory waste problem#

In a standard setup, when a user sends a request, the system reserves GPU memory for the maximum possible conversation length. If your model supports conversations up to 4096 tokens long, the system grabs enough memory for all 4096 tokens right away, even if the user only typed "hi."

Most conversations are short. The average prompt-plus-response in production is typically 200-500 tokens, not 4096. So the system reserves space for 4096 but only uses 200. The remaining 3896 tokens worth of GPU memory sits completely empty. Multiply that by every user connected at the same time, and you are wasting 60-80% of your GPU's memory.

That is expensive memory to waste. An NVIDIA A100 GPU costs $1-2 per hour on cloud providers. If 70% of its memory is sitting empty because of how the serving code allocates it, you are effectively burning money.

The idle GPU problem#

The second problem is about compute, not memory.

Traditional serving systems process requests in fixed groups called "batches." They take 8 requests, process them together, and wait for all 8 to finish before starting the next group. If 7 of those requests finish in 50 tokens but the 8th takes 500 tokens, the GPU slots for those 7 users sit idle while it waits for that one slow request.

⚠️ Warning

The GPU is literally doing nothing for 7 out of 8 slots. That is a lot of wasted compute.

How vLLM fixes these two problems#

vLLM has two core ideas. They are simple to explain but hard to build correctly.

1. Paged memory (PagedAttention)#

Instead of reserving one big chunk of memory per user, vLLM breaks GPU memory into small, fixed-size blocks (like pages in a notebook). When a user starts a conversation, vLLM gives them one block. When that block fills up, it grabs another block from wherever there is free space. When the conversation ends, the blocks go back to the free pool.

This is the same idea your computer's operating system uses to manage RAM. Your laptop does not give each application a single contiguous slab of memory. It gives them pages, scattered wherever there is space. vLLM does the same thing for GPU memory.

The result: almost zero wasted memory. The system only uses what it actually needs, block by block, and reclaims it the moment a conversation ends.

Traditional LLM vs vLLM memory allocation

In the traditional approach, each user gets a fixed-size reservation with most of it empty. In the paged approach, each user only holds the blocks they actually need, and the rest stays available for other users.

2. Continuous batching#

Instead of waiting for an entire batch of requests to finish, vLLM checks after every single token generation step. If a user's response is complete (the model produced a stop token), vLLM immediately pulls that user out and slots in the next waiting request. The GPU never sits idle.

Think of it like a checkout lane at a grocery store. Traditional batching is like closing the lane until the slowest customer finishes, then opening it for a new group. Continuous batching is keeping the lane open: the moment one customer finishes, the next person steps up.

Continous batching

These two ideas together, paged memory and continuous batching, let vLLM serve 2-4x more concurrent users on the same GPU compared to a standard Hugging Face pipeline.

LLM Inference Bottleneck

What you get out of the box#

vLLM is not just the two optimizations above. It comes with a full serving stack:

An OpenAI-compatible API - You start vLLM with one command and it gives you an HTTP server that speaks the same API format as OpenAI's GPT models. If your application already calls the OpenAI API, you can point it at your vLLM server by changing the base URL. No code rewrite.

# Start serving a model
vllm serve meta-llama/Llama-3.1-8B-Instruct

# Your app just changes the base URL
# from: https://api.openai.com/v1
# to:   http://localhost:8000/v1

200+ model architectures - Llama (Meta), Qwen (Alibaba), Gemma (Google), Mistral, DeepSeek, and basically anything on Hugging Face in the standard format. You do not need to write model-specific serving code.

Streaming - Responses stream token by token over Server-Sent Events, so users see words appearing as they are generated rather than waiting for the entire response.

Multi-GPU support - If your model does not fit on one GPU, vLLM can split it across multiple GPUs automatically. You set --tensor-parallel-size 4 and it shards the model across 4 GPUs.

Quantization support - vLLM can run models in reduced precision (FP8, INT8, 4-bit) to shrink memory usage. A model that normally needs 2 GPUs might fit on 1 with quantization, at the cost of a small accuracy trade-off.

Where vLLM fits in the stack#

If you are building an AI application, here is where vLLM sits relative to everything else:

where vLLM sits

Your application sends requests to vLLM's API. vLLM handles everything between the HTTP request and the GPU: scheduling, memory management, batching, streaming. You do not manage any of that yourself.

vLLM is not a model. It is not a training framework. It is not an orchestration tool. It is the serving engine that runs a model efficiently at inference time.

The bottlenecks you will hit#

vLLM fixes the two biggest problems (memory waste and GPU idling), but it does not make problems disappear entirely. Here is what you will run into.

Cold starts are slow#

Loading a large model from disk to GPU takes time. A 70B parameter model can take 60-120 seconds to load, even from a fast SSD. If you are running on Kubernetes and a new pod starts from scratch, you are looking at 3-5 minutes before the first request is served (container pull + model download + model load).

This means scaling up when traffic spikes is not instant. If you need to react to sudden load, you need pre-warmed replicas sitting ready.

Single-request latency is not vLLM's strength#

vLLM is built for throughput: serving many users at once. If you are the only user sending one request at a time, vLLM's scheduler, block manager, and engine loop add overhead that a simpler tool would not. For single-user, low-traffic scenarios, something like llama.cpp or Ollama will feel snappier.

Long conversations eat memory fast#

Even with paged memory, the KV cache for a single conversation scales linearly with its length. A 128k-token context window on a 13B model can consume 20+ GB of GPU memory for one user. If your use case involves long documents or multi-turn conversations that accumulate context, you will run out of GPU memory faster than you expect.

Version updates can break things#

vLLM moves fast. New releases ship frequently, and they sometimes change default behaviors or break compatibility with specific model formats. Teams that run vLLM in production typically pin to a specific version and test upgrades in staging before rolling them out. Treat it like upgrading a database driver, not like bumping a patch version of a utility library.

GPU memory is still the ceiling#

vLLM squeezes more users into the same GPU, but it cannot create more memory than the GPU physically has. If your model weights take up 70% of the GPU and the remaining 30% is the KV cache pool, that pool is your hard limit for concurrent users. Quantization (running the model in lower precision) is the main lever to free up more space.

When to use vLLM (and when not to)#

Your situation	Use vLLM?	Better alternative
Serving an LLM to multiple users over an API	Yes	-
Running a model on your laptop for personal use	No	Ollama or llama.cpp
CPU-only server, no GPU available	No	llama.cpp with GGUF
Need to fine-tune or train a model	No	PyTorch, Hugging Face Trainer
Building a RAG pipeline that calls an LLM	Yes (as the LLM serving layer)	-
Running inference on edge devices or phones	No	llama.cpp, MLC-LLM
Extreme low-latency single-user chat	Probably not	TensorRT-LLM or llama.cpp

The short version: if you have a GPU, multiple users, and need an LLM API, vLLM is the default choice in mid-2026. If you are doing something else, there are better fits.

Getting started#

If you want to try vLLM on a machine with a GPU:

# Install
pip install vllm

# Serve a model (this will download it from Hugging Face)
vllm serve meta-llama/Llama-3.1-8B-Instruct

# In another terminal, send a request
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "What is vLLM?"}]
  }'

That is it. You now have an OpenAI-compatible API serving Llama on your own hardware.

If you want to understand what is happening under the hood (the block tables, the scheduler queues, the kernel internals), that is a separate conversation. I wrote a deep dive on the internal architecture: Why vLLM can serve 10x more users on the same GPU.

The metrics that tell you if vLLM is healthy#

Once vLLM is running in production, it exposes Prometheus metrics at /metrics. These are the ones that matter:

Metric	What it tells you	When to worry
`vllm:num_requests_waiting`	How many users are queued	Sustained > 0 means you need more capacity
`vllm:gpu_cache_usage_perc`	How full the KV cache is	Above 90% means users may start queuing
`vllm:num_requests_running`	How many requests are actively generating	If this is stuck at 1, your batching is not working
Time to First Token (TTFT)	How long users wait before seeing the first word	Spikes here usually mean queue buildup or cache pressure

If gpu_cache_usage_perc is constantly near 100% and num_requests_waiting is climbing, you either need another GPU, a smaller model, or quantization to free up memory.

What I would tell someone starting out#

If I were sitting across from someone who just asked "should I use vLLM?", I would say:

If you are serving an LLM to more than one person at a time on a GPU, yes. It is the industry standard for a reason, and that reason is straightforward: it wastes less memory and keeps the GPU busier than the alternatives. The setup is three commands.

But understand what it is and what it is not. It is a serving engine. It makes your existing model run more efficiently for multiple users. It does not make the model smarter. It does not replace your application logic. It does not handle prompt engineering, RAG retrieval, or agent orchestration. Those are your problems.

Start with vLLM, get your model running, then worry about everything else.

Support my work

If this post was useful, consider supporting my open source work and independent writing.

Sponsor on GitHub Buy me a coffee

Back to Blogs