LLM Serving System

LLM System Design Infrastructure

July 2, 2025

Ever wonder how language models stream responses word by word while juggling hundreds of users and still staying responsive? That question led me down the rabbit hole of real time inference infrastructure. I wanted to peel back the layers and understand what it really takes to serve large models efficiently under load, with low latency, and without crashing the system. So I started a self-driven project to build a complete LLM serving system from scratch, exploring the layers of complexity that lie between receiving a prompt and delivering a streamed response.

This blog walks through that system: how it's architected, the real problems it solves, and the thinking behind each design decision. If you're curious about the systems side of LLMs, you're in the right place.

Challenges

Before diving into how the system works, let’s first understand why serving LLMs is far from trivial. From performance expectations to fault handling, the challenges are surprisingly deep, involving many moving parts that must work seamlessly together under pressure.

This system was built with these challenges at its core, shaping the design of the components you'll explore next.

System Overview

This is a microservices based distributed system built on an async, event-driven architecture. Each service operates independently and communicates internally over gRPC. The system runs across multiple nodes(in this case, VMs) where:

Here are the core building blocks:

We’ll now break these down in more depth across the control and data planes.

Architecture Overview

With the system’s components introduced, the next step is to see how they interact. To simplify this, the architecture is organized into two complementary planes:

While the same services participate in both, their behavior and responsibilities shift depending on which plane they’re operating in. Let’s look at each in turn.

Control Plane

Control plane of the LLM serving system
Figure 1: Control plane of the LLM serving system

At the heart of the control plane is the Head Controller. It acts as the system's central coordinator:

Schedulers act as a bridge between the control and data planes. Their responsibilities in the control plane include:

The HTTP Proxy, while primarily responsible for handling requests in the data plane, participates in the control plane by subscribing to routing table updates from the Head Controller using a gRPC based pub-sub model. This ensures the proxy stays in sync with the latest deployment state and routes traffic only to healthy and available backends.

Data Plane

The data plane handles all client traffic and inference.

Architecture block diagram...
Figure 2: The figure shows the flow of live requests

This layer is built for concurrency, responsiveness, and throughput. It orchestrates the flow of user requests through the proxy, routes them to backend workers, and ensures the output is streamed efficiently and reliably

The HTTP Proxy runs a stateless, async FastAPI server on the frontend to handle incoming HTTP requests from clients, and communicates with backend schedulers using an async gRPC client. It:

Schedulers act as intermediaries between the proxy and the replicas in the data plane. They:

Replica is a subprocess responsible for running the actual inference workload. It:

Together, these components form the backbone of a real time, distributed inference system.

Design Decisions

Let’s talk about a few of the core architectural choices behind this system and more importantly, why they were made. Each one was driven by practical needs uncovered during implementation.

Deployment Setup

Before benchmarking the system, I deployed it entirely on virtual machines (VMs) in AWS, using a containerized deployment flow that ensured reproducibility and fast scaling.

Load Testing & Evaluation

To evaluate the system’s performance under concurrent traffic, I used Locust to simulate real user load against the deployed infrastructure.

This section focuses on one representative test case to illustrate the system's behavior under load. Additional test scenarios will be added to the repo linked below.

Load Test Insights

Graph showing end-to-end latency
Figure 3: Locust test results
Graph showing Time to First Token
Figure 4: 95th percentile of TTFT(Prometheus)
Graph showing requests per replica
Figure 5: 95th percentile of vLLM end-to-end latency(Prometheus)
Another graph metric
Figure 6: Requests running per replica(Prometheus)

Prometheus metrics were monitored alongside Locust to capture deeper infrastructure behavior: