LLM Serving System: V2

LLM System Design Infrastructure

August 12, 2025

In my previous blog, I shared how I built an LLM serving system that could stream model responses in real time. As I started to test the system further, a few limitations began to emerge:

The initial version, built entirely in Python, ultimately ran into the inherent limitations of Python's Global Interpreter Lock (GIL) and event loop contention, which bottlenecked performance under heavy, concurrent loads.

This sparked a change in a few components of the system. This post details the new changes in the system. This version strategically uses a hybrid C++/Python approach to combine raw performance with a rich machine learning ecosystem, tackling the previous version's shortcomings.

The Bottleneck in Detail: A Blocked Event Loop

So, why exactly does a single event loop cause such a problem? The issue lies in how Python's asyncio event loop handles different types of tasks based on a model of cooperative multitasking.

This queuing behavior is precisely what happened in the original system. The CPU intensive tokenization step would stall the entire process, preventing the server from handling concurrent requests efficiently and creating a pipeline bottleneck that starved the GPU of data.

The diagram below shows that the average end to end latency at the vLLM is about 20s less compared to the latency at the proxy showing significant queuing during high load.

Graph showing old and new system comparison
Figure 1:Avg end-to-end latency(Old system)

High Level Architecture

The system still has the same logical planes, control and data plane. But the planes are now completetly separated and do not have any overlapping components.

This separation allows the data plane to be highly optimized for low latency processing while the control plane focuses on maintaining system stability and state.

LLM serving system
Figure 2:LLM serving system

We will now look at the changes in each plane.

Control Plane

Most part of the control plane is the same as the original version. The main difference is that the Scheduler does not particiapte in the data path and it only manages the lifecycle of the replicas.

Data Plane

The Data Plane has significantly changed from the previous design.

HTTP Proxy

Replica Sever

The new design offsets the major bottlenecks of the previous design and delivers a low latency system.

The deployment setup is similar to the previous version and is discussed in the previous blog.

Design Decisions

Let’s talk about the most important architectural decisions behind the new system and more importantly, why they were made.

Load Testing & Evaluation

To evaluate the new design, I used a setup similar to the previous version and ran Locust to simulate real user load against the deployed architecture.

This section presents the load test results and compares them with those from the earlier design.

For detailed analysis, you can refer to results.

Load Test Insights

Graph showing end-to-end latency
Figure 3: Locust test results(Older version)
Graph showing end-to-end latency
Figure 4: Locust test results(Newer version)

The graphs above compare Locust test results for the previous and new versions of the system.

The graphs below show the prometheus metrics for the newer version of the system.

Graph showing active requests
Figure 5: Active requests(Prometheus)
Graph showing 95 percentile of vllm end to end latency
Figure 6: 95th percentile of vLLM end to end latency(Prometheus)
Graph showing gpu utilization
Figure 7: GPU utilization(Prometheus)

Overall, these metrics show that the new design effectively removes previous bottlenecks and delivers a more robust, scalable system.

Future Improvements

In the next version of the system, I plan to explore:

Closing Thoughts

This redesign takes the lessons from the first version and turns them into a system that is faster, more scalable, and more resilient. For me, this was more than just an optimization exercise, it was an opportunity to deepen my understanding of building high performance, distributed systems. If you’re working on similar challenges, I’d love to hear about your approach.

For deeper technical details, you can refer to the design document.