IEEE Educational Events

High Performance Inferencing for LLMs

High Performance Inferencing for LLMs 150 150 ieeeeduweek

Inferencing has become ubiquitous across cloud, regional, edge, and device environments, powering a wide spectrum of AI use cases spanning vision, language, and traditional machine learning applications. In recent years, Large Language Models (LLMs), initially developed for natural language tasks, have expanded to multimodal applications including vision speech, reasoning and planning each demanding distinct service-level objectives (SLOs). Achieving high-performance inferencing for such diverse workloads requires both model-level and system-level optimizations.

This talk focuses on system-level optimization techniques that maximize token throughput , achieve user experience  metrics and inference service-provider efficiency. We review several recent innovations including KV caching, Paged/Flash/Radix Attention, Speculative Decoding, P/D Disaggregation and KV Routing, and explain how these mechanisms enhance performance by reducing latency, memory footprint, and compute overhead. These techniques are implemented in leading open-source inference frameworks such as vLLM, SGLang, Hugging Face TGI, and NVIDIA NIM, which form the backbone of large-scale public and private LLM serving platforms.

The use of GPU Training, Inference and Analysis clusters with Multi-Instance-GPU’s (MIG), and Federated Models with QML applications has now become practical.

Attendees will gain a practical understanding of the challenges in delivering scalable, low-latency LLM inference, and of the architectural and algorithmic innovations driving next-generation high-performance inference systems.