What is Inference?

Inference is the second phase in a generative AI model’s lifecycle: • Training: The process of learning model weights from data. • Inference: Serving generative AI models in production.

Doing inference well requires three layers: • Runtime: Optimizing the performance of a single model on a single GPU-backed instance. • Infrastructure: Scaling across clusters, regions, and clouds without creating silos while maintaining excellent uptime. • Tooling: Providing engineers working on inference with the right level of abstraction to balance control with productivity.

The Runtime layer

The runtime layer relies on a number of model performance techniques: • Batching: Run incoming requests in parallel, weaving them together on a token-by-token basis to increase throughput. • Caching: Re-use the KV cache – the cached results of the attention algorithm – between requests that share prefixes. • Quantization: Lower the precision of select pieces of the model to access more compute and reduce memory burden. • Speculation: Generate and validate draft tokens to produce more than one token per forward pass during decode. • Parallelism: Efficiently leverage more than one GPU to accelerate large models without introducing new bottlenecks. • Disaggregation: Separate the two phases of LLM inference, prefill and decode, onto independently scaling workers.

Requires

In practice, optimization is often about finding the right balance rather than maximizing a single factor.

latency: The time it takes to respond to a single request.
throughput: The number of requests that can be processed in a given time period.
quality: The relevance and accuracy of the generated output.

THE END