LLMs are moving onto edge devices and naturally come with variable prompt lengths. Unlike traditional inference with fixed input sizes, LLM prefill therefore operates on dynamic input shapes. At the same time, edge GPUs and NPUs are typically optimized for fixed-size computations. This makes dynamic shape handling a key prerequisite for high-performance on-device LLM inference.
roofline elevates handling of dynamic input shapes to a first-class compiler capability for on-device LLMs. Expanding from established operator-level techniques such as padding, peeling, and masking, we introduce a model-level approach that constrains dynamic prompt lengths once at the model boundary to hardware-friendly multiples. This global guarantee enables efficient fixed-size tensor execution on edge hardware.
In the video below, our LLM wizard Thomas Ziereis guides you through the key concepts and demonstrates up to 23× higher prefill performance for Qwen3-0.6B on an NVIDIA RTX 3070.
Read the full case study here: https://lnkd.in/eucmJDBN
#EdgeAI#AIDeployment#AICompiler#MLIR#Roofline