AI deployment is rapidly moving to edge devices, where dedicated AI accelerators (NPUs) are getting most of the attention for heavy AI workloads. Additionally, GPUs are present on many edge SoCs but are typically being overlooked for AI deployment. The deployment capabilities of current software stacks reflect that since edge GPUs are often lacking full AI support. This case study shows how Roofline unlocks edge GPUs for larger AI models by building a new Vulkan 1.3 HAL for IREE. We demonstrate this by running Qwen3-8B heterogeneously on an ARM Mali-G720 GPU.
This case study shows how Roofline achieves up to 23× higher throughput by solving one of the most fundamental bottlenecks in efficient on-device language model inference: dynamic shapes in the prefill stage.
This case study showcases how Roofline and ARM enabled scalable, vector-length-agnostic ML execution on Arm CPUs by implementing data-tiled Scalable Vector Extension (SVE) support end-to-end in IREE, unlocking up to 100× speedups on real models and hardware.