runtime

05

.

2026

Unlocking edge GPUs for larger AI models and asynchronous execution: Roofline brings Vulkan 1.3 to IREE

AI deployment is rapidly moving to edge devices, where dedicated AI accelerators (NPUs) are getting most of the attention for heavy AI workloads. Additionally, GPUs are present on many edge SoCs but are typically being overlooked for AI deployment. The deployment capabilities of current software stacks reflect that since edge [MJ1] [LH2] GPUs are often lacking full AI support. This case study shows how Roofline unlocks edge GPUs for larger AI models by building a new Vulkan 1.3 HAL for IREE. We demonstrate this by running Qwen3-8B heterogeneously on an ARM Mali-G720 GPU.

Edge GPUs are ready for AI, but lack mature software support

Edge hardware is becoming increasingly heterogeneous, with CPUs, GPUs, and NPUs integrated on a single SoC. NPUs are purpose-built for edge AI and rightfully receive significant investment as low-power and cost-efficient accelerators for edge AI inference. However, edge GPUs, which are already present on many higher-end SoCs, are often overlooked despite being a readily available compute resource:

1. Availability: GPUs are shipped in many higher-end SoCs.

2. Performance: Modern edge GPUs from vendors like Qualcomm, AMD, and Arm deliver substantial compute throughput suitable for AI workloads.

3. Standardization: GPUs from different vendors share common programming models and can be targeted through standard APIs like Vulkan, making them more accessible than vendor-specific NPU toolchains.

Yet, software support for AI on edge GPUs remains thin, as most GPU-focused AI tooling targets server-class hardware. Let’s look at IREE as leading AI deployment infrastructure and how it handles AI deployment on edge GPUs.

IREE's layered runtime architecture separates hardware concerns

IREE is an open-source, retargetable AI compiler built on MLIR that enables AI deployment across diverse hardware. It compiles models from frameworks like PyTorch into a single executable artifact (.vmfb) that runs across CPUs, GPUs, and NPUs.

During inference on the device, execution is orchestrated by IREE's runtime. To keep the runtime independent from any specific hardware, IREE separates execution into three layers:

1. The Hardware Abstraction Layer (HAL) is a uniform interface that defines the generic device operations the runtime relies on (memory allocation, kernel dispatch, synchronization). The HAL itself is abstract: It specifies what the runtime can ask a device to do, not how it is done.

2. A runtime backend is a concrete implementation of the HAL for a specific hardware API. For example, the Vulkan HAL backend implements the HAL by translating its generic operations into Vulkan API calls.

3. The hardware API (e.g., Vulkan for GPUs, or a vendor SDK for NPUs) is the external interface that the runtime backend ultimately drives. It is provided by the hardware vendor or a standards body.

Together, they coordinate how a piece of work is transferred from the host CPU to the accelerators and back:

1. The host (CPU) drives execution using two threads: a scheduling thread that submits work, manages dependencies, and coordinates across devices, and a compute thread that executes CPU workloads itself.

2. Accelerators (GPUs and NPUs) execute the workloads submitted by the host and signal completion back to the host through synchronization primitives such as fences.

3. The degree of actual concurrency (how much work runs in parallel across devices and how their execution is synchronized) depends on the backend implementation.

For GPUs, the Vulkan HAL backend translates IREE's execution plan into Vulkan API calls for memory management, kernel dispatch, and synchronization. The GPU kernels themselves are compiled to SPIR-V, the standard low-level bytecode format consumed by Vulkan.

IREE’s Vulkan HAL yet to adopt most recent Vulkan 1.3

The current Vulkan HAL in IREE is based on a mixture of Vulkan 1.1 and 1.2, released in 2018 and 2020 respectively. This reflects the state of the ecosystem when it was originally built, and leaves room for improvement in three areas:

1. API complexity: Mixing two Vulkan versions increases implementation complexity and makes the code harder to maintain.

2. Buffer device address: This feature (GPU pointer-based memory access) exists as an optional extension in Vulkan 1.2, but because it is optional, the HAL cannot rely on it being available on a given device.

3. Execution model: The current HAL uses blocking GPU execution, meaning the CPU scheduling thread waits for the GPU to complete each submission before proceeding.

These design choices translate into two concrete limitations on the device. First, without guaranteed buffer device address support, the HAL falls back to a segmented buffer model that caps individual model weight allocations at 4GB. And second, because GPU execution is blocking, the scheduling thread cannot distribute work to other devices (such as the CPU compute thread or NPUs) while the GPU is busy.

The practical consequences for real-world deployment follow directly: Large models exceed the 4GB buffer limit and fail to run, and the lack of overlap between CPU, GPU, and other accelerators prevents true asynchronous heterogeneous execution.

Vulkan 1.3 provides a strong foundation

Vulkan 1.3 made previously optional features mandatory, providing a more reliable baseline for GPU compute. Two changes are particularly relevant for AI workloads.

First, buffer device address became mandatory. Under Vulkan 1.1 and 1.2, GPU memory was accessed through descriptors, an intermediate lookup layer that defines buffer location and size. This layer constrains how large buffers can be addressed and effectively caps individual allocations at 4GB in current implementations. Buffer device address replaces descriptors with 64-bit device pointers, similar to pointers in C, so buffer sizes are limited only by hardware capabilities rather than by descriptor constraints.

Second, improved synchronization primitives enable non-blocking execution models. They allow the scheduling thread to submit GPU work and continue execution without waiting for completion, which is the foundation for true asynchronous heterogeneous execution across CPU, GPU, and NPU.

On top of these technical improvements, Vulkan 1.3 also has widespread platform support, including Android, making it a practical target for edge deployment. Together, these features remove a key barrier to running large AI models heterogeneously on GPUs via Vulkan, which is why we decided to design and implement a new Vulkan HAL for IREE.

Roofline has built a new Vulkan 1.3 HAL backend for IREE

To enable Vulkan 1.3 for edge AI deployment, Roofline built a completely new Vulkan 1.3 HAL backend in IREE. Implementing it end-to-end also required several upstream fixes to the MLIR compiler infrastructure that IREE builds on, which we contributed in parallel.

A new Vulkan 1.3 HAL backend to modernize GPU execution

Roofline designed and implemented a new Vulkan HAL backend that targets Vulkan 1.3 exclusively. The new backend:

1. Eliminates the Vulkan 1.1/1.2 split, reducing implementation complexity and removing version-dependent code paths.

2. Enables non-blocking GPU execution, so the scheduling thread can submit GPU work and immediately continue dispatching to other devices such as the CPU or an NPU.

3. Supports weight tensors larger than 4GB by relying on buffer device address, which is guaranteed available in Vulkan 1.3.

Upstream MLIR fixes to unblock SPIR-V code generation

Adopting buffer device address and running larger models on GPUs required closing several gaps in the upstream SPIR-V lowering path:

1. Addressing of buffers larger than 2GB (PR pending): removes a 32-bit addressing limit that prevented SPIR-V code from correctly accessing larger buffers.

2. Buffer device address support for vectorized memory accesses (PR #187698): closes a gap that prevented the compiler from using buffer device address when generating vectorized code.

3. Correct GPU parallelization of common tensor reshaping patterns (PR #183830): fixes bugs that appeared when distributing certain reshaping operations across GPU threads.

4. Preservation of performance-tuning metadata through compilation (PR #182064): ensures that tiling configurations attached by IREE are not lost during intermediate compiler passes.

New GPU deployment flow unlocks models >4 GB and asynchronous execution

With the new Vulkan 1.3 HAL and the upstream SPIR-V fixes in place, edge GPUs can now run AI workloads that were previously out of reach. Two capabilities in particular stand out.

First, asynchronous heterogeneous execution on GPU is now possible through IREE. Today, Roofline's deployment stack is the only solution that enables this on edge GPUs. The scheduling thread can dispatch work to the GPU and immediately continue orchestrating other devices, so CPU, GPU, and NPU run in parallel instead of taking turns.

Second, larger models now run end-to-end on edge GPUs. The removal of the 4GB buffer limit means weight tensors for billion-parameter LLMs fit into a single allocation, and the full model executes on the GPU without workarounds.

In the demo video below, Qwen3-8B runs heterogeneously end-to-end on an ARM Mali-G720 GPU through the new Vulkan HAL.

Stay tuned: Next up is stabilization and performance

The new Vulkan HAL is functional and enables workloads that were previously impossible, but there is still plenty of ground to cover. Our next focus is twofold: Stabilizing the new HAL across more devices and Vulkan driver implementations and further improving the performance of the Vulkan 1.3 stack. As part of this, we are continuously expanding our testing to additional edge platforms, with a particular focus on Android-based devices given Android's broad Vulkan 1.3 support across the edge GPU ecosystem.

‍

runtime

Similar cases

runtime

sdk

Roofline and NXP enable next-generation software support for LLMs on eIQ Neutron NPU

In collaboration with NXP® Semiconductors, we built on Roofline’s scalable MLIR and IREE compiler infrastructure to extend our heterogeneous execution stack to NPUs. Starting with LLM enablement for NXP’s eIQ® Neutron NPU on the i.MX 95 applications processor, this case study showcases three key advantages of the software enablement: 1) Unlocking broad model coverage, 2) overcoming accelerator memory limitations for models above 2GB, and 3) delivering clear performance gains of up to 3.2x in LLM prefill performance over CPU-only execution.

Case study

sdk

runtime

Dynamic shape support: A key enabler for on-device LLM inference

This case study shows how Roofline achieves up to 23× higher throughput by solving one of the most fundamental bottlenecks in efficient on-device language model inference: dynamic shapes in the prefill stage.

Case study

runtime

sdk

Asynchronous Heterogeneous Execution for Edge SoCs

This case study shows how Roofline enabled asynchronous heterogeneous execution on modern edge System-on-Chips (SoCs). Our technology coordinates the SoCs’ CPU–GPU–NPU hardware for running full AI models efficiently. We unlock a long-missing piece in AI deployment software on the edge to run larger models more efficiently on the best suited device available.

Case study