runtime

sdk

.

03

.

2026

Roofline and NXP enable next-generation software support for LLMs on eIQ Neutron NPU

Advanced AI models are rapidly moving to edge devices, driving demand for on-device compute that CPUs and GPUs alone cannot satisfy. To meet this demand, dedicated AI accelerators, or NPUs, are being embedded into edge SoCs.

In collaboration with NXP® Semiconductors, we built on Roofline’s scalable MLIR and IREE compiler infrastructure to extend our heterogeneous execution stack to NPUs. Starting with LLM enablement for NXP’s eIQ® Neutron NPU on the i.MX 95 applications processor, this case study showcases three key advantages of the software enablement: 1) Unlocking broad model coverage, 2) overcoming accelerator memory limitations for models above 2GB, and 3) delivering clear performance gains of up to 3.2x in LLM prefill performance over CPU-only execution.

Software velocity unlocks the full potential of edge AI hardware innovation

In edge AI, new models and new use cases are emerging faster than ever, while semiconductor companies are shipping processors with increasingly capable NPUs. However, hardware innovation alone does not translate into real-world adoption. The decisive factor is the connection between applications and silicon. This connection is defined by software velocity, which rests on two key levers:

1. Day-0 support: New models compile and run on target hardware by default. While performance might not yet be fully optimized for each and every model, this unlocks immediate evaluation, prototyping and product development.

2. Steep optimization curve: Targeted compiler improvements drive performance from functional to production-ready within days or weeks, not months of manual kernel engineering.

This software enablement allows application and product developers to focus on what truly differentiates them: Refining and iterating use cases, advancing features, and accelerating time-to-market.

Roofline's compiler infrastructure and NXP's hardware-specific expertise combine to enable Day-0 NPU support

Achieving software velocity for NPUs requires two complementary capabilities: A flexible and efficient AI compiler infrastructure, combined with deep hardware-specific optimization expertise.
Building on MLIR and IREE, Roofline brings software infrastructure that cleanly separates hardware-agnostic and hardware-specific concerns. Standing on the shoulders of this powerful ecosystem, the infrastructure allows seamless model importing across frameworks, applies generic optimizations, and brings target agnostic features like heterogeneous execution.

On the hardware-specific side, NXP brings deep expertise in software optimization for their own silicon. Backends are purpose-built and continuously optimized for their specific architecture, ensuring that operations run at maximum efficiency.
This separation of concerns enables a scalable software stack for Day-0 support across a diverse hardware portfolio: Targeting new devices requires the integration of a new hardware-specific backend but not re-writing the whole stack from top to bottom.

As a first step in the Roofline x NXP collaboration, we applied this infrastructure to the i.MX 95 SoC, targeting the eIQ Neutron NPU as the first accelerator backend.

NXP's eIQ Neutron NPU excels at MAC-heavy AI workloads: A natural fit for vision networks, challenging for LLM enablement

NXP has been greatly expanding its AI capabilities. A key cornerstone in this expanded Edge AI enablement is the eIQ® Neutron Neural Processing Unit (NPU), a highly scalable accelerator core architecture. In the NXP i.MX 95 SoC, eIQ Neutron NPU is implemented as a 8 eTOPS accelerator with a dedicated 2GB memory space. eIQ Neutron NPU is purpose-built for executing multiply-accumulate (MAC) operations at a massive scale. Matrix multiplications, convolutions, and dot products are all reduced to MACs, and by packing thousands of MAC units into a near-memory-compute architecture with tightly coupled local memory, Neutron delivers high compute utilization and energy efficiency. On standardized vision benchmarks, this design has already proven its capabilities: NXP reports an average 1.8x speedup over a competing embedded NPU at equal TOPS and memory resources⁽¹⁾.

While this architecture is a natural fit for convolutions and matrix-heavy vision workloads, enabling LLM inference introduces additional challenges. NXP’s eIQ Neutron NPU exclusively performs integer arithmetic, meaning only quantized operations can run on the accelerator. LLMs combine MAC-heavy matrix multiplications that map well to Neutron with operations that require floating-point arithmetic, even in fully quantized models. In addition, powerful edge LLMs can exceed the NXP eIQ Neutron NPU’s 2GB local memory space.

Unlocking the NXP eIQ Neutron NPU for LLMs therefore requires heterogeneous execution: Offloading MAC-intensive matrix multiplications to the NPU while executing remaining operations on the CPU, coordinated through efficient scheduling and memory management. This workload partitioning enables the accelerator to operate at high efficiency while preserving full model functionality.

NXP and Roofline join forces to enhance software support for the Neutron NPU

Roofline and NXP jointly enabled compiled LLM inference that runs heterogeneously across the NXP eIQ Neutron NPU and CPU on NXP’s i.MX 95 SoC. This was achieved through three concrete contributions:

1. Backend Integration: Roofline and NXP jointly integrated the Neutron backend into Roofline's compiler infrastructure, build on MLIR and IREE. The compiler automatically identifies matrix multiplications in the model and routes them to Neutron, while all remaining operations compile for the CPU.

2. Asynchronous heterogenous execution: Following our previous case study on heterogenous execution and Figure 1, we have enabled asynchronous execution between the CPU and NPU. Instead of blocking calls that lead to idle times, the execution gets coordinated through lightweight synchronization fences. This allows each device to run the operations it performs best with minimal idle time.

3. Shared NPU memory: Neutron has shared memory space that is accessible to both the CPU and NPU. This allows both processing unit types to read from and write to the same memory. We leveraged this to minimize data transfer overhead between devices by allocating only the required input and output buffers in the NXP eIQ Neutron NPU's memory space. The CPU writes results of preceding operations directly into the NPU's input buffers and reads inputs for subsequent operations directly from the NPU's output buffers. This eliminates redundant copies and enables deployment of models exceeding the 2GB memory limit since only active I/O buffers reside on the NPU.

Figure 1: Asynchronous heterogeneous execution on CPU and eIQ Neutron NPU. Zero-copy operations allow the CPU to read and write directly from shared NPU memory (orange arrows), while synchronization fences coordinate both devices with minimal idle time.

Through seamless asynchronous execution we are now able to run modern transformer-based LLMs jointly on CPU + NPU with Qwen3-1.7B as an example in the following video.

Clear advantages for LLM deployment on NXP eIQ Neutron NPU

The joint Roofline and NXP enablement of NXP’s eIQ Neutron NPU delivers three key results, each with direct practical implications for developers building on NXP hardware:

1. Fully compiled LLM enablement for NXP’s eIQ Neutron NPU: Starting from the initial enablement shown in the video above, Roofline and NXP expanded software support to virtually any 8-bit quantized LLM within just a few weeks, validated on a joint model zoo of 10+ LLMs from ONNX and PyTorch. Matmul offloading at the compiler level works for any LLM architecture, showcasing the scalability only provided by a truly compiler-based stack. This is the direct payoff of the separation of concerns: New models compile and run on Neutron without model-specific engineering.

2. Support for larger models: The shared NPU memory and zero-copy approach enables models that exceed the NPU’s 2GB memory, such as Qwen3-1.7B (slightly above 2GB) and Gemma-2-2b-it (~3.5GB).

3. Clear performance gains: Across all validated models, offloading matmuls to Neutron delivers consistent speed-ups during the LLM prefill phase. Figure 2 outlines benchmarks for six models of various sizes. The speed-up translates into up to 1.9x higher performance compared to a 6-core CPU baseline utilizing all available cores of the i.MX 95 SoC.

Figure 2: Prefill performance of six LLMs on i.MX 95 SoC , comparing CPU-only execution against heterogeneous CPU+eIQ Neutron NPU execution. Matmul offloading delivers up to 1.9x faster time to first token and enables models that exceed Neutron's 2GB memory limit.

These results translate into three practical implications for developers building on NXP hardware:

1. Accelerated time-to-market through Day-0 LLM support on NXP’s eIQ Neutron NPU.

2. New use cases unlocked by deploying larger models that were previously out of reach.

3. Leveraging performance benefits to either speed up inference time or free compute resources for other workloads running on the same SoC. As shown in Figure 3, we see gains of up to 3.2x for NPU + CPU when comparing versus a single CPU core.

Figure 3: Prefill performance of Qwen3-1.7B on i.MX 95 across different input sequence lengths, comparing 1-core and 6-core CPU baselines against CPU+ eIQ Neutron NPU. NPU offloading delivers up to 3.2x speedup over single-core execution.

From first LLM enablement to full SoC utilization

This case study demonstrates how scalable compiler infrastructure unlocks software velocity for NPUs-based edge SoCs. Day-0 model support and steep optimization curves enable product and application developers on NXP hardware to focus on what matters most: Building applications, user experiences, and product differentiation that set them apart.

Next, we are extending the NXP eIQ Neutron NPU enablement from prefill to the decode phase of LLMs to ensure efficient SoC utilization during token generation. We will also expand the NXP eIQ Neutron NPU offloading capabilities to further operator types, unlocking performance gains also for less matmul-heavy models like conventional CNNs. The objective: ensure any AI model for any given use case can run across the entire NXP i.MX 95 SoC, making optimal use of the CPU, GPU, and NPU.

CITATIONS

(1) Bamberg, Lennart, et al. "eIQ Neutron: Redefining edge-AI inference with integrated npu and compiler innovations." arXiv preprint arXiv:2509.14388 (2025)

‍

Co-funded by the European Union

runtime

sdk

Similar cases

runtime

sdk

Roofline brings Mixture-of-Experts to the Edge: MoE Compiler Support in IREE

Mixture-of-Experts (MoE) can accelerate LLM inference by an order of magnitude compared to dense LLMs of similar size, by activating only the parts of the model selected for a given token. This is why all frontier labs are training MoE models, and these models are now moving to the edge. But while the hardware is ready, the software is not. Current edge AI compilers do not support the dynamic expert routing that MoE depends on. Roofline closes this gap by bringing native MoE support to its IREE-based deployment stack.

Case study

runtime

Unlocking edge GPUs for larger AI models and asynchronous execution: Roofline has built a new Vulkan 1.3 HAL for IREE

AI deployment is rapidly moving to edge devices, where dedicated AI accelerators (NPUs) are getting most of the attention for heavy AI workloads. Additionally, GPUs are present on many edge SoCs but are typically being overlooked for AI deployment. The deployment capabilities of current software stacks reflect that since edge GPUs are often lacking full AI support. This case study shows how Roofline unlocks edge GPUs for larger AI models by building a new Vulkan 1.3 HAL for IREE. We demonstrate this by running Qwen3-8B heterogeneously on an ARM Mali-G720 GPU.

Case study

sdk

runtime

Dynamic shape support: A key enabler for on-device LLM inference

This case study shows how Roofline achieves up to 23× higher throughput by solving one of the most fundamental bottlenecks in efficient on-device language model inference: dynamic shapes in the prefill stage.

Case study