Roofline brings Mixture-of-Experts to the Edge: MoE Compiler Support in IREE

Mixture-of-Experts (MoE) can accelerate LLM inference by an order of magnitude compared to dense LLMs of similar size, by activating only the parts of the model selected for a given token. This is why all frontier labs are training MoE models, and these models are now moving to the edge. But while the hardware is ready, the software is not. Current edge AI compilers do not support the dynamic expert routing that MoE depends on. Roofline closes this gap by bringing native MoE support to its IREE-based deployment stack.

Key Takeaways

1. End-to-end MoE support: Roofline compiles and runs open MoE models through its IREE-based stack, demonstrated on Granite 3.1 1B-A400M, LFM2-8B-A1B, and Qwen3-REAP-15B-A3B, spanning 1B to 15B total parameters.

2. Generalizable solution: Support comes from new compiler operators for TopK routing, histogram-based token counting, and grouped matmul, which generalize across MoE architectures.

3. No runtime changes needed: Roofline maps MoE’s dynamic execution pattern into compiler-level primitives that IREE can lower and execute through the existing runtime.

MoEs expand model capacity without increasing compute demand

An MoE model delivers more capacity per unit of compute than a regular dense model. Instead of using a single feed-forward network inside each transformer block, it uses many parallel expert networks. A router network selects a subset of these experts per token, and their outputs are combined into the layer's result. This is fundamentally different from a dense model, which uses every parameter for every token. The animation below showcases the MoE mechanism.

This selective activation is what makes MoE fast, because LLM inference speed is limited by memory bandwidth, the rate at which model weights are read from memory. This matters especially on edge devices, where bandwidth is much lower than in datacenter systems. Dense models read all their weights for every token they generate, so inference gets slower as the model grows. An MoE model reads only the active experts, which decouples inference speed from total parameter count. In other words, MoE makes it possible to run larger models much faster.

The faster inference speed comes with one trade off, namely memory use. All expert weights must still fit in memory, even if only a fraction is activated for each token. This is why MoE has mostly lived in the datacenter, rather than on edge devices. But this constraint is now weakening. As successive hardware generations increase their memory capacity, edge devices can increasingly hold these models. The hardware is becoming ready for MoE, but current edge compiler stacks are not. Roofline changes this.

Roofline unlocks MoE support in IREE

IREE is an open-source, retargetable AI compiler built on MLIR that enables AI deployment across diverse hardware. IREE supports a wide range of input architectures including recent LLMs, as shown in this case study. These have so far been dense models, where every token follows the same computation path. MoE introduces a new requirement, because expert selection happens at runtime, as different tokens are routed to different experts, and each expert receives a data-dependent number of tokens.

This requires the compiler to support a set of operators that dense models did not need, including:

1. TopK, used to select which experts are activated for each token.

2. Histogram, which is used in the router to count how many tokens are assigned to each expert. This determines the workload size.

3. Grouped Matrix Multiplication, a set of matmuls that share the same weight shape but take inputs of different sizes, one per expert, corresponding to the number of tokens each expert receives.

By adding support for these operators, Roofline unlocks MoE in IREE. We now have a working proof of concept that compiles and runs open MoE models of different sizes end-to-end through this path. Notably, all of this is handled at the compiler level. MoE's dynamic execution pattern is lowered into the compiler-level primitives ahead of time, so IREE’s runtime executes the model with no MoE-specific tuning required. This brings MoE's capacity advantage to the edge through a single compiler path.

Roofline delivers fully compiled end-to-end support for MoE models

Roofline now provides fully compiled, end-to-end enablement for MoE models. The MoE layer enablement generalizes across all recent MoE architectures, so support is not tied to any individual model. We have verified this across a broad range of models from the Hugging Face Transformers library, including:

1. Granite 3.1 1B-A400M (IBM): 1B total parameters, 400M active per token, with 32 experts and top-8 routing.

2. LFM2-8B-A1B (Liquid AI): 8.3B total parameters, around 1.5B active per token, with 32 experts and top-4 routing.

3. Qwen3-REAP-15B-A3B: 15B total parameters, 3B active per token. REAP (Router-weighted Expert Activation Pruning) is a one-shot compression method that removes low-impact experts, shrinking the total parameter count while leaving the active path unchanged.

These runs were executed end-to-end on an x86 CPU from AMD. Watch the models in action!

Because this MoE support lives in the compiler rather than in hand-written, per-model code, it generalizes across MoE architectures: new MoE models are covered through the same compiler path, without per-model engineering. The same path also extends to new hardware backends, from CPUs to GPUs and NPUs, through IREE's retargetable design. This is a structural difference from runtimes that rely on hand-written implementations for each new architecture, where adding support can take months of specialized work.

Next up is performance optimization and upstreaming

With end-to-end MoE deployment in place, the next step is performance: turning this proof of concept into an optimized path that makes full use of the target hardware, which includes ARM SME/SVE and RISC-V RVV CPUs. Once performance optimization is done, we are planning to upstreaming our MoE support to IREE and also bring MoE support to NXP and Qualcomm NPUs. Stay tuned!

‍

Co-funded by the European Union

runtime

sdk

Similar cases

runtime

Unlocking edge GPUs for larger AI models and asynchronous execution: Roofline has built a new Vulkan 1.3 HAL for IREE

AI deployment is rapidly moving to edge devices, where dedicated AI accelerators (NPUs) are getting most of the attention for heavy AI workloads. Additionally, GPUs are present on many edge SoCs but are typically being overlooked for AI deployment. The deployment capabilities of current software stacks reflect that since edge GPUs are often lacking full AI support. This case study shows how Roofline unlocks edge GPUs for larger AI models by building a new Vulkan 1.3 HAL for IREE. We demonstrate this by running Qwen3-8B heterogeneously on an ARM Mali-G720 GPU.

Case study

runtime

sdk

Roofline and NXP enable next-generation software support for LLMs on eIQ Neutron NPU

In collaboration with NXP® Semiconductors, we built on Roofline’s scalable MLIR and IREE compiler infrastructure to extend our heterogeneous execution stack to NPUs. Starting with LLM enablement for NXP’s eIQ® Neutron NPU on the i.MX 95 applications processor, this case study showcases three key advantages of the software enablement: 1) Unlocking broad model coverage, 2) overcoming accelerator memory limitations for models above 2GB, and 3) delivering clear performance gains of up to 3.2x in LLM prefill performance over CPU-only execution.

Case study

sdk

runtime

Dynamic shape support: A key enabler for on-device LLM inference

This case study shows how Roofline achieves up to 23× higher throughput by solving one of the most fundamental bottlenecks in efficient on-device language model inference: dynamic shapes in the prefill stage.

Case study