Dropping our biggest feature to close the year: Heterogeneous execution

Dropping our biggest feature to close the year: Unlock your full SoC with heterogeneous execution.

Today, we are showing a capability the edge AI ecosystem has been missing for years. Modern edge SoCs are hardware powerhouses. CPU, GPU, NPU on a single chip, each with complementary strengths. But until now, software could only use them one at a time and forced them into slow lockstep. The result: idle hardware and wasted performance.

That gap ends today. roofline now enables true asynchronous heterogeneous execution, across devices, across vendors, fully end-to-end.

In this demo, we showcase Qwen3-0.6B on NXP Semiconductors’ i.MX 95. The model is executed heterogeneously on CPU and GPU in parallel, with parts of the prefill and decode workloads offloaded to the GPU.

Get all the details in our full case study, where we outline our heterogeneous execution infrastructure, show how it extends to NPUs, and apply it across NXP Semiconductors, Qualcomm, and Apple SoCs: ‍case-study: asynchronous heterogeneous execution

#EdgeAI Hashtag#AIDeployment Hashtag#AICompiler Hashtag#MLIR Hashtag#Roofline

‍