On-device image-to-text with multimodal LLMs

On-device image-to-text with multimodal LLMs. LLMs are getting smaller and now fit on edge systems. But bringing them into products and unlocking disruptive use cases remains a challenge. Common edge AI deployment tools cannot keep up with the pace of AI innovation, especially with cutting-edge models like multimodal LLMs.

Here is a look at what roofline's MLIR-based compiler can do. We run an image-to-text task using Google DeepMind's Gemma-3-4B, fully compiled, on real Qualcomm edge hardware:

🖼️ Input: Camera view of a mobile robot in an aisle.

💬 Output: Natural language reasoning. The mobile robot decides to slow down and adjust its path.

⚡ Performance: ~9x faster than TorchInductor.

Curious? Let's talk