Efficient On-Device Diffusion LLM Inference with Mobile NPU
Researchers introduce a framework to optimize diffusion LLM inference on mobile NPUs by addressing memory and compute bottlenecks.
Diffusion LLMs (dLLMs) enable parallel token generation but struggle with mobile NPU constraints like limited address space and inefficient KV cache reuse. The proposed framework optimizes these workloads to improve latency-sensitive performance on smartphone hardware.