Overview

NPU Passthrough on NXP i.MX8M Plus

Overview

The NPU Passthrough on NXP i.MX8M Plus container image provides a comprehensive environment for building and deploying AI applications on NXP i.MX 8MP hardware. This container features full hardware acceleration support, optimized AI frameworks, and industrial-grade reliability. With this container, developers can quickly prototype and deploy AI use cases such as computer vision without the burden of solving time-consuming dependency issues or manually setting up complex toolchains. All required runtimes, libraries, and drivers are pre-configured, ensuring seamless integration with NXP AI acceleration stack.

Key Features

Complete AI Framework Stack: Pre-integrated runtimes including LiteRT for seamless execution of a wide variety of model formats ( .tflite). Developers can deploy models without worrying about low-level compatibility issues.
Edge AI Capabilities: Optimized support for computer vision leveraging NXP NPU acceleration.
Hardware Acceleration: Direct passthrough access to NPU hardware ensures high-performance and low-latency inference with minimal power consumption.
Preconfigured Environment: Eliminates time-consuming setup by bundling drivers, toolchains, and AI libraries, so developers can focus directly on building applications.
Rapid Prototyping & Deployment: Ideal for quickly testing AI models, validating PoCs, and deployment without rebuilding from scratch.

Hardware Specifications

Component	Specification
Target Hardware	Advantech EPC-R3720
SoC	NXP i.MX8MPlus
GPU	Vivante GC7000UL
NPU	Vivante GC7000UL
Memory	6 GB LPDDR4

Operating System

Environment	Operating System
Device Host	Yocto 4.0 (LTS) (5.15-kirkstone)
Container	Ubuntu:22.04

Software Components

Component	Version	Description
LiteRT	2.9.1	Provides TFLite Delegate support for GPU and NPU acceleration
GStreamer	1.20.0	Multimedia framework for building flexible audio/video pipelines
NNStreamer	2.1.1	NNStreamer is a pipeline-centric multimedia framework enabling flexible audio/video processing integrated with TFLite and other ML backends.
Python	3.10	Python runtime for building applications

Supported AI Capabilities

Vision Models

Model	Format	Note
PoseNet (ResNet50)	TFLite	Provide By NXP Demo Experience
MobileNet V1	TFLite	Provide By NXP Demo Experience
SSD Mobilenet V2	TFLite	Provide By NXP Demo Experience
Facenet	TFLite	Provide By NXP Demo Experience

Note: The above tables highlight a subset of commonly used models validated for this environment. Other transformer-based or vision models may also be supported depending on runtime compatibility and hardware resources. For the most detailed and updated list of supported models and runtimes, please refer to the NXP official NXP Demo Experience.

Supported AI Model Formats

Runtime	Format	Compatible Versions
LiteRT	.tflite	2.9.1

Hardware Acceleration Support

Accelerator	Support Level	Compatible Libraries
NPU	INT8 (primary), limited mixed precision, (INT16/FP16 → quantized internally)	TensorFlow Lite (VX Delegate), NNStreamer
GPU	FP32 / FP16	TensorFlow Lite (GPU delegate)

Precision Support

Precision	Support Level	Notes
FP32	CPU, GPU	Highest accuracy, slower performance
FP16	NPU	Not directly exposed via NPU
INT16	NPU (internal mixed ops)	Some operators quantized into 16-bit internally
INT8	NPU,CPU	Primary mode for NPU acceleration, best performance-per-watt

Possible Use Cases

Domain	Applications
Smart Surveillance & Security	Real-time object detection and person tracking using SSD MobileNet V2 models; intrusion detection, face recognition, and abnormal behavior monitoring on edge devices without cloud dependency.
Industrial Automation & Robotics	Defect detection in manufacturing lines with computer vision; gesture or pose estimation for human–robot collaboration; autonomous navigation and obstacle avoidance for robots and drones.
Healthcare & Wellness	Contactless vital sign monitoring using vision models; fall detection and activity recognition for elderly care; medical imaging assistance with lightweight segmentation models.
Retail & Smart Spaces	Customer flow analysis, heatmap generation, and people counting; shelf stock monitoring and automated checkout solutions; emotion detection for personalized customer experiences.
Transportation & Mobility	Driver monitoring (drowsiness, distraction detection); traffic analysis and smart signaling; vehicle and license plate recognition at the edge.

Quick Start Guide

Prerequisites

Please ensure docker & docker compose are available and accessible on device host OS
Since default eMMC boot provides only 16 GB storage which is in-sufficient to run/build the container image, it is required to boot the Host OS using a 32 GB (minimum) SD card.

For container quick start, including the docker-compose file and more, please refer to Advantech Container Github Repository

Best Practices

Precision Selection

Topic	Description
Prefer INT8 for NPU acceleration	The i.MX8MP NPU is optimized for quantized INT8 models. Always convert to INT8 using post-training quantization or quantization-aware training for maximum performance and efficiency.
Fallback to FP16/FP32 when INT8 unsupported	If some operators cannot be quantized, LiteRT may run those parts on CPU/GPU in FP32. FP16 is not natively accelerated but can sometimes reduce memory usage.
Accuracy validation post-quantization	Benchmark the INT8 model against FP32 baseline on-device using NNStreamer pipelines to ensure accuracy is acceptable before deployment.

Model Optimization

Topic	Description
Use lightweight backbones	Models like MobileNetV1/V2, EfficientNet-Lite, SSD-MobileNet, or YOLOv4-Tiny quantized are best suited for real-time workloads on i.MX8MP.
Leverage pre-tested models	Start with NXP-provided sample models (mobilenet, ssd_mobilenet_v2, etc.) or Ultralytics-exported YOLO models already verified with TFLite.
Prune and compress	Reduce model size via pruning and weight clustering to lower memory footprint and improve NPU throughput without major accuracy loss.

Known Limitations

Topic	Description
Minimum storage	Storage required for running Docker containers is 32 GB.
LiteRT delegate coverage	Not all TFLite ops are supported by the NPU; unsupported layers fall back to CPU/GPU, reducing FPS.
Operator coverage gaps	Advanced layers (attention, deformable conv, some postprocessing ops) are not supported on NPU. Custom handling is required.
Quantization trade-offs	INT8 is mandatory for NPU, but aggressive quantization can degrade accuracy.
Resource constraints	Large models (YOLOv8l, transformers >100M params) won’t run in real-time or may not fit in memory. Stick to quantized tiny/small variants.
Version sensitivity	BSP release defines supported NNStreamer + LiteRT versions. Using mismatched SDKs (e.g., newer TFLite) can break delegate acceleration.