Overview

ONNX Runtime on Qualcomm Hexagon – QCS6490

This document describes how to validate the Qualcomm NPU-enabled ONNX Runtime container on the QCS6490 platform.

1. Hardware Specifications

Component	Specification
Target Hardware	ADVANTECH AOM-2721
SoC	Qualcomm QCS6490
GPU	Adreno™ 643
DSP	Hexagon™ 770
Memory	8GB LPDDR5

2. Software Components

Component	Version	Description
Python	3.10	Runtime environment
ONNX Runtime (QNN)	1.24.1	Custom build with QNN Execution Provider (Built with QAIRT 2.43.0)
QAIRT (QNN SDK)	2.43.0	Qualcomm AI Runtime backend

Note: The custom build of onnxruntime-qnn currently only works within this container environment.

3. Quick Start Guide

For container quick start, including docker-compose, build scripts, AI inference application source code and more, please refer to : Advantech Container Repository

4. Test ONNX Runtime with NPU capability

Run the benchmark script:

cd nycu-benchmark
python nycu-cosmoslab-onnxruntime-benchmark.py

Benchmark Result (100 Iterations)

Model: EfficientNet-B0

Quantiaztion: w8a16

**Model is download from Qualcomm AI-Hub

--- Initializing CPU Session ---
--- Initializing QNN Session (HTP/DSP) ---
/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.43.0/release/SNPE_SRC/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:38:dummy call to rpcmem_init, rpcmem APIs will be used from libxdsprpc
Starting stage: Graph Preparation Initializing
Completed stage: Graph Preparation Initializing (268 us)
Starting stage: Graph Optimizations
Completed stage: Graph Optimizations (603465 us)
Starting stage: Post Graph Optimization
Completed stage: Post Graph Optimization (18554 us)
Starting stage: Graph Sequencing for Target
Completed stage: Graph Sequencing for Target (100218 us)
Starting stage: VTCM Allocation
Completed stage: VTCM Allocation (25607 us)
Starting stage: Parallelization Optimization
Completed stage: Parallelization Optimization (7322 us)
Starting stage: Finalizing Graph Sequence

====== DDR bandwidth summary ======
spill_bytes=0
fill_bytes=0
write_total_bytes=65536
read_total_bytes=11130880

Completed stage: Finalizing Graph Sequence (9903 us)
Starting stage: Completion
Completed stage: Completion (551 us)

========================================
 PERFORMANCE COMPARISON (100 Iterations)
========================================
[CPU Only] Running 100 iterations...
[CPU Only] Total Time: 13687.09 ms
[CPU Only] Average Latency: 136.8709 ms
[QNN (NPU)] Running 100 iterations...
[QNN (NPU)] Total Time: 537.07 ms
[QNN (NPU)] Average Latency: 5.3707 ms

 Result: QNN is 25.48x faster than CPU (Average)

The result confirms that inference is successfully offloaded to the Hexagon 770 through the QNN Execution Provider, achieving approximately 25× acceleration compared to CPU execution.