Overview
ONNX Runtime on Qualcomm Hexagon – QCS6490
Version: 1.0
Release Date: Feb 2026
Copyright: © 2026 Advantech Corporation & NYCU COSMOS Lab. All rights reserved.
This document describes how to validate the Qualcomm NPU-enabled ONNX Runtime container on the QCS6490 platform.
1. Hardware Specifications
| Component | Specification |
|---|---|
| Target Hardware | ADVANTECH AOM-2721 |
| SoC | Qualcomm QCS6490 |
| GPU | Adreno™ 643 |
| DSP | Hexagon™ 770 |
| Memory | 8GB LPDDR5 |
2. Software Components
| Component | Version | Description |
|---|---|---|
| Python | 3.10 | Runtime environment |
| ONNX Runtime (QNN) | 1.24.1 | Custom build with QNN Execution Provider (Built with QAIRT 2.43.0) |
| QAIRT (QNN SDK) | 2.43.0 | Qualcomm AI Runtime backend |
Note: The custom build of onnxruntime-qnn currently only works within this container environment.
3. Quick Start Guide
For container quick start, including docker-compose, build scripts, AI inference application source code and more, please refer to : Advantech Container Repository
4. Test ONNX Runtime with NPU capability
Run the benchmark script:
cd nycu-benchmark
python nycu-cosmoslab-onnxruntime-benchmark.py
Benchmark Result (100 Iterations)
Model: EfficientNet-B0
Quantiaztion: w8a16
**Model is download from Qualcomm AI-Hub
--- Initializing CPU Session ---
--- Initializing QNN Session (HTP/DSP) ---
/prj/qct/webtech_scratch20/mlg_user_admin/qaisw_source_repo/rel/qairt-2.43.0/release/SNPE_SRC/avante-tools/prebuilt/dsp/hexagon-sdk-5.5.5/ipc/fastrpc/rpcmem/src/rpcmem_android.c:38:dummy call to rpcmem_init, rpcmem APIs will be used from libxdsprpc
Starting stage: Graph Preparation Initializing
Completed stage: Graph Preparation Initializing (268 us)
Starting stage: Graph Optimizations
Completed stage: Graph Optimizations (603465 us)
Starting stage: Post Graph Optimization
Completed stage: Post Graph Optimization (18554 us)
Starting stage: Graph Sequencing for Target
Completed stage: Graph Sequencing for Target (100218 us)
Starting stage: VTCM Allocation
Completed stage: VTCM Allocation (25607 us)
Starting stage: Parallelization Optimization
Completed stage: Parallelization Optimization (7322 us)
Starting stage: Finalizing Graph Sequence
====== DDR bandwidth summary ======
spill_bytes=0
fill_bytes=0
write_total_bytes=65536
read_total_bytes=11130880
Completed stage: Finalizing Graph Sequence (9903 us)
Starting stage: Completion
Completed stage: Completion (551 us)
========================================
PERFORMANCE COMPARISON (100 Iterations)
========================================
[CPU Only] Running 100 iterations...
[CPU Only] Total Time: 13687.09 ms
[CPU Only] Average Latency: 136.8709 ms
[QNN (NPU)] Running 100 iterations...
[QNN (NPU)] Total Time: 537.07 ms
[QNN (NPU)] Average Latency: 5.3707 ms
Result: QNN is 25.48x faster than CPU (Average)
The result confirms that inference is successfully offloaded to the Hexagon 770 through the QNN Execution Provider, achieving approximately 25× acceleration compared to CPU execution.
