Overview

About Advantech Container Catalog (ACC)

Advantech Container Catalog is a comprehensive collection of ready-to-use, containerized software packages designed to accelerate the development and deployment of Edge AI applications. By offering pre-integrated solutions optimized for embedded hardware, it simplifies the challenges often faced with software and hardware compatibility, especially in GPU/NPU-accelerated environments.

Feature / Benefit	Description
Accelerated Edge AI Development	Ready-to-use containerized solutions for faster prototyping and deployment
Hardware Compatible	Reduces hardware and package incompatibility issues
GPU/NPU Access Ready	Supports passthrough for efficient hardware acceleration
Model Conversion & Optimization	Built-in model conversion and quantization recommendations
Optimized for CV & LLM Applications	Optimized stacks for vision and language workloads

This container, LLM MLC LLM on Qualcomm® Adreno™, delivers an on-device LLM inference stack using the MLC runtime (TVM) with Adreno GPU OpenCL acceleration on Qualcomm® QCS6490™ devices. It includes:

MLC TVM runtime + compiled model binaries
FastAPI backend exposing OpenAI-compatible endpoints (/chat/completions)
Optional OpenWebUI frontend for interactive chat
Utilities for model conversion, verification (wise-bench), and service management

Use cases: low-latency conversational AI, local inference for privacy-sensitive deployments, and offline-capable edge LLM services.

Container Demo

Use Cases

Use Case	Description
Private LLM inference on local devices	Run LLMs entirely on QCS6490 without internet — ideal for privacy-critical and air-gapped environments.
Lightweight backend for LLM APIs	Expose local models via FastAPI for LangChain, custom UIs, or automation scripts.
Multilingual assistants	Deploy translation, summarization, and conversational agents locally.
LLM evaluation and benchmarking	Swap and test quantized models (Q4/Q8) to compare performance, accuracy, and DSP utilization.
Custom offline agents	Build agents that interact with local tools (databases, APIs, sensors, MQTT) for edge decision-making.
Edge AI for industrial applications	Natural language interfaces, command parsing, and decision-support at the edge.
On-device conversational AI for robotics	Low-latency conversational AI optimized for industrial/robotics scenarios.
Edge deployments requiring offline inference	Offline, low-latency deployments for constrained connectivity.
Privacy-sensitive applications	Deploy where cloud inference is unacceptable; preserves data privacy.

Key Features

MLC TVM Runtime: TVM-compiled Llama models for efficient execution
Adreno (OpenCL) Acceleration: GPU inference via OpenCL backend on Adreno 643
Hexagon DSP Acceleration: QNN/SNPE support for quantized exec on Hexagon v68
OpenAI-Compatible API: FastAPI-based /chat/completions endpoint
OpenWebUI Integration: Lightweight frontend for real-time chat
Streaming Output: Token-by-token streaming for responsive UX
Offline-First: Fully functional without internet after model deployment
Developer Tools: mlc_cli_chat, wise-bench.sh, and example configs

Host Device Prerequisites

Item	Specification
Compatible Hardware	Advantech QCS6490-based devices (e.g., AOM-2721)
SoC	Qualcomm® QCS6490™ (soc_id-35)
GPU	Adreno™ 643 (OpenCL backend supported)
DSP/HTP	Hexagon™ 770 v68 with tensor accelerator
Memory	8GB LPDDR5
Host OS	QCOM Robotics Reference Distro with ROS 1.3-ver.1.1
Compatible BSP	Yocto 4.0 (LE1.3)

Container Environment Overview

Software Components in the Image

Component	Version	Description
MLC LLM Runtime	0.1.dev0	TVM-compiled runtime for LLM inference
Apache TVM	0.14+	Model compilation stack
Python	3.10.12	Runtime for FastAPI and utilities
FastAPI	0.116.1	OpenAI-compatible API server
OpenWebUI	0.6.5	Chat UI (optional container)
Uvicorn	Latest	ASGI server for FastAPI

Quick Start Guide

For container quick start, including docker-compose, build scripts, AI inference application source code and more, please refer to Advantech Container Repository

Including:

Model preparation & coversion Guide
Installation
AI Accelerator and Software Stack Verification
Model Parameters Customization
Prompt Guidelines
MLC LLM Logs and Troubleshooting
MLC LLM CLI Inference Sample

Supported AI Capabilities

LLM Models

Model Family	Versions	Notes
Meta Llama 3.2	1B, 3B	Converted with MLC TVM (`.bin`). 1B recommended for QCS6490 8GB configurations

Model Formats

Format	Support Level	Notes
MLC `.bin`	Full	TVM-compiled format for MLC runtime
QNN `.bin`	Partial	For DSP execution via QNN/SNPE
SNPE `.dlc`	Partial	For Hexagon DSP-specific execution

Hardware Acceleration Support

Accelerator	Support Level	Libraries / Notes
Adreno (OpenCL)	Full	TVM + MLC runtime (OpenCL backend)
Hexagon DSP	Partial	QNN / SNPE support on v68

Best Practices and Recommendations

Memory Management & Speed (MLC LLM on QCS6490)

Model Placement: Ensure models are fully loaded into the DSP/HTP memory or GPU VRAM for optimal inference performance.
Batch Inference: Use batch inference when running multiple requests to improve throughput efficiency.
Dynamic Memory Offloading: Offload unused models from DSP/HTP/GPU memory when not needed to free up resources for active workloads.
Quantization Preference: Prefer quantized models (e.g., INT8, Q4F16) to balance speed, memory usage, and accuracy.
Context Length Tuning: Reduce the maximum context length when possible to minimize memory usage without impacting task quality.
Token Management: override max_tokens to avoid unnecessarily long generations that increase latency and memory consumption.
Model Size Guidance: For best performance on QCS6490, use models with ≤3B parameters.

Known Limitations

OpenWebUI Dependencies
On the first startup, OpenWebUI installs certain dependencies. These are persisted in the associated Docker volume, so allow some time for this one-time setup to complete before use.
Model Compilation
Models must be explicitly converted for execution. Always verify the quantization format and device compatibility before running a model.
Model Size Restrictions
Models larger than 1B parameters may not run efficiently on QCS6490 due to memory constraints. GPU execution is possible but may suffer from latency or thermal throttling.
Context Length Limitations
Very long context lengths can exceed memory limits, leading to errors or performance degradation. Adjust max_tokens and context size accordingly.
Docker Storage Constraints
Running inside Docker containers can quickly consume disk space due to model weights, logs, and cache. Ensure sufficient storage is available on the device.
Streaming Support
While streaming improves responsiveness, it can cause higher memory pressure if multiple clients are connected simultaneously.