Catalog

Overview

About Advantech Container Catalog (ACC)

Advantech Container Catalog is a comprehensive collection of ready-to-use, containerized software packages designed to accelerate the development and deployment of Edge AI applications. By offering pre-integrated solutions optimized for embedded hardware, it simplifies the challenges often faced with software and hardware compatibility, especially in GPU/NPU-accelerated environments.

Feature / Benefit Description
Accelerated Edge AI Development Ready-to-use containerized solutions for faster prototyping and deployment
Hardware Compatible Reduces hardware and package incompatibility issues
GPU/NPU Access Ready Supports passthrough for efficient hardware acceleration
Model Conversion & Optimization Built-in model conversion and quantization recommendations
Optimized for CV & LLM Applications Optimized stacks for vision and language workloads


This container, LLM MLC LLM on Qualcomm® Adreno™, delivers an on-device LLM inference stack using the MLC runtime (TVM) with Adreno GPU OpenCL acceleration on Qualcomm® QCS6490™ devices. It includes:

  • MLC TVM runtime + compiled model binaries
  • FastAPI backend exposing OpenAI-compatible endpoints (/chat/completions)
  • Optional OpenWebUI frontend for interactive chat
  • Utilities for model conversion, verification (wise-bench), and service management

Use cases: low-latency conversational AI, local inference for privacy-sensitive deployments, and offline-capable edge LLM services.


Container Demo


Use Cases

Use Case Description
Private LLM inference on local devices Run LLMs entirely on QCS6490 without internet — ideal for privacy-critical and air-gapped environments.
Lightweight backend for LLM APIs Expose local models via FastAPI for LangChain, custom UIs, or automation scripts.
Multilingual assistants Deploy translation, summarization, and conversational agents locally.
LLM evaluation and benchmarking Swap and test quantized models (Q4/Q8) to compare performance, accuracy, and DSP utilization.
Custom offline agents Build agents that interact with local tools (databases, APIs, sensors, MQTT) for edge decision-making.
Edge AI for industrial applications Natural language interfaces, command parsing, and decision-support at the edge.
On-device conversational AI for robotics Low-latency conversational AI optimized for industrial/robotics scenarios.
Edge deployments requiring offline inference Offline, low-latency deployments for constrained connectivity.
Privacy-sensitive applications Deploy where cloud inference is unacceptable; preserves data privacy.

Key Features

  • MLC TVM Runtime: TVM-compiled Llama models for efficient execution
  • Adreno (OpenCL) Acceleration: GPU inference via OpenCL backend on Adreno 643
  • Hexagon DSP Acceleration: QNN/SNPE support for quantized exec on Hexagon v68
  • OpenAI-Compatible API: FastAPI-based /chat/completions endpoint
  • OpenWebUI Integration: Lightweight frontend for real-time chat
  • Streaming Output: Token-by-token streaming for responsive UX
  • Offline-First: Fully functional without internet after model deployment
  • Developer Tools: mlc_cli_chat, wise-bench.sh, and example configs

Host Device Prerequisites

Item Specification
Compatible Hardware Advantech QCS6490-based devices (e.g., AOM-2721)
SoC Qualcomm® QCS6490™ (soc_id-35)
GPU Adreno™ 643 (OpenCL backend supported)
DSP/HTP Hexagon™ 770 v68 with tensor accelerator
Memory 8GB LPDDR5
Host OS QCOM Robotics Reference Distro with ROS 1.3-ver.1.1
Compatible BSP Yocto 4.0 (LE1.3)

Container Environment Overview

Software Components in the Image

Component Version Description
MLC LLM Runtime 0.1.dev0 TVM-compiled runtime for LLM inference
Apache TVM 0.14+ Model compilation stack
Python 3.10.12 Runtime for FastAPI and utilities
FastAPI 0.116.1 OpenAI-compatible API server
OpenWebUI 0.6.5 Chat UI (optional container)
Uvicorn Latest ASGI server for FastAPI

Quick Start Guide

For container quick start, including docker-compose, build scripts, AI inference application source code and more, please refer to Advantech Container Repository

Including:

  • Model preparation & coversion Guide
  • Installation
  • AI Accelerator and Software Stack Verification
  • Model Parameters Customization
  • Prompt Guidelines
  • MLC LLM Logs and Troubleshooting
  • MLC LLM CLI Inference Sample

Supported AI Capabilities

LLM Models

Model Family Versions Notes
Meta Llama 3.2 1B, 3B Converted with MLC TVM (.bin). 1B recommended for QCS6490 8GB configurations

Model Formats

Format Support Level Notes
MLC .bin Full TVM-compiled format for MLC runtime
QNN .bin Partial For DSP execution via QNN/SNPE
SNPE .dlc Partial For Hexagon DSP-specific execution

Hardware Acceleration Support

Accelerator Support Level Libraries / Notes
Adreno (OpenCL) Full TVM + MLC runtime (OpenCL backend)
Hexagon DSP Partial QNN / SNPE support on v68

Best Practices and Recommendations

Memory Management & Speed (MLC LLM on QCS6490)

  • Model Placement: Ensure models are fully loaded into the DSP/HTP memory or GPU VRAM for optimal inference performance.
  • Batch Inference: Use batch inference when running multiple requests to improve throughput efficiency.
  • Dynamic Memory Offloading: Offload unused models from DSP/HTP/GPU memory when not needed to free up resources for active workloads.
  • Quantization Preference: Prefer quantized models (e.g., INT8, Q4F16) to balance speed, memory usage, and accuracy.
  • Context Length Tuning: Reduce the maximum context length when possible to minimize memory usage without impacting task quality.
  • Token Management: override max_tokens to avoid unnecessarily long generations that increase latency and memory consumption.
  • Model Size Guidance: For best performance on QCS6490, use models with ≤3B parameters.

Known Limitations

  1. OpenWebUI Dependencies
    On the first startup, OpenWebUI installs certain dependencies. These are persisted in the associated Docker volume, so allow some time for this one-time setup to complete before use.

  2. Model Compilation
    Models must be explicitly converted for execution. Always verify the quantization format and device compatibility before running a model.

  3. Model Size Restrictions
    Models larger than 1B parameters may not run efficiently on QCS6490 due to memory constraints. GPU execution is possible but may suffer from latency or thermal throttling.

  4. Context Length Limitations
    Very long context lengths can exceed memory limits, leading to errors or performance degradation. Adjust max_tokens and context size accordingly.

  5. Docker Storage Constraints
    Running inside Docker containers can quickly consume disk space due to model weights, logs, and cache. Ensure sufficient storage is available on the device.

  6. Streaming Support
    While streaming improves responsiveness, it can cause higher memory pressure if multiple clients are connected simultaneously.


Copyright © Advantech Corporation. All rights reserved.