Overview
The Qwen2.5 3B AI Agent on NVIDIA Jetson™ provides a plug-and-play AI runtime for NVIDIA Jetson™ devices, integrating the Qwen 2.5 3B model (via Ollama) with a FastAPI-based LangChain AI Agent (integrated with EdgeSync Device Library) and OpenWebUI interface. This container offers:
- Offline, on-device LLM inference using Qwen 2.5 3B via Ollama (no internet required post-setup)
- LangChain middleware with FastAPI for orchestrating modular pipelines
- Built-in FAISS vector database for efficient semantic search and RAG use case
- Agent support to enable autonomous, multi-step task execution and decision-making
- Prompt memory and context handling for smarter conversations
- Streaming chat UI via OpenWebUI
- OpenAI-compatible API endpoints for seamless integration
- Customizable model parameters via modelfile & environment variables
- AI Agent integrated with EdgeSync Device Library for calling various peripheral functions via natural language prompts
- Predefined LangChain tools (functions) registered for the agent to call hardware APIs
Container Demo

Use Cases
- Predictive Maintenance Chatbots: Integrate with edge telemetry or logs to summarize anomalies, explain error codes, or recommend corrective actions using historical context.
- Compliance and Audit Q&A: Run offline LLMs trained on local policy or compliance data to assist with audits or generate summaries of regulatory alignment—ensuring data never leaves the premises.
- Safety Manual Conversational Agents: Deploy LLMs to provide instant answers from on-site safety manuals or procedures, reducing downtime and improving adherence to protocols.
- Technician Support Bots: Field service engineers can interact with the bot to troubleshoot equipment based on past repair logs, parts catalogs, and service manuals.
- Smart Edge Controllers: LLMs can translate human intent (e.g., “reduce line 2 speed by 10%”) into control commands for industrial PLCs or middleware using AI agents.
- Conversational Retrieval (RAG): Integrate with vector databases (like FAISS and ChromaDB) to retrieve relevant context from local documents and enable conversational Q&A over your custom data.
- Tool-Enabled Agents: Create intelligent agents that use calculators, APIs, or search tools as part of their reasoning process—LangChain handles the logic and LLM interface.
- Factory Incident Reporting: Ingest logs or voice input → extract incident type → summarize → trigger automated alerts or next steps
- Custom Tool-Driven Agents: Expand the system with new LangChain tools to call additional hardware functions, fetch local metrics, or trigger external workflows—all via natural language.
Key Features
- LangChain Middleware: Agent logic with memory and modular chains
- Ollama Integration: Lightweight inference engine for quantized models
- Complete AI Framework Stack: PyTorch, TensorFlow, ONNX Runtime, and TensorRT™
- Industrial Vision Support: Accelerated OpenCV and GStreamer pipelines
- Edge AI Capabilities: Support for computer vision, LLMs, and time-series analysis
- Performance Optimized: Tuned specifically for NVIDIA® Jetson Orin™ NX 8GB
- EdgeSync Integration with Agent Integration of the EdgeSync Device Library with the agent to interact with low-level edge hardware components via natural language
Host Device Prerequisites
| Item | Specification |
|---|---|
| Compatible Hardware | Advantech devices accelerated by NVIDIA Jetson™—refer to Compatible Hardware |
| NVIDIA Jetson™ Version | 5.x |
| Host OS | Ubuntu 20.04 |
| Required Software Packages | Refer to Below |
| Software Installation | NVIDIA Jetson™ Software Package Installation |
Container Environment Overview
Software Components on Container Image
| Component | Version | Description |
|---|---|---|
| CUDA® | 11.4.315 | GPU computing platform |
| cuDNN | 8.6.0 | Deep Neural Network library |
| TensorRT™ | 8.5.2.2 | Inference optimizer and runtime |
| PyTorch | 2.0.0+nv23.02 | Deep learning framework |
| TensorFlow | 2.12.0 | Machine learning framework |
| ONNX Runtime | 1.16.3 | Cross-platform inference engine |
| OpenCV | 4.5.0 | Computer vision library with CUDA® |
| GStreamer | 1.16.2 | Multimedia framework |
| Ollama | 0.5.7 | LLM inference engine |
| LangChain | 0.2.17 | Orchestration layer for memory, RAG, and agent workflows |
| FastAPI | 0.115.12 | API service exposing LangChain interface |
| OpenWebUI | 0.6.5 | Web interface for chat interactions |
| FAISS | 1.8.0.post1 | Vector store for RAG pipelines |
| EdgeSync | 1.0.0 | EdgeSync is provided as part of the container image for low-level edge hardware components interaction with the AI Agent. |
Quick Start Guide
For container quick start, including the docker-compose file and more, please refer to README.
Supported AI Capabilities
Language Models Recommendation
| Model Family | Parameters | Quantization | Size | Performance |
|---|---|---|---|---|
| DeepSeek R1 | 1.5 B | Q4_K_M | 1.1 GB | ~15-17 tokens/sec |
| DeepSeek R1 | 7 B | Q4_K_M | 4.7 GB | ~5-7 tokens/sec |
| DeepSeek Coder | 1.3 B | Q4_0 | 776 MB | ~20-25 tokens/sec |
| Llama 3.2 | 1 B | Q8_0 | 1.3 GB | ~17-20 tokens/sec |
| Llama 3.2 Instruct | 1 B | Q4_0 | ~0.8 GB | ~17-20 tokens/sec |
| Llama 3.2 | 3 B | Q4_K_M | 2 GB | ~10-12 tokens/sec |
| Llama 2 | 7 B | Q4_0 | 3.8 GB | ~5-7 tokens/sec |
| Tinyllama | 1.1 B | Q4_0 | 637 MB | ~22-27 tokens/sec |
| Qwen 2.5 | 0.5 B | Q4_K_M | 398 MB | ~25-30 tokens/sec |
| Qwen 2.5 | 1.5 B | Q4_K_M | 986 MB | ~15-17 tokens/sec |
| Qwen 2.5 Coder | 0.5 B | Q8_0 | 531 MB | ~25-30 tokens/sec |
| Qwen 2.5 Coder | 1.5 B | Q4_K_M | 986 MB | ~15-17 tokens/sec |
| Qwen | 0.5 B | Q4_0 | 395 MB | ~25-30 tokens/sec |
| Qwen | 1.8 B | Q4_0 | 1.1 GB | ~15-20 tokens/sec |
| Gemma 2 | 2 B | Q4_0 | 1.6 GB | ~10-12 tokens/sec |
| Mistral | 7 B | Q4_0 | 4.1 GB | ~5-7 tokens/sec |
Tuning Tips for Efficient RAG and Agent Workflows:*
- Use asynchronous chains and streaming response handlers to reduce latency in FastAPI endpoints.
- For RAG pipelines, use efficient vector stores (e.g., FAISS with cosine or inner product) and pre-filter data when possible.
- Avoid long chain dependencies; break workflows into smaller composable components.
- Cache prompt templates and tool results when applicable to reduce unnecessary recomputation
- For agent-based flows, limit tool calls per loop to avoid runaway execution or high memory usage.
- Log intermediate steps (using LangChain’s callbacks) for better debugging and observability
- Use models with ≥3B parameters (e.g., Llama 3.2 3B or larger) for agent development to ensure better reasoning depth and tool usage reliability.
Copyright © Advantech Corporation. All rights reserved.