Overview

The Qwen2.5 3B AI Agent on NVIDIA Jetson™ provides a plug-and-play AI runtime for NVIDIA Jetson™ devices, integrating the Qwen 2.5 3B model (via Ollama) with a FastAPI-based LangChain AI Agent (integrated with EdgeSync Device Library) and OpenWebUI interface. This container offers:

Offline, on-device LLM inference using Qwen 2.5 3B via Ollama (no internet required post-setup)
LangChain middleware with FastAPI for orchestrating modular pipelines
Built-in FAISS vector database for efficient semantic search and RAG use case
Agent support to enable autonomous, multi-step task execution and decision-making
Prompt memory and context handling for smarter conversations
Streaming chat UI via OpenWebUI
OpenAI-compatible API endpoints for seamless integration
Customizable model parameters via modelfile & environment variables
AI Agent integrated with EdgeSync Device Library for calling various peripheral functions via natural language prompts
Predefined LangChain tools (functions) registered for the agent to call hardware APIs

Container Demo

Use Cases

Predictive Maintenance Chatbots: Integrate with edge telemetry or logs to summarize anomalies, explain error codes, or recommend corrective actions using historical context.
Compliance and Audit Q&A: Run offline LLMs trained on local policy or compliance data to assist with audits or generate summaries of regulatory alignment—ensuring data never leaves the premises.
Safety Manual Conversational Agents: Deploy LLMs to provide instant answers from on-site safety manuals or procedures, reducing downtime and improving adherence to protocols.
Technician Support Bots: Field service engineers can interact with the bot to troubleshoot equipment based on past repair logs, parts catalogs, and service manuals.
Smart Edge Controllers: LLMs can translate human intent (e.g., “reduce line 2 speed by 10%”) into control commands for industrial PLCs or middleware using AI agents.
Conversational Retrieval (RAG): Integrate with vector databases (like FAISS and ChromaDB) to retrieve relevant context from local documents and enable conversational Q&A over your custom data.
Tool-Enabled Agents: Create intelligent agents that use calculators, APIs, or search tools as part of their reasoning process—LangChain handles the logic and LLM interface.
Factory Incident Reporting: Ingest logs or voice input → extract incident type → summarize → trigger automated alerts or next steps
Custom Tool-Driven Agents: Expand the system with new LangChain tools to call additional hardware functions, fetch local metrics, or trigger external workflows—all via natural language.

Key Features

LangChain Middleware: Agent logic with memory and modular chains
Ollama Integration: Lightweight inference engine for quantized models
Complete AI Framework Stack: PyTorch, TensorFlow, ONNX Runtime, and TensorRT™
Industrial Vision Support: Accelerated OpenCV and GStreamer pipelines
Edge AI Capabilities: Support for computer vision, LLMs, and time-series analysis
Performance Optimized: Tuned specifically for NVIDIA® Jetson Orin™ NX 8GB
EdgeSync Integration with Agent Integration of the EdgeSync Device Library with the agent to interact with low-level edge hardware components via natural language

Host Device Prerequisites

Item	Specification
Compatible Hardware	Advantech devices accelerated by NVIDIA Jetson™—refer to Compatible Hardware
NVIDIA Jetson™ Version	5.x
Host OS	Ubuntu 20.04
Required Software Packages	Refer to Below
Software Installation	NVIDIA Jetson™ Software Package Installation

Container Environment Overview

Software Components on Container Image

Component	Version	Description
CUDA®	11.4.315	GPU computing platform
cuDNN	8.6.0	Deep Neural Network library
TensorRT™	8.5.2.2	Inference optimizer and runtime
PyTorch	2.0.0+nv23.02	Deep learning framework
TensorFlow	2.12.0	Machine learning framework
ONNX Runtime	1.16.3	Cross-platform inference engine
OpenCV	4.5.0	Computer vision library with CUDA®
GStreamer	1.16.2	Multimedia framework
Ollama	0.5.7	LLM inference engine
LangChain	0.2.17	Orchestration layer for memory, RAG, and agent workflows
FastAPI	0.115.12	API service exposing LangChain interface
OpenWebUI	0.6.5	Web interface for chat interactions
FAISS	1.8.0.post1	Vector store for RAG pipelines
EdgeSync	1.0.0	EdgeSync is provided as part of the container image for low-level edge hardware components interaction with the AI Agent.

Quick Start Guide

For container quick start, including the docker-compose file and more, please refer to README.

Supported AI Capabilities

Language Models Recommendation

Model Family	Parameters	Quantization	Size	Performance
DeepSeek R1	1.5 B	Q4_K_M	1.1 GB	~15-17 tokens/sec
DeepSeek R1	7 B	Q4_K_M	4.7 GB	~5-7 tokens/sec
DeepSeek Coder	1.3 B	Q4_0	776 MB	~20-25 tokens/sec
Llama 3.2	1 B	Q8_0	1.3 GB	~17-20 tokens/sec
Llama 3.2 Instruct	1 B	Q4_0	~0.8 GB	~17-20 tokens/sec
Llama 3.2	3 B	Q4_K_M	2 GB	~10-12 tokens/sec
Llama 2	7 B	Q4_0	3.8 GB	~5-7 tokens/sec
Tinyllama	1.1 B	Q4_0	637 MB	~22-27 tokens/sec
Qwen 2.5	0.5 B	Q4_K_M	398 MB	~25-30 tokens/sec
Qwen 2.5	1.5 B	Q4_K_M	986 MB	~15-17 tokens/sec
Qwen 2.5 Coder	0.5 B	Q8_0	531 MB	~25-30 tokens/sec
Qwen 2.5 Coder	1.5 B	Q4_K_M	986 MB	~15-17 tokens/sec
Qwen	0.5 B	Q4_0	395 MB	~25-30 tokens/sec
Qwen	1.8 B	Q4_0	1.1 GB	~15-20 tokens/sec
Gemma 2	2 B	Q4_0	1.6 GB	~10-12 tokens/sec
Mistral	7 B	Q4_0	4.1 GB	~5-7 tokens/sec

Tuning Tips for Efficient RAG and Agent Workflows:*

Use asynchronous chains and streaming response handlers to reduce latency in FastAPI endpoints.
For RAG pipelines, use efficient vector stores (e.g., FAISS with cosine or inner product) and pre-filter data when possible.
Avoid long chain dependencies; break workflows into smaller composable components.
Cache prompt templates and tool results when applicable to reduce unnecessary recomputation
For agent-based flows, limit tool calls per loop to avoid runaway execution or high memory usage.
Log intermediate steps (using LangChain’s callbacks) for better debugging and observability
Use models with ≥3B parameters (e.g., Llama 3.2 3B or larger) for agent development to ensure better reasoning depth and tool usage reliability.