Overview
Container Overview
The Deepseek-R1 1.5B Langchain AI Agent (RAG) on NVIDIA Jetson™ delivers a plug-and-play AI runtime purpose-built for Retrieval-Augmented Generation workflows on edge devices. This container integrates the DeepSeek R1 1.5B model (served via Ollama) with LangChain's FastAPI middleware and OpenWebUI, providing a complete, lightweight, GPU-accelerated solution for real-time, offline RAG applications.
Designed for edge AI workflows, it offers an efficient development environment to implement document-grounded Q&A, contextual assistants, and autonomous agents—all running locally on Jetson devices. The prebuilt sample enables developers to quickly build custom RAG pipelines with hardware-optimized performance for intelligent, context-aware applications.
This container enables:
Feature | Description |
---|---|
Offline LLM Inference | DeepSeek R1 1.5B via Ollama (no internet required post-setup) |
LangChain Middleware | FastAPI orchestration for modular pipelines |
FAISS Vector Database | Built-in semantic search for efficient RAG |
Agent Support | Autonomous multi-step task execution |
Context Management | Prompt memory for smarter conversations |
Streaming Chat UI | OpenWebUI interface |
OpenAI-Compatible API | Seamless integration endpoints |
Customizable Parameters | Modelfile & environment variable configuration |
RAG Sample | Prebuilt code demonstrating RAG implementation |
Container Demo

Use Cases
Use Case | Description |
---|---|
Legal Document Assistant | Query contracts, case law, or legal memos without cloud exposure |
Internal SOP Assistant | Smart assistant for Standard Operating Procedures across departments |
Medical Protocol Access | Offline retrieval from medical guidelines and drug data in low-connectivity zones |
Compliance & Audit Q&A | Offline LLMs for policy compliance and regulatory alignment summaries |
Safety Manual Agents | Instant answers from safety manuals to improve protocol adherence |
Conversational RAG | Extend container capabilities for custom RAG development |
Tool-Enabled Agents | Intelligent agents with calculator, API, and search tool integration |
Key Features
Technical Components
Component | Description |
---|---|
LangChain Middleware | Agent logic with memory and modular chains |
Prebuilt RAG Example | PDF retrieval application for reference/extension |
Ollama Integration | Lightweight inference engine for quantized models |
AI Framework Stack | PyTorch, TensorFlow, ONNX Runtime, and TensorRT™ |
Industrial Vision | Accelerated OpenCV and GStreamer pipelines |
Edge AI Capabilities | Computer vision, LLMs, and time-series analysis |
Performance Optimized | Tuned for NVIDIA® Jetson Orin™ NX 8GB |
RAG/Agent Support | Accelerated development environment for agent workflows |
Host Device Prerequisites
Item | Specification |
---|---|
Compatible Hardware | Advantech devices accelerated by NVIDIA Jetson™—refer to Compatible hardware |
NVIDIA Jetson™ Version | 5.x |
Host OS | Ubuntu 20.04 |
Required Software Packages | Refer to Below |
Software Installation | NVIDIA Jetson™ Software Package Installation |
Container Environment Overview
Software Components on Container Image
Component | Version | Description |
---|---|---|
CUDA® | 11.4.315 | GPU computing platform |
cuDNN | 8.6.0 | Deep Neural Network library |
TensorRT™ | 8.5.2.2 | Inference optimizer and runtime |
PyTorch | 2.0.0+nv23.02 | Deep learning framework |
TensorFlow | 2.12.0 | Machine learning framework |
ONNX Runtime | 1.16.3 | Cross-platform inference engine |
OpenCV | 4.5.0 | Computer vision library with CUDA® |
GStreamer | 1.16.2 | Multimedia framework |
Ollama | 0.5.7 | LLM inference engine |
LangChain | 0.2.17 | Orchestration layer for memory, RAG, and agent workflows |
FastAPI | 0.115.12 | API service exposing LangChain interface |
OpenWebUI | 0.6.5 | Web interface for chat interactions |
FAISS | 1.8.0.post1 | Vector store for RAG pipelines |
RAG Code Sample | NA | Sample code that shows RAG capability development |
Sentence-T5-Base | NA | Pulls sentence-t5-base embedding model from HF |
Container Quick Start Guide
For container quick start, including the docker-compose file and more, please refer to README.
Supported AI Capabilities
Language Models Recommendation
Model Family | Parameters | Quantization | Size | Performance |
---|---|---|---|---|
DeepSeek R1 | 1.5 B | Q4_K_M | 1.1 GB | ~15-17 tokens/sec |
DeepSeek R1 | 7 B | Q4_K_M | 4.7 GB | ~5-7 tokens/sec |
DeepSeek Coder | 1.3 B | Q4_0 | 776 MB | ~20-25 tokens/sec |
Llama 3.2 | 1 B | Q8_0 | 1.3 GB | ~17-20 tokens/sec |
Llama 3.2 Instruct | 1 B | Q4_0 | ~0.8 GB | ~17-20 tokens/sec |
Llama 3.2 | 3 B | Q4_K_M | 2 GB | ~10-12 tokens/sec |
Llama 2 | 7 B | Q4_0 | 3.8 GB | ~5-7 tokens/sec |
Tinyllama | 1.1 B | Q4_0 | 637 MB | ~22-27 tokens/sec |
Qwen 2.5 | 0.5 B | Q4_K_M | 398 MB | ~25-30 tokens/sec |
Qwen 2.5 | 1.5 B | Q4_K_M | 986 MB | ~15-17 tokens/sec |
Qwen 2.5 Coder | 0.5 B | Q8_0 | 531 MB | ~25-30 tokens/sec |
Qwen 2.5 Coder | 1.5 B | Q4_K_M | 986 MB | ~15-17 tokens/sec |
Qwen | 0.5 B | Q4_0 | 395 MB | ~25-30 tokens/sec |
Qwen | 1.8 B | Q4_0 | 1.1 GB | ~15-20 tokens/sec |
Gemma 2 | 2 B | Q4_0 | 1.6 GB | ~10-12 tokens/sec |
Mistral | 7 B | Q4_0 | 4.1 GB | ~5-7 tokens/sec |
Tuning Tips for Efficient RAG and Agent Workflows:*
- Use asynchronous chains and streaming response handlers to reduce latency in FastAPI endpoints.
- For RAG pipelines, use efficient vector stores (e.g., FAISS with cosine or inner product) and pre-filter data when possible.
- Avoid long chain dependencies; break workflows into smaller composable components.
- Cache prompt templates and tool results when applicable to reduce unnecessary recomputation
- For agent-based flows, limit tool calls per loop to avoid runaway execution or high memory usage.
- Log intermediate steps (using LangChain’s callbacks) for better debugging and observability
- Use models with ≥3B parameters (e.g., Llama 3.2 3B or larger) for agent development to ensure better reasoning depth and tool usage reliability.
- Use FAISS or Chroma with pre-computed embeddings for faster retrieval
- Apply score thresholding in retriever config to filter irrelevant documents
- Keep persistent vector DB (e.g., FAISS saved index) to avoid re-indexing on container restart
- Use appropriate (size/precision) embedding models as per the suitability of the use case.
Supported AI Model Formats
Format | Support Level | Compatible Versions | Notes |
---|---|---|---|
ONNX | Full | 1.10.0 - 1.16.3 | Recommended for cross-framework compatibility |
TensorRT™ | Full | 7.x - 8.5.x | Best for performance-critical applications |
PyTorch (JIT) | Full | 1.8.0 - 2.0.0 | Native support via TorchScript |
TensorFlow SavedModel | Full | 2.8.0 - 2.12.0 | Recommended TF deployment format |
TFLite | Partial | Up to 2.12.0 | May have limited hardware acceleration |
GGUF | Full | v3 | Format used by Ollama backend |
Copyright © 2025 Advantech Corporation. All rights reserved.