Overview

Container Overview

The Deepseek-R1 1.5B Langchain AI Agent (RAG) on NVIDIA Jetson™ delivers a plug-and-play AI runtime purpose-built for Retrieval-Augmented Generation workflows on edge devices. This container integrates the DeepSeek R1 1.5B model (served via Ollama) with LangChain's FastAPI middleware and OpenWebUI, providing a complete, lightweight, GPU-accelerated solution for real-time, offline RAG applications.

Designed for edge AI workflows, it offers an efficient development environment to implement document-grounded Q&A, contextual assistants, and autonomous agents—all running locally on Jetson devices. The prebuilt sample enables developers to quickly build custom RAG pipelines with hardware-optimized performance for intelligent, context-aware applications.

This container enables:

Feature	Description
Offline LLM Inference	DeepSeek R1 1.5B via Ollama (no internet required post-setup)
LangChain Middleware	FastAPI orchestration for modular pipelines
FAISS Vector Database	Built-in semantic search for efficient RAG
Agent Support	Autonomous multi-step task execution
Context Management	Prompt memory for smarter conversations
Streaming Chat UI	OpenWebUI interface
OpenAI-Compatible API	Seamless integration endpoints
Customizable Parameters	Modelfile & environment variable configuration
RAG Sample	Prebuilt code demonstrating RAG implementation

Container Demo

Use Cases

Use Case	Description
Legal Document Assistant	Query contracts, case law, or legal memos without cloud exposure
Internal SOP Assistant	Smart assistant for Standard Operating Procedures across departments
Medical Protocol Access	Offline retrieval from medical guidelines and drug data in low-connectivity zones
Compliance & Audit Q&A	Offline LLMs for policy compliance and regulatory alignment summaries
Safety Manual Agents	Instant answers from safety manuals to improve protocol adherence
Conversational RAG	Extend container capabilities for custom RAG development
Tool-Enabled Agents	Intelligent agents with calculator, API, and search tool integration

Key Features

Technical Components

Component	Description
LangChain Middleware	Agent logic with memory and modular chains
Prebuilt RAG Example	PDF retrieval application for reference/extension
Ollama Integration	Lightweight inference engine for quantized models
AI Framework Stack	PyTorch, TensorFlow, ONNX Runtime, and TensorRT™
Industrial Vision	Accelerated OpenCV and GStreamer pipelines
Edge AI Capabilities	Computer vision, LLMs, and time-series analysis
Performance Optimized	Tuned for NVIDIA® Jetson Orin™ NX 8GB
RAG/Agent Support	Accelerated development environment for agent workflows

Host Device Prerequisites

Item	Specification
Compatible Hardware	Advantech devices accelerated by NVIDIA Jetson™—refer to Compatible hardware
NVIDIA Jetson™ Version	5.x
Host OS	Ubuntu 20.04
Required Software Packages	Refer to Below
Software Installation	NVIDIA Jetson™ Software Package Installation

Container Environment Overview

Software Components on Container Image

Component	Version	Description
CUDA®	11.4.315	GPU computing platform
cuDNN	8.6.0	Deep Neural Network library
TensorRT™	8.5.2.2	Inference optimizer and runtime
PyTorch	2.0.0+nv23.02	Deep learning framework
TensorFlow	2.12.0	Machine learning framework
ONNX Runtime	1.16.3	Cross-platform inference engine
OpenCV	4.5.0	Computer vision library with CUDA®
GStreamer	1.16.2	Multimedia framework
Ollama	0.5.7	LLM inference engine
LangChain	0.2.17	Orchestration layer for memory, RAG, and agent workflows
FastAPI	0.115.12	API service exposing LangChain interface
OpenWebUI	0.6.5	Web interface for chat interactions
FAISS	1.8.0.post1	Vector store for RAG pipelines
RAG Code Sample	NA	Sample code that shows RAG capability development
Sentence-T5-Base	NA	Pulls sentence-t5-base embedding model from HF

Container Quick Start Guide

For container quick start, including the docker-compose file and more, please refer to README.

Supported AI Capabilities

Language Models Recommendation

Model Family	Parameters	Quantization	Size	Performance
DeepSeek R1	1.5 B	Q4_K_M	1.1 GB	~15-17 tokens/sec
DeepSeek R1	7 B	Q4_K_M	4.7 GB	~5-7 tokens/sec
DeepSeek Coder	1.3 B	Q4_0	776 MB	~20-25 tokens/sec
Llama 3.2	1 B	Q8_0	1.3 GB	~17-20 tokens/sec
Llama 3.2 Instruct	1 B	Q4_0	~0.8 GB	~17-20 tokens/sec
Llama 3.2	3 B	Q4_K_M	2 GB	~10-12 tokens/sec
Llama 2	7 B	Q4_0	3.8 GB	~5-7 tokens/sec
Tinyllama	1.1 B	Q4_0	637 MB	~22-27 tokens/sec
Qwen 2.5	0.5 B	Q4_K_M	398 MB	~25-30 tokens/sec
Qwen 2.5	1.5 B	Q4_K_M	986 MB	~15-17 tokens/sec
Qwen 2.5 Coder	0.5 B	Q8_0	531 MB	~25-30 tokens/sec
Qwen 2.5 Coder	1.5 B	Q4_K_M	986 MB	~15-17 tokens/sec
Qwen	0.5 B	Q4_0	395 MB	~25-30 tokens/sec
Qwen	1.8 B	Q4_0	1.1 GB	~15-20 tokens/sec
Gemma 2	2 B	Q4_0	1.6 GB	~10-12 tokens/sec
Mistral	7 B	Q4_0	4.1 GB	~5-7 tokens/sec

Tuning Tips for Efficient RAG and Agent Workflows:*

Use asynchronous chains and streaming response handlers to reduce latency in FastAPI endpoints.
For RAG pipelines, use efficient vector stores (e.g., FAISS with cosine or inner product) and pre-filter data when possible.
Avoid long chain dependencies; break workflows into smaller composable components.
Cache prompt templates and tool results when applicable to reduce unnecessary recomputation
For agent-based flows, limit tool calls per loop to avoid runaway execution or high memory usage.
Log intermediate steps (using LangChain’s callbacks) for better debugging and observability
Use models with ≥3B parameters (e.g., Llama 3.2 3B or larger) for agent development to ensure better reasoning depth and tool usage reliability.
Use FAISS or Chroma with pre-computed embeddings for faster retrieval
Apply score thresholding in retriever config to filter irrelevant documents
Keep persistent vector DB (e.g., FAISS saved index) to avoid re-indexing on container restart
Use appropriate (size/precision) embedding models as per the suitability of the use case.

Supported AI Model Formats

Format	Support Level	Compatible Versions	Notes
ONNX	Full	1.10.0 - 1.16.3	Recommended for cross-framework compatibility
TensorRT™	Full	7.x - 8.5.x	Best for performance-critical applications
PyTorch (JIT)	Full	1.8.0 - 2.0.0	Native support via TorchScript
TensorFlow SavedModel	Full	2.8.0 - 2.12.0	Recommended TF deployment format
TFLite	Partial	Up to 2.12.0	May have limited hardware acceleration
GGUF	Full	v3	Format used by Ollama backend