Catalog

Overview

This container image, Deepseek-R1 1.5B Ollama on NVIDIA Jetson™, bundles a ready-to-use Ollama inference engine with the DeepSeek R1 1.5B model and the OpenWebUI interface for LLM interaction.

Developed for NVIDIA Jetson™, this image offers:

  • On-device LLM (DeepSeek R1 1.5B via Ollama) inference (no internet required post-setup)
  • Open-AI compatible APIs
  • Streaming chat support via OpenWebUI & Ollama
  • Modelfile for customizing model parameters

Container Demo


Use Case

  • Private LLM Inference on Local Devices: Run large language models locally with no internet requirement—ideal for privacy-critical environments

  • Lightweight Backend for LLM APIs: Use Ollama to expose models via its local API for fast integration with tools like LangChain, FastAPI, or custom UIs.

  • Document-Based Q&A Systems: Combine Ollama with a vector database to create offline RAG (Retrieval-Augmented Generation) systems for querying internal documents or manuals.

  • Rapid Prototyping for Prompt Engineering: Use the Modelfile to fine-tune system prompts, default instructions, and model parameters—great for experimenting with prompt structures or multi-turn workflows.

  • Multilingual Assistants: Deploy multilingual chatbots using local models that can translate, summarize, or interact in different languages without depending on cloud services.

  • LLM Evaluation and Benchmarking Easily swap and test different quantized models (e.g., Mistral, LLaMA, DeepSeek) to compare performance, output quality, and memory usage across devices.

  • Custom Offline Agents: Use Ollama as the reasoning core of intelligent agents that interact with other local tools (e.g., databases, APIs, sensors)—especially powerful when paired with LangChain


Key Features

  • Ollama Integration: Lightweight inference engine for quantized models
  • Complete AI Framework Stack: PyTorch, TensorFlow, ONNX Runtime, and TensorRT™
  • Industrial Vision Support: Accelerated OpenCV and GStreamer pipelines
  • Edge AI Capabilities: Support for computer vision, LLMs, and time-series analysis
  • Performance Optimized: Tuned specifically for NVIDIA® Jetson Orin™ NX 8GB
  • System Prompt & Temperature Control: Fine-tune the parameters using Modelfile.

Host Device Prerequisites

Item Specification
Compatible Hardware Advantech devices accelerated by NVIDIA Jetson™ - refer to Compatible Hardware
NVIDIA Jetson™ Version 5.x
Host OS Ubuntu 20.04
Required Software Packages Refer to Below
Software Installation NVIDIA Jetson™ Software Package Installation

Required Software Packages on Host Device

These packages are bound with the NVIDIA Jetson™ version of the device. This container supports NVIDIA Jetson™ 5.x.

Component Version Description
CUDA® Toolkit 11.4.315 GPU computing platform
cuDNN 8.6.0 Deep Neural Network library
TensorRT™ 8.5.2.2 Inference optimizer and runtime
VPI 2.2.7 or above
Vulkan 1.3.204 or above
OpenCV 4.5.0 Computer vision library with CUDA®

Container Environment Overview

Software Components on Container Image

Component Version Description
CUDA® 11.4.315 GPU computing platform
cuDNN 8.6.0 Deep Neural Network library
TensorRT™ 8.5.2.2 Inference optimizer and runtime
PyTorch 2.0.0+nv23.02 Deep learning framework
TensorFlow 2.12.0 Machine learning framework
ONNX Runtime 1.16.3 Cross-platform inference engine
OpenCV 4.5.0 Computer vision library with CUDA®
GStreamer 1.16.2 Multimedia framework
Ollama 0.5.7 LLM inference engine
OpenWebUI 0.6.5 Web interface for chat interactions

Quick Start Guide

For container quick start, including the docker-compose file and more, please refer to README.


Supported AI Capabilities

Language Models Recommendation

Model Family Parameters Quantization Size Performance
DeepSeek R1 1.5 B Q4_K_M 1.1 GB ~15-17 tokens/sec
DeepSeek R1 7 B Q4_K_M 4.7 GB ~5-7 tokens/sec
DeepSeek Coder 1.3 B Q4_0 776 MB ~20-25 tokens/sec
Llama 3.2 1 B Q8_0 1.3 GB ~17-20 tokens/sec
Llama 3.2 Instruct 1 B Q4_0 ~0.8 GB ~17-20 tokens/sec
Llama 3.2 3 B Q4_K_M 2 GB ~10-12 tokens/sec
Llama 2 7 B Q4_0 3.8 GB ~5-7 tokens/sec
Tinyllama 1.1 B Q4_0 637 MB ~22-27 tokens/sec
Qwen 2.5 0.5 B Q4_K_M 398 MB ~25-30 tokens/sec
Qwen 2.5 1.5 B Q4_K_M 986 MB ~15-17 tokens/sec
Qwen 2.5 Coder 0.5 B Q8_0 531 MB ~25-30 tokens/sec
Qwen 2.5 Coder 1.5 B Q4_K_M 986 MB ~15-17 tokens/sec
Qwen 0.5 B Q4_0 395 MB ~25-30 tokens/sec
Qwen 1.8 B Q4_0 1.1 GB ~15-20 tokens/sec
Gemma 2 2 B Q4_0 1.6 GB ~10-12 tokens/sec
Mistral 7 B Q4_0 4.1 GB ~5-7 tokens/sec

DeepSeek R1 1.5B Optimization Recommendations:*

  • Use INT4-8 (Q4, Q8) quantized models for faster inference
  • Best performance with TensorRT™ engine conversion
  • Use Jetson™ clocks & higher power modes, if possible
  • Run with Jetson™ clocks enabled
  • Avoid loading the model from slow storage like SD/eMMC.
  • Ensure model is fully loaded into GPU
  • Use SWAP/ZRAM in case the model's memory requirements are higher

Hardware Acceleration Support

Accelerator Support Level Compatible Libraries Notes
CUDA® Full PyTorch, TensorFlow, OpenCV, ONNX Runtime Primary acceleration method
TensorRT™ Full ONNX, TensorFlow, PyTorch (via export) Recommended for inference optimization
cuDNN Full PyTorch, TensorFlow Accelerates deep learning primitives
NVDEC Full GStreamer, FFmpeg Hardware video decoding
NVENC Full GStreamer, FFmpeg Hardware video encoding
DLA Partial TensorRT™ Requires specific model optimization

Copyright © Advantech Corporation. All rights reserved.