Local Environment

RamaLama allows you to run AI workloads on your laptop just as easily as you run them in the cloud. The CLI can help whether you’re running a coding agent locally or developing a reproducible local environments that matches production.

Prerequisites

Podman or Docker installed (recommended)
RamaLama installed (pip install ramalama or dnf install python3-ramalama)
Optional: GPU drivers/runtime (NVIDIA Container Toolkit, AMD ROCm, etc.) Check your install:

Serve a model locally

Start a REST API on port 8080 in the background:

ramalama serve --image rlcr.io/ramalama/llamacpp-cpu-distroless -d -p 8080 rlcr://gemma3-270m

Interact via the OpenAI-compatible API:

ramalama chat "Say hello in one sentence"

Hello!

List and stop containers:

ramalama containers
ramalama stop --all

GPU acceleration

RamaLama detects your hardware and picks an accelerated image automatically (quay.io/ramalama/cuda, rocm, intel-gpu, etc.). To override, specify --image:

ramalama serve -d -p 8080 --image --image rlcr.io/ramalama/llamacpp-cpu-distroless llama3

If you use Docker with NVIDIA GPUs, ensure the NVIDIA Container Toolkit is installed and your compose/run commands have GPU access enabled as needed.

Data and storage

Models are stored under your user data directory (e.g., ~/.local/share/ramalama). Use ramalama list to see downloaded models and ramalama rm to remove them.

Security defaults

RamaLama runs models in rootless containers with --network=none, read-only model mounts, and --rm cleanup.

Getting Started

Quick Start

Deploying

Artifact Types

Education

Prerequisites

Serve a model locally

GPU acceleration

Data and storage

Security defaults

Next Steps

Docker Compose

Kubernetes

Getting Started

Quick Start

Deploying

Artifact Types

Education

​Prerequisites

​Serve a model locally

​GPU acceleration

​Data and storage

​Security defaults

​Next Steps

Docker Compose

Kubernetes

Prerequisites

Serve a model locally

GPU acceleration

Data and storage

Security defaults

Next Steps