Skip to main content
RamaLama allows you to run AI workloads on your laptop just as easily as you run them in the cloud. The CLI can help whether you’re running a coding agent locally or developing a reproducible local environments that matches production.

Prerequisites

  1. Podman or Docker installed (recommended)
  2. RamaLama installed (pip install ramalama or dnf install python3-ramalama)
  3. Optional: GPU drivers/runtime (NVIDIA Container Toolkit, AMD ROCm, etc.) Check your install:

Serve a model locally

Start a REST API on port 8080 in the background:
ramalama serve --image rlcr.io/ramalama/llamacpp-cpu-distroless -d -p 8080 rlcr://gemma3-270m
Interact via the OpenAI-compatible API:
ramalama chat "Say hello in one sentence"
Hello!
List and stop containers:
ramalama containers
ramalama stop --all

GPU acceleration

RamaLama detects your hardware and picks an accelerated image automatically (quay.io/ramalama/cuda, rocm, intel-gpu, etc.). To override, specify --image:
ramalama serve -d -p 8080 --image --image rlcr.io/ramalama/llamacpp-cpu-distroless llama3
If you use Docker with NVIDIA GPUs, ensure the NVIDIA Container Toolkit is installed and your compose/run commands have GPU access enabled as needed.

Data and storage

Models are stored under your user data directory (e.g., ~/.local/share/ramalama). Use ramalama list to see downloaded models and ramalama rm to remove them.

Security defaults

RamaLama runs models in rootless containers with --network=none, read-only model mounts, and --rm cleanup.

Next Steps