Skip to main content
RamaLama runtime images are minimal, security‑hardened containers that package an inference engine without any model files. Use them when you want to manage models separately (versioning, provenance, air‑gapped environments) or need fine‑grained control over mounts and updates.

When to use runtimes

  • Isolate the execution environment from model content for stricter change control
  • Update model files without rebuilding container images
  • Pin/roll back runtime versions independently of models
  • Support multiple models on the same host via mounts

Supported flavors

Common runtime images include:
  • rlcr.io/ramalama/llamacpp-cpu-distroless:latest — CPU‑only
  • rlcr.io/ramalama/llamacpp-cuda-distroless:latest — NVIDIA CUDA
    • Requires NVIDIA Container Toolkit when using Docker
Additional hardware variants may be available (e.g., ROCm, Intel GPU). Check the registry for your hardware.
For NVIDIA + Docker, install the NVIDIA Container Toolkit before running GPU containers.

Run with a local model directory

Mount a directory containing your .gguf model and point the runtime to the file with --model.
docker run --rm -p 8080:8080 \
  -v "$PWD/models:/models:ro" \
  rlcr.io/ramalama/llamacpp-cpu-distroless:latest \
  --model /models/gemma-3-270m-it-Q6_K.gguf --host 0.0.0.0 --port 8080

Compose example

Define the runtime service and mount your model directory read‑only at /models.
docker-compose.yaml (CPU)
services:
  llama:
    image: rlcr.io/ramalama/llamacpp-cpu-distroless:latest
    command: ["--model", "/models/gemma-3-270m-it-Q6_K.gguf", "--host", "0.0.0.0", "--port", "8080"]
    volumes:
      - ./models:/models:ro
    ports:
      - "8080:8080"
    restart: unless-stopped
docker-compose.yaml (CUDA)
services:
  llama-gpu:
    image: rlcr.io/ramalama/llamacpp-cuda-distroless:latest
    command: ["--model", "/models/gemma-3-270m-it-Q6_K.gguf", "--host", "0.0.0.0", "--port", "8080"]
    volumes:
      - ./models:/models:ro
    ports:
      - "8080:8080"
    gpus: all
    restart: unless-stopped

RamaLama CLI (override image)

The CLI auto‑detects your hardware and chooses an image, but you can override it explicitly:
ramalama serve --image rlcr://llamacpp-cuda-distroless:latest rlcr://gemma3-270m

Next steps

  • See deployment patterns: /pages/deploying/compose
  • Learn about OCI‑packaged models: /pages/artifacts/model