RamaLama runtime images are minimal, security‑hardened containers that package an inference engine without any model files.
Use them when you want to manage models separately (versioning, provenance, air‑gapped environments) or need fine‑grained control over mounts and updates.
When to use runtimes
Isolate the execution environment from model content for stricter change control
Update model files without rebuilding container images
Pin/roll back runtime versions independently of models
Support multiple models on the same host via mounts
Supported flavors
Common runtime images include:
rlcr.io/ramalama/llamacpp-cpu-distroless:latest — CPU‑only
rlcr.io/ramalama/llamacpp-cuda-distroless:latest — NVIDIA CUDA
Requires NVIDIA Container Toolkit when using Docker
Additional hardware variants may be available (e.g., ROCm, Intel GPU). Check the registry for your hardware.
For NVIDIA + Docker, install the NVIDIA Container Toolkit before running GPU containers.
Run with a local model directory
Mount a directory containing your .gguf model and point the runtime to the file with --model.
Docker (CPU)
Docker (CUDA)
Podman (CPU)
Podman (CUDA)
docker run --rm -p 8080:8080 \
-v " $PWD /models:/models:ro" \
rlcr.io/ramalama/llamacpp-cpu-distroless:latest \
--model /models/gemma-3-270m-it-Q6_K.gguf --host 0.0.0.0 --port 8080
Compose example
Define the runtime service and mount your model directory read‑only at /models.
docker-compose.yaml (CPU)
services :
llama :
image : rlcr.io/ramalama/llamacpp-cpu-distroless:latest
command : [ "--model" , "/models/gemma-3-270m-it-Q6_K.gguf" , "--host" , "0.0.0.0" , "--port" , "8080" ]
volumes :
- ./models:/models:ro
ports :
- "8080:8080"
restart : unless-stopped
docker-compose.yaml (CUDA)
services :
llama-gpu :
image : rlcr.io/ramalama/llamacpp-cuda-distroless:latest
command : [ "--model" , "/models/gemma-3-270m-it-Q6_K.gguf" , "--host" , "0.0.0.0" , "--port" , "8080" ]
volumes :
- ./models:/models:ro
ports :
- "8080:8080"
gpus : all
restart : unless-stopped
RamaLama CLI (override image)
The CLI auto‑detects your hardware and chooses an image, but you can override it explicitly:
ramalama serve --image rlcr://llamacpp-cuda-distroless:latest rlcr://gemma3-270m
Next steps
See deployment patterns: /pages/deploying/compose
Learn about OCI‑packaged models: /pages/artifacts/model