> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ramalama.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Runtimes

> Hardened, distroless inference engines (e.g., llama.cpp, vLLM) for CPU and GPU.

RamaLama runtime images are minimal, security‑hardened containers that package an inference engine without any model files.
Use them when you want to manage models separately (versioning, provenance, air‑gapped environments) or need fine‑grained control over mounts and updates.

## When to use runtimes

* Isolate the execution environment from model content for stricter change control
* Update model files without rebuilding container images
* Pin/roll back runtime versions independently of models
* Support multiple models on the same host via mounts

## Supported flavors

Common runtime images include:

* `rlcr.io/ramalama/llamacpp-cpu-distroless:latest` — CPU‑only
* `rlcr.io/ramalama/llamacpp-cuda-distroless:latest` — NVIDIA CUDA
  * Requires NVIDIA Container Toolkit when using Docker

Additional hardware variants may be available (e.g., ROCm, Intel GPU). Check the registry for your hardware.

<Tip>
  For NVIDIA + Docker, install the NVIDIA Container Toolkit before running GPU containers.
</Tip>

## Run with a local model directory

Mount a directory containing your `.gguf` model and point the runtime to the file with `--model`.

<CodeGroup>
  ```bash title="Docker (CPU)" theme={"system"}
  docker run --rm -p 8080:8080 \
    -v "$PWD/models:/models:ro" \
    rlcr.io/ramalama/llamacpp-cpu-distroless:latest \
    --model /models/gemma-3-270m-it-Q6_K.gguf --host 0.0.0.0 --port 8080
  ```

  ```bash title="Docker (CUDA)" theme={"system"}
  docker run --rm -p 8080:8080 --gpus all \
    -v "$PWD/models:/models:ro" \
    rlcr.io/ramalama/llamacpp-cuda-distroless:latest \
    --model /models/gemma-3-270m-it-Q6_K.gguf --host 0.0.0.0 --port 8080
  ```

  ```bash title="Podman (CPU)" theme={"system"}
  podman run --rm -p 8080:8080 \
    -v "$PWD/models:/models:ro" \
    rlcr.io/ramalama/llamacpp-cpu-distroless:latest \
    --model /models/gemma-3-270m-it-Q6_K.gguf --host 0.0.0.0 --port 8080
  ```

  ```bash title="Podman (CUDA)" theme={"system"}
  podman run --rm -p 8080:8080 --gpus all \
    -v "$PWD/models:/models:ro" \
    rlcr.io/ramalama/llamacpp-cuda-distroless:latest \
    --model /models/gemma-3-270m-it-Q6_K.gguf --host 0.0.0.0 --port 8080
  ```
</CodeGroup>

## Compose example

Define the runtime service and mount your model directory read‑only at `/models`.

```yaml title="docker-compose.yaml (CPU)" theme={"system"}
services:
  llama:
    image: rlcr.io/ramalama/llamacpp-cpu-distroless:latest
    command: ["--model", "/models/gemma-3-270m-it-Q6_K.gguf", "--host", "0.0.0.0", "--port", "8080"]
    volumes:
      - ./models:/models:ro
    ports:
      - "8080:8080"
    restart: unless-stopped
```

```yaml title="docker-compose.yaml (CUDA)" theme={"system"}
services:
  llama-gpu:
    image: rlcr.io/ramalama/llamacpp-cuda-distroless:latest
    command: ["--model", "/models/gemma-3-270m-it-Q6_K.gguf", "--host", "0.0.0.0", "--port", "8080"]
    volumes:
      - ./models:/models:ro
    ports:
      - "8080:8080"
    gpus: all
    restart: unless-stopped
```

## RamaLama CLI (override image)

The CLI auto‑detects your hardware and chooses an image, but you can override it explicitly:

```bash theme={"system"}
ramalama serve --image rlcr://llamacpp-cuda-distroless:latest rlcr://gemma3-270m
```

## Next steps

* See deployment patterns: `/pages/deploying/compose`
* Learn about OCI‑packaged models: `/pages/artifacts/model`
