> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ramalama.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Kubernetes

> Run RamaLama on Kubernetes with CPU or GPU nodes.

RamaLama images can be used like any other containerized workload.
While not exhaustive we've provided two strategies you can leverage when deploying to kubernetes.

<Tip>
  Although model images which package both the runtime and the model into a single image are available, we generally advise mounting models as volumes onto a runtime image when deploying to production.
  This keeps runtime and model lifecycles independent and reduces image size.
</Tip>

## OCI Image Volume (Kubernetes 1.33+)

As of Kubernetes 1.33 [image volumes](https://kubernetes.io/docs/tasks/configure-pod-container/image-volumes/) have officially been promoted into beta.
With this feature, you're now able to mount a container image as a read‑only volume.
For many models we provide both raw OCI artifacts tagged by their file type (e.g. `:gguf`) and OCI images with the model file mounted under the `/models` tagged as `:gguf-image`.

<Note>
  Requires Kubernetes 1.33+ with OCI image volume support enabled in your cluster.
  GPU prerequisites apply to the GPU example below: NVIDIA drivers on nodes and the NVIDIA Device Plugin.
</Note>

<CodeGroup>
  ```yaml title="CPU" theme={"system"}
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: llama-cpu
  spec:
    replicas: 1
    selector:
      matchLabels: { app: llama-cpu }
    template:
      metadata:
        labels: { app: llama-cpu }
      spec:
        containers:
          - name: llama
            image: rlcr.io/ramalama/llamacpp-cpu-distroless:latest
            ports:
              - containerPort: 8080
            args:
              - "--model"
              - "/models/gemma-3-1b-it-Q6_K.gguf" # update to your exact filename
              - "--host"
              - "0.0.0.0"
              - "--port"
              - "8080"
            volumeMounts:
              - name: model
                mountPath: /models
                readOnly: true
                subPath: models  # mount only the /models directory from the image
            securityContext:
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: true
              capabilities: { drop: ["ALL"] }
        volumes:
          - name: model
            image:
              reference: rlcr.io/ramalama/gemma-3-1b-it:gguf-image
              pullPolicy: IfNotPresent
  ---
  apiVersion: v1
  kind: Service
  metadata:
    name: llama-cpu
  spec:
    selector: { app: llama-cpu }
    ports:
      - name: http
        port: 80
        targetPort: 8080
    type: ClusterIP
  ```

  ```yaml title="GPU" theme={"system"}
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: llama-gpu
  spec:
    replicas: 1
    selector:
      matchLabels: { app: llama-gpu }
    template:
      metadata:
        labels: { app: llama-gpu }
      spec:
        containers:
          - name: llama
            image: rlcr.io/ramalama/llamacpp-cuda-distroless:latest
            ports:
              - containerPort: 8080
            args:
              - "--model"
              - "/models/gemma-3-1b-it-Q6_K.gguf" # update to your exact filename
              - "--host"
              - "0.0.0.0"
              - "--port"
              - "8080"
            volumeMounts:
              - name: model
                mountPath: /models
                readOnly: true
                subPath: models
            env:
              - name: NVIDIA_VISIBLE_DEVICES
                value: all
              - name: CUDA_VISIBLE_DEVICES
                value: all
            resources:
              limits:
                nvidia.com/gpu: "1"
            securityContext:
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: true
              capabilities: { drop: ["ALL"] }
        volumes:
          - name: model
            image:
              reference: rlcr.io/ramalama/gemma-3-1b-it:gguf-image
              pullPolicy: IfNotPresent
  ---
  apiVersion: v1
  kind: Service
  metadata:
    name: llama-gpu
  spec:
    selector: { app: llama-gpu }
    ports:
      - name: http
        port: 80
        targetPort: 8080
    type: ClusterIP
  ```
</CodeGroup>

## InitContainer (ORAS into emptyDir)

Use an `initContainer` to pull the model ORAS artifact (`:gguf`) into an `emptyDir` mounted at `/models` before the runtime starts.
This strategy works on any currently supported version of Kubernetes without special volume types.

<CodeGroup>
  ```yaml title="CPU" theme={"system"}
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: llama-cpu
  spec:
    replicas: 1
    selector:
      matchLabels: { app: llama-cpu }
    template:
      metadata:
        labels: { app: llama-cpu }
      spec:
        initContainers:
          - name: pull-model
            image: ghcr.io/oras-project/oras:latest
            args: ["pull", "rlcr.io/ramalama/gemma-3-1b-it:gguf", "-o", "/models"]
            volumeMounts:
              - name: model
                mountPath: /models
        containers:
          - name: llama
            image: rlcr.io/ramalama/llamacpp-cpu-distroless:latest
            ports:
              - containerPort: 8080
            args:
              - "--model"
              - "/models/gemma-3-1b-it-Q6_K.gguf" # update to your exact filename
              - "--host"
              - "0.0.0.0"
              - "--port"
              - "8080"
            volumeMounts:
              - name: model
                mountPath: /models
                readOnly: true
            securityContext:
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: true
              capabilities: { drop: ["ALL"] }
        volumes:
          - name: model
            emptyDir: {}
  ---
  apiVersion: v1
  kind: Service
  metadata:
    name: llama-cpu
  spec:
    selector: { app: llama-cpu }
    ports:
      - name: http
        port: 80
        targetPort: 8080
    type: ClusterIP
  ```

  ```yaml title="GPU" theme={"system"}
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: llama-gpu
  spec:
    replicas: 1
    selector:
      matchLabels: { app: llama-gpu }
    template:
      metadata:
        labels: { app: llama-gpu }
      spec:
        initContainers:
          - name: pull-model
            image: ghcr.io/oras-project/oras:latest
            args: ["pull", "rlcr.io/ramalama/gemma-3-1b-it:gguf", "-o", "/models"]
            volumeMounts:
              - name: model
                mountPath: /models
        containers:
          - name: llama
            image: rlcr.io/ramalama/llamacpp-cuda-distroless:latest
            ports:
              - containerPort: 8080
            args:
              - "--model"
              - "/models/gemma-3-1b-it-Q6_K.gguf" # update to your exact filename
              - "--host"
              - "0.0.0.0"
              - "--port"
              - "8080"
            volumeMounts:
              - name: model
                mountPath: /models
                readOnly: true
            env:
              - name: NVIDIA_VISIBLE_DEVICES
                value: all
              - name: CUDA_VISIBLE_DEVICES
                value: all
            resources:
              limits:
                nvidia.com/gpu: "1"
            securityContext:
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: true
              capabilities: { drop: ["ALL"] }
        volumes:
          - name: model
            emptyDir: {}
  ---
  apiVersion: v1
  kind: Service
  metadata:
    name: llama-gpu
  spec:
    selector: { app: llama-gpu }
    ports:
      - name: http
        port: 80
        targetPort: 8080
    type: ClusterIP
  ```
</CodeGroup>

## Operational Tips

* Pin a specific RamaLama image tag for reproducible rollouts.
* For other accelerators (ROCm, Intel GPU, etc.), browse tags at `registry.ramalama.com` and pull from `rlcr.io/ramalama/*`, then apply the appropriate device resources.
* For persistence across pod restarts, replace `emptyDir` with a PVC and write to it from the initContainer once; subsequent restarts can mount the pre‑seeded PVC read‑only.
