Skip to main content
RamaLama images can be used like any other containerized workload. While not exhaustive we’ve provided two strategies you can leverage when deploying to kubernetes.
Although model images which package both the runtime and the model into a single image are available, we generally advise mounting models as volumes onto a runtime image when deploying to production. This keeps runtime and model lifecycles independent and reduces image size.

OCI Image Volume (Kubernetes 1.33+)

As of Kubernetes 1.33 image volumes have officially been promoted into beta. With this feature, you’re now able to mount a container image as a read‑only volume. For many models we provide both raw OCI artifacts tagged by their file type (e.g. :gguf) and OCI images with the model file mounted under the /models tagged as :gguf-image.
Requires Kubernetes 1.33+ with OCI image volume support enabled in your cluster. GPU prerequisites apply to the GPU example below: NVIDIA drivers on nodes and the NVIDIA Device Plugin.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-cpu
spec:
  replicas: 1
  selector:
    matchLabels: { app: llama-cpu }
  template:
    metadata:
      labels: { app: llama-cpu }
    spec:
      containers:
        - name: llama
          image: rlcr.io/ramalama/llamacpp-cpu-distroless:latest
          ports:
            - containerPort: 8080
          args:
            - "--model"
            - "/models/gemma-3-1b-it-Q6_K.gguf" # update to your exact filename
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8080"
          volumeMounts:
            - name: model
              mountPath: /models
              readOnly: true
              subPath: models  # mount only the /models directory from the image
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities: { drop: ["ALL"] }
      volumes:
        - name: model
          image:
            reference: rlcr.io/ramalama/gemma-3-1b-it:gguf-image
            pullPolicy: IfNotPresent
---
apiVersion: v1
kind: Service
metadata:
  name: llama-cpu
spec:
  selector: { app: llama-cpu }
  ports:
    - name: http
      port: 80
      targetPort: 8080
  type: ClusterIP

InitContainer (ORAS into emptyDir)

Use an initContainer to pull the model ORAS artifact (:gguf) into an emptyDir mounted at /models before the runtime starts. This strategy works on any currently supported version of Kubernetes without special volume types.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-cpu
spec:
  replicas: 1
  selector:
    matchLabels: { app: llama-cpu }
  template:
    metadata:
      labels: { app: llama-cpu }
    spec:
      initContainers:
        - name: pull-model
          image: ghcr.io/oras-project/oras:latest
          args: ["pull", "rlcr.io/ramalama/gemma-3-1b-it:gguf", "-o", "/models"]
          volumeMounts:
            - name: model
              mountPath: /models
      containers:
        - name: llama
          image: rlcr.io/ramalama/llamacpp-cpu-distroless:latest
          ports:
            - containerPort: 8080
          args:
            - "--model"
            - "/models/gemma-3-1b-it-Q6_K.gguf" # update to your exact filename
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8080"
          volumeMounts:
            - name: model
              mountPath: /models
              readOnly: true
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities: { drop: ["ALL"] }
      volumes:
        - name: model
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: llama-cpu
spec:
  selector: { app: llama-cpu }
  ports:
    - name: http
      port: 80
      targetPort: 8080
  type: ClusterIP

Operational Tips

  • Pin a specific RamaLama image tag for reproducible rollouts.
  • For other accelerators (ROCm, Intel GPU, etc.), browse tags at registry.ramalama.com and pull from rlcr.io/ramalama/*, then apply the appropriate device resources.
  • For persistence across pod restarts, replace emptyDir with a PVC and write to it from the initContainer once; subsequent restarts can mount the pre‑seeded PVC read‑only.