RamaLama images can be used like any other containerized workload.
While not exhaustive we’ve provided two strategies you can leverage when deploying to kubernetes.
Although model images which package both the runtime and the model into a single image are available, we generally advise mounting models as volumes onto a runtime image when deploying to production.
This keeps runtime and model lifecycles independent and reduces image size.
OCI Image Volume (Kubernetes 1.33+)
As of Kubernetes 1.33 image volumes have officially been promoted into beta.
With this feature, you’re now able to mount a container image as a read‑only volume.
For many models we provide both raw OCI artifacts tagged by their file type (e.g. :gguf) and OCI images with the model file mounted under the /models tagged as :gguf-image.
Requires Kubernetes 1.33+ with OCI image volume support enabled in your cluster.
GPU prerequisites apply to the GPU example below: NVIDIA drivers on nodes and the NVIDIA Device Plugin.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-cpu
spec:
replicas: 1
selector:
matchLabels: { app: llama-cpu }
template:
metadata:
labels: { app: llama-cpu }
spec:
containers:
- name: llama
image: rlcr.io/ramalama/llamacpp-cpu-distroless:latest
ports:
- containerPort: 8080
args:
- "--model"
- "/models/gemma-3-1b-it-Q6_K.gguf" # update to your exact filename
- "--host"
- "0.0.0.0"
- "--port"
- "8080"
volumeMounts:
- name: model
mountPath: /models
readOnly: true
subPath: models # mount only the /models directory from the image
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: ["ALL"] }
volumes:
- name: model
image:
reference: rlcr.io/ramalama/gemma-3-1b-it:gguf-image
pullPolicy: IfNotPresent
---
apiVersion: v1
kind: Service
metadata:
name: llama-cpu
spec:
selector: { app: llama-cpu }
ports:
- name: http
port: 80
targetPort: 8080
type: ClusterIP
InitContainer (ORAS into emptyDir)
Use an initContainer to pull the model ORAS artifact (:gguf) into an emptyDir mounted at /models before the runtime starts.
This strategy works on any currently supported version of Kubernetes without special volume types.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-cpu
spec:
replicas: 1
selector:
matchLabels: { app: llama-cpu }
template:
metadata:
labels: { app: llama-cpu }
spec:
initContainers:
- name: pull-model
image: ghcr.io/oras-project/oras:latest
args: ["pull", "rlcr.io/ramalama/gemma-3-1b-it:gguf", "-o", "/models"]
volumeMounts:
- name: model
mountPath: /models
containers:
- name: llama
image: rlcr.io/ramalama/llamacpp-cpu-distroless:latest
ports:
- containerPort: 8080
args:
- "--model"
- "/models/gemma-3-1b-it-Q6_K.gguf" # update to your exact filename
- "--host"
- "0.0.0.0"
- "--port"
- "8080"
volumeMounts:
- name: model
mountPath: /models
readOnly: true
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: ["ALL"] }
volumes:
- name: model
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: llama-cpu
spec:
selector: { app: llama-cpu }
ports:
- name: http
port: 80
targetPort: 8080
type: ClusterIP
Operational Tips
- Pin a specific RamaLama image tag for reproducible rollouts.
- For other accelerators (ROCm, Intel GPU, etc.), browse tags at
registry.ramalama.com and pull from rlcr.io/ramalama/*, then apply the appropriate device resources.
- For persistence across pod restarts, replace
emptyDir with a PVC and write to it from the initContainer once; subsequent restarts can mount the pre‑seeded PVC read‑only.