Skip to main content
Our containerized AI artifacts are OCI compatible allowing you to directly use them with docker, podman and kubernetes wherever you need them: whether the cloud, a datacenter, or your basement. Our artifacts are regularly rebuilt, updated, and scanned for vulnerabilities to provide, the smallest, fastest, and most secure runtime possible.
You can find comparisons between different images on the comparisons page of each image (e.g. for llama.cpp’s cuda and cpu) runtimes.

Quick start

The fastest path is to deploy a model image that bundles runtime + model using docker compose. For more information about deploying in production environments check out deployment.
1

Install dependencies

Getting started requires either Docker or Podman. We also recommend the RamaLama CLI for a streamlined experience.
  1. Install Podman or Docker
  2. (Optional) Install RamaLama CLI
2

Create docker-compose.yaml

Create a docker-compose.yaml using a model image which bundles both the runtime and model together into a single runnable container.
services:
  ai:
    image: rlcr.io/ramalama/gemma3-270m:latest
    ports:
        - "8080:8080"
    restart: unless-stopped
3

Start the stack

docker compose up -d
4

Get chatting

curl -s http://localhost:8080/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{"model":"gemma3-270m","messages":[{"role":"user","content":"Say hello in one sentence"}]}'
Hello!