Skip to main content
Model images package both an inference runtime (e.g., llama.cpp) and a specific model into a single container image. They’re ideal for quick starts, demos, single‑purpose services, and environments where simplicity is preferred over component isolation.

When to use model images

  • Fastest way to get an endpoint running
  • Minimal choices: no need to choose a runtime or mount model files
  • Great for laptops, POCs, and small dedicated services
If you need stronger isolation or to manage model files independently, see /pages/artifacts/runtime and /pages/artifacts/model.

Quick start

docker pull rlcr.io/ramalama/gemma3-270m:latest
docker run --rm -p 8080:8080 rlcr.io/ramalama/gemma3-270m:latest
Test the OpenAI‑compatible API:
curl -s http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gemma3-270m","messages":[{"role":"user","content":"Say hello in one sentence"}]}'
You can find the full catalogue of RamaLama Labs images here

Compose

docker-compose.yaml
services:
  ai:
    image: rlcr.io/ramalama/gemma3-270m:latest
    ports:
      - "8080:8080"
    restart: unless-stopped

Notes on updates, tags, and hardware

  • Examples use :latest; pin tags in production for repeatability
  • Images are rebuilt and scanned regularly for security and performance
  • Hardware acceleration is chosen by the underlying image; for advanced control, use runtimes directly

See also

  • Manage models separately: /pages/artifacts/model
  • Engines only (mount a model): /pages/artifacts/runtime