Skip to main content
RamaLama CLI is a local first developer toolkit that treats AI models like container images — you can pull, run, and serve them with familiar container‑centric workflows. It automatically identifies your hardware and automatically selects an appropriate runtime image for your hardware configuration.
The RamaLama CLI is open-source and open to contributors. Check the project out at https://github.com/containers/ramalama

Installation

1

Install RamaLama CLI

Choose your preferred installation method:
pip install ramalama
2

Verify the installation

Verify that RamaLama was successfully installed:
ramalama version

Functionality

The CLI includes a variety of useful functions including
  • Local serving and interaction with AI models
  • Packaging containerized AI deployments
  • Building optimized deployments for RAG workloads
  • etc…
This documentation covers only a small subset of the projects full capabilities. Complete information about the CLI is available on the github project: https://github.com/containers/ramalama.

Serve a REST API

RamaLama makes it easy to work with AI on your laptop. You can deploy an OpenAI compatible API with a single command.
ramalama serve --image rlcr.io/ramalama/llamacpp-cpu-distroless -d -p 8080 rlcr://gemma3-270m
This command will use your locally installed container manager, like Docker or Podman, to build a new container to serve the requested llm. You can query the server however you prefer including curl, postman, or the ramalama CLI itself.
You can find the full catalogue of RamaLama Labs images here
ramalama chat "Say hello in one sentence"
Once you’re done workwing with the AI you can stop the server either with the CLI or your preferred container manager.
ramalama stop --all

Model Repositories

RamaLama can serve models from any of the major model providers including RamaLama Labs, HuggingFace, Ollama, and Modelscope. Additionally, it supports generic oci model artifacts meaning you can easily run and serve models from your own or your enterprises own model registry. For example, you can easily serve an oci compatible artifact from Dockers modelhub with
ramalama serve oci://docker.io/<your_repo>/<your_model>

Hardware acceleration

RamaLama inspects your system and chooses a matching runtime image (e.g., CUDA, ROCm, Intel GPU, CPU). However, you can override the default image explicitly with the —image command and run
ramalama serve -d -p 8081 --image rlcr://llamacpp-distroless-cuda:latest llama3
For NVIDIA with Docker, ensure NVIDIA Container Toolkit is installed.

Next steps