RamaLama CLI

RamaLama CLI is a local first developer toolkit that treats AI models like container images — you can pull, run, and serve them with familiar container‑centric workflows. It automatically identifies your hardware and automatically selects an appropriate runtime image for your hardware configuration.

The RamaLama CLI is open-source and open to contributors. Check the project out at https://github.com/containers/ramalama

Installation

Install RamaLama CLI

Choose your preferred installation method:

pip install ramalama

Verify the installation

Verify that RamaLama was successfully installed:

ramalama version

Functionality

The CLI includes a variety of useful functions including

Local serving and interaction with AI models
Packaging containerized AI deployments
Building optimized deployments for RAG workloads
etc…

This documentation covers only a small subset of the projects full capabilities. Complete information about the CLI is available on the github project: https://github.com/containers/ramalama.

Serve a REST API

RamaLama makes it easy to work with AI on your laptop. You can deploy an OpenAI compatible API with a single command.

ramalama serve --image rlcr.io/ramalama/llamacpp-cpu-distroless -d -p 8080 rlcr://gemma3-270m

This command will use your locally installed container manager, like Docker or Podman, to build a new container to serve the requested llm. You can query the server however you prefer including curl, postman, or the ramalama CLI itself.

You can find the full catalogue of RamaLama Labs images here

ramalama chat "Say hello in one sentence"

Once you’re done workwing with the AI you can stop the server either with the CLI or your preferred container manager.

ramalama stop --all

Model Repositories

RamaLama can serve models from any of the major model providers including RamaLama Labs, HuggingFace, Ollama, and Modelscope. Additionally, it supports generic oci model artifacts meaning you can easily run and serve models from your own or your enterprises own model registry. For example, you can easily serve an oci compatible artifact from Dockers modelhub with

ramalama serve oci://docker.io/<your_repo>/<your_model>

Hardware acceleration

RamaLama inspects your system and chooses a matching runtime image (e.g., CUDA, ROCm, Intel GPU, CPU). However, you can override the default image explicitly with the —image command and run

ramalama serve -d -p 8081 --image rlcr://llamacpp-distroless-cuda:latest llama3

For NVIDIA with Docker, ensure NVIDIA Container Toolkit is installed.

Next steps

Deploy to Production

Learn how to deploy with Docker Compose or Kubernetes

Explore on GitHub

Browse the full documentation, examples, and man pages

Getting Started

Quick Start

Deploying

Artifact Types

Education

Installation

Functionality

Serve a REST API

Model Repositories

Hardware acceleration

Next steps

Deploy to Production

Explore on GitHub

Getting Started

Quick Start

Deploying

Artifact Types

Education

​Installation

​Functionality

​Serve a REST API

​Model Repositories

​Hardware acceleration

​Next steps

Deploy to Production

Explore on GitHub

Installation

Functionality

Serve a REST API

Model Repositories

Hardware acceleration

Next steps