Skip to main content
The chat() method sends a chat completion request to a running model server and returns a ChatMessage payload. It is a simple API for quick prompts when you do not need to call the HTTP endpoint directly.

Basic Chat

from ramalama_sdk import RamalamaModel

with RamalamaModel(model="tinyllama") as model:
    response = model.chat("How tall is Michael Jordan")
    print(response["content"])
Michael Jordan is 6 feet 6 inches (1.98 m) tall.

Multiturn conversations

For multiturn conversations the chat() method accepts an additional history argument which can also be used to set system prompts.
from ramalama_sdk import RamalamaModel

sys_prompt = {
    "role": "system",
    "content": "Respond to all conversations as if you were a dog with variations of bark and woof."
}
history = [sys_prompt]

with RamalamaModel(model="tinyllama") as model:
    response = model.chat("How tall is Michael Jordan?", history)
    print(response["content"])
Woof woof. Bark bark bark. Rrr-woooooof.
Arf arf arf arf arf arf. Ruff!

Model instantiation

The model exposes a variety of customization parameters including base_image, which allows you to customize the model container runtime. This is especially useful if you need to run inference on custom hardware which requires a specifically compiled version of llama.cpp, vLLM, and more.
from ramalama_sdk import RamalamaModel

model = RamalamaModel(
    model="tinyllama",
    base_image="artifactory.corp.com/llama-runtime:prod",
    temp=0.7,
    ngl=20,
    max_tokens=256,
    threads=8,
    ctx_size=4096,
    timeout=30,
)
FieldTypeDescriptionDefault
modelstrModel name or identifier.required
base_imagestrContainer image to use for serving, if different from config.quay.io/ramalama/ramalama
tempfloatTemperature override for sampling.0.8
nglintGPU layers override.-1 (all)
max_tokensintMaximum tokens for completions.0 (unlimited)
threadsintCPU threads override.-1 (all)
ctx_sizeintContext window override.0 (loaded from the model)
timeoutintSeconds to wait for server readiness.30

Async models

The async model API is identical to the sync examples above.
from ramalama_sdk import AsyncRamalamaModel

async with AsyncRamalamaModel(model="tinyllama") as model:
    response = await model.chat("How tall is Michael Jordan")
    print(response["content"])

Before you call chat()

The server must be running. If you are not using a context manager, manage the model lifecycle yourself:
from ramalama_sdk import RamalamaModel

model = RamalamaModel(model="tinyllama")
model.download()
model.serve()

try:
    response = model.chat("Hello!")
    print(response["content"])
finally:
    model.stop()

Method signature

RamalamaModel.chat(message: str, history: list[ChatMessage] | None = None) -> ChatMessage

Parameters

ParameterTypeDescriptionDefault
messagestrUser prompt content.required
historylist[ChatMessage] or NoneOptional prior conversation messages.None

Returns

A ChatMessage typed dict with the assistant response.
FieldTypeDescription
roleLiteral[‘system’, ‘user’, ‘assistant’, ‘developer’]Message author role.
contentstrMessage text content.

Raises

  • RuntimeError if the server is not running.

When to use chat() vs direct HTTP

Use caseRecommended approach
Quick responseschat()
Custom payloads or full OpenAI schema controlDirect HTTP to /chat/completions
Interoperability with existing OpenAI clientsDirect HTTP to /chat/completions
For direct HTTP calls, see the quick start example that uses requests.