Chat

The chat() method sends a chat completion request to a running model server and returns a ChatMessage payload. It is a simple API for quick prompts when you do not need to call the HTTP endpoint directly.

Basic Chat

from ramalama_sdk import RamalamaModel

with RamalamaModel(model="tinyllama") as model:
    response = model.chat("How tall is Michael Jordan")
    print(response["content"])

Michael Jordan is 6 feet 6 inches (1.98 m) tall.

Multiturn conversations

For multiturn conversations the chat() method accepts an additional history argument which can also be used to set system prompts.

from ramalama_sdk import RamalamaModel

sys_prompt = {
    "role": "system",
    "content": "Respond to all conversations as if you were a dog with variations of bark and woof."
}
history = [sys_prompt]

with RamalamaModel(model="tinyllama") as model:
    response = model.chat("How tall is Michael Jordan?", history)
    print(response["content"])

Woof woof. Bark bark bark. Rrr-woooooof.
Arf arf arf arf arf arf. Ruff!

Model instantiation

The model exposes a variety of customization parameters including base_image, which allows you to customize the model container runtime. This is especially useful if you need to run inference on custom hardware which requires a specifically compiled version of llama.cpp, vLLM, and more.

from ramalama_sdk import RamalamaModel

model = RamalamaModel(
    model="tinyllama",
    base_image="artifactory.corp.com/llama-runtime:prod",
    temp=0.7,
    ngl=20,
    max_tokens=256,
    threads=8,
    ctx_size=4096,
    timeout=30,
)

Field	Type	Description	Default
model	str	Model name or identifier.	required
base_image	str	Container image to use for serving, if different from config.	`quay.io/ramalama/ramalama`
temp	float	Temperature override for sampling.	0.8
ngl	int	GPU layers override.	-1 (all)
max_tokens	int	Maximum tokens for completions.	0 (unlimited)
threads	int	CPU threads override.	-1 (all)
ctx_size	int	Context window override.	0 (loaded from the model)
timeout	int	Seconds to wait for server readiness.	30

Async models

The async model API is identical to the sync examples above.

from ramalama_sdk import AsyncRamalamaModel

async with AsyncRamalamaModel(model="tinyllama") as model:
    response = await model.chat("How tall is Michael Jordan")
    print(response["content"])

Before you call chat()

The server must be running. If you are not using a context manager, manage the model lifecycle yourself:

from ramalama_sdk import RamalamaModel

model = RamalamaModel(model="tinyllama")
model.download()
model.serve()

try:
    response = model.chat("Hello!")
    print(response["content"])
finally:
    model.stop()

Method signature

RamalamaModel.chat(message: str, history: list[ChatMessage] | None = None) -> ChatMessage

Parameters

Parameter	Type	Description	Default
message	str	User prompt content.	required
history	list[ChatMessage] or None	Optional prior conversation messages.	None

Returns

A ChatMessage typed dict with the assistant response.

Field	Type	Description
role	Literal[‘system’, ‘user’, ‘assistant’, ‘developer’]	Message author role.
content	str	Message text content.

Raises

RuntimeError if the server is not running.

When to use chat() vs direct HTTP

Use case	Recommended approach
Quick responses	`chat()`
Custom payloads or full OpenAI schema control	Direct HTTP to `/chat/completions`
Interoperability with existing OpenAI clients	Direct HTTP to `/chat/completions`

For direct HTTP calls, see the quick start example that uses requests.

Getting Started

Python

Planned SDKs

Basic Chat

Multiturn conversations

Model instantiation

Async models

Before you call chat()

Method signature

Parameters

Returns

Raises

When to use chat() vs direct HTTP

Getting Started

Python

Planned SDKs

​Basic Chat

​Multiturn conversations

​Model instantiation

​Async models

​Before you call chat()

​Method signature

​Parameters

​Returns

​Raises

​When to use chat() vs direct HTTP

Basic Chat

Multiturn conversations

Model instantiation

Async models

Before you call chat()

Method signature

Parameters

Returns

Raises

When to use chat() vs direct HTTP