chat() method sends a chat completion request to a running model server and returns a ChatMessage payload. It is a simple API for quick prompts when you do not need to call the HTTP endpoint directly.
Basic Chat
Multiturn conversations
For multiturn conversations thechat() method accepts an additional history argument which can also be used to set system prompts.
Model instantiation
The model exposes a variety of customization parameters includingbase_image, which allows you to customize the model container runtime. This is especially useful if you need to run inference on custom hardware which requires a specifically compiled version of llama.cpp, vLLM, and more.
| Field | Type | Description | Default |
|---|---|---|---|
| model | str | Model name or identifier. | required |
| base_image | str | Container image to use for serving, if different from config. | quay.io/ramalama/ramalama |
| temp | float | Temperature override for sampling. | 0.8 |
| ngl | int | GPU layers override. | -1 (all) |
| max_tokens | int | Maximum tokens for completions. | 0 (unlimited) |
| threads | int | CPU threads override. | -1 (all) |
| ctx_size | int | Context window override. | 0 (loaded from the model) |
| timeout | int | Seconds to wait for server readiness. | 30 |
Async models
The async model API is identical to the sync examples above.Before you call chat()
The server must be running. If you are not using a context manager, manage the model lifecycle yourself:Method signature
Parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
| message | str | User prompt content. | required |
| history | list[ChatMessage] or None | Optional prior conversation messages. | None |
Returns
AChatMessage typed dict with the assistant response.
| Field | Type | Description |
|---|---|---|
| role | Literal[‘system’, ‘user’, ‘assistant’, ‘developer’] | Message author role. |
| content | str | Message text content. |
Raises
RuntimeErrorif the server is not running.
When to use chat() vs direct HTTP
| Use case | Recommended approach |
|---|---|
| Quick responses | chat() |
| Custom payloads or full OpenAI schema control | Direct HTTP to /chat/completions |
| Interoperability with existing OpenAI clients | Direct HTTP to /chat/completions |
requests.
