Skip to main content
The SDK spins up a local model server and lets you chat with it using a simple API.

Run a model

Context Managers

The context manager will automatically manage and clean up running models on your behalf.
from ramalama_sdk import RamalamaModel

with RamalamaModel(model="tinyllama") as model:
    response = model.chat("How tall is Michael Jordan?")
    print(response["content"])

Manual Management

It’s also possible to manually manage the models run state.
Manual lifecycle
from ramalama_sdk import RamalamaModel

model_name = "tinyllama"
model = RamalamaModel(model=model_name)
model.download()
model.serve()
Once the model is serving, you can call the local OpenAI-compatible endpoint yourself.
try:
    response = model.chat("How tall is Michael Jordan?")
    print(response["content"])
finally:
    model.stop()

Download models

Use download() to fetch and cache models before serving. The model identifier controls where the SDK pulls from. Common prefixes include
  • HuggingFace: hf://
  • Ollama: ollama://
  • OCI (any oci image repository): oci://
  • ModelScope: modelscope://
  • File: file://
from ramalama_sdk import RamalamaModel

model = RamalamaModel(model="hf://ggml-org/gpt-oss-20b-GGUF")
model.download()

Instantiating a model

You can pass runtime overrides when creating a model session:
from ramalama_sdk import RamalamaModel

model = RamalamaModel(
    model="tinyllama",
    base_image=None,
    temp=0.7,
    ngl=20,
    max_tokens=256,
    threads=8,
    ctx_size=4096,
    timeout=30,
)
ParameterTypeDescriptionDefault
modelstrModel name or identifier.required
base_imagestr or NoneContainer image to use for serving, if different from config.None
tempfloat or NoneTemperature override for sampling.None
nglint or NoneGPU layers override.None
max_tokensint or NoneMaximum tokens for completions.None
threadsint or NoneCPU threads override.None
ctx_sizeint or NoneContext window override.None
timeoutintSeconds to wait for server readiness.30