
Overview
The Python SDK provides a local-first developer experience for running AI models on device. It wraps the RamaLama CLI to provision models in containers and exposes a simple API for inference in your apps. Core capabilities include:- LLM: local chat with OpenAI-compatible HTTP endpoints for direct requests.
- STT: speech-to-text with Whisper models running on device.
Capabilities
Chat
Send chat completion requests to a running model server.
Speech-to-Text
Local transcription with Whisper models (coming soon).
Key Capabilities
- Container-native model provisioning with the RamaLama CLI.
- Flexible model sources (HuggingFace, Ollama, ModelScope, OCI registries, local files, URLs).
- Local-first inference to minimize latency and protect data.
- Model lifecycle control (download, serve, stop) from code.
Core Philosophy
- On-device first
- Container-native by default
- Privacy-focused
- Developer-friendly APIs
Features
Language Models (LLM)
- Local chat with a simple SDK interface.
- OpenAI-compatible HTTP endpoint for direct requests.
- Bring-your-own model sources through the RamaLama CLI.
Speech-to-Text (STT)
- Local transcription with Whisper models.
- Works entirely on device.
Model Management
- Download and cache models locally.
- Start and stop model servers programmatically.
- Use the same model catalog and resolution as the CLI.
System Requirements
| Requirement | Notes |
|---|---|
| RamaLama CLI | Installed and available on your PATH |
| Container manager | Docker or Podman |
| Local storage | Space for model downloads |

