Skip to main content
RamaLama Labs Logo
The RamaLama Python SDK wraps the RamaLama CLI to provision and run local models from your apps. Use it when you want local-first inference with the same container-based model provisioning as the CLI.

Overview

The Python SDK provides a local-first developer experience for running AI models on device. It wraps the RamaLama CLI to provision models in containers and exposes a simple API for inference in your apps. Core capabilities include:
  • LLM: local chat with OpenAI-compatible HTTP endpoints for direct requests.
  • STT: speech-to-text with Whisper models running on device.

Capabilities

Key Capabilities

  • Container-native model provisioning with the RamaLama CLI.
  • Flexible model sources (HuggingFace, Ollama, ModelScope, OCI registries, local files, URLs).
  • Local-first inference to minimize latency and protect data.
  • Model lifecycle control (download, serve, stop) from code.

Core Philosophy

  • On-device first
  • Container-native by default
  • Privacy-focused
  • Developer-friendly APIs

Features

Language Models (LLM)

  • Local chat with a simple SDK interface.
  • OpenAI-compatible HTTP endpoint for direct requests.
  • Bring-your-own model sources through the RamaLama CLI.

Speech-to-Text (STT)

  • Local transcription with Whisper models.
  • Works entirely on device.

Model Management

  • Download and cache models locally.
  • Start and stop model servers programmatically.
  • Use the same model catalog and resolution as the CLI.

System Requirements

RequirementNotes
RamaLama CLIInstalled and available on your PATH
Container managerDocker or Podman
Local storageSpace for model downloads

Next steps