Skip to main content

Serving Models

timber serve starts an HTTP server that exposes a compiled model over a REST API.

Basic Usage

timber serve my-model

This starts the server on 0.0.0.0:11434 — the same default port as Ollama.

Options

timber serve my-model --host 127.0.0.1 --port 8080
OptionDefaultDescription
--host0.0.0.0Bind address
--port11434Bind port

Endpoints

POST /api/predict

Run inference. Send a JSON body with model and inputs:

curl http://localhost:11434/api/predict \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"inputs": [[1.0, 2.0, 3.0, 4.0, 5.0]]
}'

Response:

{
"model": "my-model",
"outputs": [0.87],
"n_samples": 1,
"latency_us": 91.0,
"done": true
}

Batch Inference

Send multiple samples:

curl http://localhost:11434/api/predict \
-d '{
"model": "my-model",
"inputs": [
[1.0, 2.0, 3.0, 4.0, 5.0],
[5.0, 4.0, 3.0, 2.0, 1.0],
[2.5, 3.5, 1.5, 4.5, 0.5]
]
}'

GET /api/models

List all loaded models:

curl http://localhost:11434/api/models

GET /api/health

Health check:

curl http://localhost:11434/api/health
# {"status": "ok", "version": "0.1.0"}

POST /api/generate

Alias for /api/predict — for Ollama client compatibility.

Architecture

The serving architecture separates concerns:

  • Python handles: HTTP parsing, JSON serialization, request validation, CORS headers
  • Compiled C handles: the actual inference computation via ctypes

This means Python is never in the inference hot path. The C function call takes ~2 µs; the total HTTP round-trip is ~91 µs.

Error Handling

The server returns structured errors:

{"error": "model 'xyz' not loaded"}
{"error": "expected 30 features, got 10"}
{"error": "missing 'inputs' field"}

Production Deployment

The built-in server uses Python's http.server. For production at scale, front it with a reverse proxy:

upstream timber {
server 127.0.0.1:11434;
}

server {
listen 80;
location /api/ {
proxy_pass http://timber;
}
}