Ollama Usage Guide
This guide covers how to interact with the Ollama LLM service running on aomi (192.168.1.23).
Service Architecture
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ Application │ ────────>│ Ollama Exporter │ ────────>│ Ollama │
│ (OpenCode, │ :8000 │ (Prometheus) │ :11434 │ Service │
│ Emacs) │ │ Metrics Proxy │ │ │
└─────────────┘ └──────────────────┘ └─────────────┘
│
v
┌────────────────┐
│ Prometheus │
│ (sakhalin) │
└────────────────┘
Endpoints
With Metrics (Recommended) - Port 8000
Use these URLs to have your requests tracked in Prometheus/Grafana:
- Native API:
http://192.168.1.23:8000/api/generate - Native Chat:
http://192.168.1.23:8000/api/chat - OpenAI-compatible:
http://192.168.1.23:8000/v1/chat/completions - OpenAI completions:
http://192.168.1.23:8000/v1/completions
Direct Ollama (No Metrics) - Port 11434
Use these URLs to bypass metrics collection (faster, no tracking):
- Native API:
http://192.168.1.23:11434/api/generate - Native Chat:
http://192.168.1.23:11434/api/chat - OpenAI-compatible:
http://192.168.1.23:11434/v1/chat/completions - OpenAI completions:
http://192.168.1.23:11434/v1/completions
VPN URLs (From Any Machine)
- With metrics:
http://ollama.sbr.pm/orhttp://llm.sbr.pm/(via Traefik, port 443) - Direct: Not exposed via VPN (local network only)
Available Models
# List all models
curl http://192.168.1.23:11434/api/tags
# List via OpenAI-compatible API
curl http://192.168.1.23:11434/v1/models
Current models:
llama3.1:8b- Best for tool calling (OpenCode)mistral-nemo- Fast tool callingqwen2.5-coder:7b- Best coding performancecodestral:latest- Large coding model (22B)deepseek-r1:7b- Reasoningphi4-reasoning:latest- 14B reasoningphi3.5:3.8b- Fastest, smallestqwen2.5vl:7b- Vision/multimodal
Usage Examples
1. Native Ollama API (Simple)
Non-streaming request:
curl http://192.168.1.23:8000/api/generate \
-d '{
"model": "phi3.5:3.8b",
"prompt": "Why is the sky blue?",
"stream": false
}'
Streaming request:
curl http://192.168.1.23:8000/api/generate \
-d '{
"model": "phi3.5:3.8b",
"prompt": "Write a haiku about coding",
"stream": true
}'
Chat format:
curl http://192.168.1.23:8000/api/chat \
-d '{
"model": "llama3.1:8b",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "How do I reverse a string in Python?"}
],
"stream": false
}'
2. OpenAI-Compatible API (For Compatibility)
Chat completions:
curl http://192.168.1.23:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
"temperature": 0.7,
"max_tokens": 500
}'
With streaming:
curl http://192.168.1.23:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-coder:7b",
"messages": [
{"role": "user", "content": "Write a bubble sort in Rust"}
],
"stream": true
}'
3. Direct Ollama (Bypass Metrics)
Use port 11434 instead of 8000 for direct access:
# Same as above, but faster (no metrics overhead)
curl http://192.168.1.23:11434/api/generate \
-d '{
"model": "phi3.5:3.8b",
"prompt": "Quick test",
"stream": false
}'
4. Advanced Options
Control generation parameters:
curl http://192.168.1.23:8000/api/generate \
-d '{
"model": "llama3.1:8b",
"prompt": "Write a function to calculate fibonacci",
"stream": false,
"options": {
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"num_ctx": 8192,
"num_predict": 1024
}
}'
System prompt + context:
curl http://192.168.1.23:8000/api/generate \
-d '{
"model": "qwen2.5-coder:7b",
"prompt": "Add error handling to this function",
"system": "You are an expert code reviewer focusing on robustness and error handling.",
"context": "This is a production application handling financial transactions",
"stream": false
}'
Programming Language Examples
Python
import requests
import json
def ask_ollama(prompt, model="phi3.5:3.8b", stream=False):
"""Query Ollama with metrics tracking."""
url = "http://192.168.1.23:8000/api/generate"
payload = {
"model": model,
"prompt": prompt,
"stream": stream
}
response = requests.post(url, json=payload)
if stream:
for line in response.iter_lines():
if line:
chunk = json.loads(line)
print(chunk.get("response", ""), end="", flush=True)
else:
return response.json()["response"]
# Usage
result = ask_ollama("What is Python?")
print(result)
Bash/Shell Script
#!/bin/bash
# ollama-query.sh - Query Ollama from shell
MODEL="${1:-phi3.5:3.8b}"
PROMPT="${2:-Hello}"
curl -s http://192.168.1.23:8000/api/generate \
-d "{\"model\":\"$MODEL\",\"prompt\":\"$PROMPT\",\"stream\":false}" \
| jq -r '.response'
Emacs Lisp (see Emacs Configuration section below)
Proxy Overhead Benchmarks
Measured overhead of exporter proxy: ~25ms average
Test results (5 runs with phi3.5:3.8b):
Run 1: 17ms overhead
Run 2: 19ms overhead
Run 3: 36ms overhead
Run 4: 36ms overhead
Run 5: 15ms overhead
Average: 25ms (0.025s)
Impact on typical requests:
- Small model (2-5s): 0.5-1.25% overhead
- Medium model (30-60s): 0.04-0.08% overhead
- Large model (60-120s): 0.02-0.04% overhead
Conclusion: The overhead is negligible for all practical purposes.
When to Use Direct vs Metrics
Use Metrics Endpoint (Port 8000) - RECOMMENDED
- ✅ Default choice - overhead is negligible (~25ms)
- ✅ You want to track usage in Grafana
- ✅ You need to monitor performance
- ✅ Running production workloads
- ✅ Debugging slow responses
- ✅ Want to see token counts and costs
Use Direct Endpoint (Port 11434) - RARE CASES ONLY
- ✅ Running synthetic benchmarks where 25ms matters
- ✅ Troubleshooting the exporter itself
- ✅ Explicitly don’t want metrics for privacy/compliance
Monitoring
View Metrics
# Prometheus metrics endpoint
curl http://192.168.1.23:8000/metrics | grep ollama_
# Query Prometheus directly
curl -s "http://192.168.1.70:9001/api/v1/query?query=ollama_requests_total"
Grafana Dashboards
- Navigate to Grafana:
http://grafana.sbr.pm - Look for “Ollama Metrics” and “Ollama Performance” dashboards
Troubleshooting
Check Service Status
# Ollama service
ssh aomi.sbr.pm "systemctl status ollama.service"
# Exporter service
ssh aomi.sbr.pm "systemctl status ollama-exporter.service"
# Exporter container
ssh aomi.sbr.pm "docker ps | grep ollama"
Test Connectivity
# Test direct Ollama
curl http://192.168.1.23:11434/api/version
# Test exporter
curl http://192.168.1.23:8000/api/version
# Test models endpoint
curl http://192.168.1.23:11434/api/tags
View Logs
# Ollama service logs
ssh aomi.sbr.pm "journalctl -u ollama.service -f"
# Exporter logs
ssh aomi.sbr.pm "journalctl -u ollama-exporter.service -f"
# Docker exporter logs
ssh aomi.sbr.pm "docker logs -f ollama-exporter"
Performance Notes
- CPU-only: Models run on CPU (no GPU acceleration)
- First request: 30-90 seconds (model loading)
- Subsequent requests: 15-45 seconds (model cached)
- Model stays loaded: 10 minutes after last request (configurable)
- Fastest model:
phi3.5:3.8b(~8 seconds response time) - Best coding:
qwen2.5-coder:7b(~30-60 seconds) - Tool calling:
llama3.1:8b,mistral-nemo
Security
- Network access: Local network (192.168.1.0/24) and VPN (10.100.0.0/24) only
- Authentication: None (trusted network)
- HTTPS: Available via Traefik (
https://ollama.sbr.pm) - Firewall: Port 11434 not exposed publicly, only 8000 for metrics
See Also
- Ollama API Documentation
- OpenAI API Compatibility
- Emacs configuration: See next section
- OpenCode configuration:
~/.config/opencode/opencode.json