auto-update-daily-20260202

Ollama Usage Guide

This guide covers how to interact with the Ollama LLM service running on aomi (192.168.1.23).

Service Architecture

┌─────────────┐          ┌──────────────────┐          ┌─────────────┐
│ Application │ ────────>│ Ollama Exporter  │ ────────>│   Ollama    │
│  (OpenCode, │   :8000  │  (Prometheus)    │  :11434  │   Service   │
│   Emacs)    │          │   Metrics Proxy  │          │             │
└─────────────┘          └──────────────────┘          └─────────────┘
                                  │
                                  v
                         ┌────────────────┐
                         │  Prometheus    │
                         │  (sakhalin)    │
                         └────────────────┘

Endpoints

Use these URLs to have your requests tracked in Prometheus/Grafana:

  • Native API: http://192.168.1.23:8000/api/generate
  • Native Chat: http://192.168.1.23:8000/api/chat
  • OpenAI-compatible: http://192.168.1.23:8000/v1/chat/completions
  • OpenAI completions: http://192.168.1.23:8000/v1/completions

Direct Ollama (No Metrics) - Port 11434

Use these URLs to bypass metrics collection (faster, no tracking):

  • Native API: http://192.168.1.23:11434/api/generate
  • Native Chat: http://192.168.1.23:11434/api/chat
  • OpenAI-compatible: http://192.168.1.23:11434/v1/chat/completions
  • OpenAI completions: http://192.168.1.23:11434/v1/completions

VPN URLs (From Any Machine)

  • With metrics: http://ollama.sbr.pm/ or http://llm.sbr.pm/ (via Traefik, port 443)
  • Direct: Not exposed via VPN (local network only)

Available Models

# List all models
curl http://192.168.1.23:11434/api/tags

# List via OpenAI-compatible API
curl http://192.168.1.23:11434/v1/models

Current models:

  • llama3.1:8b - Best for tool calling (OpenCode)
  • mistral-nemo - Fast tool calling
  • qwen2.5-coder:7b - Best coding performance
  • codestral:latest - Large coding model (22B)
  • deepseek-r1:7b - Reasoning
  • phi4-reasoning:latest - 14B reasoning
  • phi3.5:3.8b - Fastest, smallest
  • qwen2.5vl:7b - Vision/multimodal

Usage Examples

1. Native Ollama API (Simple)

Non-streaming request:

curl http://192.168.1.23:8000/api/generate \
  -d '{
    "model": "phi3.5:3.8b",
    "prompt": "Why is the sky blue?",
    "stream": false
  }'

Streaming request:

curl http://192.168.1.23:8000/api/generate \
  -d '{
    "model": "phi3.5:3.8b",
    "prompt": "Write a haiku about coding",
    "stream": true
  }'

Chat format:

curl http://192.168.1.23:8000/api/chat \
  -d '{
    "model": "llama3.1:8b",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "How do I reverse a string in Python?"}
    ],
    "stream": false
  }'

2. OpenAI-Compatible API (For Compatibility)

Chat completions:

curl http://192.168.1.23:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

With streaming:

curl http://192.168.1.23:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder:7b",
    "messages": [
      {"role": "user", "content": "Write a bubble sort in Rust"}
    ],
    "stream": true
  }'

3. Direct Ollama (Bypass Metrics)

Use port 11434 instead of 8000 for direct access:

# Same as above, but faster (no metrics overhead)
curl http://192.168.1.23:11434/api/generate \
  -d '{
    "model": "phi3.5:3.8b",
    "prompt": "Quick test",
    "stream": false
  }'

4. Advanced Options

Control generation parameters:

curl http://192.168.1.23:8000/api/generate \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "Write a function to calculate fibonacci",
    "stream": false,
    "options": {
      "temperature": 0.7,
      "top_p": 0.9,
      "top_k": 40,
      "num_ctx": 8192,
      "num_predict": 1024
    }
  }'

System prompt + context:

curl http://192.168.1.23:8000/api/generate \
  -d '{
    "model": "qwen2.5-coder:7b",
    "prompt": "Add error handling to this function",
    "system": "You are an expert code reviewer focusing on robustness and error handling.",
    "context": "This is a production application handling financial transactions",
    "stream": false
  }'

Programming Language Examples

Python

import requests
import json

def ask_ollama(prompt, model="phi3.5:3.8b", stream=False):
    """Query Ollama with metrics tracking."""
    url = "http://192.168.1.23:8000/api/generate"
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": stream
    }

    response = requests.post(url, json=payload)

    if stream:
        for line in response.iter_lines():
            if line:
                chunk = json.loads(line)
                print(chunk.get("response", ""), end="", flush=True)
    else:
        return response.json()["response"]

# Usage
result = ask_ollama("What is Python?")
print(result)

Bash/Shell Script

#!/bin/bash
# ollama-query.sh - Query Ollama from shell

MODEL="${1:-phi3.5:3.8b}"
PROMPT="${2:-Hello}"

curl -s http://192.168.1.23:8000/api/generate \
  -d "{\"model\":\"$MODEL\",\"prompt\":\"$PROMPT\",\"stream\":false}" \
  | jq -r '.response'

Emacs Lisp (see Emacs Configuration section below)

Proxy Overhead Benchmarks

Measured overhead of exporter proxy: ~25ms average

Test results (5 runs with phi3.5:3.8b):

Run 1: 17ms overhead
Run 2: 19ms overhead
Run 3: 36ms overhead
Run 4: 36ms overhead
Run 5: 15ms overhead
Average: 25ms (0.025s)

Impact on typical requests:

  • Small model (2-5s): 0.5-1.25% overhead
  • Medium model (30-60s): 0.04-0.08% overhead
  • Large model (60-120s): 0.02-0.04% overhead

Conclusion: The overhead is negligible for all practical purposes.

When to Use Direct vs Metrics

  • Default choice - overhead is negligible (~25ms)
  • ✅ You want to track usage in Grafana
  • ✅ You need to monitor performance
  • ✅ Running production workloads
  • ✅ Debugging slow responses
  • ✅ Want to see token counts and costs

Use Direct Endpoint (Port 11434) - RARE CASES ONLY

  • ✅ Running synthetic benchmarks where 25ms matters
  • ✅ Troubleshooting the exporter itself
  • ✅ Explicitly don’t want metrics for privacy/compliance

Monitoring

View Metrics

# Prometheus metrics endpoint
curl http://192.168.1.23:8000/metrics | grep ollama_

# Query Prometheus directly
curl -s "http://192.168.1.70:9001/api/v1/query?query=ollama_requests_total"

Grafana Dashboards

  • Navigate to Grafana: http://grafana.sbr.pm
  • Look for “Ollama Metrics” and “Ollama Performance” dashboards

Troubleshooting

Check Service Status

# Ollama service
ssh aomi.sbr.pm "systemctl status ollama.service"

# Exporter service
ssh aomi.sbr.pm "systemctl status ollama-exporter.service"

# Exporter container
ssh aomi.sbr.pm "docker ps | grep ollama"

Test Connectivity

# Test direct Ollama
curl http://192.168.1.23:11434/api/version

# Test exporter
curl http://192.168.1.23:8000/api/version

# Test models endpoint
curl http://192.168.1.23:11434/api/tags

View Logs

# Ollama service logs
ssh aomi.sbr.pm "journalctl -u ollama.service -f"

# Exporter logs
ssh aomi.sbr.pm "journalctl -u ollama-exporter.service -f"

# Docker exporter logs
ssh aomi.sbr.pm "docker logs -f ollama-exporter"

Performance Notes

  • CPU-only: Models run on CPU (no GPU acceleration)
  • First request: 30-90 seconds (model loading)
  • Subsequent requests: 15-45 seconds (model cached)
  • Model stays loaded: 10 minutes after last request (configurable)
  • Fastest model: phi3.5:3.8b (~8 seconds response time)
  • Best coding: qwen2.5-coder:7b (~30-60 seconds)
  • Tool calling: llama3.1:8b, mistral-nemo

Security

  • Network access: Local network (192.168.1.0/24) and VPN (10.100.0.0/24) only
  • Authentication: None (trusted network)
  • HTTPS: Available via Traefik (https://ollama.sbr.pm)
  • Firewall: Port 11434 not exposed publicly, only 8000 for metrics

See Also