Cost Tracking & Usage Analytics

Every AI request is automatically tracked — tokens consumed, cost calculated, and success or failure recorded. No instrumentation required. The data is available immediately via API and visualized in the frontend analytics dashboard.

Requires Database Backend

Cost tracking is only active when the AI service is configured with a database backend:

aegis init my-app --services "ai[sqlite]"
aegis init my-app --services "ai[postgres]"

With the default in-memory backend, the /ai/usage/stats endpoint is not available and usage is not persisted.

What You Get

Automatic tracking - every chat and stream_chat call records tokens and cost
Catalog-based pricing - costs calculated from the LLM Catalog's versioned price entries
Per-user breakdown - filter usage by user_id to see individual consumption
Model breakdown - compare spend and volume across models and providers
Real-time dashboard - frontend analytics tab with hero stats, pie chart, and activity table
Illiana context - usage stats injected into Illiana's system prompt so she can answer cost questions

How It Works

Every call to chat() or stream_chat() follows this flow:

Request
  │
  ├─ AI provider call (OpenAI, Groq, Anthropic, etc.)
  │
  ├─ Extract token counts from response
  │     PydanticAI: result.usage.request_tokens / response_tokens
  │     LangChain:  response_metadata['token_usage']
  │
  ├─ Look up model price from LLM Catalog
  │     SELECT ... ORDER BY effective_date DESC LIMIT 1
  │
  ├─ Calculate cost
  │     (input_tokens × input_cost) + (output_tokens × output_cost)
  │
  └─ Write LLMUsage record to database
        Non-blocking: a failure here never fails the request

Key Behaviors

Vendor prefix stripping - openai/gpt-4o is recorded as gpt-4o for consistency
Unknown models - If a model isn't in the catalog, usage is still recorded with total_cost = 0.0
Non-blocking - If the database write fails, the error is logged but the AI response returns normally

API Endpoint

GET `/ai/usage/stats`

Returns aggregated usage statistics.

Query Parameters:

Parameter	Type	Required	Default	Description
`user_id`	string	No	all users	Filter to a specific user
`start_time`	datetime	No	all time	ISO 8601 lower bound
`end_time`	datetime	No	now	ISO 8601 upper bound
`recent_limit`	integer	No	10	Number of recent activity records

All aggregations are performed at the SQL level (GROUP BY, SUM, COUNT, AVG).

Response:

{
  "total_tokens": 45230,
  "input_tokens": 32100,
  "output_tokens": 13130,
  "total_cost": 0.47,
  "total_requests": 23,
  "success_rate": 95.6,
  "models": [
    {
      "model_id": "gpt-4o",
      "vendor": "OpenAI",
      "requests": 15,
      "tokens": 30000,
      "cost": 0.35,
      "percentage": 65.2
    }
  ],
  "recent_activity": [
    {
      "timestamp": "2024-01-15T10:30:00Z",
      "model": "gpt-4o",
      "input_tokens": 1500,
      "output_tokens": 800,
      "cost": 0.02,
      "success": true,
      "action": "chat"
    }
  ]
}

Examples:

cURLPython

# All-time totals
curl http://localhost:8000/ai/usage/stats | jq

# Filter by user
curl "http://localhost:8000/ai/usage/stats?user_id=alice" | jq

# Last 7 days
curl "http://localhost:8000/ai/usage/stats?start_time=2024-01-08T00:00:00Z" | jq

# Show last 25 requests
curl "http://localhost:8000/ai/usage/stats?recent_limit=25" | jq

import httpx
from datetime import datetime, timedelta, timezone

# All-time stats
response = httpx.get("http://localhost:8000/ai/usage/stats")
stats = response.json()
print(f"Total cost: ${stats['total_cost']:.4f}")
print(f"Total requests: {stats['total_requests']}")
print(f"Success rate: {stats['success_rate']:.1f}%")

# Per-model breakdown
for model in stats["models"]:
    print(f"  {model['model_id']}: {model['requests']} reqs, "
          f"${model['cost']:.4f} ({model['percentage']:.1f}%)")

# Filter to last 24 hours for a specific user
now = datetime.now(timezone.utc)
yesterday = now - timedelta(days=1)

response = httpx.get(
    "http://localhost:8000/ai/usage/stats",
    params={
        "user_id": "alice",
        "start_time": yesterday.isoformat(),
    },
)

Analytics Dashboard

The frontend includes an analytics tab (ai_analytics_tab.py) with real-time usage visualization.

Hero stats cards:

Card	Description
Total Tokens	Sum of all input + output tokens
Total Cost	Cumulative spend across all models
Success Rate	Percentage of requests completed without error
Total Requests	Count of all tracked AI calls

Model usage pie chart - visual breakdown of request volume per model.

Recent activity table - last N requests with timestamp, model, token counts, cost, success flag, and action type.

Cost Calculation

Location: app/services/ai/service.py

# Simplified - actual implementation in service.py
async def calculate_cost(input_tokens: int, output_tokens: int) -> float:
    # Look up current model in catalog
    price = get_latest_price(model_id)  # ORDER BY effective_date DESC LIMIT 1
    if not price:
        return 0.0
    return (input_tokens * price.input_cost_per_token) + \
           (output_tokens * price.output_cost_per_token)

Price Versioning

LLMPrice rows include an effective_date. When a provider updates pricing, a new row is added rather than overwriting the old one. Historical records retain the cost that was accurate when written; only new requests pick up the new price.

LLMPrice rows for gpt-4o:
  effective_date=2024-01-01  input=$2.50/1M  output=$10.00/1M
  effective_date=2024-05-01  input=$2.50/1M  output=$10.00/1M  ← current

LLMPrice also supports a cache_input_cost_per_token field for providers that offer prompt caching.

Database Model

Table: llm_usage

Location: app/services/ai/models/llm/llm_usage.py

Column	Type	Notes
`model_id`	string, indexed	Without vendor prefix (`gpt-4o`, not `openai/gpt-4o`)
`user_id`	string, indexed, nullable	Null for unauthenticated requests
`timestamp`	datetime, indexed	Auto-set to UTC
`input_tokens`	integer	>= 0
`output_tokens`	integer	>= 0
`total_cost`	float	Calculated at record time from catalog prices
`success`	boolean	Default `True`; `False` on provider errors
`error_message`	string, nullable	Populated on failure
`action`	string, indexed	`"chat"` or `"stream_chat"`

A compound index on (timestamp, model_id) keeps time-range + model aggregation queries fast.

Decoupled from Foreign Key

model_id is a plain string, not a foreign key to the LLM Catalog. Usage records survive catalog resets, model renames, and schema migrations without orphaned rows.

Illiana Context

Usage statistics are automatically included in Illiana's system prompt via UsageContext (app/services/ai/usage_context.py).

This means Illiana can answer questions like:

"How much have I spent this month?"
"Which model am I using most?"
"What's my success rate?"

format_for_prompt(compact=True) produces a condensed version for smaller models (Ollama), keeping the system prompt short.

Ollama Zero-Cost

Illiana knows that Ollama models report $0.00 cost by design — local inference has no per-token billing. She won't flag zero cost as an anomaly.

Next Steps:

LLM Catalog - Model registry with pricing data
AI Service Overview - Getting started
API Reference - All REST endpoints
CLI Commands - Command-line interface

Was this page helpful?