Overseer Integration Guide
This guide shows how to add health monitoring to custom components and services in your Aegis Stack application.
Health Check Registration
Overseer uses a registry pattern to collect health information from all components and services. During application startup, components and services register their health check functions with the central health registry.
Component Health Checks
Components are infrastructure pieces like backends, databases, workers, and schedulers.
Registration Function:
from app.services.system import register_health_check
register_health_check(name: str, check_fn: Callable[[], Awaitable[ComponentStatus]])
Example: Backend Component
# app/components/backend/health.py
from app.services.system import register_health_check
from app.services.system.models import ComponentStatus, ComponentStatusType
async def check_backend_health() -> ComponentStatus:
"""Check backend component health."""
# Collect system metrics
cpu_percent = psutil.cpu_percent(interval=0.1)
memory = psutil.virtual_memory()
# Determine status based on thresholds
status = ComponentStatusType.HEALTHY
if cpu_percent > 90:
status = ComponentStatusType.WARNING
return ComponentStatus(
name="backend",
status=status,
message="Backend API operational",
metadata={
"cpu_percent": cpu_percent,
"memory_percent": memory.percent,
"disk_percent": psutil.disk_usage('/').percent,
}
)
# Register during component initialization
register_health_check("backend", check_backend_health)
Service Health Checks
Services are business capabilities like authentication, AI integrations, and communications.
Registration Function:
from app.services.system import register_service_health_check
register_service_health_check(name: str, check_fn: Callable[[], Awaitable[ComponentStatus]])
Example: Auth Service
# app/services/auth/health.py
from app.services.system import register_service_health_check
from app.services.system.models import ComponentStatus, ComponentStatusType
from app.core.config import settings
async def check_auth_service_health() -> ComponentStatus:
"""Check auth service health including JWT configuration."""
jwt_errors = []
# Validate JWT configuration
if not settings.SECRET_KEY or len(settings.SECRET_KEY) < 32:
jwt_errors.append("SECRET_KEY misconfigured")
# Check database connectivity
database_available = True
try:
from sqlalchemy import text
from app.core.db import db_session
with db_session() as session:
session.execute(text("SELECT 1"))
except Exception:
database_available = False
jwt_errors.append("Database unavailable")
# Determine status
status = ComponentStatusType.HEALTHY
if jwt_errors:
status = ComponentStatusType.UNHEALTHY if not database_available else ComponentStatusType.WARNING
return ComponentStatus(
name="auth",
status=status,
message="Auth service ready" if not jwt_errors else f"Issues: {'; '.join(jwt_errors)}",
metadata={
"service_type": "auth",
"jwt_algorithm": getattr(settings, "ALGORITHM", "HS256"),
"database_available": database_available,
}
)
register_service_health_check("auth", check_auth_service_health)
ComponentStatus Model
The ComponentStatus model defines the structure for health check results.
ComponentStatus(
name: str, # Component/service name
status: ComponentStatusType, # HEALTHY | INFO | WARNING | UNHEALTHY
message: str, # Human-readable status message
response_time_ms: float | None = None, # Optional response time
metadata: dict[str, Any] = {}, # Custom metrics for dashboard
children: list[ComponentStatus] = [], # Nested components (optional)
)
Custom Metadata for Dashboard Cards
The metadata dictionary passes custom data to dashboard cards.
Example Patterns:
# System metrics
metadata = {
"cpu_percent": 12.5,
"memory_percent": 45.2,
"disk_percent": 60.0,
}
# Service configuration
metadata = {
"service_type": "auth",
"user_count": 42,
"jwt_algorithm": "HS256",
"security_level": "standard",
}
# Queue statistics
metadata = {
"jobs_queued": 5,
"jobs_active": 2,
"jobs_failed": 3,
"failure_rate": 0.02,
}
Status Hierarchy and Propagation
Overseer uses a four-tier status system that propagates from child components to parents.
Status Priority (highest to lowest):
- UNHEALTHY - Any unhealthy child makes parent unhealthy
- WARNING - Any warning child makes parent warning (if no unhealthy)
- INFO - Any info child makes parent info (if no unhealthy/warning)
- HEALTHY - All children healthy makes parent healthy
Example with Children:
# Worker component with multiple queues
async def check_worker_health() -> ComponentStatus:
"""Check worker with per-queue status."""
queue_statuses = [
await check_queue_health("system"),
await check_queue_health("notifications"),
]
# Propagate worst status from queues
from app.services.system.health import propagate_status
overall_status = propagate_status([q.status for q in queue_statuses])
return ComponentStatus(
name="worker",
status=overall_status,
message=f"Worker with {len(queue_statuses)} queues",
children=queue_statuses,
metadata={"total_queues": len(queue_statuses)},
)
Health Check Best Practices
1. Keep Checks Fast
Health checks run every 30 seconds via dashboard polling. Avoid expensive operations.
# ❌ BAD - Slow query
async def slow_health_check() -> ComponentStatus:
users = session.exec(select(User)).all() # Loads ALL users
return ComponentStatus(...)
# ✅ GOOD - Limited query
async def fast_health_check() -> ComponentStatus:
statement = select(User).limit(101) # Only check if >100 users
users = list(session.exec(statement).all())
user_count_display = "100+" if len(users) > 100 else str(len(users))
return ComponentStatus(metadata={"user_count_display": user_count_display})
2. Use Caching for Expensive Metrics
System metrics (CPU, memory, disk) can be cached briefly to improve performance.
from datetime import UTC, datetime, timedelta
_metrics_cache: dict[str, tuple[dict, datetime]] = {}
CACHE_TTL = timedelta(seconds=5)
async def get_system_metrics() -> dict:
"""Get system metrics with 5-second cache."""
now = datetime.now(UTC)
if "system" in _metrics_cache:
cached_data, cached_time = _metrics_cache["system"]
if now - cached_time < CACHE_TTL:
return cached_data
# Expensive metrics collection
metrics = {
"cpu_percent": psutil.cpu_percent(interval=0.1),
"memory_percent": psutil.virtual_memory().percent,
}
_metrics_cache["system"] = (metrics, now)
return metrics
3. Handle Errors Gracefully
Always return a ComponentStatus, even on failure.
async def check_external_api_health() -> ComponentStatus:
"""Check external API dependency."""
try:
response = await http_client.get("https://api.example.com/health")
response.raise_for_status()
return ComponentStatus(
name="external_api",
status=ComponentStatusType.HEALTHY,
message="External API reachable",
)
except Exception as e:
# Don't raise - return unhealthy status
return ComponentStatus(
name="external_api",
status=ComponentStatusType.UNHEALTHY,
message=f"External API unreachable: {str(e)}",
metadata={"error": str(e)},
)
4. Use Descriptive Messages
Messages appear in both the dashboard and CLI health output.
# ❌ BAD - Vague message
message = "Worker error"
# ✅ GOOD - Specific message
message = f"Worker queue 'system' has 50 failed jobs (failure rate: 15%)"
5. Include Actionable Metadata
Metadata powers dashboard visualizations and debugging.
# ✅ GOOD - Rich metadata for debugging
metadata = {
"queue_name": "system",
"jobs_queued": 50,
"jobs_active": 0, # ⚠️ No workers processing
"jobs_failed": 15,
"failure_rate": 0.15,
"redis_connected": False, # 🔴 Root cause
}
Registration Timing
Health checks must be registered during application startup, before the health endpoint is accessed.
Typical Registration Points:
- Component Initialization:
app/components/{component}/__init__.py - Service Initialization:
app/services/{service}/__init__.py - Backend Startup Hook:
app/components/backend/startup/component_health.py
Conditional Registration Pattern:
# Register backend (always included)
register_health_check("backend", check_backend_health)
# Conditionally register optional components
try:
from app.components.worker.health import check_worker_health
register_health_check("worker", check_worker_health)
except ImportError:
pass # Worker component not included
Next Steps
- Overseer Overview - Learn about Overseer's architecture and purpose
- The Overseer Story - Evolution and vision
- CLI Reference - Health command documentation (search for "health")