What does Operations contain?

Runbooks · Incident playbooks · How-to guides · Deployment procedures · Rollback protocols · Observability recipes · Escalation & alerting


Humans vs AI Agents

Human

Guides in crisis situations, capitalizes on resolved incidents, onboards on routine operations.

AI Agent

Automates repeatable actions, executes playbooks autonomously, generates runbooks from other registers, proposes updates after each incident.

Asymmetry: guidance vs. automation.


Consumption modes

RAG

Incident search and operational context retrieval via RAG.

Skills / MCP

Repeatable runbooks and playbooks encapsulated as executable MCP Skills.


Runbooks as MCP Skills

Runbooks are not just documented: they are encapsulated as callable tools. The agent does not read a runbook linearly — it invokes it with parameters and gets back a structured result.

skills/catalog_indexation_lag.py Python (MCP)
from mcp import tool
from dataclasses import dataclass

@dataclass
class RunbookResult:
    status: str          # "resolved" | "escalate"
    lag_before: int
    lag_after: int
    actions_taken: list[str]
    duration_seconds: int

@tool
def catalog_indexation_lag(
    partition: str,
    consumer_group: str = "catalog-indexer",
) -> RunbookResult:
    """
    Runbook: Kafka consumer lag on catalogue indexation partition.
    Diagnoses, scales consumer group, waits for lag resorption.
    """
    lag_before = get_consumer_lag(partition, consumer_group)
    actions = [f"Detected lag: {lag_before} messages"]

    if lag_before > 5000:
        scale_consumer_group(consumer_group, replicas=6)
        actions.append("Scaled consumer group to 6 instances")
        lag_after = wait_for_resorption(partition, consumer_group, timeout=600)
        actions.append(f"Lag resolved: {lag_after} messages remaining")
        return RunbookResult(
            status="resolved" if lag_after < 100 else "escalate",
            lag_before=lag_before,
            lag_after=lag_after,
            actions_taken=actions,
            duration_seconds=elapsed(),
        )
    return RunbookResult(
        status="resolved", lag_before=lag_before, lag_after=lag_before,
        actions_taken=["Lag within threshold — no action taken"], duration_seconds=0,
    )

Runbook artifact structure

operations/runbooks/catalog-indexation-lag.md YAML + Markdown
---
register: operations
level: team
owner: platform-engineer
status: active
consumption-mode: skill
mcp-skill: catalog-indexation-lag
last-validated: 2026-06-03
---

# Runbook — Catalogue Indexation Lag

## Trigger
Datadog alert: `kafka.consumer.lag > 5000` on partition `catalog-updates`.

## Known causes
- High-volume operator burst (> 2000 listings updated simultaneously)
- Elasticsearch backpressure during bulk flush window
- Consumer group crash / pod eviction

## Resolution
Execute MCP skill `catalog_indexation_lag` with affected partition.
Expected resolution time: < 10 minutes for lags under 50,000 messages.

## Escalation
If lag persists after 10 min or exceeds 100,000 messages → PagerDuty: platform-oncall.

## Post-incident
Update this runbook with observed cause. Add Decision Context to Knowledge if pattern is new.