Operations register
Runbooks, playbooks, deployment procedures, and incident protocols. Workflows agents can execute autonomously when conditions are met.
What does Operations contain?
Runbooks · Incident playbooks · How-to guides · Deployment procedures · Rollback protocols · Observability recipes · Escalation & alerting
Humans vs AI Agents
Guides in crisis situations, capitalizes on resolved incidents, onboards on routine operations.
Automates repeatable actions, executes playbooks autonomously, generates runbooks from other registers, proposes updates after each incident.
Asymmetry: guidance vs. automation.
Consumption modes
Incident search and operational context retrieval via RAG.
Repeatable runbooks and playbooks encapsulated as executable MCP Skills.
Runbooks as MCP Skills
Runbooks are not just documented: they are encapsulated as callable tools. The agent does not read a runbook linearly — it invokes it with parameters and gets back a structured result.
from mcp import tool
from dataclasses import dataclass
@dataclass
class RunbookResult:
status: str # "resolved" | "escalate"
lag_before: int
lag_after: int
actions_taken: list[str]
duration_seconds: int
@tool
def catalog_indexation_lag(
partition: str,
consumer_group: str = "catalog-indexer",
) -> RunbookResult:
"""
Runbook: Kafka consumer lag on catalogue indexation partition.
Diagnoses, scales consumer group, waits for lag resorption.
"""
lag_before = get_consumer_lag(partition, consumer_group)
actions = [f"Detected lag: {lag_before} messages"]
if lag_before > 5000:
scale_consumer_group(consumer_group, replicas=6)
actions.append("Scaled consumer group to 6 instances")
lag_after = wait_for_resorption(partition, consumer_group, timeout=600)
actions.append(f"Lag resolved: {lag_after} messages remaining")
return RunbookResult(
status="resolved" if lag_after < 100 else "escalate",
lag_before=lag_before,
lag_after=lag_after,
actions_taken=actions,
duration_seconds=elapsed(),
)
return RunbookResult(
status="resolved", lag_before=lag_before, lag_after=lag_before,
actions_taken=["Lag within threshold — no action taken"], duration_seconds=0,
) Runbook artifact structure
---
register: operations
level: team
owner: platform-engineer
status: active
consumption-mode: skill
mcp-skill: catalog-indexation-lag
last-validated: 2026-06-03
---
# Runbook — Catalogue Indexation Lag
## Trigger
Datadog alert: `kafka.consumer.lag > 5000` on partition `catalog-updates`.
## Known causes
- High-volume operator burst (> 2000 listings updated simultaneously)
- Elasticsearch backpressure during bulk flush window
- Consumer group crash / pod eviction
## Resolution
Execute MCP skill `catalog_indexation_lag` with affected partition.
Expected resolution time: < 10 minutes for lags under 50,000 messages.
## Escalation
If lag persists after 10 min or exceeds 100,000 messages → PagerDuty: platform-oncall.
## Post-incident
Update this runbook with observed cause. Add Decision Context to Knowledge if pattern is new.