MarketPlace SA

An online booking platform: hotels, experiences, transport. 80,000 establishments across Europe.

The organization has a Nexus Org carrying non-negotiable standards: PCI-DSS compliance, WCAG AA accessibility, baseline performance SLA. The BU Digital Products adds an intermediate level with its UX standards and design system. Three inheritance levels total.

team-search is the two pizza team responsible for the search engine and results page. Seven people: Sophie (PM), Marcus (Tech Lead), two backend engineers, one frontend engineer, one UX designer, one data/ML engineer. They operate their own Nexus Team, heir to the two parent levels.


Episode 1

The ADR

The situation

For three weeks, the backend engineers have been observing production latency spikes during massive catalogue updates. When a hotel operator simultaneously updates several hundred listings, synchronous writes to Elasticsearch saturate the thread pool. P95 climbs to 1.8 seconds during these windows. Marcus launches a two-day spike. The team evaluates two options: CQRS with a separate read model, or event-driven indexation via Kafka. CQRS is ruled out: maintenance cost exceeds the team's current capacity, and two documented incidents on a similar system showed the pattern's fragility under high catalogue volume. The decision: switch to asynchronous event-driven indexation for all catalogue updates.

What enters Knowledge

Marcus opens a Decision Context. He documents the situation that made the decision necessary, the two options evaluated, and the precise reasons for rejecting CQRS.

knowledge/decision-contexts/event-driven-indexation.md
---
register: knowledge
level: team
owner: tech-lead
status: active
consumption-mode: rag
---

# Decision Context — Event-driven indexation

## Situation
P95 latency spikes to 1.8s during mass catalogue updates.
Cause: synchronous writes to Elasticsearch.

## Options evaluated
### CQRS with separate read model
Rejected: high maintenance cost for a 7-person team.
Reference: two P1s on similar pattern (project X, 2024).

## Decision taken
Async event-driven indexation via Kafka.

## Observed consequences
(to be completed after 30 days in production)

What enters Intent

Marcus creates the corresponding Decision Directive. Short, directly applicable, no narration.

intent/directives/event-driven-indexation.md
---
register: intent
level: team
owner: tech-lead
status: active
consumption-mode: system-prompt
---

# Decision Directive — Event-driven indexation

All hotel catalogue updates must transit through
a Kafka event before any write to Elasticsearch.
No synchronous catalogue write to Elasticsearch
is allowed in this bounded context.

Acceptance criteria: see Contracts/search-indexation-sla.md
Decision context: Knowledge/decision-contexts/event-driven-indexation.md

Episode 2

The Sprint

The situation

Sophie creates ticket SEARCH-89: reduce the P95 latency of the results page to under 200ms for standard queries. An agent is assigned to implement the indexation pipeline.

The Context Assembler assembles the task brief

Before the agent begins, the Context Assembler constitutes the task brief, pulling from all applicable registers and the inheritance chain.

intent/specs/SEARCH-89.md Context Injection
---
register: intent
consumption-mode: context-injection
---

# Spec SEARCH-89

## Objective
Reduce P95 latency of the results page to under 200ms.

## Acceptance criteria
- [ ] P95 measured on the indexation pipeline: ≤ 200ms
- [ ] No synchronous Elasticsearch write (see Decision Directive /directives/event-driven)
- [ ] Load test: 500 req/s for 5 minutes without degradation

The quality gate fails

The agent generates the pipeline. It invokes the performance-test.mcp skill before proposing the code. P95 reads 340ms, above the 200ms threshold. Bottleneck identified: synchronous per-document writes to Elasticsearch in the Kafka consumer. The agent does not propose the code. It raises the observation and proposes a correction: switch to bulk writes with a 50ms flush window and 100-document batch. Marcus validates. The agent generates the corrected code, reruns the test, obtains P95 = 138ms. The gate passes.

A Contracts update follows

Marcus notices the 200ms threshold in Contracts was calibrated for synchronous pipelines. It does not account for the flush latency inherent to asynchronous bulk writers. He updates the team's Contracts to add a separate P95 ≤ 80ms threshold for the bulk writer alone, leaving a 120ms budget for Kafka transit.


Episode 3

The Incident

The situation

Three weeks after production deployment, a Datadog alert fires on a Tuesday morning at 8:47am. Search results show hotels with expired availability. The initial investigation shows the Kafka consumer is under backpressure: 12,000 messages queued in the partition. An operator just updated 3,400 listings simultaneously.

The agent executes the runbook

The agent searches the Operations register by semantic similarity. The closest runbook is catalog-indexation-lag.md. It executes it: diagnoses the lag (12,847 messages, last ES write 4m32s ago), scales the consumer group from 2 to 6 instances, waits for resorption. Lag resolves in 8 minutes. Total incident duration: 23 minutes. The agent notifies Sophie and Marcus via Slack.

Runbook execution (agent log)
[08:47:12] Alert received: kafka.consumer.lag > 5000 (partition: catalog-updates)
[08:47:13] Searching Operations register: "kafka consumer lag catalogue indexation"
[08:47:13] Matched: operations/runbooks/catalog-indexation-lag.md (similarity: 0.94)
[08:47:14] Invoking MCP skill: catalog_indexation_lag(partition="catalog-updates")
[08:47:15] Diagnosed: lag=12847 messages, last_write=4m32s ago
[08:47:16] Action: scaling consumer group from 2 → 6 instances
[08:47:45] Monitoring lag resorption...
[08:55:22] Lag resolved: 43 messages remaining
[08:55:23] Skill result: status=resolved, duration=487s
[08:55:24] Notified: sophie@marketplace.sa, marcus@marketplace.sa
[08:55:24] Incident duration: 23 minutes

The post-mortem feeds Knowledge and Operations

Marcus organizes the post-mortem 48h later. The agent drafts a structured summary from logs and runbook execution. Marcus and Sophie enrich it. Two artifacts are created or updated: the runbook in Operations gains a new known cause (Elasticsearch backpressure), and a new Decision Context is added to Knowledge documenting the fragility pattern of large-operator bursts.


Episode 4

The Product Signal

The situation

Two weeks after the incident, the data/ML engineer delivers an analysis on two months of usage data. The results are clear: users who get results with a relevance precision above 0.75 convert 3.4× more than those who get fast but less relevant results. The initial hypothesis in Intent held that speed was the primary conversion lever. The analysis contradicts it.

Intent is updated

Sophie revises the product hypothesis in Intent: relevance precision is the primary conversion lever, ahead of display speed. Implication on the roadmap: prioritize ranking algorithm improvement on sprint S+1. Accept higher P95 latency for requests with complex boosting.

intent/hypotheses/relevance-vs-speed.md Intent update
---
register: intent
level: team
owner: product-manager
status: active
supersedes: intent/hypotheses/speed-first.md
---

# Product Hypothesis — Relevance as primary conversion lever

## Previous hypothesis (superseded)
Speed is the primary conversion lever for the results page.

## Revised hypothesis
Relevance precision > 0.75 is the primary conversion lever.
Source: ML analysis 2026-05-20, 2 months of production data, n=48,000 sessions.
Conversion lift: 3.4× for precision > 0.75 vs. baseline.

## Roadmap implication
Sprint S+1: prioritize ranking algorithm improvement.
Accept P95 > 200ms for multi-criteria boosting queries.

The tension with Contracts

This revision creates a direct tension with the org performance Contracts (P95 < 200ms for all requests). Requests with multi-criteria boosting structurally exceed this threshold: their observed P95 is 320ms. The team cannot simply modify the org Contracts: they must declare an exception.

The exception workflow

Marcus prepares the team contract with exception-to pointing to org/contracts/performance.md. The Context Assembler detects that exception-approved-by is null: it blocks inclusion of this contract in the system prompt until validation. Marcus submits it to the BU Digital Products quality lead. Three days later, the BU lead validates. The Context Assembler now includes the exception in the context of team-search agents.

contracts/engineering/search-boosting-exception.md Exception workflow
---
register: contracts
level: team
exception-to: org/contracts/performance.md
exception-approved-by: null   # ← Context Assembler blocks until validated
---

# → Marcus submits to BU Digital Products quality lead
# → BU lead validates (3 days later)

---
register: contracts
level: team
exception-to: org/contracts/performance.md
exception-approved-by: bu-digital-quality-lead   # ← now included
exception-approved-date: 2026-05-28
---

P95 for multi-criteria boosting queries: ≤ 350ms (exception to 200ms org standard)

What the four episodes traverse

Mechanism Episode
ADR split into Decision Context + Decision Directive Ep. 1
Knowledge fed by decision memory Ep. 1
Context Assembler assembling the task brief Ep. 2
Contracts quality gate triggered by agent Ep. 2
Self-correction before submission Ep. 2
Ship loop → Contracts update Ep. 2
Operations runbook executed by agent Ep. 3
Sync loop → Knowledge + Operations enrichment Ep. 3
Product hypothesis revised in Intent (Shape) Ep. 4
Enforcement mechanism: extension vs. exception Ep. 4
Cross-level validation (team → BU) Ep. 4
Context Assembler blocking an unapproved contract Ep. 4

Dig deeper into the registers