Designing Fallback Routes for PMS Outages

This page builds the concrete degrade-and-recover state machine that keeps outbound rate and inventory pushes safe while a property management system is down: how to model the three routing phases, buffer every mutation to a store that survives the outage, and replay the buffer idempotently in version order once the PMS answers again. It sits under Fallback Routing for Downtime, which defines the health-probe, circuit-breaker, and delta-safety control plane as a whole — here we focus on the finite-state routing and durable buffer that control plane orchestrates.

The failure it exists to prevent is specific to distribution. Without a deterministic fallback route, a PMS maintenance window leaves the automation with two bad options: keep pushing the last-known rate until an OTA flags a parity violation, or stop pushing and let every channel drift until someone notices. A revenue manager then finds a room selling on Booking.com at last week’s price; the on-call engineer is left reconstructing which pushes were lost mid-outage. A three-phase route turns that open-ended failure into a bounded, replayable one.

Prerequisites & environment

The route below is a standalone module that wraps your existing dispatch call; it does not replace the delta-safety and quarantine logic from the parent workflow, it feeds them. Pin these versions so the async, hashing, and Pydantic behaviour is reproducible:

Python 3.11+ — for enum.StrEnum and the match statement used in the transition table.
redis 5.0+ — the durable buffer, run with appendonly yes and appendfsync everysec so appended records survive a node restart mid-outage without an fsync on every write.
pydantic 2.6+ — the v2 API (field_validator, model_dump) validates each buffered record at the boundary; v1 validator/dict() will not run against this code.
httpx 0.27+ and tenacity 8.x — the probe client and the replay dispatch with retry.
structlog 24.x — every state transition is logged as a machine-readable event keyed by property_id.
API access — a lightweight PMS health endpoint plus channel-manager write scopes, with credentials kept valid across a long outage through OAuth2 token refresh so a mid-recovery 401 does not abort the flush.

Two upstream contracts must already hold. Every rate_plan_code the buffer stores must resolve through your rate plan taxonomy so a cached push references a stable code rather than a PMS-internal key, and every room_type_code must be reconciled through OTA channel mapping so a replayed push never double-counts one physical room across channels. The buffered objects themselves are the validated canonical shape produced by data schema standardization, which is what lets the recovery step assume its cached data is already well-typed and currency-normalized.

Step-by-step implementation

The route is four ordered parts: define the phases and the record contract, detect degradation, buffer durably while degraded, then flush in version order on recovery.

Step 1 — Model the three routing phases and the buffered record

Represent the route as an explicit finite state machine, not a scatter of boolean flags. Three phases is the minimum that recovers cleanly: HEALTHY dispatches straight through, DEGRADED buffers, and RECOVERY drains the buffer while probing. Each buffered mutation carries a monotonic version and a content hash so the recovery step can order replays and detect duplicates.

python

from datetime import date, datetime, timezone
from decimal import Decimal
from enum import StrEnum
import hashlib

from pydantic import BaseModel, ConfigDict, Field, field_validator


class RouteState(StrEnum):
    HEALTHY = "healthy"      # push straight to the channel manager
    DEGRADED = "degraded"    # PMS unreachable — buffer every push
    RECOVERY = "recovery"    # PMS back — drain the buffer, still probing


class BufferedPush(BaseModel):
    model_config = ConfigDict(strict=True, extra="forbid")

    property_id: str = Field(pattern=r"^prop_[0-9a-f]{8}$")
    room_type_code: str = Field(min_length=2, max_length=16)
    rate_plan_code: str = Field(min_length=3, max_length=24, pattern=r"^[A-Z0-9_-]+$")
    ota: str  # channel slug, e.g. "booking_com" or "expedia"
    base_amount: Decimal = Field(ge=0, max_digits=10, decimal_places=2)
    currency: str = Field(pattern=r"^[A-Z]{3}$")
    date_from: date
    date_to: date
    version: int = Field(ge=0)  # monotonic per (property, room, rate plan, dates)
    buffered_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))

    @field_validator("ota")
    @classmethod
    def known_channel(cls, v: str) -> str:
        if v not in {"booking_com", "expedia", "agoda", "direct"}:
            raise ValueError(f"unknown OTA slug: {v}")
        return v

    def idempotency_key(self) -> str:
        # Keyed on business identity, NOT payload content or buffered_at — so the
        # same room-night replayed on recovery reuses the key the primary path
        # would have used, and the channel manager collapses the duplicate.
        identity = f"{self.property_id}:{self.room_type_code}:{self.rate_plan_code}:{self.date_from}:{self.date_to}"
        return hashlib.sha256(identity.encode()).hexdigest()[:32]

Deriving idempotency_key from business identity rather than from the amount or timestamp is the load-bearing choice: a rate that was corrected twice during the outage collapses to a single key, so the OTA applies the latest version once instead of stacking two conflicting updates in arrival order.

Step 2 — Detect degradation and drive the transition

The route changes phase on a health signal, never on the failure of one business push — a single transient 503 on a rate push must not tip a healthy PMS into buffering. Feed a rolling failure count from a background probe into a deterministic transition table.

python

import structlog

log = structlog.get_logger()


class RouteController:
    def __init__(self, fail_threshold: int = 3, recover_probes: int = 2):
        self.state = RouteState.HEALTHY
        self.fail_threshold = fail_threshold
        self.recover_probes = recover_probes
        self._consecutive_fail = 0
        self._consecutive_ok = 0

    def observe(self, healthy: bool) -> RouteState:
        self._consecutive_ok = self._consecutive_ok + 1 if healthy else 0
        self._consecutive_fail = 0 if healthy else self._consecutive_fail + 1
        prev = self.state
        match self.state:
            case RouteState.HEALTHY if self._consecutive_fail >= self.fail_threshold:
                self.state = RouteState.DEGRADED
            case RouteState.DEGRADED if healthy:
                self.state = RouteState.RECOVERY  # one good probe starts draining
            case RouteState.RECOVERY if self._consecutive_ok >= self.recover_probes:
                self.state = RouteState.HEALTHY   # confirmed stable — resume direct
            case RouteState.RECOVERY if not healthy:
                self.state = RouteState.DEGRADED  # relapse — stop draining, buffer again
        if self.state is not prev:
            log.warning("route_transition", frm=prev, to=self.state,
                        consecutive_fail=self._consecutive_fail, consecutive_ok=self._consecutive_ok)
        return self.state

Requiring recover_probes consecutive successes before returning to HEALTHY — while a single failure during RECOVERY drops straight back to DEGRADED — makes the route asymmetric on purpose: it is cheap to keep buffering and expensive to resume too early against a PMS that is still flapping.

Step 3 — Buffer every push durably while degraded

While DEGRADED, no push leaves the process. Each validated mutation is appended to a per-property Redis list under a key that outlives the outage. Storing the model with model_dump(mode="json") keeps Decimal and date values loss-free as strings, so the replay reads back exactly what was buffered.

python

import json
import redis


def buffer_push(r: redis.Redis, push: BufferedPush) -> None:
    key = f"fallback:buffer:{push.property_id}"
    r.rpush(key, json.dumps(push.model_dump(mode="json")))
    r.expire(key, 172_800)  # 48h guard — a buffer older than this is stale by definition
    log.info("push_buffered", property_id=push.property_id, ota=push.ota,
             rate_plan_code=push.rate_plan_code, version=push.version,
             idempotency_key=push.idempotency_key())


def route_push(ctrl: RouteController, r: redis.Redis, push: BufferedPush, dispatch) -> str:
    match ctrl.state:
        case RouteState.HEALTHY:
            dispatch(push)  # straight through — your existing delta-safe dispatch
            return "dispatched"
        case RouteState.DEGRADED | RouteState.RECOVERY:
            buffer_push(r, push)  # buffer even during RECOVERY so the ordered drain owns replay
            return "buffered"

Buffering during RECOVERY as well as DEGRADED — rather than letting fresh pushes race the drain straight to the channel — is deliberate: it keeps a single ordered replay path in charge of the whole outage window, so a new push and a buffered older version of the same room-night can never arrive out of order.

Step 4 — Flush the buffer in version order on recovery

Draining is where correctness is won or lost. Read the whole buffer, keep only the highest version per idempotency key (superseding intra-outage corrections), and replay oldest-first under jittered retry. The channel manager collapses any duplicate key, so a flush interrupted and restarted is safe.

python

import random
import time
import httpx
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

RETRYABLE = {429, 500, 502, 503, 504}


@retry(stop=stop_after_attempt(4),
       wait=wait_exponential(multiplier=1, min=1, max=15),
       retry=retry_if_exception_type(httpx.HTTPStatusError), reraise=True)
def _replay(client: httpx.Client, push: BufferedPush) -> None:
    resp = client.post(
        "https://api.channelmanager.example/v1/inventory/sync",
        json={"rates": {push.rate_plan_code: str(push.base_amount)}},
        headers={"Idempotency-Key": push.idempotency_key()},
        timeout=5.0,
    )
    if resp.status_code in RETRYABLE:
        resp.raise_for_status()


def flush_buffer(r: redis.Redis, property_id: str, client: httpx.Client) -> int:
    key = f"fallback:buffer:{property_id}"
    raw = r.lrange(key, 0, -1)
    pushes = [BufferedPush.model_validate_json(item) for item in raw]

    # Collapse to the latest version per room-night; a mid-outage re-price wins.
    latest: dict[str, BufferedPush] = {}
    for p in pushes:
        k = p.idempotency_key()
        if k not in latest or p.version > latest[k].version:
            latest[k] = p

    sent = 0
    for p in sorted(latest.values(), key=lambda p: (p.date_from, p.version)):
        _replay(client, p)
        time.sleep(random.uniform(0, 0.2))  # spread the burst under the OTA ceiling
        sent += 1
    r.delete(key)  # only after every replay succeeded — a raised error leaves the buffer intact
    log.info("buffer_flushed", property_id=property_id, replayed=sent, buffered=len(pushes))
    return sent

Defining _replay with a module-level @retry decorator rather than wrapping each call inline keeps the backoff policy in one place and — because the decorator is applied once at import, not rebuilt per push — every buffered record in the flush inherits the exact same retry budget, so a slow tail of 503s can’t quietly get a fresh set of attempts on each iteration. Deleting the buffer key only after the loop completes means an exception mid-flush leaves every un-replayed record in place, so a restart re-drains from a consistent state instead of losing the tail of the outage.

Gotchas & production notes

The buffer must not share a failure domain with the PMS. If the Redis node backing the buffer is co-located with or dependent on the PMS host, the same outage that trips the route also loses the pushes you were trying to protect. Host it independently and keep appendonly yes; treat an empty buffer on recovery as a signal to re-derive from live PMS state, not as “nothing to do”.
Replay only after superseding by version — never blind. A push captured at 02:10 may be stale by the time the PMS returns at 04:30. Step 4 keeps only the latest version per room-night, but you should still run every replayed rate through the parent workflow’s delta guardrail so an out-of-tolerance value quarantines rather than reaching the OTA; verbatim replay of a stale rate is the classic way an outage becomes a parity penalty.
Store effective dates as naive property-local dates. A push buffered near midnight and replayed hours later must land on the same stay_date it was created for. If date_from is stored UTC-shifted, the replay can move a rate onto the adjacent night; keep the date property-local and independent of buffered_at.
A recovery flush is a burst — size it to the channel budget. Every buffered push wanting to leave at once will trip OTA API rate limits and cascade into 429s. The jittered sleep spreads it, and the retry follows the shared exponential backoff profile; classify any non-retryable response with the 4xx-vs-5xx taxonomy so a contract error dead-letters instead of burning quota on replay.

Verification snippet

Prove the two properties the whole route depends on before an outage exercises it: that the transition table degrades and recovers on the right signal, and that the flush collapses intra-outage corrections to the latest version under one idempotency key.

python

def test_route_degrades_then_recovers() -> None:
    ctrl = RouteController(fail_threshold=3, recover_probes=2)
    assert [ctrl.observe(False) for _ in range(3)][-1] is RouteState.DEGRADED
    assert ctrl.observe(True) is RouteState.RECOVERY   # first good probe starts drain
    assert ctrl.observe(True) is RouteState.HEALTHY    # second confirms stable


def test_flush_keeps_latest_version_per_room_night() -> None:
    base = dict(property_id="prop_0a1b2c3d", room_type_code="DLX",
                rate_plan_code="BAR_FLEX", ota="booking_com", currency="EUR",
                date_from=date(2026, 7, 2), date_to=date(2026, 7, 3))
    v1 = BufferedPush(**base, base_amount=Decimal("180.00"), version=1)
    v2 = BufferedPush(**base, base_amount=Decimal("175.00"), version=2)  # mid-outage re-price
    assert v1.idempotency_key() == v2.idempotency_key()  # same room-night → same key
    latest = {v1.idempotency_key(): max((v1, v2), key=lambda p: p.version)}
    assert latest[v1.idempotency_key()].base_amount == Decimal("175.00")


test_route_degrades_then_recovers()
test_flush_keeps_latest_version_per_room_night()

Asserting that two versions of the same room-night share an idempotency key directly tests the guarantee the flush relies on — if it ever varied, a recovery replay would double-apply the correction. In production, also assert that the count of replayed records plus quarantined records equals the count buffered (no silent drop), and reconcile the flushed totals against live PMS state through async polling and the nightly batch reconciliation run.

Fallback Routing for Downtime — the parent control plane: health probes, circuit breaker, delta safety, and quarantine that this state machine and buffer plug into.
Data Schema Standardization — the validated canonical objects the buffer stores and replays.
Handling OTA API Rate Limits — the channel budget a recovery flush must stay under.
Categorizing 4xx vs 5xx Sync Errors — deciding when a replay retries versus dead-letters.
Building Batch Reconciliation Scripts for Daily Syncs — the nightly audit that catches anything an outage flush missed.

← Back to Fallback Routing for Downtime