Batch Reconciliation Workflows for Hospitality Distribution

Batch reconciliation is the deterministic audit layer that proves — on a fixed schedule, against full snapshots rather than individual events — that every online travel agency (OTA) is selling the exact inventory, rates, and restrictions the property management system (PMS) intends. Without it, parity drift accumulates silently: a dropped channel manager webhook leaves a room sellable that housekeeping has already blocked, a mid-day rate push half-applies, and the first signal anyone gets is an overbooking or a Booking.com parity-compliance flag that suppresses the property in search. Revenue managers lose margin to channel arbitrage they cannot see; operations teams field walk-in disputes they cannot explain; and Python automation engineers are handed a “why is Expedia wrong?” ticket with no auditable trail to answer it. This page sits inside the broader API Sync & Data Ingestion Workflows domain and specifies the architecture, code, data contracts, and failure handling for running reconciliation in production.

Unlike the live push path, reconciliation operates as a scheduled, read-only audit: it compares complete state snapshots during off-peak windows without triggering cascading writes during peak booking hours. Where async polling catches near-real-time deltas, batch reconciliation is the authoritative backstop that catches whatever both the event and polling paths missed, and it is the escalation target when drift on a property exceeds the tolerance a piecemeal correction can safely absorb.

Architecture & Prerequisites

A reconciliation run is a five-stage pipeline: parallel extraction of PMS truth and channel/OTA state, normalization of both into one canonical schema, a deterministic vectorized diff, severity scoring and triage, then durable state persistence plus alerting. Extraction is deliberately decoupled from comparison — authentication, pagination, and throttling live in the ingestion layer so that a credential expiry or a 429 can never corrupt the validation result. Each run is keyed by an immutable batch_id; every raw payload is hashed and staged before comparison, which makes any run reproducible and point-in-time replayable when a downstream assertion fails.

Two parallel extractors are hashed and frozen against an immutable batch_id, normalized into one canonical Polars frame, then diffed into three discrepancy classes; severity-scored rows are upserted idempotently while a 429 backs off extraction and excess async-polling drift escalates into the run.

Environment assumptions and dependency versions:

Python 3.11+ (for asyncio.TaskGroup and exception groups during parallel extraction)
polars 0.20+ for vectorized, memory-efficient batch joins over room-night records
pydantic 2.6+ (v2 API — field_validator, model_dump) for the canonical contract
httpx 0.27+ and tenacity 8.x for resilient extraction
structlog 24.x for JSON-structured audit telemetry
A relational store (PostgreSQL 14+) supporting INSERT … ON CONFLICT DO UPDATE for idempotent state
Read access to both PMS and channel-manager/OTA snapshot endpoints, with credentials managed through OAuth2 token refresh so a mid-extraction 401 never voids a run

Reconciliation consumes the same canonical inventory shape produced by standardized JSON payloads, and every rate_plan_code it compares must already conform to your rate plan taxonomy. Reconciling upstream of those two contracts is possible but self-defeating: unmapped rate plans and unnormalized room types manufacture false discrepancies on every run and drown the real ones.

Implementation

The build proceeds in four numbered steps. Each step is anchored to a self-contained block; wire them together in the order shown, driven by an off-peak scheduler (typically 02:00–04:00 local property time).

Step 1 — Extract and stage both sides deterministically

Reconciliation drift originates from non-deterministic pulls. A rate that changes between the PMS read and the OTA read is not a real discrepancy, so both sides must be captured as close together as possible and frozen. Each raw payload is hashed with SHA-256 and staged against the run’s batch_id before any parsing, guaranteeing immutability and replay.

python

import asyncio
import hashlib
from datetime import datetime, timezone

import httpx
import structlog
from tenacity import (
    retry, stop_after_attempt, wait_exponential, retry_if_exception_type,
)

structlog.configure(processors=[structlog.processors.JSONRenderer()])
log = structlog.get_logger()


@retry(
    stop=stop_after_attempt(4),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    retry=retry_if_exception_type((httpx.RequestError, httpx.HTTPStatusError)),
    reraise=True,
)
async def fetch_snapshot(client: httpx.AsyncClient, url: str, headers: dict) -> dict:
    resp = await client.get(url, headers=headers)
    resp.raise_for_status()
    return resp.json()


async def stage_run(property_id: str, pms_url: str, channel_url: str, headers: dict) -> dict:
    batch_id = f"recon_{property_id}_{datetime.now(timezone.utc):%Y%m%dT%H%M%SZ}"
    async with httpx.AsyncClient(timeout=30.0, http2=True) as client:
        # Capture both sides concurrently to minimize the read-skew window.
        pms_raw, channel_raw = await asyncio.gather(
            fetch_snapshot(client, pms_url, headers),
            fetch_snapshot(client, channel_url, headers),
        )
    for source, payload in (("pms", pms_raw), ("channel", channel_raw)):
        digest = hashlib.sha256(repr(payload).encode()).hexdigest()
        log.info("snapshot_staged", batch_id=batch_id, source=source, payload_sha256=digest)
    return {"batch_id": batch_id, "pms": pms_raw, "channel": channel_raw}

Capturing both snapshots with a single asyncio.gather rather than sequential awaits shrinks the read-skew window to milliseconds, which is what stops an in-flight rate change from masquerading as a parity violation.

Step 2 — Normalize both sides into the canonical frame

Extraction returns two dialects: the PMS speaks its internal codes, the channel manager echoes OTA-flavoured aliases. Both are coerced into one Polars frame with identical column names, dtypes, and units before any comparison. Dates are forced to ISO 8601, rates to integer minor units (cents/pence), and rate_plan_code/room_type_code through the same mapping the live path uses.

python

import polars as pl

MERGE_KEYS = ["property_id", "room_type_code", "rate_plan_code", "stay_date"]


def normalize(records: list[dict], source: str) -> pl.DataFrame:
    return (
        pl.DataFrame(records)
        .with_columns(
            pl.col("room_type_code").str.strip_chars().str.to_uppercase(),
            pl.col("rate_plan_code").str.strip_chars().str.to_uppercase(),
            pl.col("stay_date").str.to_date("%Y-%m-%d"),
            # Rates arrive as decimal strings; store base-currency minor units as ints.
            (pl.col("rate").cast(pl.Float64) * 100).round(0).cast(pl.Int64).alias("rate_minor"),
            pl.col("available_units").cast(pl.Int64),
        )
        .select([*MERGE_KEYS, "rate_minor", "available_units",
                 "min_length_of_stay", "closed_to_arrival"])
        .rename({c: f"{c}" for c in MERGE_KEYS})  # keys stay shared; measures get suffixed at join
    )

Normalizing rates into integer minor units at ingestion — never at the diff site — eliminates the floating-point rounding class of false positives, where 129.00 and 129.0000001 would otherwise register as a parity break.

Step 3 — Run the deterministic diff engine

The engine left-joins on the composite key so PMS truth is preserved and any row missing on the channel side surfaces as a null. It flags three discrepancy classes — inventory mismatch, rate parity violation, and restriction misalignment — and assigns a severity score from revenue impact and date proximity so triage is automatic. A configurable tolerance absorbs legitimate holds and overbooking buffers.

python

def reconcile(pms: pl.DataFrame, channel: pl.DataFrame,
              avail_tolerance: int = 0, rate_bps: int = 0) -> pl.DataFrame:
    merged = pms.join(channel, on=MERGE_KEYS, how="left", suffix="_ch")

    avail_gap = (pl.col("available_units_ch") - pl.col("available_units")).abs()
    rate_gap_bps = (
        (pl.col("rate_minor_ch") - pl.col("rate_minor")).abs() * 10_000
        / pl.col("rate_minor")
    )

    flagged = merged.with_columns(
        pl.when(pl.col("available_units_ch").is_null()).then(pl.lit("MISSING_ON_CHANNEL"))
          .when(avail_gap > avail_tolerance).then(pl.lit("INVENTORY_MISMATCH"))
          .when(rate_gap_bps > rate_bps).then(pl.lit("RATE_PARITY_VIOLATION"))
          .when(pl.col("closed_to_arrival") != pl.col("closed_to_arrival_ch")).then(pl.lit("RESTRICTION_MISALIGN"))
          .when(pl.col("min_length_of_stay") != pl.col("min_length_of_stay_ch")).then(pl.lit("RESTRICTION_MISALIGN"))
          .otherwise(pl.lit("OK")).alias("discrepancy"),
        # Severity blends magnitude with how soon the stay date arrives.
        pl.when(avail_gap > 3).then(pl.lit(3))
          .when(avail_gap > 0).then(pl.lit(2))
          .otherwise(pl.lit(1)).alias("severity"),
    )
    return flagged.filter(pl.col("discrepancy") != "OK")

The join is how="left" rather than inner on purpose: an inner join would silently drop room-nights the channel forgot to publish, which is precisely the most dangerous discrepancy — phantom-absent inventory that never triggers a rate check.

Step 4 — Persist state and emit alerts idempotently

The run’s output is upserted into a state table keyed on (batch_id, property_id, room_type_code, rate_plan_code, stay_date) with ON CONFLICT DO UPDATE. Re-running the same batch_id therefore refreshes rows in place instead of duplicating alerts or correction tickets. Each flagged row is logged with its payload hash and severity so the audit trail is complete.

python

def persist(conn, discrepancies: pl.DataFrame, batch_id: str) -> int:
    rows = discrepancies.to_dicts()
    for row in rows:
        conn.execute(
            """
            INSERT INTO reconciliation_state
                (batch_id, property_id, room_type_code, rate_plan_code, stay_date,
                 discrepancy, severity, detected_at)
            VALUES (%(batch_id)s, %(property_id)s, %(room_type_code)s,
                    %(rate_plan_code)s, %(stay_date)s, %(discrepancy)s, %(severity)s, now())
            ON CONFLICT (batch_id, property_id, room_type_code, rate_plan_code, stay_date)
            DO UPDATE SET discrepancy = EXCLUDED.discrepancy,
                          severity   = EXCLUDED.severity,
                          detected_at = now()
            """,
            {**row, "batch_id": batch_id},
        )
    log.info("reconciliation_persisted", batch_id=batch_id,
             discrepancies=len(rows),
             max_severity=max((r["severity"] for r in rows), default=0))
    return len(rows)

Deriving the conflict key from the business coordinates (property, room, rate plan, stay date) rather than a surrogate row id is what makes a re-run converge on the same rows instead of appending a second, contradictory alert set.

Schema & Data Contracts

Before any record enters the Polars frame, it is validated against a canonical Pydantic v2 model. This rejects the malformations a raw OTA snapshot routinely carries — negative availability, unmapped rate plans, lower-cased channel slugs, mismatched currencies — at the boundary, so the diff engine only ever sees clean, comparable rows.

python

from datetime import date
from pydantic import BaseModel, Field, field_validator

ALLOWED_CHANNELS = {"booking_com", "expedia", "agoda", "direct"}


class ReconRecord(BaseModel):
    property_id: str = Field(pattern=r"^prop_[0-9a-f]{8}$")
    room_type_code: str
    rate_plan_code: str
    channel: str
    stay_date: date
    rate_minor: int = Field(ge=0)          # base-currency minor units (cents/pence)
    currency: str = Field(pattern=r"^[A-Z]{3}$")
    available_units: int = Field(ge=0)
    min_length_of_stay: int = Field(ge=1, default=1)
    closed_to_arrival: bool = False

    @field_validator("channel")
    @classmethod
    def known_channel(cls, v: str) -> str:
        slug = v.strip().lower()
        if slug not in ALLOWED_CHANNELS:
            raise ValueError(f"unknown channel slug: {v!r}")
        return slug

    @field_validator("room_type_code", "rate_plan_code")
    @classmethod
    def upper(cls, v: str) -> str:
        return v.strip().upper()

Storing rate_minor as an integer plus an explicit currency code — rather than a float — means the contract itself forbids the two most common silent parity bugs: rounding drift and cross-currency comparison. Serialize with record.model_dump(mode="json") when staging so date fields round-trip as ISO-8601 strings.

Error Handling & Retry Strategy

Reconciliation inherits the extraction layer’s failure classes and adds a validation class of its own. Reuse the shared taxonomy in categorizing 4xx vs 5xx sync errors rather than inventing per-endpoint logic.

Status / class	Meaning	Action
`401 Unauthorized`	Token expired mid-extraction	Trigger OAuth2 token refresh, retry the same request; keep the `batch_id`
`429 Too Many Requests`	Extraction exhausted the budget	Honor `Retry-After`, back off, and shed parallel extractors
`5xx`	Upstream snapshot endpoint unstable	Retry with jittered exponential backoff, then abort the run cleanly
Schema validation failure	Payload violates `ReconRecord`	Do not retry; route the record to a dead-letter queue and continue the run
Business-rule violation	A real discrepancy	Persist, score, and raise a correction ticket — not an error

Extraction retries use the same exponential backoff profile as the rest of the pipeline — wait_exponential(multiplier=1, min=2, max=10) capped at four attempts with full jitter — bounded by the channel’s published OTA API rate limits. The critical distinction reconciliation must preserve: a transport failure aborts and reschedules the whole run, but a validation failure only quarantines one record. Conflating the two either loses good data or, worse, persists a partial audit as if it were complete. The batch_id is the idempotency key that makes an aborted run safe to retry wholesale — the ON CONFLICT upsert overwrites any rows the failed attempt managed to write.

The same outcome-classification the error table encodes: only a transport-class failure (exhausted 429, 5xx, timeout) aborts the whole run, while a schema-validation failure isolates a single record to the dead-letter queue and lets the run continue.

Verification & Testing

A green process proves nothing; confirm a run by asserting on the discrepancy set, the record counts, and the log stream together.

python

import polars as pl

def test_missing_channel_row_is_flagged() -> None:
    pms = pl.DataFrame([{
        "property_id": "prop_0a1b2c3d", "room_type_code": "DLX_KING",
        "rate_plan_code": "BAR", "stay_date": "2026-07-10",
        "rate_minor": 12900, "available_units": 4,
        "min_length_of_stay": 1, "closed_to_arrival": False,
    }]).with_columns(pl.col("stay_date").str.to_date())
    # Channel snapshot is empty — the room-night was never published.
    channel = pms.clear()

    out = reconcile(pms, channel)
    assert out.height == 1
    assert out["discrepancy"][0] == "MISSING_ON_CHANNEL"

test_missing_channel_row_is_flagged()

This asserts the highest-risk case — inventory the channel never published — is caught rather than dropped by the join. In production, additionally assert that the count of persisted rows equals the count of discrepancy != OK rows in the frame (no silent write loss), that every batch_id appears exactly once per property per scheduled window, and that re-running a batch_id leaves the row count unchanged (idempotency holds). Track rows_compared, discrepancies_by_class, max_severity, and run_duration_ms as structured fields so a rising discrepancy rate alerts before it reaches booking conversion.

Troubleshooting

Every run reports thousands of rate-parity violations after a currency change : Root cause: one side is comparing gross rates against net, or a rate arrived in a different currency than the baseline. Fix: enforce rate_minor + currency in the ReconRecord contract and reject cross-currency rows to the dead-letter queue instead of diffing them.

Discrepancy count explodes on rate-plan renames : Root cause: the channel side carries an alias that no longer maps to a canonical rate_plan_code. Fix: run normalization against the current rate plan taxonomy and confirm room mappings via OTA channel mapping strategies before comparison.

A property’s inventory is only partially reconciled : Root cause: the snapshot endpoint paginates across date ranges or room categories and extraction read only the first page. Fix: apply cursor traversal per parsing paginated OTA responses so the frame holds the full window.

Real changes are flagged as discrepancies because of read skew : Root cause: PMS and channel snapshots were captured minutes apart and a live rate push landed between them. Fix: capture both sides concurrently (Step 1) and, for high-velocity dates, widen avail_tolerance slightly rather than chasing transient noise.

Re-running a failed job doubles the open correction tickets : Root cause: state was inserted without an ON CONFLICT clause, or the conflict key used a surrogate id. Fix: key the upsert on the business coordinates plus batch_id (Step 4) so a re-run refreshes rows in place.

FAQ

How is batch reconciliation different from async polling?

Polling chases near-real-time deltas on a short cadence and pushes small corrections continuously; reconciliation is a scheduled, full-snapshot audit that runs off-peak and proves the entire state matches. Most production stacks run both: async polling for freshness, and batch reconciliation as the authoritative backstop that catches anything the event and polling paths missed.

When should a run be aborted versus just quarantining a record?

Abort the whole run on a transport-class failure — 5xx, exhausted 429 retries, or an extraction timeout — because a partial snapshot would produce a misleading audit. Quarantine a single record (route it to a dead-letter queue and continue) only when it fails schema validation. The batch_id and ON CONFLICT upsert make a wholesale re-run safe.

Why Polars instead of pandas for the diff?

Reconciliation joins tens of thousands of room-night rows per property portfolio. Polars runs the join and the vectorized discrepancy expressions in a single lazy, multi-threaded pass with a far smaller memory footprint, which keeps a multi-property nightly run inside its off-peak window.

What tolerance should I set for inventory and rate parity?

Start at zero for rates (any published deviation beyond your contracted margin is a real parity risk) and a small non-zero avail_tolerance for inventory to absorb legitimate housekeeping holds and overbooking buffers. Tune from the discrepancy rate: if a class of noise dominates every run, it is usually a normalization gap, not a tolerance that is too tight.

How does polling hand a property off to reconciliation?

When accumulated drift exceeds the polling tolerance — typically after a throttling window where many changes piled up — emitting a burst of individual corrections risks tripping OTA parity-compliance flags. The polling engine instead escalates the whole property to the next reconciliation run, which resyncs it atomically against PMS truth.

Building Batch Reconciliation Scripts for Daily Syncs — sharding, caching, and parallelizing this workflow across a multi-property portfolio
Async Polling for Inventory Updates — the near-real-time delta path that escalates to reconciliation on excess drift
Categorizing 4xx vs 5xx Sync Errors — the shared retry taxonomy behind the error table above
Handling OTA API Rate Limits — the request budget that caps how aggressively extraction can pull
Data Schema Standardization — the canonical payload shape reconciliation normalizes both sides into

← Back to API Sync & Data Ingestion Workflows