Automating Channel Manager Token Renewal

Channel manager integrations fail silently when an OAuth2 access token expires mid-sync: a single stale bearer token during a rate push triggers parity violations, downstream overbookings, and manual reconciliation across every connected OTA. This page is the build guide for the background worker that eliminates that failure class — a scheduled process that proactively rotates credentials for every property_id-and-channel pair before any expire, so the sync engine never discovers expiry at dispatch time. It sits under OAuth2 token refresh strategies, which specifies the deterministic pre-flight-and-lock refresh design this worker operationalizes across a whole portfolio.

Prerequisites & environment

The worker runs decoupled from the rate push scheduler and owns exactly one responsibility: keep a valid access_token in the shared store for each property-and-channel pair. Pin these versions so the async, retry, and validation behaviour below is reproducible:

Python 3.11+ — for asyncio.TaskGroup and exception groups when fanning out across many properties.
httpx 0.27+ — async client with granular connect/read/write timeouts for the token endpoint.
redis 5.x (asyncio interface) — atomic per-key state persistence and the distributed lock that serializes concurrent refreshes.
tenacity 8.x — declarative exponential backoff scoped to retryable transport failures only.
pydantic 2.6+ — the TokenState contract (v2 API: field_validator, model_dump) that validates every credential record at the store boundary.
structlog 24.x — key=value telemetry for audit-grade token lifecycle logging.

On the access side you need, per property-and-channel pair, a stored refresh_token, the provider’s token endpoint URL, and the client_id / client_secret issued during the initial grant covered in implementing OAuth2 for PMS API access. Secrets belong in a vault (HashiCorp Vault, AWS Secrets Manager, or an encrypted config), never in the Redis state key — Redis holds only the short-lived access token and its expiry.

Step-by-step implementation

The worker is four parts: a validated state model, an atomic store, the refresh call itself, and a fan-out loop that sweeps the portfolio on a fixed cadence.

Step 1 — Model the token state with a proactive skew window

Represent each credential record as a Pydantic v2 model so malformed writes are rejected at the boundary rather than surfacing as a mysterious 401 three hops downstream. The model carries its own “should I refresh yet” decision.

python

import time
from pydantic import BaseModel, Field, field_validator

# Renew this many seconds before the provider-stated expiry. Covers clock skew
# between the worker host and the OTA, plus serialization/network jitter on the
# renewal round-trip itself — never wait for the token to actually expire.
RENEWAL_SKEW_SECONDS = 300

class TokenState(BaseModel):
    property_id: str = Field(pattern=r"^PROP_\d+$")
    channel: str  # OTA slug: "booking_com", "expedia", "agoda"
    access_token: str
    refresh_token: str
    expires_at: int  # absolute epoch seconds, not a relative "expires_in"

    @field_validator("channel")
    @classmethod
    def known_channel(cls, v: str) -> str:
        allowed = {"booking_com", "expedia", "agoda", "hostelworld"}
        if v not in allowed:
            raise ValueError(f"unmapped channel slug: {v}")
        return v

    def needs_refresh(self, now: int | None = None) -> bool:
        now = now or int(time.time())
        return now >= (self.expires_at - RENEWAL_SKEW_SECONDS)

Storing an absolute expires_at epoch rather than the provider’s relative expires_in is the non-obvious choice: a relative value is only meaningful at the instant of the response, so persisting it forces every later reader to remember when the token was issued — an absolute epoch is comparable directly against time.time() from any process.

Step 2 — Persist state atomically, keyed per pair

Give every property-and-channel pair its own Redis key so a refresh for PROP_8842 on booking_com never contends with expedia. Writes go through a transactional pipeline so a concurrent reader never observes a half-written record.

python

import json
import redis.asyncio as redis

class TokenStore:
    def __init__(self, redis_url: str):
        self.redis = redis.from_url(redis_url, decode_responses=True)

    def _key(self, property_id: str, channel: str) -> str:
        return f"cm:token:{property_id}:{channel}"

    async def load(self, property_id: str, channel: str) -> TokenState | None:
        raw = await self.redis.get(self._key(property_id, channel))
        return TokenState.model_validate_json(raw) if raw else None

    async def save(self, state: TokenState) -> None:
        key = self._key(state.property_id, state.channel)
        # Pipeline the write so a reader mid-refresh sees either the old record
        # or the new one — never a partially serialized token.
        async with self.redis.pipeline(transaction=True) as pipe:
            pipe.set(key, json.dumps(state.model_dump()))
            await pipe.execute()

model_validate_json on load means a record hand-edited during an incident, or written by an older build with a different shape, is rejected loudly at read time instead of poisoning a rate push with an invalid channel slug.

Step 3 — Refresh under a lock, retrying only transport failures

The refresh call must distinguish failures that a retry can fix from failures that a retry only makes worse. Transient timeouts and 5xx responses warrant exponential backoff; a 400 invalid_grant means the refresh token is revoked or already consumed, and retrying it just burns your OTA rate-limit budget on a request that can never succeed. This split mirrors the shared taxonomy in categorizing 4xx vs 5xx sync errors.

python

import httpx
import structlog
from tenacity import (
    retry, stop_after_attempt, wait_exponential, retry_if_exception_type,
)

log = structlog.get_logger()

class TerminalAuthError(Exception):
    """Refresh token is dead — halt and alert, never retry."""

class TokenRenewer:
    def __init__(self, store: TokenStore, endpoint: str, client_id: str, client_secret: str):
        self.store, self.endpoint = store, endpoint
        self.client_id, self.client_secret = client_id, client_secret

    @retry(
        retry=retry_if_exception_type((httpx.ConnectTimeout, httpx.ReadTimeout, httpx.HTTPStatusError)),
        wait=wait_exponential(multiplier=1, min=2, max=10),
        stop=stop_after_attempt(3),
        reraise=True,
    )
    async def _call_endpoint(self, refresh_token: str) -> dict:
        timeout = httpx.Timeout(connect=5.0, read=30.0, write=10.0, pool=10.0)
        async with httpx.AsyncClient(timeout=timeout) as client:
            resp = await client.post(self.endpoint, data={
                "grant_type": "refresh_token",
                "refresh_token": refresh_token,
                "client_id": self.client_id,
                "client_secret": self.client_secret,
            })
        if resp.status_code in (400, 401, 403):
            # Terminal: raise a type tenacity does NOT retry, so we fail fast.
            raise TerminalAuthError(resp.json().get("error", "invalid_grant"))
        resp.raise_for_status()  # 5xx -> HTTPStatusError -> retried by tenacity
        return resp.json()

    async def renew(self, state: TokenState) -> TokenState:
        # Serialize refresh across workers: a single-use refresh token submitted
        # twice returns invalid_grant and voids the chain. See the parent
        # workflow page for the full distributed-lock design.
        lock = self.store.redis.lock(f"cm:lock:{state.property_id}:{state.channel}", timeout=30)
        async with lock:
            fresh = await self.store.load(state.property_id, state.channel)
            if fresh and not fresh.needs_refresh():
                return fresh  # another worker already rotated it while we waited
            data = await self._call_endpoint(state.refresh_token)
            new_state = TokenState(
                property_id=state.property_id,
                channel=state.channel,
                access_token=data["access_token"],
                # Providers may rotate the refresh token; fall back to the old one if not.
                refresh_token=data.get("refresh_token", state.refresh_token),
                expires_at=int(time.time()) + data.get("expires_in", 3600),
            )
            await self.store.save(new_state)
            log.info("token_renewed", property_id=state.property_id,
                     channel=state.channel, expires_at=new_state.expires_at,
                     rotated=bool(data.get("refresh_token")))
            return new_state

Re-reading state after acquiring the lock is the pattern that prevents a thundering herd: when several workers all notice the same token expiring, only the first refreshes; the rest wait on the lock, re-read the freshly rotated token, and skip the endpoint call entirely.

Step 4 — Fan out across the portfolio on a fixed cadence

The sweep loop checks every pair, refreshing only those inside the skew window. A bounded TaskGroup keeps a portfolio of hundreds of pairs from opening hundreds of simultaneous connections to the token endpoint.

python

import asyncio

async def sweep(renewer: TokenRenewer, pairs: list[tuple[str, str]]) -> None:
    sem = asyncio.Semaphore(8)  # cap concurrent token-endpoint calls

    async def maybe_renew(property_id: str, channel: str) -> None:
        state = await renewer.store.load(property_id, channel)
        if state is None or not state.needs_refresh():
            return
        async with sem:
            try:
                await renewer.renew(state)
            except TerminalAuthError as exc:
                # Do not crash the sweep — one dead credential must not stall the rest.
                log.error("token_terminal", property_id=property_id,
                          channel=channel, error=str(exc))

    async with asyncio.TaskGroup() as tg:
        for property_id, channel in pairs:
            tg.create_task(maybe_renew(property_id, channel))

async def run_forever(renewer: TokenRenewer, pairs, interval_seconds: int = 60) -> None:
    while True:
        await sweep(renewer, pairs)
        await asyncio.sleep(interval_seconds)

Catching TerminalAuthError inside maybe_renew rather than letting it propagate is deliberate: a TaskGroup cancels its siblings on the first unhandled exception, so an un-caught terminal error on one property would abandon renewal for every other property in the same sweep.

Gotchas & production notes

Single-use refresh tokens void the chain on double submit. Most OTA providers rotate the refresh token on every renewal per RFC 6749 §6. If two workers submit the same one-use token, the second gets invalid_grant and the credential chain is dead until a human re-authorizes the property. The per-pair lock plus the re-read in Step 3 is what makes this safe — do not remove it as an “optimization”.
Clock skew, not expires_in, is what actually expires you. If the worker host drifts even a minute ahead of the OTA, a token you believe is valid is already rejected. The 300-second RENEWAL_SKEW_SECONDS absorbs this, but you should still run NTP on the worker host and alert on drift rather than widening the window indefinitely.
Never log the token bodies. The structlog events above emit expires_at and a rotated flag but never the access_token or refresh_token themselves — hospitality stacks operate under PCI-adjacent governance, and a leaked bearer token in a log aggregator is a reportable incident.
Renewal is not a substitute for a fallback. When a credential goes terminal, the rate push engine still needs somewhere to go; wire the alert into the fallback routing for PMS outages path so that channel is quarantined rather than retried blindly against a dead token.

Verification snippet

Prove the skew-window logic and the store round-trip before trusting the worker with live credentials. This asserts that a token inside the window is flagged, one outside it is not, and that a saved record survives serialization unchanged.

python

def test_needs_refresh_honours_skew_window() -> None:
    now = 1_900_000_000
    fresh = TokenState(property_id="PROP_8842", channel="booking_com",
                       access_token="a", refresh_token="r",
                       expires_at=now + RENEWAL_SKEW_SECONDS + 60)
    stale = fresh.model_copy(update={"expires_at": now + RENEWAL_SKEW_SECONDS - 1})
    assert fresh.needs_refresh(now) is False
    assert stale.needs_refresh(now) is True

def test_state_survives_a_store_round_trip() -> None:
    s = TokenState(property_id="PROP_8842", channel="expedia",
                   access_token="a", refresh_token="r", expires_at=1_900_000_000)
    assert TokenState.model_validate_json(json.dumps(s.model_dump())) == s

test_needs_refresh_honours_skew_window()
test_state_survives_a_store_round_trip()

Testing the boundary at exactly expires_at - RENEWAL_SKEW_SECONDS is the assertion that matters most: an off-by-one here is the difference between renewing one second early (harmless) and one second late (a live token expiring mid-push). In production, also assert that a 400 response raises TerminalAuthError and that renew is never called twice for a token another worker already rotated.

OAuth2 Token Refresh Strategies — the parent workflow: the deterministic pre-flight-and-lock refresh design this worker operationalizes
Implementing OAuth2 for PMS API Access — the initial grant flow that issues the refresh_token this worker rotates
Handling OTA API Rate Limits — sizing the concurrency cap so the token endpoint never becomes the thing that gets you throttled
Categorizing 4xx vs 5xx Sync Errors — the shared taxonomy behind the retryable-vs-terminal split in Step 3
Designing Fallback Routes for PMS Outages — where a channel goes when a credential goes terminal

← Back to OAuth2 Token Refresh Strategies