Research: Retry Policy Configuration (ID-010)¶

Date: 2026-03-05 Backlog items: ID-010 (Retry policy configuration) Status: Research complete — ready for design decisions

1. Problem Statement¶

The SFTP backend has hardcoded retry logic (3 attempts, 2–10 s exponential backoff via tenacity). S3 and Azure rely on their SDK's built-in retry. There is no unified retry surface — each backend does (or doesn't do) its own thing, and users have zero knobs to tune retry behavior through remote-store.

The backlog item states:

SFTP has hardcoded retry logic (3 attempts, 2–10 s backoff via tenacity). Expose a RetryPolicy dataclass in BackendConfig.options so users can tune attempts, backoff, and jitter per-backend.

Why this matters¶

Production workloads need tunable retries. Batch jobs want aggressive retry (10 attempts, long backoff). Request-serving code wants fast failure (2 attempts, short backoff). One-size-fits-all is wrong for both.
Flaky networks are common. SFTP over WAN, S3 behind a VPN, Azure in cross-region setups — transient failures are routine, not exceptional.
Rate limiting. Cloud backends throttle (S3 503 SlowDown, Azure 429). Users need backoff tuning to avoid hammering a throttled endpoint.
Observability. Users want to know that retries happened and why, not discover them via unexplained latency spikes.

Design constraints¶

Core package has zero runtime dependencies (dependencies = []).
tenacity>=4.0 is an optional dependency (part of the sftp extra).
BackendConfig.options is a dict[str, object] — the natural injection point for retry configuration.
S3 and Azure SDKs have their own built-in retry mechanisms — a naive tenacity wrapper on top creates "retry multiplication."

2. Current State: How Each Backend Handles Retry¶

2.1 SFTP — Explicit tenacity on connect only¶

File: src/remote_store/backends/_sftp.py — _connect() method (lines 519–566)

@retry(
    retry=retry_if_exception_type((paramiko.SSHException, OSError, EOFError)),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    before_sleep=before_sleep_log(log, logging.WARNING),
    reraise=True,
)
def _do_connect() -> None:
    ssh.connect(...)

Scope: Connection establishment only. Individual file operations (read, write, delete, list) are NOT retried.
Hardcoded: 3 attempts, 2–10 s exponential backoff, no jitter.
Spec: SFTP-009 (Tenacity Retry on Connect).
Audit note: M-17 flags that this retry logic is untested.
Liveness check: The _sftp property also calls stat('.') to detect stale connections and auto-reconnects. This is recovery, not retry.

2.2 S3 — Relies on botocore built-in retry¶

File: src/remote_store/backends/_s3.py

s3fs delegates to botocore, which has its own retry layer:

Default mode: legacy (5 attempts, exponential backoff).
Standard mode: Configurable via botocore.config.Config(retries={"max_attempts": N, "mode": "standard"}).
Adaptive mode: Dynamic token-bucket rate limiting.
Retried errors: Throttling (503 SlowDown, 429), transient errors (500, 502, 503, 504), connection errors, timeout errors.

Users can already pass retry config through client_options:

S3Backend(
    bucket="b",
    client_options={
        "client_kwargs": {
            "config": botocore.config.Config(retries={"max_attempts": 10, "mode": "adaptive"})
        }
    },
)

But this is undiscoverable, botocore-specific, and doesn't help with non-botocore errors (e.g., s3fs's own network handling).

Additionally, s3fs has a module-level retries parameter (default 5) and functions s3fs.add_retryable_error() / s3fs.set_custom_error_handler() for registering retryable exception types. These are global — not per-instance.

2.3 Azure — Relies on Azure SDK built-in retry¶

File: src/remote_store/backends/_azure.py

Azure Storage SDK uses HTTP pipeline policies for retry:

Default: ExponentialRetry(initial_backoff=15, increment_base=3, retry_total=3, random_jitter_range=3).
Alternative: LinearRetry(backoff=15, retry_total=3, random_jitter_range=3).
Retried errors: HTTP 408, 429, 500, 502, 503, 504, connection errors.

Users can pass retry config through client_options:

from azure.storage.blob import ExponentialRetry
AzureBackend(
    container="c",
    account_name="a",
    client_options={"retry_policy": ExponentialRetry(retry_total=10)},
)

Again, undiscoverable and SDK-specific.

2.4 S3-PyArrow — Hybrid, partially configurable¶

File: src/remote_store/backends/_s3_pyarrow.py

Data path (PyArrow C++ S3): PyArrow exposes retry_strategy on S3FileSystem with two strategy classes:
AwsStandardS3RetryStrategy(max_attempts=3) — default, exponential backoff, broad error coverage (recommended).
AwsDefaultS3RetryStrategy(max_attempts=N) — legacy, narrower coverage. The only configurable knob is max_attempts. Backoff timing, jitter, and error classification are not configurable through PyArrow's API. Related timeout parameters: request_timeout, connect_timeout.
Control path (s3fs): Same as S3 backend above.

2.5 Local / Memory — No retry (correct)¶

Local filesystem and in-memory operations don't have transient failures worth retrying. No retry needed.

Summary¶

Backend	Retry mechanism	Scope	User-configurable?
SFTP	tenacity (hardcoded)	Connect only	No
S3	botocore built-in	All SDK calls	Yes, but buried in `client_options`
Azure	Azure SDK policies	All SDK calls	Yes, but buried in `client_options`
S3-PyArrow	C++ internal + botocore	Partial	Minimal
Local	None	N/A	N/A
Memory	None	N/A	N/A

3. Survey: How the Python Ecosystem Handles Retry¶

3.1 google-api-core `Retry` (gold standard)¶

Retry(
    predicate=if_transient_error,  # Callable[[Exception], bool]
    initial=1.0,                   # initial delay (seconds)
    maximum=60.0,                  # max delay cap (seconds)
    multiplier=2.0,                # exponential multiplier
    timeout=120.0,                 # total retry window (seconds)
    on_error=None,                 # callback on each error
)

Key decisions: - Uses timeout (total wall-clock window) rather than max_attempts. - Predicate-based retryability — if_exception_type(Exc1, Exc2) builds predicates. - Immutable with with_XXX builders: retry.with_timeout(500) returns a new instance. - ConditionalRetryPolicy activates retry only for idempotent operations.

Strengths: Most thoughtful design. Predicate composability is powerful. Weaknesses: Timeout-only (no attempt count) is confusing for some users.

3.2 urllib3 `Retry` (most widely used)¶

Retry(
    total=10,              # total retries
    backoff_factor=0.5,    # exponential backoff multiplier
    backoff_max=120,       # backoff ceiling (seconds)
    backoff_jitter=0.0,    # random jitter
    status_forcelist=None, # HTTP status codes to retry
)

Backoff formula: backoff_factor * (2 ** num_previous_retries) + uniform(0, backoff_jitter)

Strengths: Battle-tested, explicit jitter, attempt-based. Weaknesses: HTTP-specific knobs (status codes, methods) don't map to storage operations.

3.3 obstore `RetryConfig` (closest analog)¶

RetryConfig(
    max_retries=10,
    retry_timeout="60s",          # total wall-clock cap
    backoff={"init_backoff": "1s", "max_backoff": "30s", "base": 2.0},
)

The only multi-backend storage library with a unified retry config object. Backed by the Rust object_store crate.

Strengths: Simple, storage-specific, proven in production. Weaknesses: No predicate customization, no jitter knob.

3.4 tenacity (our existing dependency)¶

retry(
    retry=retry_if_exception_type(OSError),
    stop=stop_after_attempt(3) | stop_after_delay(60),
    wait=wait_exponential(multiplier=1, min=2, max=10) + wait_random(0, 2),
    before_sleep=before_sleep_log(log, logging.WARNING),
    reraise=True,
)

Strengths: Maximum composability — combinable stop/wait/retry conditions. Weaknesses: Not a config object — it's a decorator. Hard to serialize/deserialize for config files.

3.5 fsspec ecosystem (fragmented)¶

s3fs: Module-level retries parameter (default 5), add_retryable_error() for exception types. No per-instance config.
adlfs: Delegates to Azure SDK.
sshfs: No retry.
No shared retry abstraction across fsspec.

Summary table¶

Library	Type	Limit style	Backoff params	Jitter
google-api-core	Class	timeout (seconds)	initial, maximum, multiplier	Internal
urllib3	Class	total (count)	backoff_factor, backoff_max	Explicit
obstore	Dict	max_retries + retry_timeout	init_backoff, max_backoff, base	Internal
tenacity	Decorator	stop_after_attempt / stop_after_delay	wait_exponential(multiplier, min, max)	Additive
s3fs	Module-level	retries (count)	Fixed	None

4. Design Space: Where Should Retry Live?¶

The central tension: S3 and Azure already retry internally. Adding a retry layer on top creates multiplication. There are three viable approaches.

4.1 Option A: Unified tenacity layer at Store level¶

Wrap every Store method call with a configurable tenacity retry decorator. The Store becomes the retry boundary.

store = Store(backend, root_path="data")
store.retry_policy = RetryPolicy(max_attempts=5, backoff_base=2.0, backoff_max=30.0)
# Every store.read(), store.write(), etc. is retried on transient errors

Pros: - Single retry config for all backends. - Users don't need to think about per-backend differences. - Works with any backend, including future ones.

Cons: - Retry multiplication: S3 has 5 botocore retries × 5 Store retries = 25 actual attempts. Azure has 3 SDK retries × 5 Store retries = 15 attempts. This is wasteful and can cause extremely long waits. - Wrong error types: Store methods raise remote-store errors (NotFound, BackendUnavailable, etc.), not SDK exceptions. Retrying on NotFound makes no sense. The retryable-error predicate must be carefully curated. - Conflates levels: Connection retry (SFTP) vs operation retry (S3 503) vs application retry (idempotency) are different concerns. - Adds tenacity as a core dependency (currently it's SFTP-only optional).

4.2 Option B: Per-backend native retry configuration¶

Expose each SDK's native retry configuration through a unified RetryPolicy dataclass that maps to backend-specific settings.

policy = RetryPolicy(max_attempts=5, backoff_base=2.0, backoff_max=30.0, jitter=1.0)

# S3: maps to botocore Config(retries={"max_attempts": 5, "mode": "standard"})
S3Backend(bucket="b", retry=policy)

# Azure: maps to ExponentialRetry(retry_total=5, initial_backoff=2, increment_base=2)
AzureBackend(container="c", retry=policy)

# SFTP: maps to tenacity @retry(stop=stop_after_attempt(5), wait=wait_exponential(...))
SFTPBackend(host="h", retry=policy)

Pros: - No retry multiplication — replaces SDK defaults, doesn't stack on top. - Each backend interprets the policy in the most efficient way for its SDK. - Clean separation — retry stays at the transport level where it belongs.

Cons: - Lossy mapping: a RetryPolicy can't express everything botocore or Azure SDK supports (status code lists, conditional retry, adaptive mode). - PyArrow C++ S3 retry is barely configurable — mapping is very limited. - Users who need full SDK control still use client_options (which already works today). - Different backends may interpret the same policy slightly differently.

4.3 Option C: Retry-aware Store middleware (ext layer)¶

Instead of building retry into backends, offer it as an observable middleware in ext/, similar to how ext.observe wraps Store with hooks.

from remote_store.ext.retry import RetryPolicy, retry_store

policy = RetryPolicy(max_attempts=3, backoff_base=1.0, backoff_max=30.0)
store = retry_store(base_store, policy=policy)
# store.write() now retries on BackendUnavailable

Under the hood, retry_store() returns a RetryStore(Store) proxy that wraps each method with tenacity retry, filtering on retryable error types (BackendUnavailable, PermissionDenied on rate-limit, etc.).

Pros: - Zero change to backends or Store — purely additive. - Composable with ext.observe — observe(retry_store(base)). - Explicit opt-in — users who want retry get it; others don't pay for it. - Can be combined with SDK-level retry (intentionally — the ext layer catches errors that escape the SDK retry). - tenacity stays optional (part of an ext extra, not core).

Cons: - Two retry layers (SDK + middleware) are harder to reason about. - Retry at the Store level means the full method re-executes, including path validation, logging, etc. (minor overhead). - read() returns BinaryIO — retrying a streaming read is tricky (must discard the partial stream and re-open).

4.4 Option D: Hybrid — Backend-native defaults + Store-level override¶

Combine B and C: backends configure their SDK's native retry from the policy (eliminating SDK defaults), and the Store middleware handles cross-cutting retry (e.g., reconnect-and-retry for SFTP connection drops mid-operation).

Pros: Theoretically cleanest — each layer does what it's best at. Cons: Most complex. Two places to configure retry. Hard to explain.

5. Recommendation¶

Primary: Option B (per-backend native retry configuration)¶

This is the most natural fit for the library's architecture:

Backends own their transport — retry is a transport concern.
No retry multiplication — the policy replaces SDK defaults.
Minimal API surface — one dataclass, one constructor parameter.
No new core dependencies — tenacity stays in the SFTP extra.

Secondary consideration: Option C as a future extension¶

A Store-level retry middleware in ext/ could handle higher-level retry (e.g., "reconnect SFTP and retry the operation" or "retry a write that failed mid-stream"). This is orthogonal to backend-level retry and can be added later without changing the backend-level design.

Not recommended: Option A¶

Store-level retry as the primary mechanism causes retry multiplication and is at the wrong abstraction level. Rejected.

6. Proposed `RetryPolicy` Dataclass¶

@dataclasses.dataclass(frozen=True)
class RetryPolicy:
    """Retry configuration for transient backend errors.

    Backends map these parameters to their native retry mechanisms.
    Backends that don't support a parameter silently ignore it.
    """

    max_attempts: int = 3
    """Maximum number of attempts (including the initial attempt).
    Set to 1 to disable retry."""

    backoff_base: float = 1.0
    """Base delay in seconds for exponential backoff.
    Delay = backoff_base * (2 ** attempt) capped at backoff_max."""

    backoff_max: float = 60.0
    """Maximum delay between retries in seconds."""

    jitter: float = 1.0
    """Maximum random jitter added to each delay in seconds.
    Set to 0.0 to disable jitter."""

    timeout: float | None = None
    """Total wall-clock timeout in seconds for all attempts combined.
    None means no total timeout (only max_attempts limits retries)."""

Mapping to each backend¶

Parameter	SFTP (tenacity)	S3 (botocore)	Azure SDK	S3-PyArrow
`max_attempts`	`stop_after_attempt(N)`	`Config(retries={"max_attempts": N})`	`retry_total=N-1`	`AwsStandardS3RetryStrategy(max_attempts=N)` + s3fs side
`backoff_base`	`wait_exponential(min=N)`	Not directly mapped¹	`initial_backoff=N`	Not configurable (C++ internal)
`backoff_max`	`wait_exponential(max=N)`	Not directly mapped¹	Implicit via increment	Not configurable (C++ internal)
`jitter`	`wait_random(0, N)`	Built-in (not configurable)	`random_jitter_range=N`	Not configurable (C++ internal)
`timeout`	`stop_after_delay(N)`	Not supported	Not supported	Not supported

¹ botocore uses fixed backoff: min(base * 2^attempt, 20) where base is random() * 2^attempt. The backoff_base and backoff_max cannot be directly set. For S3, the mapping is best-effort.

Where it lives¶

Type definition: src/remote_store/_config.py (alongside BackendConfig)
No new dependencies for the dataclass itself.
SFTP uses tenacity (already an optional dep) to implement the policy.
S3/Azure map to their SDK's native config objects.

How users configure it¶

from remote_store import RetryPolicy, BackendConfig, RegistryConfig

# Direct backend construction
from remote_store.backends import SFTPBackend
backend = SFTPBackend(host="sftp.example.com", retry=RetryPolicy(max_attempts=5))

# Via BackendConfig
config = BackendConfig(
    type="sftp",
    options={"host": "sftp.example.com"},
    retry=RetryPolicy(max_attempts=5),
)

# Via dict config (from_dict / YAML / TOML)
config = RegistryConfig.from_dict({
    "stores": {
        "remote": {
            "backend": {
                "type": "sftp",
                "options": {"host": "sftp.example.com"},
                "retry": {"max_attempts": 5, "backoff_base": 2.0},
            }
        }
    }
})

# Disable retry entirely
backend = SFTPBackend(host="h", retry=RetryPolicy(max_attempts=1))

7. Scope Question: Connect Retry vs Operation Retry¶

SFTP currently retries connection only. Should RetryPolicy also cover individual operations (read, write, delete)?

Arguments for operation retry (SFTP)¶

SSH connections drop mid-session (network blip, server restart).
A write() that fails with OSError or SSHException mid-transfer is retryable if the backend reconnects first.
The liveness check (stat('.')) already detects stale connections — retry could trigger reconnect + re-attempt.

Arguments against operation retry (SFTP)¶

Non-idempotent operations (write with overwrite=False, move) are unsafe to retry blindly.
Partial writes may leave orphaned data — retry without cleanup causes corruption.
The current SFTP-009 spec explicitly limits retry to connection.

Arguments for operation retry (S3/Azure)¶

S3 and Azure SDKs already retry individual operations at the HTTP level.
RetryPolicy naturally configures this existing behavior.
No additional code needed — just pass config to the SDK.

Recommendation¶

S3/Azure: RetryPolicy configures the SDK's existing operation-level retry. This is safe because the SDKs already handle idempotency.
SFTP: RetryPolicy configures connection retry (replacing the hardcoded values). Operation retry is a separate, more complex feature that requires idempotency analysis and should be a follow-up if needed. The liveness check + auto-reconnect already handles the most common case (stale connection on next operation).

8. Open Questions for Spec/ADR¶

Should RetryPolicy be a field on BackendConfig or nested in options? The backlog item says BackendConfig.options. A dedicated field (BackendConfig.retry: RetryPolicy | None) is cleaner and type-safe, but changes the BackendConfig schema. Recommendation: dedicated field.
Should RetryPolicy be a core type or live in an optional module? As a frozen dataclass with no imports, it has zero dependency cost. Recommendation: core type in _config.py.
What about RetryPolicy.disabled() class method? RetryPolicy(max_attempts=1) works but reads poorly. A RetryPolicy.disabled() factory (returns RetryPolicy(max_attempts=1)) is more expressive. Nice-to-have, not critical.
Should backends accept retry as a constructor parameter? Currently backends take individual kwargs (host, port, etc.) not config objects. Adding retry: RetryPolicy | None = None to each backend constructor is the cleanest API. Backends that ignore retry (Local, Memory) simply don't accept the parameter.
Should the default RetryPolicy() match current SFTP behavior? Current SFTP: 3 attempts, 2–10 s exponential. Proposed default: 3 attempts, 1–60 s exponential, 1 s jitter. The defaults should be reasonable for all backends, not SFTP-specific. SFTP's current behavior can be preserved with RetryPolicy(max_attempts=3, backoff_base=2.0, backoff_max=10.0).
How does this interact with ext.observe? Retry happens inside the backend, below the observe middleware. ext.observe sees the final result (success after N retries, or failure after exhausting retries). If users want retry-level observability, they need backend-level logging (which SFTP's before_sleep_log already provides). The RetryPolicy could optionally accept an on_retry callback, but this adds complexity — defer unless there's demand.
Should from_dict() auto-parse retry from nested dicts? Yes — {"retry": {"max_attempts": 5}} should produce BackendConfig(retry=RetryPolicy(max_attempts=5)). This keeps YAML/TOML config ergonomic.

9. Implementation Estimate¶

Work item	Effort	Dependencies
`RetryPolicy` dataclass in `_config.py`	Small	None
`BackendConfig.retry` field + `from_dict()` parsing	Small	RetryPolicy
SFTP: read policy from constructor, replace hardcoded values	Medium	RetryPolicy
S3: map policy to botocore `Config(retries=...)`	Small	RetryPolicy
Azure: map policy to `ExponentialRetry(...)`	Small	RetryPolicy
S3-PyArrow: map policy to s3fs side	Small	RetryPolicy
Tests: unit tests for RetryPolicy, integration tests per backend	Medium	All above
Spec: `sdd/specs/0XX-retry-policy.md`	Medium	Design decisions
Docs: user guide, config examples	Small	Spec
CHANGELOG, BACKLOG update	Small	All above