Research: Retry Policy Configuration (ID-010)¶
Date: 2026-03-05 Backlog items: ID-010 (Retry policy configuration) Status: Research complete — ready for design decisions
1. Problem Statement¶
The SFTP backend has hardcoded retry logic (3 attempts, 2–10 s exponential
backoff via tenacity). S3 and Azure rely on their SDK's built-in retry.
There is no unified retry surface — each backend does (or doesn't do) its own
thing, and users have zero knobs to tune retry behavior through remote-store.
The backlog item states:
SFTP has hardcoded retry logic (3 attempts, 2–10 s backoff via
tenacity). Expose aRetryPolicydataclass inBackendConfig.optionsso users can tune attempts, backoff, and jitter per-backend.
Why this matters¶
- Production workloads need tunable retries. Batch jobs want aggressive retry (10 attempts, long backoff). Request-serving code wants fast failure (2 attempts, short backoff). One-size-fits-all is wrong for both.
- Flaky networks are common. SFTP over WAN, S3 behind a VPN, Azure in cross-region setups — transient failures are routine, not exceptional.
- Rate limiting. Cloud backends throttle (S3 503 SlowDown, Azure 429). Users need backoff tuning to avoid hammering a throttled endpoint.
- Observability. Users want to know that retries happened and why, not discover them via unexplained latency spikes.
Design constraints¶
- Core package has zero runtime dependencies (
dependencies = []). tenacity>=4.0is an optional dependency (part of thesftpextra).BackendConfig.optionsis adict[str, object]— the natural injection point for retry configuration.- S3 and Azure SDKs have their own built-in retry mechanisms — a naive tenacity wrapper on top creates "retry multiplication."
2. Current State: How Each Backend Handles Retry¶
2.1 SFTP — Explicit tenacity on connect only¶
File: src/remote_store/backends/_sftp.py — _connect() method (lines 519–566)
@retry(
retry=retry_if_exception_type((paramiko.SSHException, OSError, EOFError)),
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
before_sleep=before_sleep_log(log, logging.WARNING),
reraise=True,
)
def _do_connect() -> None:
ssh.connect(...)
- Scope: Connection establishment only. Individual file operations (read, write, delete, list) are NOT retried.
- Hardcoded: 3 attempts, 2–10 s exponential backoff, no jitter.
- Spec: SFTP-009 (Tenacity Retry on Connect).
- Audit note: M-17 flags that this retry logic is untested.
- Liveness check: The
_sftpproperty also callsstat('.')to detect stale connections and auto-reconnects. This is recovery, not retry.
2.2 S3 — Relies on botocore built-in retry¶
File: src/remote_store/backends/_s3.py
s3fs delegates to botocore, which has its own retry layer:
- Default mode:
legacy(5 attempts, exponential backoff). - Standard mode: Configurable via
botocore.config.Config(retries={"max_attempts": N, "mode": "standard"}). - Adaptive mode: Dynamic token-bucket rate limiting.
- Retried errors: Throttling (503 SlowDown, 429), transient errors (500, 502, 503, 504), connection errors, timeout errors.
Users can already pass retry config through client_options:
S3Backend(
bucket="b",
client_options={
"client_kwargs": {
"config": botocore.config.Config(retries={"max_attempts": 10, "mode": "adaptive"})
}
},
)
But this is undiscoverable, botocore-specific, and doesn't help with non-botocore errors (e.g., s3fs's own network handling).
Additionally, s3fs has a module-level retries parameter (default 5) and
functions s3fs.add_retryable_error() / s3fs.set_custom_error_handler()
for registering retryable exception types. These are global — not per-instance.
2.3 Azure — Relies on Azure SDK built-in retry¶
File: src/remote_store/backends/_azure.py
Azure Storage SDK uses HTTP pipeline policies for retry:
- Default:
ExponentialRetry(initial_backoff=15, increment_base=3, retry_total=3, random_jitter_range=3). - Alternative:
LinearRetry(backoff=15, retry_total=3, random_jitter_range=3). - Retried errors: HTTP 408, 429, 500, 502, 503, 504, connection errors.
Users can pass retry config through client_options:
from azure.storage.blob import ExponentialRetry
AzureBackend(
container="c",
account_name="a",
client_options={"retry_policy": ExponentialRetry(retry_total=10)},
)
Again, undiscoverable and SDK-specific.
2.4 S3-PyArrow — Hybrid, partially configurable¶
File: src/remote_store/backends/_s3_pyarrow.py
- Data path (PyArrow C++ S3): PyArrow exposes
retry_strategyonS3FileSystemwith two strategy classes: AwsStandardS3RetryStrategy(max_attempts=3)— default, exponential backoff, broad error coverage (recommended).AwsDefaultS3RetryStrategy(max_attempts=N)— legacy, narrower coverage. The only configurable knob ismax_attempts. Backoff timing, jitter, and error classification are not configurable through PyArrow's API. Related timeout parameters:request_timeout,connect_timeout.- Control path (s3fs): Same as S3 backend above.
2.5 Local / Memory — No retry (correct)¶
Local filesystem and in-memory operations don't have transient failures worth retrying. No retry needed.
Summary¶
| Backend | Retry mechanism | Scope | User-configurable? |
|---|---|---|---|
| SFTP | tenacity (hardcoded) | Connect only | No |
| S3 | botocore built-in | All SDK calls | Yes, but buried in client_options |
| Azure | Azure SDK policies | All SDK calls | Yes, but buried in client_options |
| S3-PyArrow | C++ internal + botocore | Partial | Minimal |
| Local | None | N/A | N/A |
| Memory | None | N/A | N/A |
3. Survey: How the Python Ecosystem Handles Retry¶
3.1 google-api-core Retry (gold standard)¶
Retry(
predicate=if_transient_error, # Callable[[Exception], bool]
initial=1.0, # initial delay (seconds)
maximum=60.0, # max delay cap (seconds)
multiplier=2.0, # exponential multiplier
timeout=120.0, # total retry window (seconds)
on_error=None, # callback on each error
)
Key decisions:
- Uses timeout (total wall-clock window) rather than max_attempts.
- Predicate-based retryability — if_exception_type(Exc1, Exc2) builds
predicates.
- Immutable with with_XXX builders: retry.with_timeout(500) returns
a new instance.
- ConditionalRetryPolicy activates retry only for idempotent operations.
Strengths: Most thoughtful design. Predicate composability is powerful. Weaknesses: Timeout-only (no attempt count) is confusing for some users.
3.2 urllib3 Retry (most widely used)¶
Retry(
total=10, # total retries
backoff_factor=0.5, # exponential backoff multiplier
backoff_max=120, # backoff ceiling (seconds)
backoff_jitter=0.0, # random jitter
status_forcelist=None, # HTTP status codes to retry
)
Backoff formula: backoff_factor * (2 ** num_previous_retries) + uniform(0, backoff_jitter)
Strengths: Battle-tested, explicit jitter, attempt-based. Weaknesses: HTTP-specific knobs (status codes, methods) don't map to storage operations.
3.3 obstore RetryConfig (closest analog)¶
RetryConfig(
max_retries=10,
retry_timeout="60s", # total wall-clock cap
backoff={"init_backoff": "1s", "max_backoff": "30s", "base": 2.0},
)
The only multi-backend storage library with a unified retry config object.
Backed by the Rust object_store crate.
Strengths: Simple, storage-specific, proven in production. Weaknesses: No predicate customization, no jitter knob.
3.4 tenacity (our existing dependency)¶
retry(
retry=retry_if_exception_type(OSError),
stop=stop_after_attempt(3) | stop_after_delay(60),
wait=wait_exponential(multiplier=1, min=2, max=10) + wait_random(0, 2),
before_sleep=before_sleep_log(log, logging.WARNING),
reraise=True,
)
Strengths: Maximum composability — combinable stop/wait/retry conditions. Weaknesses: Not a config object — it's a decorator. Hard to serialize/deserialize for config files.
3.5 fsspec ecosystem (fragmented)¶
- s3fs: Module-level
retriesparameter (default 5),add_retryable_error()for exception types. No per-instance config. - adlfs: Delegates to Azure SDK.
- sshfs: No retry.
- No shared retry abstraction across fsspec.
Summary table¶
| Library | Type | Limit style | Backoff params | Jitter |
|---|---|---|---|---|
| google-api-core | Class | timeout (seconds) | initial, maximum, multiplier | Internal |
| urllib3 | Class | total (count) | backoff_factor, backoff_max | Explicit |
| obstore | Dict | max_retries + retry_timeout | init_backoff, max_backoff, base | Internal |
| tenacity | Decorator | stop_after_attempt / stop_after_delay | wait_exponential(multiplier, min, max) | Additive |
| s3fs | Module-level | retries (count) | Fixed | None |
4. Design Space: Where Should Retry Live?¶
The central tension: S3 and Azure already retry internally. Adding a retry layer on top creates multiplication. There are three viable approaches.
4.1 Option A: Unified tenacity layer at Store level¶
Wrap every Store method call with a configurable tenacity retry decorator.
The Store becomes the retry boundary.
store = Store(backend, root_path="data")
store.retry_policy = RetryPolicy(max_attempts=5, backoff_base=2.0, backoff_max=30.0)
# Every store.read(), store.write(), etc. is retried on transient errors
Pros: - Single retry config for all backends. - Users don't need to think about per-backend differences. - Works with any backend, including future ones.
Cons:
- Retry multiplication: S3 has 5 botocore retries × 5 Store retries = 25
actual attempts. Azure has 3 SDK retries × 5 Store retries = 15 attempts.
This is wasteful and can cause extremely long waits.
- Wrong error types: Store methods raise remote-store errors (NotFound,
BackendUnavailable, etc.), not SDK exceptions. Retrying on NotFound
makes no sense. The retryable-error predicate must be carefully curated.
- Conflates levels: Connection retry (SFTP) vs operation retry (S3 503)
vs application retry (idempotency) are different concerns.
- Adds tenacity as a core dependency (currently it's SFTP-only optional).
4.2 Option B: Per-backend native retry configuration¶
Expose each SDK's native retry configuration through a unified
RetryPolicy dataclass that maps to backend-specific settings.
policy = RetryPolicy(max_attempts=5, backoff_base=2.0, backoff_max=30.0, jitter=1.0)
# S3: maps to botocore Config(retries={"max_attempts": 5, "mode": "standard"})
S3Backend(bucket="b", retry=policy)
# Azure: maps to ExponentialRetry(retry_total=5, initial_backoff=2, increment_base=2)
AzureBackend(container="c", retry=policy)
# SFTP: maps to tenacity @retry(stop=stop_after_attempt(5), wait=wait_exponential(...))
SFTPBackend(host="h", retry=policy)
Pros: - No retry multiplication — replaces SDK defaults, doesn't stack on top. - Each backend interprets the policy in the most efficient way for its SDK. - Clean separation — retry stays at the transport level where it belongs.
Cons:
- Lossy mapping: a RetryPolicy can't express everything botocore or Azure
SDK supports (status code lists, conditional retry, adaptive mode).
- PyArrow C++ S3 retry is barely configurable — mapping is very limited.
- Users who need full SDK control still use client_options (which already
works today).
- Different backends may interpret the same policy slightly differently.
4.3 Option C: Retry-aware Store middleware (ext layer)¶
Instead of building retry into backends, offer it as an observable middleware
in ext/, similar to how ext.observe wraps Store with hooks.
from remote_store.ext.retry import RetryPolicy, retry_store
policy = RetryPolicy(max_attempts=3, backoff_base=1.0, backoff_max=30.0)
store = retry_store(base_store, policy=policy)
# store.write() now retries on BackendUnavailable
Under the hood, retry_store() returns a RetryStore(Store) proxy that
wraps each method with tenacity retry, filtering on retryable error types
(BackendUnavailable, PermissionDenied on rate-limit, etc.).
Pros:
- Zero change to backends or Store — purely additive.
- Composable with ext.observe — observe(retry_store(base)).
- Explicit opt-in — users who want retry get it; others don't pay for it.
- Can be combined with SDK-level retry (intentionally — the ext layer
catches errors that escape the SDK retry).
- tenacity stays optional (part of an ext extra, not core).
Cons:
- Two retry layers (SDK + middleware) are harder to reason about.
- Retry at the Store level means the full method re-executes, including
path validation, logging, etc. (minor overhead).
- read() returns BinaryIO — retrying a streaming read is tricky
(must discard the partial stream and re-open).
4.4 Option D: Hybrid — Backend-native defaults + Store-level override¶
Combine B and C: backends configure their SDK's native retry from the policy (eliminating SDK defaults), and the Store middleware handles cross-cutting retry (e.g., reconnect-and-retry for SFTP connection drops mid-operation).
Pros: Theoretically cleanest — each layer does what it's best at. Cons: Most complex. Two places to configure retry. Hard to explain.
5. Recommendation¶
Primary: Option B (per-backend native retry configuration)¶
This is the most natural fit for the library's architecture:
- Backends own their transport — retry is a transport concern.
- No retry multiplication — the policy replaces SDK defaults.
- Minimal API surface — one dataclass, one constructor parameter.
- No new core dependencies — tenacity stays in the SFTP extra.
Secondary consideration: Option C as a future extension¶
A Store-level retry middleware in ext/ could handle higher-level retry
(e.g., "reconnect SFTP and retry the operation" or "retry a write that
failed mid-stream"). This is orthogonal to backend-level retry and can be
added later without changing the backend-level design.
Not recommended: Option A¶
Store-level retry as the primary mechanism causes retry multiplication and is at the wrong abstraction level. Rejected.
6. Proposed RetryPolicy Dataclass¶
@dataclasses.dataclass(frozen=True)
class RetryPolicy:
"""Retry configuration for transient backend errors.
Backends map these parameters to their native retry mechanisms.
Backends that don't support a parameter silently ignore it.
"""
max_attempts: int = 3
"""Maximum number of attempts (including the initial attempt).
Set to 1 to disable retry."""
backoff_base: float = 1.0
"""Base delay in seconds for exponential backoff.
Delay = backoff_base * (2 ** attempt) capped at backoff_max."""
backoff_max: float = 60.0
"""Maximum delay between retries in seconds."""
jitter: float = 1.0
"""Maximum random jitter added to each delay in seconds.
Set to 0.0 to disable jitter."""
timeout: float | None = None
"""Total wall-clock timeout in seconds for all attempts combined.
None means no total timeout (only max_attempts limits retries)."""
Mapping to each backend¶
| Parameter | SFTP (tenacity) | S3 (botocore) | Azure SDK | S3-PyArrow |
|---|---|---|---|---|
max_attempts |
stop_after_attempt(N) |
Config(retries={"max_attempts": N}) |
retry_total=N-1 |
AwsStandardS3RetryStrategy(max_attempts=N) + s3fs side |
backoff_base |
wait_exponential(min=N) |
Not directly mapped¹ | initial_backoff=N |
Not configurable (C++ internal) |
backoff_max |
wait_exponential(max=N) |
Not directly mapped¹ | Implicit via increment | Not configurable (C++ internal) |
jitter |
wait_random(0, N) |
Built-in (not configurable) | random_jitter_range=N |
Not configurable (C++ internal) |
timeout |
stop_after_delay(N) |
Not supported | Not supported | Not supported |
¹ botocore uses fixed backoff: min(base * 2^attempt, 20) where base is
random() * 2^attempt. The backoff_base and backoff_max cannot be
directly set. For S3, the mapping is best-effort.
Where it lives¶
- Type definition:
src/remote_store/_config.py(alongsideBackendConfig) - No new dependencies for the dataclass itself.
- SFTP uses tenacity (already an optional dep) to implement the policy.
- S3/Azure map to their SDK's native config objects.
How users configure it¶
from remote_store import RetryPolicy, BackendConfig, RegistryConfig
# Direct backend construction
from remote_store.backends import SFTPBackend
backend = SFTPBackend(host="sftp.example.com", retry=RetryPolicy(max_attempts=5))
# Via BackendConfig
config = BackendConfig(
type="sftp",
options={"host": "sftp.example.com"},
retry=RetryPolicy(max_attempts=5),
)
# Via dict config (from_dict / YAML / TOML)
config = RegistryConfig.from_dict({
"stores": {
"remote": {
"backend": {
"type": "sftp",
"options": {"host": "sftp.example.com"},
"retry": {"max_attempts": 5, "backoff_base": 2.0},
}
}
}
})
# Disable retry entirely
backend = SFTPBackend(host="h", retry=RetryPolicy(max_attempts=1))
7. Scope Question: Connect Retry vs Operation Retry¶
SFTP currently retries connection only. Should RetryPolicy also cover
individual operations (read, write, delete)?
Arguments for operation retry (SFTP)¶
- SSH connections drop mid-session (network blip, server restart).
- A
write()that fails withOSErrororSSHExceptionmid-transfer is retryable if the backend reconnects first. - The liveness check (
stat('.')) already detects stale connections — retry could trigger reconnect + re-attempt.
Arguments against operation retry (SFTP)¶
- Non-idempotent operations (
writewithoverwrite=False,move) are unsafe to retry blindly. - Partial writes may leave orphaned data — retry without cleanup causes corruption.
- The current SFTP-009 spec explicitly limits retry to connection.
Arguments for operation retry (S3/Azure)¶
- S3 and Azure SDKs already retry individual operations at the HTTP level.
RetryPolicynaturally configures this existing behavior.- No additional code needed — just pass config to the SDK.
Recommendation¶
- S3/Azure:
RetryPolicyconfigures the SDK's existing operation-level retry. This is safe because the SDKs already handle idempotency. - SFTP:
RetryPolicyconfigures connection retry (replacing the hardcoded values). Operation retry is a separate, more complex feature that requires idempotency analysis and should be a follow-up if needed. The liveness check + auto-reconnect already handles the most common case (stale connection on next operation).
8. Open Questions for Spec/ADR¶
-
Should
RetryPolicybe a field onBackendConfigor nested inoptions? The backlog item saysBackendConfig.options. A dedicated field (BackendConfig.retry: RetryPolicy | None) is cleaner and type-safe, but changes theBackendConfigschema. Recommendation: dedicated field. -
Should
RetryPolicybe a core type or live in an optional module? As a frozen dataclass with no imports, it has zero dependency cost. Recommendation: core type in_config.py. -
What about
RetryPolicy.disabled()class method?RetryPolicy(max_attempts=1)works but reads poorly. ARetryPolicy.disabled()factory (returnsRetryPolicy(max_attempts=1)) is more expressive. Nice-to-have, not critical. -
Should backends accept
retryas a constructor parameter? Currently backends take individual kwargs (host,port, etc.) not config objects. Addingretry: RetryPolicy | None = Noneto each backend constructor is the cleanest API. Backends that ignore retry (Local, Memory) simply don't accept the parameter. -
Should the default
RetryPolicy()match current SFTP behavior? Current SFTP: 3 attempts, 2–10 s exponential. Proposed default: 3 attempts, 1–60 s exponential, 1 s jitter. The defaults should be reasonable for all backends, not SFTP-specific. SFTP's current behavior can be preserved withRetryPolicy(max_attempts=3, backoff_base=2.0, backoff_max=10.0). -
How does this interact with
ext.observe? Retry happens inside the backend, below the observe middleware.ext.observesees the final result (success after N retries, or failure after exhausting retries). If users want retry-level observability, they need backend-level logging (which SFTP'sbefore_sleep_logalready provides). TheRetryPolicycould optionally accept anon_retrycallback, but this adds complexity — defer unless there's demand. -
Should
from_dict()auto-parseretryfrom nested dicts? Yes —{"retry": {"max_attempts": 5}}should produceBackendConfig(retry=RetryPolicy(max_attempts=5)). This keeps YAML/TOML config ergonomic.
9. Implementation Estimate¶
| Work item | Effort | Dependencies |
|---|---|---|
RetryPolicy dataclass in _config.py |
Small | None |
BackendConfig.retry field + from_dict() parsing |
Small | RetryPolicy |
| SFTP: read policy from constructor, replace hardcoded values | Medium | RetryPolicy |
S3: map policy to botocore Config(retries=...) |
Small | RetryPolicy |
Azure: map policy to ExponentialRetry(...) |
Small | RetryPolicy |
| S3-PyArrow: map policy to s3fs side | Small | RetryPolicy |
| Tests: unit tests for RetryPolicy, integration tests per backend | Medium | All above |
Spec: sdd/specs/0XX-retry-policy.md |
Medium | Design decisions |
| Docs: user guide, config examples | Small | Spec |
| CHANGELOG, BACKLOG update | Small | All above |
Total: Medium-sized feature. Comparable to ID-039 (Secret/credential hygiene).
10. References¶
Python ecosystem¶
- google-api-core Retry
- urllib3 Retry
- obstore RetryConfig
- tenacity docs
- botocore retry configuration
- Azure Storage retry policies
Internal¶
- SFTP spec:
sdd/specs/009-sftp-backend.md(SFTP-009) - Audit: M-17 (SFTP retry untested)
- BackendConfig:
src/remote_store/_config.py - SFTP retry code:
src/remote_store/backends/_sftp.py(lines 519–566)