Research: Configuration Loaders and Store Config Patterns¶

Date: 2026-03-02 Backlog items: ID-002 (YAML config support), ID-003 (Pydantic BaseSettings integration), ID-005 (Built-in from_toml() config loader) Status: Research complete — awaiting design decisions

1. Executive Summary¶

This document researches how remote-store should extend its configuration surface beyond the current RegistryConfig.from_dict(). The three backlog ideas under review are:

ID	Proposal	Dependency impact
ID-002	`RegistryConfig.from_yaml(path)`	Optional: `pyyaml` or `ruamel.yaml`
ID-003	Pydantic `BaseSettings` integration	Optional: `pydantic-settings`
ID-005	`RegistryConfig.from_toml(path)`	Zero on 3.11+; optional `tomli` on 3.10

A cross-cutting theme is that a single backend technology often needs multiple configurations — the same S3 bucket accessed with different credentials, or the same Azure account using account-key in production and connection-string in CI. The config system must support this naturally without forcing users to duplicate boilerplate.

12-Factor framing. The three proposed loaders map cleanly onto the file-vs-environment spectrum from The Twelve-Factor App, Factor III ("Store config in the environment"):

Loader	12-Factor alignment	Primary audience
`from_toml()` / `from_yaml()`	Low (file-based)	Libraries, scripts, local dev
Pydantic `BaseSettings` adapter	High (env-var native)	Services, containers, 12-factor apps
Code-level `RegistryConfig()`	N/A (explicit code)	Complex credentials, tests, notebooks

This tension between file-based configuration (developer ergonomics, VCS-friendly) and environment-variable configuration (deploy-time flexibility, secret safety) is the central design question. ADR-0002 resolves it by keeping all three as pre-processing steps that produce a single, immutable RegistryConfig — the Registry never merges sources at runtime.

Headline findings:

from_toml() (ID-005) is the lowest-friction, highest-value addition — zero runtime dependency on 3.11+, aligns with Python packaging conventions, and the TOML structure maps cleanly to the existing dict schema.
from_yaml() (ID-002) is straightforward but adds an optional dependency. pyyaml is the pragmatic choice (ubiquitous, simple); ruamel.yaml is technically superior but heavier.
Pydantic BaseSettings (ID-003) is the most complex but enables env-var binding, .env file loading, and type validation. It serves a different user segment (framework-heavy apps like FastAPI) and should be designed as an adapter, not a replacement for the core config model.
All three loaders are thin translation layers over from_dict(). The core config model (BackendConfig, StoreProfile, RegistryConfig) does not change.

2. Current State¶

2.1 Config model¶

Three frozen dataclasses in src/remote_store/_config.py:

RegistryConfig
├── backends: dict[str, BackendConfig]
│   └── BackendConfig(type: str, options: dict[str, object])
└── stores: dict[str, StoreProfile]
    └── StoreProfile(backend: str, root_path: str, options: dict[str, object])

2.2 Loading path¶

RegistryConfig.from_dict(data) is the only loader. It expects:

{
    "backends": {
        "<name>": {"type": "<type>", "options": {<kwargs>}},
    },
    "stores": {
        "<name>": {"backend": "<backend-name>", "root_path": "<prefix>"},
    },
}

The Registry instantiates backends via factory(**cfg.options) — a direct kwarg splat. This means options keys must exactly match constructor parameter names.

2.3 ADR-0002: No merging¶

Config-as-code has absolute priority. No env-var merging, no layering. If RegistryConfig is provided, it is used exclusively. This is a deliberate design decision for determinism and test safety.

Implication for this research: All three loaders must produce a complete RegistryConfig. We do not layer TOML + env vars + defaults. Users who want env-var injection do it before constructing the config (or use the Pydantic adapter which handles this in its own layer, yielding a final RegistryConfig that is then used exclusively).

3. External Landscape Survey¶

Before proposing solutions, we survey how existing Python libraries and frameworks handle configuration for storage and application settings. This establishes prior art and justifies where remote-store should align with, diverge from, or defer to ecosystem conventions.

3.1 fsspec — the closest analog¶

fsspec is the Python ecosystem's abstract filesystem interface. It is the closest analog to remote-store: multiple backends, credential management, used by pandas, dask, and xarray.

fsspec's config approach:

fsspec.config.conf — A global nested dict keyed by protocol (s3, gcs, abfs) that supplies default kwargs to any filesystem constructor. This is the "config-as-defaults" pattern.
storage_options pass-through — Every fsspec.filesystem("s3", **opts) call accepts kwargs that override global config. This two-tier design (global defaults + per-call overrides) was adopted by pandas read_parquet(storage_options=...), dask, xarray, and PyArrow.
set_conf_files() / set_conf_env() — Reads ~/.config/fsspec/conf.json and env vars like FSSPEC_S3_KEY.
No TOML/YAML loader — fsspec only supports JSON for config files.

Key differences from remote-store:

Aspect	fsspec	remote-store
Config model	Global mutable dict	Immutable frozen dataclasses
Layering	Global defaults + per-call overrides	No merging (ADR-0002)
Backend identity	Protocol string (`"s3"`)	User-chosen name (`"s3-prod"`)
Multiple configs per protocol	Not supported natively	First-class (§4.2)
File format	JSON only	TOML, YAML, Pydantic (proposed)

Relevance to remote-store: fsspec's storage_options convention is table stakes for interoperability with the data ecosystem. Even though remote-store uses a different config model, we should document how users can bridge between storage_options dicts and BackendConfig.options. This is a documentation concern, not a code change — BackendConfig.options already is a dict of kwargs, so storage_options dicts are often directly usable as options. See §8 (Cross-Cutting Concerns) for the compatibility note.

3.2 Competing storage abstraction libraries¶

Library	Config approach	Credential handling	File-based config
Apache libcloud	Provider-specific drivers with explicit credential passing. Connection objects constructed in code.	Explicit kwargs only	No
cloudpathlib	`Client` objects wrapping cloud SDK clients.	Delegated entirely to underlying SDK (boto3, google-cloud-storage)	No
smart_open	`transport_params` dicts (analogous to `storage_options`)	Delegated to underlying SDK	No

Pattern across all: Storage abstraction libraries delegate credential management to the underlying SDK and focus on clean pass-through configuration. None provide their own config file format — they all accept dicts/kwargs and let users construct them however they wish.

Implication for remote-store: This validates the "thin translation layer" design. from_toml() and from_yaml() are file-to-dict loaders that feed into from_dict() — exactly the pattern the ecosystem expects. We are not building a config management framework; we are providing format-specific convenience for dict construction.

3.3 Hydra / OmegaConf — state-of-the-art config¶

Meta's Hydra + OmegaConf is the most sophisticated config system in the Python ML ecosystem:

Config groups — Organizing configs by concern (db, server, logging) and composing via command-line overrides.
Variable interpolation — ${backend.bucket} references within configs.
Structured configs — Pydantic-like validation via dataclasses.
Override grammar — +backend=s3 to select, ~backend to remove.

Hydra's config groups are directly analogous to remote-store's backends/stores split. However, Hydra targets ML experiment management with complex composition needs. remote-store's config is structurally simple (two flat dicts of typed entries) and doesn't need interpolation, overrides, or config groups.

Why we don't adopt Hydra's approach: The complexity budget doesn't justify it. Hydra adds omegaconf, hydra-core, and a CLI framework. remote-store's config is a two-level nested dict — TOML/YAML handle this natively. Users who do use Hydra can trivially convert OmegaConf dicts to plain dicts via OmegaConf.to_container() and pass them to from_dict().

3.4 dynaconf — Python settings management¶

dynaconf is a mature library specifically for settings management:

Multiple file formats — TOML, YAML, JSON, INI, .env
Layered environments — [default], [development], [production]
Env-var overrides — DYNACONF_ prefix convention
Vault integration — HashiCorp Vault, Redis
Framework extensions — Django and Flask

dynaconf solves much of what the Pydantic adapter (ID-003) addresses: env-var binding, multi-format file loading, and secrets integration. The question is whether remote-store should recommend dynaconf as an integration pattern rather than building a bespoke Pydantic adapter.

Assessment: dynaconf is powerful but opinionated — it manages settings globally, uses its own merge semantics, and has a learning curve. The Pydantic adapter approach is lighter: users who already use pydantic-settings (common in FastAPI/Django) get integration for free via model_dump() → from_dict(). Users who prefer dynaconf can use it the same way: settings.as_dict() → from_dict(). We should document dynaconf as a supported integration path in the Pydantic adapter docs, not build a dedicated adapter.

3.5 Airflow's env-var override convention¶

Apache Airflow uses a .cfg (INI) file with a widely-copied env-var override convention:

AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql://...
AIRFLOW__SMTP__SMTP_HOST=smtp.example.com

The double-underscore convention (SECTION__KEY) maps directly to the INI section/key hierarchy. This pattern was adopted by Prefect, Dagster, and dbt.

Comparison with the Pydantic adapter proposal (§7.4): The Pydantic adapter proposes RS_BACKENDS__S3_PROD__OPTIONS__BUCKET=prod-data — triple-nested and verbose. Airflow's convention works because its config is two levels deep (section + key). remote-store's config is three levels deep (backends/stores → name → options), making flat env vars inherently unwieldy.

Implication: For env-var-heavy deployments, the Pydantic adapter should document a flatter alternative pattern:

# Instead of deeply nested env vars:
RS_BACKENDS__S3_PROD__OPTIONS__BUCKET=prod-data

# Consider per-backend env prefix pattern:
RS_S3_PROD_BUCKET=prod-data
RS_S3_PROD_REGION=eu-central-1

This requires a custom Pydantic model per deployment but is far more ergonomic. We should document both approaches with trade-offs.

3.6 Django and Flask configuration lessons¶

The two most popular Python web frameworks have decades of configuration experience worth studying:

Django: - settings.py-as-Python-module was deliberate: config IS code. - The ecosystem split into three camps: django-environ (12-factor, env vars), django-configurations (class-based inheritance), django-split-settings (modular files). - Key lesson: the community never converged on one approach. Different deployment models need different config patterns.

Flask: - app.config offers five explicit loading methods: from_envvar(), from_pyfile(), from_object(), from_file() (TOML/JSON with loader callable), from_mapping(). - Flask 2.0+ added from_file() with a load parameter — exactly the "thin translation layer" pattern this research proposes. - Key lesson: every successful framework supports multiple config sources but does NOT merge automatically. The user explicitly chains load calls.

Validation for ADR-0002: Flask's approach — explicit, user-controlled loading with no automatic merging — is precisely what ADR-0002 mandates and what our proposed from_toml() / from_yaml() / Pydantic adapter implements. This is not a novel design; it is the proven pattern in battle-tested frameworks.

4. Backend Configuration Landscape¶

Understanding the full configuration surface per backend is essential for evaluating how well each format and loader handles real-world configs.

4.1 Configuration options by backend¶

Backend	Type	Required	Optional	Sensitive
Local	`"local"`	`root`	—	—
Memory	`"memory"`	—	—	—
S3	`"s3"`	`bucket`	`key`, `secret`, `region_name`, `endpoint_url`, `client_options`	`key`, `secret`
S3-PyArrow	`"s3-pyarrow"`	`bucket`	`key`, `secret`, `region_name`, `endpoint_url`, `client_options`	`key`, `secret`
SFTP	`"sftp"`	`host`	`port`, `username`, `password`, `pkey`, `base_path`, `host_key_policy`, `known_host_keys`, `host_keys_path`, `config`, `timeout`, `connect_kwargs`	`password`, `pkey`
Azure	`"azure"`	`container` + one of (`account_name`, `account_url`, `connection_string`)	`account_key`, `sas_token`, `credential`, `client_options`	`account_key`, `sas_token`, `connection_string`, `credential`

4.2 Multiple configs per backend technology¶

A single project commonly needs multiple backend configs of the same type with different credentials or endpoints. Examples:

# Same S3 technology, different access patterns
backends:
  s3-prod:       {type: s3, options: {bucket: prod-data, region_name: eu-central-1}}
  s3-analytics:  {type: s3, options: {bucket: analytics, key: AKIA..., secret: ...}}
  s3-minio-dev:  {type: s3, options: {bucket: dev, endpoint_url: http://localhost:9000, key: minioadmin, secret: minioadmin}}

# Same Azure technology, different auth methods
backends:
  az-prod:       {type: azure, options: {container: prod, account_name: acme}}          # DefaultAzureCredential
  az-ci:         {type: azure, options: {container: test, connection_string: "..."}}     # Connection string
  az-readonly:   {type: azure, options: {container: prod, account_name: acme, sas_token: "sv=..."}}

# SFTP to different hosts
backends:
  sftp-vendor-a: {type: sftp, options: {host: files.vendor-a.com, username: upload, password: "..."}}
  sftp-vendor-b: {type: sftp, options: {host: sftp.vendor-b.io, username: etl, pkey: <PKey>}}

Multiple stores then map to these backends:

stores:
  raw-events:    {backend: s3-prod,       root_path: events/raw}
  aggregates:    {backend: s3-analytics,   root_path: agg/v2}
  dev-scratch:   {backend: s3-minio-dev,  root_path: scratch}
  invoices:      {backend: az-prod,        root_path: invoices/2026}
  test-fixtures: {backend: az-ci,          root_path: fixtures}
  vendor-a-drop: {backend: sftp-vendor-a,  root_path: /incoming}
  vendor-b-drop: {backend: sftp-vendor-b,  root_path: /data/drop}

Key design requirement: The config format must allow an arbitrary number of backend entries of the same type, each with its own credential set. This is already supported by the current dict schema (backends are keyed by user-chosen names, not by type), and all three file formats handle this naturally.

4.3 Credential chain patterns in cloud SDKs¶

All three major cloud SDKs implement a credential resolution chain with well-defined priority ordering. This is the most important configuration pattern in production cloud systems and directly affects how remote-store users will manage credentials.

AWS (boto3/botocore) credential chain:

Explicit kwargs (aws_access_key_id, aws_secret_access_key)
Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
Shared credential file (~/.aws/credentials)
AWS config file (~/.aws/config)
Assume role provider
Boto2 config file (/etc/boto.cfg, ~/.boto)
Instance metadata service (IMDS) on EC2
Container credential provider (ECS)

Azure DefaultAzureCredential chain:

EnvironmentCredential
WorkloadIdentityCredential
ManagedIdentityCredential
AzureCliCredential
AzurePowerShellCredential
AzureDeveloperCliCredential

GCP Application Default Credentials:

GOOGLE_APPLICATION_CREDENTIALS env var (service account JSON path)
User credentials via gcloud auth application-default login
Attached service account (GCE, GKE, Cloud Run)

Implications for remote-store:

The credential chain pattern means that in many production deployments, no credentials appear in config at all. The backend's underlying SDK resolves credentials automatically via IAM roles, managed identity, or workload identity. This is the ideal state.

remote-store already supports this: if key/secret are omitted from S3 options, the underlying credential chain resolves automatically (S3Backend uses s3fs, which uses aiobotocore, which delegates to botocore's credential provider chain — not boto3 directly, though the chain behavior is identical). If credential is omitted from Azure options, DefaultAzureCredential() is used. The config loaders should not attempt to replicate or interfere with these chains.

Recommendation: Document the credential chain pattern explicitly in the config loader guide. Show that a minimal TOML config with no secrets is the recommended production pattern:

# Production config — credentials resolved by cloud SDK chain
[backends.s3-prod]
type = "s3"
options.bucket = "prod-data"
options.region_name = "eu-central-1"
# No key/secret — IAM role resolves automatically

Users who must put credentials in config (dev, CI, on-prem) should use the Pydantic adapter with env-var binding or the TOML/YAML + env-var injection pattern documented in §8.1.

4.4 Sensitive values and the secrets problem¶

The most common pain points in configuration:

Problem	Frequency	Affected backends
Secrets in config files (committed to VCS)	Very common	S3, SFTP, Azure
Different secrets per environment (dev/staging/prod)	Very common	All cloud
Non-string credentials (`pkey` is a `paramiko.PKey` object)	SFTP only	SFTP
Credential objects (`DefaultAzureCredential()`)	Azure only	Azure

Observation: TOML and YAML can express all string-serializable options, but pkey (a paramiko.PKey instance) and credential (an Azure credential object) cannot be represented in any config file format. These always require code-level construction. This is acceptable — the from_dict() / from_toml() / from_yaml() path is for the common case; complex credentials use the Python-object constructor.

Additionally, host_key_policy (HostKeyPolicy Enum with values "strict", "tofu", "auto") is string-representable but requires coercion: TOML/YAML will produce a raw string like "strict", but SFTPBackend.__init__ expects a HostKeyPolicy instance. Without coercion, factory(**cfg.options) passes the raw string through, and comparisons in _create_ssh_client() silently fail ("strict" != HostKeyPolicy.STRICT — Python Enum equality is identity-based). The implementation should add string→Enum coercion in SFTPBackend.__init__:

if isinstance(host_key_policy, str):
    host_key_policy = HostKeyPolicy(host_key_policy)

This is a pre-existing gap in SFTPBackend, not specific to config loaders, but config loaders make it a practical problem. Tracked as part of ID-039 (credential hygiene), item 4.

Possible mitigations for file-based configs:

pkey from PEM string: SFTP's load_private_key() can load from a PEM string. A TOML/YAML config could store pkey_pem: "-----BEGIN RSA..." and a thin post-processing step converts it. However, this is outside the scope of from_toml() / from_yaml() — those are pure dict loaders.
Secrets via env vars: The Pydantic adapter (ID-003) handles this natively. For TOML/YAML, users inject secrets before calling from_dict().
Recommendation: Document the pattern of loading TOML/YAML for structure, then overriding options with secrets from env vars / vault before constructing the RegistryConfig. Do not build env-var resolution into from_toml() / from_yaml() (ADR-0002).

5. ID-005: `from_toml()` — TOML Config Loader¶

5.1 Why TOML¶

stdlib on 3.11+: tomllib is built-in since Python 3.11 (PEP 680). tomli is the compatible backport for 3.10.
Python ecosystem alignment: pyproject.toml is the standard for project config. Tools like pytest, mypy, ruff, black all use TOML.
Strict typing: TOML distinguishes strings, integers, booleans, arrays, and tables — unlike YAML, there are no ambiguous value types.
Read-only is fine: tomllib is read-only by design. We only need to read config.

5.2 Dependency strategy¶

# Compatibility shim (standard pattern)
try:
    import tomllib
except ModuleNotFoundError:
    import tomli as tomllib  # type: ignore[no-redef]

Python version	Module	Dependency
3.11+	`tomllib` (stdlib)	None
3.10	`tomli` (backport)	Optional extra

Since remote-store targets >=3.10, the optional extra would be:

[project.optional-dependencies]
toml = ["tomli>=1.1.0; python_version < '3.11'"]

Alternatively, since tomli is tiny (~3 KB) and pure Python, it could be a hard dependency for 3.10 users without an extra. But the extra approach is more consistent with our "zero core dependencies" philosophy.

5.3 TOML schema¶

Natural mapping from the existing dict schema:

# remote-store.toml (standalone) or [tool.remote-store] in pyproject.toml

[backends.local]
type = "local"
options.root = "/data/store"

[backends.s3-prod]
type = "s3"

[backends.s3-prod.options]
bucket = "prod-data"
region_name = "eu-central-1"
# key and secret intentionally omitted — use IAM role or inject at runtime

[backends.s3-dev]
type = "s3"

[backends.s3-dev.options]
bucket = "dev-data"
endpoint_url = "http://localhost:9000"
key = "minioadmin"
secret = "minioadmin"

[backends.azure]
type = "azure"

[backends.azure.options]
container = "my-container"
account_name = "mystorageaccount"

[stores.raw-events]
backend = "s3-prod"
root_path = "events/raw"

[stores.scratch]
backend = "s3-dev"
root_path = "scratch"

[stores.documents]
backend = "azure"
root_path = "documents"

[stores.local-cache]
backend = "local"
root_path = "cache"

This maps 1:1 to the dict that from_dict() already accepts.

5.4 Proposed API¶

@classmethod
def from_toml(
    cls,
    path: str | Path,
    *,
    table: tuple[str, ...] = (),
) -> RegistryConfig:
    """Load config from a TOML file.

    :param path: Path to the TOML file.
    :param table: Dotted table path to extract config from.
        For pyproject.toml use ``table=("tool", "remote-store")``.
    """

The table parameter enables reading from a nested table, which is essential for pyproject.toml usage:

# Standalone file
config = RegistryConfig.from_toml("remote-store.toml")

# From pyproject.toml
config = RegistryConfig.from_toml("pyproject.toml", table=("tool", "remote-store"))

5.5 Implementation sketch¶

@classmethod
def from_toml(cls, path: str | Path, *, table: tuple[str, ...] = ()) -> RegistryConfig:
    try:
        import tomllib
    except ModuleNotFoundError:
        try:
            import tomli as tomllib  # type: ignore[no-redef]
        except ModuleNotFoundError:
            raise ModuleNotFoundError(
                "TOML support requires tomli on Python < 3.11. "
                "Install it with: pip install 'remote-store[toml]'"
            ) from None

    with open(path, "rb") as f:
        data = tomllib.load(f)

    for key in table:
        if not isinstance(data, dict) or key not in data:
            raise KeyError(f"Table key {key!r} not found in {path}")
        data = data[key]

    return cls.from_dict(data)

~15 lines of logic. Delegates entirely to from_dict().

5.6 Assessment¶

Criterion	Rating	Notes
Implementation effort	Very low	~15 lines wrapping `from_dict()`
Dependency cost	Zero on 3.11+; `tomli` on 3.10	Aligns with zero-dep philosophy
User demand	High	TOML is the standard Python config format
Risk	Very low	Pure translation layer, no new semantics
Multi-backend support	Natural	TOML tables map cleanly to nested dicts

6. ID-002: `from_yaml()` — YAML Config Loader¶

6.1 Why YAML¶

Familiar: Widely used for application config (Kubernetes, Ansible, Docker Compose, etc.).
Readable: More compact than TOML for deeply nested structures.
Comments: YAML supports inline comments (like TOML, unlike JSON).

6.2 Library comparison¶

Feature	PyYAML	ruamel.yaml
YAML spec	1.1	1.2
Comment preservation	No	Yes
Round-trip editing	No	Yes
Safety defaults	Unsafe `yaml.load()` by default	Safer
Install size	Small	Larger
PyPI downloads	~300M/month	~2.5M/month
API simplicity	Simple	More complex

Recommendation: pyyaml. We only need read-only parsing of config files. We do not need comment preservation or round-trip editing. pyyaml is ubiquitous (likely already installed in most environments), simpler, and well-tested. The YAML 1.1 vs 1.2 differences (yes/no as booleans) are irrelevant for our config schema — all our option values are explicit strings, numbers, or dicts.

However, we should accept either library — users who have ruamel.yaml installed should be able to use it. The import strategy:

try:
    from yaml import safe_load  # pyyaml
except ImportError:
    try:
        from ruamel.yaml import YAML
        _yaml = YAML(typ="safe")
        safe_load = _yaml.load  # ruamel.yaml
    except ImportError:
        safe_load = None

6.3 YAML schema¶

# remote-store.yaml
backends:
  s3-prod:
    type: s3
    options:
      bucket: prod-data
      region_name: eu-central-1

  s3-dev:
    type: s3
    options:
      bucket: dev-data
      endpoint_url: "http://localhost:9000"
      key: minioadmin
      secret: minioadmin

  azure:
    type: azure
    options:
      container: my-container
      account_name: mystorageaccount

  sftp-vendor:
    type: sftp
    options:
      host: files.vendor.com
      port: 22
      username: etl
      password: "${VENDOR_PASSWORD}"  # user resolves before loading
      base_path: /incoming
      timeout: 30

stores:
  raw-events:
    backend: s3-prod
    root_path: events/raw

  scratch:
    backend: s3-dev
    root_path: scratch

  documents:
    backend: azure
    root_path: documents

  vendor-drop:
    backend: sftp-vendor
    root_path: incoming

Again, maps 1:1 to the dict schema.

6.4 Proposed API¶

@classmethod
def from_yaml(
    cls,
    path: str | Path,
) -> RegistryConfig:
    """Load config from a YAML file.

    :param path: Path to the YAML file.
    :raises ModuleNotFoundError: If neither pyyaml nor ruamel.yaml is installed.
    """

Simpler than TOML — no table parameter needed because YAML files are typically standalone (no pyproject.yaml convention).

API asymmetry note: This creates an asymmetry with from_toml(table=...). YAML files are sometimes embedded in larger application config bundles (Ansible vars, Helm values, multi-concern app configs). A user with remote_store: nested under a parent key must pre-process: yaml.safe_load(f)["remote_store"] → from_dict(). This is an acceptable workaround and consistent with YAML ecosystem conventions (no standard shared-file format exists). If demand emerges, a key parameter can be added later without breaking changes.

6.5 Implementation sketch¶

@classmethod
def from_yaml(cls, path: str | Path) -> RegistryConfig:
    try:
        from yaml import safe_load
    except ImportError:
        try:
            from ruamel.yaml import YAML
            _yaml = YAML(typ="safe")
            safe_load = _yaml.load
        except ImportError:
            raise ModuleNotFoundError(
                "YAML support requires pyyaml or ruamel.yaml. "
                "Install with: pip install pyyaml"
            ) from None

    with open(path) as f:
        data = safe_load(f)

    if not isinstance(data, dict):
        raise TypeError(f"Expected YAML mapping at top level, got {type(data).__name__}")

    return cls.from_dict(data)

~20 lines. Delegates to from_dict().

6.6 YAML pitfalls for config files¶

Pitfall	Impact on remote-store	Mitigation
`yes`/`no`/`on`/`off` parsed as booleans (YAML 1.1)	Port numbers like `port: 22` are fine; string values that happen to match YAML boolean literals would be silently coerced. Import precedence interaction: since pyyaml (YAML 1.1) takes priority over ruamel.yaml (YAML 1.2) in our import chain (§6.2), a user who specifically installed ruamel.yaml expecting YAML 1.2 strictness (no implicit boolean coercion) will get pyyaml behavior silently if both are installed. The implementation spec should document this precedence prominently in the `from_yaml()` docstring and consider adding a `parser` parameter (e.g., `parser="ruamel"`) for users who need YAML 1.2 semantics.	Document: always quote string values that could be ambiguous. Document import precedence in `from_yaml()` docstring.
Enum values loaded as raw strings	`host_key_policy: "strict"` loads as a Python `str`, but `SFTPBackend` expects a `HostKeyPolicy` Enum. The `from_dict()` → `factory(**options)` pipeline passes strings through without coercion, causing silent failures (see §4.4).	Implement string→Enum coercion in `SFTPBackend.__init__`
The Norway problem — bare `NO` is parsed as `false` in YAML 1.1	Country codes, region names, or any short string matching YAML 1.1 boolean literals (`NO`, `YES`, `ON`, `OFF`) silently become booleans. This caused real bugs in npm package country-code lists and GitHub Actions workflows.	Always quote string values. This is a concrete argument for TOML's stricter typing — TOML has no implicit boolean coercion, making it the safer default for config files.
Indentation errors silently change structure	Could produce malformed config	`from_dict()` validation catches invalid structures
No native type distinction (everything is a string without explicit tags)	Numbers and booleans auto-convert, which is actually desirable for our schema	Non-issue
`yaml.load()` is unsafe	Remote code execution if using untrusted input	Always use `safe_load()` — enforced in our implementation

6.7 Assessment¶

Criterion	Rating	Notes
Implementation effort	Very low	~20 lines wrapping `from_dict()`
Dependency cost	Optional `pyyaml`	Ubiquitous, likely already installed
User demand	Medium	YAML is common but less so in Python-native tooling
Risk	Low	`safe_load()` mitigates security; `from_dict()` validates
Multi-backend support	Natural	YAML mappings are dicts

7. ID-003: Pydantic `BaseSettings` Integration¶

7.1 Why Pydantic¶

Pydantic BaseSettings (from pydantic-settings) provides:

Env-var binding: Fields automatically populate from environment variables.
.env file support: Load from .env files.
Type validation: Constructor-time validation with clear error messages.
Nested model support: env_nested_delimiter for APP__DB__HOST=....
Built-in file sources: TomlConfigSettingsSource, YamlConfigSettingsSource, JsonConfigSettingsSource.
Source priority customization: Init > CLI > env > .env > file > secrets > defaults.
Docker secrets: secrets_dir='/run/secrets'.

This is the go-to configuration approach for FastAPI, Django, and other framework-heavy Python applications. As of March 2026, pydantic-settings v2.13+ supports Python 3.10–3.14.

7.2 Design challenge: ADR-0002 tension¶

ADR-0002 says "no merging, no env var overrides." Pydantic BaseSettings is built for merging and env var overrides. These appear to conflict.

Resolution: The Pydantic adapter operates in its own layer. It merges env vars, .env files, and config files to produce a final RegistryConfig. Once that RegistryConfig is constructed, ADR-0002 applies — the Registry uses it exclusively with no further merging. The Pydantic layer is user-side glue, not core library behavior.

User's Pydantic model (merges env + .env + files)
    ↓ produces
RegistryConfig (immutable, no further merging)
    ↓ used by
Registry (ADR-0002 applies here)

This is consistent with ADR-0002's note: "those users can build their own config loader and pass RegistryConfig." The Pydantic adapter is exactly that — a pre-built config loader that users opt into.

7.3 Proposed design: adapter, not replacement¶

The Pydantic integration should be an adapter module (e.g., remote_store.ext.pydantic or a top-level helper), not a modification to the core config model. The core remains pure dataclasses with zero dependencies.

Option A: Pydantic models that produce RegistryConfig¶

# remote_store/ext/pydantic.py (or remote_store/_pydantic.py)
from pydantic_settings import BaseSettings
from pydantic import Field

class S3Options(BaseSettings):
    model_config = SettingsConfigDict(env_prefix="RS_S3_")

    bucket: str
    key: str | None = None
    secret: str | None = None
    region_name: str | None = None
    endpoint_url: str | None = None

class AzureOptions(BaseSettings):
    model_config = SettingsConfigDict(env_prefix="RS_AZURE_")

    container: str
    account_name: str | None = None
    account_key: str | None = None
    sas_token: str | None = None
    connection_string: str | None = None

class SFTPOptions(BaseSettings):
    model_config = SettingsConfigDict(env_prefix="RS_SFTP_")

    host: str
    port: int = 22
    username: str | None = None
    password: str | None = None
    base_path: str = "/"
    timeout: int = 10

class RemoteStoreSettings(BaseSettings):
    """Pydantic settings that produces a RegistryConfig."""

    def to_registry_config(self) -> RegistryConfig:
        ...

SecretStr for credential awareness. Documented Pydantic example models should use pydantic.SecretStr for sensitive fields (key, secret, password, account_key, sas_token, connection_string). SecretStr provides no real security boundary — the plain value is accessible via .get_secret_value() — but it prevents accidental exposure in logs, repr(), and serialization output. More importantly, it signals to users that these fields contain secrets and should be treated with care. The to_registry_config() method would call .get_secret_value() when building the options dict.

Option B: Generic converter from any Pydantic model¶

def pydantic_to_registry_config(settings: BaseModel) -> RegistryConfig:
    """Convert a Pydantic model to RegistryConfig.

    Expects the model to have 'backends' and 'stores' fields
    matching the RegistryConfig schema.
    """
    return RegistryConfig.from_dict(settings.model_dump())

SecretStr interaction warning: If the user's Pydantic model uses SecretStr for credential fields (as recommended in Option A above), model_dump() by default calls .get_secret_value() on all SecretStr fields, exposing secrets as plain strings in the intermediate dict. This negates SecretStr's accidental-exposure protection. The implementation should either: - Use model_dump(mode="python") with a custom serializer that preserves wrapping until BackendConfig.options is built, or - Document that the SecretStr → plain string conversion is intentional at this boundary (secrets must be plain strings for backend constructors), or - Accept the trade-off: SecretStr protects against accidental repr()/log exposure in user code, and the to_registry_config() call is an explicit "I'm done configuring, build the registry" boundary where exposure is expected.

The third option is the pragmatic choice — document it explicitly.

Recommendation: Option B with documented patterns¶

Option A is opinionated and hard to maintain — it pre-defines env var prefixes and field structures that may not match every user's deployment. Option B is a thin utility that users combine with their own Pydantic models. We provide documented example patterns, not rigid pre-built models.

7.4 Multi-backend configs with Pydantic¶

The key challenge with Pydantic is mapping multiple backend instances of the same type to different env var prefixes:

# How does the user configure two S3 backends via env vars?
RS_BACKENDS__S3_PROD__TYPE=s3
RS_BACKENDS__S3_PROD__OPTIONS__BUCKET=prod-data
RS_BACKENDS__S3_PROD__OPTIONS__KEY=AKIA...
RS_BACKENDS__S3_DEV__TYPE=s3
RS_BACKENDS__S3_DEV__OPTIONS__BUCKET=dev-data
RS_BACKENDS__S3_DEV__OPTIONS__KEY=AKIA...

This works with env_nested_delimiter="__" but is verbose. The Pydantic settings model:

class BackendEntry(BaseModel):
    type: str
    options: dict[str, Any] = {}

class StoreEntry(BaseModel):
    backend: str
    root_path: str = ""
    options: dict[str, Any] = {}

class RemoteStoreSettings(BaseSettings):
    model_config = SettingsConfigDict(
        env_prefix="RS_",
        env_nested_delimiter="__",
    )

    backends: dict[str, BackendEntry] = {}
    stores: dict[str, StoreEntry] = {}

    def to_registry_config(self) -> RegistryConfig:
        # SecretStr fields are intentionally exposed here — this is the
        # config→registry boundary where plain strings are required.
        return RegistryConfig.from_dict(self.model_dump())

Then env vars RS_BACKENDS__S3_PROD__OPTIONS__BUCKET=prod-data resolve correctly. This is documented pattern, not library code. See the SecretStr interaction warning in §7.3 Option B for details on why model_dump() exposure of secrets is acceptable at this boundary.

7.5 Pydantic's built-in file sources¶

As of pydantic-settings 2.13+, users can combine env vars with TOML, YAML, and JSON files in a single BaseSettings class. This means the Pydantic adapter partially subsumes ID-002 and ID-005 for users who adopt it — but only for those users. The standalone from_toml() and from_yaml() remain valuable for users who don't want Pydantic.

7.6 Assessment¶

Criterion	Rating	Notes
Implementation effort	Medium	Adapter + documentation + examples
Dependency cost	Optional `pydantic-settings` (+ `pydantic`)	Heavy; ~5 MB
User demand	Medium-high	Strong in FastAPI/Django ecosystem
Risk	Medium	Must not violate ADR-0002 semantics
Multi-backend support	Works but verbose	`env_nested_delimiter` handles it
ADR-0002 compatibility	Compatible	Pydantic merges then produces RegistryConfig

8. Cross-Cutting Concerns¶

8.1 Secrets in config files¶

None of the three loaders should resolve secrets from env vars or vaults. This is the user's responsibility (per ADR-0002). However, we should document the common patterns with concrete examples:

Pattern	When to use	How
Inject before `from_dict()`	Simple scripts	Load TOML/YAML, replace secrets from `os.environ`, call `from_dict()`
Pydantic env-var binding	Framework apps	Pydantic resolves env vars, produces `RegistryConfig`
Config-as-code	Prod deployments	Secrets in vault, injected into Python code at app startup
`.env` + Pydantic	Local dev	`.env` file with secrets, loaded by `BaseSettings`
SOPS / sealed secrets	GitOps workflows	Encrypted config files committed to VCS, decrypted at deploy
Kubernetes Secrets	Container orchestration	Mounted as files or env vars in pods

Concrete example 1: TOML + env-var injection¶

import os
import tomllib
from remote_store import RegistryConfig

# Load structure from TOML
with open("remote-store.toml", "rb") as f:
    data = tomllib.load(f)

# Inject secrets from environment
s3_opts = data["backends"]["s3-prod"]["options"]
s3_opts["key"] = os.environ["AWS_ACCESS_KEY_ID"]
s3_opts["secret"] = os.environ["AWS_SECRET_ACCESS_KEY"]

config = RegistryConfig.from_dict(data)

Concrete example 2: HashiCorp Vault integration¶

import hvac
import tomllib
from remote_store import RegistryConfig

# Load structure from TOML
with open("remote-store.toml", "rb") as f:
    data = tomllib.load(f)

# Fetch secrets from Vault
client = hvac.Client(url="https://vault.example.com")
secret = client.secrets.kv.v2.read_secret_version(path="remote-store/s3-prod")
s3_creds = secret["data"]["data"]

data["backends"]["s3-prod"]["options"]["key"] = s3_creds["access_key"]
data["backends"]["s3-prod"]["options"]["secret"] = s3_creds["secret_key"]

config = RegistryConfig.from_dict(data)

Concrete example 3: SOPS-encrypted config¶

SOPS (Secrets OPerationS) by Mozilla encrypts config files so they can be committed to VCS. At deploy time, the file is decrypted and loaded:

# Encrypt a config file (one-time)
sops --encrypt remote-store.toml > remote-store.enc.toml

# At deploy time, decrypt and load
sops --decrypt remote-store.enc.toml > /tmp/remote-store.toml

# In application code — identical to normal TOML loading
config = RegistryConfig.from_toml("/tmp/remote-store.toml")

Real-world secrets infrastructure¶

Tool	How it works	Integration with remote-store
HashiCorp Vault	API-based secret storage. `hvac` Python client. Vault Agent sidecar for automatic injection.	Fetch secrets via `hvac`, inject into dict, call `from_dict()`
SOPS (Mozilla)	Encrypts YAML/JSON/TOML files in-place. Committed to VCS encrypted.	Decrypt at deploy time, load normally via `from_toml()` / `from_yaml()`
AWS Secrets Manager / SSM	`boto3` calls at startup. Often used with IAM role authentication.	Fetch via `boto3`, inject into config dict
Kubernetes Secrets	Mounted as files (`/run/secrets/`) or env vars in pods.	Use mounted file paths in config, or env vars via Pydantic adapter
Docker Secrets	Mounted at `/run/secrets/<name>`. Available in Swarm mode.	Pydantic adapter: `secrets_dir='/run/secrets'`

8.2 Non-serializable options (`pkey`, `credential`)¶

SFTP's pkey (a paramiko.PKey instance) and Azure's credential (e.g., DefaultAzureCredential()) cannot be represented in TOML, YAML, or JSON. File-based configs work for all string-serializable options; complex credential objects require code-level construction.

Acceptable trade-off: Users with complex credentials use RegistryConfig() directly or use the Pydantic adapter with a custom validator that constructs the credential object. Document both paths.

8.3 Validation and error messages¶

All three loaders delegate to from_dict(), which validates structure. The Registry constructor calls validate(), which checks backend references. Backend construction catches TypeError from invalid options and re-raises with a clear message including the provided option keys. No format-specific validation is needed — from_dict() handles structure.

Gap: unknown top-level keys are silently ignored. from_dict() uses data.get("backends", {}) and data.get("stores", {}), so a typo like backend: (singular) or store: produces an empty RegistryConfig with no error or warning. This is acceptable for programmatic use but becomes a real usability problem when loading from config files — a user's carefully written TOML/YAML silently produces nothing.

Implementation spec must address this: either warnings.warn() for unrecognized top-level keys, or a strict mode that raises ValueError. The implementation spec should include a test case for this scenario. Suggested approach: warn by default, with a strict=True parameter on from_dict() (or on from_toml() / from_yaml()) that raises instead.

8.4 Where to put the code¶

Loader	Location	Rationale
`from_toml()`	`_config.py` (classmethod on `RegistryConfig`)	Zero-dep on 3.11+, core workflow
`from_yaml()`	`_config.py` (classmethod on `RegistryConfig`)	Parallel to `from_toml()`, import-guarded
Pydantic adapter	`ext/pydantic.py`	Optional dependency, adapter pattern

from_toml() and from_yaml() belong on RegistryConfig because they are simple format loaders (like from_dict()). The Pydantic adapter is more complex and involves a separate settings model, so it fits the ext/ pattern.

8.5 fsspec `storage_options` compatibility¶

fsspec's storage_options convention is the de facto standard for passing backend configuration in the data ecosystem (pandas, dask, xarray, PyArrow). A storage_options dict for S3 looks like:

storage_options = {
    "key": "AKIA...",
    "secret": "...",
    "client_kwargs": {"endpoint_url": "http://localhost:9000"},
}

remote-store's BackendConfig.options is already a dict of kwargs splatted into the backend constructor. For S3, the constructor accepts key, secret, region_name, endpoint_url, and client_options — which overlap significantly with fsspec's storage_options but the mapping is hierarchical, not a simple rename. S3Backend's client_options is a pass-through dict for all s3fs.S3FileSystem kwargs, while fsspec's client_kwargs is a nested key within that dict. In _s3.py, region_name is placed into opts.setdefault("client_kwargs", {}), so client_options={"client_kwargs": {"region_name": "us-east-1"}} is valid. For endpoint_url, s3fs accepts it as both a top-level kwarg and inside client_kwargs, with the top-level form being the documented preferred approach. remote-store accepts it as a top-level constructor option (which is then passed as a top-level kwarg to s3fs.S3FileSystem). The divergence between fsspec and remote-store is therefore smaller than it might appear — both prefer endpoint_url at the top level.

Recommendation: Do not attempt to auto-translate storage_options dicts. Instead, document the mapping between fsspec's storage_options keys and remote-store's BackendConfig.options keys for each backend. Users who work with both ecosystems (e.g., using remote-store for writes and PyArrow datasets for reads) can maintain a shared config and translate as needed.

This is a documentation concern for the config loader guide, not a code change. If demand emerges for a from_storage_options() helper, it can be added later as a thin key-mapping utility.

8.6 Optional extras¶

[project.optional-dependencies]
# Existing
s3 = [...]
sftp = [...]
azure = [...]
arrow = [...]
otel = [...]

# New
toml = ["tomli>=1.1.0; python_version < '3.11'"]
yaml = ["pyyaml>=5.1"]
pydantic = ["pydantic-settings>=2.0.0"]

9. Priority and Sequencing¶

9.1 Recommended order¶

Priority	Item	Rationale
1	ID-005 `from_toml()`	Lowest cost, highest value. Zero dep on 3.11+. Natural for Python projects.
2	ID-002 `from_yaml()`	Low cost. Parallel implementation to `from_toml()`.
3	ID-003 Pydantic adapter	Higher cost, narrower audience. Can be done independently.

ID-005 and ID-002 can ship together in a single release. ID-003 is independent and can ship later.

9.2 Spec requirements¶

Per project conventions, new features require a spec in sdd/specs/. A single spec covering all three config loaders would be appropriate since they share the same config model and validation chain. Suggested invariants:

CFG-008: from_toml(path, table=()) loads config from a TOML file.
CFG-009: from_yaml(path) loads config from a YAML file.
CFG-010: Pydantic adapter converts BaseSettings to RegistryConfig.
CFG-011: All loaders produce identical RegistryConfig for equivalent input.
CFG-012: Missing optional dependency raises ModuleNotFoundError with install instructions.

10. Open Questions¶

#	Question	Candidates	Recommendation
Q1	Should `from_toml()` support reading from a `pyproject.toml` `[tool.remote-store]` table?	Yes (via `table` parameter) / No (only standalone files)	Yes — TOML's primary Python use is `pyproject.toml`. The `table` kwarg costs nothing and enables this.
Q2	Should we accept both `pyyaml` and `ruamel.yaml`?	Accept both / Only `pyyaml` / Only `ruamel.yaml`	Accept both with `pyyaml` as primary and `ruamel.yaml` as fallback.
Q3	Should the Pydantic adapter live in `ext/pydantic.py` or `_pydantic.py`?	`ext/` / top-level private	`ext/pydantic.py` — follows extension architecture (ADR-0008).
Q4	Should the Pydantic adapter provide pre-built `S3Options` etc., or just a generic converter?	Pre-built models / Generic converter + docs	Generic converter + documented patterns. Pre-built models are opinionated and maintenance-heavy.
Q5	Should `from_toml()` accept `str`, `Path`, or both?	`str` only / `Path` only / Both	Both (`str \| Path`) — consistent with Python stdlib conventions.
Q6	Should `from_yaml()` accept a `key` parameter (like `table` for TOML)?	Yes / No	No — YAML has no equivalent of `pyproject.toml` shared-file convention. Creates an API asymmetry with `from_toml(table=...)`, but the asymmetry reflects a real ecosystem difference. Users with nested YAML can use `yaml.safe_load(f)["remote_store"]` → `from_dict()`. A `key` parameter can be added later without breaking changes if demand emerges.
Q7	Should we add `from_json()` while we're at it?	Yes / No	No — JSON has no comments, is less readable, and `from_dict(json.load(f))` is a one-liner. Not worth a dedicated method.
Q8	Should there be automatic config file discovery (e.g., `RegistryConfig.from_default()` searching `~/.config/remote-store/`, `pyproject.toml`, etc.)?	Yes / No	No — ADR-0002's explicit-config philosophy means the user provides the path. Auto-discovery adds implicit behavior, magic path conventions, and platform-specific logic (XDG vs. AppDirs vs. Windows `%APPDATA%`). Other tools that do discovery (ruff, pytest) are CLI tools where implicit config lookup is expected; remote-store is a library where explicit is better. Users who want discovery can implement it in their application layer.
Q9	Should the config schema include a `version` key for future migration?	Yes (reserve `version` key) / No	No — premature. The schema maps directly to `from_dict()` which maps directly to constructor kwargs. Schema evolution would likely be additive (new keys) rather than breaking (renamed keys), so a version field adds complexity without clear benefit today. If a breaking change is ever needed, a `v2` key or a separate `from_dict_v2()` method is simpler than a version-based migration system. Acknowledge this decision in the implementation spec so it's revisited if the schema grows.

11. References¶

ADR-0002: Configuration Resolution — No Merging
Spec 002: Registry & Configuration
PEP 680: tomllib — Support for Parsing TOML in the Standard Library
pydantic-settings documentation: https://docs.pydantic.dev/latest/concepts/pydantic_settings/
tomllib documentation: https://docs.python.org/3/library/tomllib.html
PyYAML: https://pyyaml.org/
ruamel.yaml: https://yaml.dev/doc/ruamel.yaml/
The Twelve-Factor App, Factor III (Config): https://12factor.net/config
fsspec documentation: https://filesystem-spec.readthedocs.io/
Hydra documentation: https://hydra.cc/
OmegaConf documentation: https://omegaconf.readthedocs.io/
dynaconf documentation: https://www.dynaconf.com/
Apache libcloud: https://libcloud.apache.org/
cloudpathlib: https://cloudpathlib.drivendata.org/
smart_open: https://github.com/piskvorky/smart_open
SOPS (Secrets OPerationS): https://github.com/getsops/sops
HashiCorp Vault hvac client: https://hvac.readthedocs.io/
boto3 credential configuration: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
Azure Identity DefaultAzureCredential: https://learn.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential
GCP Application Default Credentials: https://cloud.google.com/docs/authentication/application-default-credentials
Flask configuration handling: https://flask.palletsprojects.com/en/stable/config/
The YAML Norway problem: https://hitchdev.com/strictyaml/why/implicit-typing-removed/