S3 Backend¶

The S3 backend stores files on Amazon S3 or any S3-compatible service (MinIO, DigitalOcean Spaces, etc.).

Installation¶

pip install "remote-store[s3]"

Usage¶

from remote_store import BackendConfig, RegistryConfig, Registry, StoreProfile

config = RegistryConfig(
    backends={
        "s3": BackendConfig(
            type="s3",
            options={
                "bucket": "my-bucket",
                "key": "AWS_ACCESS_KEY_ID",
                "secret": "AWS_SECRET_ACCESS_KEY",
                "endpoint_url": "https://s3.amazonaws.com",
            },
        ),
    },
    stores={"data": StoreProfile(backend="s3", root_path="data")},
)

with Registry(config) as registry:
    store = registry.get_store("data")
    store.write("report.csv", b"col1,col2\n1,2\n")

Options¶

Option	Type	Description
`bucket`	`str`	S3 bucket name (required)
`key`	`str`	AWS access key ID
`secret`	`str`	AWS secret access key
`region_name`	`str`	AWS region name
`endpoint_url`	`str`	Custom endpoint for S3-compatible services
`tls_ca_bundle`	`str`	Path to a PEM CA bundle file for custom/self-signed certificates
`client_options`	`dict`	Additional options passed to s3fs

Custom TLS Certificates¶

Use tls_ca_bundle when connecting to S3-compatible services with custom or self-signed certificates (e.g., on-premises MinIO):

backend = S3Backend(
    bucket="my-bucket",
    endpoint_url="https://minio.internal:9000",
    tls_ca_bundle="/etc/ssl/certs/internal-ca.pem",
)

Or via config:

backends:
  minio:
    type: s3
    options:
      bucket: my-bucket
      endpoint_url: https://minio.internal:9000
      tls_ca_bundle: /etc/ssl/certs/internal-ca.pem

If tls_ca_bundle is not set, the following environment variables are checked in order (first non-empty value wins):

Priority	Env var	Standard
1	`AWS_CA_BUNDLE`	boto3
2	`REQUESTS_CA_BUNDLE`	requests
3	`SSL_CERT_FILE`	OpenSSL

The path is validated at construction time — a ValueError is raised immediately if the file does not exist.

This replaces the previous workaround of passing client_options={"client_kwargs": {"verify": "/path/to/ca.pem"}}.

Botocore Client Tuning¶

Anything beyond the first-class options above (proxies, timeouts, retry mode, S3 addressing style, pool sizes, custom User-Agent, …) is configured by passing a config_kwargs dict inside client_options. The dict is forwarded to aiobotocore.config.AioConfig(**config_kwargs), so every keyword the underlying botocore.config.Config accepts is available.

Why config_kwargs, not client_kwargs['config']

s3fs.S3FileSystem.set_session always passes config=AioConfig(**self.config_kwargs) to aiobotocore.create_client(). Setting client_kwargs['config'] in addition would duplicate the config= keyword and raise TypeError: got multiple values for keyword argument 'config'. The S3 backend spec pins the routing: every Config option flows through client_options['config_kwargs'], and a pre-built Config in client_kwargs is rejected with a clear ValueError.

Disabling inherited HTTP proxies¶

Required when the host has HTTP_PROXY / HTTPS_PROXY set in the environment but the S3 endpoint is reachable directly (typical for on-premises MinIO):

backend = S3Backend(
    bucket="my-bucket",
    endpoint_url="https://s3.internal:9000",
    client_options={
        "config_kwargs": {
            "proxies": {"http": None, "https": None},
        },
    },
)

Pointing at a corporate proxy¶

backend = S3Backend(
    bucket="my-bucket",
    client_options={
        "config_kwargs": {
            "proxies": {
                "http": "http://proxy.corp:3128",
                "https": "http://proxy.corp:3128",
            },
        },
    },
)

Retry policy¶

Prefer the first-class retry=RetryPolicy(...) argument — it is portable across backends and applied via the same merge path:

backend = S3Backend(
    bucket="my-bucket",
    retry=RetryPolicy(max_attempts=5),
)

When you need mode="adaptive" or other botocore-specific retry knobs, pass them through config_kwargs.retries and do not also pass retry=RetryPolicy(...): RetryPolicy replaces the entire retries dict (matching botocore.Config.merge semantics), so any caller-supplied mode / non-max_attempts fields are lost. Pick one channel:

backend = S3Backend(
    bucket="my-bucket",
    client_options={
        "config_kwargs": {
            "retries": {"max_attempts": 5, "mode": "adaptive"},
        },
    },
)

Connect / read timeouts¶

backend = S3Backend(
    bucket="my-bucket",
    client_options={
        "config_kwargs": {
            "connect_timeout": 3.0,
            "read_timeout": 10.0,
        },
    },
)

MinIO-style path addressing¶

Required for endpoints that do not support virtual-host-style bucket addressing (most on-premises MinIO deployments):

backend = S3Backend(
    bucket="my-bucket",
    endpoint_url="https://minio.internal:9000",
    key="AKIA...",
    secret="...",
    client_options={
        "config_kwargs": {
            "s3": {"addressing_style": "path"},
        },
    },
)

Putting it together¶

A realistic on-premises MinIO configuration combining the pieces above (custom CA bundle continues to come from tls_ca_bundle= / env vars, Custom TLS Certificates):

backend = S3Backend(
    bucket="my-bucket",
    endpoint_url="https://s3.internal:9000",
    key="AKIA...",
    secret="...",
    retry=RetryPolicy(max_attempts=5),
    client_options={
        "config_kwargs": {
            "connect_timeout": 3.0,
            "read_timeout": 10.0,
            "s3": {"addressing_style": "path"},
            "proxies": {"http": None, "https": None},
        },
    },
)

File Metadata¶

get_file_info() and list_files() return FileInfo objects with the following fields populated by the S3 backend:

Field	Source	Notes
`etag`	`ETag` response header	Double-quotes stripped; lowercased. Example: `"abc123"` → `abc123`.
`digest`	—	Always `None`; S3 checksums require `ChecksumMode: ENABLED` (not requested by default).

Write Results¶

The S3 backend declares WRITE_RESULT_NATIVE and USER_METADATA. Write operations return a WriteResult with digest, etag, and last_modified populated from the upload response — contrast with reads, where digest is always None (see the File Metadata table above).

Pass metadata= to store custom string key-value pairs as S3 object metadata. They round-trip through get_file_info() in FileInfo.metadata.

Listing Strategies and Performance¶

S3 listing behavior differs sharply between shallow and recursive traversals. Understanding these trade-offs is critical for large buckets.

Shallow Listing (Non-Recursive)¶

Use list_files(path, recursive=False) or iter_children(path) to list only direct children:

# List direct children only
for entry in store.iter_children("data/"):
    print(entry.name)  # Files and folders one level deep

Characteristics: - Single S3 ListObjectsV2 API call (or paginated requests if >1000 entries) - O(n) cost where n = direct children count - Flat cost per call, not dependent on bucket size - Suitable for folder-first navigation (e.g., building a file browser UI)

Recursive Listing (Flat Stream)¶

Use list_files(path, recursive=True) to fetch all files under a prefix:

# Stream all files under a prefix, regardless of depth
for file_info in store.list_files("data/", recursive=True):
    process(file_info)

Why flat streaming wins: - Internally uses S3's ListObjectsV2 pagination with a prefix, not delimiter-based folder traversal - Single logical stream; S3 SDK handles pagination transparently - O(n) cost where n = total objects in the prefix tree - Avoids the O(n_folders) × (API calls + parsing overhead) of delimiter-based iteration

If you need all objects under a prefix, use recursive=True:

# ✓ Single flat stream (optimal)
for file in store.list_files("data/", recursive=True):
    process(file)

Do not implement folder-by-folder traversal:

# ❌ This makes one API call per folder level
def traverse_folders(prefix):
    for folder in list_folders(prefix):  # Calls ListObjectsV2 with delimiter=/
        yield from list_files(folder, recursive=False)
        yield from traverse_folders(folder)  # Recursive calls per subfolder

The traversal approach costs O(n_folders) API calls, even with few total files.

Streaming Over Parallelization¶

For large buckets, use a single sequential flat stream, not parallel folder traversal:

# ✓ Single flat stream (optimal for large buckets)
for file in store.list_files("data/", recursive=True):
    process(file)

Why not parallelize folder traversal:

# ❌ Avoid parallel folder enumeration
from concurrent.futures import ThreadPoolExecutor

def parallel_traverse(prefix, executor):
    # Spawning threads per folder creates:
    # - O(n_folders) concurrent requests (thundering herd)
    # - Earlier rate-limiting hits (S3 per-partition limits)
    # - Thread pool overhead
    # - Loss of connection pooling benefits

Flat streams are superior: - Single sequential ListObjectsV2 respects S3's request pipelining - Reuses pooled connections across paginated responses - No thread overhead for what is already a streaming operation - More predictable latency and throughput

On large buckets with thousands of folders, a flat stream is orders of magnitude faster than parallel traversal.

Performance¶

See the performance guide for benchmark results. Listing is dominated by S3 API round-trip latency, not file count. Connection pooling is automatic; successive calls reuse connections.

Directory-listing cache (off by default)¶

The S3 backend disables the underlying s3fs directory-listing cache by default, so every listing call reflects the current state of the bucket.

s3fs caches directory listings in a cache that never expires: once a prefix has been listed, s3fs serves later listings of that prefix from memory until the process restarts. For a single reader that never writes, that saves a round trip on repeated listings. But for any store shared by more than one writer — two processes, two Store instances, or another tool writing to the same bucket — a write made elsewhere is then permanently invisible to a reader that has already listed the prefix. Fresh listings cost one bounded round trip; the default favours correctness.

Re-enable the cache when you have a single-writer (or read-only) workload and want to avoid repeated listing round trips:

backend = S3Backend(
    bucket="my-bucket",
    client_options={"use_listings_cache": True},
)

client_options["use_listings_cache"] takes precedence over the default, so passing True restores the s3fs caching behaviour (and False is a harmless no-op). The ext.cache extension is the portable, backend-agnostic alternative when you want caching with explicit invalidation.

Data Lake Pattern (Few Root Folders, Deep Nesting)¶

If you have few root-level folders (e.g., /bronze, /silver, /gold) with deeply nested structures:

To explore the structure:

# ✓ Shallow listing to see root tiers
for folder in store.list_folders(""):  # bronze/, silver/, gold/
    print(folder.name)

To process a specific tier incrementally:

# ✓ Depth-limited listing to explore one branch without full recursion
for file in store.list_files("bronze", max_depth=3):
    # Gets files up to 3 levels deep under bronze/
    process(file)

Only if you need all files across all depths:

# ✓ Full recursive listing (streaming, memory-efficient despite size)
for file in store.list_files("bronze", recursive=True):
    process(file)

Characteristics: - max_depth=0: Direct children only (equivalent to non-recursive) - max_depth=1: One level of nesting - max_depth=None (default): Defers to recursive parameter (non-recursive by default) - Cost on S3: Full recursive ListObjectsV2 listing (O(n_total) API cost); client-side depth filter reduces the yielded result set. Local/SFTP/Memory backends prune natively. - Streaming, memory-efficient (unlike loading entire tree)

Use case: Incremental exploration, tier-by-tier processing, or when you know the data structure depth in advance.

Recommendations¶

Shallow listing (non-recursive): Interactive UI, folder browsers, or when you only need direct children.
Depth-limited listing (max_depth=N): Data lake patterns with known structure depth. Explore incrementally without full recursion.
Recursive listing (full tree): Data processing, backups, or scanning entire prefix. Streaming operation, memory-efficient despite size.
Pattern matching: Use glob(pattern) (internally uses flat stream with filtering) rather than custom folder traversal.
Note: Do not parallelize any listing — single flat streams are already optimal.

Capabilities¶

Supports all capabilities except ATOMIC_MOVE. See the capabilities matrix for full details.

Caveats¶

move() is not atomic. S3 has no native rename operation. move() is implemented as copy + delete. If the process crashes between the two steps, both source and destination will exist.
overwrite=False has a TOCTOU race. The exists-check and write are separate API calls. Concurrent writers can both pass the check and overwrite each other.

See the Concurrency and Atomicity Guarantees guide for details and workarounds.

API Reference¶

S3Backend ¶

S3Backend(
    bucket: str,
    *,
    endpoint_url: str | None = None,
    key: str | Secret | None = None,
    secret: str | Secret | None = None,
    region_name: str | None = None,
    tls_ca_bundle: str | None = None,
    client_options: dict[str, Any] | None = None,
    retry: RetryPolicy | None = None,
    reject_write_under_file_ancestor: bool = False,
)

Bases: _S3Base

S3-compatible object storage backend using s3fs.

move() is implemented as a server-side copy followed by a delete. This is non-atomic: a crash or network error between the two steps may leave both source and destination present. ATOMIC_MOVE is not declared.

Parameters:

bucket (str) –

S3 bucket name (required, non-empty).
endpoint_url (str | None, default: None ) –

Custom endpoint URL (e.g. for MinIO).
key (str | Secret | None, default: None ) –

AWS access key ID.
secret (str | Secret | None, default: None ) –

AWS secret access key.
region_name (str | None, default: None ) –

AWS region name.
tls_ca_bundle (str | None, default: None ) –

Path to a PEM CA bundle file. Falls back to AWS_CA_BUNDLE / REQUESTS_CA_BUNDLE / SSL_CERT_FILE.
client_options (dict[str, Any] | None, default: None ) –

Additional options passed to s3fs.
reject_write_under_file_ancestor (bool, default: False ) –

If True, write / write_atomic / open_atomic / move / copy HEAD each slash-aligned ancestor of the target path and raise InvalidPath on the first regular-file hit, matching the cross-backend contract that hierarchical filesystems enforce natively. Default False: each nested-path write otherwise pays one HEAD per ancestor; paths without slashes short-circuit.

check_health ¶

check_health() -> None

Confirm the bucket is reachable and credentials valid via one HeadBucket.

Raises:

NotFound –

If the bucket does not exist.
PermissionDenied –

If the credentials are rejected or lack access.
BackendUnavailable –

On a transport or service failure, or after close().

exists ¶

exists(path: str) -> bool

Return True if an object or prefix exists at path; never NotFound.

Raises:

PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

is_file ¶

is_file(path: str) -> bool

Return True if path is an existing object (False if absent or a prefix).

Raises:

PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

is_folder ¶

is_folder(path: str) -> bool

Return True if path is an existing virtual folder (a common prefix).

Raises:

PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

read ¶

read(path: str) -> BinaryIO

Open path for reading and return a streaming handle.

s3fs reads the object lazily in range-backed chunks, so memory stays constant regardless of size.

Raises:

NotFound –

If the object does not exist.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

read_bytes ¶

read_bytes(path: str) -> bytes

Read and return the full object content as bytes.

Downloads the whole object into memory (unlike the lazy read stream).

Raises:

NotFound –

If the object does not exist.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

write ¶

write(
    path: str,
    content: WritableContent,
    *,
    overwrite: bool = False,
    metadata: Mapping[str, str] | None = None,
) -> WriteResult

Write content to path as an S3 object.

The upload commits atomically — a reader sees either the old object or the new one, never a partial. A streamed write that fails mid-body aborts the multipart upload rather than finalising a truncated object, so no partial object is ever left at path.

Raises:

AlreadyExists –

If the object exists and overwrite is False.
InvalidPath –

With the reject_write_under_file_ancestor opt-in, if an ancestor of path exists as an object.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

write_atomic ¶

write_atomic(
    path: str,
    content: WritableContent,
    *,
    overwrite: bool = False,
    metadata: Mapping[str, str] | None = None,
) -> WriteResult

Write content to path atomically (delegates to write).

An S3 PUT is already atomic, so this is exactly write.

Raises:

AlreadyExists –

If the object exists and overwrite is False.
InvalidPath –

With the opt-in, if an ancestor of path exists as an object.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

open_atomic ¶

open_atomic(
    path: str, *, overwrite: bool = False
) -> Iterator[BinaryIO]

Yield a writable buffer committed to path atomically on clean exit.

Writes spool to a temporary file (up to 8 MB in memory, then on disk); on clean exit the buffer is uploaded in a single atomic PUT. An exception before exit leaves path untouched.

Raises:

AlreadyExists –

If the object exists and overwrite is False.
InvalidPath –

With the opt-in, if an ancestor of path exists as an object.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

delete ¶

delete(path: str, *, missing_ok: bool = False) -> None

Delete the object at path.

Raises:

NotFound –

If the object does not exist and missing_ok is False.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

delete_folder ¶

delete_folder(
    path: str,
    *,
    recursive: bool = False,
    missing_ok: bool = False,
) -> None

Delete the virtual folder at path.

recursive=True removes every object under the prefix; this is a best-effort multi-object delete, not atomic, so an interruption can leave the prefix partially deleted. recursive=False removes the prefix only when it has no contents.

Raises:

NotFound –

If no object exists under path and missing_ok is False.
DirectoryNotEmpty –

If the prefix is non-empty and recursive is False.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

get_file_info ¶

get_file_info(path: str) -> FileInfo

Return metadata for the object at path from one HeadObject.

The HEAD is issued with ChecksumMode=ENABLED so a stored checksum surfaces as the FileInfo digest.

Raises:

NotFound –

If the object does not exist.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

move ¶

move(
    src: str, dst: str, *, overwrite: bool = False
) -> None

Move or rename the object src to dst.

Implemented as a server-side copy followed by a delete of src. This is not atomic — a crash or network error between the two steps can leave both src and dst present — so ATOMIC_MOVE is not declared. src == dst is a no-op.

Raises:

NotFound –

If src does not exist.
AlreadyExists –

If dst exists, src != dst, and overwrite is False.
InvalidPath –

With the opt-in, if an ancestor of dst exists as an object.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

copy ¶

copy(
    src: str, dst: str, *, overwrite: bool = False
) -> None

Copy the object src to dst via a server-side CopyObject.

The bytes are copied entirely server-side (never through the client). Like move, the operation carries no cross-operation atomicity guarantee. src == dst is a no-op.

Raises:

NotFound –

If src does not exist.
AlreadyExists –

If dst exists, src != dst, and overwrite is False.
InvalidPath –

With the opt-in, if an ancestor of dst exists as an object.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

S3 Backend¶

Installation¶

Usage¶

Options¶

Custom TLS Certificates¶

Botocore Client Tuning¶

Disabling inherited HTTP proxies¶

Pointing at a corporate proxy¶

Retry policy¶

Connect / read timeouts¶

MinIO-style path addressing¶

Putting it together¶

File Metadata¶

Write Results¶

Listing Strategies and Performance¶

Shallow Listing (Non-Recursive)¶

Recursive Listing (Flat Stream)¶

Streaming Over Parallelization¶

Performance¶

Directory-listing cache (off by default)¶

Data Lake Pattern (Few Root Folders, Deep Nesting)¶

Recommendations¶

Capabilities¶

Caveats¶

See also¶

API Reference¶

S3Backend ¶

check_health ¶

exists ¶

is_file ¶

is_folder ¶

read ¶

read_bytes ¶

write ¶

write_atomic ¶

open_atomic ¶

delete ¶

delete_folder ¶

get_file_info ¶

move ¶

copy ¶