S3-PyArrow Backend¶

Drop-in alternative to the S3 backend optimized for analytical workloads. Uses PyArrow's C++ S3 filesystem for data-path operations (reads, writes, copies) and s3fs for control-path operations (listing, metadata, deletion). This enables Tier 1 PyArrow integration — Parquet column pruning, I/O coalescing, and GIL-free reads — while keeping the same API.

Installation¶

pip install "remote-store[s3-pyarrow]"

This pulls in both s3fs and pyarrow.

Usage¶

from remote_store import BackendConfig, RegistryConfig, Registry, StoreProfile

config = RegistryConfig(
    backends={
        "s3pa": BackendConfig(
            type="s3-pyarrow",
            options={
                "bucket": "my-bucket",
                "key": "AWS_ACCESS_KEY_ID",
                "secret": "AWS_SECRET_ACCESS_KEY",
                "endpoint_url": "https://s3.amazonaws.com",
            },
        ),
    },
    stores={"data": StoreProfile(backend="s3pa", root_path="data")},
)

with Registry(config) as registry:
    store = registry.get_store("data")
    store.write("report.csv", b"col1,col2\n1,2\n")

Options¶

Same constructor signature as the S3 backend:

Option	Type	Description
`bucket`	`str`	S3 bucket name (required)
`key`	`str`	AWS access key ID
`secret`	`str`	AWS secret access key
`region_name`	`str`	AWS region name
`endpoint_url`	`str`	Custom endpoint for S3-compatible services
`tls_ca_bundle`	`str`	Path to a PEM CA bundle file for custom/self-signed certificates
`client_options`	`dict`	Additional options passed to s3fs (control-path)

Custom TLS Certificates¶

Use tls_ca_bundle when connecting to S3-compatible services with custom or self-signed certificates. The parameter applies to both the PyArrow data path (tls_ca_file_path) and the s3fs control path (client_kwargs.verify).

backend = S3PyArrowBackend(
    bucket="my-bucket",
    endpoint_url="https://minio.internal:9000",
    tls_ca_bundle="/etc/ssl/certs/internal-ca.pem",
)

If not set, falls back to env vars: AWS_CA_BUNDLE > REQUESTS_CA_BUNDLE > SSL_CERT_FILE. See the S3 backend guide for the full fallback chain and config examples.

When to use S3-PyArrow vs S3¶

Scenario	Recommended backend
General-purpose file storage	`s3`
Sequential byte streaming (read/write)	`s3` (faster at every file size)
Analytical workloads (Parquet, datasets)	`s3-pyarrow` (Tier 1 column pruning)
Minimal dependencies	`s3` (only needs `s3fs`)
PyArrow already in your stack	`s3-pyarrow` (zero extra deps)

S3-PyArrow's C++ data path adds per-call overhead for sequential reads compared to the regular S3 backend. The advantage is native PyArrow integration: when PyArrow reads Parquet files through the adapter, it uses C++ range requests and I/O coalescing directly — no Python in the loop.

Both backends lack ATOMIC_MOVE, but they differ on USER_METADATA: the regular S3 backend declares it (write calls accept metadata= and store it as S3 object metadata), while S3-PyArrow does not. Code that passes metadata= to write methods will get CapabilityNotSupported on the PyArrow backend. For workloads that do not use object metadata, switching is safe — change the type in your config.

Escape Hatch¶

Access the underlying filesystems when you need protocol-level features:

from pyarrow.fs import S3FileSystem as PyArrowS3
import s3fs

# PyArrow filesystem (data path)
pa_fs = backend.unwrap(PyArrowS3)

# s3fs filesystem (control path)
s3_fs = backend.unwrap(s3fs.S3FileSystem)

Caveats¶

move() is not atomic. Like the S3 backend, move is copy + delete. A crash between the two steps leaves duplicates.
overwrite=False has a TOCTOU race. The exists-check and write are separate API calls. Concurrent writers can both pass the check and overwrite each other.

See the Concurrency and Atomicity Guarantees guide for details and workarounds.

API Reference¶

S3PyArrowBackend ¶

S3PyArrowBackend(
    bucket: str,
    *,
    endpoint_url: str | None = None,
    key: str | Secret | None = None,
    secret: str | Secret | None = None,
    region_name: str | None = None,
    tls_ca_bundle: str | None = None,
    client_options: dict[str, Any] | None = None,
    retry: RetryPolicy | None = None,
    reject_write_under_file_ancestor: bool = False,
)

Bases: _S3Base

Hybrid S3 backend: PyArrow for reads/writes/copies, s3fs for listing/metadata.

Drop-in alternative to S3Backend with the same constructor signature. Uses PyArrow's C++ S3 filesystem for data-path operations (higher throughput for large files) and s3fs for control-path operations (listing, metadata, deletion).

move() is implemented as a PyArrow copy followed by an s3fs delete. This is non-atomic: a crash or network error between the two steps may leave both source and destination present. ATOMIC_MOVE is not declared.

Parameters:

bucket (str) –

S3 bucket name (required, non-empty).
endpoint_url (str | None, default: None ) –

Custom endpoint URL (e.g. for MinIO).
key (str | Secret | None, default: None ) –

AWS access key ID.
secret (str | Secret | None, default: None ) –

AWS secret access key.
region_name (str | None, default: None ) –

AWS region name.
tls_ca_bundle (str | None, default: None ) –

Path to a PEM CA bundle file. Falls back to AWS_CA_BUNDLE / REQUESTS_CA_BUNDLE / SSL_CERT_FILE.
client_options (dict[str, Any] | None, default: None ) –

Additional options passed to s3fs.
reject_write_under_file_ancestor (bool, default: False ) –

If True, write / write_atomic / open_atomic / move / copy HEAD each slash-aligned ancestor of the target path and raise InvalidPath on the first regular-file hit, matching the cross-backend contract that hierarchical filesystems enforce natively. Default False: each nested-path write otherwise pays one HEAD per ancestor; paths without slashes short-circuit.

read ¶

read(path: str) -> BinaryIO

Open path for reading and return a streaming handle.

Uses PyArrow's open_input_file (higher throughput for large objects) rather than the s3fs reader; still lazy, so memory stays constant.

Raises:

NotFound –

If the object does not exist.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

read_bytes ¶

read_bytes(path: str) -> bytes

Read and return the full object content as bytes (via PyArrow).

Raises:

NotFound –

If the object does not exist.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

write ¶

write(
    path: str,
    content: WritableContent,
    *,
    overwrite: bool = False,
    metadata: Mapping[str, str] | None = None,
) -> WriteResult

Write content to path, streaming straight to a multipart upload.

Unlike S3Backend.write, a plain streamed write here is not atomic: PyArrow's output stream exposes no abort, so a failure mid-body finalises a truncated object at path. Use write_atomic when readers must never observe a partial object. A bytes payload (no streaming) commits in one shot.

Raises:

AlreadyExists –

If the object exists and overwrite is False.
InvalidPath –

With the reject_write_under_file_ancestor opt-in, if an ancestor of path exists as an object.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

write_atomic ¶

write_atomic(
    path: str,
    content: WritableContent,
    *,
    overwrite: bool = False,
    metadata: Mapping[str, str] | None = None,
) -> WriteResult

Write content to path atomically by buffering before upload.

The whole body is buffered first (a bytes payload is already materialised and delegates straight through), so a source failure happens off the wire and leaves no object at path — closing the atomicity gap in the plain streaming write.

Raises:

AlreadyExists –

If the object exists and overwrite is False.
InvalidPath –

With the opt-in, if an ancestor of path exists as an object.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

open_atomic ¶

open_atomic(
    path: str, *, overwrite: bool = False
) -> Iterator[BinaryIO]

Yield a writable buffer committed to path atomically on clean exit.

Writes spool to a temporary file (up to 8 MB in memory, then on disk) and upload only on clean exit, so path never holds a partial object. An exception before exit leaves path untouched.

Raises:

AlreadyExists –

If the object exists and overwrite is False.
InvalidPath –

With the opt-in, if an ancestor of path exists as an object.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

move ¶

move(
    src: str, dst: str, *, overwrite: bool = False
) -> None

Move or rename the object src to dst.

Existence checks and the delete go through s3fs; the copy is a PyArrow copy_file. Copy-then-delete is not atomic — a failure between the two steps can leave both src and dst present — so ATOMIC_MOVE is not declared. src == dst is a no-op.

Raises:

NotFound –

If src does not exist.
AlreadyExists –

If dst exists, src != dst, and overwrite is False.
InvalidPath –

With the opt-in, if an ancestor of dst exists as an object.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

copy ¶

copy(
    src: str, dst: str, *, overwrite: bool = False
) -> None

Copy the object src to dst via a PyArrow copy_file.

Like move, the operation carries no cross-operation atomicity guarantee. src == dst is a no-op.

Raises:

NotFound –

If src does not exist.
AlreadyExists –

If dst exists, src != dst, and overwrite is False.
InvalidPath –

With the opt-in, if an ancestor of dst exists as an object.
PermissionDenied –

If the credentials lack access.
BackendUnavailable –

On a transport or service failure, or after close().

S3-PyArrow Backend¶

Installation¶

Usage¶

Options¶

Custom TLS Certificates¶

When to use S3-PyArrow vs S3¶

Escape Hatch¶

Caveats¶

See also¶

API Reference¶

S3PyArrowBackend ¶

read ¶

read_bytes ¶

write ¶

write_atomic ¶

open_atomic ¶

move ¶

copy ¶