S3-PyArrow Backend¶
Drop-in alternative to the S3 backend optimized for analytical workloads. Uses PyArrow's C++ S3 filesystem for data-path operations (reads, writes, copies) and s3fs for control-path operations (listing, metadata, deletion). This enables Tier 1 PyArrow integration — Parquet column pruning, I/O coalescing, and GIL-free reads — while keeping the same API.
Installation¶
This pulls in both s3fs and pyarrow.
Usage¶
from remote_store import BackendConfig, RegistryConfig, Registry, StoreProfile
config = RegistryConfig(
backends={
"s3pa": BackendConfig(
type="s3-pyarrow",
options={
"bucket": "my-bucket",
"key": "AWS_ACCESS_KEY_ID",
"secret": "AWS_SECRET_ACCESS_KEY",
"endpoint_url": "https://s3.amazonaws.com",
},
),
},
stores={"data": StoreProfile(backend="s3pa", root_path="data")},
)
with Registry(config) as registry:
store = registry.get_store("data")
store.write("report.csv", b"col1,col2\n1,2\n")
Options¶
Same constructor signature as the S3 backend:
| Option | Type | Description |
|---|---|---|
bucket |
str |
S3 bucket name (required) |
key |
str |
AWS access key ID |
secret |
str |
AWS secret access key |
region_name |
str |
AWS region name |
endpoint_url |
str |
Custom endpoint for S3-compatible services |
tls_ca_bundle |
str |
Path to a PEM CA bundle file for custom/self-signed certificates |
client_options |
dict |
Additional options passed to s3fs (control-path) |
Custom TLS Certificates¶
Use tls_ca_bundle when connecting to S3-compatible services with custom or
self-signed certificates. The parameter applies to both the PyArrow data path
(tls_ca_file_path) and the s3fs control path (client_kwargs.verify).
backend = S3PyArrowBackend(
bucket="my-bucket",
endpoint_url="https://minio.internal:9000",
tls_ca_bundle="/etc/ssl/certs/internal-ca.pem",
)
If not set, falls back to env vars: AWS_CA_BUNDLE > REQUESTS_CA_BUNDLE >
SSL_CERT_FILE. See the S3 backend guide for
the full fallback chain and config examples.
When to use S3-PyArrow vs S3¶
| Scenario | Recommended backend |
|---|---|
| General-purpose file storage | s3 |
| Sequential byte streaming (read/write) | s3 (faster at every file size) |
| Analytical workloads (Parquet, datasets) | s3-pyarrow (Tier 1 column pruning) |
| Minimal dependencies | s3 (only needs s3fs) |
| PyArrow already in your stack | s3-pyarrow (zero extra deps) |
S3-PyArrow's C++ data path adds per-call overhead for sequential reads compared to the regular S3 backend. The advantage is native PyArrow integration: when PyArrow reads Parquet files through the adapter, it uses C++ range requests and I/O coalescing directly — no Python in the loop.
Both backends lack ATOMIC_MOVE, but they differ on USER_METADATA: the regular S3
backend declares it (write calls accept metadata= and store it as S3 object metadata),
while S3-PyArrow does not. Code that passes metadata= to write methods will get
CapabilityNotSupported on the PyArrow backend. For workloads that do not use object
metadata, switching is safe — change the type in your config.
Escape Hatch¶
Access the underlying filesystems when you need protocol-level features:
from pyarrow.fs import S3FileSystem as PyArrowS3
import s3fs
# PyArrow filesystem (data path)
pa_fs = backend.unwrap(PyArrowS3)
# s3fs filesystem (control path)
s3_fs = backend.unwrap(s3fs.S3FileSystem)
Caveats¶
move()is not atomic. Like the S3 backend, move is copy + delete. A crash between the two steps leaves duplicates.overwrite=Falsehas a TOCTOU race. The exists-check and write are separate API calls. Concurrent writers can both pass the check and overwrite each other.
See the Concurrency and Atomicity Guarantees guide for details and workarounds.
See also¶
API Reference¶
S3PyArrowBackend
¶
S3PyArrowBackend(
bucket: str,
*,
endpoint_url: str | None = None,
key: str | Secret | None = None,
secret: str | Secret | None = None,
region_name: str | None = None,
tls_ca_bundle: str | None = None,
client_options: dict[str, Any] | None = None,
retry: RetryPolicy | None = None,
reject_write_under_file_ancestor: bool = False,
)
Bases: _S3Base
Hybrid S3 backend: PyArrow for reads/writes/copies, s3fs for listing/metadata.
Drop-in alternative to S3Backend with the same constructor signature.
Uses PyArrow's C++ S3 filesystem for data-path operations (higher throughput
for large files) and s3fs for control-path operations (listing, metadata,
deletion).
move() is implemented as a PyArrow copy followed by an s3fs delete.
This is non-atomic: a crash or network error between the two steps may
leave both source and destination present. ATOMIC_MOVE is not
declared.
Parameters:
-
bucket(str) –S3 bucket name (required, non-empty).
-
endpoint_url(str | None, default:None) –Custom endpoint URL (e.g. for MinIO).
-
key(str | Secret | None, default:None) –AWS access key ID.
-
secret(str | Secret | None, default:None) –AWS secret access key.
-
region_name(str | None, default:None) –AWS region name.
-
tls_ca_bundle(str | None, default:None) –Path to a PEM CA bundle file. Falls back to
AWS_CA_BUNDLE/REQUESTS_CA_BUNDLE/SSL_CERT_FILE. -
client_options(dict[str, Any] | None, default:None) –Additional options passed to s3fs.
-
reject_write_under_file_ancestor(bool, default:False) –If
True,write/write_atomic/open_atomic/move/copyHEAD each slash-aligned ancestor of the target path and raiseInvalidPathon the first regular-file hit, matching the cross-backend contract that hierarchical filesystems enforce natively. DefaultFalse: each nested-path write otherwise pays one HEAD per ancestor; paths without slashes short-circuit.